And one more service for checking passports, or again the question is how many gigabytes are in one megabyte

Some time ago, it became possible to pay attention to the Go language and the publication "Passport Control, or How to Compress One and a Half Gigabytes to 42 Mb" came across . The article briefly, but informatively, describes a test task for developing a service for checking Russian passport numbers for their presence in the list of invalid passports. Among the main requirements for the implementation are the speed of verification and the availability of the service.  





After reading the article, I decided for myself that it is on this practical task that you can build your Go learning. Indeed, the problem of checking invalid passports, although hackneyed, is interesting, and since performance requirements are a priority here, the implementation in Go would be quite appropriate here.





In addition, the article mentioned above touches upon the issues of efficient, from the point of view of memory, organization of bit arrays (bitmaps). And this topic is quite relevant and in demand in various applied solutions, for example, in the form of bitmap indexes for a DBMS.





As a result, the following: there is a desire to look at the new Go language, there is an interesting problem in the form of organization and use of bitmap (naturally in Go), there is a practical task on which these two goals can be worked out. If interested, go ahead.





What as raw data

The very task of checking passports is built around the primary list of invalid passports presented on the website of the  Ministry of Internal Affairs of the Russian Federation , where you can also manually check the passport number. You can download the entire list from the site as a packed archive of 460 MB in size, while unpacked we get a list of about 1.5 GB, presented as a single CSV file with two columns: series and passport number.





To date, this CSV file contains about 132 million passport numbers. But the file also contains erroneous numbers that cannot be identified unambiguously, for example, all 4 digits for a series or all 6 digits for a number are not indicated, there may be alphabetic series.   





4- 6- , , , .  , , , .. , .





Go , , .





. , . , – . , 10 000 ( 0 9999), 1 000 000 ( 0 999998, .. 000001).  , ( , G), ..     10 000 bitmap, bitmap 15 625 uint64.





15 625?

, 1 000 000 125 000 , , , 15 625 ( x86-64)





, , 1.25 , 10 000 bitmap. , .. . – , 10 000 bitmap 3 300 ( ).





– , , bitmap , .  , , , ..15 625 1 000 000 . , . , .  





bitmap – roaring bitmap.





, , .  Pilosa.





Pilosa – . Pilosa , , .  – () .





, , , .. Pilosa. , Pilosa, .. (binary matrices).  





Pilosa , Go, «» Go, .  « ».   Pilosa.





  , :





  • . - , – Pilosa.





  • http . GET , POST, json.  .





  • .





  • , , ..





  • , , , Docker .





, . .





. , Go « » , , , , . , , , Go . Go , , , Linux, Mac OS Windows, . , , uint16 uint64.





Go, , , .





, , GitHub.  Go, Go «». roaring bitmap Pilosa , .





, , . , , , , . , - API .   – , , .  . , .. , .  , .





API

, , GET





http://localhost:8080/passport?series=8003&number=011384





html json , , ContentType "application/json" .





"application/json":





[{
	"series": "8003",
	"number": "011384",
	"status": "non-valid"
}]
      
      



status “valid” “non-valid”





json POST , status:





http://localhost:8080/passport





[
   {
        "series""4050",
        "number""039589"
    },
    {
        "series""5203",
        "number""257719"
    },
    {
        "series""5000",
        "number""347024"
    },
    {
        "series""2507",
        "number""857721"
    },
    {
        "series""2507" ,
        "number""857728"
    }
]
      
      



, status.  - , , ( 100)





i7 7- 16 Gb , Windows Pro WSL2(Windows Subsystem for Linux), Docker Desktop For Windows. , , : Windows, WSL url ApacheBench 1000- , 100 . , Go (benchmark), , .  





  :





  • ( );





  • ;





  • .





bitmap

  1.5 30 , .  , , . 





, «» , 1.25 Gb ( ), 440 . - , .. . , . 





(1000 100 ), , «Time per request» 0.1 ms , , .





, :





  • 30 ;





  • 440 ;





  • 0.10 ms.





, .





roaring bitmap

  1.5 , bitmap. 44 (!) , , , . 0.18ms, , , , .





  roaring bitmap:





, bitmap :





  • 90 ;





  • 44 ;





  • 0.18 ms.





.  , , .. roaring bitmap, 64- C (Croaring).





Pilosa

 Pilosa , .  , .  , Pilosa Windows, , , . , « ».   , Pilosa docker- Windows – .





Pilosa , , .. . 





Pilosa – 1 1000 . 132 .





Pilosa WSL2, .





Pilosa 4 , , , , , , , , .. .





  Pilosa , , , , .. Pilosa bitmap roaring bitmap.





– 0.5 ms .





Pilosa:





  • 240 ;





  • ;





  • 0.5 ms.





, Pilosa , , , .  , (timestamp), Pilosa.





, Go .  . , . – Go .





, - , , Go.





roaring bitmap . bitmap bitmap, roaring bitmap.   








All Articles