And again about the captcha

Here on Habré there are often articles about captcha recognition. I have always read them with great interest, and today it was my turn to write. The path from a naive implementation with Tesseract to a web service with a complex neural network took me about a year. The number of recognition errors was reduced from 90 to less than 1%.





Once a fairly well-known file hosting service once again changed something in the algorithm, and before downloading, a captcha input window began to appear in the download program. It was annoying, and the old workarounds apparently didn't work anymore. I began to think about how to solve the problem. With the plug-in connected to the program, everything turned out to be simple, it's up to recognizing the captcha itself. It consisted of 4 colored symbols (letters, numbers) of different sizes rotated up to 30 degrees in both directions on a colored background with thin straight lines. As a result of searches, I came across the OCR program Tesseract, saved several files with captchas and tried to recognize them. A naive solution gave about 10% of correct results, quite quickly it became possible to set a list of allowed characters - this increased the percentage of hits to 20. It is already possible to work with this - wrote in Pythona program that sends captchas for recognition and returns the result to the download program. Along the way, I began to experiment with graphic image processing in order to improve the recognition accuracy. At first I tried to convert them to black and white, but due to the low resolution and some color gradient, the edges of the symbol turned out to be clipped. I stopped at lowering the chromaticity by discarding the 6 least significant bits of color. I also came up with the idea to process the image character by character, breaking the image into parts and making several attempts at different angles of rotation. Rotation from -30 to 30 with a step of 5 degrees with the selection of the most common result gave an accuracy of 30-40%, but the time for one captcha increased to 12 seconds.





Tesseract . , - . OCR .





, , 3 . , . - , . , , . , 40-50%. .





- OpenCV NumPy . , , . 70-80%, 85 . - , . .





, MNIST . ( 2500), 2828 , 25% - 9000 . , Keras Tensorflow, 100% 75% . - 1,8 . , NumPy, . "Python Machine Learning" , .





. , , . , . , - . , , . , , . 90%.





, 6000 . , 2 , . - , - - "How to implement an OCR model using CNNs, RNNs and CTC loss". Keras.





, "" , 2%. "", , 20 . - , ( ), . , , - 15000 . - - , 2 . 1 250 . Keras Tensorflow . 3 2 , . - Flask, .





, . , .





:









  1. - - ,





  2. ,





  3. - -





  4. " , . , , , " ()








All Articles