Tesseract OCR, selection of recognized text in the image

It is quite easy to read the picture, save the text, process the text, and get the result. I want to tell you how to display the result for the user on a previously read picture, for example, select a piece of text containing a target sentence. Such a task will be useful in highlighting an important part of the report and demonstrating it to the management.





You can draw on the picture by coordinates, but you need to get the coordinates when reading the text. You can get them using a special output_type:





pd_dataframe = pytesseract.image_to_data(image, output_type=Output.DATAFRAME)
      
      



So, let's look at an example. Let's take a publicly available financial report from the Internet, read one of its pages (we will draw on it below) and look at the resulting dataframe with text.





result output_type = Output.DATAFRAME
result output_type = Output.DATAFRAME
Detailed description of columns (eng)
  • level = 1/2/3/4/5 , the level of current item.





  • page_num: the page index of the current item. In most instances, a image only has one page.





  • block_num: the block item of the current item. when tesseract OCR Image, it will split the image into several blocks according the PSM parameters and some rules. The words in a line often in a block.





  • par_num: The paragraph index of the current item. It is the page analysis results.





  • line_num: The line index of the current item. It is the page analysis results.





  • word_num: The word index in one block.





  • left/top/width/height:the top-left coordinate and the width and height of the current word.





  • conf: the confidence of the current word, the range is -1~100.. The -1 means that there is no text here. The 100 is the highest value.





  • text: the word ocr results.





, ( ), , , , .





difference of blocks from paragraphs (block vs par_num)
(block vs par_num)

, . :





left -





top -





width - ( )





height - ( )





For a block or paragraph for all words, we take the minimum left and top, and the maximum width and height, as a result we get the coordinates for the selected text:





Having received a dataframe with the coordinates of each word, you can create a dataframe with the coordinates of blocks, paragraphs, lines, sentences .. depends on the problem being solved. To draw (geometric shapes, insert additional text, etc.) using these coordinates, you will need to get a picture from the text and draw on it, if you artificially increased it to recognize and obtain coordinates, do not forget to increase it in the same way when drawing.





The code for reading text, drawing cv2.rectangle by coordinates will be posted on GitHub








All Articles