Can neural networks read checks?

For almost three years now, I have been scrupulously recording all my income and expenses in hledger . Why exactly he? It happened historically. With the onset of 2018, I began to write everything down on Google, and in April I went to Japan. I was sitting in a hotel and trying to figure out how to correctly calculate prices in different currencies, and decided to write something on the Lisp. And he wrote . And he showed it to people in the email chat. To which I received the answer "but there is already a ready-made one" and a link to hledger. Then I dragged all my entries from the google plate into hledger.

What I love about this way of accounting for expenses is the ability to go out and rewrite history. So I decided that the headphones bought last year for my wife should be written not as a "technique", but as "gifts" - no problem.

And while studying at some point a check from the Platypus, I thought, how much do I spend on chocolate? Spoiler - a lot. I went into the order history, found old receipts there, rewrote the entries by category. There were just "expenses: food", but now "expenses: food: fruits" and others. At the same time, some household goods were also found there.

The first time I was doing this rewriting completely by hand. That is, he came home from the store, looked at the check and wrote down many lines. Then he automated a little - he made a template for a plate in emacs, where the lines contain goods with their categories and prices, and in the last column, the filter by category immediately gives the amounts.

But neural networks and other datasatanism.

An idea arose in my head that the neural network is quite able to understand that "GL.VIL.Oranges SELECT.fas.1kg" is such a fruit, unlike "BOTTOM.HL.aton PODMOSKOVNY 400g" (bread, but for this I had to copy this is the name on google).

The problem to be solved is obviously classification. Lines-names are submitted to the input, and a category is expected at the output. Initially, I will put these categories manually, and then only edit the predictions. Accordingly, the program must first analyze the text of the check, select the names of the products and their prices from there, predict the category for each name. Show me the predicted so that I can correct it. When everything is in order, form a set of lines for the hledger.

, . , .

def parse_utk(lines):
    while lines:
        if lines[0].startswith('  '):
            break
        lines.pop(0)
    else:
        return
    lines.pop(0)
    result = []
    while lines:
        data = lines[0:6]
        name, price, lines = lines[0], lines[3], lines[6:]
        if name.startswith(''):
            break
        assert price.startswith('= ')
        result.append((name, Decimal(price[2:]))
    return result

, . .

, . , character-level , flair. – . – . , , .

, . " ", . . . 23 21. – .

hledger.

$ hledger bal  -b thisyear -% -S
             100.0 %  expenses:
              20.6 %    <unsorted>
              15.3 %    
               7.9 %    
               7.1 %    
               6.9 %    
               6.3 %    
               5.2 %    
               4.8 %    
               4.0 %    
               3.8 %    
               2.7 %    
               2.7 %    
               2.0 %    
               2.0 %    
               1.7 %    
               1.6 %    
               1.3 %    
               1.2 %    
               0.9 %    
               0.9 %    
               0.5 %    
               0.4 %    
               0.2 %    

, . . , ( ) .

. , . , ?

– , . rule-based . ( - ) , ? , .

– NER. flair, ( ? ). . , IOB- … , .

rule-based , . . , "" . , . , , , . -, rule-based .

, . , . , . flair SpaceTokenizer, , . " ". , .

- - : " , , , ". , "1 107 99" , 107 99 . , ( ). ? , , .

, . . – " ", " " " ". NER . " " "".

. , (, , NewlineTokenizer), . , ColumnCorpus . \r, , . vertical tab (\x0b), RS US.

- , . , O, . , , . , , , .

, 'entry' 'entry'. – . 'O', , 'O\n', .

, . – flair.

. – , "" . .

rule-based , "" . , , rule-based . , , , rule-based .

One single check was enough to train taggers. True, at first I had to finish the script in order to edit the intermediate results, and then also save them in the correct format (I managed to forget about the special column separator and did not understand for an hour why nothing worked). It remains only to screw the "neural network" partition to the main parser.

Then comb the code and publish.

PS sources




All Articles