NLP: extract facts from text using Tomita parser

NLP - natural language processing





Most of the data in the world is not structured - it's just texts in Russian or any other language. Extracted facts from such texts can be of particular interest for business, so such tasks often arise. A separate area of ​​artificial intelligence deals with this issue: natural language processing, the same NLP ( Natural Language Processing ).





There are many ways to extract facts from text, and they all have their pros and cons:





  • Regular Expressions





High speed and stability are offset by the complexity of the syntax and low coverage.





  • Neural networks





Fashionable, good quality when training on a large sample, but work requires a lot of markup, and each new task requires a new markup.





  • KS grammars





Predictability of the result, easy to write rules, but difficult to run in PROM





,  , , , , . , , , , , :









№2 ,15 2020., 5400,00





.. 15299,00 ,





№575 , 145 17.09.2020 2020 , 18% — 5300 .





, 23, 51 01.09.2020 — 7500 .





№1-03 01.07.2020 211 2020 23000 ..(18%)





-?

- – () , (- ) .  GitHub, .





-?

  • – ,





















-?

- , , , . ,  ,  , .





tomitaparser.exe ( . ) :





  • config.proto — . , . tomitaparser.exe;





  • dic.gzt – . . , , , . ;





  • mygram.cxx – . , . . ;





  • facttypes.proto – ;





  • kwtypes.proto – . , .





 utf8 , ( ).





«dic.gzt», , .





encoding "utf8"; //   

//        ,     
import "base.proto";
import "articles_base.proto";
//     
TAuxDicArticle "payment" {
    key = { "tomita:mygram.cxx" type=CUSTOM }
};
      
      



. , — , . . «->»  . , – . . . (Noun, Verb, Adj), (Comma, Punct, Ampersand, PlusSign) . .  .  





. , « () , .  .





() , (), , . ()  «< >»  .  -  . , «cxx», – «mygram.cxx». . . — , , «», «», «».





#encoding "utf8" //   

//  "|"    ""
Rent -> '' | '' | '';
//  "" ,       0   
//  <gnc-agr[1]>   ,         ,   
Purpose -> Rent Adj<gnc-agr[1]> Noun<gnc-agr[1]>;
      
      



. , , , . , , .





//   StreetW    ,   StreetAbbr -   
StreetW -> '' | '' | '' | '';
StreetAbbr -> '' | '' | '' | '-' | '';

//       StreetDescr,       StreetW   StreetAbbr
StreetDescr -> StreetW | StreetAbbr;
StreetNameNoun -> (Adj<gnc-agr[1]>) Word<gnc-agr[1], rt> (Word<gram="">);
StreetNameAdj -> Adj<h-reg1> Adj*;
      
      



«StreetNameNoun»  , . , , «<rt>». , , . , , . . , , .. , «()». «StreetNameAdj»  , . . «<h-reg1>». , «*». , .





Address -> StreetDescr StreetNameNoun<gram="", h-reg1>;
Address -> StreetDescr StreetNameNoun<gram="", h-reg1>;

Address -> StreetNameAdj<gnc-agr[1]> StreetW<gnc-agr[1]>;
Address -> StreetNameAdj StreetAbbr;
      
      



. , . , . . , . :





//       «dic.gzt»
TAuxDicArticle "month" {
    key = { "" | "" | "" | "" | "" | "" | "" | "" | "" | "" | "" | "" }
};
      
      



:





Month -> Noun<kwtype="month">;
Year -> AnyWord<wff=/[1-2]?[0-9]{1,3}?\.?/>;

Period -> Month Year;
      
      



«kwtype» , «month» , 0 2999 «» «.» . , . «Result» :





Result -> Purpose AnyWord* Address AnyWord* Period;
Result -> Purpose AnyWord* Address;
Result -> Purpose;
      
      



«AnyWord» «*» , 0 . : , . : , .





. – «facttypes.proto» «dic.gzt»  (, - , ).





import "facttypes.proto"; //    «dic.gzt» 
      
      



«facttypes.proto» «Payment»  (): , . :





//   
import "base.proto";
import "facttypes_base.proto";

message Payment: NFactType.TFact {
    required string Purpose = 1;
    optional string Address = 2;
    optional string Period = 3;
};
      
      



«Payment» «NFactType.TFact», «required» «optional» , . , , «interp» , . , .





//  «Purpose»    «Purpose»  «Payment»
//  «Address»    «Address»  «Payment»
//  «Period»    «Period»  «Payment»
Result -> Purpose interp(Payment.Purpose) AnyWord* Address interp(Payment.Address) AnyWord* Period interp(Payment.Period);
Result -> Purpose interp(Payment.Purpose) AnyWord* Address interp(Payment.Address);
Result -> Purpose interp(Payment.Purpose);
      
      



, , , .





encoding "utf8"; //   

TTextMinerConfig {
    //   
    Dictionary = "dic.gzt";
    //  
    Input = {File = "input.txt"}
    //      
    Output = {File = "output.txt"
            Format = text}
    // ,     
    Articles = [
        { Name = "payment" }
        ]
    // ,  
    Facts = [
        { Name = "Payment" }
        ]
    //       
    PrettyOutput = "pretty.html"
}
      
      



:





> tomitaparser.exe config.proto
      
      



In the file " input.txt " we have placed the source text placed at the very beginning of the article. After work, the parser wrote the result to the " output.txt " file :





2      , 15   2020 . ,   5400,00  
    Payment
    {
        Purpose =   
        Address =   
        Period =  2020
    }
     . .      15299,00  ,   
    Payment
    {
        Purpose =  
    }575      , 145  17.09.2020   2020  ,   18% - 5300  . 
    Payment
    {
        Purpose =  
        Address =   
        Period =  2020
    }
          , 23 ,  51  01.09.2020 - 7500  . 
    Payment
    {
        Purpose =   
        Address =  
    }1-03  01.07.2020       211   2020  23000  .. ( 18% ) 
    Payment
    {
        Purpose =  
        Address =  
        Period =  2020
    }
      
      



Extracting facts from natural language is a rather non-trivial task in the IT world to this day. Now we have another available tool in our hands. As you can see, creating your first grammar can be quite easy, while spending a little time learning. for Tomita, detailed and comprehensive documentation is provided. However, the quality of the highlighted facts strongly depends on the developer himself and his knowledge in the expert field.








All Articles