Sibur Challenge 2020 or "how we came up with features"

Hello everyone! This year, Sibur Digital again hosted a large (compared to other Russian) data analysis championship. My friend and I participated in it and would like to share with the readers of Habr our decision and experience gained from participation. Of course, we are unlikely to open America with this article, but some beginner in blood pressure competitions will definitely be able to learn something useful for himself.





Who are we?

We are students who are very passionate about DS and ML. We first learned about this area at the AI ​​Journey conference held at our university. Since that moment, we have passed more than one, and not two, and not three courses (from Omsk State Technical University to Andrew NG) and now we constantly participate in hackathons and competitions (in some we even won prizes), in parallel we are looking for an internship.





About the task

We took on the second challenge of the competition - "name matching".





The essence is as follows: Sibur works with a huge number of new companies, and in order to optimize the workflow, it would be useful for them to understand that they are working with an already familiar holding. For example, Sibur Neftekhim and SIBUR IT are from the same holding, and when working with one of these companies it would be useful to use the information accumulated earlier on the SIBUR holding.





Let's paraphrase the problem into DS language. Two names have been given, by which we must determine whether the companies belong to one holding or not.





name_1





name_2





is_duplicate





Japan Synthetic Rubber Co





Jsr Bst Elastomer





one





JSR Corporation





BST ELASTOMERS CO.





0





This is what the dataset looked like.





Data preprocessing

First of all, we converted the data to the Latin alphabet using the magic unidecode module. Then they brought to the lower case, removed any garbage in the form of unnecessary punctuation marks, double spaces, etc.





from unidecode import unidecode
import re
def preprocess(text: str):
    text = unidecode(text)
    text = text.lower()
    text = re.sub(r'[\.,]+', '', text)
    text = re.sub(r"\(.*\)", ' ', text)
    text = re.sub(r"[^\w\s]", ' ', text)
    text = re.sub(r'\b\w\b', ' ', text)
    text = ' '.join(text.split())
    return text
      
      



. , pycountry( ) , .





. , , , . " " "shanghai", , , . .





, , - ( , ).





, : "" , .





" ". 0.3 . , .





. . .





, , . .





, , :





  • ,





  • ()





  • : , ,





  • tfidf - ( )









  • ngram





  • ( )





  • ,





  • ,





, , XGBoost, . ~ 0.59 .





, - . (, , !), , 0.69 . , , .





- , , , .





, . , fit_predict, . ( ). , -.





?





It was possible to take into account the semantics of words or give weights to words: if a word in two names coincided and it is useful (refers to the name of the company) - has weight, we automatically consider that it is just as harmful in the "difference" of words; using as much external data as possible with company names, etc. Also, do not forget to analyze the observations on which the model is wrong (False Positive, False Negative), and based on this, construct new signs.





PS

All the code lies here





If you want to contact us: matnik2001@gmail.com, domonion@list.ru








All Articles