Clustering and Classifying Big Text Data with M.O. in Java. Article # 3 - Architecture / Results

Hello, Habr! Today will be the final part of the topic Clustering and Classification of Big Text Data Using Machine Learning in Java. This article is a continuation of the  first and second articles .









The article describes the system architecture, algorithm, and visual results. All the details of the theory and algorithms can be found in the first two articles.









System architectures can be divided into two main parts: web application and data clustering and classification software









The machine learning software algorithm consists of 3 main parts:





  1. natural language processing;





    1. tokenization;





    2. lemmatization;





    3. stop listing;





    4. frequency of words;





  2. clustering methods;





    1. TF-IDF;





    2. SVD;





    3. finding cluster groups;





  3. classification methods - Aylien API.





Natural language processing

The algorithm starts by reading any text data. Since our system is an electronic library, the books are mostly in pdf format. You can read the implementation and details of NLP processing here .





Below is a comparison when running the Lemmatization and Stemmitization algorithms:





  : 4173415
    : 88547
    : 82294
      
      











, , , . , :





characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
      
      



, :





character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
      
      











tf-idf . HashMap, - , - -.





-:





tf-idf:









, , tf-idf. :





-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997

      
      











SVD   .





, .  – , . OrientDB , OrientDB . OrientDB , , , . . .





, .









– . , , DBSCAN. . . r=0.007. 562 80.000 , . , .





r = max (D) / n









   max(D)  β€’ , . n -













, . – , –









, . 4-. ( > nt)





nt = N / S

Nβ€’ - , S β€’ .









, .





– Aylien API





Aylien API . API json , . API . 9 , . POST API:





String queryText = "select  DocText from documents where clusters = '" + cluster + "'";
   OResultSet resultSet = database.query(queryText);
   while (resultSet.hasNext()) {
   OResult result = resultSet.next();

   String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
   .toLowerCase();
   keywords.add(textDoc.replaceAll("\\n", ""));
   }

   ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder    = ClassifyByTaxonomyParams.newBuilder();
   classifyByTaxonomybuilder.setText(keywords.toString());
   classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
   TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
   for (TaxonomyCategory c : response.getCategories()) {
   clusterUpdate.add(c.getLabel());
   }

      
      







GET, :









. .













. . , . . , . , :









-





- – . , . - , . Vaadin Flow:









:





  • , .





  • .





  • -.





  • , , , , -.





  • .













β€œTechnology & Computing”:









:









:









, . . , , . . . . : .





, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..





, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .





Aylien API, . , 100 . , , , k-, . , .








All Articles