Hello, Habr! Today will be the final part of the topic Clustering and Classification of Big Text Data Using Machine Learning in Java. This article is a continuation of the first and second articles .
The article describes the system architecture, algorithm, and visual results. All the details of the theory and algorithms can be found in the first two articles.
System architectures can be divided into two main parts: web application and data clustering and classification software
The machine learning software algorithm consists of 3 main parts:
natural language processing;
tokenization;
lemmatization;
stop listing;
frequency of words;
clustering methods;
TF-IDF;
SVD;
finding cluster groups;
classification methods - Aylien API.
Natural language processing
The algorithm starts by reading any text data. Since our system is an electronic library, the books are mostly in pdf format. You can read the implementation and details of NLP processing here .
Below is a comparison when running the Lemmatization and Stemmitization algorithms:
: 4173415 : 88547 : 82294
, , , . , :
characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
, :
character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
tf-idf . HashMap, - , - -.
-:
, , tf-idf. :
-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997
SVD .
, . β , . OrientDB , OrientDB . OrientDB , , , . . .
, .
β . , , DBSCAN. . . r=0.007. 562 80.000 , . , .
max(D) β , . n -
, . β , β
, . 4-. ( > nt)
Nβ - , S β .
, .
β Aylien API
Aylien API . API json , . API . 9 , . POST API:
String queryText = "select DocText from documents where clusters = '" + cluster + "'";
OResultSet resultSet = database.query(queryText);
while (resultSet.hasNext()) {
OResult result = resultSet.next();
String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
.toLowerCase();
keywords.add(textDoc.replaceAll("\\n", ""));
}
ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder = ClassifyByTaxonomyParams.newBuilder();
classifyByTaxonomybuilder.setText(keywords.toString());
classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
for (TaxonomyCategory c : response.getCategories()) {
clusterUpdate.add(c.getLabel());
}
GET, :
. .
. . , . . , . , :
-
- β . , . - , . Vaadin Flow:
:
, .
.
-.
, , , , -.
.
βTechnology & Computingβ:
:
:
, . . , , . . . . : .
, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..
, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .
Aylien API, . , 100 . , , , k-, . , .