Clustering and Classification of Big Text Data with Java Machine Learning. Article # 1 - Theory

image



This article will be divided into 3 parts ( Theory / Methods and Algorithms for Problem Solving / Development and Implementation in Java ) to describe the full picture. The first article will only include theory to prepare the minds of the readers.



Purpose of the article:



  • Partial or full automation of the task of clustering and classifying big data, namely text data.
  • Application of machine learning algorithms "unsupervised" (clustering) and "supervised" (classification).
  • Analysis of current problem solutions.


Tasks to be considered in general:



  1. Development and application of natural language processing algorithms and methods.
  2. Development and application of clustering methods to determine the cluster groups of input documents.
  3. Application of classification methods to define the subject matter of each cluster group.
  4. Web interface development based on Java Vaadin


Hypotheses that I deduced from the problem and when teaching theory:



  • The classification of cluster groups defines abstract and more valuable hidden knowledge, ignoring noise, than the classification of individual objects.
  • Clustering accuracy is directly proportional to the number of cluster groups and inversely proportional to the number of objects in one cluster group.


Looking ahead to anyone interested in the algorithm itself, here's an overview.



The machine learning software algorithm consists of 3 main parts:



  1. Natural language processing.

    1. tokenization;
    2. lemmatization;
    3. stop listing;
    4. frequency of words;


  2. Clustering methods.

    TF-IDF;

    SVD;

    finding cluster groups;

  3. Classification methods - Aylien API.


So let's start the theory.



1. The concept of machine learning



β€’ , , . – , , - . β€œ , , , . , ?”. β€’ , β€œβ€. – , , β€” , .



. , , , , , . - () . , , , .. , ? β€” , , . ? β€” , . .



, , , , . , , , , . , . , , . , .



, . β€’ . , . , , . . , " ", . , , .



()? , : β€œ , E T P, T, P, E”. , E , T , P , . – E, , 100 . β€œβ€ .



, , . . , , , .



, .





2.



β€’ , :



  1. β€’ . , .
  2. β€’ . , , , .
  3. – , , . . , . , β€’ , .
  4. β€’ , . , .



. , , , . , . , , , . β€’ , , , , .



, , , , , .



, . . , . . , . , . , , , . , .



, , . , . , Google, Yahoo, Microsoft, Amazon , . , , Facebook, YouTube Twitter, . , , , , .



, . , . . , . , , , , , .



, . , , . , , .



, . , , , , , , , , , , , / .. , / . , .



In conclusion, we can say that big data and machine learning are closely related to each other, since big data is useless without analyzing and extracting information, and machine learning could not coexist without big data, which gives the algorithm experience and learning.





3. Types of machine learning



Machine learning, as a science, can be classified into 3 main categories depending on the nature of the learning:



  1. teaching with a teacher;
  2. teaching without a teacher;
  3. reinforcement learning.


In some scientific works, by nature, learning is divided into 4 categories, which include partial learning, but this is just a symbiosis of learning with a teacher and without a teacher.





3.1. Learning with a teacher



, , , , . , , . , . . , . , , , , . , .



In order to solve the problem in order to apply supervised learning, the following steps must be followed:



  1. Determine the type of training examples. First of all, you need to decide what data should be used as a training set.
  2. Data collection. The dataset must be representative of the actual use of the function. Thus, a set of input features and associated outputs are collected.
  3. Determination of the input representation of the object of the studied function. The accuracy of the function being studied is highly dependent on how the input object is represented. Typically, the input object is converted to a vector of objects that contains a series of objects that describe the object. The number of functions should not be too large, due to the "curse of dimension", but should contain enough information to accurately predict the result.
  4. .
  5. . . . ( ) .
  6. . , , .


Algorithms are trained using preprocessed examples, and at this stage the performance of the algorithms is evaluated using test data. Sometimes patterns identified in a subset of data cannot be found in a larger population of data. If the model is only suitable for representing patterns that exist in a subset of training, a problem called β€œOverfitting” is created.



Overfitting means the model is fine-tuned for the training dataset, but cannot be applied to large datasets of unknown data. To protect against overfitting, testing should be done against unexpected or unknown data. Using unexpected data for a test suite can help you gauge the accuracy of a model when predicting results. Supervised learning models have broad applicability to a variety of business problems, including fraud detection, recommendation, speech recognition, or risk analysis.



The most widely used and popular supervised learning algorithms are:



  • support vector machine;
  • linear regression;
  • logistic regression;
  • naive Bayesian classifier;
  • decision tree training;
  • k-nearest neighbors method;
  • artificial neural network;
  • study of similarities.


Each of the above algorithms has different approaches to mathematical and statistical mathematical methods, and formulas. But the general pattern of the algorithm can be emphasized, since all these algorithms are supervised learning:

n (x_1,y_1),(x_2,y_2),...,(x_n,y_n), x_i β€’ , y_i β€’ . , x_i , , , , . y_i «” β€žβ€œ.



β€’ m : (xn+1, xn+2,..., xn+m) * (x_(n+1),x_(n+2),...,x_(n+m) ). , , (, β€œβ€ β€œ ”), , .





3.3.



β€’ . , . , () . . , β€œβ€ .



, , (), . , , . β€’ β€’ .



, , . , , , . , , , . , , , , .

The dilemma is that neither learning nor mastering can be done exclusively without failures in the task. The algorithm should try different actions and gradually favor the ones that seem best. In a stochastic problem, each action must be repeatedly tried to get a reliable estimate. The learning-mastering dilemma has been intensively studied by mathematicians for many decades, but remains unresolved.



Mistakes help you learn because they add a measure of discipline (cost, wasted time, regret, pain, etc., teaching you that a certain course of action is less likely than others.) An interesting example of reinforcement learning happens when computers learn to play video games by themselves without human intervention.



Machine learning can also be classified based on the desired results:



  1. classification;
  2. clustering;
  3. regression.


Regression algorithms are commonly used for statistical analysis. Regression helps you analyze model relationships between data points. Regression algorithms can quantify the strength of the correlation between variables in a dataset. In addition, regression analysis can be useful for predicting future data values ​​based on historical values. It is important to remember, however, that regression analysis assumes that correlation is about cause and effect. Without understanding the context around the data, regression analysis can lead to inaccurate predictions. Regression types:



  • linear regression;
  • nonlinear regression;
  • vector regression;
  • logistic regression.


Clustering is a fairly simple technique to understand. Objects with similar parameters are grouped together (in a cluster). All objects in a cluster are more similar to each other than objects in other clusters. Clustering is a type of unsupervised learning because the algorithm itself determines the general characteristics of the elements in the data. The algorithm interprets the parameters that make up each element and then groups them accordingly.



Clustering categories:



  • k-means method;
  • density-based spatial clustering for noisy applications - DBSCAN;
  • clustering algorithm OPTICS;
  • method of principal components.


But it is important to note that in clustering, especially in unsupervised learning, the algorithm looks for connections between input data. The beauty of machine learning is finding hidden connections between data, better known as latent connections. For clustering in the search for latent relationships, a model of hidden variables is used, which is applied to study the relationships between the values ​​of variables. The hidden variable model includes:



  • EM algorithm;
  • method of moments;
  • blind signal separation;
  • method of principal components;
  • analysis of independent components;
  • non-negative matrix decomposition;
  • singular value decomposition.


Classification is the process of predicting a class of given data points. Classes are sometimes referred to as labels or categories. Classification predictive modeling is the problem of approximating a mapping function (f) from input variables (X) to discrete output variables (y) . The classification belongs to the supervised learning category. Types of classification schemes:



  • thesaurus;
  • taxonomy;
  • data model;
  • transport network;
  • ontology.


But in machine learning, the types of classification are done according to the types of algorithms that in one way or another refer to classification schemes. The most widely used learning algorithms are:



  • support vector machine;
  • logistic regression;
  • naive Bayesian classifier;
  • k-nearest neighbors method;
  • artificial neural network;
  • .


4.



, . . , . – . β€’ , , – . , . , .



(NLP) β€’ . β€’ , , ( ) . , , , . , .



, , .



, , . , . ( ) (), . / . . , , . : () .



, . , -, , β€’ . (NLP). NLP β€’ (-) . NLP , , . NLP, , : . , , ( ). NLP , . NLP, , . , :



  • β€’ () , . – . , . .
  • β€’ , . ? , () , . . . , , . .
  • – , . , . , :

    • , , , , .
    • ,




, . , .



  • - β€’ - , , , , - , . .
  • – . , . , , ( ).


, , . , . . . . . , , . .



In machine learning, a text document can overlap with many categories in a classification or with many clusters in clustering. The most commonly used feature sampling algorithms are:



  • Term Frequency β€” Inverse Document Frequency (TF-IDF) is commonly used to weight each word in a text document according to its uniqueness. Word (token) weight is often used for information retrieval and semantic analysis of text. This weight is a statistical measure used to assess how important a word is to a document in a collection or corpus. In other words, the TF-IDF approach reflects the relevance of words, text documents, and specific categories.
  • Word2Vec is a tool (set of algorithms) for computing vector representations of words, implementing two main architectures - Continuous Packet of Words (CBOW) and Skip-gram. A text document or word is passed as input, and the output data will be represented as vector variables (coordinates in vector space).


Afterword



In fact, machine learning theory is very vast and vast. Here I wrote in more abstract and simpler words. If there are amendments to the text or to the theory, please write. The purpose of this article, again, is to prepare readers for the most practical problem and solution.



Leave a comment if you are waiting for more.



All Articles