Clustering and Classification of Big Text Data with Java Machine Learning. Article # 2 - Algorithms

image



Hello, Habr! Today there will be a continuation of the topic Clustering and Classification of Big Text Data Using Machine Learning in Java. This article is a continuation of the first article .





The article will contain Theory, and the implementation of the algorithms that I used.





1. Tokenization



Theory:



β€’ . (, ). , , , , , , . . . (), . β€’ ; , . , - . , , . , , .



, . .



, «». (, , , ), , . , , . , .



, PDF-, , , . , . .



. , , , , . , , , , , . , . . . , , , . , .



, , . , , . , . . , , , , , , . , , . , . , .





:



Iterator<String> finalIterator = new WordIterator(reader);


private final BufferedReader br;
String curLine;
public WordIterator(BufferedReader br) {
        this.br = br;
        curLine = null;
        advance();
    }
    private void advance() {
        try {
            while (true) {
                if (curLine == null || !matcher.find()) {
                    String line = br.readLine();
                    if (line == null) {
                        next = null;
                        br.close();
                        return;
                    }
                    matcher = notWhiteSpace.matcher(line);
                    curLine = line;
                    if (!matcher.find())
                        continue;                    
                }
                next = curLine.substring(matcher.start(), matcher.end());
                break;
            }
        } catch (IOException ioe) {
            throw new IOError(ioe);
        }
    }


2. -



:



, , Β«-Β», Β«-Β». , . - . -. - 1958 .. . - β€’ , , . , , , , , , , , , , , , , , , , , , , , , . . , . - , , , . , Β« Β», -, β€œβ€, β€œβ€, β€œ ”, β€œβ€. , «” β€œβ€, , , β€œβ€ β€žβ€œ . , , , : β€œβ€, β€œ ”, β€œβ€, , . , . - , , .



. -, . , » ", «», «», . -, , , , . , . .

- :



  • - β€’ .
  • , -, , , , -.
  • - - . .
  • , .
  • , -, , .
  • - .
  • - :
  • : -, -. .
  • , (Β«β€”Β»): - -. (TF-High), , , . . (TF1), (IDF).
  • (MI): , (, , ), , . , , .


Term random sampling (TBRS): A method in which stop words are manually detected from documents. This technique is used by iterating over randomly selected individual chunks of data and ranking the features in each chunk based on their values ​​in a format using the Kullback-Leibler divergence measure as shown in the following equation:



d_x (t) = Px (t) .log_2⁑ γ€–( Px (t)) ⁄ (P (t))γ€—



where Px (t) is the normalized frequency of the term t within the weight x

P (t) is the normalized frequency of the term t in the entire collection.

The final stop list is built by accepting the least informative terms in all documents, removing all possible duplicates.





The code:

TokenFilter filter = new TokenFilter().loadFromResource("stopwords.txt")
if (!filter.accept(token)) continue;


private Set<String> tokens;
private boolean excludeTokens;
private TokenFilter parent;

public TokenFilter loadFromResource(String fileName) {
		try {
			ClassLoader classLoader = getClass().getClassLoader();
			String str = IOUtils.toString(
					classLoader.getResourceAsStream(fileName),
					Charset.defaultCharset());
			InputStream is = new ByteArrayInputStream(str.getBytes());
			BufferedReader br = new BufferedReader(new InputStreamReader(is));

			Set<String> words = new HashSet<String>();
			for (String line = null; (line = br.readLine()) != null;)
				words.add(line);
			br.close();

			this.tokens = words;
			this.excludeTokens = true;
			this.parent = null;
		} catch (Exception e) {
			throw new IOError(e);
		}
		return this;
	}
public boolean accept(String token) {
		token = token.toLowerCase().replaceAll("[\\. \\d]", "");
		return (parent == null || parent.accept(token))
				&& tokens.contains(token) ^ excludeTokens && token.length() > 2 && token.matches("^[-]+");
	}


File:



















....


3. Lemmatization



Theory:



. , . , .



β€’ , , . , . , , ( ). , working, works, work work, : work; , . . , computers, computing, computer , : compute, . , , . , - , , , . , , .



Over the years, a number of tools have been developed that provide lemmatization functionality. Despite the different processing methods used, they all use a lexicon of words, a set of rules, or a combination of these as resources for morphological analysis. The most famous lemmatization tools are:



  • WordNet β€’ WordNet . , , , , . , . WordNet . .
  • CLEAR β€’ . WordNet , . NLP, , .
  • GENIA POS , . POS, . : , , . WordNet, , , GENIA PennBioIE. , . , .
  • TreeTagger POS. , , TreeTagger , . GENIA TreeTagger , POS .
  • Norm LuiNorm , . , , . UMLS, , , -, . . , . POS .
  • MorphAdorner – , , , POS . , MorphAdorner , . , .
  • morpha – . 1400 , , , , . , WordNet, 5 000 6 000 . morpha , .


:



Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String token = documentTokens.next().replaceAll("[^a-zA-Z]", "").toLowerCase();
         Annotation lemmaText = new Annotation(token);
         pipeline.annotate(lemmaText);
         List<CoreLabel> lemmaToken = lemmaText.get(TokensAnnotation.class);
         String word = "";
         for(CoreLabel t:lemmaToken) {
           word = t.get(LemmaAnnotation.class);  //   (  )
         }


4. –



:



Term Frequency - Inverse Document Frequency (TF-IDF) is the most widely used algorithm for computing term (keyword in a document) weight in modern information retrieval systems. This weight is a statistical measure used to assess how important a word is to a document in an array of documents or in a corpus. The value increases in proportion to the number of times a word appears in the document, but compensates for the frequency of the word in the corpus

...

(TF), , , , () . . (), , , . , . TF – . t D:



tf(t,D)=f_(t,D),



f_(t,D) – .

:

«»: tf(t,D) = 1, t D 0 ;

, :



tf(t,D)=f_(t,D)⁄(βˆ‘_(t^'∈D)β–’f_(t^',D) )



:



log⁑〖(1+f_(t,D))γ€—



, , , :



tf(t,D)=0.5+0.5*f_(t,D)/(max⁑{f_(t^',D):t'∈D})



IDF, , , , . , , . , , , , :



idf(t,D)=log⁑N/|{d∈D:t∈d}|



TF IDF, TF-IDF, . , , . TF-IDF . TF-IDF : :



tfidf(t,D)=tf(t,D)*idf(t,D)





:

private final TObjectIntMap<T> counts;
public int count(T obj) {
    int count = counts.get(obj);
    count++;
    counts.put(obj, count);
    sum++;
    return count;
}


public synchronized int addColumn(SparseArray<? extends Number> column) {
     if (column.length() > numRows)
         numRows = column.length();
    
     int[] nonZero = column.getElementIndices();
     nonZeroValues += nonZero.length;
     try {
         matrixDos.writeInt(nonZero.length);
         for (int i : nonZero) {
             matrixDos.writeInt(i); // write the row index
             matrixDos.writeFloat(column.get(i).floatValue());
         }
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
     return ++curCol;
}


public interface SparseArray<T> {
    int cardinality();
    T get(int index);
    int[] getElementIndices();
    int length();
    void set(int index, T obj);
    <E> E[] toArray(E[] array);
}


public File transform(File inputFile, File outFile, GlobalTransform transform) {
     try {
         DataInputStream dis = new DataInputStream(
             new BufferedInputStream(new FileInputStream(inputFile)));
         int rows = dis.readInt();
         int cols = dis.readInt();
         DataOutputStream dos = new DataOutputStream(
             new BufferedOutputStream(new FileOutputStream(outFile)));
         dos.writeInt(rows);
         dos.writeInt(cols);
         for (int row = 0; row < rows; ++row) {
             for (int col = 0; col < cols; ++col) {
                 double val = dis.readFloat();
                 dos.writeFloat((float) transform.transform(row, col, val));
             }
         }
         dos.close();
         return outFile;
     } catch (IOException ioe) {
         throw new IOError(ioe);
     }
}

public double transform(int row, int column, double value) {
        double tf = value / docTermCount[column];
        double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
        return tf * idf;
}


public void factorize(MatrixFile mFile, int dimensions) {
        try {
            String formatString = "";
            switch (mFile.getFormat()) {
            case SVDLIBC_DENSE_BINARY:
                formatString = " -r db ";
                break;
            case SVDLIBC_DENSE_TEXT:
                formatString = " -r dt ";
                break;
            case SVDLIBC_SPARSE_BINARY:
                formatString = " -r sb ";
                break;
            case SVDLIBC_SPARSE_TEXT:
                break;
            default:
                throw new UnsupportedOperationException(
                    "Format type is not accepted");
            }

            File outputMatrixFile = File.createTempFile("svdlibc", ".dat");
            outputMatrixFile.deleteOnExit();
            String outputMatrixPrefix = outputMatrixFile.getAbsolutePath();

            LOG.fine("creating SVDLIBC factor matrices at: " + 
                              outputMatrixPrefix);
            String commandLine = "svd -o " + outputMatrixPrefix + formatString +
                " -w dt " + 
                " -d " + dimensions + " " + mFile.getFile().getAbsolutePath();
            LOG.fine(commandLine);
            Process svdlibc = Runtime.getRuntime().exec(commandLine);
            BufferedReader stdout = new BufferedReader(
                new InputStreamReader(svdlibc.getInputStream()));
            BufferedReader stderr = new BufferedReader(
                new InputStreamReader(svdlibc.getErrorStream()));

            StringBuilder output = new StringBuilder("SVDLIBC output:\n");
            for (String line = null; (line = stderr.readLine()) != null; ) {
                output.append(line).append("\n");
            }
            LOG.fine(output.toString());
            
            int exitStatus = svdlibc.waitFor();
            LOG.fine("svdlibc exit status: " + exitStatus);

            if (exitStatus == 0) {
                File Ut = new File(outputMatrixPrefix + "-Ut");
                File S  = new File(outputMatrixPrefix + "-S");
                File Vt = new File(outputMatrixPrefix + "-Vt");
                U = MatrixIO.readMatrix(
                        Ut, Format.SVDLIBC_DENSE_TEXT, 
                        Type.DENSE_IN_MEMORY, true); //  U
                scaledDataClasses = false; 
                
                V = MatrixIO.readMatrix(
                        Vt, Format.SVDLIBC_DENSE_TEXT,
                        Type.DENSE_IN_MEMORY); //  V
                scaledClassFeatures = false;


                singularValues =  readSVDLIBCsingularVector(S, dimensions);
            } else {
                StringBuilder sb = new StringBuilder();
                for (String line = null; (line = stderr.readLine()) != null; )
                    sb.append(line).append("\n");
                // warning or error?
                LOG.warning("svdlibc exited with error status.  " + 
                               "stderr:\n" + sb.toString());
            }
        } catch (IOException ioe) {
            LOG.log(Level.SEVERE, "SVDLIBC", ioe);
        } catch (InterruptedException ie) {
            LOG.log(Level.SEVERE, "SVDLIBC", ie);
        }
    }

    public MatrixBuilder getBuilder() {
        return new SvdlibcSparseBinaryMatrixBuilder();
    }

    private static double[] readSVDLIBCsingularVector(File sigmaMatrixFile,
                                                      int dimensions)
            throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(sigmaMatrixFile));
        double[] m = new double[dimensions];

        int readDimensions = Integer.parseInt(br.readLine());
        if (readDimensions != dimensions)
            throw new RuntimeException(
                    "SVDLIBC generated the incorrect number of " +
                    "dimensions: " + readDimensions + " versus " + dimensions);

        int i = 0;
        for (String line = null; (line = br.readLine()) != null; )
            m[i++] = Double.parseDouble(line);
        return m;
    }


SVD Java ( S-space)



5. Aylien API



Aylien API Text Analysis β€’ API .

Aylien API , , , . β€’ .



, IPTC, -, β€’ IAB-QAG, .



The IAB-QAG contextual taxonomy was developed by the IAB (Interactive Advertising Bureau) in conjunction with taxonomy experts from academia to define content categories at at least two different levels, making content classification much more consistent. The first level is a broad level category, and the second is a more detailed description of the root type structure (Figure 6).

To use this API, you need to get the key and ID on the official website. Then, using this data, you can use Java code to call the POST and GET methods.



private static TextAPIClient client = new TextAPIClient(" ", " ")


You can then use the classification by passing in the data to be classified.



ClassifyByTaxonomyParams.Builder builder = ClassifyByTaxonomyParams.newBuilder();
URL url = new URL("http://techcrunch.com/2015/07/16/microsoft-will-never-give-up-on-mobile");
builder.setUrl(url);
builder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(builder.build());
for (TaxonomyCategory c: response.getCategories()) {
  System.out.println(c);
}


The response from the service is returned in json format:



{
  "categories": [
    {
      "confident": true,
      "id": "IAB19-36",
      "label": "Windows",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19-36",
          "rel": "self"
        },
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "parent"
        }
      ],
      "score": 0.5675236066291172
    },
    {
      "confident": true,
      "id": "IAB19",
      "label": "Technology & Computing",
      "links": [
        {
          "link": "https://api.aylien.com/api/v1/classify/taxonomy/iab-qag/IAB19",
          "rel": "self"
        }
      ],
      "score": 0.46704140928338533
    }
  ],
  "language": "en",
  "taxonomy": "iab-qag",
  "text": "When Microsoft announced its wrenching..."
}


This API is used to classify clusters that will be obtained using the unsupervised learning clustering method.



Afterword



When applying the algorithms described above, there are alternatives and ready-made libraries. You just have to look. If you liked the article, or have ideas or questions, please leave your comments. The third part will be abstract, mainly discussing the system architecture. Description of the algorithm, what was used and in what order.



Additionally, there will be the results of each after the application of each algorithm, as well as the final result of this work.




All Articles