CPython library "VKF" for machine learning

In a previous post by the author , a web server was described for conducting experiments with the VKF method of machine learning based on lattice theory. As an alternative to using a web server, this article attempts to specify the path to use the CPython library directly. We will reproduce working sessions of experiments with Mushroom and Wine Quality arrays from the UCI data repository for testing machine learning algorithms. Then there will be explanations about the formats of the input data.







Introduction



This article describes a new version of a machine learning program based on lattice theory. The main advantage of this version is an interface for Python programmers to efficient algorithms programmed in C ++.



Today, machine learning procedures (such as convolutional neural networks, random forests, and support vector machines) have reached a very high level, surpassing humans in speech, video and image recognition. They, however, cannot provide arguments to prove the correctness of their conclusions.



On the other hand, symbolic approaches to machine learning (inductive logic programming, learning coverage using integer programming) have a provably very high computational complexity and are practically inapplicable to samples of even a relatively small size.



The approach described here uses probabilistic algorithms to avoid these problems. The ICF method of machine learning uses the techniques of modern algebraic lattice theory (Analysis of formal concepts) and probability theory (especially, Markov chains). But now you do not need knowledge of advanced mathematics to use and create programs using the ICF system. The author created a library (vkf.cp38-win32.pyd under Windows or vkf.cpython-38-x86_64-linux-gnu.so under Linux) to provide access to the program through an interface that Python programmers can understand.



There is a parental approach in relation to the described approach - invented at the beginning of the 80s of the XX century, Doctor of Technical Sciences. prof. VK. Finn JSM method for automatic hypothesis generation. It has now evolved, as its creator says, into a method for the automated support of scientific research. It allows you to use methods of logic of argumentation, to check the stability of the found hypotheses when expanding the training sample, to combine the results of predictions using various strategies of JSM reasoning.



Unfortunately, the author of this note and his colleagues discovered and investigated some theoretical shortcomings of the JSM method:



  1. In the worst case, the number of generated similarities can be exponentially large compared to the size of the input data (training sample).
  2. (NP-).
  3. .
  4. «» , .
  5. .


The author's research is summarized in Chapter 2 of his dissertation work . The last point was discovered by the author recently, but, in his opinion, puts an end to the expanding sample approach.



Finally, the JSM community does not offer access to the source code of its systems. Moreover, the programming languages ​​used (Fort and C #) will not allow their use by the general public. The only free C ++ version of the JSM solver known to the author (by T.A. Volkovarobofreak) contains an annoying error that sometimes leads to abnormal termination of calculations.



Initially, the author planned to share the encoder developed for the VKF-method system with the JSM community. Therefore, he put all the algorithms simultaneously applicable to both JSM and VKF systems into a separate library (vkfencoder.cp38-win32.pyd under Windows or vkfencoder.cpython-38-x86_64-linux-gnu.so under Linux) ... Unfortunately, these algorithms turned out to be unclaimed by the JSM community. The VKF library implements these algorithms (for example, the vkf.FCA class), but relying on filling in tables not from files, but directly through the web interface. Here we will use the vkfencoder library.



1 Procedure for working with the library for discrete features





Let's assume that the reader knows how to have installed MariaDB server + MariaDB Connector / C (by default we use IP-address 127.0.0.1:3306 and user 'root' with password 'toor'). Let's start by installing the vkfencoder and vkf libraries into the demo virtual environment and creating an empty MariaDB database named 'mushroom'.



krrguest@amd2700vii:~/src$ python3 -m venv demo 
krrguest@amd2700vii:~/src$ source demo/bin/activate 
(demo) krrguest@amd2700vii:~/src$ cd vkfencoder
(demo) krrguest@amd2700vii:~/src/vkfencoder$ python3 ./setup.py build
(demo) krrguest@amd2700vii:~/src/vkfencoder$ python3 ./setup.py install
(demo) krrguest@amd2700vii:~/src/vkfencoder$ cd ../vkf
(demo) krrguest@amd2700vii:~/src/vkf$ python3 ./setup.py build
(demo) krrguest@amd2700vii:~/src/vkf$ python3 ./setup.py install
(demo) krrguest@amd2700vii:~/src/vkf$ cd ../demo
(demo) krrguest@amd2700vii:~/src/demo$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS mushroom;
MariaDB [(none)]> exit;


As a result of the work, the database 'mushroom'

will appear, the file vkfencoder.cpython-38 will appear in the ~ / src / demo / lib / python3.8 / site-packages / vkfencoder-1.0.3-py3.8-linux-x86_64.egg / folder -x86_64-linux-gnu.so, and the vkf.cpython

file will appear in the ~ / src / demo / lib / python3.8 / site-packages / vkf-2.0.1-py3.8-linux-x86_64.egg / folder 38-x86_64-linux-gnu.so



Launch Python3 and run a CCF experiment on the Mushrooms array. It is assumed that there are 3 files in the ~ / src / demo / files / folder (mushrooms.xml, MUSHROOMS.train, MUSHROOMS.rest). The structure of these files will be described in the next section. The first file 'mushrooms.xml' defines the structure of the lower half-lattice on the values ​​of each attribute describing mushrooms. The second and third files are the file 'agaricus-lepiota.data' divided by a random number generator approximately in half - the digitized book "Identifier of North American Mushrooms" published in New York in 1981. The following names are 'encoder', 'lattices',' trains' and 'tests' are the names of tables in the database' mushroom 'for, respectively, encodings of feature values ​​by bit substrings, coverage relations for semilattice diagrams on these values,training and test examples.



(demo) krrguest@amd2700vii:~/src/demo$ Python3
>>> import vkfencoder
>>> xml = vkfencoder.XMLImport('./files/mushrooms.xml', 'encoder', 'lattices', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> trn = vkfencoder.DataImport('./files/MUSHROOMS.train', 'e', 'encoder', 'trains', 'mushroom', '127.0.0.1', 'root', 'toor') 
>>> tst = vkfencoder.DataImport('./files/MUSHROOMS.rest', 'e', 'encoder', 'tests', 'mushroom', '127.0.0.1', 'root', 'toor') 


The 'e' in the last two lines corresponds to the edibility of the mushroom (this is how the target trait is encoded in this array).



It is important to note that there is a vkfencoder.XMLExport class that allows you to save information from two tables 'encoder' and 'lattices' in an xml file, which, after making changes, can be processed again by the vkfencoder.XMLImport class.



Now we turn to the VKF experiment itself: we connect the encoders, load the previously calculated hypotheses (if any), calculate the additional number (100) of hypotheses into the specified number (4) streams and save them in the 'vkfhyps' table.



>>> enc = vkf.Encoder('encoder', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> ind = vkf.Induction()
>>> ind.load_discrete_hypotheses(enc, 'trains', 'vkfhyps', 'mushroom', '127.0.0.1', 'root', 'toor') 
>>> ind.add_hypotheses(100, 4) 
>>> ind.save_discrete_hypotheses(enc, 'vkfhyps', 'mushroom', '127.0.0.1', 'root', 'toor')


You can get a Python list of all nontrivial pairs (feature_name, feature_value) for the CCF hypothesis numbered ndx



>>> ind.show_discrete_hypothesis(enc, ndx)


One of the hypotheses for making a decision about the edibility of the mushroom has the form



[('gill_attachment', 'free'), ('gill_spacing', 'close'), ('gill_size', 'broad'), ('stalk_shape', 'enlarging'), ('stalk_surface_below_ring', 'scaly'), ('veil_type', 'partial'), ('veil_color', 'white'), ('ring_number', 'one'), ('ring_type', 'pendant')]


Due to the probabilistic nature of the algorithms of the CCF method, it may not be generated, but the author has proved that with a sufficiently large number of generated CCF hypotheses, a hypothesis very similar to it will arise, which will almost as well predict the target property for important test cases. Details are in Chapter 4 of the author's dissertation .



Finally, we turn to prediction. We create a test sample to assess the quality of the generated hypotheses and at the same time predict the target property of its elements



>>> tes = vkf.TestSample(enc, ind, 'tests', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> tes.correct_positive_cases()
>>> tes.correct_negative_cases()
>>> exit()


2 Description of the data structure



2.1 Lattice structures on discrete features



The vkfencoder.XMLImport class parses an XML file describing the order between feature values, creates and fills two tables 'encoder' (converting each value to a bit string) and 'lattices' (storing relationships between the values ​​of one feature).

The structure of the input file should be like



<?xml version="1.0"?>
<document name="mushrooms_db">
  <attribute name="cap_color">
    <vertices>
      <node string="null" char='_'></node>
      <node string="brown" char='n'></node>
      <node string="buff" char='b'></node>
      <node string="cinnamon" char='c'></node>
      <node string="gray" char='g'></node>
      <node string="green" char='r'></node>
      <node string="pink" char='p'></node>
      <node string="purple" char='u'></node>
      <node string="red" char='e'></node>
      <node string="white" char='w'></node>
      <node string="yellow" char='y'></node>
    </vertices>
    <edges>
      <arc source="brown" target="null"></arc>
      <arc source="buff" target="brown"></arc>
      <arc source="buff" target="yellow"></arc>
      <arc source="cinnamon" target="brown"></arc>
      <arc source="cinnamon" target="red"></arc>
      <arc source="gray" target="null"></arc>
      <arc source="green" target="null"></arc>
      <arc source="pink" target="red"></arc>
      <arc source="pink" target="white"></arc>
      <arc source="purple" target="red"></arc>
      <arc source="red" target="null"></arc>
      <arc source="white" target="null"></arc>
      <arc source="yellow" target="null"></arc>
    </edges>
  </attribute>
</document>


The above example represents the ordering between the values ​​of the third trait 'cap_color' for the Mushrooms array from the Machine Learning Data Repository (University of California at Irvine). Each 'attribute' field represents a lattice structure on the values ​​of the corresponding attribute. In our example, the group corresponds to the discrete attribute 'cap_color'. The list of all characteristic values ​​forms a subgroup. We've added a value to indicate trivial (missing) similarities. The rest of the values ​​correspond to the description in the accompanying file 'agaricus-lepiota.names'.

The ordering between them is represented by the content of the subgroup. Each item describes a more specific / more general relationship between pairs of characteristic values. For example,, means that a pink mushroom cap is more specific than a red cap.



The author has proven the theorem on the correctness of the representation generated by the constructor of the vkfencoder.XMLImport class and stored in the 'encoder' table. His algorithm and correctness theorem proof uses a modern branch of lattice theory known as Formal Concept Analysis. For details, the reader is again referred to the author's dissertation work (Chapter 1).



2.2 Sample structure for discrete features



First of all, it should be noted that the reader can implement his own training and test case loader in the database or use the vkfencoder.DataImport class available in the library. In the second case, the reader should take into account that the target feature must be in the first position and consist of one character (for example, '+' / '-', '1' / '0' or 'e' / 'p').



The training sample should be a CSV file (with comma separated values) describing training examples and counter examples (examples that do not have the target property).



The structure of the input file should be like



e,f,f,g,t,n,f,c,b,w,t,b,s,s,p,w,p,w,o,p,k,v,d
p,x,s,p,f,c,f,c,n,u,e,b,s,s,w,w,p,w,o,p,n,s,d
p,f,y,g,f,f,f,c,b,p,e,b,k,k,b,n,p,w,o,l,h,v,g
p,x,y,y,f,f,f,c,b,p,e,b,k,k,n,n,p,w,o,l,h,v,p
e,x,y,b,t,n,f,c,b,e,e,?,s,s,e,w,p,w,t,e,w,c,w
p,f,f,g,f,f,f,c,b,g,e,b,k,k,n,p,p,w,o,l,h,y,g
p,x,f,g,f,f,f,c,b,p,e,b,k,k,p,n,p,w,o,l,h,y,g
p,x,f,y,f,f,f,c,b,h,e,b,k,k,n,b,p,w,o,l,h,y,g
p,f,f,y,f,f,f,c,b,h,e,b,k,k,p,p,p,w,o,l,h,y,g
p,x,y,g,f,f,f,c,b,h,e,b,k,k,p,p,p,w,o,l,h,v,d
p,x,f,g,f,c,f,c,n,u,e,b,s,s,w,w,p,w,o,p,n,v,d
p,x,y,g,f,f,f,c,b,h,e,b,k,k,p,b,p,w,o,l,h,v,g
e,f,y,g,t,n,f,c,b,n,t,b,s,s,p,g,p,w,o,p,k,y,d
e,f,f,e,t,n,f,c,b,p,t,b,s,s,p,p,p,w,o,p,k,v,d
p,f,y,g,f,f,f,c,b,p,e,b,k,k,p,b,p,w,o,l,h,y,p


Each line here describes one training example. The first position contains 'e' or 'p' depending on the edibility / toxicity of the described mushroom. The rest of the positions (separated by commas) contain letters (short names) or strings (full names of the values ​​of the corresponding attribute). For example, the fourth position matches the attribute 'cup_color', where 'g', 'p', 'y', 'b' and 'e' stand for “gray”, “pink”, “yellow”, “sand” and “red” , respectively. The question mark in the middle of the above table indicates a missing value. However, the values ​​of the remaining (non-target) characteristics can be strings. It is important to note that when saving to the database, spaces in their names will be replaced with '_' characters.



The file with test cases has the same form, only the system is not told about the real sign of the example, and its prediction is compared with it to calculate the quality of the generated hypotheses and their predictive power.



2.3 Sample structure for continuous features



In this case, options are again possible: the reader can implement his own trainer and test case loader in the database or use the vkfencoder.DataLoad class available in the library. In the second case, the reader should take into account that the target feature should be in the first position and consist of a natural number (for example, 0, 7, 15).



The training sample must be a CSV file (with values ​​separated by some kind of separator) describing training examples and counter examples (examples that do not have the target property).



The structure of the input file should be like



"quality";"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol"
5;7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4
5;7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8
5;7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8
6;11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8
5;7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4
5;7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4
5;7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4
7;7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10
7;7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5


In our case, these are the first few lines of the 'vv_red.csv' file generated from the 'winequality-red.csv' file of red Portuguese wines from the Wine Quality array of UCI data repository for testing machine learning procedures by rearranging the last column to the very beginning. It is important to note that when saving to the database, spaces in their names will be replaced with '_' characters.



3 CCF experiment on examples with continuous features



As previously written, we will demonstrate the work on the Wine Quality array from the UCI repository. Let's start by creating an empty database under MariaDB named 'red_wine'.



(demo) krrguest@amd2700vii:~/src/demo$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS red_wine;
MariaDB [(none)]> exit;


As a result of the work, the 'red_wine' database will appear.



We launch Python3 and conduct a VKF experiment on the Wine Quality array. It is assumed that there is a vv_red.csv file in the ~ / src / demo / files / folder. The structure of these files was described in the previous paragraph. The following names 'verges', 'complex', 'trains' and 'tests' are the names of tables in the 'red_wine' database for, respectively, thresholds (both for the initial and for features calculated by regression), coefficients of significant logistic regressions between pairs of features, training and test cases.



(demo) krrguest@amd2700vii:~/src/demo$ Python3
>>> import vkfencoder
>>> nam = vkfencoder.DataLoad('./files/vv_red.csv', 7, 'verges', 'trains', 'red_wine', ';', '127.0.0.1', 'root', 'toor')


The second argument sets the threshold (7 in our case), above which the example is declared positive. Argument ';' matches the delimiter (the default is ',').



As indicated in the previous note by the author , the procedure is different from the case of discrete features. First, we compute the logistic regressions using the vkf.Join class, the coefficients of which are stored in the 'complex' table.



>>> join = vkf.Join('trains', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> join.compute_save('complex', 'red_wine', '127.0.0.1', 'root', 'toor') 


Now, using information theory, we calculate the thresholds using the vkf.Generator class, which, along with the maximum and minimum, are stored in the 'verges' table.



>>> gen = vkf.Generator('complex', 'trains', 'verges', 'red_wine', 7, '127.0.0.1', 'root', 'toor')


The fifth argument sets the number of thresholds on the original features. By default (and with a value of 0) it is calculated as the logarithm of the number of features. The regressions look for a single threshold.



Now we turn to the actual VKF experiment: we connect the encoders, load the previously calculated hypotheses (if any), calculate the additional number (300) of hypotheses in the specified number (8) streams and save them in the 'vkfhyps' table.



>>> qual = vkf.Qualifier('verges', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> beg = vkf.Beget('complex', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> ind = vkf.Induction()
>>> ind.load_continuous_hypotheses(qual, beg, 'trains', 'vkfhyps', 'red_wine', '127.0.0.1', 'root', 'toor') 
>>> ind.add_hypotheses(300, 8) 
>>> ind.save_continuous_hypotheses(qual, 'vkfhyps', 'red_wine', '127.0.0.1', 'root', 'toor')


You can get a Python list of triples (feature_name, (lower_boundary, upper_boundary)) for the CCF hypothesis numbered ndx



>>> ind.show_continuous_hypothesis(enc, ndx)


Finally, we turn to prediction. We create a test sample to assess the quality of the generated hypotheses and at the same time predict the target property of its elements



>>> tes = vkf.TestSample(qual, ind, beg, 'trains', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> tes.correct_positive_cases()
>>> tes.correct_negative_cases()
>>> exit()


Since we have only one file, here we used the training set as test cases.



4. Note about uploading files



I used the aiofiles library to upload files (such as 'mushrooms.xml') to a web server. Here's a piece of code you might find useful



import aiofiles
import os
import vkfencoder

async def write_file(path, body):
    async with aiofiles.open(path, 'wb') as file_write:
        await file_write.write(body)
    file_write.close()

async def xml_upload(request):
    #determine names of tables
    encoder_table_name = request.form.get('encoder_table_name')
    lattices_table_name = request.form.get('lattices_table_name')
    #Create upload folder if doesn't exist
    if not os.path.exists(Settings.UPLOAD_DIR):
        os.makedirs(Settings.UPLOAD_DIR)
    #Ensure a file was sent
    upload_file = request.files.get('file_name')
    if not upload_file:
        return response.redirect("/?error=no_file")
    #write the file to disk and redirect back to main
    short_name = upload_file.name.split('/')[-1]
    full_path = f"{Settings.UPLOAD_DIR}/{short_name}"
    try:
        await write_file(full_path, upload_file.body)
    except Exception as exc:
        return response.redirect('/?error='+ str(exc))
    return response.redirect('/ask_data')


Conclusion



The vkf.cpython-38-x86_64-linux-gnu.so library contains many hidden classes and algorithms. They use such advanced mathematical concepts as closed-by-one operations, lazy evaluation, paired Markov chains, transient states of the Markov chain, metric of total variation, etc.



In practice, experiments with arrays of the Data Repository for Machine Learning (University of California at Irvine ) showed the real applicability of the C ++ program 'VKF Method' to medium-sized data (the Adult array contains 32,560 training and 16280 test cases).



The 'VKF Method' system outperformed (in terms of prediction accuracy) the CLIP3 (Integer Programming Coverage Learning v.3) system on a SPECT array, where CLIP3 is a program created by the authors of the SPECT array. On the Mushrooms array (randomly divided into training and test samples), the VKF Method system showed 100% accuracy (both with respect to the criterion of mushroom toxicity and the criterion of edibility). The program was also applied to the Adult array to generate over a million hypotheses (no glitches).



The source codes of the vkf CPython library are pre-moderated at savannah.nongnu.org. The codes for the vkfencoder auxiliary library can be obtained from Bitbucket .



The author would like to thank his colleagues and students (D.A. Anokhin, E.D.Baranova, I.V. Nikulin, E.Yu.Sidorova, L.A. Yakimova) for their support, useful discussions and joint research. However, as usual, the author is solely responsible for all existing shortcomings.



All Articles