🐗 👂🏻 🎅 ANYKS Spell-checker 🚼 🛌 👴🏿

Hello, this is my third article on Habré, earlier I wrote an article about the ALM language model . Now, I want to introduce you to the ASC typo correction system (implemented on the basis of ALM ).

Yes, there are a huge number of systems for correcting typos, they all have their own strengths and weaknesses, from open systems I can single out one of the most promising JamSpell , and we will compare with it. There is also a similar system from DeepPavlov , which many might think about, but I never made friends with it.

Feature list:

Correction of mistakes in words with a difference of up to 4 Levenshtein distances.
Correction of typos in words (insertion, deletion, replacement, rearrangement) of characters.
fication given the context.
Putting the case of the first letter of a word for (proper names and titles), taking into account the context.
Splitting the combined words into separate words, taking into account the context.
Performs text analysis without correcting the original text.
Search in the text for presence (errors, typos, incorrect context).

Supported operating systems:

MacOS X
FreeBSD
Linux

The system is written in C ++ 11, there is a port for Python3

Ready dictionaries

Name	Size (GB)	RAM (GB)	Size N-grams	Language
wittenbell-3-big.asc	1.97	15.6	3	RU
wittenbell-3-middle.asc	1.24	9.7	3	RU
mkneserney-3-middle.asc	1.33	9.7	3	RU
wittenbell-3-single.asc	0.772	5.14	3	RU
wittenbell-5-single.asc	1.37	10.7	five	RU

Testing

Data from the 2016 Dialog21 "typo correction" competition was used to test the system . A trained binary dictionary was used for testing: wittenbell-3-middle.asc

Test conducted	Precision	Recall	FMeasure
Typo correction mode	76.97	62.71	69.11
Error correction mode	73.72	60.53	66.48

I think it is unnecessary to add other data, if desired, everyone can repeat the test, I attach all the materials used in testing below.

Materials used in testing

test.txt - Text to test
correct.txt - Text of correct variants
evaluate.py - Python3 script for calculating correction results

Now, it is interesting to compare the operation of the systems for correcting typos themselves in equal conditions, we will train two different typos on the same text data and conduct a test.

For comparison, let's take the typo correction system I mentioned above, JamSpell .

ASC vs JamSpell

Installation

ASC

$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make

JamSpell

$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make

Training

ASC

train.json

{
  "ext": "txt",
  "size": 3,
  "alter": {"":""},
  "debug": 1,
  "threads": 0,
  "method": "train",
  "allow-unk": true,
  "reset-unk": true,
  "confidence": true,
  "interpolate": true,
  "mixed-dicts": true,
  "only-token-words": true,
  "locale": "en_US.UTF-8",
  "smoothing": "wittenbell",
  "pilots": ["","","","","","","","","","","a","i","o","e","g"],
  "corpus": "./texts/correct.txt",
  "w-bin": "./dictionary/3-middle.asc",
  "w-vocab": "./train/lm.vocab",
  "w-arpa": "./train/lm.arpa",
  "mix-restwords": "./similars/letters.txt",
  "alphabet": "abcdefghijklmnopqrstuvwxyz",
  "bin-code": "ru",
  "bin-name": "Russian",
  "bin-author": "You name",
  "bin-copyright": "You company LLC",
  "bin-contacts": "site: https://example.com, e-mail: info@example.com",
  "bin-lictype": "MIT",
  "bin-lictext": "... License text ...",
  "embedding-size": 28,
  "embedding": {
      "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
      "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
      "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
      "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
      "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
      "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
      "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
      "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
      "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
      "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
      "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
      "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
      "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
      "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
  }
}

$ ./asc -r-json ./train.json

Python3

import asc

asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)

asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")

asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

def statusArpa1(status):
    print("Build arpa", status)

def statusArpa2(status):
    print("Write arpa", status)

def statusVocab(status):
    print("Write vocab", status)

def statusIndex(text, status):
    print(text, status)

def status(text, status):
    print(text, status)

asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)

asc.buildArpa(statusArpa1)

asc.writeArpa("./train/lm.arpa", statusArpa2)

asc.writeVocab("./train/lm.vocab", statusVocab)

asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")

asc.setEmbedding({
     "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
     "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
     "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
     "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
     "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
     "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
     "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
     "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
     "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
     "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
     "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
     "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
     "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
     "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)

JamSpell

$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin

Testing

ASC

spell.json

{
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "spell-verbose": true,
    "confidence": true,
    "mixed-dicts": true,
    "asc-split": true,
    "asc-alter": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "asc-wordrep": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "r-bin": "./dictionary/3-middle.asc"
}

$ ./asc -r-json ./spell.json

Python3

import asc

asc.setAlmV2()
asc.setThreads(0)

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)

def status(text, status):
    print(text, status)

asc.loadIndex("./dictionary/3-middle.asc", "", status)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()

JamSpell

- Python , C++

#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>

//   BOOST
#ifdef USE_BOOST_CONVERT
	#include <boost/locale/encoding_utf.hpp>
//     
#else
	#include <codecvt>
#endif

using namespace std;

/**
 * convert    utf-8  
 * @param  str  utf-8  
 * @return      
 */
const string convert(const wstring & str){
	//   
	string result = "";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//     
#else
		//     UTF-8
		using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
		//  
		wstring_convert <convert_type, wchar_t> conv;
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		//    utf-8 
		result = conv.to_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * convert      utf-8
 * @param  str   
 * @return       utf-8
 */
const wstring convert(const string & str){
	//   
	wstring result = L"";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//     
#else
		//  
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
		//    utf-8 
		result = conv.from_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * safeGetline     
 * @param  is  
 * @param  t     
 * @return     
 */
istream & safeGetline(istream & is, string & t){
	//  
	t.clear();

	istream::sentry se(is, true);
	streambuf * sb = is.rdbuf();

	for(;;){
		int c = sb->sbumpc();
		switch(c){
 			case '\n': return is;
			case '\r':
				if(sb->sgetc() == '\n') sb->sbumpc();
				return is;
			case streambuf::traits_type::eof():
				if(t.empty()) is.setstate(ios::eofbit);
				return is;
			default: t += (char) c;
		}
	}
}

/**
* main   
*/
int main(){
	//  
	NJamSpell::TSpellCorrector corrector;
	//   
	corrector.LoadLangModel("model.bin");
	//    
	ifstream file1("./test_data/test.txt", ios::in);
	//   
	if(file1.is_open()){
		//    
		string line = "", res = "";
		//    
		ofstream file2("./test_data/output.txt", ios::out);
		//   
		if(file2.is_open()){
			//       
			while(file1.good()){
				//    
				safeGetline(file1, line);
				//   ,  
				if(!line.empty()){
					//   
					res = convert(corrector.FixFragment(convert(line)));
					//   ,    
					if(!res.empty()){
						//   
						res.append("\n");
						//    
						file2.write(res.c_str(), res.size());
					}
				}
			}
			//  
			file2.close();
		}
		//  
		file1.close();
	}
    return 0;
}

$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test

$ ./bin/test

results

Getting results

$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt

ASC

Precision	Recall	FMeasure
92.13	82.51	87.05

JamSpell

Precision	Recall	FMeasure
77.87	63.36	69.87

One of the main features of ASC is learning from dirty data. It is practically impossible to find text corpora without errors and typos in open access. It's not enough life to fix terabytes of data by hand, but you need to work with it somehow.

The teaching principle that I offer

Putting together a language model using dirty data
We remove all rare words and N-grams in the assembled language model
We add single words for more correct operation of the typo correction system.
Putting together a binary dictionary

Let's get started

Suppose we have several corpuses of different subjects, it is more logical to train them separately, then combine them.

Assembling the chassis using ALM

collect.json

{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"ext": "txt",
	"method": "train",
	"allow-unk": true,
	"mixed-dicts": true,
	"only-token-words": true,
	"smoothing": "wittenbell",
	"locale": "en_US.UTF-8",
	"w-abbr": "./output/alm.abbr",
	"w-map": "./output/alm.map",
	"w-vocab": "./output/alm.vocab",
	"w-words": "./output/words.txt",
	"corpus": "./texts/corpus",
	"abbrs": "./abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./collect.json

size — N- 3
debug —
threads —
ext —
allow-unk — 〈unk〉
mixed-dicts —
only-token-words — N- —
smoothing — wittenbell ( , - )
locale — ( )
w-abbr —
w-map —
w-vocab —
w-words — ( )
corpus —
abbrs — , , (, , ...)
goodwords —
badwords —
mix-restwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#    N- —       
alm.setOption(alm.options_t.tokenWords)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#   ,  ,   (, ,  ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    alm.addAbbr(abbr)
f.close()

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def status(text, status):
    print(text, status)

def statusWords(status):
    print("Write words", status)

def statusVocab(status):
    print("Write vocab", status)

def statusMap(status):
    print("Write map", status)

def statusSuffix(status):
    print("Write suffix", status)

#    
alm.collectCorpus("./texts/corpus", status)
#      
alm.writeWords("./output/words.txt", statusWords)
#   
alm.writeVocab("./output/alm.vocab", statusVocab)
#    
alm.writeMap("./output/alm.map", statusMap)
#      
alm.writeSuffix("./output/alm.abbr", statusSuffix)

Assembled Hull Pruning with ALM

prune.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "vprune",
    "vprune-wltf": -15.0,
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./corpus1/alm.map",
    "r-vocab": "./corpus1/alm.vocab",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./prune.json

size — N- 3
debug —
allow-unk — 〈unk〉
vprune-wltf — - (, — )
locale — ( )
smoothing — wittenbell ( , - )
r-map —
r-vocab —
w-map —
w-vocab —
goodwords —
badwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")

#    <unk>   
alm.setOption(alm.options_t.allowUnk)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def statusPrune(status):
    print("Prune data", status)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#  
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#    
alm.readMap("./corpus1/alm.map", statusReadMap)
#   
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)

Combining collected data with ALM

merge.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "merge",
    "mixed-dicts": "true",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-words": "./texts/words",
    "r-map": "./corpus1",
    "r-vocab": "./corpus1",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "mix-restwords": "./texts/similars/letters.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./merge.json

size — N- 3
debug —
allow-unk — 〈unk〉
mixed-dicts —
locale — ( )
smoothing — wittenbell ( , - )
r-words —
r-map — ,
r-vocab — ,
w-map —
w-vocab —
goodwords —
badwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

#         
f = open('./texts/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addWord(word)
f.close()

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#   
alm.readVocab("./corpus1", statusReadVocab)
#    
alm.readMap("./corpus1", statusReadMap)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)

Learning a language model with ALM

train.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "reset-unk": true,
    "interpolate": true,
    "method": "train",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./output/alm.map",
    "r-vocab": "./output/alm.vocab",
    "w-arpa": "./output/alm.arpa",
    "w-words": "./output/words.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./train.json

size — N- 3
debug —
allow-unk — 〈unk〉
reset-unk — , 〈unk〉
interpolate —
locale — ( )
smoothing — wittenbell
r-map — ,
r-vocab — ,
w-arpa — ARPA,
w-words — , ( )
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#      <unk>   
alm.setOption(alm.options_t.resetUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#     
alm.setOption(alm.options_t.interpolate)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusBuildArpa(status):
    print("Build ARPA", status)

def statusWriteMap(status):
    print("Write map", status)

def statusWriteArpa(status):
    print("Write ARPA", status)

def statusWords(status):
    print("Write words", status)

#   
alm.readVocab("./output/alm.vocab", statusReadVocab)
#    
alm.readMap("./output/alm.map", statusReadMap)

#     
alm.buildArpa(statusBuildArpa)

#       ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)

#   
alm.writeWords("./output/words.txt", statusWords)

Spell-checker ASC training

train.json

{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"confidence": true,
	"mixed-dicts": true,
	"method": "train",
	"alter": {"":""},
	"locale": "en_US.UTF-8",
	"smoothing": "wittenbell",
	"pilots": ["","","","","","","","","","","a","i","o","e","g"],
	"w-bin": "./dictionary/3-single.asc",
	"r-abbr": "./output/alm.abbr",
	"r-vocab": "./output/alm.vocab",
	"r-arpa": "./output/alm.arpa",
	"abbrs": "./texts/abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"alters": "./texts/alters/yoficator.txt",
	"upwords": "./texts/words/upp",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz",
	"bin-code": "ru",
	"bin-name": "Russian",
	"bin-author": "You name",
	"bin-copyright": "You company LLC",
	"bin-contacts": "site: https://example.com, e-mail: info@example.com",
	"bin-lictype": "MIT",
	"bin-lictext": "... License text ...",
	"embedding-size": 28,
	"embedding": {
	    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
	    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
	    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
	    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
	    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
	    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
	    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
	    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
	    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
	    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
	    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
	    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
	    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
	    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
	}
}

$ ./asc -r-json ./train.json

size — N- 3
debug —
threads —
confidence — ARPA - ,
mixed-dicts —
alter — ( , , — «»)
locale — ( )
smoothing — wittenbell ( , - )
pilots — ( )
w-bin —
r-abbr — ,
r-vocab — ,
r-arpa — ARPA,
abbrs — , , (, , ...)
goodwords —
badwords —
alters — , ( )
upwords — , (, , ...)
mix-restwords —
alphabet — ( )
bin-code —
bin-name —
bin-author —
bin-copyright —
bin-contacts —
bin-lictype —
bin-lictext —
embedding-size —
embedding — ( , )

Python

import asc

#   N-  3
asc.setSize(3)
#      
asc.setThreads(0)
#    (  )
asc.setLocale("en_US.UTF-8")

#        
asc.setOption(asc.options_t.uppers)
#    <unk>   
asc.setOption(asc.options_t.allowUnk)
#      <unk>   
asc.setOption(asc.options_t.resetUnk)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusIndex(text, status):
    print(text, status)

def statusBuildIndex(status):
    print("Build index", status)

def statusArpa(status):
    print("Read arpa", status)

def statusVocab(status):
    print("Read vocab", status)

#        ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#   
asc.readVocab("./output/alm.vocab", statusVocab)

#     
asc.setCode("RU")
#    
asc.setLictype("MIT")
#   
asc.setName("Russian")
#    
asc.setAuthor("You name")
#   
asc.setCopyright("You company LLC")
#    
asc.setLictext("... License text ...")
#     
asc.setContacts("site: https://example.com, e-mail: info@example.com")

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusBuildIndex)

#     
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)

I understand that not every person will be able to train their own binary vocabulary; this requires text corpora and significant computing resources. Therefore, the ASC is capable of working with only one ARPA file as the main dictionary.

Example of work

spell.json

{
    "ad": 13,
    "cw": 38120,
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "alter": {"":""},
    "asc-split": true,
    "asc-alter": true,
    "confidence": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "mixed-dicts": true,
    "asc-wordrep": true,
    "spell-verbose": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "upwords": "./texts/words/upp",
    "r-arpa": "./dictionary/alm.arpa",
    "r-abbr": "./dictionary/alm.abbr",
    "abbrs": "./texts/abbrs/abbrs.txt",
    "alters": "./texts/alters/yoficator.txt",
    "mix-restwords": "./similars/letters.txt",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "pilots": ["","","","","","","","","","","a","i","o","e","g"],
    "alphabet": "abcdefghijklmnopqrstuvwxyz",
    "embedding-size": 28,
    "embedding": {
        "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
        "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
        "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
        "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
        "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
        "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
        "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
        "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
        "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
        "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
        "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
        "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
        "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
        "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
    }
}

$ ./asc -r-json ./spell.json

Python

import asc

#      
asc.setThreads(0)
#        
asc.setOption(asc.options_t.uppers)
#   
asc.setOption(asc.options_t.ascSplit)
#   
asc.setOption(asc.options_t.ascAlter)
#      
asc.setOption(asc.options_t.ascESplit)
#      
asc.setOption(asc.options_t.ascRSplit)
#     
asc.setOption(asc.options_t.ascUppers)
#     
asc.setOption(asc.options_t.ascHyphen)
#    
asc.setOption(asc.options_t.ascWordRep)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusArpa(status):
    print("Read arpa", status)

def statusIndex(status):
    print("Build index", status)

#        ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)

#    (38120      13    )
asc.setAdCw(38120, 13)

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusIndex)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()

PS For those who do not want to collect and train anything at all, I brought up the web version of ASC . It should also be borne in mind that the system for correcting typos is not an omniscient system and it is impossible to feed the entire Russian language there. ASC will not correct any texts, it is necessary to train separately for each topic.

ANYKS Spell-checker

Feature list:

Supported operating systems:

Ready dictionaries

Testing

Materials used in testing

ASC vs JamSpell

results

The teaching principle that I offer

Let's get started

More articles: