How many foreign tourists are there in your city? In mine, there are few, but there are, as a rule, they are lost in the middle of the street and they repeat one single word - the name of whatever. And passers-by are trying to explain to them on their fingers where to go, and when "mine is yours not to understand" - they take the hand and lead them to their destination. Surprisingly, usually the target is within a five minute walk, i.e. these tourists still had some rough idea of the city. Maybe they were guided by a paper map.
How often have you personally found yourself in such a situation, in an unfamiliar city in another country?
The advent of smartphones and navigation apps has solved many problems. Hurray, you can see your geolocation, you can find where to go, estimate in which direction and even plot a route.
There is only one problem left: all streets in the application are signed with local hieroglyphs in the local dialect, and okay, if the host country accepts the Latin alphabet, there is a Latin keyboard in all smartphones and the world is used to it, and then I felt discomfort due to the diacritics adopted in the Czech alphabet. And I can only imagine the pain and suffering of foreigners seeing the Cyrillic alphabet, look at the pseudo-Cyrillic alphabet and you will understand. If I were in their place, I would write names and addresses in Latin, trying to reproduce the sound - phonetic search.
In the publication I will describe how to implement the phonetic search algorithms Soudex on the Sphinx Search engine . Transliteration alone will not do here, although without it anywhere. The resulting configuration file is available on the GitHub Gist .
Introduction
, , -, , , Sphinx Search.
, , , .. , - Sphinx.
, , , , , . , , .
, . Soundex Metaphone, . Soundex , Metaphone .
, Sphinx Soundex, , . , , . .. . .
. , : « » – , , « », , . , , , , , .
, Soundex, , , NYSIIS, Daitch-Mokotoff.
SphinxQL, :
mysql -h 127.0.0.1 -P 9306 --default-character-set=utf8
Sphinx, , Sphinx Search, , , . .
Soundex
. , Sphinx Search, , , .. .
, : , – . .
– , Sphinx .
, , , , , : . – , - , , – . " ", . , , , .
regexp_filter = (|) => a
regexp_filter = (|) =>
, – , GitHub Gist.
soundex :
morphology = soundex
, , Sphinx Soundex.
, , Sphinx. -. - , , . . «», «», - , «Lenina», «ulitsa Lenina».
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | lenin | l500 |
| 2 | lenina | l500 |
| 3 | lenina | l500 |
| 4 | lennina | l500 |
| 5 | lenin | l500 |
+------+-----------+------------+
, tokenized , . normalized, Sphinx , , morphology. 'Lenina' l500, '' l500, , - , . Lennina, Lenena, Lennona. , , .
, :
mysql> select * from STREETS where match('Lenena');
+------+--------------------------------------+-----------+--------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------+
| 387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 | | |
+------+--------------------------------------+-----------+--------------+
Sphinx , . . , :
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+----------------+------------+
| qpos | tokenized | normalized |
+------+----------------+------------+
| 1 | plekhanovskaja | p42512 |
| 2 | plechanovskaya | p42512 |
| 3 | plehanovskaja | p4512 |
| 4 | plekhanovska | p42512 |
+------+----------------+------------+
plehanovskaja -
. Sphinx . , CALL QSUGGEST:
mysql> CALL QSUGGEST('Plehanovskaja', 'STREETS');
+----------------+----------+------+
| suggest | distance | docs |
+----------------+----------+------+
| plekhanovskaja | 1 | 1 |
| petrovskaja | 4 | 1 |
+----------------+----------+------+
, , . .. .
, :
min_infix_len = 2
suggest tokenized, .. , . , Soudex , QSUGGEST .
- :
mysql> select * from STREETS where match('30 let Pobedy');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
mysql> select * from STREETS where match('30 ');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
, .
Soundex
. , , , .
.
Sphinx index
, , , . , Sphinx , . .. , regexp_filter
, regexp_filter
.
morphology = soundex
– , . , .
Sphinx , , ! . RE2.
, : regexp_filter = \A(A|a) => a
, 0.
regexp_filter = \B(A|a) => 0
regexp_filter = \B(Y|y) => 0
...
, regexp_filter = \B(Y|y) =>
, - . , «» «Veelkaseem» .
mysql> call keywords(' Veelkaseem', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v738 | v738 |
| 2 | v738 | v738 |
+------+-----------+------------+
- :
mysql> call keywords(' Veelkaseem', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v738 | v738 |
| 2 | v0730308 | v0730308 |
+------+-----------+------------+
, H W .
, , /, H W, . .
regexp_filter = 0+ => 0
regexp_filter = 1+ => 1
...
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | l8 | l8 |
| 2 | l8 | l8 |
| 3 | l8 | l8 |
| 4 | l8 | l8 |
| 5 | l8 | l8 |
+------+-----------+------------+
mysql> select * from STREETS where match('Lenina');
+------+--------------------------------------+-----------+--------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------+
| 387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 | | |
+------+--------------------------------------+-----------+--------------+
, . , tokenized , soundex-. QSUGGEST . - , – . ngram_chars. .
:
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | p738234 | p738234 |
| 2 | p73823 | p73823 |
| 3 | p78234 | p78234 |
| 4 | p73823 | p73823 |
+------+-----------+------------+
, , QSUGGEST :
mysql> CALL QSUGGEST('Plehanovskaja', 'STREETS');
Empty set (0.00 sec)
mysql> CALL QSUGGEST('p73823', 'STREETS');
Empty set (0.00 sec)
mysql> CALL QSUGGEST('p78234', 'STREETS');
Empty set (0.00 sec)
, , , . , , . . , «30 »:
mysql> call keywords('30 let Podedy', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 30 | 30 |
| 2 | l6 | l6 |
| 3 | p6 | p6 |
+------+-----------+------------+
mysql> select * from STREETS where match('30 let Pobedy');
+------+--------------------------------------+-----------+------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+------------------------+
| 677 | 87234d80-4098-40c0-adb2-fc83ef237a5f | | 30 |
+------+--------------------------------------+-----------+------------------------+
:
mysql> select * from STREETS where match('');
+------+--------------------------------------+--------------+----------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+--------------+----------------------+
| 873 | abdb0221-bfe8-4cf8-9217-0ed40b2f6f10 | | 30 |
| 1208 | f1127b16-8a8e-4520-b1eb-6932654abdcd | | 50 |
+------+--------------------------------------+--------------+----------------------+
, , , .
NYSIIS
. «» - . «» , , - , .
(?i) .
, . :
regexp_filter = (?i)\b(mac) => mcc
regexp_filter = (?i)(ee)\b => y
: H, W
regexp_filter = (?i)(a|e|i|o|u|y)h => \1
regexp_filter = (?i)(a|e|i|o|u|y)w => \1a
regexp_filter = (?i)\B(e|i|o|u) => a
regexp_filter = (?i)\B(q) => g
S
regexp_filter = (?i)s\b =>
AY Y
A
, , !!!
, - , , , CALL QSUGGEST.
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | lanan | lanan |
| 2 | lanan | lanan |
| 3 | lanan | lanan |
| 4 | lannan | lannan |
| 5 | lanan | lanan |
+------+-----------+------------+
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+---------------+---------------+
| qpos | tokenized | normalized |
+------+---------------+---------------+
| 1 | plachanavscaj | plachanavscaj |
| 2 | plachanavscay | plachanavscay |
| 3 | plaanavscaj | plaanavscaj |
| 4 | plachanavsc | plachanavsc |
+------+---------------+---------------+
, CALL QSUGGEST Plehanovskaja, plaanavscaj:
mysql> CALL QSUGGEST('plaanavscaj', 'STREETS');
+---------------+----------+------+
| suggest | distance | docs |
+---------------+----------+------+
| paanarscaj | 2 | 1 |
| plachanavscaj | 2 | 1 |
| latavscaj | 3 | 1 |
| sladcavscaj | 3 | 1 |
| pacravscaj | 3 | 1 |
+---------------+----------+------+
. - .
paanarscaj →
plachanavscaj →
latavscaj →
sladcavscaj →
pacravscaj →
- , . - . , . , , .
Daitch-Mokotoff Soundex
, , Soundex.
. , « », , , - , , - .
, .
.
, .. :
regexp_filter = (?i)\b(au) => 0
regexp_filter = (?i)(a|e|i|o|u|y)(au) => \17
, \B ,
regexp_filter = (?i)au =>
– - :
regexp_filter = (?i)j => 1
:
mysql> call keywords(' Lenina Lennina Lenin', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 866 | 866 |
| 2 | 866 | 866 |
| 3 | 866 | 866 |
| 4 | 8666 | 8666 |
| 5 | 866 | 866 |
+------+-----------+------------+
mysql> call keywords(' Plechanovskaya Plehanovskaja Plekhanovska', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | 7856745 | 7856745 |
| 2 | 7856745 | 7856745 |
| 3 | 786745 | 786745 |
| 4 | 7856745 | 7856745 |
+------+-----------+------------+
, QSUGGEST . .
mysql> select * from STREETS where match('Veelkaseem'); show meta;
+------+--------------------------------------+--------------+----------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+--------------+----------------------+
| 873 | abdb0221-bfe8-4cf8-9217-0ed40b2f6f10 | | 30 |
| 1208 | f1127b16-8a8e-4520-b1eb-6932654abdcd | | 50 |
+------+--------------------------------------+--------------+----------------------+
2 rows in set (0.00 sec)
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 2 |
| total_found | 2 |
| time | 0.000 |
| keyword[0] | 78546 |
| docs[0] | 2 |
| hits[0] | 2 |
+---------------+-------+
, , - .
Soundex, , Soundex NYSIIS, CALL QSUGGEST, Sphinx , NYSIIS -. Soundex Daitch-Mokotoff Soundex, , , , 1286 , , - . :
mysql> call keywords(' ', 'STREETS', 0);
+------+------------+------------+
| qpos | tokenized | normalized |
+------+------------+------------+
| 1 | vorovskogo | v612 |
| 2 | verbovaja | v612 |
+------+------------+------------+
Soundex, :
mysql> call keywords(' ', 'STREETS', 0);
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | v9234 | v9234 |
| 2 | v9124 | v9124 |
+------+-----------+------------+
, . , Soundex:
mysql> select * from STREETS where match('');
+------+--------------------------------------+-----------+--------------------------+
| id | aoguid | shortname | offname |
+------+--------------------------------------+-----------+--------------------------+
| 12 | 0278d3ee-4e17-4347-b128-33f8f62c59e0 | | |
+------+--------------------------------------+-----------+--------------------------+
.
QSUGGEST, . , . , – .
, , : Soundex . - , , - , , Sphinx.
, , , Soundex Daitch-Mokotof - , . NYSIIS , , , .
sphinx-3.3.1, 2.1.1-beta, . Manticore. Manticore Search, . , , .
, . , .
P.S.
, . Metaphone . , , . :
-
????
PROFIT