We continue to internationalize address search using Sphinx or Manticore. Now Metaphone

This is a continuation of the publication “ Internationalization of City Address Search. Implementing the Russian-language Soundex in Sphinx Search ”, in which I discussed how to implement support for the phonetic Soundex algorithms in Sphinx Search, for text written in Cyrillic. Soundex support is already available for Latin text. It is the same with Metphone, for the Latin alphabet, but not for the Cyrillic alphabet, but we will try to correct this annoying fact with the help of transliteration, regular expressions and a file.





This is a direct continuation, in which we will analyze how to implement the original Metaphone, Russian Metaphone (in the sense that transliteration is not needed), Caverphone, and we will not be able to make Double Metaphone.





The implementation is suitable for both Sphinx Search and Manticore Search platforms.





In the end, let's see how Metaphone perceives the rakomakophone .





Docker image





Prepared the docker image tkachenkoivan / searchfonetic so that you can "feel" the result. All indexes from this publication and from the previous one have been added to the image, but, attention, the names of the indexes from the previous publication do not correspond to what is stored in the image. Why? Because a good thought comes after.





The description of the algorithms, all the same, was taken from the publication " phonetic algorithms ". I will try to duplicate the text written in it as little as possible.





Original Metaphone

It is implemented in an elementary way, regular expressions for transliteration are created:





	regexp_filter = (|) => a
	regexp_filter = (|) => b
	regexp_filter = (|) => v
	…
      
      



And turn on the metaphone :





morphology = metaphone
      
      



, Soundex. , , , Soundex , Soundex, – , .





, , , Metaphone + . .





Sphinx blend_chars. , Sphinx , , , , – , , , .., .. , , , , «&». «M&M’s» ? «&»? blend_chars



.





, blend_chars



:





blend_chars = U+0020
      
      



, - “ ”, , , . , , .





mysql> select * from metaphone where match('');
+------+--------------------------------------+-----------+---------------------------+
| id   | aoguid                               | shortname | offname                   |
+------+--------------------------------------+-----------+---------------------------+
| 1130 | e21aec85-0f63-4367-b9bb-1943b2b5a8fb |         |               |
+------+--------------------------------------+-----------+---------------------------+
      
      



, « », call keywords



:





mysql> call keywords (' ', 'metaphone');
+------+---------------+------------+
| qpos | tokenized     | normalized |
+------+---------------+------------+
| 1    | morisa toreza | MRSTRS     |
| 1    | morisa        | MRS        |
| 2    | toreza        | TRS        |
+------+---------------+------------+
      
      



, : «morisa», «toreza» «morisa toreza», Metaphone, «».





Metaphone Sphinx Search. , . , , :





regexp_filter = [ ] => 
      
      



« », , , .





, , , .





Caverphone , .





mysql> call keywords (' ', 'caverphone');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | mrsa trza | mrsa trza  |
| 1    | mrsa      | mrsa       |
| 2    | trza      | trza       |
+------+-----------+------------+

mysql> select * from caverphone where match('');
Empty set (0.00 sec)
      
      



Soundex ( ), Sphinx, , , , , «morisa» «toreza» , «morisa toreza» :





mysql> call keywords (' ', 'simple_soundex');
+------+---------------+---------------+
| qpos | tokenized     | normalized    |
+------+---------------+---------------+
| 1    | morisa toreza | morisa toreza |
| 1    | morisa        | m620          |
| 2    | toreza        | t620          |
+------+---------------+---------------+
      
      



blend_chars



– , . metaphone. ( ) – : , .





.





Double Metaphone

Metaphone , , , .





, , Metaphone . , , , , DoubleMetaphone.java. , «C», , .





, , – , , , Sphinx Manticore.





, Metaphone . , . Sphinx . .





, , Java, Commons Codec. – , . , – , .





, , , . – .





, , :





DoubleMetaphone dm = new DoubleMetaphone();
String metaphone1 = dm.doubleMetaphone("Text", false);
String metaphone2 = dm.doubleMetaphone("Text", true);
      
      



metaphone1



metaphone2



.





– .





, Commons Codec. , . Metaphone , , . , : , , .





Sphinx .





Metaphone

.





. , . « », « Metaphone».





, , , .





, , . , « », «», «» , :





mysql> call keywords (' ', 'rus_metaphone');
+------+--------------+--------------+
| qpos | tokenized    | normalized   |
+------+--------------+--------------+
| 1    |        |        |
| 2    |         |         |
+------+--------------+--------------+
      
      



. , , GitHub Gist manticore.conf.





  • :





regexp_filter = (?i)(|||) => 
regexp_filter = (?i)(||) => 
regexp_filter = (?i)(||) => 
regexp_filter = (?i)() => 
      
      



  • , , , , , :





regexp_filter = (?i)()(||||||||||||||||) => \2
regexp_filter = (?i)()(||||||||||||||||) => \2
regexp_filter = (?i)()(||||||||||||||||) => \2
regexp_filter = (?i)()(||||||||||||||||) => \2
regexp_filter = (?i)()(||||||||||||||||) => \2
regexp_filter = (?i)()(||||||||||||||||) => \2
      
      



  • ,





regexp_filter = (?i)\b => 
regexp_filter = (?i)\b => 
regexp_filter = (?i)\b => 
regexp_filter = (?i)\b => 
regexp_filter = (?i)\b => 
regexp_filter = (?i)\b => 
      
      







regexp_filter = (?i)(||) => 
      
      



Caverphone

.





  • , :





regexp_filter = (A|a) => a
regexp_filter = (B|b) => b
…
      
      



, , , , .





  • e





regexp_filter = e\b =>
      
      



  • , , :





regexp_filter = \b(cough) => cou2f
regexp_filter = \b(rough) => rou2f
…
      
      







regexp_filter = (cq) => 2q
regexp_filter = (ci) => si
…
      
      



  • a, — 3





regexp_filter = (?i)\b(a|e|i|o|u|y) => A
regexp_filter = (?i)(a|e|i|o|u|y) => 3
      
      







regexp_filter = (j) => y
regexp_filter = \b(y3) => Y3
…

      
      



  • 2





regexp_filter = 2 => 
      
      



  • 3, A





regexp_filter = 3\b => A
      
      



  • 3





regexp_filter = 3 =>
      
      



10 .





:





mysql> select * from caverphone where match ('');
+------+--------------------------------------+-----------+------------------+
| id   | aoguid                               | shortname | offname          |
+------+--------------------------------------+-----------+------------------+
|    5 | 01339f2b-6907-4cb8-919b-b71dbed23f06 |         |          |
|  387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 |         |            |
+------+--------------------------------------+-----------+------------------+
      
      



«» «». , , , Daitch Mokotoff Soundex - «»:





mysql> select * from daitch_mokotoff_soundex where match ('');
+------+--------------------------------------+-----------+--------------+
| id   | aoguid                               | shortname | offname      |
+------+--------------------------------------+-----------+--------------+
|  387 | 4b919f60-7f5d-4b9e-99af-a7a02d344767 |         |        |
|  541 | 69b8220e-a42d-4fec-a346-1df56370c363 |         |        |
+------+--------------------------------------+-----------+--------------+
      
      



:





mysql> call keywords ('  ', 'caverphone');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | lnna      | lnna       |
| 2    | lnna      | lnna       |
| 3    | lna       | lna        |
+------+-----------+------------+


mysql> call keywords ('  ', 'daitch_mokotoff_soundex');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | 866       | 866        |
| 2    | 8616      | 8616       |
| 3    | 866       | 866        |
+------+-----------+------------+
      
      



, , , - . , .





: .

, , . Just for fun.





, rock the microphone?! , Metaphone . !





-, blend_chars, rock the microphone, :





blend_chars = U+0020
      
      



- metaphone, .





keywords



Sphinx:





mysql> call keywords ('', 'metaphone');
+------+-------------+------------+
| qpos | tokenized   | normalized |
+------+-------------+------------+
| 1    | rakomakofon | RKMKFN     |
+------+-------------+------------+
      
      



rock the microphone:





mysql> call keywords ('rock the microphone', 'metaphone');
+------+---------------------+------------+
| qpos | tokenized           | normalized |
+------+---------------------+------------+
| 1    | rock the microphone | RK0MKRFN   |
| 1    | rock                | RK         |
| 2    | the                 | 0          |
| 3    | microphone          | MKRFN      |
+------+---------------------+------------+
      
      



RK0MKRFN, RKMKFN, 2(!). the , RKMKRFN:





mysql> call keywords ('rock microphone', 'metaphone');
+------+-----------------+------------+
| qpos | tokenized       | normalized |
+------+-----------------+------------+
| 1    | rock microphone | RKMKRFN    |
| 1    | rock            | RK         |
| 2    | microphone      | MKRFN      |
+------+-----------------+------------+
      
      



RKMKRFN RKMKFN, 1! .





«the», stopwords , - blend_chars = U+0020



«the» . , 1, .





The hope qsuggest



did not come true - it will not give hints. Why? You can notice that when you call keywords



there are two columns tokenized



and normalized



, qsuggest



gives a hint on the column tokenized



and measures Levenshtein's distance relative to it, qsuggest



it doesn't matter that there, in normalized



, the distance is 1.





Therefore, the observation is funny, but not practical.








All Articles