My phone rang. Who speaks? .. Will help "elephant"

Automatic identification of a client and his region by an incoming phone call has become an integral part of any developed HelpDesk or CRM system. You just need to be able to do it quickly - then a lot of opportunities appear.



For example, you can immediately show the manager from which city the call is coming from, tighten up the current price list and delivery conditions, display the caller's card, the last transactions with him, a specific contact person, ... - and a lot of useful things, as our VLSI CRM can do !





How to implement this functionality yourself? It turns out not that difficult. You can literally build and test a working model on the knee - you just need a bundle of Node.js and PostgreSQL.



Determine the region by number



Let's assume that the PBX sends us an incoming phone number, already normalized and formatted up to 10 digits (we will consider only calls within Russia). What is the most efficient way to understand where the call came from?



Collecting telephone codes



First, we need a database of telephone codes of Russia in relation to regions. To do this, you can use an official source - an up-to-date extract from the numbering plan on the website of the Federal Communications Agency.



But finding is not enough, you need to download and extract this data. A small script for Node.js using the request library will help us with this :



const async = require('async')
  , request = require('request');

const fs = require('fs');

let queue = [
  'ABC-3xx'
, 'ABC-4xx'
, 'ABC-8xx'
, 'DEF-9xx'
]
  .map(key => (
    {
      base : 'https://rossvyaz.gov.ru'
    , path : `/data/${key}.csv`
    }
  ));

let ranges = [];

async.doWhilst(
  cb => {
    //       
    let task = queue.shift();
    request(
      {
        url  : task.base + task.path
      , pool : false
      }
    , (err, res, body) => {
        //   CSV
        body.split('\n').forEach(line => {
          let tds = line.split(';');
          let place = tds[5].split('|');
          ranges.push([
            tds[0]
          , tds[1]
          , tds[2]
          , tds[4]
          , place[place.length - 1]
          , place[place.length - 2] && place[place.length - 2].startsWith('-') ? place[place.length - 2] : ''
          , place.length > 1
            ? place[0].startsWith('-')
              ? ''
              : place[0]
            : ''
          ]);
        });
        return cb(err);
      }
    );
  }
  // ,    
, cb => {
    return cb(null, queue.length);
  }
  //    -         
, err => {
    //    
    ranges.forEach(row => {
      //      
      let ln = row[0].length + row[1].length - 10;
      if (ln > 0) {
        let sfx = row[0].slice(-ln);
        if (row[1].startsWith(sfx) && row[2].startsWith(sfx)) {
          row[1] = row[1].slice(ln);
          row[2] = row[2].slice(ln);
        }
      }

      //   
      let pfx;
      for (let i = 1; i < row[1].length; i++) {
        if (row[2].startsWith(row[1].slice(0, i))) {
          pfx = row[1].slice(0, i);
        }
        else {
          break;
        }
      }
      if (pfx) {
        row[0] = row[0] + pfx;
        row[1] = row[1].slice(pfx.length);
        row[2] = row[2].slice(pfx.length);
      }
    });

    let sql = `
SET client_encoding = 'UTF-8';
CREATE TABLE phonecodes(
  code
    varchar
, numb
    varchar
, nume
    varchar
, oper
    varchar
, region
    varchar
, district
    varchar
, city
    varchar
);
COPY phonecodes FROM STDIN;
`;
    //  COPY-
    let copy = ranges.map(row => row.join('\t')).join('\n') + '\n\\.\n';

    fs.writeFileSync('phonecodes.sql', sql + copy);
  }
);


Now let's load it into our test base, and you can work:



psql -f phonecodes.sql -U postgres tst


If everything worked out as it should, almost 378 thousand ranges will be loaded into our table:



SET
CREATE TABLE
COPY 377937


Note that in our example, both the code and the boundary numbers of the range are represented by strings. Yes, they can be turned into integer/bigint, but we will not do this for now. Moreover, the incoming phone number does not always consist of numbers only - for example, some payphones can report their number with the "digit A".


"They are looking for firefighters, the police are looking for ..."



Let's try a naive query first:



WITH src AS (
  SELECT '4852262000' num --  
)
SELECT
  *
FROM
  src
, phonecodes
WHERE
  num LIKE (code || '%') AND --   
  num BETWEEN (code || numb) AND (code || nume) --    
LIMIT 1;




[look at explain.tensor.ru]



Almost 70 thousand lines were subtracted (and it was lucky that not all 380 lines!), almost 10MB of data was shoveled ... not very efficiently, but the result is achieved:



num        | code   | numb | nume | oper | region           | district | city
-----------------------------------------------------------------------------------
4852262000 | 485226 | 0000 | 9999 |   |  . |          | 


But let's get rid of it somehow Seq Scan! To do this, we just need an index that will help to search by LIKE, right? ..



Alas, no. If we need to search column LIKE (val || '%'), then prefix indices with varchar_pattern_ops will help us , but we have the opposite - val LIKE (column || '%'). And we get a situation close to the one that I described in the article "Classifying errors from PostgreSQL logs" .



We use knowledge of the applied field



A close one, but, fortunately, it is still much simpler - our data is fixed and there are relatively few of them. Moreover, the records are distributed rather sparsely by codes:



SELECT --     - 
  ranges
, count(*)
FROM
  (
    SELECT --     
      code
    , count(*) ranges
    FROM
      phonecodes
    GROUP BY
      1
  ) T
GROUP BY
  1
ORDER BY
  1 DESC;


Only about a hundred codes have 10 ranges, and almost a quarter have exactly one:



ranges | count
--------------
    10 |   121
     9 |   577
     8 |  1705
     7 |  3556
     6 |  6667
     5 | 10496
     4 | 12491
     3 | 20283
     2 | 22627
     1 | 84453


So let's only index the code for now. And since we need all the ranges of the same code all together CLUSTER, let's arrange our table with the help of so that the records are physically next to each other:



CREATE INDEX ON phonecodes(code);
CLUSTER phonecodes USING phonecodes_code_idx;


And now let's remember that our phone number consists of exactly (all!) 10 digits, among which we need to isolate the prefix code. That is, our task is calmly solved by a simple enumeration of no more than 10 options:



WITH RECURSIVE src AS (
  SELECT '4852262000' num
)
, T AS (
  SELECT
    num pfx --    ""   
  , NULL::phonecodes pc
  FROM
    src
UNION ALL
  SELECT
    substr(pfx, 1, length(pfx) - 1) -- ""  
  , (
      SELECT
        X
      FROM
        phonecodes X
      WHERE
        code = T.pfx AND --    
        (TABLE src) BETWEEN (code || numb) AND (code || nume) --    
      LIMIT 1
    ) pc
  FROM
    T
  WHERE
    pc IS NOT DISTINCT FROM NULL AND -- ,    
    length(pfx) > 2 -- ...      
)
SELECT
  (pc).* -- ""     
FROM
  T
WHERE
  pc IS DISTINCT FROM NULL;




[look at explain.tensor.ru]



It took us only 5 index calls to find the code we were looking for. The gain seems microscopic in absolute numbers, but we got a 150-fold reduction in load relative to the naive option! If your system has to process tens or hundreds of thousands of such requests per hour, the savings become very substantial!

And you can do even fewer iterations over the index - if all codes are pre-reduced to the classic form "from 3 to 5 digits". However, then the number of ranges in each code will increase, and filtering them can add problems.


int8range + GiST



As correctly noted in the comments miksir, since we have all pairs "code + range" and the incoming number have strictly the same dimension of 10 digits, then the problem can be reduced to an interval search among numeric values.



To do this, we will create an index that will treat our records as int8range:



CREATE INDEX ON phonecodes USING gist(
  int8range(
    (code || numb)::bigint --   
  , (code || nume)::bigint --   
  , '[]' --   
  )
);


After that we can use it in the request:



WITH src AS (
  SELECT '4852262000'::bigint num
)
SELECT
  *
FROM
  phonecodes
WHERE
  int8range((code || numb)::bigint, (code || nume)::bigint, '[]') @> ( --  
    SELECT
      int8range(num, num, '[]') -- ""   
    FROM
      src
  )
LIMIT 1;




[look at explain.tensor.ru]



Non-overlapping intervals + btree



First, let's make sure our number ranges don't really overlap:



SELECT
  *
FROM
  phonecodes X
, phonecodes Y
WHERE
  int8range((X.code || X.numb)::bigint, (X.code || X.nume)::bigint, '[]') &&
  int8range((Y.code || Y.numb)::bigint, (Y.code || Y.nume)::bigint, '[]') AND
  X.ctid <> Y.ctid;


If you get "nothing" - everything is fine, and you can apply the following optimization: the number can only be included in the range, to the end (or beginning) of which it is closest .



To find the closest "beginning", we need a regular btree index:



CREATE INDEX ON phonecodes((code || numb));


WITH src AS (
  SELECT '4852262000' num
)
SELECT
  *
FROM
  src
, LATERAL (
    SELECT
      *
    FROM
      ( --     
        SELECT
          *
        FROM
          phonecodes
        WHERE
          (code || numb) <= src.num
        ORDER BY
          (code || numb) DESC
        LIMIT 1
      ) T
    WHERE
      src.num BETWEEN (code || numb) AND (code || nume) --  
  ) T;


Despite its apparent simplicity, this option gives worse performance than the previous one:





[look at explain.tensor.ru]



We identify the client by number



Now let's imagine that we already have a table with clients, where the "cleaned up" phone number is written - all brackets, hyphens, etc. are removed.



But here's a nuisance, not all of them have a city code - either managers are too lazy to score, or the PBX is so configured that it sends not full, but "intracity" numbers ... How then to find a client - after all, a full match search will no longer work?



PBX gives the full number



In this case, we will use the same "exhaustive" algorithm . Only we will “pinch off” the numbers not from the end of the number, but from the beginning.



If the number in the customer card was indicated in full, we will stumble upon it at the very first iteration. If not completely - when we "cut off" some of the appropriate codes.



Of course, we will need some kind of cross-checking by other details (address, TIN, ...) so that we do not get a situation that we “cut off” the Moscow code from the incoming number, and found a client from St. Petersburg.



PBX gives a "city" number



        :     262000
   : 4852262000


Here the situation is more interesting. We cannot "increment" every possible code to a short number and try to search - there are too many of them. Let's look at the situation from the other side - literally:



    reverse(262000) -> 000262
reverse(4852262000) -> 0002622584


It turns out that if you expand the lines with numbers, then the problem turns into a regular prefix search , which can be easily solved using an index with varchar_pattern_ops and LIKE!



CREATE INDEX ON client(reverse(phone) varchar_pattern_ops);


SELECT
  *
FROM
  client
WHERE
  reverse(phone) LIKE (reverse($1) || '%');


And then, again, we double-check the additional information - from which region the PBX sent us the number, which region the client belongs to.



All Articles