For example, you can immediately show the manager from which city the call is coming from, tighten up the current price list and delivery conditions, display the caller's card, the last transactions with him, a specific contact person, ... - and a lot of useful things, as our VLSI CRM can do !
How to implement this functionality yourself? It turns out not that difficult. You can literally build and test a working model on the knee - you just need a bundle of Node.js and PostgreSQL.
Determine the region by number
Let's assume that the PBX sends us an incoming phone number, already normalized and formatted up to 10 digits (we will consider only calls within Russia). What is the most efficient way to understand where the call came from?
Collecting telephone codes
First, we need a database of telephone codes of Russia in relation to regions. To do this, you can use an official source - an up-to-date extract from the numbering plan on the website of the Federal Communications Agency.
But finding is not enough, you need to download and extract this data. A small script for Node.js using the request library will help us with this :
const async = require('async')
, request = require('request');
const fs = require('fs');
let queue = [
'ABC-3xx'
, 'ABC-4xx'
, 'ABC-8xx'
, 'DEF-9xx'
]
.map(key => (
{
base : 'https://rossvyaz.gov.ru'
, path : `/data/${key}.csv`
}
));
let ranges = [];
async.doWhilst(
cb => {
//
let task = queue.shift();
request(
{
url : task.base + task.path
, pool : false
}
, (err, res, body) => {
// CSV
body.split('\n').forEach(line => {
let tds = line.split(';');
let place = tds[5].split('|');
ranges.push([
tds[0]
, tds[1]
, tds[2]
, tds[4]
, place[place.length - 1]
, place[place.length - 2] && place[place.length - 2].startsWith('-') ? place[place.length - 2] : ''
, place.length > 1
? place[0].startsWith('-')
? ''
: place[0]
: ''
]);
});
return cb(err);
}
);
}
// ,
, cb => {
return cb(null, queue.length);
}
// -
, err => {
//
ranges.forEach(row => {
//
let ln = row[0].length + row[1].length - 10;
if (ln > 0) {
let sfx = row[0].slice(-ln);
if (row[1].startsWith(sfx) && row[2].startsWith(sfx)) {
row[1] = row[1].slice(ln);
row[2] = row[2].slice(ln);
}
}
//
let pfx;
for (let i = 1; i < row[1].length; i++) {
if (row[2].startsWith(row[1].slice(0, i))) {
pfx = row[1].slice(0, i);
}
else {
break;
}
}
if (pfx) {
row[0] = row[0] + pfx;
row[1] = row[1].slice(pfx.length);
row[2] = row[2].slice(pfx.length);
}
});
let sql = `
SET client_encoding = 'UTF-8';
CREATE TABLE phonecodes(
code
varchar
, numb
varchar
, nume
varchar
, oper
varchar
, region
varchar
, district
varchar
, city
varchar
);
COPY phonecodes FROM STDIN;
`;
// COPY-
let copy = ranges.map(row => row.join('\t')).join('\n') + '\n\\.\n';
fs.writeFileSync('phonecodes.sql', sql + copy);
}
);
Now let's load it into our test base, and you can work:
psql -f phonecodes.sql -U postgres tst
If everything worked out as it should, almost 378 thousand ranges will be loaded into our table:
SET
CREATE TABLE
COPY 377937
Note that in our example, both the code and the boundary numbers of the range are represented by strings. Yes, they can be turned into integer/bigint
, but we will not do this for now. Moreover, the incoming phone number does not always consist of numbers only - for example, some payphones can report their number with the "digit A".
"They are looking for firefighters, the police are looking for ..."
Let's try a naive query first:
WITH src AS (
SELECT '4852262000' num --
)
SELECT
*
FROM
src
, phonecodes
WHERE
num LIKE (code || '%') AND --
num BETWEEN (code || numb) AND (code || nume) --
LIMIT 1;
[look at explain.tensor.ru]
Almost 70 thousand lines were subtracted (and it was lucky that not all 380 lines!), almost 10MB of data was shoveled ... not very efficiently, but the result is achieved:
num | code | numb | nume | oper | region | district | city
-----------------------------------------------------------------------------------
4852262000 | 485226 | 0000 | 9999 | | . | |
But let's get rid of it somehow
Seq Scan
! To do this, we just need an index that will help to search by LIKE
, right? ..
Alas, no. If we need to search
column LIKE (val || '%')
, then prefix indices with varchar_pattern_ops will help us , but we have the opposite - val LIKE (column || '%')
. And we get a situation close to the one that I described in the article "Classifying errors from PostgreSQL logs" .
We use knowledge of the applied field
A close one, but, fortunately, it is still much simpler - our data is fixed and there are relatively few of them. Moreover, the records are distributed rather sparsely by codes:
SELECT -- -
ranges
, count(*)
FROM
(
SELECT --
code
, count(*) ranges
FROM
phonecodes
GROUP BY
1
) T
GROUP BY
1
ORDER BY
1 DESC;
Only about a hundred codes have 10 ranges, and almost a quarter have exactly one:
ranges | count
--------------
10 | 121
9 | 577
8 | 1705
7 | 3556
6 | 6667
5 | 10496
4 | 12491
3 | 20283
2 | 22627
1 | 84453
So let's only index the code for now. And since we need all the ranges of the same code all together
CLUSTER
, let's arrange our table with the help of so that the records are physically next to each other:
CREATE INDEX ON phonecodes(code);
CLUSTER phonecodes USING phonecodes_code_idx;
And now let's remember that our phone number consists of exactly (all!) 10 digits, among which we need to isolate the prefix code. That is, our task is calmly solved by a simple enumeration of no more than 10 options:
WITH RECURSIVE src AS (
SELECT '4852262000' num
)
, T AS (
SELECT
num pfx -- ""
, NULL::phonecodes pc
FROM
src
UNION ALL
SELECT
substr(pfx, 1, length(pfx) - 1) -- ""
, (
SELECT
X
FROM
phonecodes X
WHERE
code = T.pfx AND --
(TABLE src) BETWEEN (code || numb) AND (code || nume) --
LIMIT 1
) pc
FROM
T
WHERE
pc IS NOT DISTINCT FROM NULL AND -- ,
length(pfx) > 2 -- ...
)
SELECT
(pc).* -- ""
FROM
T
WHERE
pc IS DISTINCT FROM NULL;
[look at explain.tensor.ru]
It took us only 5 index calls to find the code we were looking for. The gain seems microscopic in absolute numbers, but we got a 150-fold reduction in load relative to the naive option! If your system has to process tens or hundreds of thousands of such requests per hour, the savings become very substantial!
And you can do even fewer iterations over the index - if all codes are pre-reduced to the classic form "from 3 to 5 digits". However, then the number of ranges in each code will increase, and filtering them can add problems.
int8range + GiST
As correctly noted in the comments miksir, since we have all pairs "code + range" and the incoming number have strictly the same dimension of 10 digits, then the problem can be reduced to an interval search among numeric values.
To do this, we will create an index that will treat our records as
int8range
:
CREATE INDEX ON phonecodes USING gist(
int8range(
(code || numb)::bigint --
, (code || nume)::bigint --
, '[]' --
)
);
After that we can use it in the request:
WITH src AS (
SELECT '4852262000'::bigint num
)
SELECT
*
FROM
phonecodes
WHERE
int8range((code || numb)::bigint, (code || nume)::bigint, '[]') @> ( --
SELECT
int8range(num, num, '[]') -- ""
FROM
src
)
LIMIT 1;
[look at explain.tensor.ru]
Non-overlapping intervals + btree
First, let's make sure our number ranges don't really overlap:
SELECT
*
FROM
phonecodes X
, phonecodes Y
WHERE
int8range((X.code || X.numb)::bigint, (X.code || X.nume)::bigint, '[]') &&
int8range((Y.code || Y.numb)::bigint, (Y.code || Y.nume)::bigint, '[]') AND
X.ctid <> Y.ctid;
If you get "nothing" - everything is fine, and you can apply the following optimization: the number can only be included in the range, to the end (or beginning) of which it is closest .
To find the closest "beginning", we need a regular btree index:
CREATE INDEX ON phonecodes((code || numb));
WITH src AS (
SELECT '4852262000' num
)
SELECT
*
FROM
src
, LATERAL (
SELECT
*
FROM
( --
SELECT
*
FROM
phonecodes
WHERE
(code || numb) <= src.num
ORDER BY
(code || numb) DESC
LIMIT 1
) T
WHERE
src.num BETWEEN (code || numb) AND (code || nume) --
) T;
Despite its apparent simplicity, this option gives worse performance than the previous one:
[look at explain.tensor.ru]
We identify the client by number
Now let's imagine that we already have a table with clients, where the "cleaned up" phone number is written - all brackets, hyphens, etc. are removed.
But here's a nuisance, not all of them have a city code - either managers are too lazy to score, or the PBX is so configured that it sends not full, but "intracity" numbers ... How then to find a client - after all, a full match search will no longer work?
PBX gives the full number
In this case, we will use the same "exhaustive" algorithm . Only we will “pinch off” the numbers not from the end of the number, but from the beginning.
If the number in the customer card was indicated in full, we will stumble upon it at the very first iteration. If not completely - when we "cut off" some of the appropriate codes.
Of course, we will need some kind of cross-checking by other details (address, TIN, ...) so that we do not get a situation that we “cut off” the Moscow code from the incoming number, and found a client from St. Petersburg.
PBX gives a "city" number
: 262000
: 4852262000
Here the situation is more interesting. We cannot "increment" every possible code to a short number and try to search - there are too many of them. Let's look at the situation from the other side - literally:
reverse(262000) -> 000262
reverse(4852262000) -> 0002622584
It turns out that if you expand the lines with numbers, then the problem turns into a regular prefix search , which can be easily solved using an index with varchar_pattern_ops and
LIKE
!
CREATE INDEX ON client(reverse(phone) varchar_pattern_ops);
SELECT
*
FROM
client
WHERE
reverse(phone) LIKE (reverse($1) || '%');
And then, again, we double-check the additional information - from which region the PBX sent us the number, which region the client belongs to.