Immutable Trie: find something, I don't know what, but fast, and don't litter

A lot has been written about the prefix tree ( Trie ), including on Habré . Here's an example of how it might look:





And even implementations in the code, including JavaScript, there are many for it - from the "canonical" by John Resig and various optimized versions to a series of modules in NPM .



Why did we need to use it for a service for collecting and analyzing PostgreSQL plans , and even “cycling” some new implementation? ..



Glued logs



Let's take a look at a small piece of the PostgreSQL server log:



2020-09-11 14:49:53.281 MSK [80927:619/4255507] [explain.tensor.ru] 10.76.182.154(59933) pgAdmin III - ???????????????????? ???????????????? LOG:  duration: 0.016 ms  plan:
	Query Text: explain analyze
	SELECT
		*
	FROM
		pg_class
	WHERE
		relname = '
	
	';
	Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.29..2.54 rows=1 width=265) (actual time=0.014..0.014 rows=0 loops=1)
	  Index Cond: (relname = '
	
	'::name)
	  Buffers: shared hit=2


We can use the format set by the log_line_prefix variable to identify and trim the header line that starts with a date :



SHOW log_line_prefix;
-- "%m [%p:%v] [%d] %r %a "


It takes quite a bit of regex magic
const reTS = "\\d{4}(?:-\\d{2}){2} \\d{2}(?::\\d{2}){2}"
  , reTSMS = reTS + "\\.\\d{3}"
  , reTZ   = "(?:[A-Za-z]{3,5}|GMT[+\\-]\\d{1,2})";

var re = {
// : log_line_prefix
      '%a' : "(?:[\\x20-\\x7F]{0,63})"
    , '%u' : "(?:[\\x20-\\x7F]{0,63})"
    , '%d' : "[\\x20-\\x7F]{0,63}?"
    , '%r' : "(?:(?:\\d{1,3}(?:\\.\\d{1,3}){3}|[\\-\\.\\_a-z0-9])\\(\\d{1,5}\\)|\\[local\\]|)"
    , '%h' : "(?:(?:\\d{1,3}(?:\\.\\d{1,3}){3}|[\\-\\.\\_a-z0-9])|\\[local\\]|)"
    , '%p' : "\\d{1,5}"
    , '%t' : reTS + ' ' + reTZ
    , '%m' : reTSMS + ' ' + reTZ
    , '%i' : "(?:SET|SELECT|DO|INSERT|UPDATE|DELETE|COPY|COMMIT|startup|idle|idle in transaction|streaming [0-9a-f]{1,8}\/[0-9a-f]{1,8}|)(?: waiting)?"
    , '%e' : "[0-9a-z]{5}"
    , '%c' : "[0-9a-f]{1,8}\\.[0-9a-f]{1,8}"
    , '%l' : "\\d+"
    , '%s' : "\\d{4}(?:-\\d{2}){2} \\d{2}(?::\\d{2}){2} [A-Z]{3}"
    , '%v' : "(?:\\d{1,9}\\/\\d{1,9}|)"
    , '%x' : "\\d+"
    , '%q' : ""
    , '%%' : "%"
// : log_min_messages
    , '%!' : "(?:DEBUG[1-5]|INFO|NOTICE|WARNING|ERROR|LOG|FATAL|PANIC)"
// : log_error_verbosity
    , '%@' : "(?:DETAIL|HINT|QUERY|CONTEXT|LOCATION|STATEMENT)"
    };
re['%#'] = "(?:" + re['%!'] + "|" + re['%@'] + ")";

//  log_line_prefix  RegExp   
let lre = self.settings['log_line_prefix'].replace(/([\[\]\(\)\{\}\|\?\$\\])/g, "\\\$1") + '%#:  ';
self.tokens = lre.match(new RegExp('(' + Object.keys(re).join('|') + ')', 'g'));

let matcher = self.tokens.reduce((str, token) => str.replace(token, '(' + re[token] + ')'), lre);
self.matcher = new RegExp('^' + matcher, '');


But then we have a request along with a plan - and how to understand where one ends and the other begins? ..



It would seem that the plan is a textual representation of a tree , so there should be one "root"? That is, the first line from the bottom with the minimum indentation (omitting, Trigger ...) - the desired root and the beginning of the plan?



Unfortunately no. In our example, such a string will be the "tail" '::name)from the split into parts multiline string. How to be?



Use Trie, Luke!



But note that the plan must start from one of the nodes: Seq Scan, Index Scan, Sort, Aggregate, ...- neither more nor less, but 133 different options, excluding CTE, InitPlan SubPlanthose that cannot be root.



In fact, we do not know which of the nodes we know is at the beginning of this line (and if it is at all), but we want to find it. This is where the prefix tree will help us .



Immutable Trie



But our tree that we want to build has several features:



  • compactness

    We have tens / hundreds of possible elements of quite limited length, so there cannot be a situation of a large number of very long almost identical keys that differ only in the last character. The longest of our keys is probably 'Parallel Index Only Scan Backward'.


  • . .


  • . , .
  • -

    , «» Garbage Collector'.


The last requirement is due to the fact that the analysis of logs on our collectors is carried out without interruptions, in streaming mode. And the less we can "litter", the more resources we will direct to useful activity instead of cleaning up after ourselves.



Two useful functions will help us with this:





Building a map



Let's look at an example of how you can build a map in order to quickly find the elements you need from the original set using these two operations: Hmm ... they have the same “In” prefix !



Insert

Index Scan

Index Scan Backward

Index Only Scan

Index Only Scan Backward








//      Longest Common Prefix
let len, lcp;
for (let key of keys) {
  //  
  if (lcp === undefined) {
    len = key.length;
    lcp = key.slice(off);
    continue;
  }

  len = Math.min(len, key.length);
  // ,   "" LCP      
  if (lcp == '' || key.startsWith(lcp, off)) {
    continue;
  }
  //  LCP   
  for (let i = 0; i < lcp.length; i++) {
    if (lcp.charCodeAt(i) != key.charCodeAt(off + i)) {
      lcp = lcp.slice(0, i);
      break;
    }
  }
}


And since it is the same, then by checking its symbols, we will not be able to get new information in any way - which means that we only need to check the symbols that go further, up to the length of the shortest element . They will help us to split all the elements into several groups: In this case, it didn’t matter which symbol we take for splitting (3rd or 5th, for example) - the composition of the groups will still be the same, so the exact same combination of splitting into groups is repeated there is no need to process :



Insert

Index Scan

Index Scan Backward

Index Only Scan

Index Only Scan Backward








//      
let grp = new Set();
res.pos = {};
for (let i = off + lcp.length; i < len; i++) {
  //      [i]-
  let chr = keys.reduce((rv, key) => {
    if (key.length < i) {
      return rv;
    }
    let ch = key.charCodeAt(i);
    rv[ch] = rv[ch] || [];
    rv[ch].push(key);
    return rv;
  }, {});

  //          
  let cmb = Object.values(chr)
    .map(seg => seg.join('\t'))
    .sort()
    .join('\n');
  if (grp.has(cmb)) {
    continue;
  }
  else {
    grp.add(cmb);
  }

  res.pos[i] = chr;
}


Optimum metric



It remains only to understand - and if the groups were different on the 3rd and 5th symbols - which of these tree branches should you choose? To do this, we introduce a metric that will give us the answer to this question - the number of single character comparisons to find each of the keys.



Here we neglect the fact that some of the nodes are found in reality in the plans much more often than others, and we consider them equal.



, 3- 's', startsWith, , 6 , , Insert.

: 1 (.charCodeAt(2)) + 6 (.startsWith('Insert')) = 7 .



'd', 7-, , 'O' 'S'. — 'Index Scan Backward' (+19 ) 'Index Scan' (+10 ).



, 'Index Scan Backward', 19 , 'Index Scan' — 19 + 10 = 29.

: 1 (.charCodeAt(2)) + 1 (.charCodeAt(6)) + 19 + 29 (.startsWith(...)) = 50 .



As a result, for our example, the optimal map will look like this:



{
  '$pos' : 2 //  3- 
, '$chr' : Map {
    100 => {         // 'd'
      '$pos' : 6 //  7- 
    , '$chr' : Map {
        79 => [ 'Index Only Scan Backward', 'Index Only Scan' ] // 'O'
      , 83 => [ 'Index Scan Backward', 'Index Scan' ]           // 'S'
      }
    }
  , 115 => 'Insert' // 's'
  }
}


Vzhuh!



Now all that remains is to bring everything together, add a search function, some optimizations - and you can use:



//   
const fill = (obj, off, hash) => {
  off = off || 0;
  hash = hash || {};

  let keys = obj.src;

  //        
  let H = keys.join('\n');
  hash[off] = hash[off] || {};
  if (hash[off][H]) {
    obj.res = hash[off][H];
    return;
  }
  obj.res = {};
  hash[off][H] = obj.res;

  let res = obj.res;

  //    -     
  if (keys.length == 1) {
    res.lst = [...keys];
    res.cmp = res.lst[0].length;
    return;
  }

  //      Longest Common Prefix
  let len, lcp;
  for (let key of keys) {
    //  
    if (lcp == undefined) {
      len = key.length;
      lcp = key.slice(off);
      continue;
    }

    len = Math.min(len, key.length);
    // ,   "" LCP      
    if (lcp == '' || key.startsWith(lcp, off)) {
      continue;
    }
    //  LCP   
    for (let i = 0; i < lcp.length; i++) {
      if (lcp.charCodeAt(i) != key.charCodeAt(off + i)) {
        lcp = lcp.slice(0, i);
        break;
      }
    }
  }

  //       
  if (off + lcp.length == len) {
    let cmp = 0;
    //       - 
    if (keys.length == 2) {
      res.lst = [...keys];
    }
    //  " "   
    else {
      res.src = keys.filter(key => key.length > off + lcp.length);
      res.lst = keys.filter(key => key.length <= off + lcp.length);
    }

    //    ,     ,   
    res.lst.sort((x, y) => y.length - x.length); // s.length DESC
    cmp += res.lst.reduce((rv, key, idx, keys) => rv + (keys.length - idx + 1) * key.length, 0);

    //    -  
    if (res.src && res.src.length) {
      fill(res, off + lcp.length + 1, hash);
      cmp += res.res.cmp;
    }
    res.cmp = cmp + 1;
    return;
  }

  //      
  let grp = new Set();
  res.pos = {};
  for (let i = off + lcp.length; i < len; i++) {
    //      [i]-
    let chr = keys.reduce((rv, key) => {
      if (key.length < i) {
        return rv;
      }
      let ch = key.charCodeAt(i);
      rv[ch] = rv[ch] || [];
      rv[ch].push(key);
      return rv;
    }, {});

    //          
    let cmb = Object.values(chr)
      .map(seg => seg.join('\t'))
      .sort()
      .join('\n');
    if (grp.has(cmb)) {
      continue;
    }
    else {
      grp.add(cmb);
    }

    let fl = true;
    let cmp = 0;
    for (let [ch, keys] of Object.entries(chr)) {
      //     
      if (keys.length == 1) {
        let key = keys[0];
        chr[ch] = key;
        cmp += key.length;
      }
      //       
      else {
        fl = false;
        chr[ch] = {src : keys};
        fill(chr[ch], i + 1, hash);
        cmp += chr[ch].res.cmp;
      }
    }

    res.pos[i] = {
      chr
    , cmp
    };

    //   "" 
    if (res.cmp === undefined || cmp + 1 < res.cmp) {
      res.cmp = cmp + 1;
      res.bst = i;
    }

    //       ,      
    if (fl) {
      res.bst = i;
      for (let j = off; j < i; j++) {
        delete res.pos[j];
      }
      break;
    }
  }
};

//      
const comp = obj => {
  //   
  delete obj.src;
  delete obj.cmp;
  if (obj.res) {
    let res = obj.res;
    if (res.pos !== undefined) {
      //      
      obj.$pos = res.bst;
      let $chr = res.pos[res.bst].chr;
      Object.entries($chr).forEach(([key, val]) => {
        //   
        comp(val);
        //      - ""    
        let keys = Object.keys(val);
        if (keys.length == 1 && keys[0] == '$lst') {
          $chr[key] = val.$lst;
        }
      });
      //    -  Map  -
      obj.$chr = new Map(Object.entries($chr).map(([key, val]) => [Number(key), val]));
    }
    //    ""     
    if (res.lst !== undefined) {
      obj.$lst = res.lst;
      delete res.lst;
      if (res.res !== undefined) {
        comp(res);
        Object.assign(obj, res);
      }
    }
    delete obj.res;
  }
};

//    -     
const find = (str, off, map) => {
  let curr = map;
  do {
    //    
    let $pos = curr.$pos;
    if ($pos !== undefined) {
      let next = curr.$chr.get(str.charCodeAt(off + $pos));
      if (typeof next === 'string') {   //  
        if (str.startsWith(next, off)) {
          return next;
        }
      }
      else if (next instanceof Array) { //    
        for (let key of next) {
          if (str.startsWith(key, off)) {
            return key;
          }
        }
      }
      else if (next !== undefined) {    //  map,  
        curr = next;
        continue;
      }
    }
    //    ,   
    if (curr.$lst) {
      for (let key of curr.$lst) {
        if (str.startsWith(key, off)) {
          return key;
        }
      }
    }
    return;
  }
  while (true);
};

function ImmutableTrie(keys) {
  this.map = {src : keys.sort((x, y) => x < y ? -1 : +1)};
  fill(this.map);
  comp(this.map);
}

const p = ImmutableTrie.prototype;

p.get = function(line, off) {
  return find(line, off || 0, this.map);
};

p.has = function(line, off) {
  return this.get(line, off) !== undefined;
};

module.exports = ImmutableTrie;


As you can see, when searching in such an Immutable Trie, no animal was harmed , no new objects are created in memory, for which the GC would then like to come.



From the bonuses: now we can get the desired prefix without having to do it .sliceon the original line, even if we know that at the very beginning it, traditionally for the plan, has something extraneous:



const nodeIT = new ImmutableTrie(...);
nodeIT.get('  ->  Parallel Seq Scan on abc', 6); // 'Parallel Seq Scan'


Well, when we have already decided where the plan begins, in exactly the same way (but with the help of Trie attribute names), we define the lines that are the beginning of the node attribute, and which are the continuation of the multiline string and "glue" them:



Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.29..2.54 rows=1 width=265) (actual time=0.014..0.014 rows=0 loops=1)
  Index Cond: (relname = '\n\n'::name)
  Buffers: shared hit=2


Well, in this form, it is much easier to disassemble it.



All Articles