Problem
We all have days at work when someone comes to us with a truly impossible requirement that takes a miracle to fulfill. My case occurred when a marketing colleague came up to me with a seemingly simple question: if I could get data from one table for a certain month a couple of years ago. You can say nothing wrong, but I vaguely remembered that the table was very large. The table had a datetime field with the creation time, but was there an index on this field?
Of course, they also wanted to get the data fast. As usual, I said, “I'll see what I can do,” and went to take a closer look at the table under discussion. Luck never leaves us, the index didn't really exist and the table was huge. Not a problem, we can scan the table, right? Wrong. If I've learned anything from the years of working with databases, it's that size matters. A table with hundreds of millions of records and several integer columns would be formidable enough. Then add the various varchar and datetime columns. Now that's a real challenge, isn't it?
The table I was looking at had a billion records, literally. A simple table scan could easily take over a day and I needed to process the extracted data as well. In addition, scanning a table of this size may not be as favorable for the general state of the system as it seems at first glance. All scanned pages should be extracted from the disks into the sql server memory by filling it in. This, in turn, will unload data pages from the cache, which can be used for current operational tasks. Your current users' requests may wait longer while the server rereads data from disks, rather than quickly reusing data pages from memory. Performance may decline, and in the worst case, the server may be completely on its knees. Thus,when scanning a table, you should use a special technique - run it in small batches, keeping the current position and intermediate results somewhere. This approach allows the server to reconfigure and have a breather, avoiding a large performance drawdown.
And so I was thinking about the scenario of work and estimating the time it would take to run and execute the script when it occurred to me that the datetime values of the creation time were associated with the table ID. The identifier (ID) was a column of ever-increasing values and a primary key based on it. When fetching by identifier, the creation date and time values were naturally pre-sorted in the same way as the identifier values. Wait, I thought, I can implement something extremely fast than scanning a table, something that in the worst case would require only 30 steps instead of 500,000,000! Binary search!
Search
For everyone who, like me, completed their studies a long time ago, retraining in theory should be in the order of things. Binary search is a simple but extremely efficient way to find a value in a sorted array (in our case, a table column). The array must be pre-sorted and finite. The search is done by comparing the middle element of your array with the target. If they are equal, then the search is over. If not, you delete the half in which the desired element is obviously absent, and repeat the previous step until there is only one left. Find the middle, compare the target with it, if they are not equal, remove half again, and so on. The total number of steps required to find the target in the worst case, when you find it to the very last step, will be log (N), where N is the number of elements. Compare this to N / 2,when you just iterate over the array. Generally speaking, this would be N, but if we could guess whether the goal was closer to the end, we could go back, thereby reducing the total number of steps needed.
In my case, N = 1,000,000,000 and this is how I arrived at the two numbers above: 30 and 500,000,000. An identifier (ID) would play the role of an array enumerator, and the creation datetime would be the value of the array element. There is one caveat, though. An array enumerator, by its very definition, is a contiguous sequential sequence of integers. It is easy for table IDs to contain spaces due to record deletion or ID overflow. Simply determining the middle by dividing the range of identifiers by 2, you should not expect that there will be an entry with such an identifier. Instead of searching directly, I had to use the top () function. Something like that:
Select top(1) * from Table where id <= @id order by id desc
I used “<=” and “desc” because I wanted to find a value that is either equal to or immediately before the target. Provided @l, @r are left and right borders respectively,id Is the middle, @dt is the creation datetime, tdt Is the goal and idr is the real table identifier (ID), the algorithm may look like this:
while @l <@r
begin
--
@id = @l +floor((@r-@l)/2)
--
select top(1) @idr = id, @dt = creation_datetime from Table where id <= @id order by id desc
--
if(@dt<@tdt)
@l = @id +1
else
@r = @id
end
Last found idr would be the identifier of the entry after which was needed. The algorithm has a kind of "left" offset, that is, a tendency to choose the leftmost of all values. Since I wanted the record with the value to be as close to the target as possible, I also checked the nearest left and right neighbors in the table to see if one of them was better. Please note that I did not use the real identifier from the table for the iteration process, as in this case, with spaces in the ID values and in some circumstances, the algorithm could go into an infinite loop.
It took me about an hour to write and test the script. Using it, I found the first record with a specific date and time of creation. After that, I used a simple select statement with a where clause that contained both conditions: id> = @ and creation_datetime> = @ dt1 and creation_datetime <@ dt2. I only needed to make sure that the optimizer would choose to use a primary key instead of an index or a table scan. All in all, I did it in less than 2 hours! It was surprising to rediscover that algorithms are not something esoteric buried deep inside a sql server, but rather something that can be easily used in day-to-day work.
Below is the entire script I wrote. Please see the comments inside to understand how to use it.
/*
, .
, datetime 3
*/
-- , ,
declare @debug bit = 0
-- @Table - , .
declare @Table varchar(128) = 'TableX'
-- @Id - (id)
declare @ID varchar(128) = 'Id'
-- @DateTimeColumn - datetime
declare @DateTimeColumn varchar(128) = 'created_dt'
--
declare @TargetDateTime datetime = '2020-01-03 18:23:03'
declare @Sql varchar(max)
set @sql = '
--
declare @debug bit = <debug>
--
declare @tdt datetime = ''<TargetDateTime>''
-- ( )
declare @id bigint
--
declare @l bigint, @r bigint
-- , , datetime
declare @dt datetime, @idr bigint
-- «» «»
select @l = min(<ID>), @r = max(<ID>) from <Table>
while @l < @r
begin
--
set @id = @l +floor((@r-@l)/2)
--
select top(1) @dt = <DateTimeColumn>, @idr = <ID> from <Table> where <ID> <= @id order by <ID> desc
-- ,
if( @debug = 1 )
begin
select ''@id'' = @id, ''@l'' = @l, ''@r'' = @r, ''@dt'' = @dt, ''@idr'' = @idr
end
if(@dt < @tdt)
set @l = @id +1
else
set @r = @id
end
-- ,
declare @t table(id bigint, dt datetime, dlt float)
insert @t(id, dt, dlt)
select top(1) <ID>, <DateTimeColumn>, abs(cast(<DateTimeColumn> as float) -cast(@tdt as float))
from <Table>
where <ID> < @idr
order by <ID> desc
insert @t(id, dt, dlt)
select top(1) <ID>, <DateTimeColumn>, abs(cast(<DateTimeColumn> as float) -cast(@tdt as float))
from <Table>
where <ID> > @idr
order by <ID>
insert @t(id, dt, dlt)
select @idr, @dt, abs(cast(@dt as float) -cast(@tdt as float))
select top(1) @dt = dt, @idr = id from @t order by dlt, id
select ''Found Value'' = @dt, ''Record Id'' = @idr
'
set @sql = replace(
replace(
replace(
replace(
replace(@sql, '<ID>', @ID),
'<Table>', @Table),
'<TargetDateTime>', convert(varchar(50), @TargetDateTime, 121)),
'<debug>', @debug),
'<DateTimeColumn>', @DateTimeColumn)
exec (@sql)