👩🏾‍🤝‍👩🏻 🤷🏾 🖋️ How we scanned the entire Internet 🎺 🈳 🚙

Hello everyone! My name is Alexander and I am writing the code for 2ip.ru. For a good half of the services, you can kick me, ready to fight back. Today I want to tell you a little about the rework of one of our old services. This is certainly not "big data", but still quite large amounts of information, so I think it will be interesting.

We are going to talk about Sites on one IP , which, you guessed it, allows you to find out all domains registered on one IP. It is quite convenient to see who has stuck to your server (yes, there are some), or someone else's (for example, shared hosting).

How has it always worked? We went to Bing from a large pool of addresses and parsed the results for a special request. Yes, the decision was so-so, but that was it. It was because Bing screwed the nuts and we decided to do it all humanly.

Own base

What if you take and parse the entire internet? In general, this is not a problem, but we are not Google and do not have large resources for crawling. Or do we have?

There is a server with 12 cores and 64 gigs of memory, and in the arsenal of MySQL, PHP, golang and a bunch of all sorts of frameworks. Obviously, good results can be achieved with goroutines. Golang is fast and requires minimal resources. On the basis of questions, will it pull all the usual MySQL?

Let's try.

Making a prototype

Collecting all the domains is a thankless task, so we bought a domain database of 260 million records. There are quite a lot of services that provide their services and they cost a penny.

So, on my disk, a CSV file of 5 GB in size, it's easy, write a mass resolver that will read line by line, and send a "domain - IP address" pair to the STDOUT output

The only question is performance, you need to do it very, very quickly, we cannot wait a month for the result.

A few hours of work and my demon is on the go. Maine turned out something like this:

func main() {
    file, err := os.Open("domains.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    maxGoroutines := 500
    guard := make(chan struct{}, maxGoroutines)

    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        guard <- struct{}{}

        host := scanner.Text()
        go func(host string) {
            resolve(host)
            <-guard
        }(host)
    }

    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

, 500 , 12 .

resolve , — IP STDOUT. DNS, A , .

DNS

, DNS IP, . undbound Docker.

DNS, . DNS , , . , .

Google DNS, , . 500 — .

localhost

. 500 , . .

1000 12 , 500 . ~2000 .

, . , TLD .bar, .

tmux CSV 10 . .

! .

domain_ip, — IP. , IP .

IP - BIGINT domain - VARCHAR 255

, 260 . , IP , .

20 , . 260 . .

IP 20 200 . :

ALTER TABLE domain_ip PARTITION BY RANGE COLUMNS (ip)  (
    PARTITION p0 VALUES LESS THAN (200000000),
    PARTITION p1 VALUES LESS THAN (400000000),
    PARTITION p2 VALUES LESS THAN (600000000),
    PARTITION p3 VALUES LESS THAN (800000000),
    PARTITION p4 VALUES LESS THAN (1000000000),
    PARTITION p5 VALUES LESS THAN (1200000000),
    PARTITION p6 VALUES LESS THAN (1400000000),
    PARTITION p7 VALUES LESS THAN (1600000000),
    PARTITION p8 VALUES LESS THAN (1800000000),
    PARTITION p9 VALUES LESS THAN (2000000000),
    PARTITION p10 VALUES LESS THAN (2200000000),
    PARTITION p11 VALUES LESS THAN (2400000000),
    PARTITION p12 VALUES LESS THAN (2600000000),
    PARTITION p13 VALUES LESS THAN (2800000000),
    PARTITION p14 VALUES LESS THAN (3000000000),
    PARTITION p15 VALUES LESS THAN (3200000000),
    PARTITION p16 VALUES LESS THAN (3400000000),
    PARTITION p17 VALUES LESS THAN (3600000000),
    PARTITION p18 VALUES LESS THAN (3800000000),
    PARTITION p19 VALUES LESS THAN (4000000000),
    PARTITION p20 VALUES LESS THAN (MAXVALUE) 
);

, ?

Anyone who has worked with MySQL knows that pouring large data dumps is a rather long operation. Over the years, I have not found anything better than importing data from CSV. It looks like this:

LOAD DATA INFILE '/tmp/domains.csv' IGNORE 
INTO TABLE domain_ip
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'

The machine digests ~ 10 GB CSV in 30 minutes.

The final

The result is such a nice service . A selection of ~ 300 million records occurs instantly on a rather modest server by today's standards. You need about 8 GB of RAM for this.

Now you can find out, for example, that mankind has attached 8194 domains to IP 8.8.8.8, well, or come up with it yourself ...

Thanks for attention.

How we scanned the entire Internet

Own base

Making a prototype

DNS

localhost

The final

More articles: