How we scanned all domains on the Internet again

I'm sure you've seen the headlines “40% of sites use WordPress”, “10% of sites are on CloudFlare”, “Most common PHP version XX”. Usually, neither the type of site selection for analysis nor its size is indicated. Does almost half of the internet actually run on Wordpress?





Three years ago I published an article about how we analyzed the faces of more than 250 million available domains.





At the beginning of 2021, we made a new data collection, added technology definition, pixel track, improved content and link analysis.





This article is an overview of the current state of the main indicators: how many sites are running, what percentage are using HTTPs, which version of PHP is currently dominant.





Under the hood

In general, parsing data is not the most difficult task and many programmers take it without much enthusiasm.

What is the difficulty? We take our favorite programming language, for parsers, scrapers - this is often python. A library for working with the network, parsing html (or maybe you generally prefer regulars), a database where the whole thing is saved (although it is possible in csv), not the weakest server and let's go. I think it will take a middle up to a week (at an initial estimate per day) to make a working prototype that can easily scroll through 1-10 million pages.





. 250-260 , .

, , , IP , .. www 500 .

. - RPS ( , RPS, ). , , .

10-20 DNS . DNS, , .





- DNS . . DNS , . , IP , . DNS .





, IP, . IP 4 :

http://domain.com

http://www.domain.com

https://domain.com

https://www.domain.com





IP . , IP IP.





, , , , .. , .





, , robots.txt, sitemap.xml .





Go - , . , .





, . , , , . , random? . Redis sets, + - , .





- SSD, . - () payload .





, Go 1-2 VPS, 5-10 EUR/mo . , , .





252 , 80 443 - 200 , 200 - 148 . .. .





, IP - 2018 13.2 , 2021 - 14.3 IP , A .





, site.com www.site.com https://site.com . .





, .. 4 ( www/non www, http/https)





HTTPS





- (), HTTPS. , https, 106/86 - 1 = 23%.





www www?





, 10 , , www . www , , . : non-www www 50 , www non-www - 37 .





Server





server 143 286 , .





( /), :





- openresty, 4 , 67 . - , nginx, .





X-Powered-By





43 52 .





- PHP, :





Version 5.6 is still the leader, but in total the seven is already ahead.
5.6 , .





wappalyzer. , . - html , url js, css .





, WordPress 23 148 200 = 15% . 55 295 200 = 18% .





Surprising figure with cloudflare
cloudflare

At the same time, we see about 10 million by hosts with CloudFlare. Perhaps in their statistics they also count subdomains that we do not have in the database.





Conclusion

Collecting and processing data on the Internet is a very exciting activity that makes you look for non-standard approaches, such as for queues. So to insert imaginable and not so good checks at each processing area (like robots.txt files per gigabyte).





To be honest, I thought that many more numbers should have changed in three years. In fact, the total volume of domains + - is stable, the number of working sites is also. The population of the Earth appears to be growing faster than the number of sites on the Internet.





I would be glad to hear your comments and remarks.








All Articles