And often content needs to be collected in large volumes, in large quantities, and if data is still needed with a certain frequency, then it is not possible to solve such a task by hand. This is where special algorithms come to the rescue, which, according to certain conditions, collect information, structure and present it in the desired form.
Who needs to parse sites and why?
Parsing is mainly used by professionals to solve work tasks, since automation allows you to immediately get a large amount of data, but it is also useful for solving specific problems.
- Marketers collect information about sales volumes, identify shelf share, find out category demand and other indicators that allow predicting sales;
- Product managers collect information about changes in product metrics, conduct A / B tests, measure statistical significance;
- Analysts monitor changes in competitors' prices;
- Developers fill online stores with wholesaler content and automatically update prices;
- SEO-specialists will find out if all metadata, H1, Title, Description are filled, analyze the presence of non-existent pages showing 404 errors, identify keywords;
- Managers of manufacturing companies make sure that partners do not dump and get business indicators;
- For private purposes, you can collect a collection of recipes, lessons, or any other information that you want to keep for personal use.
The purpose of the application is clear, let's now figure out what kind of parsers there are and select a tool to solve your problems, for this we divide the parsers into several groups and see what solutions are on the market.
Classification of programs and tools for parsing
By resource use
This is an important point, if the parser will be used for business tasks and on a regular basis, you need to decide on which side the algorithm will work, on the side of the executor or yours. On the one hand, to deploy a cloud solution at home, you will need a specialist to install and maintain software, a dedicated space on the server, and the work of the program will consume server power. And it's expensive. On the other hand, if you can afford it, perhaps such a solution will be cheaper (if the scale of data collection is really industrial), you need to study tariff scales.
There is also a moment with privacy, the policies of some companies do not allow storing data on other people's servers and here you need to look at a specific service, firstly, the data collected by the parser can be transmitted immediately via the API, and secondly, this moment is solved by an additional clause in the agreement.
By access method
Remote solutions
This includes cloud programs (SaaS solutions), the main advantage of such solutions is that they are installed on a remote server and do not use the resources of your computer. You connect to the server through a browser (in this case, work with any OS is possible) or an application and take the data you need.
Cloud services, like all ready-made solutions in this article, do not guarantee that you will be able to parse any site. You may be faced with a complex structure, site technology that the service “does not understand”, protection that will be “too tough,” or the inability to interpret data (for example, displaying text data not in text, but in pictures).
Pros:
- Does not require installation on a computer;
- The data is stored remotely and does not consume space, you download only the results you need;
- They can work with large amounts of data;
- Ability to work with API and subsequent automation of data visualization;
Minuses:
- As a rule, more expensive than desktop solutions;
- Requires customization and maintenance;
- Inability to parse sites with complex security and / or interpret data.
Let's consider popular services and working conditions.
Octoparse is one of the popular cloud services.
Service features:
- Visual interface for capturing data;
- No programming knowledge required;
- Works with dynamic site elements such as infinite scrolling, authorization windows, drop-down lists;
- Service language - English;
Cost, per month:
- The free plan allows you to collect up to 10,000 values ​​and run 2 streams in parallel;
- Paid plans $ 89 and $ 249 with different limits for data parsing;
- Customizable plan for companies with individual requirements.
Scraper API is an API service with detailed documentation.
Service features:
- Automatic substitution of proxy addresses and repeating unsuccessful requests;
- Captcha input;
- Works through API and requires knowledge of the code;
- Service language - English;
An example of a GET request:
Cost, per month:
- Free - 1000 API calls (up to 5 concurrent requests);
- Starter and Medium Paid Plan $ 29 and $ 99 without proxy geo targeting and no JavaScript support;
- Business plan with JavaScript support and extended data collection limits;
- A custom plan for companies with individual requirements.
ScrapingHub is a powerful cloud-based tool that includes a proxy rotation tool, a headless browser for parsing (requiring coding) and a data storage tool.
Service features:
- The service is a set of tools, you can choose the necessary ones, in contrast to convenience, each tool must be paid separately;
- API availability;
- Availability of video lessons for a quick start;
- The service language is English.
Proxy cost, per month:
- Demo access with 10,000 requests;
- $ 99 per month for 200,000 requests and $ 349 for 2.5m requests;
- Unlimited service starts at $ 999.
Cost of cloud storage for data, per month:
- The free plan limits data storage to 7 days and scanning time to 1 hour;
- Paid plan $ 9.
Browser for parsing, per month:
- $ 25 / $ 50 / $ 100 for browser access on servers with different capacities.
The cost of a custom service for individual requests is calculated individually.
Mozenda is a popular service that allows you to work in the cloud and on a local machine, has an interface for visual data capture without programming knowledge.
Service features:
- The ability to return money if you cannot collect the necessary data using the service;
- Good tech support;
- Ability to parse without programming knowledge;
- API availability;
- Integration with various services, trackers, Bl systems;
- The service language is English.
Cost, per month:
- Free plan for 30 days;
- Paid plans from $ 250 to $ 450 with a different set of services included;
- Customizable plan for companies with individual requirements.
ScrapingBee - the service provides the ability to parse data through a headless browser, requires programming knowledge.
Service features:
- Automatic proxy change in case of blocking;
- API availability;
- Ability to work with Javascript;
- No fee will be charged if the parser fails to receive the data;
- The service language is English.
Cost, per month:
- The free plan includes 1000 API calls;
- $ 29, includes 250,000 requests, proxy, no API;
- $ 99, includes 1,000,000 requests, proxies and APIs;
- Customizable plan for companies with individual requirements.
Desktop solutions (parsing programs)
Such programs are installed on a computer. They are used for irregular and non-resource-intensive tasks. Many allow you to customize data collection parameters visually.
Pros:
- Always at hand, especially if installed on a laptop;
- They often have a visual programming interface.
Minuses:
- Waste computer resources (computing power, disk space);
- They work only on the OS they are written for;
- There is no guarantee that the program will be able to collect the necessary data, switch the listing;
- You often need to look for your proxy addresses to bypass site protection.
ParseHub is a program that allows you to visually collect data from sites without programming knowledge.
Program interface:
Features:
- Parsing startup scheduler;
- Proxy support (you need to use your own);
- Regular expression support;
- API availability;
- Working with JavaScript and AJAX;
- Storing data on servers and uploading results to Google Sheets;
- Works on Windows, Mac, Linux;
- The service language is English.
Cost, per month:
- The free plan allows you to collect data from 200 pages per launch, with a limit of 40 minutes, only text data, no proxy rotation;
- $ 149, 10,000 pages per launch with a limit of 200 pages in 10 minutes, file upload, proxy, scheduler;
- $ 499, unlimited pages per launch, limited to 200 pages in 2 minutes, file upload, proxy, scheduler;
- Individual tariff.
Easy Web Extract is a simple website scraping tool that doesn't require any programming knowledge.
Program interface:
Features:
- Visual programming;
- Up to 24 parallel streams;
- Parsing of sites with dynamic content;
- Simulates human behavior;
- Scheduler;
- Saving files;
- Works on Windows;
- The service language is English.
Cost:
- Free version for 14 days, you can collect up to 200 first results, export up to 50 results;
- The unlocked version costs $ 39, an additional license is $ 29.
FMiner is a visual web scraping tool with an intuitive interface. Works with sites that require form input and proxy servers.
Program interface:
Features:
- Editor for visual programming of the parser;
- Parsing dynamic sites using Ajax and Javascript;
- Multithreaded scanning;
- Bypass captcha;
- Works on Windows, Mac;
- The service language is English.
Cost:
- The free version is limited to 15 days;
- The Basic version costs $ 168 and does not have the advanced features of the Pro version;
- Pro version includes reports, scheduler, customization with javascript.
Helium Scraper is a multithreaded parsing program with the ability to collect databases up to 140 Tb.
Program interface:
Features:
- Visual programming of the parser;
- Parsing dynamic sites using Ajax and Javascript;
- Multithreaded scanning;
- Automatic rotation of proxy servers;
- Works on Windows;
- The service language is English.
Cost:
- Free, fully functional version limited to 10 days;
- 4 tariff plans from $ 99 to $ 699, they differ in the number of licenses and the period of major updates.
WebHarvy Web Scraper is a website scraping program with the ability to detect patterns in website templates and then automatically process such data. This feature greatly simplifies the programming of the parser.
Program interface:
Features:
- Visual programming of parsing;
- Parsing of dynamically loaded websites using Javascript and Ajax;
- Multithreaded scanning;
- Proxy / VPN support;
- Filling out forms;
- Scheduler;
- Multithreading;
- The ability to collect data from a list of links;
- Working with captcha;
- Works on Windows;
- The service language is English.
Cost:
- Free full-featured version is limited to 15 days and the ability to grab 2 pages from the site;
- 5 tariff plans from $ 139 to $ 699 differing in the number of licenses.
By the framework used
If the tasks of data collection are non-standard, you need to build a suitable architecture, work with multiple threads, and the existing solutions do not suit you, you need to write your own parser. This requires resources, programmers, servers, and special tools that make it easier to write and integrate parsing a program, and of course support (regular support will be required, if the data source changes, the code will need to be changed). Let's take a look at what libraries currently exist. In this section, we will not evaluate the advantages and disadvantages of the solutions, since the choice may be due to the characteristics of the current software and other features of the environment, which for some will be an advantage for others - a disadvantage.
Parsing Python sites
Libraries for parsing sites in Python provide the ability to create fast and efficient programs, with subsequent API integration. An important feature is that the frameworks presented below are open source.
Scrapy is the most common framework, has a large community and detailed documentation, and is well structured.
License: BSD
BeautifulSoup - designed to parse HTML and XML documents, has documentation in Russian, features - fast, automatically recognizes encodings.
License: Creative Commons, Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0)
PySpider is powerful and fast, supports Javascript, no built-in proxy support.
License: Apache License, Version 2.0
Grab - feature - asynchronous, allows you to write parsers with a large number of network threads, there is documentation in Russian, works by API.
License: MIT License
Lxml is a simple and fast library for parsing large documents, it allows you to work with XML and HTML documents, converts source information to Python data types, is well documented. Compatible with BeautifulSoup, in which case the latter uses Lxml as a parser.
License: BSD
Selenium - browser automation toolkit, includes a number of libraries for deployment, browser management, the ability to record and replay user actions. Provides the ability to write scripts in various languages, Java, C #, JavaScript, Ruby.
License: Apache License, Version 2.0
Parsing sites in JavaScript
JavaScript also offers ready-made frameworks for building parsers with convenient APIs.
PuppeteerIs a headless Chrome API for NodeJS programmers who want granular control over their work while doing parsing. As an open source tool, Puppeteer is free to use. It is actively developed and maintained by the Google Chrome team itself. It has a well thought out API and automatically installs a compatible Chromium binary during the installation process, which means you don't have to keep track of browser versions yourself. While this is much more than just a website parsing library, it is very often used to parse data that requires JavaScript to display, and treats scripts, stylesheets, and fonts like a real browser. Please note that while this is a great solution for sites that require javascript to display data,this tool requires significant CPU and memory resources.
License: Apache License, Version 2.0
Cheerio - fast, parses page markup and offers functions for processing the received data. Works with HTML, has an API similar to the jQuery API.
License: MIT License
Apify SDK is a Node.js library that allows you to work with JSON, JSONL, CSV, XML, XLSX or HTML, CSS. Works with proxies.
License: Apache License, Version 2.0
Osmosis - written in Node.js, searches and loads AJAX, supports CSS 3.0 and XPath 1.0 selectors, logs URLs, fills in forms.
License: MIT License
Parsing sites in Java
Java also offers various libraries that can be used to parse sites.
Jaunt - The library offers a lightweight headless browser (no GUI) for parsing and automation. Allows to interact with REST API or web applications (JSON, HTML, XHTML, XML). Fills forms, downloads files, works with tabular data, supports Regex.
License: Apache License (Software expires monthly, after which the latest version must be downloaded)
Jsoup - HTML library, provides a convenient API for getting URLs, extracting and manipulating data using HTML5 DOM methods and CSS selectors ... Supports proxy. Doesn't support XPath.
License: MIT License
HtmlUnit is not a universal framework for unit testing, it is a browser without a GUI. Models HTML pages and provides an API that allows you to call pages, fill out forms, click links. Supports JavaScript and XPath-based parsing.
License: Apache License, Version 2.0
CyberNeko HTML Parser is a simple parser that allows you to parse HTML documents and process them using XPath.
License: Apache License, Version 2.0
Browser extensions
Site parsers made in the form of browser extensions are convenient from the point of view of use, the installation is minimal - you just need a browser, visual data capture - does not require programming.
Scrape.it is a Chrome browser extension for collecting data from sites with a visual Point-Click interface.
Features:
- Visual Point-Click data capture;
- Parsing dynamic websites using Javascript;
- Multithreaded scanning;
- Server proxy;
- Chrome browser;
- The service language is English.
Cost, per month:
- Free trial period for 30 days;
- 3 tariff plans $ 19.9, $ 49.9, $ 199.9 differing in the number of parallel requests and page crawling speed.
Web Scraper.io is a site scraping tool designed as an extension for Chrome, a service with a wide range of options and the ability to visually program the scraping.
Features:
- Visual capture of data from the site;
- Parsing of dynamic sites with Ajax and Javascript, with the ability to scroll;
- Multithreaded scanning;
- Automatic rotation of proxy servers;
- Works with browsers Chrome, Firefox;
- API;
- Transferring results via Dropbox;
- The service language is English.
Cost, per month:
- Free trial period for 30 days;
- 3 tariff plans $ 19.9, $ 49.9, $ 199.9 differ in the number of parallel requests and page crawling speed.
Data miner is an extension for Google Chrome and Microsoft Edge that helps you collect data from sites using a simple visual interface.
Features:
- Collection of data from the site without programming;
- Ready-made templates for 15,000+ popular sites;
- Parsing a list of URLs;
- Support for pagination with additional loading;
- Automatic form filling;
- Works with browsers Chrome, Edge;
- Emulation of human behavior;
- Service language - English;
Cost, per month:
- Free account with the ability to parse up to 500 pages per month;
- 4 tariff plans $ 19, $ 49, $ 99, $ 199.9 differing in the number of pages you can parse, from 500 to 9000;
- Enterprise, customizable, contractual plan for on-demand tasks.
Scraper.Ai is an extension with a wide range of functionality and reasonable prices, works with Chrome, Firefox and Edge.
Features:
- Collection of data from the site without programming;
- Ready-made templates for Facebook, Instagram and Twitter;
- Support for pagination with additional loading;
- Automatic form filling;
- Works with browsers Chrome, Firefox, Edge;
- Scheduler;
- Tracking changes on the site;
- Limit on the number of pages to keep the quota;
- The service language is English.
Cost, per month:
- Free plan for 3 months with the ability to parse up to 50 pages;
- 3 tariff plans $ 9, $ 49, $ 99 differing in the number of pages you can parse.
Depending on the tasks to be solved
Competitor monitoring
Price monitoring services allow you to track the dynamics of competitors' prices for the same commodity items that you are selling. Then the prices are compared and you can increase or decrease the cost depending on the market situation. This allows you to offer the best price on the market at any time, making a purchase in your store more attractive than a competitor, and not to miss out on profits if competitors for some reason have raised prices.
Such services are often adapted to any marketplace, in order to get the prices of online stores selling from their site, you need to set up data collection yourself or order the parsing setting individually.
Monetizing such services is a subscription model with a tariff scale that ranks the number of prices / competitors collected.
Organization of joint purchases
Such services are designed to organize conscientious purchases in social networks. Such parsers collect data about goods and upload them to VKontakte and Odnoklassniki groups, which makes it possible to automate the process of filling the showcase and monitor the assortment, balances and prices on suppliers' websites. As a rule, these parsers have a personal account with the ability to manage, customized integrations for collecting data, a notification system, the ability to export data, and do not require modification.
Monetization is a subscription with billing, depending on the number of sites.
Automation of online stores
Such services allow you to automate the loading of goods (pictures, descriptions, characteristics) from a wholesaler, synchronize prices and balances. This allows you to work on adding goods and managing prices in a fully automated mode and save on personnel. The source can be either an xml or csv file, or the site from which the robot takes information.
SEO data parsing and analytics
Parsers used for search engine optimization help to collect meta data (H1, Title, Description), keywords, compose a semantic core, collect behavioral and quantitative analytical data about competitors. The range of tools is very wide in functionality, let's look at popular services so that you can choose the right one.
SiteAnalyzer is a web scraping program for checking basic technical and SEO data of websites. The main feature is that the program is completely free. Works on local computer, available only for Windows OS.
Features:
- Not demanding on computer resources;
- Checking pages, images, scripts and documents;
- Checking response codes (200, 404 ...);
- Checking titles Title, Description, anonical;
- Search for duplicate pages;
- Analysis of internal and external links;
- Works on Windows;
- Data export to CSV, Excel, PDF;
- Localization in 17 languages, including Russian;
Cost:
- Is free.
Screaming Frog SEO Spider is a powerful and popular SEO site audit program. The parser has established itself as one of the best in its class and provides a wide range of SEO analysis functionality.
Features:
- Demanding on computer resources;
- Support for Google Analytics API and Google Search Console (Google Webmaster);
- User-Agent support;
- Support for URL redirects (local htaccess);
- Scheduler;
- Customizable scan configuration;
- Checking pages, images, scripts and documents;
- Checking response codes (200, 404 ...);
- Checking titles Title, Description, anonical;
- Search for duplicate pages;
- Analysis of internal and external links;
- Works on Windows, MacOS, Ubuntu;
- Data export;
- English-language interface.
Cost:
- The free version is limited to scanning 500 addresses and reduced functionality;
- Paid full version ÂŁ 149.99 (roughly $ 200 or 14,600 rubles).
ComparseR is a specialization of the program for analytics of website indexing in the search engines Yandex and Google. You can find out which pages are in search and which are not and analyze them.
Features:
- Search for pages in the index;
- Regular expression support when customizing;
- Auto captcha input;
- Checking response codes (200, 404 ...);
- Checking titles Title, Description, anonical;
- Search for duplicate pages;
- Analysis of internal and external links;
- Works on Windows;
- Data export;
- Russian language interface.
Cost:
- The free version parses the first 150 pages or the first 150 search results;
- 2000 . .
Such parsers collect data directly into excel and google sheets. The actions of such parsers are based on macros that automate actions or special formulas that extract data from sites. Such parsers are suitable for simple tasks when the collected data is not protected and is located on simple, non-dynamic sites.
ParserOk - parsing sites based on vba (macros) into Microsoft Excel tables The add-on allows you to import data from sites according to pre-created templates and is relatively easy to configure. The disadvantage is that if the template does not match your request, then some work will be needed.
The cost of the license is 2700 rubles, the demo version is designed for 10 days.
Google sheets functions - importhtml and importxml- functions that allow you to import data directly into tables. With the help of these functions, you can organize a simple data collection according to preprogrammed inputs. Knowledge of the "Xpath" query language will significantly expand the scope of formulas.
Customizable parsing solutions
Such services work on a turnkey basis, approach the task individually, parsing is written for a specific request. Such solutions are best suited for private business tasks, for example, when you need to analyze competitors, collect certain types of data and do it regularly. The advantages of such solutions are that a solution specially designed for the task will collect data even from well-protected sites or data that requires interpretation, for example, when the price is displayed not in text, but in the form of a picture. Self-configuring programs and services will not cope with this task in these situations. Plus, such services do not require an individual employee to spend time collecting data or reworking parsing in case of a change in the source on the site.
The cost of working with individually configured parsing, if you have several different sites and the need to regularly receive data will be more profitable, it is not difficult to check if you calculate the cost of a ready-made solution + the cost of a programmer for writing parsing and its support + the cost of maintaining servers.
There are examples of such services at the beginning of the article in the section of cloud parsers, many of them offer custom solutions. Let's add a Russian-language service.
iDatica - a service specializing in organizing parsing, data cleansing, matching and data visualization on request. iDatica has Russian-speaking support, experienced specialists and has established itself as a reliable partner for the development of data collection and visualization solutions. Upon request, the team allocates analytics to work with your projects.
iDatica - the service specializes in organizing parsing, data cleansing, matching and data visualization on request.
Features of the service:
- Personal approach to the task;
- Complete tasks on a turnkey basis, you only need to describe the task;
- Working with sites of any complexity;
- The ability to connect BI services for visualization;
- The ability to connect analytics;
- The service language is Russian.
Cost, per month:
- From 2000 rubles, calculated based on the complexity and frequency of parsing.
How to choose the right parser
- First, define your tasks: price monitoring, product analytics, machine learning, SEO data, process automation;
- Determine the sources of data collection: competitors' sites, data sources for training, your site, etc.;
- , , ;
- .
If you have a standard task with a small amount of data and have a separate person to complete the task, then a ready-made solution in the form of a program or browser extension is suitable for you.
For parsing complex sites with a certain regularity, pay attention to cloud solutions. You will need a separate employee to run this project.
If the task is tied to increasing profits or even the viability of the project, it is worth paying attention to a cloud service with the ability to program or libraries for parsing, allocate a separate programmer for this task and server capacity.
If you need to get a solution quickly and you need to be sure of the quality of the result, you should choose a company that implements a turnkey project.