Book Modern Website Scraping with Python. 2nd int. edition"

imageHello, Habitants! If programming is like magic, then web scraping is a very powerful witchcraft. By writing a simple automated program, you can send requests to web servers, request data from them, and then analyze them and extract the information you need. The new expanded edition of the book introduces not only web scraping, but also helps to collect any kind of data on the modern Internet. Part I focuses on the mechanics of web scraping: how to use Python to request information from a web server, perform basic server response processing, and organize automated interactions with sites. Part II explores more specific tools and applications that come in handy in any web scraping scenario. - Parse complex HTML pages.- Develop search robots using the Scrapy framework. - Learn how to store scraped data. - Read and extract data from documents. - Clean up and normalize poorly formatted data. - Read and write information in natural languages. - Master the search for forms and logins. - Learn JavaScript scraping and API work. - Use and write programs to convert images to text. - Learn to bypass scraping traps and bot blockers. - Test your own website with scraping.- Learn JavaScript scraping and API work. - Use and write programs to convert images to text. - Learn to bypass scraping traps and bot blockers. - Test your own website with scraping.- Learn JavaScript scraping and API work. - Use and write programs to convert images to text. - Learn to bypass scraping traps and bot blockers. - Test your own website with scraping.



Web crawling with API



JavaScript is traditionally considered the universal curse of web crawlers. There were times a long time ago when you could be sure that a request you sent to a web server would receive the same data that a user would see in their browser by making the same request.



As JavaScript and Ajax methods for generating and loading content proliferate, the situation described above is becoming less common. In Chapter 11, we looked at one way to solve this problem: using Selenium to automate the browser and fetch data. This is easy to do. This almost always works.



The problem is that when you have a hammer as powerful and effective as Selenium in your hands, every web scraping task starts to feel like a nail.



In this chapter, you’ll learn how, bypassing all this JavaScript (without executing or even loading it!), You can directly access the data source — the APIs that generate that data.



A Brief Introduction to the API



There are countless books, talks, and tutorials on the nuances of REST APIs, GraphQL21, JSON, and XML, but they are all based on one simple concept. The API defines a standardized syntax that allows one program to interact with another, even if they are written in different languages ​​or have different structures.



This section focuses on web APIs (in particular those that allow the web server to interact with the browser), and here we will refer to API as this type of interface. But you can keep in mind that in other contexts, API is also a generic term that can mean, for example, an interface that allows a Java program to interact with a Python program running on the same computer. API does not always mean "interface over the Internet" and does not have to include any web technologies.



Web APIs are most commonly used by developers to interact with widely advertised and well-documented open source services. For example, the American cable sports TV channel ESPN provides an API (http://www.espn.com/apis/devcenter/docs/) for information on athletes, game scores, etc. Google has a developer section (https: / /console.developers.google.com) there are dozens of APIs for language translation, analytics, and geolocation.



The documentation for all of these APIs usually describes routes or endpoints in the form of URLs that can be requested, with mutable parameters, either as part of these URLs, or acting as GET parameters.



For example, in the following URL, pathparam is a path parameter:



example.com/the-api-route/pathparam



And here pathparam is the value of the param1 parameter:



example.com/the-api-route?param1=pathparam



Both methods of transferring data to the API are used quite widely, although, like many other aspects of computer science , are the subject of heated philosophical discussions about when and where variables should be passed via path, and when - via parameters.



The response to an API request is usually returned in JSON or XML format. Nowadays JSON is much more popular than XML, but the latter is also sometimes encountered. Many APIs allow you to choose the type of response, usually with another parameter that determines what type of response you want to receive.



Here's an example of a JSON response to an API request:



{"user":{"id": 123, "name": "Ryan Mitchell", "city": "Boston"}}
      
      







And here is the response to the API request in XML format:



<user><id>123</id><name>Ryan Mitchell</name><city>Boston</city></user>
      
      







The site ip-api.com (http://ip-api.com/) has a clear and convenient API that converts IP addresses to real physical addresses. You can try



making a simple API request by typing the following ip-api.com/json/50.78.253.58 into your browser,



and you will get a response like this:



{"ip":"50.78.253.58","country_code":"US","country_name":"United States",
"region_code":"MA","region_name":"Massachusetts","city":"Boston",
"zip_code":"02116","time_zone":"America/New_York","latitude":42.3496,
"longitude":-71.0746,"metro_code":506}

      
      





Note, there is a json path parameter in the request. To get a response in XML or CSV format, you need to replace it with the appropriate format:



ip-api.com/xml/50.78.253.58

ip-api.com/csv/50.78.253.58




API and HTTP Methods



In the previous section, we looked at the API, sending a GET request to the server to obtain information. There are four main ways (or methods) of requesting information from a web server via HTTP:



- GET;

- POST;

- PUT;

- DELETE.



Technically, there are more than four request types (for example, there are still HEAD, OPTIONS and CONNECT), but they are rarely used in the API and it is unlikely that you will ever come across them. The vast majority of APIs are limited to these four methods, and sometimes even some part of them. There are always APIs that use only GET or only GET and POST.



GET is the request that you use when you visit a site by entering its address in the address bar of your browser. When you visit ip-api.com/json/50.78.253.58 , you are using the GET method. This request can be thought of as a command: "Hey, web server, please give me this information."



A GET request, by definition, does not change the contents of the server database. Nothing is saved and nothing is changed. The information is only read.



POST is a request that is used when filling out a form or submitting information, presumably intended for processing by a server script. Each time you log into the site, you make a POST request, passing in a username and (hopefully) an encrypted password. By making a POST request through the API, you are telling the server, "Kindly save this information in the database."



PUT request when interacting with sites is used less often, but from time to time it occurs in the API. This request is used to change an object or information. For example, the API can use a POST request to create a user and a PUT request to change their email address.



DELETE requests, as you might guess, are used to delete an object. For example, if you send a DELETE request to myapi.com/user/23 , the user ID 23 will be deleted. DELETE methods are not often found in open APIs, as they are mainly created to distribute information or to allow users to create or publish information, but do not remove it from the databases.



Unlike GET, POST, PUT, and DELETE requests allow information to be passed in the body of the request, in addition to the URL or route from which the data is requested.

Like the response received from the web server, this data in the request body is usually JSON or less commonly XML. The specific data format is determined by the API syntax. For example, when using an API that adds comments to blog posts, you can create the following PUT request:



example.com/comments?post=123



with a request body like this:



{"title": "Great post about APIs!", "body": "Very informative. Really helped me out with a tricky technical challenge I was facing. Thanks for taking the time to write such a detailed blog post about PUT requests!", "author": {"name": "Ryan Mitchell", "website": "http://pythonscraping.com", "company": "O'Reilly Media"}}
      
      







Note that the blog post ID (123) is passed as a parameter in the URL, and the content of the comment we create is passed in the request body. Parameters and data can be passed both in the parameter and in the request body. Which parameters are required and where they are passed - again, is determined by the API syntax.



More About Responding to API Requests



As we saw in the ip-api.com example at the beginning of this chapter, an important feature of APIs is that these interfaces return well-formatted responses. The most common response formats are XML (eXtensible Markup Language) and JSON (JavaScript Object Notation).



In recent years, JSON has become much more popular than XML for several main reasons. First, JSON files are usually smaller than well-crafted XML files. Compare, for example, the following XML data that is 98 characters long:



<user><firstname>Ryan</firstname><lastname>Mitchell</
lastname><username>Kludgist</username></user>
      
      





Now look at the same JSON data:



{"user":{"firstname":"Ryan","lastname":"Mitchell","username":"Kludgist"}}
      
      







That's just 73 characters, a whopping 36% less than the same XML data.

Of course, the argument is likely that XML can be formatted like this:



<user firstname="ryan" lastname="mitchell" username="Kludgist"></user>
      
      







However, this is not recommended because this view does not support deep data nesting. Still, the entry is 71 characters long - about the same as the equivalent JSON.



Another reason why JSON is becoming more popular than XML so quickly has to do with changing web technologies. Previously, API recipients were mostly server side scripts in PHP or .NET. Now it may well turn out that a framework like Angular or Backbone will receive and send API calls. To a certain extent, server technologies do not care what form the data takes to them. However, JavaScript libraries like Backbone find it easier to handle JSON.



It is generally accepted that APIs return a response in either XML or JSON format, but any other option is possible. The API response type is limited only by the imagination of the programmer who created this interface. Another typical response format is CSV (as seen in the example from ip-api.com). Separate APIs even allow you to create files. You can send a request to the server, which will generate an image with the specified text superimposed on it, or you can request a specific XLSX or PDF file.



Some APIs do not return a response at all. For example, if you send a request to the server to create a comment on a blog post, it can only return an HTTP response code of 200, which means: “I posted a comment; everything is fine!" Other queries may return a minimal response like this:



{"success": true}
      
      







In case of an error, you can get a response like this:



{"error": {"message": "Something super bad happened"}}
      
      







Alternatively, if the API is not well configured, you might end up with a non-parsing stack trace or some English text. When making a request to an API, it usually makes sense to first verify that the response you receive is indeed in JSON format (or XML, or CSV, or whatever format you expect to receive).



about the author



Ryan Mitchell is a Senior Software Engineer at HedgeServ, Boston, where she develops APIs and tools for data analysis. Ryan graduated from the College of Engineering and Technology. Franklin V. Olin holds an MSc in Software Engineering and a Certificate in Data Analysis and Processing from Harvard University's continuing education courses. Prior to joining HedgeServ, Ryan worked at Abine, where she developed web scrapers and automation tools in Python. He regularly advises web scraping projects for retail, finance and pharmaceuticals. Concurrently works as a consultant and freelance teacher at Northeastern University and the College of Engineering and Technology. Franklin V. Olin.



More details about the book can be found on the publisher's website

» Table of Contents

» Excerpt



For Habitants a 25% discount on coupon - Python



Upon payment for the paper version of the book, an e-book is sent to the e-mail.



All Articles