Recently, Habré published an article Scraping modern websites without headless browsers , and in the comments it was suggested that without a headless browser it will not work to get a phone number from an ad on Avito or Yulia. I want to refute this, below is a python script less than 100 lines of code that successfully parses "avito"
I am not a specialist in "parsing" sites and this is not my job, but it is not uncommon for me to do this to solve my work, and not only tasks. For example, you need to get the balance of a personal account in some service (mobile operators) that does not have an API for this, or, which is quite sad, a list of domains from the registrar (another one), which also does not have an API.
As in the article, a couple of comments from which prompted me to write this post, I also use Python and the requests library. If you can't find an "internal" API, you'll have to include the BeautifulSoup library. But here everything turned out to be much simpler.
If you open the "full" version of the site https://avito.ru, and try to copy the phone number, it becomes clear that the phone number on the site is not written, but drawn. But in the mobile version of the site, the number is given in text. You can check this by looking at the responses when you click the "Call" button in the developer tools in the browser.
I will not analyze my script in detail, there are enough comments in the code to understand what is happening and at what stage. In short, the mobile version of the site is used, variables for searching the site are declared, as well as two variables "key" and "cookie", about them in more detail, then the process of obtaining cookies by opening the main page is started, then a cycle is started that collects id of all ads going through all pages. After receiving all the ads in the second cycle, go through them and get the information we are interested in.
Screenshot of the script:
, .. API. - API. , , . . - . - 100 .
"key" "cookie", key , , - . cookie , "", , IP , "" .
If it is interesting, I will tell you more about how I was looking for an API or I can write a similar example for the "whirligig".