The customer contacted with a problem that his collector could not cope with the protection "incapsula".
In a nutshell, instead of the page code, a javascript code is returned, when executed, a request is made to the server encapsulations, some browser parameters are checked, and if the browser is recognized as valid, the page and some cookies are returned.
A detailed description is on the developer's website (www.imperva.com)
Adding a javascript handler, as well as other solutions offered by Google (for example, raising your servers), seemed too complicated / long. Selenium, as it turned out, perfectly bypasses this protection, but since there is a lot of data and collecting in one stream, (or even switching between tabs) I didn't want to, and there were not enough resources to launch several browsers, it was decided to write a proxy server.
Since the load changed, depending on the time of day and other conditions, it was decided to make a scalable web part through the combination of Nginx + uwsgi + flask. It seemed too costly to run the Selenium version for each worker, so it was decided to move Selenium into a separate service, with communication between blocks via Redis. To keep the implementation as simple as possible, requests are executed synchronously.
Project structure
uwsgi.ini – . , . (
selenium:
gecko/Sel.py
sellenium . , selenium , ( ). cookie Redis. Cookie , redis. cookie callback .
API:
src
, 1 url:
@app.route('/', methods=['GET', 'POST'])
, url url, , post .
:
http://127.0.0.1:5000/?url=https://www.example.com/vehicledetails/34313441?RowNumber=0&
, , , .
request.py .
requests, .
Redis, Post, Get c reqests.
, cookie, Selenium .
, . https, , , . . , .