XPATH + XML = Fast Processing





When executing queries, XPath operates on entities such as nodes. Nodes are of several kinds: element (element node), attribute (attribute node), text (text node), namespace (namespace node), processing-instruction (executable instruction node), comment (comment node) , document (document node).



Let's consider how in XPATH the sequence of nodes is set, selection directions and nodes with specific values ​​are selected.



To select nodes, 6 basic types of structures are mainly used:







Also, when selecting nodes, it is possible to use wildcard masks when we do not know what kind of a node should take.







In the XPATH language, special constructs called the axis are used to select relative to the current node.







The selection rule can be either absolute (// input [@ placeholder = "Login" - selection starting from the root node], or relative (* @ class = "okved-table__code" - selection relative to the current node).



Building a selection rule for each sampling step is performed relative to the current node and takes into account:



  • The name of the axis around which to sample
  • Condition for selecting a node by name or position
  • Zero or more predicates


In general, the syntax of one sampling step is:



axisname::nodetest[predicate]


To select specific nodes for some conditions, parameters or positions, such a tool as predicates is used. The predicate condition is enclosed in square brackets. Examples:







In addition to the above XPATH language constructs, it also contains support for a number of operators (+, -, *, div, mod, =,! =, And, or, etc.) as well as over 200 built-in functions.



Let's give such a practical example. We need to upload information about the periods of a certain list of people. For this we will use the notariat.ru service.



We import dependencies.



from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from multiprocessing import Pool
from retry import retry
import itertools, time, pprint, os, re, traceback, sys, datetime
import pandas as pd, numpy as np, multiprocessing as mp


Loading data on people:



df_people = pd.read_excel('people.xlsx')


We extract information from pages with information about people.



def find_persons(driver, name, birth_date):
    base_url = 'https://notariat.ru/ru-ru/help/probate-cases/'
    #    
    driver.get(base_url)
    #       
    driver.find_element_by_xpath('//input[@name="name"]').send_keys(name)
    #       
    driver.find_element_by_xpath('//select[@data-placeholder=""]/following::div/a').click()
   #       
    driver.find_element_by_xpath('//select[@data-placeholder=""]/following::div//li[@data-option-array-index={}]'.format(birth_date.day)).click()
    #       
    driver.find_element_by_xpath('//select[@data-placeholder=""]/following::div/a').click()
   #       
    driver.find_element_by_xpath('//select[@data-placeholder=""]/following::div//li[@data-option-array-index={}]'.format(birth_date.month)).click()
    #      
    driver.find_element_by_xpath('//input[@placeholder=""]').send_keys(str(birth_date.year))
    #  
    driver.find_element_by_xpath('//*[contains(., " ")]').click()
    #   20     ,        «probate-cases__result-list»
    WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CLASS_NAME, "probate-cases__result-list")))
    time.sleep(2)
    
    #      
    max_pages = 1
    pages_counters = driver.find_elements_by_xpath('//a[@class="pagination__item-content"]')
    if pages_counters:
        max_pages = int(pages_counters[-1].text)
    
    data = []
    def parse_page_data():
        #            
        lines = driver.find_elements_by_xpath('//ol[@class="probate-cases__result-list"]/li')
        for line in lines:
            name = ' '.join(map(lambda el: el[0].upper() + el[1:].lower(), line.find_element_by_xpath('.//h4').text.split()))
            death_date = datetime.datetime.strptime(line.find_element_by_xpath('.//p').text.split(':')[-1].strip(), '%d.%m.%Y')
            data.append((name, birth_date, death_date))
    #      
    if max_pages == 1:
        parse_page_data() #         
    else: 
        for page_num in range(1, max_pages + 1):
            #       ,       
            driver.find_element_by_xpath('//li[./a[@class="pagination__item-content" and text()="{}"]]'.format(page_num)).click()
            time.sleep(0.2)
            #      
            parse_page_data()
    return data


We perform searches using the multiprocessing module to speed up data collection.



def parse_persons(persons_data_chunk, pool_num):
    #   Chrome   headless    (      DOM    notariat.ru  )
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--window-size=1920,1080")
    driver = webdriver.Chrome(options=chrome_options)
    driver.set_page_load_timeout(20)
    data = [] 
    print(pool_num, '')
    #         
    for ind, (person_name, person_date) in enumerate(persons_data_chunk, start=1):
        print('pool:', pool_num, ', person: ', ind, '/', len(persons_data_chunk))
        try:
            data.extend(find_persons(driver, person_name, person_date))
        except Exception as e:
            print(pool_num, 'failed to load', person_name, person_date, "error:", e)
            traceback.print_exception(*sys.exc_info()) 
    print(pool_num, 'done')
    return data

def parse(people_data, parts=5):
    p = mp.Pool(parts)
    #               
    people_in_chanks = np.array_split(people_data, parts if parts < len(people_data) else 1) or []
    all_data = p.starmap(parse_persons, zip(people_in_chanks, range(parts)))
    out = []
    for el in all_data:
        out.extend(el)
    return out
parsed_data = parse(people_data)




And we save the results:



df = pd.DataFrame({
    '': list(map(lambda el: el[0], parsed_data)),
    " ": list(map(lambda el: el[1], parsed_data)),
    ' ': list(map(lambda el: el[2], parsed_data))
})
df.to_excel('results.xlsx', index=False)


The figure below shows the search page for personal files, which indicates the full name, date of birth, which are subsequently searched for. After entering the full name and date of birth, the algorithm clicks on the button to search for a case, after which it analyzes the results.







In the next figure we see a list, the elements of which are parsed by the algorithm.







The example above showed how you can use XPATH to collect information from web pages. But as already mentioned, XPATH is applicable for processing any xml documents, being the industry standard for accessing xml and xhtml elements, xslt transformations.



Often the code readability affects its quality, so you should abandon regular expressions when parsing, study XPATH and start applying it in your workflow. This will make your code easier to understand. You will make fewer mistakes and also reduce debugging time.



All Articles