Parsing and audit





Let's start with an idea. Let's say you, as a real auditor, want to conduct an examination of the dog breeder's reporting, using, among other things, third-party resources. To do this, you try to get systematized information about the breeder's puppies, knowing, for example, only the name of their breeds, and create a table from it in Pandas, suitable for further processing of any nature (all kinds of statistical research, aggregation, and so on). But your data is stored in the depths of some abstract website, from where you can take it out only in the form of an archive, where documents of several formats are stacked, inside of which there are text, pictures, tables. And if there are a lot of puppy breeds, and for each of them there are a dozen of pdf files with tables, where do you not need all the information from, and also, for example, do you need the names of these tables or footnotes? Let's add several functions to our project,solving the following tasks: unloading and unpacking the archive with data, searching and processing pdf files from the archive, analyzing the received data.



First, let's import everything we need. Let's divide the libraries we need into system libraries:



import os
import re
import glob
import csv
import shutil


and external ones that require installation (pip install, as I said):



import requests as req
import pandas as pd
from zipfile import ZipFile
import tabula
import PyPDF2
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image, ImageDraw


Now, for each of your puppies, we will download a large archive with data, referring to the site by the name of his breed:



def get_doquments_archive(breed):
            url = 'https://yourwebsite' + breed + '/document/download'
              with req.get(url, stream=True) as r:
                r.raise_for_status()
                with open('/Users/user/Desktop/' + breed + '.zip', 'wb') as f:
                         for chunk in r.iter_content(chunk_size=8192):
                                   f.write(chunk)


We now have an archive on our desktop. Let's unpack it, for this we only need to know the path to the file with the archive:



def unzipper(zippath, cond = False):
 dirpath = zippath[:-4] + '_package'
 if os.path.exists(dirpath) and os.path.isdir(dirpath):
shutil.rmtree(dirpath)
os.mkdir(dirpath)
with ZipFile(zippath, 'r') as zipObj:
zipObj.extractall(path = dirpath)


In this step, we will get a folder with documents, where pdf, csv, xls, png and other nice things can be. Let's say we want to process several pdf files containing tables with data. But how to get them out of there? First, let's select the documents of the required format from the folder:



all_pdfs = glob.glob(dirpath + '/*_pd*.pdf')


Excellent. We now have a bunch of files with text and tables inside. When trying to extract information from there, it may turn out that the tools recognize such a mixture very crookedly, especially if the tables are glued to each other, and their titles or footnotes are separate text. Tabula comes to the rescue! But first, let's take out from the first page of each document a little textual description that is not included in the table (such text for a tabula can be a problem). Since the first page can also have a table, let's use focus:



def get_text_description(path):
pdfFileObj = open(path,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pages = convert_from_path(declar, 0)
page = pages[0]
pname = '/Users/user/Desktop/text_description.png'
page.save(pname, 'JPEG')
text = image_to_string(Image.open('/Users/user/Desktop/text_description.png'),
                                          lang = 'rus')
text_file = open('res', "w")
text_file.write(text)
text_file.close()


Now let's start working with the table. If you're lucky, and the table in our pdf is quite readable, tabula will correctly upload it in csv format, so you don't even have to parse the information:



tabula.convert_into(file, 'output_file.csv', output_format = "csv", pages = 'all')


See how it can now be simple to get, for example, data on the character of the selected puppy:



data = pd.read_csv('/Users/user/Desktop/output_file.csv')
temperament = data[data[''] == '']['']


But what if the author of the text glued tables together, added a different number of columns to the rows, or mixed them with the text? Then we will convert the file received from tabula into a new format:



def get_table_data(path):
 data = []
 with open(path) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for val in row.values():
data.append(val)
data = str(data)
data = re.sub('\]|\[|,|\'', '', data)
data = data.replace("\\n", "")
return data


For what? This will allow you to search for the information you need quickly and painlessly using regular expressions. We want to find a set of possible breed colors:



def get_colors(data):
 res = re.search('^: (.*)', data).group(1)
 return res


Now we have accumulated a certain amount of information from files one puppy at a time (for example, character, colors, weight). Let's add it to the pandas dataframe as a new line:



def append_new_row(dataframe, breed, temperament, colors, weight):
  return dataframe.append({'': breed,
'': temperament,
'': colors,
'' : weight
}, ignore_index=True)


What we now have:







So, we unloaded an archive with data from the site, unpacked it, took out the documents we needed, got important information from them and brought it to a convenient format. Now this data can be compared with those provided by the company, converted and analyzed, and much more! Much more convenient than downloading and writing out manually.



def clean_all(path):
os.remove(path + '.zip')
shutil.rmtree(path + '_package')


It is important that your actions remain completely legal. You can take data from sites, you cannot steal content. You can download automatically, you cannot put the server. Study copyrights and the Criminal Code of the Russian Federation, do not cause damage.



All Articles