🥢 🌥️ 🥖 Parse Wikipedia, filtering, for NLP tasks in 44 lines of code 🚠 🕙 👨🏾‍🏭

In this post, I would like to supplement this article and tell you how you can use the Wikipedia WikiExtractor more flexibly, filtering articles by category.

It all started with the fact that I needed definitions for various terms. Terms and their definitions are usually the first sentence on every Wikipedia page. Following the simplest path, I extracted all the articles and quickly grabbed everything I needed with regulars. The problem is that the size of definitions exceeded 500 MB, and there were too many unnecessary things, for example, named entities, cities, years, etc. which I don't need.

I correctly assumed that the WikiExtractor tool (I will use a different version, the link will be below) has some kind of filter, and it turned out to be a category filter. Categories are tags for articles that have a hierarchical structure for organizing pages. I was happy to put up the category "Exact Sciences", very naively believing that all articles that relate to the exact sciences will be included in the list, but the miracle did not happen - each page has its own tiny set of categories and there is no information on a single page about how these categories relate. This means that if I need pages on exact sciences, I must indicate all categories that are descendants of "Exact sciences".

Well, it doesn’t matter, now I’ll find a service, I thought, which would easily ship all categories from a given start to me. Unfortunately, I only found this where you can just see how these categories are related. An attempt to manually iterate over the categories was also unsuccessful, but I was "glad" that these categories have a structure not of a tree, as I thought all this time, but just a directed graph with cycles. Moreover, the hierarchy itself floats very much - I will say in advance that by setting the starting point "Mathematics", you can easily reach Alexander I. As a result, I only had to restore this graph locally and somehow get a list of categories of interest to me.

, : - , , , - .

Ubuntu 16.04, , , 18.04 .

, ,

ruwiki-latest-pages-articles.xml.bz2
ruwiki-latest-categorylinks.sql.gz
ruwiki-latest-category.sql.gz
ruwiki-latest-page.sql.gz

categorylinks , , [[Category:Title]] , . cl_from, id , cl_to, . , id , page () page_id page_title. , . , , , , , . category([](category table)) cat_title. pages-articles.xml .

mysql. ,

sudo apt-get install mysql-server  mysql-client

, mysql , .

$ mysql -u username -p
mysql> create database category;
mysql> create database categorylinks;
mysql> create database page;

, . .

$  mysql -u username -p category < ruwiki-latest-category.sql
$  mysql -u username -p categorylinks < ruwiki-latest-categorylinks.sql
$  mysql -u username -p page < ruwiki-latest-page.sql

, csv.

mysql> select page_title, cl_to from categorylinks.categorylinks join page.page
on cl_from = page_id  where page_title in (select cat_title from category) INTO outfile '/var/lib/mysql-files/category.csv' FIELDS terminated by ';' enclosed by '"' lines terminated by '\n';

. .

, , — , . , , , , 1,6 1,1. .

import pandas as pd
import networkx as nx
from tqdm.auto import tqdm, trange

#Filtering
df = pd.read_csv("category.csv", sep=";", error_bad_lines=False)
df = df.dropna()
df_filtered = df[df.parant.str.contains("[--]+:") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains(",_") != True]
df_filtered = df_filtered[df_filtered.parant.str.contains("__") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains("_") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains(",_") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains("__") != True]
df_filtered = df_filtered[df_filtered.parant.str.contains("__") != True]
df_filtered = df_filtered[df_filtered.parant.str.contains("_") != True] 
df_filtered = df_filtered[df_filtered.parant.str.contains("__") != True]
df_filtered = df_filtered[df_filtered.parant.str.contains("") != True] 

# Graph recovering
G = nx.DiGraph()
c = 0
for i, gr in tqdm(df_filtered.groupby('child')):

    vertex = set()
    edges = []
    for i, r in gr.iterrows():
        G.add_node(r.parant, color="white")
        G.add_node(r.child, color="white")
        G.add_edge(r.parant, r.child)

, , , , .

counter = 0
nodes = []

def dfs(G, node, max_depth):
    global nodes, counter
    G.nodes[node]['color'] = 'gray'
    nodes.append(node)
    counter += 1
    if counter == max_depth:
        counter -= 1
        return
    for v in G.successors(node):
        if G.nodes[v]['color'] == 'white':
            dfs(G, v, max_depth)
        elif G.nodes[v]['color'] == 'gray':
            continue
    counter -= 1

, nodes . " " 5 . 2500 . , , , , - , , , — , . , , .

, .

_

CAM
__
_
_
__
__


__
__
__
___
_

...

_
___
__
_____
_
_
____
_
_
_
_
__
_
_()

...


_

_
_

_
_
_
-_

_
_
_
_
_

In order to apply these categories for filtering for the Russian language, however, you need to tweak something in the sources. I used this version. Now there is something new, perhaps the fixes below are no longer relevant. In the WikiExtractor.py file, you need to replace "Category" with "Category" in two places. The areas with the already corrected version are presented below:


tagRE = re.compile(r'(.*?)<(/?\w+)[^>]*?>(?:([^<]*)(<.*?>)?)?')
#                    1     2               3      4
keyRE = re.compile(r'key="(\d*)"')
catRE = re.compile(r'\[\[:([^\|]+).*\]\].*')  # capture the category name [[Category:Category name|Sortkey]]"

def load_templates(file, output_file=None):
...

if inText:
    page.append(line)
    # extract categories
    if line.lstrip().startswith('[[:'):
        mCat = catRE.search(line)
        if mCat:
            catSet.add(mCat.group(1))

After that, you need to run the command

python WikiExtractor.py --filter_category categories --output wiki_filtered ruwiki-latest-pages-articles.xml

where categories is the file with categories. Filtered articles will be in wiki_filtered.

That's all. Thank you for attention.

Parse Wikipedia, filtering, for NLP tasks in 44 lines of code

More articles: