😂 🚶🏻 👴🏽 Collecting data for training in solving NLP problems 👻 ⏸️ ◽️

Selecting a source and implementation tools

As a source of information, I decided to use habr.com - a collective blog with elements of a news site (news, analytical articles, articles on information technology, business, Internet, etc. are published). On this resource, all materials are divided into categories (hubs), of which only the main ones - 416 pieces. Each material can belong to one or more categories.

() python. – Jupyter notebook Google Colab. :

BeautifulSoup – html / xml;
Requests – http ;
Re – ;
Pandas – .

tqdm ratelim ( ).

, . :

mainUrl = 'https://habr.com/ru/post/'
postCount = 10000

, , , . try… except requests. :

@ratelim.patient(1, 1)
def get_post(postNum):
currPostUrl = mainUrl + str(postNum)
try:
response = requests.get(currPostUrl)
response.raise_for_status()
response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views = executePost(response)
dataList = [postNum, currPostUrl, response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views]
habrParse_df.loc[len(habrParse_df)] = dataList
except requests.exceptions.HTTPError as err:
pass

– . try – , .

executePost - .

def executePost(page):
soup = bs(page.text, 'html.parser')
#   
title = soup.find('meta', property='og:title')
title = str(title).split('="')[1].split('" ')[0]
#   
post = str(soup.find('div', id="post-content-body"))
post = re.sub('\n', ' ', post)
#   
num_comment = soup.find('span', id='comments_count').text
num_comment = int(re.sub('\n', '', num_comment).strip())
#  -     
info_panel = soup.find('ul', attrs={'class' : 'post-stats post-stats_post js-user_'})
#   
try:
rating = int(info_panel.find('span', attrs={'class' : 'voting-wjt__counter js-score'}).text)
except:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_positive js-score'})
if rating:
rating = int(re.sub('/+', '', rating.text))
else:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_negative js-score'}).text
rating = - int(re.sub('–', '', rating))
#         
vote = info_panel.find_all('span')[0].attrs['title']
rating_upVote = int(vote.split(':')[1].split('')[0].strip().split('↑')[1])
rating_downVote = int(vote.split(':')[1].split('')[1].strip().split('↓')[1])
#     
bookmk = int(info_panel.find_all('span')[1].text)
#    
views = info_panel.find_all('span')[3].text
return title, post, num_comment, rating, rating_upVote, rating_downVote, bookmk, views

BeautifulSoup : soup = bs(page.text, ‘html.parser’). find / findall (, html-). , html-, , .

( ), . , 10 . tqdm .

for pc in tqdm(range(postCount)):
postNum = pc + 1
get_post(postNum)

pandas :

As a result, I got a dataset containing the texts of articles of the resource habr.com , as well as additional information - the title, link to the article, the number of comments, rating, the number of bookmarks, the number of views.

In the future, the resulting dataset can be enriched with additional data and used for training in building various language models, classifying texts, etc.

Collecting data for training in solving NLP problems

More articles: