Collecting data for training in solving NLP problems

Selecting a source and implementation tools





As a source of information, I decided to use  habr.com  - a collective blog with elements of a news site (news, analytical articles, articles on information technology, business, Internet, etc. are published). On this resource, all materials are divided into categories (hubs), of which only the main ones - 416 pieces. Each material can belong to one or more categories.





() python. – Jupyter notebook Google Colab. :





  • BeautifulSoup – html / xml;





  • Requests – http ;





  • Re – ;





  • Pandas – .





tqdm ratelim ( ).









, . :





mainUrl = 'https://habr.com/ru/post/'
postCount = 10000
      
      



, , , .  try… except   requests. :





@ratelim.patient(1, 1)
def get_post(postNum):
currPostUrl = mainUrl + str(postNum)
try:
response = requests.get(currPostUrl)
response.raise_for_status()
response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views = executePost(response)
dataList = [postNum, currPostUrl, response_title, response_post, response_numComment, response_rating, response_ratingUp, response_ratingDown, response_bookMark, response_views]
habrParse_df.loc[len(habrParse_df)] = dataList
except requests.exceptions.HTTPError as err:
pass
      
      



– . try – , .





executePost - .





def executePost(page):
soup = bs(page.text, 'html.parser')
#   
title = soup.find('meta', property='og:title')
title = str(title).split('="')[1].split('" ')[0]
#   
post = str(soup.find('div', id="post-content-body"))
post = re.sub('\n', ' ', post)
#   
num_comment = soup.find('span', id='comments_count').text
num_comment = int(re.sub('\n', '', num_comment).strip())
#  -     
info_panel = soup.find('ul', attrs={'class' : 'post-stats post-stats_post js-user_'})
#   
try:
rating = int(info_panel.find('span', attrs={'class' : 'voting-wjt__counter js-score'}).text)
except:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_positive js-score'})
if rating:
rating = int(re.sub('/+', '', rating.text))
else:
rating = info_panel.find('span', attrs={'class' : 'voting-wjt__counter voting-wjt__counter_negative js-score'}).text
rating = - int(re.sub('–', '', rating))
#         
vote = info_panel.find_all('span')[0].attrs['title']
rating_upVote = int(vote.split(':')[1].split('')[0].strip().split('↑')[1])
rating_downVote = int(vote.split(':')[1].split('')[1].strip().split('↓')[1])
#     
bookmk = int(info_panel.find_all('span')[1].text)
#    
views = info_panel.find_all('span')[3].text
return title, post, num_comment, rating, rating_upVote, rating_downVote, bookmk, views
      
      



BeautifulSoup : soup = bs(page.text, ‘html.parser’).  find / findall  (, html-). , html-, , .





( ), . , 10 . tqdm .





for pc in tqdm(range(postCount)):
postNum = pc + 1
get_post(postNum)
      
      



pandas :





As a result, I got a dataset containing the texts of articles of the resource  habr.com , as well as additional information - the title, link to the article, the number of comments, rating, the number of bookmarks, the number of views.





In the future, the resulting dataset can be enriched with additional data and used for training in building various language models, classifying texts, etc.








All Articles