🥞 🤟🏼 🤛 Pikabu-dataset 👃🏽 ⚪️ 🖕

It is suggested to look at the dataset of posts from pikabu.ru from the point of view of datastatistics. The dataset itself, consisting of 450 pieces, is assembled by the best round-the-clock parsers, processed with odorants that remove duplicate articles, and is also stuffed with additional columns, the meaning of which is available only to the initiated. Here the dataset itself is not so much interesting as the approach to the analysis of such sites. In the next posts we will try to apply elements from machine learning for analysis.

You can work with the dataset both in ordinary excel and in jupyter notebook, data fields are separated by tabs. We will focus on the last option, and all commands will be given taking into account the fact that work is being done in a jupyter notebook.

We will work in windows. Therefore, use cmd to go to the folder with the downloaded dataset and run jupyter notebook with the command of the same name.

Next, let's import the modules.

import pandas as pd
import numpy as np

Since the dataset does not contain headers, let's designate them before loading the dataset:

headers=['story_title','link','story_id','data_rating','data_timestamp','story_comments','data_author_id','data_meta_rating','user_name','user_link','story__community_link']

Everything is clear here: article title, link to it, article id, rating (number of pluses), article date, number of comments, author id, article meta rating, author name, link to author, link to the community.

We count the dataset.

df = pd.read_csv('400k-pikabu.csv',parse_dates=['data_timestamp'],
                   warn_bad_lines=True,
                   index_col = False,                   
                   dtype ={'story_title':'object','link':'object','story_id':'float32','data_rating':'float32',
                   'story_comments':'float32','data_author_id':'float32'},
                   delimiter='\t',names=headers)

Here's a slight optimization of the read values so that some columns appear as numeric.

So the dataset represents 468,595 rows, 11 columns.

print(df.shape)#468595 ,11

First 5 records

df.head(5)

Statistical description:

df.describe()

Working with empty values in a dataset

Despite the fact that the parsers worked tirelessly, there are small holes in the dataset, in other words, technological holes represented by gaps. These gaps in pandas come with the value NaN. Let's see the number of lines with such voids:

len(df.loc[pd.isnull( df['story_title'])])

How it looks on the dataset:

df.loc[pd.isnull( df['story_title'])]

1444 lines with gaps do not spoil the overall picture, but, nevertheless, let's get rid of them:

data1=df.dropna(axis=0, thresh=5)

We check that the deletion was successful:

len(data1.loc[pd.isnull(data1['story_id'])])

Let's work with the dataset

Let's see the names of the columns

df.columns

Let's select the first column

col = df['story_title']
col

Let's take a look at the minimum in the dataset

data1.min()

Maximum

data1.max()

The same is more descriptive:

data1.loc[:,['user_name', 'data_rating', 'story_comments']].min()

Now let's collect the values from the interesting columns into an array:

arr = data1[['story_id', 'data_rating', 'data_timestamp','user_name']].values

You can look at one of the columns of the array:

arr[:, 1] #

Let's look at the number of articles with a rating of over 10,000:

print((arr[:, 1] > 10000.0).sum())

Only 2672 articles have an ultra-high rating out of 450k

Let's draw charts

First, let's import the module:

import matplotlib.pyplot as plt

Let's find out if the relationship between the author's id of the article and the rating of the article?

plt.scatter(data1['data_author_id'], data1['data_rating'])
plt.xlabel('data_author_id') 
plt.ylabel('data_rating')

Due to the large amount of data, it is difficult to grasp the relationship and, most likely, it is missing.

Is there a relationship between article id and article rating?

plt.scatter(data1['story_id'], data1['data_rating'])
plt.xlabel('story_id') 
plt.ylabel('data_rating')

It is noticeable here that posts with a higher number (later posts) receive a higher rating, they are more often voted for. Is the resource growing in popularity?

Is there a relationship between the date of the article and the rating?

plt.scatter(data1['data_timestamp'], data1['data_rating'])
plt.xlabel('data_timestamp') 
plt.ylabel('data_rating')

You can also see the relationship between later posts and post rank. Better content or, again, just an increase in website traffic?

Is there a connection between the rating of an article and the number of comments to it?

plt.scatter(data1['story_comments'], data1['data_rating'])
plt.xlabel('story_comments') 
plt.ylabel('data_rating')

There is a linear relationship here, although it is highly scattered. There is a certain logic, the higher the rating of the post, the more comments.

Let's take a look at the top authors (authors with the highest total post ratings):

top_users_df = data1.groupby('user_name')[['data_rating']].sum().sort_values('data_rating', ascending=False).head(10)   
top_users_df

Let's add clarity:

top_users_df.style.bar()

Let's try other visualization tools. For example seaborn

#   
 ! pip3 install seaborn
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)
#  
import warnings
warnings.simplefilter('ignore')

#      jupyter'e
%pylab inline
#  svg   
%config InlineBackend.figure_format = 'svg' 

#  
from pylab import rcParams
rcParams['figure.figsize'] = 6,3
import seaborn as sns

Let's build graphs using columns with post id, their rating and comments, save the result in .png:

%config InlineBackend.figure_format = 'png' 
sns_plot = sns.pairplot(data1[['story_id', 'data_rating', 'story_comments']]);
sns_plot.savefig('pairplot.png')

Let's try the Plotly visualization tool

from plotly.offline import init_notebook_mode, iplot
import plotly
import plotly.graph_objs as go
init_notebook_mode(connected=True)

Let's group the data by date and total rating of articles for this date:

df2 = data1.groupby('data_timestamp')[['data_rating']].sum()
df2.head()

Let's see how many articles were published on a certain date (month):

released_stories = data1.groupby('data_timestamp')[['story_id']].count()
released_stories.head()

Let's glue two tables:

years_df = df2.join(released_stories)
years_df.head()

Now let's draw using plotly:

trace0 = go.Scatter(
    x=years_df.index,
    y=years_df.data_rating,
    name='data_rating'
)
trace1 = go.Scatter(
    x=years_df.index,
    y=years_df.story_id,
    name='story_id'
)
data = [trace0, trace1]
layout = {'title': 'Statistics'}
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=False)

The beauty of plotly is its interactivity. In this case, on hover, the graph shows the total rating of articles for a certain date (month). It can be seen that the rating dropped in 2020. But this can be explained by the fact that the number of articles from this interval has not been sufficiently collected by parsers, and also that the posts have not yet acquired a sufficient number of pluses.

At the bottom of the graph, a red line also interactively shows the number of unique articles for a specific date.

Let's save the chart as an html file.

plotly.offline.plot(fig, filename='stats_pikabu.html', show_link=False);

Data groupings

Let's see how many authors are in the dataset:

data1.groupby('user_name').size()

How many articles per author:

data1['user_name'].value_counts()

Who writes most often (more than 500 articles):

for i in data1.groupby('user_name').size():
    if i>500:        
        print (data1.iloc[i,8],i) #8-   user_name

So that's who "clogs up" the resource). There are not so many of them:

authors

crackcraft 531

mpazzz 568

kastamurzik 589

pbdsu 773

RedCatBlackFox 4882

Wishhnya 1412

haalward 1190

iProcione 690

tooNormal 651

Drugayakuhnya 566

Ozzyab 1088

kalinkaElena9 711

Freshik04 665

100pudofff 905

100pudofff 1251

Elvina.Brestel 1533

1570525 543

Samorodok 597

Mr.Kolyma 592

kka2012 505

DENTAARIUM 963

4nat1k 600

chaserLI 650

kostas26 1192

portal13 895

exJustice 1477

alc19 525

kuchka70 572

SovietPosters 781

Grand.Bro 1051

Rogo3in 1068

fylhtq2222 774

deystvitelno 539

lilo26 802

al56.81 2498

Hebrew01 596

TheRovsh 803

ToBapuLLI 1143

ragnarok777 893

Ichizon 890

hoks1 610

arthik 700

Let's see how many communities there are on the resource in total:

data1.groupby('story__community_link').size()

And which one is the most prolific:

data1['story__community_link'].value_counts()

* The data about the community is not entirely correct, since the first mentioned community was collected during parsing, and the authors often indicate several pieces.

Finally, let's see how to apply the function with the output of the result in a separate column .

This will be needed for further study of the dataset.

A simple function to assign the rating of an article to a group.

If the rating is more than <5000 - bad,> 5000 - good.

def ratingGroup( row ):
    # ,      NaN
    if not pd.isnull( row['data_rating'] ):
        if row['data_rating'] <= 5000:
            return 'bad'
        if row['data_rating'] >= 20000:
            return 'good'        
    
    #    NaN,   Undef
    return 'Undef'

Let's apply the ratingGroup function to the DataFrame and display the result in a separate column -ratingGroup

data1['ratingGroup'] = data1.apply( ratingGroup, axis = 1 )
data1.head(10)

A new column will appear in the dataset with the following values:

Download - dataset .

Download an uncleaned dataset to clean up duplicates yourself - a dataset .

* python cleans (removes duplicate lines based on article id) for almost an hour! If someone rewrites the code in C ++ I will be grateful !:

with open('f-final-clean-.txt','a',encoding='utf8',newline='') as f:
    for line in my_lines:
        try:
            b=line.split("\t")[2]
            if b in a:        
                pass
            else:
                a.append(b)            
                f.write(line)            
        except:
            print(line)

The question is removed, tk. unexpectedly) found a dictionary in python that works 10 times faster:

a={}
f = open("f-final.txt",'r',encoding='utf8',newline='')
f1 = open("f-final-.txt",'a',encoding='utf8',newline='') 
for line in f.readlines():
    try:            
        b=line.split("\t")[2]
        if b in a:        
            pass
        else:
            a[b]=b
            #print (a[b])
            f1.write(line) 
    except:
        pass
f.close()
f1.close()

Jupyter notebook - download .

Pikabu-dataset

Working with empty values ​​in a dataset