You can work with the dataset both in ordinary excel and in jupyter notebook, data fields are separated by tabs. We will focus on the last option, and all commands will be given taking into account the fact that work is being done in a jupyter notebook.
We will work in windows. Therefore, use cmd to go to the folder with the downloaded dataset and run jupyter notebook with the command of the same name.
Next, let's import the modules.
import pandas as pd
import numpy as np
Since the dataset does not contain headers, let's designate them before loading the dataset:
headers=['story_title','link','story_id','data_rating','data_timestamp','story_comments','data_author_id','data_meta_rating','user_name','user_link','story__community_link']
Everything is clear here: article title, link to it, article id, rating (number of pluses), article date, number of comments, author id, article meta rating, author name, link to author, link to the community.
We count the dataset.
df = pd.read_csv('400k-pikabu.csv',parse_dates=['data_timestamp'],
warn_bad_lines=True,
index_col = False,
dtype ={'story_title':'object','link':'object','story_id':'float32','data_rating':'float32',
'story_comments':'float32','data_author_id':'float32'},
delimiter='\t',names=headers)
Here's a slight optimization of the read values โโso that some columns appear as numeric.
So the dataset represents 468,595 rows, 11 columns.
print(df.shape)#468595 ,11
First 5 records
df.head(5)
Statistical description:
df.describe()
Working with empty values โโin a dataset
Despite the fact that the parsers worked tirelessly, there are small holes in the dataset, in other words, technological holes represented by gaps. These gaps in pandas come with the value NaN. Let's see the number of lines with such voids:
len(df.loc[pd.isnull( df['story_title'])])
How it looks on the dataset:
df.loc[pd.isnull( df['story_title'])]
1444 lines with gaps do not spoil the overall picture, but, nevertheless, let's get rid of them:
data1=df.dropna(axis=0, thresh=5)
We check that the deletion was successful:
len(data1.loc[pd.isnull(data1['story_id'])])
Let's work with the dataset
Let's see the names of the columns
df.columns
Let's select the first column
col = df['story_title']
col
Let's take a look at the minimum in the dataset
data1.min()
Maximum
data1.max()
The same is more descriptive:
data1.loc[:,['user_name', 'data_rating', 'story_comments']].min()
Now let's collect the values โโfrom the interesting columns into an array:
arr = data1[['story_id', 'data_rating', 'data_timestamp','user_name']].values
You can look at one of the columns of the array:
arr[:, 1] #
Let's look at the number of articles with a rating of over 10,000:
print((arr[:, 1] > 10000.0).sum())
Only 2672 articles have an ultra-high rating out of 450k
Let's draw charts
First, let's import the module:
import matplotlib.pyplot as plt
Let's find out if the relationship between the author's id of the article and the rating of the article?
plt.scatter(data1['data_author_id'], data1['data_rating'])
plt.xlabel('data_author_id')
plt.ylabel('data_rating')
Due to the large amount of data, it is difficult to grasp the relationship and, most likely, it is missing.
Is there a relationship between article id and article rating?
plt.scatter(data1['story_id'], data1['data_rating'])
plt.xlabel('story_id')
plt.ylabel('data_rating')
It is noticeable here that posts with a higher number (later posts) receive a higher rating, they are more often voted for. Is the resource growing in popularity?
Is there a relationship between the date of the article and the rating?
plt.scatter(data1['data_timestamp'], data1['data_rating'])
plt.xlabel('data_timestamp')
plt.ylabel('data_rating')
You can also see the relationship between later posts and post rank. Better content or, again, just an increase in website traffic?
Is there a connection between the rating of an article and the number of comments to it?
plt.scatter(data1['story_comments'], data1['data_rating'])
plt.xlabel('story_comments')
plt.ylabel('data_rating')
There is a linear relationship here, although it is highly scattered. There is a certain logic, the higher the rating of the post, the more comments.
Let's take a look at the top authors (authors with the highest total post ratings):
top_users_df = data1.groupby('user_name')[['data_rating']].sum().sort_values('data_rating', ascending=False).head(10)
top_users_df
Let's add clarity:
top_users_df.style.bar()
Let's try other visualization tools. For example seaborn
#
! pip3 install seaborn
from __future__ import (absolute_import, division,
print_function, unicode_literals)
#
import warnings
warnings.simplefilter('ignore')
# jupyter'e
%pylab inline
# svg
%config InlineBackend.figure_format = 'svg'
#
from pylab import rcParams
rcParams['figure.figsize'] = 6,3
import seaborn as sns
Let's build graphs using columns with post id, their rating and comments, save the result in .png:
%config InlineBackend.figure_format = 'png'
sns_plot = sns.pairplot(data1[['story_id', 'data_rating', 'story_comments']]);
sns_plot.savefig('pairplot.png')
Let's try the Plotly visualization tool
from plotly.offline import init_notebook_mode, iplot
import plotly
import plotly.graph_objs as go
init_notebook_mode(connected=True)
Let's group the data by date and total rating of articles for this date:
df2 = data1.groupby('data_timestamp')[['data_rating']].sum()
df2.head()
Let's see how many articles were published on a certain date (month):
released_stories = data1.groupby('data_timestamp')[['story_id']].count()
released_stories.head()
Let's glue two tables:
years_df = df2.join(released_stories)
years_df.head()
Now let's draw using plotly:
trace0 = go.Scatter(
x=years_df.index,
y=years_df.data_rating,
name='data_rating'
)
trace1 = go.Scatter(
x=years_df.index,
y=years_df.story_id,
name='story_id'
)
data = [trace0, trace1]
layout = {'title': 'Statistics'}
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=False)
The beauty of plotly is its interactivity. In this case, on hover, the graph shows the total rating of articles for a certain date (month). It can be seen that the rating dropped in 2020. But this can be explained by the fact that the number of articles from this interval has not been sufficiently collected by parsers, and also that the posts have not yet acquired a sufficient number of pluses.
At the bottom of the graph, a red line also interactively shows the number of unique articles for a specific date.
Let's save the chart as an html file.
plotly.offline.plot(fig, filename='stats_pikabu.html', show_link=False);
Data groupings
Let's see how many authors are in the dataset:
data1.groupby('user_name').size()
How many articles per author:
data1['user_name'].value_counts()
Who writes most often (more than 500 articles):
for i in data1.groupby('user_name').size():
if i>500:
print (data1.iloc[i,8],i) #8- user_name
So that's who "clogs up" the resource). There are not so many of them:
authors
crackcraft 531
mpazzz 568
kastamurzik 589
pbdsu 773
RedCatBlackFox 4882
Wishhnya 1412
haalward 1190
iProcione 690
tooNormal 651
Drugayakuhnya 566
Ozzyab 1088
kalinkaElena9 711
Freshik04 665
100pudofff 905
100pudofff 1251
Elvina.Brestel 1533
1570525 543
Samorodok 597
Mr.Kolyma 592
kka2012 505
DENTAARIUM 963
4nat1k 600
chaserLI 650
kostas26 1192
portal13 895
exJustice 1477
alc19 525
kuchka70 572
SovietPosters 781
Grand.Bro 1051
Rogo3in 1068
fylhtq2222 774
deystvitelno 539
lilo26 802
al56.81 2498
Hebrew01 596
TheRovsh 803
ToBapuLLI 1143
ragnarok777 893
Ichizon 890
hoks1 610
arthik 700
mpazzz 568
kastamurzik 589
pbdsu 773
RedCatBlackFox 4882
Wishhnya 1412
haalward 1190
iProcione 690
tooNormal 651
Drugayakuhnya 566
Ozzyab 1088
kalinkaElena9 711
Freshik04 665
100pudofff 905
100pudofff 1251
Elvina.Brestel 1533
1570525 543
Samorodok 597
Mr.Kolyma 592
kka2012 505
DENTAARIUM 963
4nat1k 600
chaserLI 650
kostas26 1192
portal13 895
exJustice 1477
alc19 525
kuchka70 572
SovietPosters 781
Grand.Bro 1051
Rogo3in 1068
fylhtq2222 774
deystvitelno 539
lilo26 802
al56.81 2498
Hebrew01 596
TheRovsh 803
ToBapuLLI 1143
ragnarok777 893
Ichizon 890
hoks1 610
arthik 700
Let's see how many communities there are on the resource in total:
data1.groupby('story__community_link').size()
And which one is the most prolific:
data1['story__community_link'].value_counts()
* The data about the community is not entirely correct, since the first mentioned community was collected during parsing, and the authors often indicate several pieces.
Finally, let's see how to apply the function with the output of the result in a separate column .
This will be needed for further study of the dataset.
A simple function to assign the rating of an article to a group.
If the rating is more than <5000 - bad,> 5000 - good.
def ratingGroup( row ):
# , NaN
if not pd.isnull( row['data_rating'] ):
if row['data_rating'] <= 5000:
return 'bad'
if row['data_rating'] >= 20000:
return 'good'
# NaN, Undef
return 'Undef'
Let's apply the ratingGroup function to the DataFrame and display the result in a separate column -ratingGroup
data1['ratingGroup'] = data1.apply( ratingGroup, axis = 1 )
data1.head(10)
A new column will appear in the dataset with the following values:
Download - dataset .
Download an uncleaned dataset to clean up duplicates yourself - a dataset .
* python cleans (removes duplicate lines based on article id) for almost an hour! If someone rewrites the code in C ++ I will be grateful !:
with open('f-final-clean-.txt','a',encoding='utf8',newline='') as f:
for line in my_lines:
try:
b=line.split("\t")[2]
if b in a:
pass
else:
a.append(b)
f.write(line)
except:
print(line)
The question is removed, tk. unexpectedly) found a dictionary in python that works 10 times faster:
a={}
f = open("f-final.txt",'r',encoding='utf8',newline='')
f1 = open("f-final-.txt",'a',encoding='utf8',newline='')
for line in f.readlines():
try:
b=line.split("\t")[2]
if b in a:
pass
else:
a[b]=b
#print (a[b])
f1.write(line)
except:
pass
f.close()
f1.close()
Jupyter notebook - download .