Python, data science and choices: part 1

The 5-post series for beginners is a remix of the first chapter of a 2015 book called Clojure for Data Science. The author of the book, Henry Garner, has kindly agreed to use the materials of the book for this remix using the Python language.





The book was written as an invitation to the so-called "data science", which has received a strong impetus in recent years due to the need for fast and timely processing of large datasets locally and in a distributed environment.





The material of the book is presented in a living language and is presented in a task-oriented style, mainly with an emphasis on data analysis using appropriate algorithms and computing platforms, with short and direct explanations provided along the way.





It is unfair when excellent educational material gathers dust unclaimed simply due to the fact that it is implemented in a rather academic, if not elite language, such as the functional programming language Clojure. Therefore, there was a desire to contribute their five kopecks to make the book's material available to a wider public.





Three chapters of the book were adapted for Python over the next year after the book was published, i.e. in 2016. The publication of the remix of the book in the Russian Federation did not work out for various reasons, but one of the main ones will become clear at the end of this series of posts. At the end of the final post, you can vote for or against the next series of posts. Until then ...





Post # 1 is about preparing the environment and data.





Statistics

 It is important not who votes, but who counts the votes





- Joseph Stalin





, , , . , , , Β« Β» Β« 80/20Β». . : .





, Python- pandas. , , , numpy . β€” 2010 . 2011 . β€” , .





SciPy: SciPy - , pandas , , NumPy .





. SciPy , NumPy , pandas -, - . R Python, REPL, . , .





. - , , :





import numpy as np
import scipy as sp
import pandas as pd
      
      



, Python . , , random , collections , Counter.





pandas , DataFrame



, .. , , . , pandas , . , , . pandas , , , , :





  • (.csv) (.tsv), read_csv







  • Excel (, .xls .xlsx), read_excel







  • ( , -, , JSON-, HTML- . .)





– Series, .. . , , .





Excel, read_excel



. β€” β€” , . . . , :





pd.read_excel('data/ch01/UK2010.xls')
      
      



, . load_uk



:





def load_uk():
    '''   '''
    return pd.read_excel('data/ch01/UK2010.xls') 
      
      



DataFrame



pandas, . , .





UK2010.xls . pandas read_excel



. β€” columns , (.



):





def ex_1_1():
    '''    '''
    return load_uk().columns
      
      



pandas:





Index(['Press Association Reference', 'Constituency Name', 'Region',
       'Election Year', 'Electorate', 'Votes', 'AC', 'AD', 'AGS', 'APNI',
       ...
       'UKIP', 'UPS', 'UV', 'VCCA', 'Vote', 'Wessex Reg', 'WRP', 'You',
       'Youth', 'YRDPL'],
       dtype='object', length=144)
      
      



, 144 . ; :





  • : , ( )





  • : ,





  • : ,





  • : ,





  • : ,





  • :





, , , . . , , 2010 ., Election Year.





pandas () () . , . :





def ex_1_2():
    '''   " "'''
    return load_uk()['Election Year']
      
      



:





0      2010.0
1      2010.0
2      2010.0
...
646    2010.0
647    2010.0
648    2010.0
649    2010.0
650       NaN
Name: Election Year, dtype: float64
      
      



. , . , , , unique . pandas , , Python. :





def ex_1_3():
    '''    " "  '''
    return load_uk()['Election Year'].unique()
      
      



[ 2010.    nan]
      
      



2010 , 2010 . , nan, . not a number, .. , , .





, , , , . Counter



Python collections



. , , .. :





def ex_1_4():
    '''    " " 
       (   )'''
    return Counter( load_uk()['Election Year'] )
      
      



Counter({nan: 1, 2010.0: 650}) 
      
      



, , 2010 . 650 . , , . , , nan , . , , .





, 80% . .





nan , . , pandas , . pandas.





pandas, , . , . , , :





def ex_1_5():
    '''    " " 
           (  )'''
    df = load_uk()
    return df[ df['Election Year'].isnull() ]
      
      



 









Press Association Reference





Constituency Name





Region





Election Year





Electorate





Votes





AC





AD





AGS





...





650





NaN





NaN





NaN





NaN





NaN





29687604





NaN





NaN





NaN





...





dt['Election Year'].isnull()



, , , False



, . SQL, , WHERE



.





ex_1_5, , ( ) NaN



. , Excel. . notnull()



, , NaN



:





    df = load_uk()
    return df[ df[ 'Election Year' ].notnull() ]
      
      



. , load_uk_scrubbed



:





def load_uk_scrubbed():
    '''     '''
    df = load_uk()
    return df[ df[ 'Election Year' ].notnull() ]
      
      



, : load_uk



load_uk_scrubbed



. 650 , .





, . β€” β€” , . , , , , , .





The source code examples for this post are in my Github repo .





The next part, Part 2 , of the Python, Data Science, and Choices post series focuses on descriptive statistics, data grouping, and normal distribution. All this information will lay the foundation for further analysis of electoral data.








All Articles