Python and statistical inference: part 4

This final post is about variance analysis. See the previous post here .





Analysis of variance

Analysis of variance (variance), which in the special literature is also referred to as ANOVA from the English. ANalysis Of VAriance is a set of statistical methods used to measure the statistical significance of differences between groups. It was developed by the extremely gifted statistician Ronald Fisher, who also popularized the statistical significance test in his biological testing research papers.





Note... In the previous and this series of posts, the term β€œvariance” was used in our accepted term β€œvariance” and the term β€œvariance” was indicated in parentheses in places. This is no coincidence. Abroad there are paired terms "variance" and "covariance", and in theory they should be translated with one root, for example, as "variance" and "covariance", but in fact, we have a paired connection broken, and they are translated as completely different " variance "and" covariance ". But that's not all. "Dispersion" (statistical variance) abroad is a separate generic concept of dispersion, i.e. the degree to which the distribution stretches or contracts, and the measures of statistical variance are variance, standard deviation, and interquartile range. Dispersion, as a generic concept of dispersion, and variance, as one of its measures,measuring the distance from the mean are two different concepts. Further in the text for variance, the generally accepted term "variance" will be used throughout. However, this discrepancy in terminology should be taken into account.





Our z- statistic and t- statistic tests focused on sample means as the primary mechanism for differentiating between the two samples. In each case, we looked for the discrepancy in the means, divided by the level of discrepancy that we could reasonably expect, and quantified by the standard error.





The mean is not the only sample indicator that may indicate a discrepancy between samples. In fact, sample variance can also be used as an indicator of statistical discrepancy.





Duration (sec), page by page and combined
Duration (sec), page by page and combined

, , . - . , , .





β€” . , , . , , .





F-

F- β€” .





β€” 1, β€” . k , n β€” , :





df_1 = k-1 df_2 = nk

F- pandas plot



:





def ex_2_Fisher():
    '''  F-  '''
    mu = 0
    d1_values, d2_values = [4, 9, 49], [95, 90, 50]
    linestyles = ['-', '--', ':', '-.']
    x = sp.linspace(0, 5, 101)[1:] 
    ax = None
    for (d1, d2, ls) in zip(d1_values, d2_values, linestyles):
        dist = stats.f(d1, d2, mu)
        df  = pd.DataFrame( {0:x, 1:dist.pdf(x)} )   
        ax = df.plot(0, 1, ls=ls, 
                     label=r'$d_1=%i,\ d_2=%i$' % (d1,d2), ax=ax)
    plt.xlabel('$x$\nF-')
    plt.ylabel('  \n$p(x|d_1, d_2)$')
    plt.show()
      
      



F- , 100 , 5, 10 50 .





F-

, , F-. F- , . F- :





S2b β€” , S2w β€” .





F . , , . , , , .





F- , F. F .





F- . , . , k , xΜ…k, :





SSW β€” , xjk β€” j- .





SSW  , Python, ssdev



, :





def ssdev( xs ):
    '''    
            '''
    mu = xs.mean() 
    square_deviation = lambda x : (x - mu) ** 2 
    return sum( map(square_deviation, xs) )
      
      



F- :





SST β€” , SSW β€” , . «» , :





, SST β€” - . Python SST SSW , .





ssw = sum( groups.apply( lambda g: ssdev(g) ) )  #  
                                                 #   
sst = ssdev( df['dwell-time'] )  #      
ssb = sst – ssw                  #    
      
      



F- . ssb



ssw



, F-.





Python F- :





msb = ssb / df1      #  
msw = ssw / df2      #  
f_stat = msb / msw
      
      



F- , F-.





F-

, , () ,   , .





scipy stats.f.sf



, . F- 20 , . , , F-. F-, F- F-, . f_test



, :





def f_test(groups):
    m, n = len(groups), sum(groups.count())
    df1, df2 = m - 1, n - m 
    ssw = sum( groups.apply(lambda g: ssdev(g)) )  
    sst = ssdev( df['dwell-time'] )                
    ssb = sst - ssw                                
    msb = ssb / df1                                
    msw = ssw / df2                                
    f_stat = msb / msw
    return stats.f.sf(f_stat, df1, df2)
    
def ex_2_24():
    '''   -   F-'''
    df = load_data('multiple-sites.tsv')
    groups = df.groupby('site')['dwell-time']
    return f_test(groups)
      
      



0.014031745203658217
      
      



F- p-, scipy stats.f.sf



, . P- , .. - . . 5%- .





p-, 0.014, .. . - , .





F-distribution with 19 and 980 degrees of freedom
F- 19 980

- , :





def ex_2_25():
    '''    
        -    '''
    df = load_data('multiple-sites.tsv')
    df.boxplot(by='site', showmeans=True)
    plt.xlabel('  -')
    plt.ylabel(' , .')
    plt.title('')
    plt.suptitle('')
    plt.show()
      
      



boxplot



, -. - 0, .





, - 10 , . , , , , 6, 144 .:





def ex_2_26():
    '''T-  0  10  -'''
    df = load_data('multiple-sites.tsv')
    groups   = df.groupby('site')['dwell-time']
    site_0   = groups.get_group(0) 
    site_10  = groups.get_group(10)
    _, p_val = stats.ttest_ind(site_0, site_10, equal_var=False)
    return p_val
      
      



0.0068811940138903786
      
      



F-, , - 6 :





def ex_2_27():
    '''t-  0  6  -'''
    df = load_data('multiple-sites.tsv')
    groups   = df.groupby('site')['dwell-time']
    site_0   = groups.get_group(0) 
    site_6   = groups.get_group(6)
    _, p_val = stats.ttest_ind(site_0, site_6, equal_var=False)
    return p_val
      
      



0.005534181712508717
      
      



, , , - 6 -. AcmeContent -. - - !





, , . , β€” , . . , , .





d

d  β€” , , , , . , :





Sab β€” ( ) . :





def pooled_standard_deviation(a, b):
    '''   
       (   )'''
    return sp.sqrt( standard_deviation(a) ** 2 +
                    standard_deviation(b) ** 2)
      
      



, 6 - d  :





def ex_2_28():
    '''   d  
          -   6'''
    df = load_data('multiple-sites.tsv')
    groups = df.groupby('site')['dwell-time']
    a      = groups.get_group(0)
    b      = groups.get_group(6)
    return (b.mean() - a.mean()) / pooled_standard_deviation(a, b)
      
      



0.38913648705499848
      
      



p-, d . , , . 0.5, , , 0.38 β€” . - , -, .





   Github.    .





, . , , z-, t- F-.





, , , . β€” , β€” . , , F- 1- 2- .





.





In the next series of posts, if readers wish, we will apply what we have learned about variance and the F- test to single samples. We will present a regression analysis method and use it to find correlations between variables in a sample of Olympic athletes.








All Articles