您的位置:首页 > 编程语言 > Python开发

Performing summary statistics and plots —— Python Data Science Cookbook

2017-02-09 00:03 417 查看
source from :  Python Data Science Cookbook case

The primary purpose of using summary statistics is to get a good understanding of the location and dispersion of the data. By summary statistics, we refer to mean, median, and standard deviation. These quantities are quite
easy to calculate. However, one should be careful when using them. If the underlying data is not unimodal, that is, it has multiple peaks, these quantities may not be of much use.

Note

If the given data is unimodal, that is, having only one peak, the mean, which gives the location, and standard deviation, which gives the variance, are valuable metrics. 
Compared to the regular mean, a trimmed mean is less sensitive to outliers.SciPy provides us with a trim mean function. We will demonstrate the trimmed mean

calculation in step 2.
the mean is very sensitive to outliers; variance also uses the mean, and hence, it’s prone to the same issues as the mean. We can use other measures for variance to avoid this trap. One such measure isabsolute
average deviation; instead of taking the square of the difference between the individual values and mean and dividing it by the number of instances, we will take the absolute of the difference between the mean and individual values and divide it by
the number of instances. In step 5,  we will define a function for this:
def mad(x,axis=None):
mean = np.mean(x,axis=axis)
return np.sum(np.abs(x-mean))/(1.0 * len(x))

With the data having many outliers, there is another set of metrics that come in handy. They are themedian and percentiles. Traditionally, median is defined as a value from the dataset such that half of all the points
in the dataset are smaller and the other half is larger than the median value. 
Interpreting the percentiles:

25% of the points in the dataset are below 13.00 (25th percentile value).

50% of the points in the dataset are below 18.50 (50th percentile value).

75% of the points in the dataset are below 25.25 (75th percentile value).

A point to note is that the 50th percentile is the median. Percentiles give us a good idea of the range of our values.
The median is the measure of the location of the data distribution. Using percentiles, we can get a metric forthe dispersion
of the data, the interquartile range. The interquartile rangeis the distance between the 75th percentile and 25th percentile. 
Similar to the mean absolute deviation as explained previously, we also have themedian absolute deviation.:
def mdad(x,axis=None):
median = np.median(x,axis=axis)
return np.median(np.abs(x-median))

source code :
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
@author: snaildove
"""
# Load Libraries
from sklearn.datasets import load_iris
import numpy as np
from scipy.stats import trim_mean
# Load iris data
data = load_iris()
x = data['data']
y = data['target']
col_names = data['feature_names']
# Let’s now demonstrate how to calculate the mean, trimmed mean, and range values:
# 1. Calculate and print the mean value of each column in the Iris dataset
print "col name,mean value"
for i,col_name in enumerate(col_names):
print "%s,%0.2f"%(col_name,np.mean(x[:,i]))
print
# 2. Trimmed mean calculation.
p = 0.1 # 10% trimmed mean
print
print "col name,trimmed mean value"
for i,col_name in enumerate(col_names):
print "%s,%0.2f"%(col_name,trim_mean(x[:,i],p))
print
# 3. Data dispersion, calculating and display the range values.
print "col_names,max,min,range"
for i,col_name in enumerate(col_names):
print "%s,%0.2f,%0.2f,%0.2f"%(col_name,max(x[:,i]),min(x[:,i]),max(x[:,i])-min(x[:,i]))
print
# Finally, we will show the variance, standard deviation, mean absolute deviation, and
# median absolute deviation calculations:
# 4. Data dispersion, variance and standard deviation
print "col_names,variance,std-dev"
for i,col_name in enumerate(col_names):
print "%s,%0.2f,%0.2f"%(col_name,np.var(x[:,i]),np.std(x[:,i]))
print
# 5. Mean absolute deviation calculation
def mad(x,axis=None): mean = np.mean(x,axis=axis) return np.sum(np.abs(x-mean))/(1.0 * len(x))

print "col_names,mad"
for i,col_name in enumerate(col_names):
print "%s,%0.2f"%(col_name,mad(x[:,i]))
print
# 6. Median absolute deviation calculation
def mdad(x,axis=None): median = np.median(x,axis=axis) return np.median(np.abs(x-median))
print "col_names,median,median abs dev,inter quartile range"
for i,col_name in enumerate(col_names):
iqr = np.percentile(x[:,i],75) - np.percentile(x[i,:],25)
print "%s,%0.2f,%0.2f,%0.2f"%(col_name,np.median(x[:,i]),mdad(x[:,i]),iqr)
print

ouput:

col name,mean value
sepal length (cm),5.84
sepal width (cm),3.05
petal length (cm),3.76
petal width (cm),1.20

col name,trimmed mean value
sepal length (cm),5.81
sepal width (cm),3.04
petal length (cm),3.76
petal width (cm),1.18

col_names,max,min,range
sepal length (cm),7.90,4.30,3.60
sepal width (cm),4.40,2.00,2.40
petal length (cm),6.90,1.00,5.90
petal width (cm),2.50,0.10,2.40

col_names,variance,std-dev
sepal length (cm),0.68,0.83
sepal width (cm),0.19,0.43
petal length (cm),3.09,1.76
petal width (cm),0.58,0.76

col_names,mad
sepal length (cm),0.69
sepal width (cm),0.33
petal length (cm),1.56
petal width (cm),0.66

col_names,median,median abs dev,inter quartile range
sepal length (cm),5.80,0.70,5.30
sepal width (cm),3.00,0.25,2.20
petal length (cm),4.35,1.25,4.07
petal width (cm),1.30,0.70,0.62
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  data statistics python