Performing summary statistics and plots —— Python Data Science Cookbook
2017-02-09 00:03
417 查看
source from : Python Data Science Cookbook case
The primary purpose of using summary statistics is to get a good understanding of the location and dispersion of the data. By summary statistics, we refer to mean, median, and standard deviation. These quantities are quite
easy to calculate. However, one should be careful when using them. If the underlying data is not unimodal, that is, it has multiple peaks, these quantities may not be of much use.
Note
If the given data is unimodal, that is, having only one peak, the mean, which gives the location, and standard deviation, which gives the variance, are valuable metrics.
Compared to the regular mean, a trimmed mean is less sensitive to outliers.SciPy provides us with a trim mean function. We will demonstrate the trimmed mean
calculation in step 2.
the mean is very sensitive to outliers; variance also uses the mean, and hence, it’s prone to the same issues as the mean. We can use other measures for variance to avoid this trap. One such measure isabsolute
average deviation; instead of taking the square of the difference between the individual values and mean and dividing it by the number of instances, we will take the absolute of the difference between the mean and individual values and divide it by
the number of instances. In step 5, we will define a function for this:
With the data having many outliers, there is another set of metrics that come in handy. They are themedian and percentiles. Traditionally, median is defined as a value from the dataset such that half of all the points
in the dataset are smaller and the other half is larger than the median value.
Interpreting the percentiles:
25% of the points in the dataset are below 13.00 (25th percentile value).
50% of the points in the dataset are below 18.50 (50th percentile value).
75% of the points in the dataset are below 25.25 (75th percentile value).
A point to note is that the 50th percentile is the median. Percentiles give us a good idea of the range of our values.
The median is the measure of the location of the data distribution. Using percentiles, we can get a metric forthe dispersion
of the data, the interquartile range. The interquartile rangeis the distance between the 75th percentile and 25th percentile.
Similar to the mean absolute deviation as explained previously, we also have themedian absolute deviation.:
source code :
ouput:
The primary purpose of using summary statistics is to get a good understanding of the location and dispersion of the data. By summary statistics, we refer to mean, median, and standard deviation. These quantities are quite
easy to calculate. However, one should be careful when using them. If the underlying data is not unimodal, that is, it has multiple peaks, these quantities may not be of much use.
Note
If the given data is unimodal, that is, having only one peak, the mean, which gives the location, and standard deviation, which gives the variance, are valuable metrics.
Compared to the regular mean, a trimmed mean is less sensitive to outliers.SciPy provides us with a trim mean function. We will demonstrate the trimmed mean
calculation in step 2.
the mean is very sensitive to outliers; variance also uses the mean, and hence, it’s prone to the same issues as the mean. We can use other measures for variance to avoid this trap. One such measure isabsolute
average deviation; instead of taking the square of the difference between the individual values and mean and dividing it by the number of instances, we will take the absolute of the difference between the mean and individual values and divide it by
the number of instances. In step 5, we will define a function for this:
def mad(x,axis=None): mean = np.mean(x,axis=axis) return np.sum(np.abs(x-mean))/(1.0 * len(x))
With the data having many outliers, there is another set of metrics that come in handy. They are themedian and percentiles. Traditionally, median is defined as a value from the dataset such that half of all the points
in the dataset are smaller and the other half is larger than the median value.
Interpreting the percentiles:
25% of the points in the dataset are below 13.00 (25th percentile value).
50% of the points in the dataset are below 18.50 (50th percentile value).
75% of the points in the dataset are below 25.25 (75th percentile value).
A point to note is that the 50th percentile is the median. Percentiles give us a good idea of the range of our values.
The median is the measure of the location of the data distribution. Using percentiles, we can get a metric forthe dispersion
of the data, the interquartile range. The interquartile rangeis the distance between the 75th percentile and 25th percentile.
Similar to the mean absolute deviation as explained previously, we also have themedian absolute deviation.:
def mdad(x,axis=None): median = np.median(x,axis=axis) return np.median(np.abs(x-median))
source code :
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
@author: snaildove
"""
# Load Libraries
from sklearn.datasets import load_iris
import numpy as np
from scipy.stats import trim_mean
# Load iris data
data = load_iris()
x = data['data']
y = data['target']
col_names = data['feature_names']
# Let’s now demonstrate how to calculate the mean, trimmed mean, and range values:
# 1. Calculate and print the mean value of each column in the Iris dataset
print "col name,mean value"
for i,col_name in enumerate(col_names):
print "%s,%0.2f"%(col_name,np.mean(x[:,i]))
# 2. Trimmed mean calculation.
p = 0.1 # 10% trimmed mean
print "col name,trimmed mean value"
for i,col_name in enumerate(col_names):
print "%s,%0.2f"%(col_name,trim_mean(x[:,i],p))
# 3. Data dispersion, calculating and display the range values.
print "col_names,max,min,range"
for i,col_name in enumerate(col_names):
print "%s,%0.2f,%0.2f,%0.2f"%(col_name,max(x[:,i]),min(x[:,i]),max(x[:,i])-min(x[:,i]))
# Finally, we will show the variance, standard deviation, mean absolute deviation, and
# median absolute deviation calculations:
# 4. Data dispersion, variance and standard deviation
print "col_names,variance,std-dev"
for i,col_name in enumerate(col_names):
print "%s,%0.2f,%0.2f"%(col_name,np.var(x[:,i]),np.std(x[:,i]))
# 5. Mean absolute deviation calculation
def mad(x,axis=None): mean = np.mean(x,axis=axis) return np.sum(np.abs(x-mean))/(1.0 * len(x))
print "col_names,mad"
for i,col_name in enumerate(col_names):
print "%s,%0.2f"%(col_name,mad(x[:,i]))
# 6. Median absolute deviation calculation
def mdad(x,axis=None): median = np.median(x,axis=axis) return np.median(np.abs(x-median))
print "col_names,median,median abs dev,inter quartile range"
for i,col_name in enumerate(col_names):
iqr = np.percentile(x[:,i],75) - np.percentile(x[i,:],25)
print "%s,%0.2f,%0.2f,%0.2f"%(col_name,np.median(x[:,i]),mdad(x[:,i]),iqr)
ouput:
col name,mean value sepal length (cm),5.84 sepal width (cm),3.05 petal length (cm),3.76 petal width (cm),1.20 col name,trimmed mean value sepal length (cm),5.81 sepal width (cm),3.04 petal length (cm),3.76 petal width (cm),1.18 col_names,max,min,range sepal length (cm),7.90,4.30,3.60 sepal width (cm),4.40,2.00,2.40 petal length (cm),6.90,1.00,5.90 petal width (cm),2.50,0.10,2.40 col_names,variance,std-dev sepal length (cm),0.68,0.83 sepal width (cm),0.19,0.43 petal length (cm),3.09,1.76 petal width (cm),0.58,0.76 col_names,mad sepal length (cm),0.69 sepal width (cm),0.33 petal length (cm),1.56 petal width (cm),0.66 col_names,median,median abs dev,inter quartile range sepal length (cm),5.80,0.70,5.30 sepal width (cm),3.00,0.25,2.20 petal length (cm),4.35,1.25,4.07 petal width (cm),1.30,0.70,0.62
相关文章推荐
- Stemming the words and word lemmatization —— Python Data Science CookBook
- Using scatter plots for multivariate data —— python data science cookbook
- box-and-whisker plot —— Python Data Science Cookook
- the bag of words representation —— Python Data Science CookBook
- Removing stop words —— Python Data Science CookBook
- sampling brief —— python data science cookbook
- data imputation —— Python Data Science Cookbook
- Recipe 1.2. Converting Between Characters and Numeric Codes(Python Cookbook)
- 电子书下载:Microsoft Silverlight 5 Data and Services Cookbook
- Hands-On Data Science and Python Machine Learning 免积分下载
- Python Data Visualization Cookbook 2.9.2
- [Python] Read and plot data from csv file
- 电子书下载:Microsoft Silverlight 4 Data and Services Cookbook
- Python GUI Cookbook —— Matplotlib 图表
- cookbook of python for data analysis
- Python Data Visualization Cookbook 2.2.2
- iOS and OS X Network Programming Cookbook |Performing a network address resolution
- Python Cookbook 第二版 汉化版 [Recipe 1.6] 字符串的组合
- Python Cookbook 第二版 汉化版 [Recipe 1.9] 简化字符串 translate 方法的用法
- O'Reilly JavaScript and DHTML Cookbook