您的位置:首页 > 产品设计 > UI/UE

Pandas——value_counts&index&to_dict

2016-04-21 19:30 621 查看
本文数据是大学专业和就业的信息。有两个csv文件all-ages.csv和recent-grads.csv

主要的属性如下:

Rank - The numerical rank of the major by post-graduation median earnings.

Major_code - The numerical code of the major.

Major - The description of the major.

Major_category - The category of the major.

Total - The total number of people who studied the major.

Men - The number of men who studied the major.

Women - The number of women who studied the major.

ShareWomen - The share of women (from 0 to 1) who studied the major.

Employed - The number of people who studied the major and were employed post-graduation.

recent-grads.csv



all-ages.csv和这个类似,只是某些列的值不同

Summarizing Major Categories

计算两个数据集中每个Major Categories(每个Major Categories包含多个Major

)的就读的人数。

Series.value_counts返回的是该Series对象中独一无二的元素的个数(Returns object containing counts of unique values.)是个Series对象。

print(all_ages['Major_category'].value_counts())
'''
Engineering                            29
Education                              16
Humanities & Liberal Arts              15
Biology & Life Science                 14
Business                               13
Health                                 12
Computers & Mathematics                11
Physical Sciences                      10
Agriculture & Natural Resources        10
Psychology & Social Work                9
Social Science                          9
Arts                                    8
Industrial Arts & Consumer Services     7
Law & Public Policy                     5
Communications & Journalism             4
Interdisciplinary                       1
Name: Major_category, dtype: int64
'''


再转换为index对象

print(all_ages['Major_category'].value_counts().index)
'''
Index([u'Engineering', u'Education', u'Humanities & Liberal Arts',
u'Biology & Life Science', u'Business', u'Health',
u'Computers & Mathematics', u'Physical Sciences',
u'Agriculture & Natural Resources', u'Psychology & Social Work',
u'Social Science', u'Arts', u'Industrial Arts & Consumer Services',
u'Law & Public Policy', u'Communications & Journalism',
u'Interdisciplinary'],
dtype='object')
'''


因此计算每个Major Categories下就读的学生人数的代码如下,人数存在Total中。

all_ages_major_categories = dict()
recent_grads_major_categories = dict()
def calculate_major_cat_totals(df):
# cats存储了Major_category的类别category
cats = df['Major_category'].value_counts().index
counts_dictionary = dict()
for c in cats:
major_df = df[df["Major_category"] == c] #category为c的行
total = major_df["Total"].sum(axis=0) #计算Total和
counts_dictionary[c] = total
return counts_dictionary

all_ages_major_categories = calculate_major_cat_totals(all_ages)
recent_grads_major_categories = calculate_major_cat_totals(recent_grads)


根据前面的学习,我想到了一个更简单的方法,与上面得到的结果一模一样,并且用to_dict()将Series转换为dict

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
all_ages = pd.read_csv("all-ages.csv")
recent_grads = pd.read_csv("recent-grads.csv")
all_ages_major_categories  = all_ages.pivot_table(index="Major_category", values="Total", aggfunc=np.sum).to_dict()
recent_grads_major_categories  = recent_grads.pivot_table(index="Major_category", values="Total", aggfunc=np.sum).to_dict()


Low Wage Jobs Rates

接下来就该分析有多少大学生毕业后不能找到高薪的工作?或者不好的工作?低薪的工作?

“Low_wage_jobs”:从事低薪工作的人数

“Total”:每个Major的人数

因此从事低薪学生的占比为:

low_wage_percent = 0.0
low_wage_percent = (recent_grads['Low_wage_jobs'].sum(axis=0))/(recent_grads['Total'].sum(axis=0))


Comparing Datasets

现在有两个数据集,all_ages(总的历史数据)和recent_grads (最近几年的)数据集都有173行。因此可以进行比较。

每个major未就业率的比较,得到的是43:128也就是最几年就业率变好了。

# All majors, common to both DataFrames
majors = recent_grads['Major'].value_counts().index

recent_grads_lower_emp_count = 0
all_ages_lower_emp_count = 0
for m in majors:
recent_grads_row =  recent_grads[recent_grads['Major'] == m]
all_ages_row = all_ages[all_ages['Major'] == m]

recent_grads_unemp_rate = recent_grads_row['Unemployment_rate'].values[0]
all_ages_unemp_rate = all_ages_row['Unemployment_rate'].values[0]

if recent_grads_unemp_rate < all_ages_unemp_rate:
recent_grads_lower_emp_count += 1
elif all_ages_unemp_rate < recent_grads_unemp_rate:
all_ages_lower_emp_count += 1
print(recent_grads_lower_emp_count)
print(all_ages_lower_emp_count)
'''
43
128
'''
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: