Pandas——value_counts&index&to_dict
2016-04-21 19:30
621 查看
本文数据是大学专业和就业的信息。有两个csv文件all-ages.csv和recent-grads.csv
主要的属性如下:
Rank - The numerical rank of the major by post-graduation median earnings.
Major_code - The numerical code of the major.
Major - The description of the major.
Major_category - The category of the major.
Total - The total number of people who studied the major.
Men - The number of men who studied the major.
Women - The number of women who studied the major.
ShareWomen - The share of women (from 0 to 1) who studied the major.
Employed - The number of people who studied the major and were employed post-graduation.
recent-grads.csv
all-ages.csv和这个类似,只是某些列的值不同
)的就读的人数。
Series.value_counts返回的是该Series对象中独一无二的元素的个数(Returns object containing counts of unique values.)是个Series对象。
再转换为index对象
因此计算每个Major Categories下就读的学生人数的代码如下,人数存在Total中。
根据前面的学习,我想到了一个更简单的方法,与上面得到的结果一模一样,并且用to_dict()将Series转换为dict
“Low_wage_jobs”:从事低薪工作的人数
“Total”:每个Major的人数
因此从事低薪学生的占比为:
每个major未就业率的比较,得到的是43:128也就是最几年就业率变好了。
主要的属性如下:
Rank - The numerical rank of the major by post-graduation median earnings.
Major_code - The numerical code of the major.
Major - The description of the major.
Major_category - The category of the major.
Total - The total number of people who studied the major.
Men - The number of men who studied the major.
Women - The number of women who studied the major.
ShareWomen - The share of women (from 0 to 1) who studied the major.
Employed - The number of people who studied the major and were employed post-graduation.
recent-grads.csv
all-ages.csv和这个类似,只是某些列的值不同
Summarizing Major Categories
计算两个数据集中每个Major Categories(每个Major Categories包含多个Major)的就读的人数。
Series.value_counts返回的是该Series对象中独一无二的元素的个数(Returns object containing counts of unique values.)是个Series对象。
print(all_ages['Major_category'].value_counts()) ''' Engineering 29 Education 16 Humanities & Liberal Arts 15 Biology & Life Science 14 Business 13 Health 12 Computers & Mathematics 11 Physical Sciences 10 Agriculture & Natural Resources 10 Psychology & Social Work 9 Social Science 9 Arts 8 Industrial Arts & Consumer Services 7 Law & Public Policy 5 Communications & Journalism 4 Interdisciplinary 1 Name: Major_category, dtype: int64 '''
再转换为index对象
print(all_ages['Major_category'].value_counts().index) ''' Index([u'Engineering', u'Education', u'Humanities & Liberal Arts', u'Biology & Life Science', u'Business', u'Health', u'Computers & Mathematics', u'Physical Sciences', u'Agriculture & Natural Resources', u'Psychology & Social Work', u'Social Science', u'Arts', u'Industrial Arts & Consumer Services', u'Law & Public Policy', u'Communications & Journalism', u'Interdisciplinary'], dtype='object') '''
因此计算每个Major Categories下就读的学生人数的代码如下,人数存在Total中。
all_ages_major_categories = dict() recent_grads_major_categories = dict() def calculate_major_cat_totals(df): # cats存储了Major_category的类别category cats = df['Major_category'].value_counts().index counts_dictionary = dict() for c in cats: major_df = df[df["Major_category"] == c] #category为c的行 total = major_df["Total"].sum(axis=0) #计算Total和 counts_dictionary[c] = total return counts_dictionary all_ages_major_categories = calculate_major_cat_totals(all_ages) recent_grads_major_categories = calculate_major_cat_totals(recent_grads)
根据前面的学习,我想到了一个更简单的方法,与上面得到的结果一模一样,并且用to_dict()将Series转换为dict
# -*- coding: utf-8 -*- import pandas as pd import numpy as np all_ages = pd.read_csv("all-ages.csv") recent_grads = pd.read_csv("recent-grads.csv") all_ages_major_categories = all_ages.pivot_table(index="Major_category", values="Total", aggfunc=np.sum).to_dict() recent_grads_major_categories = recent_grads.pivot_table(index="Major_category", values="Total", aggfunc=np.sum).to_dict()
Low Wage Jobs Rates
接下来就该分析有多少大学生毕业后不能找到高薪的工作?或者不好的工作?低薪的工作?“Low_wage_jobs”:从事低薪工作的人数
“Total”:每个Major的人数
因此从事低薪学生的占比为:
low_wage_percent = 0.0 low_wage_percent = (recent_grads['Low_wage_jobs'].sum(axis=0))/(recent_grads['Total'].sum(axis=0))
Comparing Datasets
现在有两个数据集,all_ages(总的历史数据)和recent_grads (最近几年的)数据集都有173行。因此可以进行比较。每个major未就业率的比较,得到的是43:128也就是最几年就业率变好了。
# All majors, common to both DataFrames majors = recent_grads['Major'].value_counts().index recent_grads_lower_emp_count = 0 all_ages_lower_emp_count = 0 for m in majors: recent_grads_row = recent_grads[recent_grads['Major'] == m] all_ages_row = all_ages[all_ages['Major'] == m] recent_grads_unemp_rate = recent_grads_row['Unemployment_rate'].values[0] all_ages_unemp_rate = all_ages_row['Unemployment_rate'].values[0] if recent_grads_unemp_rate < all_ages_unemp_rate: recent_grads_lower_emp_count += 1 elif all_ages_unemp_rate < recent_grads_unemp_rate: all_ages_lower_emp_count += 1 print(recent_grads_lower_emp_count) print(all_ages_lower_emp_count) ''' 43 128 '''
相关文章推荐
- Can you answer these queries?
- 设计模式:建造者模式(Builder)
- A. Increasing Sequence
- HDU 4027 Can you answer these queries?
- POJ 2299 Ultra-QuickSort
- POJ 3368 Frequent values (RMQ)
- UVA 11995I Can Guess the Data Structure!
- Common Subsequence(LCS)
- ValueStack---对OGNL的扩展
- LightOJ - 1048 Conquering Keokradong (二分)输出路径
- UIButton无法响应点击事件问题
- 南京理工大学第八届程序设计大赛-sequence
- APUE------进程控制
- HDOJ 3415 Max Sum of Max-K-sub-sequence
- CSS中font-style的属性有Italic oblique,它们俩的区别是什么呢?
- iOS环信3.0集成 (二)UI文件集成
- iOS开发 贝塞尔曲线UIBezierPath
- ZOJ1648-Circuit Board
- 删除cell
- Bluetooth基本使用