您的位置：首页 > 理论基础 > 数据结构算法

数据分析处理库Pandas-数据读取

2017-11-09 17:43 579 查看

假定所有操作都事先导入pandas

import pandas

我们以一个csv文件为例，来展示pandas是如何读取数据的:food_info

读入csv文件

food_info = pandas.read_csv("food_info.csv")

1、查看pandas的数据结构，pandas的数据结构为DataFrame类型

print(type(food_info))

OUT:

<class 'pandas.core.frame.DataFrame'>

2、查看pandas中的数据类型，pandas的数据类型包括int、float、object、datatime、bool，其中object指的是string值

print(food_info.dtypes)

OUT：

NDB_No               int64
Shrt_Desc           object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
…………………………
Cholestrl_(mg)     float64
dtype: object

3、查看官方帮助文档

print(help(pandas.read_csv))

OUT:

Help on function read_csv in module pandas.io.parsers:
…………………………
Returns
-------
result : DataFrame or TextParser

None

4、①查看DataFrame的前5行，如果有参数，则显示参数n表示的前n行

food_info.head()

OUT:

②查看DataFrame的后5行，如果有参数，则显示参数n表示的后n行

food_info.tail()

OUT:

5、返回每列的列名

print(food_info.columns)

OUT:

Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
'Cholestrl_(mg)'],
dtype='object')

6、查看DataFrame的行列情况

print(food_info.shape)

OUT:

(8618, 36)

7、查看指定索引行的数据

print(food_info.loc[0])

OUT:

NDB_No                         1001
Shrt_Desc          BUTTER WITH SALT
Water_(g)                     15.87
Energ_Kcal                      717
Protein_(g)                    0.85
Lipid_Tot_(g)                 81.11
Ash_(g)                        2.11
Carbohydrt_(g)                 0.06
Fiber_TD_(g)                      0
Sugar_Tot_(g)                  0.06
Calcium_(mg)                     24
Iron_(mg)                      0.02
Magnesium_(mg)
4000
2
Phosphorus_(mg)                  24
Potassium_(mg)                   24
Sodium_(mg)                     643
Zinc_(mg)                      0.09
Copper_(mg)                       0
Manganese_(mg)                    0
Selenium_(mcg)                    1
Vit_C_(mg)                        0
Thiamin_(mg)                  0.005
Riboflavin_(mg)               0.034
Niacin_(mg)                   0.042
Vit_B6_(mg)                   0.003
Vit_B12_(mcg)                  0.17
Vit_A_IU                       2499
Vit_A_RAE                       684
Vit_E_(mg)                     2.32
Vit_D_mcg                       1.5
Vit_D_IU                         60
Vit_K_(mcg)                       7
FA_Sat_(g)                   51.368
FA_Mono_(g)                  21.021
FA_Poly_(g)                   3.043
Cholestrl_(mg)                  215
Name: 0, dtype: object

8、返回索引行切片值

print(food_info.loc[3:5])

OUT:

NDB_No     Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  Lipid_Tot_(g)  \
3    1004   CHEESE BLUE      42.41         353        21.40          28.74
4    1005  CHEESE BRICK      41.11         371        23.24          29.68
5    1006   CHEESE BRIE      48.42         334        20.75          27.68

Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)       ...        \
3     5.11            2.34           0.0           0.50       ...
4     3.18            2.79           0.0           0.51       ...
5     2.70            0.45           0.0           0.45       ...

Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  Vit_K_(mcg)  \
3     721.0      198.0        0.25        0.5      21.0          2.4
4    1080.0      292.0        0.26        0.5      22.0          2.5
5     592.0      174.0        0.24        0.5      20.0          2.3

FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)
3      18.669        7.778        0.800            75.0
4      18.764        8.598        0.784            94.0
5      17.410        8.013        0.826           100.0

[3 rows x 36 columns]

9、返回某几行的索引值

two_five_ten = [2, 5, 10]
print(food_info.loc[two_five_ten])

OUT:

NDB_No             Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
2     1003  BUTTER OIL ANHYDROUS       0.24         876         0.28
5     1006           CHEESE BRIE      48.42         334        20.75
10    1011          CHEESE COLBY      38.20         394        23.76

Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \
2           99.48     0.00            0.00           0.0           0.00
5           27.68     2.70            0.45           0.0           0.45
10          32.11     3.36            2.57           0.0           0.52

...        Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  \
2        ...          3069.0      840.0        2.80        1.8      73.0
5        ...           592.0      174.0        0.24        0.5      20.0
10       ...           994.0      264.0        0.28        0.6      24.0

Vit_K_(mcg)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)
2           8.6      61.924       28.732        3.694           256.0
5           2.3      17.410        8.013        0.826           100.0
10          2.7      20.218        9.280        0.953            95.0

[3 rows x 36 columns]

10、①查看DataFrame的某一列

food_info["NDB_No"]

OUT:

0        1001
1        1002
2        1003
3        1004
4        1005
5        1006
6        1007
7        1008
8        1009
9        1010
10       1011
11       1012
12       1013
13       1014
14       1015
15       1016
16       1017
17       1018
18       1019
19       1020
20       1021
21       1022
22       1023
23       1024
24       1025
25       1026
26       1027
27       1028
28       1029
29       1030
...
8588    43544
8589    43546
8590    43550
8591    43566
8592    43570
8593    43572
8594    43585
8595    43589
8596    43595
8597    43597
8598    43598
8599    44005
8600    44018
8601    44048
8602    44055
8603    44061
8604    44074
8605    44110
8606    44158
8607    44203
8608    44258
8609    44259
8610    44260
8611    48052
8612    80200
8613    83110
8614    90240
8615    90480
8616    90560
8617    93600
Name: NDB_No, Length: 8618, dtype: int64

②查看多列

col = ["Ash_(g)", "Fiber_TD_(g)"]
food_info[col]

OUT:

Ash_(g) Fiber_TD_(g)
0   2.11    0.0
1   2.11    0.0
2   0.00    0.0
3   5.11    0.0
4   3.18    0.0
5   2.70    0.0
6   3.68    0.0
7   3.28    0.0
8   3.71    0.0
9   3.60    0.0
10  3.36    0.0
11  1.41    0.0
12  1.20    0.2
13  1.71    0.0
14  1.27    0.0
15  1.39    0.0
16  1.32    0.0
17  4.22    0.0
18  5.20    0.0
19  3.79    0.0
20  4.75    0.0
21  3.94    0.0
22  4.30    0.0
23  3.79    0.0
24  3.55    0.0
25  3.28    0.0
26  2.91    0.0
27  3.27    0.0
28  3.80    0.0
29  3.66    0.0
... ... ...
8588    2.00    2.6
8589    0.76    1.6
8590    0.29    1.0
8591    1.85    5.7
8592    1.22    4.2
8593    1.71    14.2
8594    0.52    2.0
8595    3.50    0.0
8596    0.80    2.1
8597    2.40    0.0
8598    0.40    0.0
8599    0.00    0.0
8600    0.00    0.1
8601    4.74    0.0
8602    13.90   27.8
8603    9.90    6.1
8604    0.22    0.1
8605    0.08    0.8
8606    0.35    2.6
8607    0.07    0.0
8608    5.70    10.1
8609    1.86    0.9
8610    6.80    0.8
8611    1.00    0.6
8612    1.40    0.0
8613    13.40   0.0
8614    2.97    0.0
8615    0.86    0.0
8616    1.30    0.0
8617    1.20    0.0
8618 rows × 2 columns

11、找出food_info文件中单位是g的数据

col_name = food_info.columns.tolist()
print(col_name)
print("__________")
gram_columns = []

for c in col_name:
#endswith() 方法用于判断字符串是否以指定后缀结尾，如果以指定后缀结尾返回True，否则返回False
if c.endswith("(g)"):
gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df.head())

OUT:

['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)']
__________
Water_(g)  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  \
0      15.87         0.85          81.11     2.11            0.06
1      15.87         0.85          81.11     2.11            0.06
2       0.24         0.28          99.48     0.00            0.00
3      42.41        21.40          28.74     5.11            2.34
4      41.11        23.24          29.68     3.18            2.79

Fiber_TD_(g)  Sugar_Tot_(g)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)
0           0.0           0.06      51.368       21.021        3.043
1           0.0           0.06      50.489       23.426        3.012
2           0.0           0.00      61.924       28.732        3.694
3           0.0           0.50      18.669        7.778        0.800
4           0.0           0.51      18.764        8.598        0.784

12、升序排序

food_info.sort_values("Sodium_(mg)", inplace = True, ascending = True)
print(food_info["Sodium_(mg)"])

参数解释：

inplace:是否新生成一个DataFrame。

ascending:是否升序排序

OUT:

760     0.0
8607    0.0
629     0.0
631     0.0
6470    0.0
654     0.0
8599    0.0
657     0.0
633     0.0
635     0.0
637     0.0
638     0.0
639     0.0
646     0.0
653     0.0
632     0.0
606     0.0
6463    0.0
634     0.0
666     0.0
8387    0.0
611     0.0
434     0.0
655     0.0
661     0.0
3663    0.0
3664    0.0
3665    0.0
656     0.0
3697    0.0
...
8153    NaN
8155    NaN
8156    NaN
8157    NaN
8158    NaN
8159    NaN
8160    NaN
8161    NaN
8163    NaN
8164    NaN
8165    NaN
8167    NaN
8169    NaN
8170    NaN
8172    NaN
8173    NaN
8174    NaN
8175    NaN
8176    NaN
8177    NaN
8178    NaN
8179    NaN
8180    NaN
8181    NaN
8183    NaN
8184    NaN
8185    NaN
8195    NaN
8251    NaN
8267    NaN
Name: Sodium_(mg), Length: 8618, dtype: float64

附上：

数据分析处理库Pandas-数据预处理

数据分析处理库Pandas-常用函数

数据分析处理库Pandas-Series结构

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 数据数据结构 class csv pandas

相关文章推荐

新的分享

章节导航