您的位置:首页 > 其它

如何用pandas读取CVS格式数据

2018-01-10 16:21 274 查看
本文主要介绍的是如何利用pandas来读取CVS格式的数据

CVS格式指的是:每个元素之间均已逗号隔开,不管文件后缀名是什么,例如.txt,.data等等



#x.txt

1,2,3
4,5,6

----------------------------------------------------------
column_name=['A','B','C']
t=pd.read_csv('./x.txt',names=column_name)
print t

>>
A  B  C
0  1  2  3
1  4  5  6


1.导入pandas包

import pandas as pd


2.利用read_csv函数读取

train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')
print np.shape(train)
print type(train)

>> (175,4)
>> <class 'pandas.core.frame.DataFrame'>


读取后的数据保存在train中,但其数据类型不是我们常用的array或者array;此时可以用np.array(train)强制转换成array类型,之后的操作就同矩阵操作一样了。

3.拟合数据

3.1 转换成array类型处理

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')
train_data = np.array(train)
test_data = np.array(test)

X_train = train_data[:,1:3] # 取第1,2列作为训练集
y_train = train_data[:,3] # 取第3列为标签

X_test = test_data[:,1:3]
y_test = test_data[:,3]

p_index = np.where(train_data[:,3]==1)[0] # 取出所以正样本的索引
n_index = np.where(train_data[:,3]==0)[0] # 取出所以负样本的索引
positive = X_train[p_index,:] # 取出所以正样本
nagative = X_train[n_index,:] # 取出所以负样本

plt.scatter(nagative[:,0],nagative[:,1],marker='o',s=200,c='red') #绘制样本点
plt.scatter(positive[:,0],positive[:,1],marker='x',s=150,c='black')
plt.show()

lr=LogisticRegression()
lr.fit(X_train,y_train)
print lr.score(X_test,y_test)


3.2 利用DataFrame处理

import pandas as pd
import matplotlib.pyplot as plt

train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')

negative=train.loc[train['Type']==0][['Clump Thickness','Cell Size']]
positive=train.loc[train['Type']==1][['Clump Thickness','Cell Size']]
plt.scatter(negative['Clump Thickness'],negative['Cell Size'],\
marker='o',s=200,c='red')
plt.scatter(positive['Clump Thickness'],positive['Cell Size'],\
marker='x',s=150,c ='black')
plt.show()

X_train=train[['Clump Thickness','Cell Size']]
y_train=train['Type']
X_test=test[['Clump Thickness','Cell Size']]
y_test=test['Type']

lr=LogisticRegression()
lr.fit(X_train,y_train)
print lr.score(X_test,y_test)


下载

参考:

python机器学习及实践
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: