您的位置：首页 > 编程语言 > Python开发

第六章：数据加载、存储于文件格式Day12-14

2016-07-31 15:13 302 查看

说明：本文章为Python数据处理学习日志，记录内容为实现书本内容时遇到的错误以及一些与书本不一致的地方，一些简单操作则不再赘述。日志主要内容来自书本《利用Python进行数据分析》，Wes McKinney著，机械工业出版社。

读写文本格式的数据

read_csv()

Signature:

pd.read_csv(

filepath_or_buffer,

sep=’,’,

delimiter=None,

header=’infer’,

names=None,

index_col=None,

usecols=None,

squeeze=False,

prefix=None,

mangle_dupe_cols=True,

dtype=None,

engine=None,

converters=None,

true_values=None,

false_values=None,

skipinitialspace=False,

skiprows=None,

skipfooter=None,

nrows=None,

na_values=None,

keep_default_na=True,

na_filter=True,

verbose=False,

skip_blank_lines=True,

parse_dates=False,

infer_datetime_format=False,

keep_date_col=False,

date_parser=None,

dayfirst=False,

iterator=False,

chunksize=None,

compression=’infer’,

thousands=None,

decimal=’.’,

lineterminator=None,

quotechar=’”’,

quoting=0,

escapechar=None,

comment=None,

encoding=None,

dialect=None,

tupleize_cols=False,

error_bad_lines=True,

warn_bad_lines=True,

skip_footer=0,

doublequote=True,

delim_whitespace=False,

as_recarray=False,

compact_ints=False,

use_unsigned=False,

low_memory=True,

buffer_lines=None,

memory_map=False,

float_precision=None)

Docstring: Read CSV (comma-separated) file into DataFrame Also supports optionally iterating or breaking of the file into chunks.

Additional help can be found in the

online docs for IO Tools

<http://pandas.pydata.org/pandas-docs/stable/io.html>

_.

Parameters:

filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

The string could be a URL. Valid URL schemes include http, ftp, s3, and

file. For file URLs, a host is expected. For instance, a local file could

be file ://localhost/path/to/table.csv

sep : str, default ‘,’

Delimiter to use. If sep is None, will try to automatically determine

this. Regular expressions are accepted and will force use of the python

parsing engine and will ignore quotes in the data.

delimiter : str, default None

Alternative argument name for sep.

header : int or list of ints, default ‘infer’

Row number(s) to use as the column names, and the start of the data.

Default behavior is as if set to 0 if no

names

passed, otherwise

None

. Explicitly pass

header=0

to be able to replace existing

names. The header can be a list of integers that specify row locations for

a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not

specified will be skipped (e.g. 2 in this example is skipped). Note that

this parameter ignores commented lines and empty lines if

skip_blank_lines=True

, so header=0 denotes the first line of data

rather than the first line of the file.

names : array-like, default None

List of column names to use. If file contains no header row, then you

should explicitly pass header=None

index_col : int or sequence or False, default None

Column to use as the row labels of the DataFrame. If a sequence is given, a

MultiIndex is used. If you have a malformed file with delimiters at the end

of each line, you might consider index_col=False to force pandas to not

use the first column as the index (row names)

usecols : array-like, default None

Return a subset of the columns.

Results in much faster parsing time and lower memory usage.

squeeze : boolean, default False

If the parsed data only contains one column then return a Series

prefix : str, default None

Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …

mangle_dupe_cols : boolean, default True

Duplicate columns will be specified as ‘X.0’…’X.N’, rather than ‘X’…’X’

dtype : Type name or dict of column -> type, default None

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}

(Unsupported with engine=’python’). Use

str

object

to preserve and

not interpret dtype.

engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is

currently more feature-complete.

converters : dict, default None

Dict of functions for converting values in certain columns. Keys can either

be integers or column labels

true_values : list, default None

Values to consider as True

false_values : list, default None

Values to consider as False

skipinitialspace : boolean, default False

Skip spaces after delimiter.

skiprows : list-like or integer, default None

Line numbers to skip (0-indexed) or number of lines to skip (int)

at the start of the file

skipfooter : int, default 0

Number of lines at bottom of file to skip (Unsupported with engine=’c’)

nrows : int, default None

Number of rows of file to read. Useful for reading pieces of large files

na_values : str or list-like or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific

per-column NA values. By default the following values are interpreted as

NaN:

''

'#N/A'

'#N/A N/A'

'#NA'

'-1.#IND'

'-1.#QNAN'

'-NaN'

'-nan'

'1.#IND'

'1.#QNAN'

'N/A'

'NA'

'NULL'

'NaN'

'nan'

.

keep_default_na : bool, default True

If na_values are specified and keep_default_na is False the default NaN

values are overridden, otherwise they’re appended to.

na_filter : boolean, default True

Detect missing value markers (empty strings and the value of na_values). In

data without any NAs, passing na_filter=False can improve the performance

of reading a large file

verbose : boolean, default False

Indicate number of NA values placed in non-numeric columns

skip_blank_lines : boolean, default True

If True, skip over blank lines rather than interpreting as NaN values parse_dates : boolean or list of ints or names or list of lists

or dict, default False

* boolean. If True -> try parsing the index.
* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
each as a separate date column.
* list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
4000
a single date column.
* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
'foo'
Note: A fast-path exists for iso8601-formatted dates.

infer_datetime_format : boolean, default False

If True and parse_dates is enabled for a column, attempt to infer

the datetime format to speed up the processing

keep_date_col : boolean, default False

If True and parse_dates specifies combining multiple columns then

keep the original columns.

date_parser : function, default None

Function to use for converting a sequence of string columns to an array of

datetime instances. The default uses

dateutil.parser.parser

to do the

conversion. Pandas will try to call date_parser in three different ways,

advancing to the next if an exception occurs: 1) Pass one or more arrays

(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the

string values from the columns defined by parse_dates into a single array

and pass that; and 3) call date_parser once for each row using one or more

strings (corresponding to the columns defined by parse_dates) as arguments.

dayfirst : boolean, default False

DD/MM format dates, international and European format

iterator : boolean, default False

Return TextFileReader object for iteration or getting chunks with

get_chunk()

.

chunksize : int, default None

Return TextFileReader object for iteration.

See IO Tools docs for more

information

<http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>

_ on

iterator

and

chunksize

.

compression : {‘infer’, ‘gzip’, ‘bz2’, None}, default ‘infer’

For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip or

bz2 if filepath_or_buffer is a string ending in ‘.gz’ or ‘.bz2’,

respectively, and no decompression otherwise. Set to None for no

decompression.

thousands : str, default None

Thousands separator

decimal : str, default ‘.’

Character to recognize as decimal point (e.g. use ‘,’ for European data).

lineterminator : str (length 1), default None

Character to break file into lines. Only valid with C parser.

quotechar : str (length 1), optional

The character used to denote the start and end of a quoted item. Quoted

items can include the delimiter and it will be ignored.

quoting : int or csv.QUOTE_* instance, default None

Control field quoting behavior per

csv.QUOTE_*

constants. Use one of

QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

Default (None) results in QUOTE_MINIMAL behavior.

escapechar : str (length 1), default None

One-character string used to escape delimiter when quoting is QUOTE_NONE.

comment : str, default None

Indicates remainder of line should not be parsed. If found at the beginning

of a line, the line will be ignored altogether. This parameter must be a

single character. Like empty lines (as long as

skip_blank_lines=True

),

fully commented lines are ignored by the parameter

header

but not by

skiprows

. For example, if comment=’#’, parsing ‘#empty\na,b,c\n1,2,3’

with

header=0

will result in ‘a,b,c’ being

treated as the header.

encoding : str, default None

Encoding to use for UTF when reading/writing (ex. ‘utf-8’).

List of Python

standard encodings

<https://docs.python.org/3/library/codecs.html#standard-encodings>

_

dialect : str or csv.Dialect instance, default None

If None defaults to Excel dialect. Ignored if sep longer than 1 char

See csv.Dialect documentation for more details

tupleize_cols : boolean, default False

Leave a list of tuples on columns as is (default is to convert to

a Multi Index on the columns)

error_bad_lines : boolean, default True

Lines with too many fields (e.g. a csv line with too many commas) will by

default cause an exception to be raised, and no DataFrame will be returned.

If False, then these “bad lines” will dropped from the DataFrame that is

returned. (Only valid with C parser)

warn_bad_lines : boolean, default True

If error_bad_lines is False, and warn_bad_lines is True, a warning for each

“bad line” will be output. (Only valid with C parser).

Returns:

result : DataFrame or TextParser

read_table和read_csv相似，sep改为”\t”，其余大同小异。

to_csv()

Signature: data.to_csv(path_or_buf=None, sep=’,’, na_rep=”, float_format=None, columns=None, header=True, index=True,

index_label=None, mode=’w’, encoding=None, compression=None,

quoting=None, quotechar=’”’, line_terminator=’\n’, chunksize=None,

tupleize_cols=False, date_format=None, doublequote=True,

escapechar=None, decimal=’.’, **kwds)

Docstring: Write DataFrame to a comma-separated values (csv) file

Parameters:

path_or_buf : string or file handle, default None

File path or object, if None is provided the result is returned as

a string.

sep : character, default ‘,’

Field delimiter for the output file.

na_rep : string, default ”

Missing data representation

float_format : string, default None

Format string for floating point numbers

columns : sequence, optional

Columns to write

header : boolean or list of string, default True

Write out column names. If a list of string is given it is assumed

to be aliases for the column names

index : boolean, default True

Write row names (index)

index_label : string or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and

header

and

index

are True, then the index names are used. A

sequence should be given if the DataFrame uses MultiIndex. If

False do not print fields for index names. Use index_label=False

for easier importing in R

nanRep : None

deprecated, use na_rep

mode : str

Python write mode, default ‘w’

encoding : string, optional

A string representing the encoding to use in the output file,

defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compression : string, optional

a string representing the compression to use in the output file,

allowed values are ‘gzip’, ‘bz2’,

only used when the first argument is a filename

line_terminator : string, default ‘\n’

The newline character or character sequence to use in the output

file

quoting : optional constant from csv module

defaults to csv.QUOTE_MINIMAL

quotechar : string (length 1), default ‘”’

character used to quote fields

doublequote : boolean, default True

Control quoting of

quotechar

inside a field

escapechar : string (length 1), default None

character used to escape

sep

and

quotechar

when appropriate

chunksize : int or None

rows to write at a time

tupleize_cols : boolean, default False

write multi_index columns as a list of tuples (if True)

or new (expanded format) if False)

date_format : string, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for

European data

书本注

P164 header=None

header不同：

pd.read_csv('ex2.csv',header=None)
Out[14]:
0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo

P166 读txt

似乎并不需要书上那么复杂，现在功能扩充完善，不过格式稍有不同：

list(open('ex3.txt'))
Out[21]:
['            A         B         C\n',
'aaa -0.264438 -1.026059 -0.619500\n',
'bbb  0.927272  0.302904 -0.032399\n',
'ccc -0.264273 -0.386314 -0.217601\n',
'ddd -0.871858 -0.348382  1.100491\n']

result = pd.read_csv('ex3.txt')

result
Out[23]:
A         B         C
0  aaa -0.264438 -1.026059 -0.619500
1  bbb  0.927272  0.302904 -0.032399
2  ccc -0.264273 -0.386314 -0.217601
3  ddd -0.871858 -0.348382  1.100491

result = pd.read_table('ex3.txt')

result
Out[25]:
A         B         C
0  aaa -0.264438 -1.026059 -0.619500
1  bbb  0.927272  0.302904 -0.032399
2  ccc -0.264273 -0.386314 -0.217601
3  ddd -0.871858 -0.348382  1.100491

result = pd.read_table('ex3.txt',sep='\s+')

result
Out[27]:
A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

P166 skiprows

如果不skiprows，函数会将连续的读成一个单元格的数据：

pd.read_csv('ex4.csv')
Out[29]:
# hey!
a                                                  b        c   d    message
# just wanted to make things more difficult for... NaN      NaN NaN      NaN
# who reads CSV files with computers                anyway? NaN NaN      NaN
1                                                  2        3   4      hello
5                                                  6        7   8      world
9                                                  10       11  12       foo

pd.read_csv('ex4.csv',skiprows=[0,2,3])
Out[30]:
a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

P166 na_values的含义

na_values : str or list-like or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN:

''

'#N/A'

'#N/A N/A'

'#NA'

'-1.#IND'
cebc

'-1.#QNAN'

'-NaN'

'-nan'

'1.#IND'

'1.#QNAN'

'N/A'

'NA'

'NULL'

'NaN'

'nan'

.

na_values=[‘xxx’]的意思为DataFrame里面为xxx的元素标记未NaN：

result = pd.read_csv('ex5.csv')
result
Out[33]:
something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

result = pd.read_csv('ex5.csv',na_values=['5'])
result
Out[40]:
something    a   b     c   d message
0       one  1.0   2   3.0   4     NaN
1       two  NaN   6   NaN   8   world
2     three  9.0  10  11.0  12     foo

result = pd.read_csv('ex5.csv',na_values=['three'])
result
Out[42]:
something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2       NaN  9  10  11.0  12     foo

P168 显示具体信息

"""
直接result回车，会直接显示result全部内容
"""
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
one      10000 non-null float64
two      10000 non-null float64
three    10000 non-null float64
four     10000 non-null float64
key      10000 non-null object
dtypes: float64(4), object(1)
memory usage: 390.7+ KB

P170 to_csv()

参数变了，cols–>columns：

data.to_csv(sys.stdout,index=False,cols=list('abc'))
something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

data.to_csv(sys.stdout,index=False,columns=list('abc'))
a,b,c
1,2,3.0
5,6,
9,10,11.0

P172 diaect

书上写错了，应该是dialect：

reader = csv.reader(f,diaect=my_dialect)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-97-fff676b470d2> in <module>()
----> 1 reader = csv.reader(f,diaect=my_dialect)

TypeError: 'diaect' is an invalid keyword argument for this function

P172 写错误，定义类的时候似乎没有继承

with open('mydata.csv','w') as f:
writer = csv.writer(f,dialect=my_dialect)
writer.writerow(('one','two','three'))
writer.writerow(('1','2','3'))
writer.writerow(('1','2','3'))
writer.writerow(('1','2','3'))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-106-1f82c8cdfdb0> in <module>()
1 with open('mydata.csv','w') as f:
----> 2     writer = csv.writer(f,dialect=my_dialect)
3     writer.writerow(('one','two','three'))
4     writer.writerow(('1','2','3'))
5     writer.writerow(('1','2','3'))

TypeError: "quoting" must be an integer

"""
重新定义一下my_dialect中的quoting，必须是整数，并不清楚其含义，暂设为0。
"""
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = ';'
quotechar = '"'
quoting = 0

with open('mydata.csv','w') as f:
writer = csv.writer(f,dialect=my_dialect)
writer.writerow(('one','two','three'))
writer.writerow(('1','2','3'))
writer.writerow(('1','2','3'))
writer.writerow(('1','2','3'))

!cat mydata.csv
one;two;three
1;2;3
1;2;3
1;2;3

P173 json.loads()和json.dumps()

经过一系列转化后，和原obj还是iyouyidian差别的：

obj = """
{"name":"Wes",
"place_lived":["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
{"naem":"Katie","age":33,"pet":"Cisco"}]
}
"""

import json

obj
Out[117]: '\n{"name":"Wes",\n"place_lived":["United States","Spain","Germany"],\n"pet":null,\n"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},\n{"naem":"Katie","age":33,"pet":"Cisco"}]\n}\n'

result = json.loads(obj)

result
Out[119]:
{u'name': u'Wes',
u'pet': None,
u'place_lived': [u'United States', u'Spain', u'Germany'],
u'siblings': [{u'age': 25, u'name': u'Scott', u'pet': u'Zuko'},
{u'age': 33, u'naem': u'Katie', u'pet': u'Cisco'}]}

asjson = json..dumps(result)
File "<ipython-input-120-b73195ced089>", line 1
asjson = json..dumps(result)
^
SyntaxError: invalid syntax

asjson = json.dumps(result)

asjson
Out[122]: '{"pet": null, "place_lived": ["United States", "Spain", "Germany"], "name": "Wes", "siblings": [{"pet": "Zuko", "age": 25, "name": "Scott"}, {"pet": "Cisco", "age": 33, "naem": "Katie"}]}'

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 数据分析

相关文章推荐

新的分享

章节导航