scrapy相关 通过设置 FEED_EXPORT_ENCODING 解决 unicode 中文写入json文件出现`\uXXXX`
0.问题现象
爬取 item:
2017-10-16 18:17:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.huxiu.com/v2_action/article_list> {'author': u'\u5546\u4e1a\u8bc4\u8bba\u7cbe\u9009\xa9', 'cmt': 5, 'fav': 194, 'time': u'4\u5929\u524d', 'title': u'\u96f7\u519b\u8c08\u5c0f\u7c73\u201c\u65b0\u96f6\u552e\u201d\uff1a\u50cfZara\u4e00\u6837\u5f00\u5e97\uff0c\u8981\u505a\u5f97\u6bd4Costco\u66f4\u597d', 'url': u'/article/217755.html'}
写入jsonline jl 文件
{"title": "\u8fd9\u4e00\u5468\uff1a\u8d2b\u7a77\u66b4\u51fb", "url": "/article/217997.html", "author": "\u864e\u55c5", "fav": 8, "time": "2\u5929\u524d", "cmt": 5} {"title": "\u502a\u840d\u8001\u516c\u7684\u65b0\u620f\u6251\u8857\u4e86\uff0c\u9ec4\u6e24\u6301\u80a1\u7684\u516c\u53f8\u8981\u8d54\u60e8\u4e86", "url": "/article/217977.html", "author": "\u5a31\u4e50\u8d44\u672c\u8bba", "fav": 5, "time": "2\u5929\u524d", "cmt": 3}
item 被转 str,默认 ensure_ascii = True,则非 ASCII 字符被转化为 `\uXXXX`,每一个 ‘{xxx}’ 单位被写入文件
目标:注意最后用 chrome 或 notepad++ 打开确认,firefox 打开 jl 可能出现中文乱码,需要手动指定编码。
{"title": "这一周:贫穷暴击", "url": "/article/217997.html", "author": "虎嗅", "fav": 8, "time": "2天前", "cmt": 5} {"title": "倪萍老公的新戏扑街了,黄渤持股的公司要赔惨了", "url": "/article/217977.html", "author": "娱乐资本论", "fav": 5, "time": "2天前", "cmt": 3}
1.参考资料
scrapy抓取到中文,保存到json文件为unicode,如何解决.
import json import codecs class JsonWithEncodingPipeline(object): def __init__(self): self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8') def process_item(self, item, spider):^M line = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(line) return item def close_spider(self, spider): self.file.close()View Code
Scrapy爬虫框架抓取中文结果为Unicode编码,如何转换UTF-8编码
以上资料实际上就是官方文档举的 pipeline 例子,另外指定 ensure_ascii=False
Write items to a JSON file
The following pipeline stores all scraped items (from all spiders) into a single
items.jlfile, containing one item per line serialized in JSON format:
import json class JsonWriterPipeline(object): def open_spider(self, spider): self.file = open('items.jl', 'w') def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" #另外指定 ensure_ascii=False self.file.write(line) return item
Note
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
2.更好的解决办法:
scrapy 使用item export输出中文到json文件,内容为unicode码,如何输出为中文?
http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence 里面有提到,将 JSONEncoder 的
ensure_ascii参数设为 False 即可。
而 scrapy 的 item export 文档里有提到
The additional constructor arguments are passed to the
BaseItemExporter constructor, and the leftover arguments to the
JSONEncoder constructor, so you can use any JSONEncoder constructor
argument to customize this exporter.
因此就在调用
scrapy.contrib.exporter.JsonItemExporter的时候额外指定
ensure_ascii=False就可以啦。
3.根据上述解答,结合官网和源代码,直接解决办法:
1.可以通过修改 project settings.py 补充 FEED_EXPORT_ENCODING = 'utf-8'
2.或在cmd中传入 G:\pydata\pycode\scrapy\huxiu_com>scrapy crawl -o new.jl -s FEED_EXPORT_ENCODING='utf-8' huxiu
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-encoding
FEED_EXPORT_ENCODING
Default:
None
The encoding to be used for the feed.
If unset or set to
None(default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (
\uXXXXsequences) for historic reasons.
Use
utf-8if you want UTF-8 for JSON too.
In [615]: json.dump? Signature: json.dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding='utf-8', default=None, sort_keys=False, **kw) Docstring: Serialize ``obj`` as a JSON formatted stream to ``fp`` (a ``.write()``-supporting file-like object). If ``ensure_ascii`` is true (the default), all non-ASCII characters in the output are escaped with ``\uXXXX`` sequences, and the result is a ``str`` instance consisting of ASCII characters only. If ``ensure_ascii`` is ``False``, some chunks written to ``fp`` may be ``unicode`` instances. This usually happens because the input contains unicode strings or the ``encoding`` parameter is used. Unless ``fp.write()`` explicitly understands ``unicode`` (as in ``codecs.getwriter``) this is likely to cause an error.
C:\Program Files\Anaconda2\Lib\site-packages\scrapy\exporters.py
class JsonLinesItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): kwargs.setdefault('ensure_ascii', not self.encoding) class JsonItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): kwargs.setdefault('ensure_ascii', not self.encoding) class XmlItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): if not self.encoding: self.encoding = 'utf-8'
- scrapy抓取到中文,保存到json文件为unicode,如何解决.
- 使用Unicode字符集时用CFile把中文写入txt文件再用记事本打开出现乱码的问题
- 自己动手写中文分词解析器完整教程,并对出现的问题进行探讨和解决(附完整c#代码和相关dll文件、txt文件下载)
- 通过更改字库文件组件及相关设置实现对NK进行裁剪(尤其适用中文简体系统)
- java freemarker 通过ftl模板文件导出word文件发现在有中文地方出现在乱码,打开word文件提示xml错误解决办法
- python-多语言功能-读excel文件并写入json,解决json输出unicode
- mfc CStdioFile 类在 UNICODE 工程中WriteString 中文写入不进文件一种解决办法
- 通过更改字库文件组件及相关设置实现对NK进行裁剪(尤其适用中文简体系统)
- 通过更改字库文件组件及相关设置实现对NK进行裁剪(尤其适用中文简体系统)
- 使用Unicode字符集时用CFile把中文写入txt文件再用记事本打开出现乱码的问题
- 通过更改字库文件组件及相关设置实现对NK进行裁剪(尤其适用中文简体系统)(转载)
- 关于爬取数据保存到json文件,中文是unicode解决方式
- 解决UNICODE字符集下CStdioFile的Writestring无法写入中文的问题和在在原文件后写入文件
- 解决springmvc返回json数据IE出现文件下载和json数据中文乱码问题
- CFile在写入Unicode编码文件出现乱码---原因及解决办法
- nodejs读取本地中文json文件出现乱码解决方法
- file_put_contents 写入文件,json,ajax中文乱码解决
- 自己动手写中文分词解析器完整教程,并对出现的问题进行探讨和解决(附完整c#代码和相关dll文件、txt文件下载)
- 解决python3.6下scrapy中xpath.extract()匹配出来的内容转成json与.csv文件没有编码(unicode)的问题
- 解决使用nvelocity时候template文件里面包含中文输出结果时候出现乱码的问题