您的位置：首页 > 编程语言 > Python开发

试验python爬取逐浪小说

2015-11-18 10:36 609 查看

只是想试下用python爬网页，之前用米花，后来米花不知道怎么回事打不开了，就用的逐浪。

#coding:utf-8

import sys

reload(sys)

sys.setdefaultencoding( "utf-8" )

import urllib,sys,urllib2,os

from bs4 import BeautifulSoup

IMAGE_DIR = '/home/cloud/temp/' #存放目录

if not os.path.exists(IMAGE_DIR):

os.mkdir(IMAGE_DIR)

def get_book_without_db(url):

"""一边爬取一边写入，不用数据库保存"""

soup = BeautifulSoup(request(url))

title = (soup.find_all("title"))[0].string.split('_')[0] #文章名

book_path = os.path.join(IMAGE_DIR, title)

book = open(book_path, 'a+')

i = 1

for volume in soup.find_all('h2'):

i += 1

volume_name = volume.text

print type(volume_name)

book.write(str(volume_name) + '\n\n\n')

for chapter in soup.find_all('ul')[i].find_all("li"):

chapter_name = chapter.find('a').text

book.write(str(chapter_name) + '\n')

chapter_url = chapter.find('a').get('href')

content_soup = BeautifulSoup(request(chapter_url))

content = content_soup.find_all("p")[0].contents[0]

book.write(str(content) + '\n\n')

book.close()

print '书籍路径: ', book_path

get_book_without_db('testurl')

其中，testurl是小说目录。

因为是自学的，代码中获取内容有些还是debug时看内存才写的，所以可能不规范。

另外，我爬取的文章内容是一段字符串，没有自动换行。百度没有查到，哪位知道的可否告知一下，文章内容该怎么自动换行？

注释：后来发现有个也是爬取逐浪的：http://www.oschina.net/code/snippet_1788589_48365

巧合啊，我是在优书网随便选的一个网站。不过此文好像也没有自动换行

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 爬虫

相关文章推荐

新的分享

章节导航