数据抓取分析(python + mongodb)Python数据抓取分析
2017-10-24 18:18
253 查看
编程模块:requests,lxml,pymongo,time,BeautifulSoup
首先获取所有产品的分类网址:
1 def step():
2 try:
3 headers = {
4 。。。。。
5 }
6 r = requests.get(url,headers,timeout=30)
7 html = r.content
8 soup = BeautifulSoup(html,"lxml")
9 url = soup.find_all(正则表达式)
10 for i in url:
11 url2 = i.find_all('a')
12 for j in url2:
13 step1url =url + j['href']
14 print step1url
15 step2(step1url)
16 except Exception,e:
17 print e
我们在产品分类的同时需要确定我们所访问的地址是产品还是又一个分类的产品地址(所以需要判断我们访问的地址是否含有if判断标志):
1 def step2(step1url):
2 try:
3 headers = {
4 。。。。
5 }
6 r = requests.get(step1url,headers,timeout=30)
7 html = r.content
8 soup = BeautifulSoup(html,"lxml")
9 a = soup.find('div',id='divTbl')
10 if a:
11 url = soup.find_all('td',class_='S-ITabs')
12 for i in url:
13 classifyurl = i.find_all('a')
14 for j in classifyurl:
15 step2url = url + j['href']
16 #print step2url
17 step3(step2url)
18 else:
19 postdata(step1url)
当我们if判断后为真则将第二页的分类网址获取到(第一个步骤),否则执行postdata函数,将网页产品地址抓取!
1 def producturl(url):
2 try:
3 p1url = doc.xpath(正则表达式)
4 for i in xrange(1,len(p1url) + 1):
5 p2url = doc.xpath(正则表达式)
6 if len(p2url) > 0:
7 producturl = url + p2url[0].get('href')
8 count = db.find({'url':producturl}).count().insert({"sn":sn,"url":producturl}).update({'sn':sn},{'$set':dt})
首先获取所有产品的分类网址:
1 def step():
2 try:
3 headers = {
4 。。。。。
5 }
6 r = requests.get(url,headers,timeout=30)
7 html = r.content
8 soup = BeautifulSoup(html,"lxml")
9 url = soup.find_all(正则表达式)
10 for i in url:
11 url2 = i.find_all('a')
12 for j in url2:
13 step1url =url + j['href']
14 print step1url
15 step2(step1url)
16 except Exception,e:
17 print e
我们在产品分类的同时需要确定我们所访问的地址是产品还是又一个分类的产品地址(所以需要判断我们访问的地址是否含有if判断标志):
1 def step2(step1url):
2 try:
3 headers = {
4 。。。。
5 }
6 r = requests.get(step1url,headers,timeout=30)
7 html = r.content
8 soup = BeautifulSoup(html,"lxml")
9 a = soup.find('div',id='divTbl')
10 if a:
11 url = soup.find_all('td',class_='S-ITabs')
12 for i in url:
13 classifyurl = i.find_all('a')
14 for j in classifyurl:
15 step2url = url + j['href']
16 #print step2url
17 step3(step2url)
18 else:
19 postdata(step1url)
当我们if判断后为真则将第二页的分类网址获取到(第一个步骤),否则执行postdata函数,将网页产品地址抓取!
1 def producturl(url):
2 try:
3 p1url = doc.xpath(正则表达式)
4 for i in xrange(1,len(p1url) + 1):
5 p2url = doc.xpath(正则表达式)
6 if len(p2url) > 0:
7 producturl = url + p2url[0].get('href')
8 count = db