作者:qapo | 来源:互联网 | 2023-05-18 11:33
Case1爬取简书网热评文章-案例描述利用第三方库及多进程爬虫,来爬取简书网“首页投稿”的热评文章数据,并存储在MongoDB数据库中#-*-coding:utf-
Case1 爬取简书网热评文章
- 案例描述
利用第三方库及多进程爬虫,来爬取简书网“首页投稿”的热评文章数据,并存储在MongoDB数据库中
import requests
from lxml import etree
import pymongo
from multiprocessing import Pool
client = pymongo.MongoClient('localhost', 27017)
mydb = client['mydb']
jianshu_shouye = mydb['jianshu_shouye']
def get_jianshu_info(url):
html = requests.get(url)
selector = etree.HTML(html.text)
infos = selector.xpath('//ul[@class="note-list"]/li')
for info in infos:
try:
author = info.xpath('div/div[1]/div/a/text()')[0]
time = info.xpath('div/div[1]/div/span/@date-shared-at ')[0]
title = info.xpath('div/a/text()')[0]
cOntent= info.xpath('div/p/text()')[0].strip()
view = info.xpath('div/div[2]/a[1]/text()')[1].strip()
comment = info.xpath('div/div[2]/a[2]/text()')[0].strip()
like = info.xpath('div/div[2]/span[1]/text()')[0].strip()
rewards = info.xpath('div/div[2]/span[2]/text()')
if len(rewards) == 0:
reward = '无'
else:
reward = reward[0].strip()
data = {
'author':author,
'time':time,
'title':title,
'content':content,
'view':view,
'comment':comment,
'like':like,
'reward':reward,
}
jianshu_shouye.insert_one(data)
except IndexError:
pass
if __name__ == '__main__':
urls = ['https://www.jianshu.com/c/bDHhpK?order_by=commented_at&page={}'.format(str(i)) for i in range(1, 10001)]
pool = Pool(processes = 4)
pool.map(get_jianshu_info, urls)
- 代码分析
1、1-4行导入库,Pymongo用于对MongoDB数据库的操作,multiprocessing库由于多进程爬虫;
2、6-8行用于创建MongoDB数据库和集合;
3、10-40行定义了爬取简书网信息的参数,由于有些文章有打赏,有的没有打赏,因此需要判断;
4、42-46行,构造10000个url,进行多进程爬取。
Case2 爬取转转网二手市场商品信息
这里写代码片