前言:Scrapy因为请求到的都是静态的数据,动态数据无法获取,拿应该怎么解决呢?
1.获取新闻的第一步
解析静态网址
根据抓包可以查看是否是静态数据,根据固定的数据进行解析
2.使用selenium+中间件完成动态数据的解析
提示:chromedriver.exe文件需要到官网中查看对应的浏览器版本进行操作
def __init__(self):self.bro = webdriver.Chrome(executable_path='D:\PY\chromedriver.exe')
def closed(self, spider):self.bro.quit()
- 结合 middlewares.py,对于数据返回值进行拦截:
def process_response(self, request, response, spider):bro = spider.broif request.url in spider.modules_url:bro.get(request.url)page_text = bro.page_sourcen_response = HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request)return n_responseelse:return response
3.获取详情数据之后,开启管道持久化数据
fp = Nonedef open_spider(self, spider):print("开始爬虫")self.fp = open('./news163item.txt', 'w', encoding='utf-8')def process_item(self, item, spider):title = item['title']content = item['content']content = title + "\n" + contentself.fp.write(content)return itemdef close_spider(self, spider):print("结束爬虫")self.fp.close()
- 配置文件settings中都需要开启:
DOWNLOADER_MIDDLEWARES = {'News163Item.middlewares.News163ItemDownloaderMiddleware': 543,
}ITEM_PIPELINES = {'News163Item.pipelines.News163ItemPipeline': 300,
}USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'ROBOTSTXT_OBEY = FalseLOG_LEVEL = 'ERROR'
最终获取结果