Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT。Twisted(QT)用来让服务具有异步处理能力,以发挥webkit的并发能力。
目前,为了加速页面的加载速度,页面的很多部分都是用JS生成的,而对于用scrapy爬虫来说就是一个很大的问题,因为scrapy没有JS engine,所以爬取的都是静态页面,对于JS生成的动态页面都无法获得
解决方案:
1、利用第三方中间件来提供JS渲染服务: scrapy-splash 等。
2、利用webkit或者基于webkit库
Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT。Twisted(QT)用来让服务具有异步处理能力,以发挥webkit的并发能力。
下面就来讲一下如何使用scrapy-splash:
pip install scrapy-splash
scrapy-splash使用的是Splash HTTP API, 所以需要一个splash instance,一般采用docker运行splash,所以需要安装docker,具体参见:https://www.jianshu.com/p/c5795d4c7e44
安装好后运行docker。docker成功安装后,有“Docker Quickstart Terminal”图标,双击他启动
请注意上面画红框的地方,这是默认分配给你的ip,下面会用到。至此,docker工具就已经安装好了
$ docker pull scrapinghub/splash
这样就正式启动了。
安装docker之后,官方文档给了docker启动splash容器的命令(docker run -d -p 8050:8050 scrapinghub/splash),但一定要查阅splash文档,来了解启动的相关参数。
比如我启动的时候,就需要指定max-timeout参数。因为我操作js时间较长时,很有可能超出默认timeout时间,以防万一我设定为3600(一小时),但对于本来js操作时间就不长的的同学,注意不要乱设定max-timeout。
$ docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600
首次启动会比较慢,加载一些东西,多次启动会出现以下信息
这时要关闭当前窗口,然后在进程管理器里面关闭一些进程重新打开
重新打开Docker Quickstart Terminal,然后输入:docker run -p 8050:8050 scrapinghub/splash
1)添加splash服务器地址:
2)将splash middleware添加到DOWNLOADER_MIDDLEWARE中:
3)Enable SplashDeduplicateArgsMiddleware:
4)Set a custom DUPEFILTER_CLASS:
5)a custom cache storage backend:
在settings.py文件中,你需要额外的填写下面的一些内容
# 渲染服务的url
SPLASH_URL = 'http://192.168.99.100:8050'#下载器中间件
DOWNLOADER_MIDDLEWARES = {'scrapy_splash.SplashCOOKIEsMiddleware': 723,'scrapy_splash.SplashMiddleware': 725,'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 使用Splash的Http缓存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
该例子是抓取京东某个手机产品的详细信息,地址:https://item.jd.com/4483094.html
如下图:框住的信息是要榨取的内容
对应的html
1、京东价:
抓取代码:prices = site.xpath(’//span[@class=“p-price”]/span/text()’)
2、促销
抓取代码:cxs = site.xpath(’//div[@class=“J-prom-phone-jjg”]/em/text()’)
3、增值业务
抓取代码:value_addeds =site.xpath(’//ul[@class=“choose-support lh”]/li/a/span/text()’)
4、重量
抓取代码:quality = site.xpath(’//div[@id=“summary-weight”]/div[2]/text()’)
5、选择颜色
抓取代码:colors = site.xpath(’//div[@id=“choose-attr-1”]/div[2]/div/@title’)
6、选择版本
抓取代码:versions = site.xpath(’//div[@id=“choose-attr-2”]/div[2]/div/@data-value’)
7、购买方式
抓取代码:buy_style = site.xpath(’//div[@id=“choose-type”]/div[2]/div/a/text()’)
8、套 装
抓取代码:suits = site.xpath(’//div[@id=“choose-suits”]/div[2]/div/a/text()’)
9、增值保障
抓取代码:vaps = site.xpath(’//div[@class=“yb-item-cat”]/div[1]/span[1]/text()’)
10、白条分期
抓取代码:stagings = site.xpath(’//div[@class=“baitiao-list J-baitiao-list”]/div[@class=“item”]/a/strong/text()’)
在抓取之前首先要启动splash服务,命令:docker run -p 8050:8050 scrapinghub/splash
,点击“Docker Quickstart Terminal” 图标
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from scrapy_splash import SplashMiddleware
from scrapy.http import Request, HtmlResponse
from scrapy.selector import Selector
from scrapy_splash import SplashRequest
from splash_test.items import SplashTestItem
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = open('output.txt', 'w')class SplashSpider(Spider):name = 'scrapy_splash'start_urls = ['https://item.jd.com/2600240.html']# request需要封装成SplashRequestdef start_requests(self):for url in self.start_urls:yield SplashRequest(url, self.parse, args={'wait': '0.5'}# ,endpoint='render.json')def parse(self, response):# 本文只抓取一个京东链接,此链接为京东商品页面,价格参数是ajax生成的。会把页面渲染后的html存在html.txt# 如果想一直抓取可以使用CrawlSpider,或者把下面的注释去掉site = Selector(response)it_list = []it = SplashTestItem()#京东价# prices = site.xpath('//span[@class="price J-p-2600240"]/text()')# it['price']= prices[0].extract()# print '京东价:'+ it['price']prices = site.xpath('//span[@class="p-price"]/span/text()')it['price'] = prices[0].extract()+ prices[1].extract()print '京东价:' + it['price']# 促 销cxs = site.xpath('//div[@class="J-prom-phone-jjg"]/em/text()')strcx = ''for cx in cxs:strcx += str(cx.extract())+' 'it['promotion'] = strcxprint '促销:%s '% strcx# 增值业务value_addeds =site.xpath('//ul[@class="choose-support lh"]/li/a/span/text()')strValueAdd =''for va in value_addeds:strValueAdd += str(va.extract())+' 'print '增值业务:%s ' % strValueAddit['value_add'] = strValueAdd# 重量quality = site.xpath('//div[@id="summary-weight"]/div[2]/text()')print '重量:%s ' % str(quality[0].extract())it['quality']=quality[0].extract()#选择颜色colors = site.xpath('//div[@id="choose-attr-1"]/div[2]/div/@title')strcolor = ''for color in colors:strcolor += str(color.extract()) + ' 'print '选择颜色:%s ' % strcolorit['color'] = strcolor# 选择版本versions = site.xpath('//div[@id="choose-attr-2"]/div[2]/div/@data-value')strversion = ''for ver in versions:strversion += str(ver.extract()) + ' 'print '选择版本:%s ' % strversionit['version'] = strversion# 购买方式buy_style = site.xpath('//div[@id="choose-type"]/div[2]/div/a/text()')print '购买方式:%s ' % str(buy_style[0].extract())it['buy_style'] = buy_style[0].extract()# 套装suits = site.xpath('//div[@id="choose-suits"]/div[2]/div/a/text()')strsuit = ''for tz in suits:strsuit += str(tz.extract()) + ' 'print '套装:%s ' % strsuitit['suit'] = strsuit# 增值保障vaps = site.xpath('//div[@class="yb-item-cat"]/div[1]/span[1]/text()')strvaps = ''for vap in vaps:strvaps += str(vap.extract()) + ' 'print '增值保障:%s ' % strvapsit['value_add_protection'] = strvaps# 白条分期stagings = site.xpath('//div[@class="baitiao-list J-baitiao-list"]/div[@class="item"]/a/strong/text()')strstaging = ''for st in stagings:ststr =str(st.extract())strstaging += ststr.strip() + ' 'print '白天分期:%s ' % strstagingit['staging'] = strstagingit_list.append(it)return it_list
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SplashTestItem(scrapy.Item):#单价price = scrapy.Field()# description = Field()#促销promotion = scrapy.Field()#增值业务value_add = scrapy.Field()#重量quality = scrapy.Field()#选择颜色color = scrapy.Field()#选择版本version = scrapy.Field()#购买方式buy_style=scrapy.Field()#套装suit =scrapy.Field()#增值保障value_add_protection = scrapy.Field()#白天分期staging = scrapy.Field()# post_view_count = scrapy.Field()# post_comment_count = scrapy.Field()# url = scrapy.Field()
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import jsonclass SplashTestPipeline(object):def __init__(self):# self.file = open('data.json', 'wb')self.file = codecs.open('spider.txt', 'w', encoding='utf-8')# self.file = codecs.open(# 'spider.json', 'w', encoding='utf-8')def process_item(self, item, spider):line = json.dumps(dict(item), ensure_ascii=False) + "\n"self.file.write(line)return itemdef spider_closed(self, spider):self.file.close()
# -*- coding: utf-8 -*-# Scrapy settings for splash_test project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
ITEM_PIPELINES = {'splash_test.pipelines.SplashTestPipeline':300}
BOT_NAME = 'splash_test'SPIDER_MODULES = ['splash_test.spiders']
NEWSPIDER_MODULE = 'splash_test.spiders'SPLASH_URL = 'http://192.168.99.100:8050'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'splash_test (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = TrueDOWNLOADER_MIDDLEWARES = {'scrapy_splash.SplashCOOKIEsMiddleware': 723,'scrapy_splash.SplashMiddleware': 725,'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable COOKIEs (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'splash_test.middlewares.SplashTestSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'splash_test.middlewares.MyCustomDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'splash_test.pipelines.SplashTestPipeline': 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
###1. 使用SecureCRT连接docker
下载并安装secureCRT, 在连接对话框输入docker的地址:默认是192.168.99.100
,用户名:docker
,密码:tcuser
1、 docker中安装splash
通过SecureCRT连接到docker机器输入
#从docker hub下载相关镜像文件
sudo docker pull scrapinghub/splash
这里需要注意的是由于docker hub的软件仓库不在国内,下载或许需要不少时间,若无法忍受请自行使用代理服务或者其他镜像仓库
#启动splash服务,并通过http,https,telnet提供服务
#通常一般使用http模式 ,可以只启动一个8050就好
#Splash 将运行在 0.0.0.0 at ports 8050 (http), 8051 (https) and 5023 (telnet).
sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
192.168.99.100:8050
查看服务启动情况Splash 本身支持进行页面的过滤,具体规则模式和Adblock Plus的规则模式一致,我们可以通过直接下载Adblock Plus的过滤规则来对页面进行过滤,或者为了提高页面的加载和渲染速度,可以通过设定过滤规则来屏蔽一些不想下载的内容,比如图片,视频等。一般可以首先下载Adblock Plus的规则,屏蔽掉广告
#设置一个本地目录映射为docker中 splash的文件目录,用于类似adblock plus的广告过滤
#
#同时设置adblock过滤器目录为/etc/splash/filters
$ docker run -p 8050:8050 -v
下图是没有加载过滤器的新浪首页样子
下图是使用过滤器后新浪首页的样子
splash请求附带参数的一些设置
class FlySpider(scrapy.Spider):name = "FlySpider"house_pc_index_url='xxxxx'def __init__(self):client = MongoClient("mongodb://name:pwd@localhost:27017/myspace")db = client.myspaceself.fly = db["fly"]def start_requests(self):for x in xrange(0,1):try:script = """function process_one(splash)splash:runjs("$('#next_title').click()")splash:wait(1)local content=splash:evaljs("$('.scrollbar_content').html()")return contentendfunction process_mul(splash,totalPageNum)local res={}for i=1,totalPageNum,1 dores[i]=process_one(splash)endreturn resendfunction main(splash)splash.resource_timeout = 1800local tmp=splash:get_COOKIEs()splash:add_COOKIE('PHPSESSID', splash.args.COOKIEs['PHPSESSID'],"/", "www.feizhiyi.com")splash:add_COOKIE('FEIZHIYI_LOGGED_USER', splash.args.COOKIEs['FEIZHIYI_LOGGED_USER'],"/", "www.feizhiyi.com" )splash:autoload("http://cdn.bootcss.com/jquery/2.2.3/jquery.min.js")assert(splash:go{splash.args.url,http_method=splash.args.http_method,headers=splash.args.headers,})assert(splash:wait(splash.args.wait) )return {res=process_mul(splash,100)}end"""agent = random.choice(agents)print "------COOKIE---------"headers={"User-Agent":agent,"Referer":"xxxxxxx",}splash_args = {'wait': 3,"http_method":"GET",# "images":0,"timeout":1800,"render_all":1,"headers":headers,'lua_source': script,"COOKIEs":COOKIEs,# "proxy":"http://101.200.153.236:8123",}yield SplashRequest(self.house_pc_index_url, self.parse_result, endpoint='execute',args=splash_args,dont_filter=True)# +"&page="+str(x+1)except Exception, e:print e.__doc__print e.messagepass
scrapy splash 实现下滑加载
实现滚轴下拉加载页面的splash script(Lua 脚本)
方法1
function main(splash, args) splash:set_viewport_size(1028, 10000) splash:go(args.url) local scroll_to = splash:jsfunc("window.scrollTo") scroll_to(0, 2000) splash:wait(5) return {png=splash:png()}
end 方法2
function main(splash, args) splash:set_viewport_size(1028, 10000) splash:go(args.url) splash.scroll_position={0,2000} splash:wait(5) return {png=splash:png()}
end
爬虫实现下滑加载
def start_requests(self): script = """ function main(splash) splash:set_viewport_size(1028, 10000) splash:go(splash.args.url) local scroll_to = splash:jsfunc("window.scrollTo") scroll_to(0, 2000) splash:wait(15) return { html = splash:html() } end """ for url in self.start_urls: yield Request(url,callback=self.parse_info_index,meta = { 'dont_redirect': True, 'splash':{ 'args':{'lua_source':script,'images':0}, 'endpoint':'execute', } })
参考:https://www.cnblogs.com/shaosks/p/6950358.html
https://www.jianshu.com/p/4052926bc12c
https://www.jianshu.com/p/b9a2ea9277ce
https://www.jianshu.com/p/2516138e9e75?open_source=weibo_search
https://www.cnblogs.com/zhonghuasong/p/5976003.html