[294]爬虫之scrapysplash

作者：超级独自旅行也快乐 | 来源：互联网 | 2023-09-09 11:44

什么是splashSplash是一个Javascript渲染服务。它是一个实现了HTTPAPI的轻量级浏览器，Splash是用Python实现的，同时使

什么是splash

Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器&＃xff0c;Splash是用Python实现的&＃xff0c;同时使用Twisted和QT。Twisted&＃xff08;QT&＃xff09;用来让服务具有异步处理能力&＃xff0c;以发挥webkit的并发能力。

目前&＃xff0c;为了加速页面的加载速度&＃xff0c;页面的很多部分都是用JS生成的&＃xff0c;而对于用scrapy爬虫来说就是一个很大的问题&＃xff0c;因为scrapy没有JS engine&＃xff0c;所以爬取的都是静态页面&＃xff0c;对于JS生成的动态页面都无法获得

解决方案&＃xff1a;

1、利用第三方中间件来提供JS渲染服务&＃xff1a; scrapy-splash 等。

2、利用webkit或者基于webkit库

Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器&＃xff0c;Splash是用Python实现的&＃xff0c;同时使用Twisted和QT。Twisted&＃xff08;QT&＃xff09;用来让服务具有异步处理能力&＃xff0c;以发挥webkit的并发能力。

下面就来讲一下如何使用scrapy-splash&＃xff1a;

1、利用pip安装scrapy-splash库&＃xff1a;

2、`pip install scrapy-splash`

3、安装docker

scrapy-splash使用的是Splash HTTP API&＃xff0c; 所以需要一个splash instance&＃xff0c;一般采用docker运行splash&＃xff0c;所以需要安装docker&＃xff0c;具体参见&＃xff1a;https://www.jianshu.com/p/c5795d4c7e44

4、启动docker

安装好后运行docker。docker成功安装后&＃xff0c;有“Docker Quickstart Terminal”图标&＃xff0c;双击他启动

请注意上面画红框的地方&＃xff0c;这是默认分配给你的ip&＃xff0c;下面会用到。至此&＃xff0c;docker工具就已经安装好了

5、拉取镜像(pull the image)&＃xff1a;

$ docker pull scrapinghub/splash

这样就正式启动了。

6、用docker运行scrapinghub/splash服务&＃xff1a;

安装docker之后&＃xff0c;官方文档给了docker启动splash容器的命令&＃xff08;docker run -d -p 8050:8050 scrapinghub/splash&＃xff09;&＃xff0c;但一定要查阅splash文档&＃xff0c;来了解启动的相关参数。

比如我启动的时候&＃xff0c;就需要指定max-timeout参数。因为我操作js时间较长时&＃xff0c;很有可能超出默认timeout时间&＃xff0c;以防万一我设定为3600&＃xff08;一小时&＃xff09;&＃xff0c;但对于本来js操作时间就不长的的同学&＃xff0c;注意不要乱设定max-timeout。

$ docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600

首次启动会比较慢&＃xff0c;加载一些东西&＃xff0c;多次启动会出现以下信息

这时要关闭当前窗口&＃xff0c;然后在进程管理器里面关闭一些进程重新打开

重新打开Docker Quickstart Terminal&＃xff0c;然后输入&＃xff1a;docker run -p 8050:8050 scrapinghub/splash

7、配置splash服务&＃xff08;以下操作全部在settings.py&＃xff09;&＃xff1a;

1&＃xff09;添加splash服务器地址&＃xff1a;

2&＃xff09;将splash middleware添加到DOWNLOADER_MIDDLEWARE中&＃xff1a;

3)Enable SplashDeduplicateArgsMiddleware:

4)Set a custom DUPEFILTER_CLASS:

5)a custom cache storage backend:

在settings.py文件中&＃xff0c;你需要额外的填写下面的一些内容

# 渲染服务的url SPLASH_URL &＃61; &＃39;http://192.168.99.100:8050&＃39;#下载器中间件 DOWNLOADER_MIDDLEWARES &＃61; {&＃39;scrapy_splash.SplashCOOKIEsMiddleware&＃39;: 723,&＃39;scrapy_splash.SplashMiddleware&＃39;: 725,&＃39;scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&＃39;: 810, } # 去重过滤器 DUPEFILTER_CLASS &＃61; &＃39;scrapy_splash.SplashAwareDupeFilter&＃39; # 使用Splash的Http缓存 HTTPCACHE_STORAGE &＃61; &＃39;scrapy_splash.SplashAwareFSCacheStorage&＃39;

8、正式抓取

该例子是抓取京东某个手机产品的详细信息&＃xff0c;地址&＃xff1a;https://item.jd.com/4483094.html

如下图&＃xff1a;框住的信息是要榨取的内容

对应的html

1、京东价&＃xff1a;

抓取代码&＃xff1a;prices &＃61; site.xpath(’//span[&＃64;class&＃61;“p-price”]/span/text()’)

2、促销

抓取代码&＃xff1a;cxs &＃61; site.xpath(’//div[&＃64;class&＃61;“J-prom-phone-jjg”]/em/text()’)

3、增值业务

抓取代码&＃xff1a;value_addeds &＃61;site.xpath(’//ul[&＃64;class&＃61;“choose-support lh”]/li/a/span/text()’)

4、重量

抓取代码&＃xff1a;quality &＃61; site.xpath(’//div[&＃64;id&＃61;“summary-weight”]/div[2]/text()’)

5、选择颜色

抓取代码&＃xff1a;colors &＃61; site.xpath(’//div[&＃64;id&＃61;“choose-attr-1”]/div[2]/div/&＃64;title’)

6、选择版本

抓取代码&＃xff1a;versions &＃61; site.xpath(’//div[&＃64;id&＃61;“choose-attr-2”]/div[2]/div/&＃64;data-value’)

7、购买方式

抓取代码&＃xff1a;buy_style &＃61; site.xpath(’//div[&＃64;id&＃61;“choose-type”]/div[2]/div/a/text()’)

8、套　　装

抓取代码&＃xff1a;suits &＃61; site.xpath(’//div[&＃64;id&＃61;“choose-suits”]/div[2]/div/a/text()’)

9、增值保障

抓取代码&＃xff1a;vaps &＃61; site.xpath(’//div[&＃64;class&＃61;“yb-item-cat”]/div[1]/span[1]/text()’)

10、白条分期

抓取代码&＃xff1a;stagings &＃61; site.xpath(’//div[&＃64;class&＃61;“baitiao-list J-baitiao-list”]/div[&＃64;class&＃61;“item”]/a/strong/text()’)

9、运行splash服务

在抓取之前首先要启动splash服务&＃xff0c;命令&＃xff1a;docker run -p 8050:8050 scrapinghub/splash&＃xff0c;点击“Docker Quickstart Terminal” 图标

10、运行scrapy crawl scrapy_splash

11、抓取数据

12、完整源代码

1、SplashSpider

# -*- coding: utf-8 -*- import scrapy from scrapy import Request from scrapy.spiders import Spider from scrapy_splash import SplashRequest from scrapy_splash import SplashMiddleware from scrapy.http import Request, HtmlResponse from scrapy.selector import Selector from scrapy_splash import SplashRequest from splash_test.items import SplashTestItem import sys reload(sys) sys.setdefaultencoding(&＃39;utf-8&＃39;) sys.stdout &＃61; open(&＃39;output.txt&＃39;, &＃39;w&＃39;)class SplashSpider(Spider):name &＃61; &＃39;scrapy_splash&＃39;start_urls &＃61; [&＃39;https://item.jd.com/2600240.html&＃39;]# request需要封装成SplashRequestdef start_requests(self):for url in self.start_urls:yield SplashRequest(url, self.parse, args&＃61;{&＃39;wait&＃39;: &＃39;0.5&＃39;}# ,endpoint&＃61;&＃39;render.json&＃39;)def parse(self, response):# 本文只抓取一个京东链接&＃xff0c;此链接为京东商品页面&＃xff0c;价格参数是ajax生成的。会把页面渲染后的html存在html.txt# 如果想一直抓取可以使用CrawlSpider&＃xff0c;或者把下面的注释去掉site &＃61; Selector(response)it_list &＃61; []it &＃61; SplashTestItem()#京东价# prices &＃61; site.xpath(&＃39;//span[&＃64;class&＃61;"price J-p-2600240"]/text()&＃39;)# it[&＃39;price&＃39;]&＃61; prices[0].extract()# print &＃39;京东价&＃xff1a;&＃39;&＃43; it[&＃39;price&＃39;]prices &＃61; site.xpath(&＃39;//span[&＃64;class&＃61;"p-price"]/span/text()&＃39;)it[&＃39;price&＃39;] &＃61; prices[0].extract()&＃43; prices[1].extract()print &＃39;京东价&＃xff1a;&＃39; &＃43; it[&＃39;price&＃39;]# 促　　销cxs &＃61; site.xpath(&＃39;//div[&＃64;class&＃61;"J-prom-phone-jjg"]/em/text()&＃39;)strcx &＃61; &＃39;&＃39;for cx in cxs:strcx &＃43;&＃61; str(cx.extract())&＃43;&＃39; &＃39;it[&＃39;promotion&＃39;] &＃61; strcxprint &＃39;促销:%s &＃39;% strcx# 增值业务value_addeds &＃61;site.xpath(&＃39;//ul[&＃64;class&＃61;"choose-support lh"]/li/a/span/text()&＃39;)strValueAdd &＃61;&＃39;&＃39;for va in value_addeds:strValueAdd &＃43;&＃61; str(va.extract())&＃43;&＃39; &＃39;print &＃39;增值业务:%s &＃39; % strValueAddit[&＃39;value_add&＃39;] &＃61; strValueAdd# 重量quality &＃61; site.xpath(&＃39;//div[&＃64;id&＃61;"summary-weight"]/div[2]/text()&＃39;)print &＃39;重量:%s &＃39; % str(quality[0].extract())it[&＃39;quality&＃39;]&＃61;quality[0].extract()#选择颜色colors &＃61; site.xpath(&＃39;//div[&＃64;id&＃61;"choose-attr-1"]/div[2]/div/&＃64;title&＃39;)strcolor &＃61; &＃39;&＃39;for color in colors:strcolor &＃43;&＃61; str(color.extract()) &＃43; &＃39; &＃39;print &＃39;选择颜色:%s &＃39; % strcolorit[&＃39;color&＃39;] &＃61; strcolor# 选择版本versions &＃61; site.xpath(&＃39;//div[&＃64;id&＃61;"choose-attr-2"]/div[2]/div/&＃64;data-value&＃39;)strversion &＃61; &＃39;&＃39;for ver in versions:strversion &＃43;&＃61; str(ver.extract()) &＃43; &＃39; &＃39;print &＃39;选择版本:%s &＃39; % strversionit[&＃39;version&＃39;] &＃61; strversion# 购买方式buy_style &＃61; site.xpath(&＃39;//div[&＃64;id&＃61;"choose-type"]/div[2]/div/a/text()&＃39;)print &＃39;购买方式:%s &＃39; % str(buy_style[0].extract())it[&＃39;buy_style&＃39;] &＃61; buy_style[0].extract()# 套装suits &＃61; site.xpath(&＃39;//div[&＃64;id&＃61;"choose-suits"]/div[2]/div/a/text()&＃39;)strsuit &＃61; &＃39;&＃39;for tz in suits:strsuit &＃43;&＃61; str(tz.extract()) &＃43; &＃39; &＃39;print &＃39;套装:%s &＃39; % strsuitit[&＃39;suit&＃39;] &＃61; strsuit# 增值保障vaps &＃61; site.xpath(&＃39;//div[&＃64;class&＃61;"yb-item-cat"]/div[1]/span[1]/text()&＃39;)strvaps &＃61; &＃39;&＃39;for vap in vaps:strvaps &＃43;&＃61; str(vap.extract()) &＃43; &＃39; &＃39;print &＃39;增值保障:%s &＃39; % strvapsit[&＃39;value_add_protection&＃39;] &＃61; strvaps# 白条分期stagings &＃61; site.xpath(&＃39;//div[&＃64;class&＃61;"baitiao-list J-baitiao-list"]/div[&＃64;class&＃61;"item"]/a/strong/text()&＃39;)strstaging &＃61; &＃39;&＃39;for st in stagings:ststr &＃61;str(st.extract())strstaging &＃43;&＃61; ststr.strip() &＃43; &＃39; &＃39;print &＃39;白天分期:%s &＃39; % strstagingit[&＃39;staging&＃39;] &＃61; strstagingit_list.append(it)return it_list

2、SplashTestItem

# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SplashTestItem(scrapy.Item):#单价price &＃61; scrapy.Field()# description &＃61; Field()#促销promotion &＃61; scrapy.Field()#增值业务value_add &＃61; scrapy.Field()#重量quality &＃61; scrapy.Field()#选择颜色color &＃61; scrapy.Field()#选择版本version &＃61; scrapy.Field()#购买方式buy_style&＃61;scrapy.Field()#套装suit &＃61;scrapy.Field()#增值保障value_add_protection &＃61; scrapy.Field()#白天分期staging &＃61; scrapy.Field()# post_view_count &＃61; scrapy.Field()# post_comment_count &＃61; scrapy.Field()# url &＃61; scrapy.Field()

3、SplashTestPipeline

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don&＃39;t forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import codecs import jsonclass SplashTestPipeline(object):def __init__(self):# self.file &＃61; open(&＃39;data.json&＃39;, &＃39;wb&＃39;)self.file &＃61; codecs.open(&＃39;spider.txt&＃39;, &＃39;w&＃39;, encoding&＃61;&＃39;utf-8&＃39;)# self.file &＃61; codecs.open(# &＃39;spider.json&＃39;, &＃39;w&＃39;, encoding&＃61;&＃39;utf-8&＃39;)def process_item(self, item, spider):line &＃61; json.dumps(dict(item), ensure_ascii&＃61;False) &＃43; "\n"self.file.write(line)return itemdef spider_closed(self, spider):self.file.close()

4、settings.py

# -*- coding: utf-8 -*-# Scrapy settings for splash_test project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html ITEM_PIPELINES &＃61; {&＃39;splash_test.pipelines.SplashTestPipeline&＃39;:300} BOT_NAME &＃61; &＃39;splash_test&＃39;SPIDER_MODULES &＃61; [&＃39;splash_test.spiders&＃39;] NEWSPIDER_MODULE &＃61; &＃39;splash_test.spiders&＃39;SPLASH_URL &＃61; &＃39;http://192.168.99.100:8050&＃39; # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT &＃61; &＃39;splash_test (&＃43;http://www.yourdomain.com)&＃39;# Obey robots.txt rules ROBOTSTXT_OBEY &＃61; TrueDOWNLOADER_MIDDLEWARES &＃61; {&＃39;scrapy_splash.SplashCOOKIEsMiddleware&＃39;: 723,&＃39;scrapy_splash.SplashMiddleware&＃39;: 725,&＃39;scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&＃39;: 810, } SPIDER_MIDDLEWARES &＃61; {&＃39;scrapy_splash.SplashDeduplicateArgsMiddleware&＃39;: 100, } DUPEFILTER_CLASS &＃61; &＃39;scrapy_splash.SplashAwareDupeFilter&＃39; HTTPCACHE_STORAGE &＃61; &＃39;scrapy_splash.SplashAwareFSCacheStorage&＃39; # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS &＃61; 32# Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY &＃61; 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN &＃61; 16 #CONCURRENT_REQUESTS_PER_IP &＃61; 16# Disable COOKIEs (enabled by default) #COOKIES_ENABLED &＃61; False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED &＃61; False# Override the default request headers: #DEFAULT_REQUEST_HEADERS &＃61; { # &＃39;Accept&＃39;: &＃39;text/html,application/xhtml&＃43;xml,application/xml;q&＃61;0.9,*/*;q&＃61;0.8&＃39;, # &＃39;Accept-Language&＃39;: &＃39;en&＃39;, #}# Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES &＃61; { # &＃39;splash_test.middlewares.SplashTestSpiderMiddleware&＃39;: 543, #}# Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES &＃61; { # &＃39;splash_test.middlewares.MyCustomDownloaderMiddleware&＃39;: 543, #}# Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS &＃61; { # &＃39;scrapy.extensions.telnet.TelnetConsole&＃39;: None, #}# Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES &＃61; { # &＃39;splash_test.pipelines.SplashTestPipeline&＃39;: 300, #}# Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED &＃61; True # The initial download delay #AUTOTHROTTLE_START_DELAY &＃61; 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY &＃61; 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY &＃61; 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG &＃61; False# Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED &＃61; True #HTTPCACHE_EXPIRATION_SECS &＃61; 0 #HTTPCACHE_DIR &＃61; &＃39;httpcache&＃39; #HTTPCACHE_IGNORE_HTTP_CODES &＃61; [] #HTTPCACHE_STORAGE &＃61; &＃39;scrapy.extensions.httpcache.FilesystemCacheStorage&＃39;

###1. 使用SecureCRT连接docker

下载并安装secureCRT&＃xff0c; 在连接对话框输入docker的地址&＃xff1a;默认是192.168.99.100&＃xff0c;用户名:docker&＃xff0c;密码&＃xff1a;tcuser

在docker中安装和运行splash

1、 docker中安装splash

通过SecureCRT连接到docker机器输入

#从docker hub下载相关镜像文件 sudo docker pull scrapinghub/splash

这里需要注意的是由于docker hub的软件仓库不在国内&＃xff0c;下载或许需要不少时间&＃xff0c;若无法忍受请自行使用代理服务或者其他镜像仓库

2. 启动splash服务

使用docker启动服务命令启动Splash服务

#启动splash服务&＃xff0c;并通过http&＃xff0c;https&＃xff0c;telnet提供服务 #通常一般使用http模式 &＃xff0c;可以只启动一个8050就好 #Splash 将运行在 0.0.0.0 at ports 8050 (http), 8051 (https) and 5023 (telnet). sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash

服务启动后&＃xff0c;打开浏览器输入192.168.99.100:8050查看服务启动情况

输入www.baidu.com,点击Render me 按钮&＃xff0c;立马可以看见在服务器端渲染后的百度页面

3. Splash使用

Splash 本身支持进行页面的过滤&＃xff0c;具体规则模式和Adblock Plus的规则模式一致&＃xff0c;我们可以通过直接下载Adblock Plus的过滤规则来对页面进行过滤&＃xff0c;或者为了提高页面的加载和渲染速度&＃xff0c;可以通过设定过滤规则来屏蔽一些不想下载的内容&＃xff0c;比如图片&＃xff0c;视频等。一般可以首先下载Adblock Plus的规则&＃xff0c;屏蔽掉广告

#设置一个本地目录映射为docker中 splash的文件目录&＃xff0c;用于类似adblock plus的广告过滤 #&＃xff1a;是一个本地文件夹&＃xff0c;注意这里的本地是宿主哦&＃xff0c;不是windows哦 #同时设置adblock过滤器目录为/etc/splash/filters $ docker run -p 8050:8050 -v :/etc/splash/filters scrapinghub/splash --filters-path&＃61;/etc/splash/filters

下图是没有加载过滤器的新浪首页样子

下图是使用过滤器后新浪首页的样子

splash请求附带参数的一些设置

class FlySpider(scrapy.Spider):name &＃61; "FlySpider"house_pc_index_url&＃61;&＃39;xxxxx&＃39;def __init__(self):client &＃61; MongoClient("mongodb://name:pwd&＃64;localhost:27017/myspace")db &＃61; client.myspaceself.fly &＃61; db["fly"]def start_requests(self):for x in xrange(0,1):try:script &＃61; """function process_one(splash)splash:runjs("$(&＃39;#next_title&＃39;).click()")splash:wait(1)local content&＃61;splash:evaljs("$(&＃39;.scrollbar_content&＃39;).html()")return contentendfunction process_mul(splash,totalPageNum)local res&＃61;{}for i&＃61;1,totalPageNum,1 dores[i]&＃61;process_one(splash)endreturn resendfunction main(splash)splash.resource_timeout &＃61; 1800local tmp&＃61;splash:get_COOKIEs()splash:add_COOKIE(&＃39;PHPSESSID&＃39;, splash.args.COOKIEs[&＃39;PHPSESSID&＃39;],"/", "www.feizhiyi.com")splash:add_COOKIE(&＃39;FEIZHIYI_LOGGED_USER&＃39;, splash.args.COOKIEs[&＃39;FEIZHIYI_LOGGED_USER&＃39;],"/", "www.feizhiyi.com" )splash:autoload("http://cdn.bootcss.com/jquery/2.2.3/jquery.min.js")assert(splash:go{splash.args.url,http_method&＃61;splash.args.http_method,headers&＃61;splash.args.headers,})assert(splash:wait(splash.args.wait) )return {res&＃61;process_mul(splash,100)}end"""agent &＃61; random.choice(agents)print "------COOKIE---------"headers&＃61;{"User-Agent":agent,"Referer":"xxxxxxx",}splash_args &＃61; {&＃39;wait&＃39;: 3,"http_method":"GET",# "images":0,"timeout":1800,"render_all":1,"headers":headers,&＃39;lua_source&＃39;: script,"COOKIEs":COOKIEs,# "proxy":"http://101.200.153.236:8123",}yield SplashRequest(self.house_pc_index_url, self.parse_result, endpoint&＃61;&＃39;execute&＃39;,args&＃61;splash_args,dont_filter&＃61;True)# &＃43;"&page&＃61;"&＃43;str(x&＃43;1)except Exception, e:print e.__doc__print e.messagepass

scrapy splash 实现下滑加载

实现滚轴下拉加载页面的splash script(Lua 脚本)

方法1 function main(splash, args) splash:set_viewport_size(1028, 10000) splash:go(args.url) local scroll_to &＃61; splash:jsfunc("window.scrollTo") scroll_to(0, 2000) splash:wait(5) return {png&＃61;splash:png()} end 方法2 function main(splash, args) splash:set_viewport_size(1028, 10000) splash:go(args.url) splash.scroll_position&＃61;{0,2000} splash:wait(5) return {png&＃61;splash:png()} end

爬虫实现下滑加载

def start_requests(self): script &＃61; """ function main(splash) splash:set_viewport_size(1028, 10000) splash:go(splash.args.url) local scroll_to &＃61; splash:jsfunc("window.scrollTo") scroll_to(0, 2000) splash:wait(15) return { html &＃61; splash:html() } end """ for url in self.start_urls: yield Request(url,callback&＃61;self.parse_info_index,meta &＃61; { &＃39;dont_redirect&＃39;: True, &＃39;splash&＃39;:{ &＃39;args&＃39;:{&＃39;lua_source&＃39;:script,&＃39;images&＃39;:0}, &＃39;endpoint&＃39;:&＃39;execute&＃39;, } })

参考&＃xff1a;https://www.cnblogs.com/shaosks/p/6950358.html
https://www.jianshu.com/p/4052926bc12c
https://www.jianshu.com/p/b9a2ea9277ce
https://www.jianshu.com/p/2516138e9e75?open_source&＃61;weibo_search
https://www.cnblogs.com/zhonghuasong/p/5976003.html