热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

使用我的代码通过多线程/多处理加快抓取

如何通过多线程多处理来加快我的scrapy代码?我在下面附加了我的代码,我对pytho

如何通过多线程/多处理来加快我的scrapy代码?
我在下面附加了我的代码,我对python中的线程不熟悉,并且不知道从哪里开始,如果有人可以帮助我使用此代码

import scrapy
import logging
domain = 'https://www.spdigital.cl/categories/view/'
categories = [
'334','335','553','607','336','340','339','540','486','489','485','598','347','562','348','349','353','351','352','532','350','477','475','476','474','559','355','356','580','337','357','358','360','374','363','362','361','338','344','593','359','604','478','507','509','508','510','512','600','590','511','459','564','376','375','558','341','377','378','484','554','567','563','379','342','343','370','481','365','556','364','541','555','492','570','579','576','574','575','572','578','577','588','573','596','597','601','595','387','468','536','391','390','589','389','399','394','396','397','398','392','592','401','402','530','560','407','406','408','404','403','405','413','411','414','410','409','412','418','599','603','465','415','487','416','382','419','417','479','515','582','518','514','581','583','517','519','520','420','421','422','423','424','425','521','557','538','428','430','432','434','436','433','435','427','437','429','482','544','552','545','546','550','547','551','549','548','491','535','494','493','472','471','470','534','537','587','586','585','602','569','561','438','446','488','439','496','440','566','445','447','565','448','449','450','451','452','531','453','454','456','455','501','505','506','504','502','498','500','503','369','527','460','529','606','528','591','462','526','525','605','463','464',]
class Productosspider(scrapy.Spider):
name = 'productos'
allowed_domains = ['www.spdigital.cl']
def start_requests(self):
for i in categories:
yield scrapy.Request( url = domain + i,callback = self.parse,headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,like Gecko) Ubuntu Chromium/78.0.3904.108 Chrome/78.0.3904.108 Safari/537.36'
})
def parse(self,response):
for product in response.xpath( '//div[@class="span8 grid-style-mosaic"]/div/div[@class="span2 product-item-mosaic"]' ):
yield {
'product_name': product.xpath( './/div[@class="name"]/a/text() | //div[@class="name"]/a/span/@data-original-title' ).get(),'product_brand': product.xpath( './/div[@class="brand"]/text()' ).get(),'product_url': response.urljoin(product.xpath('.//div[@class="name"]/a/@href').get()),'product_original': product.xpath( './/div[@class="cash-price"]/text()' ).get(),'product_discount': product.xpath( './/span[@class="cash-previous-price-value"]/text()' ).get()
}
next_page = response.urljoin( response.xpath( '//a[@class="next"]/@href').get() )
if next_page:
yield scrapy.Request( url = next_page,headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,like Gecko) Ubuntu Chromium/78.0.3904.108 Chrome/78.0.3904.108 Safari/537.36'
})



Scrapy是单线程的,因此不支持多线程。 Scrapy建立在Twisted上,因此异步执行请求。为了加快抓取过程,您可以通过修改默认数字分别为16和8的setting.pyCONCURRENT_REQUESTS来增加CONCURRENT_REQUESTS_PER_DOMAIN中的并发请求。在Scrapy documentaition about concurrent requests中进一步了解有建设性。


推荐阅读
author-avatar
命硬D小童鞋
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有