当前位置: 开发笔记 > 编程语言 > 正文

pythonscrapy爬虫_PythonScrapy爬虫框架实例（一）

作者：我爱妈妈的家常菜_712 | 来源：互联网 | 2023-09-18 17:45

之前有介绍scrapy的相关知识，但是没有介绍相关实例，在这里做个小例，供大家参考学习。注：后续不强调python版本&#

之前有介绍 scrapy 的相关知识&＃xff0c;但是没有介绍相关实例&＃xff0c;在这里做个小例&＃xff0c;供大家参考学习。

注&＃xff1a;后续不强调python 版本&＃xff0c;默认即为python3.x。

爬取目标

这里简单找一个图片网站&＃xff0c;获取图片的先关信息。

该网站网址&＃xff1a; http://www.58pic.com/c/

创建项目

终端命令行执行以下命令

scrapy startproject AdilCrawler

命令执行后&＃xff0c;会生成如下结构的项目。

执行结果如下

如上图提示&＃xff0c;cd 到项目下&＃xff0c;可以执行 scrapy genspider example example.com 命令&＃xff0c;创建名为example,域名为example.com 的爬虫文件。

编写items.py

这里先简单抓取图片的作者名称、图片主题等信息。

#-*- coding: utf-8 -*-

#Define here the models for your scraped items#

#See documentation in:#https://doc.scrapy.org/en/latest/topics/items.html

importscrapyclassAdilcrawlerItem(scrapy.Item):#define the fields for your item here like:

#name &＃61; scrapy.Field()

author&＃61; scrapy.Field() #作者

theme&＃61; scrapy.Field() #主题

编写spider文件

进入AdilCrawler目录&＃xff0c;使用命令创建一个基础爬虫类&＃xff1a;

scrapy genspider thousandPic www.58pic.com#thousandPic为爬虫名&＃xff0c;www.58pic.com为爬虫作用范围

执行命令后会在spiders文件夹中创建一个thousandPic.py的文件&＃xff0c;现在开始对其编写&＃xff1a;

#-*- coding: utf-8 -*-

importscrapy

#爬虫小试

classThousandpicSpider(scrapy.Spider):

name&＃61; &＃39;thousandPic&＃39;allowed_domains&＃61; [&＃39;www.58pic.com&＃39;]

start_urls&＃61; [&＃39;http://www.58pic.com/c/&＃39;]defparse(self, response):&＃39;&＃39;&＃39;查看页面元素

/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

因为页面中有多张图&＃xff0c;而图是以 /html/body/div[4]/div[3]/div[i] 其中i 为变量作为区分的 &＃xff0c;所以为了获取当前页面所有的图

这里不写 i 程序会遍历该路径下的所有图片。&＃39;&＃39;&＃39;#author 作者

#theme 主题author&＃61; response.xpath(&＃39;/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()&＃39;).extract()

theme&＃61; response.xpath(&＃39;/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()&＃39;).extract()#使用爬虫的log 方法在控制台输出爬取的内容。

self.log(author)

self.log(theme)#使用遍历的方式打印出爬取的内容&＃xff0c;因为当前一页有20张图片。

for i in range(1, 21):print(i,&＃39;****&＃39;,theme[i - 1], &＃39;:&＃39;,author[i - 1] )

执行命令,查看打印结果

scrapy crawl thousandPic

结果如下&＃xff0c;其中DEBUG为 log 输出。

代码优化

引入 item AdilcrawlerItem

#-*- coding: utf-8 -*-

importscrapy#这里使用 import 或是下面from 的方式都行&＃xff0c;关键要看当前项目在pycharm的打开方式&＃xff0c;是否是作为一个项目打开的&＃xff0c;建议使用这一种方式。

importAdilCrawler.items as items#使用from 这种方式&＃xff0c;AdilCrawler 需要作为一个项目打开。#from AdilCrawler.items import AdilcrawlerItem

classThousandpicSpider(scrapy.Spider):

name&＃61; &＃39;thousandPic&＃39;allowed_domains&＃61; [&＃39;www.58pic.com&＃39;]

start_urls&＃61; [&＃39;http://www.58pic.com/c/&＃39;]defparse(self, response):&＃39;&＃39;&＃39;查看页面元素

/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

因为页面中有多张图&＃xff0c;而图是以 /html/body/div[4]/div[3]/div[i] 其中i 为变量作为区分的 &＃xff0c;所以为了获取当前页面所有的图

这里不写 i 程序会遍历该路径下的所有图片。&＃39;&＃39;&＃39;item&＃61;items.AdilcrawlerItem()#author 作者

#theme 主题

author&＃61; response.xpath(&＃39;/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()&＃39;).extract()

theme&＃61; response.xpath(&＃39;/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()&＃39;).extract()

item[&＃39;author&＃39;] &＃61;author

item[&＃39;theme&＃39;] &＃61;themereturn item

再次运营爬虫&＃xff0c;执行结果如下

保存结果到文件

执行命令如下

scrapy crawl thousandPic -o items.json

会生成如图的文件

再次优化&＃xff0c;使用 ItemLoader 功能类

使用itemLoader &＃xff0c;以取代杂乱的extract()和xpath()。

代码如下&＃xff1a;

#-*- coding: utf-8 -*-

importscrapyfrom AdilCrawler.items importAdilcrawlerItem#导入 ItemLoader 功能类

from scrapy.loader importItemLoader#optimize 优化#爬虫项目优化

classThousandpicoptimizeSpider(scrapy.Spider):

name&＃61; &＃39;thousandPicOptimize&＃39;allowed_domains&＃61; [&＃39;www.58pic.com&＃39;]

start_urls&＃61; [&＃39;http://www.58pic.com/c/&＃39;]defparse(self, response):&＃39;&＃39;&＃39;查看页面元素

/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

因为页面中有多张图&＃xff0c;而图是以 /html/body/div[4]/div[3]/div[i] 其中i 为变量作为区分的 &＃xff0c;所以为了获取当前页面所有的图

这里不写 i 程序会遍历该路径下的所有图片。&＃39;&＃39;&＃39;

#使用功能类 itemLoader,以取代看起来杂乱的 extract() 和 xpath() &＃xff0c;优化如下i&＃61; ItemLoader(item &＃61; AdilcrawlerItem(),response &＃61;response )#author 作者

#theme 主题i.add_xpath(&＃39;author&＃39;,&＃39;/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()&＃39;)

i.add_xpath(&＃39;theme&＃39;,&＃39;/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()&＃39;)return i.load_item()

编写pipelines文件

默认pipelines.py 文件

#-*- coding: utf-8 -*-

#Define your item pipelines here#

#Don&＃39;t forget to add your pipeline to the ITEM_PIPELINES setting#See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

classAdilcrawler1Pipeline(object):defprocess_item(self, item, spider):return item

优化后代码如下

#-*- coding: utf-8 -*-

#Define your item pipelines here#

#Don&＃39;t forget to add your pipeline to the ITEM_PIPELINES setting#See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

importjsonclassAdilcrawlerPipeline(object):&＃39;&＃39;&＃39;保存item数据&＃39;&＃39;&＃39;

def __init__(self):

self.filename&＃61; open(&＃39;thousandPic.json&＃39;,&＃39;w&＃39;)defprocess_item(self, item, spider):#ensure_ascii&＃61;False 可以解决 json 文件中乱码的问题。

text &＃61; json.dumps(dict(item), ensure_ascii&＃61;False) &＃43; &＃39;,\n&＃39; #这里是一个字典一个字典存储的&＃xff0c;后面加个 &＃39;,\n&＃39; 以便分隔和换行。

self.filename.write(text)returnitemdefclose_spider(self,spider):

self.filename.close()

settings文件设置

修改settings.py配置文件

找到pipelines 配置进行修改

#Configure item pipelines#See https://doc.scrapy.org/en/latest/topics/item-pipeline.html#ITEM_PIPELINES &＃61; {#&＃39;AdilCrawler.pipelines.AdilcrawlerPipeline&＃39;: 300,#}

#启动pipeline 必须将其加入到“ITEM_PIPLINES”的配置中#其中根目录是tutorial&＃xff0c;pipelines是我的pipeline文件名&＃xff0c;TutorialPipeline是类名

ITEM_PIPELINES &＃61;{&＃39;AdilCrawler.pipelines.AdilcrawlerPipeline&＃39;: 300,

}#加入后&＃xff0c;相当于开启pipeline&＃xff0c;此时在执行爬虫&＃xff0c;会执行对应的pipelines下的类&＃xff0c;并执行该类相关的方法&＃xff0c;比如这里上面的保存数据功能。

执行命令

scrapy crawl thousandPicOptimize

执行后生成如下图文件及保存的数据

使用CrawlSpider类进行翻页抓取

使用crawl 模板创建一个 CrawlSpider

执行命令如下

scrapy genspider -t crawl thousandPicPaging www.58pic.com

items.py 文件不变&＃xff0c;查看爬虫 thousandPicPaging.py 文件

#-*- coding: utf-8 -*-

importscrapyfrom scrapy.linkextractors importLinkExtractorfrom scrapy.spiders importCrawlSpider, RuleclassThousandpicpagingSpider(CrawlSpider):

name&＃61; &＃39;thousandPicPaging&＃39;allowed_domains&＃61; [&＃39;www.58pic.com&＃39;]

start_urls&＃61; [&＃39;http://www.58pic.com/&＃39;]

rules&＃61;(

Rule(LinkExtractor(allow&＃61;r&＃39;Items/&＃39;), callback&＃61;&＃39;parse_item&＃39;, follow&＃61;True),

)defparse_item(self, response):

i&＃61;{}#i[&＃39;domain_id&＃39;] &＃61; response.xpath(&＃39;//input[&＃64;id&＃61;"sid"]/&＃64;value&＃39;).extract()

#i[&＃39;name&＃39;] &＃61; response.xpath(&＃39;//div[&＃64;id&＃61;"name"]&＃39;).extract()

#i[&＃39;description&＃39;] &＃61; response.xpath(&＃39;//div[&＃64;id&＃61;"description"]&＃39;).extract()

return i

修改后如下

#-*- coding: utf-8 -*-

importscrapy#导入链接规则匹配类&＃xff0c;用来提取符合规则的连接

from scrapy.linkextractors importLinkExtractor#导入CrawlSpider类和Rule

from scrapy.spiders importCrawlSpider, RuleimportAdilCrawler.items as itemsclassThousandpicpagingSpider(CrawlSpider):

name&＃61; &＃39;thousandPicPaging&＃39;allowed_domains&＃61; [&＃39;www.58pic.com&＃39;]#修改起始页地址

start_urls &＃61; [&＃39;http://www.58pic.com/c/&＃39;]#Response里链接的提取规则&＃xff0c;返回的符合匹配规则的链接匹配对象的列表

#http://www.58pic.com/c/1-0-0-03.html 根据翻页连接地址&＃xff0c;找到相应的正则表达式 1-0-0-03 -> \S-\S-\S-\S\S 而且这里使用 allow

#不能使用 restrict_xpaths &＃xff0c;使用他的话&＃xff0c;正则将失效

page_link &＃61; LinkExtractor(allow&＃61;&＃39;http://www.58pic.com/c/\S-\S-\S-\S\S.html&＃39;, allow_domains&＃61;&＃39;www.58pic.com&＃39;)

rules&＃61;(#获取这个列表里的链接&＃xff0c;依次发送请求&＃xff0c;并且继续跟进&＃xff0c;调用指定回调函数处理

Rule(page_link, callback&＃61;&＃39;parse_item&＃39;, follow&＃61;True), #注意这里的 &＃39;,&＃39; 要不会报错

)#加上这个方法是为了解决 parse_item() 不能抓取第一页数据的问题 parse_start_url 是 CrawlSpider() 类下的方法&＃xff0c;这里重写一下即可

defparse_start_url(self, response):

i&＃61;items.AdilcrawlerItem()

author&＃61; response.xpath(&＃39;/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()&＃39;).extract()

theme&＃61; response.xpath(&＃39;/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()&＃39;).extract()

i[&＃39;author&＃39;] &＃61;author

i[&＃39;theme&＃39;] &＃61;themeyieldi#指定的回调函数

defparse_item(self, response):

i&＃61;items.AdilcrawlerItem()

author&＃61; response.xpath(&＃39;/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()&＃39;).extract()

theme&＃61; response.xpath(&＃39;/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()&＃39;).extract()

i[&＃39;author&＃39;] &＃61;author

i[&＃39;theme&＃39;] &＃61;themeyield i

再次执行

scrapy crawl thousandPicPaging

查看执行结果&＃xff0c;可以看到是有4页的内容

再次优化引入ItemLoader 类

#-*- coding: utf-8 -*-

importscrapy#导入链接规则匹配类&＃xff0c;用来提取符合规则的连接

from scrapy.linkextractors importLinkExtractor#导入CrawlSpider类和Rule

from scrapy.loader importItemLoaderfrom scrapy.spiders importCrawlSpider, RuleimportAdilCrawler.items as itemsclassThousandpicpagingopSpider(CrawlSpider):

name&＃61; &＃39;thousandPicPagingOp&＃39;allowed_domains&＃61; [&＃39;www.58pic.com&＃39;]#修改起始页地址

start_urls &＃61; [&＃39;http://www.58pic.com/c/&＃39;]#Response里链接的提取规则&＃xff0c;返回的符合匹配规则的链接匹配对象的列表

#http://www.58pic.com/c/1-0-0-03.html 根据翻页连接地址&＃xff0c;找到相应的正则表达式 1-0-0-03 -> \S-\S-\S-\S\S 而且这里使用 allow

#不能使用 restrict_xpaths &＃xff0c;使用他的话&＃xff0c;正则将失效

page_link &＃61; LinkExtractor(allow&＃61;&＃39;http://www.58pic.com/c/\S-\S-\S-\S\S.html&＃39;, allow_domains&＃61;&＃39;www.58pic.com&＃39;)

rules&＃61;(#获取这个列表里的链接&＃xff0c;依次发送请求&＃xff0c;并且继续跟进&＃xff0c;调用指定回调函数处理

Rule(page_link, callback&＃61;&＃39;parse_item&＃39;, follow&＃61;True), #注意这里的 &＃39;,&＃39; 要不会报错

)#加上这个方法是为了解决 parse_item() 不能抓取第一页数据的问题 parse_start_url 是 CrawlSpider() 类下的方法&＃xff0c;这里重写一下即可

defparse_start_url(self, response):

i&＃61; ItemLoader(item &＃61; items.AdilcrawlerItem(),response &＃61;response )

i.add_xpath(&＃39;author&＃39;,&＃39;/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()&＃39;)

i.add_xpath(&＃39;theme&＃39;,&＃39;/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()&＃39;)yieldi.load_item()#指定的回调函数

defparse_item(self, response):

i&＃61; ItemLoader(item &＃61; items.AdilcrawlerItem(),response &＃61;response )

i.add_xpath(&＃39;author&＃39;,&＃39;/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()&＃39;)

i.add_xpath(&＃39;theme&＃39;,&＃39;/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()&＃39;)yield i.load_item()

执行结果是一样的。

最后插播一条在线正则表达式测试工具的广告&＃xff0c;地址&＃xff1a; http://tool.oschina.net/regex/

应用如下

至此&＃xff0c;简单完成了一个网站的简单信息的爬取。后面还会有其他内容的介绍~

如果你要觉得对你有用的话&＃xff0c;请不要吝惜你打赏&＃xff0c;这将是我无尽的动力&＃xff0c;谢谢&＃xff01;

推荐阅读

regex
python模块之正则

re模块可以读懂你写的正则表达式根据你写的表达式去执行任务用re去操作正则正则表达式使用一些规则来检测一些字符串是否符合个人要求，从一段字符串中找到符合要求的内容。在 ... [详细]

蜡笔小新 2024-11-14 15:52:38
default
使用Python爬取妙笔阁小说信息并保存为TXT和CSV格式

本文介绍了如何使用Python爬取妙笔阁小说网仙侠系列中所有小说的信息，并将其保存为TXT和CSV格式。主要内容包括如何构造请求头以避免被网站封禁，以及如何利用XPath解析HTML并提取所需信息。 ... [详细]

蜡笔小新 2024-11-14 19:54:58
default
iOS 百度地图使用指南：基本定位与地理编码

本文详细介绍如何在 iOS 应用中集成百度地图，实现基本的地图定位和地理编码功能。配置详情请参考官方文档：http://developer.baidu.com/map/index.php?title=iossdk ... [详细]

蜡笔小新 2024-11-16 14:37:27
default
Python IDE 设置字符集及编码转换的最佳实践

本文介绍了如何在 Python 脚本中规范文件编码，并提供了在不同字符集之间进行转换的方法，特别是在处理中文字符时的注意事项。 ... [详细]

蜡笔小新 2024-11-16 13:42:20
utf-8
Python3爬虫实战：突破网站反爬虫机制的方法

本文详细探讨了使用Python3编写爬虫时如何应对网站的反爬虫机制，通过实例讲解了如何模拟浏览器访问，帮助读者更好地理解和应用相关技术。 ... [详细]

蜡笔小新 2024-11-14 19:48:54
split
pytorch(一)：torch构建数据集并训练一个神经网络

目录预备知识导包构建数据集神经网络结构训练测试精度可视化计算模型精度损失可视化输出网络结构信息训练神经网络定义参数载入数据载入神经网络结构、损失及优化训练及测试损失、精度可视化qu ... [详细]

蜡笔小新 2024-11-14 13:06:38
utf-8
Spring Data JdbcTemplate 入门指南

本文将介绍如何使用 Spring JdbcTemplate 进行数据库操作，包括查询和插入数据。我们将通过一个学生表的示例来演示具体步骤。 ... [详细]

蜡笔小新 2024-11-14 10:33:29
bit
浅析python实现布隆过滤器及Redis中的缓存穿透原理_python

本文带你了解了位图的实现，布隆过滤器的原理及Python中的使用，以及布隆过滤器如何应对Redis中的缓存穿透，相信你对布隆过滤 ... [详细]

蜡笔小新 2024-11-13 16:43:07
web
开发笔记:前端之前端初识

开发笔记:前端之前端初识 ... [详细]

蜡笔小新 2024-11-16 16:05:59
const
C#(八）基础篇—继承和多态

C#本随笔为个人复习巩固知识用，多从书上总结与python基础教程理解归纳得来，如有错误烦请指正面向对象的三大基本特征：封装、继承、多态 ... [详细]

蜡笔小新 2024-11-16 15:43:09
const
【转】强大的矩阵奇异值分解(SVD)及其应用

在工程实践中，经常要对大矩阵进行计算，除了使用分布式处理方法以外，就是通过理论方法，对矩阵降维。一下文章，我在 ... [详细]

蜡笔小新 2024-11-16 12:44:31
utf-8
Python多线程详解与示例

本文介绍了Python中的多线程编程，包括僵尸进程和孤儿进程的概念，并提供了具体的代码示例。同时，详细解释了0号进程和1号进程在系统中的作用。 ... [详细]

蜡笔小新 2024-11-14 12:47:24
const
Java 15 发布，带来多项重要更新！

2020年9月15日，Oracle正式发布了最新的JDK 15版本。本次更新带来了许多新特性，包括隐藏类、EdDSA签名算法、模式匹配、记录类、封闭类和文本块等。 ... [详细]

蜡笔小新 2024-11-14 12:11:09
utf-8
Android Studio SQLite 数据库增删改查简单（代码参考）

一个建表一个执行crud操作建表代码importandroid.content.Context;importandroid.database.sqlite.SQLiteDat ... [详细]

蜡笔小新 2024-11-14 11:01:49
web
HTTP header 介绍

HTTP(HyperTextTransferProtocol)是超文本传输协议的缩写，它用于传送www方式的数据。HTTP协议采用了请求响应模型。客服端向服务器发送一 ... [详细]

蜡笔小新 2024-11-14 09:13:00

我爱妈妈的家常菜_712

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章