作者:留心6_136 | 来源:互联网 | 2023-10-14 15:02
1.创建Scrapy项目scrapystartprojectCrawlMeiziTuscrapygenspiderMeiziTuSpiderhttps:movie.douban.c
1.创建Scrapy项目
scrapy startproject CrawlMeiziTu
scrapy genspider MeiziTuSpider https://movie.douban.com/top250
image.png
2.cd到文件目录
cd CrawlMeizitu
image.png
3.创建爬虫,并设定初始爬取网页地址
scrapy genspider Meizitu http://www.meizitu.com/a/more_1.html
image.png
项目结构:
image.png
4.新建main.py文件
from scrapy import cmdline
cmdline.execute("scrapy crawl Meizitu".split())
image.png
5.编辑setting文件 主要设置USER_AGENT,下载路径,下载延迟时间
BOT_NAME = 'CrawlMeiziTu'
SPIDER_MODULES = ['CrawlMeiziTu.spiders']
NEWSPIDER_MODULE = 'CrawlMeiziTu.spiders'
#存储位置
IMAGES_STORE = '/Users/vincentwen/Downloads/img/meizitu/'
#模拟浏览器
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
#下载时间延迟
DOWNLOAD_DELAY = 0.3
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'CrawlMeiziTu.pipelines.CrawlmeizituPipeline': 300,
}
image.png
6.编辑item,Items主要用来存取通过Spider程序抓取的信息。由于我们爬取妹子图,所以要抓取每张图片的名字,图片的连接,标签等等
import scrapy
class CrawlmeizituItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#title文件夹名
title = scrapy.Field()
url = scrapy.Field()
tags = scrapy.Field()
#图片链接地址
src = scrapy.Field()
#alt为图片名
alt = scrapy.Field()
7编辑Pipelines
Pipelines主要对items里面获取的信息进行处理。比如说根据title创建文件夹或者图片的名字,根据图片链接下载图片。
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import requests
from CrawlMeiziTu.settings import IMAGES_STORE
class CrawlmeizituPipeline(object):
def process_item(self, item, spider):
fold_name = "".join(item['title'])
header = {
'USER-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
'COOKIE': 'b963ef2d97e050aaf90fd5fab8e78633',
}
images = []
#所有图片都放在一个文件夹下
dir_path = '{}'.format(IMAGES_STORE)
if not os.path.exists(dir_path) and len(item['src']) != 0:
os.mkdir(dir_path)
if len(item['src']) == 0:
with open('..//check.txt', 'a+') as fp:
fp.write("".join(item['title']) + ":" + "".join(item['url']))
fp.write("\n")
for jpg_url, name, num in zip(item['src'], item['alt'], range(0, 100)):
file_name = name + str(num)
file_path = '{}//{}'.format(dir_path, file_name)
images.append(file_path)
if os.path.exists(file_path) or os.path.exists(file_name):
continue
with open('{}//{}.jpg'.format(dir_path, file_name), 'wb') as f:
req = requests.get(jpg_url, headers=header)
f.write(req.content)
return item
image.png
8.编辑Meizitu的主程序。
觉得文章有用,请用支付宝扫描,领取一下红包!打赏一下
支付宝红包码