Python:Scrapy中重写ImagePipeline组件的file_path函数,但没被调用

 晶晶9930_195 发布于 2022-11-07 10:09

环境

Python:2.7.6(64位)
Scrapy:0.22.2(64位)
操作系统:Windows7(64位)


问题需求

默认情况下,使用ImagePipeline组件下载图片的时候,图片名称是以图片URL的SHA1值进行保存的。
如:
图片URL:http://www.example.com/image.jpg
SHA1结果:3afec3b4765f8f0a07b78f98c07b83f013567a0a
则图片名称:3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg
但是,我想要以原来的图片名称进行保存,比如上面例子中的图片保存到本地的话,图片名称就应该是:image.jpg
stackoverflow上说是可以重写image_key函数,不过我试了下,结果发现不行,重写的image_key函数没被调用。后面查看了下ImagePipeline的源码:

class ImagesPipeline(FilesPipeline):
    """Abstract pipeline that implement the image thumbnail generation logic

    """

    MEDIA_NAME = 'image'
    MIN_WIDTH = 0
    MIN_HEIGHT = 0
    THUMBS = {}
    DEFAULT_IMAGES_URLS_FIELD = 'image_urls'
    DEFAULT_IMAGES_RESULT_FIELD = 'images'

...省略

def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                          'please use file_path(request, response=None, info=None) instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(url).hexdigest()  # change to request.url after deprecation
        return 'full/%s.jpg' % (image_guid)
    # deprecated
    def image_key(self, url):
        return self.file_path(url)
    image_key._base = True

...省略

其中,有这么一句话:
ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, please use file_path(request, response=None, info=None) instead
也就是说,在最新版本的Scrapy中(0.22.2),使用file_path代替image_key函数。
因此,我在自定义的ImagePipeline类中,重写了file_path函数,但是结果运行的时候,发现也没法被调用。
代码如下:

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
import os
class DownPhotosPipeline(ImagesPipeline):

    def file_path(self, request):
        print "~~~~~~~~~~~~~~~~~~~~~~"
        print "~~~~~~~"+request.url+"~~~~~~~"
        print "~~~~~~~~~~~~~~~~~~~~~~"
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

    def get_media_requests(self, item, info):
        for image_url in item['images']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        #item['image_paths'] = image_paths
        return item

settings.py

DOWNLOAD_DELAY = 2
IMAGES_STORE = 'budejie_photos'
DOWNLOAD_TIMEOUT = 1200
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline'
]
1 个回答
  • def file_path(self, request):
    改成
    def file_path(self, request, response=None, info=None):
    就可以了,在file_path函数中return图片名称就可以了

    2022-11-12 01:42 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有