作者:立而山0605_408 | 来源:互联网 | 2023-05-17 15:21
Ihavefollowingquestion:Ineedtosaveimagetomongodbduringwebscraping.Ihaveanimagelink
I have following question: I need to save image to mongodb during web scraping. I have an image link. I tried this:
我有以下问题:我需要在网络抓取期间将图像保存到mongodb。我有一个图像链接。我试过这个:
images_binaries = [] # this will store all images data before saving it to mongodb
# save as file on hard disc
urllib.urlretrieve(url, self.album_path + '/' + photo_file_name)
images_binaries.append(open(self.album_path + '/' + photo_file, 'r').read())
....
# after that I append this array of images raw data to Item
post = WaralbumPost()
post['images_binary'] = images_binaries
....
The code of Waralbum item:
Waralbum项目的代码:
from scrapy.item import Item, Field
class WaralbumPost(Item):
images_binary = Field()
But this cause error when it saves to mongo: bson.errors.InvalidStringData: strings in documents must be valid UTF-8: '\xff\.....
但是当它保存到mongo时会导致错误:bson.errors.InvalidStringData:文档中的字符串必须是有效的UTF-8:'\ xff \ .....
What is better way to do this? Does converting of raw image data will solve this problem? Maybe, scrapy has a pretty way for saving images? Thank for your answers
有什么更好的方法呢?转换原始图像数据是否可以解决这个问题?也许,scrapy有一个很好的方法来保存图像?谢谢你的回答
SOLUTION: I deleted this lines: images_binaries.append(open(self.album_path + '/' + photo_file, 'r').read()) post['images_binary'] = images_binaries In my WaralbumPost I also save image url. Than, in pipelines.py I get this url and save image in mongo. the code of pipelines.py:
解决方案:我删除了这一行:images_binaries.append(open(self.album_path +'/'+ photo_file,'r')。read())post ['images_binary'] = images_binaries在我的WaralbumPost中我也保存了图片网址。比,在pipelines.py中我得到这个网址并将图像保存在mongo中。 pipelines.py的代码:
class WarAlbum(object):
def __init__(self):
cOnnection= pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
self.grid_fs = gridfs.GridFS(getattr(connection, settings['MONGODB_DB']))
def process_item(self, item, spider):
links = item['img_links']
ids = []
for i, link in enumerate(links):
mime_type = mimetypes.guess_type(link)[0]
request = requests.get(link, stream=True)
_id = self.grid_fs.put(request.raw, cOntentType=mime_type, filename=item['local_images'][i])
ids.append(_id)
item['data_chunk_id'] = ids
self.collection.insert(dict(item))
log.msg("Item wrote to MongoDB database %s/%s" %
(settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
level=log.DEBUG, spider=spider)
return item
Hope, this will be helpful for someone
希望,这对某人有帮助
1 个解决方案