我正在尝试使用scrapy crall single运行完美运行的scrapy蜘蛛,但我无法在python脚本中运行它.
主要问题是从不执行SingleBlogSpider.parse方法,而执行start_requests
这是运行该脚本的代码和输出.我还试图将执行移动到一个单独的文件,但同样的情况发生.
from urlparse import urlparse
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class SingleBlogSpider(BaseSpider):
name = 'single'
def __init__(self,**kwargs):
super(SingleBlogSpider,self).__init__(**kwargs)
url = kwargs.get('url') or kwargs.get('domain') or 'seaofshoes.com'
if not url.startswith('http://') and not url.startswith('https://'):
url = 'http://%s/' % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.link_extractor = SgmlLinkExtractor()
self.COOKIEs_seen = set()
print 0,self.url
def start_requests(self):
print '1',self.url
return [Request(self.url,callback=self.parse)]
def parse(self,response):
print '2'
# Actual scraper code,that is never executed
if __name__ == '__main__':
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log,signals
spider = SingleBlogSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop,signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
输出:
0 http://scrapinghub.com/
1 http://scrapinghub.com/
2013-09-13 14:21:46-0500 [single] INFO: Closing spider (finished)
2013-09-13 14:21:46-0500 [single] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 221,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 9403,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2013,9,13,19,21,46,563184),'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2013,328961)}
2013-09-13 14:21:46-0500 [single] INFO: Spider closed (finished)
该程序永远不会到达SingleBlogSpider.parse并打印’2′,因此它不会抓取任何内容.但是你可以在输出上看到它确实发出了请求,所以不确定是什么.
Scrapy版本== 0.18.2
我真的无法发现错误,真的很感激帮助.
谢谢!