作者:依然-狠幸福 | 来源:互联网 | 2023-05-20 14:40
我有一个Python脚本来获取推文.在脚本中我使用libary Tweepy.我使用有效的身份验证参数.运行此脚本后,一些推文存储在我的MongoDB中,有些推文被if语句拒绝.但我仍然得到错误
requests.packages.urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read, 2457 more expected)'
我的问题是我可以改进脚本的哪一部分,所以我没有得到上面的错误.
这是我的剧本
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
import json
from pymongo import MongoClient
#Mongo Settings
client = MongoClient()
db = client.Sentiment
Tweets = db.Tweet
#Twitter Credentials
ckey ='myckey'
csecret ='mycsecret'
atoken = 'myatoken'
asecret = 'myasecret'
class listener(StreamListener):
def on_data(self, data):
try:
tweet = json.loads(data)
if tweet["lang"] == "nl":
print tweet["id"]
Tweets.insert(tweet)
return True
except BaseException, e:
print 'failed on_date,', str(e)
time.sleep(5)
def on_error(self, status):
print status
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter( track=["geld lenen"
,"lening"
,"Defam"
,"DEFAM"
,"Credivance"
,"CREDIVANCE"
,"Alpha Credit"
,"ALPHA CREDIT"
,"Advanced Finance"
,"krediet"
,"KREDIET"
,"private lease"
,"ing"
,"Rabobank"
,"Interbank"
,"Nationale Nerderlanden"
,"Geldshop"
,"Geldlenen"
,"ABN AMBRO"
,"Independer"
,"DGA adviseur"
,"VDZ"
,"vdz"
,"Financieel Attent"
,"Anderslenen"
,"De Nederlandse Kredietmaatschappij"
,"Moneycare"
,"De Financiele Makelaar Kredieten"
,"Finanplaza"
,"Krediet"
,"CFSN Kredietendesk"
,"De Graaf Assurantien en Financieel Adviseurs"
,"AMBTENARENLENING"
,"VDZ Geldzaken"
,"Financium Primae"
,"SNS"
,"AlfamConsumerCredit"
,"GreenLoans"
], languages="nl"
)
我希望你能帮帮我...
1> dbernard..:
IncompleteRead
当您对传入的推文的消费开始落后时,通常会发生此错误,这在您的情况下是有意义的,因为您需要跟踪长长的术语列表.大多数人似乎采取的一般方法(包括我自己)只是抑制此错误并继续收集(请参阅上面的链接).
我不能完全记住是否IncompleteRead
会关闭你的连接(我认为它可能,因为我的个人解决方案重新连接我的流),但你可能会考虑以下内容(我只是想要它,它可能需要重新加工你的情况):
# from httplib import IncompleteRead # Python 2
from http.client import IncompleteRead # Python 3
...
while True:
try:
# Connect/reconnect the stream
stream = Stream(auth, listener)
# DON'T run this approach async or you'll just create a ton of streams!
stream.filter(terms)
except IncompleteRead:
# Oh well, reconnect and keep trucking
continue
except KeyboardInterrupt:
# Or however you want to exit this loop
stream.disconnect()
break
...
再一次,我只是把它放在那里,但故事的寓意是这里采取的一般方法是压制错误并继续.
EDIT(10/11/2016):对于处理大量推文的人来说,这只是一个有用的消息 - 处理这种情况而不会丢失连接时间或推文的一种方法是将你传入的推文放入排队解决方案(RabbitMQ,Kafka (等)由从该队列读取的应用程序摄取/处理.
这将瓶颈从Twitter API转移到您的队列,这应该没有问题等待您使用数据.
这更像是一个"生产"软件解决方案,所以如果您不关心丢失推文或重新连接,上述解决方案仍然完全有效.