当前位置: 开发笔记 > 后端 > 正文

京东商品及评论数据采集

作者：GodlikeZ寰 | 来源：互联网 | 2023-08-15 10:41

好吧，下面的爬虫是同步的，其实可以用协程来写，效率会增加很多！对京东的商品基本信息，产品参数，商品评论进行采集使用BeautifulSoup解析注意：由于每个产品的评论只能采集10

好吧，下面的爬虫是同步的，其实可以用协程来写，效率会增加很多！

对京东的商品基本信息，产品参数，商品评论进行采集

使用BeautifulSoup解析

注意：由于每个产品的评论只能采集100页，为了爬到更多的评论，每个产品分别采集好评，中评，差评各100页

爬虫结果

根目录
,"wb") as jpg:#保存图片
for chunk in image:
jpg.write(chunk)
i=i+1
except:
traceback.print_exc()

return images

def getCommMeta(self,item_id):
"""
获取相对属性，买家印象，评论总结
"""
commentJson = self.getCommJson(item_id)
# https://club.jd.com/comment/skuProductPageComments.action
# ?callback=fetchJSON_comment98vv40836&productId=4669576&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1
# &callback=jQuery3649390&_=1500941065939
# https://club.jd.com/comment/productCommentSummaries.action?referenceIds=3564110

commentMetas = {}

commentMetas['goodRateShow'] = str(commentJson["productCommentSummary"]["goodRateShow"]) # 好评率
commentMetas['poorRateShow'] = str(commentJson["productCommentSummary"]["poorRateShow"]) # 差评率
commentMetas['commentCount'] = str(commentJson["productCommentSummary"]["commentCount"]) #评论数
commentMetas['goodCount'] = str(commentJson["productCommentSummary"]["goodCount"]) #好评数
commentMetas['generalCount'] = str(commentJson["productCommentSummary"]["generalCount"]) #中评数
commentMetas['poorCount'] = str(commentJson["productCommentSummary"]["poorCount"]) #差评数

# 买家印象
commentMetas['hotCommentTags'] = commentJson["hotCommentTagStatistics"]

return commentMetas

def getComments(self,item_id):
"""
获取该产品的好评，中评，差评各100页评论数据
"""
comments = {}
comments['goodComments'] = []
comments['geneComments'] = []
comments['badComments'] = []

# 好评
for i in range(100):
commentJson = self.getCommJson(item_id, i,score=3)
if commentJson == None:
continue
if len(commentJson['comments']) == 0:
break
comments['goodComments'].extend(self.splitComments(commentJson))

time.sleep(1)
# 中评
for i in range(100):
commentJson = self.getCommJson(item_id, i,score=2)
if commentJson == None:
continue
if len(commentJson['comments']) == 0:
break
comments['geneComments'].extend(self.splitComments(commentJson))
time.sleep(1)
# 差评
for i in range(100):
commentJson = self.getCommJson(item_id, i,score=1)
if commentJson == None:
continue
if len(commentJson['comments']) == 0:
break
comments['badComments'].extend(self.splitComments(commentJson))
time.sleep(1)
return comments

def splitComments(self,commentJson):
comments = []
for comm in commentJson['comments']:
comment = {}

comment["cmid"] = str(comm.get('id',"")) # 该评论的id
comment["guid"] = str(comm.get('guid',"")) # guid是啥？
comment["content"] = str(comm.get('content',"")).replace(",","，").replace(' ',"").replace('\n',"").strip()
comment["creationTime"] = str(comm.get('creationTime',""))
comment["referenceId"] = str(comm.get('referenceId',"")) # 该评论所属的商品
comment["replyCount"] = str(comm.get('replyCount',""))
comment["score"] = str(comm.get('score',""))
comment["nickname"] = str(comm.get('nickname',""))
comment["productColor"] = str(comm.get('productColor',""))
comment["productSize"] = str(comm.get('productSize',""))
comments.append(comment)

return comments

def parseProducts(self, product_list):
"""
product_list是形如 [[p1_sku1_id,p1_sku2_id,p1_sku3_id],[p2_sku1_id,p2_sku2_id,p2_sku3_id,p2_sku4_id]...] 的列表
其中列表中的一个元素[p1_sku1_id,p1_sku2_id,p1_sku3_id]又是一个列表，表示一个product的相同配置，不同颜色的sku
@param product_list：自己手动构建一个满足条件的60个产品的sku的id列表，然后传进来让程序解析
"""
for products in product_list:
parent_product_id = products[0] # 同一个列表里边默认第一个为父
for item_id in products:
try:
url = "https://item.jd.com/" + str(item_id) + ".html" # 产品的url
print(url)
html = requests.get(url,headers=self.get_user_hearder())
soup = BeautifulSoup(html.text)

name = soup.find("div",attrs={"class":"J-crumb-br"}).find("div",attrs={"class":"head"}).find('a').text # 品牌
self.path = name
if not os.path.exists(self.path):
os.mkdir(self.path)

if not os.path.exists(self.path + "/propertys"): # 用来放propertys
os.mkdir(self.path + "/propertys")

try:
self.item_path = os.path.join(self.path,str(parent_product_id)) # 同一个父的子产品的图片存在同一个文件夹下
if not os.path.exists(self.item_path):
os.mkdir(self.item_path)

params = self.getParams(soup,item_id) # 获取参数并保存,绝对属性
commentMetas = self.getCommMeta(item_id) # 获取评价的相对属性
comments = self.getComments(item_id) # 获取100页评价
images = self.getImages(soup,self.item_path,item_id) # 获取相片并保存，照片

if parent_product_id == item_id: # 父sku的信息，作为主要的信息，其他的作为备份
with open('products.csv', 'a') as f: #
f.write(name+','+str(item_id)+','+params['skuName']+','+ params['price']+','+commentMetas['goodRateShow']+','+commentMetas['poorRateShow']
+','+commentMetas['commentCount']+','+commentMetas['goodCount']+","+commentMetas['generalCount']+','+commentMetas['poorCount'])

for hotTag in list(commentMetas['hotCommentTags']):
f.write(','+hotTag['name']+":"+str(hotTag['count']))

f.write('\n')

# 不是父sku则存到其他文件作为备份
with open('products_backup.csv', 'a') as f: #
f.write(name+','+str(item_id)+','+params['skuName']+','+ params['price']+','+commentMetas['goodRateShow']+','+commentMetas['poorRateShow']
+','+commentMetas['commentCount']+','+commentMetas['goodCount']+","+commentMetas['generalCount']+','+commentMetas['poorCount'])

for hotTag in list(commentMetas['hotCommentTags']):
f.write(','+hotTag['name']+":"+str(hotTag['count']))
f.write('\n')

with open(name+'/propertys/' + str(item_id)+'_propertys.csv','w') as f:
for key in params['paramsList'].keys():
f.write(key+','+params['paramsList'][key]+'\n')

with open(name+'/' + str(parent_product_id)+'_comments.csv','a') as f:
try:
# 存好评
for comm in comments['goodComments']:
try:
f.write(str(comm['cmid'])+','+str(comm['guid'])+','+comm['nickname']+','+comm['score']+','+'good,'+comm['creationTime']+','+comm['content'])
except Exception as e:
print("exception: " + str(e))
if 'commentTags' in comm.keys():
for commentTag in comm['commentTags']:
f.write(','+commentTag['name'])
f.write('\n')

except:
print('comment error save good comm' + str(item_id))
traceback.print_exc()
try:
# 存中评
for comm in comments['geneComments']:
try:
f.write(str(comm['cmid'])+','+str(comm['guid'])+','+comm['nickname']+','+comm['score']+','+'gene,'+comm['creationTime']+','+comm['content'])
except Exception as e:
print("exception: " + str(e))
if 'commentTags' in comm.keys():
for commentTag in comm['commentTags']:
f.write(','+commentTag['name'])
f.write('\n')
except:
print('comment error save gene comm' + str(item_id))
try:
# 存差评
for comm in comments['badComments']:
try:
f.write(str(comm['cmid'])+','+str(comm['guid'])+','+comm['nickname']+','+comm['score']+','+'bad,'+comm['creationTime']+','+comm['content'])
except Exception as e:
print("exception: " + str(e))
if 'commentTags' in comm.keys():
for commentTag in comm['commentTags']:
f.write(','+commentTag['name'])
f.write('\n')

except:
print('comment error save bad comm' + str(item_id))
except Exception as e:
# 每个商品，解析错误的时候，记录日志
with open('item_exception.log','a') as f:
# 格式：当前时间，所属类别，产品id，错误原因
log = str(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')) + ","+self.path + ","+ str(item_id) + ","+str(e)+"\n"
f.write(log)
traceback.print_exc()

time.sleep(2) # 休息2秒
except Exception as e:
with open('item_error.log','a') as f:
# 格式：当前时间，所属类别，产品id，错误原因
log = str(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')) + ","+ str(item_id) +str(e)+"\n"
f.write(log)
print(log)

# break

def split_comment_csv(self):
"""
遍历所有评论文件，（即以_comments.csv结尾的文件），根据标点符号分句
"""
file_list = []
dirs = os.listdir(".")
for dir_name in dirs:
if os.path.isdir(dir_name):
for name in os.listdir(dir_name):
if os.path.isfile(dir_name+"/"+name):
item = {}
item["filePath"] = dir_name+"/"+name
item['fileName'] = name
item['dirName'] = dir_name
file_list.append(item)

if not os.path.exists('comment_clause'):
os.mkdir('comment_clause')

for item in file_list:
reader = csv.reader(open(item['filePath']))
csv_writer_name = 'comment_clause/' + item['dirName'] +"_" + item['fileName']

with open(csv_writer_name, 'w', newline='\n') as csvfile:
for row in reader:
if len(row) >= 7:
clauses = re.split('，|。|？|！|；|;|、|\?|!|·|）|（',row[6])

for clause in clauses:
clause.replace('&hellip','')
clause = clause.strip()
if len(clause) != 0:
csvfile.write(clause+"\n")

def count_origin_comments(self):
"""
对原始的未断句之前的评论统计数量
"""
file_list = []
dirs = os.listdir(".")
for dir_name in dirs:
if os.path.isdir(dir_name):
for name in os.listdir(dir_name):
if os.path.isfile(dir_name+"/"+name):
item = {}
item["filePath"] = dir_name+"/"+name
item['fileName'] = name
item['dirName'] = dir_name
file_list.append(item)

countData = []
totalRowNum = 0 # 评论总条数
totalClauseNum = 0 # 断句后的句子总数
for item in file_list:
reader = csv.reader(open(item['filePath']))
rowNum = 0 # 该文件中的行数
clauseNum = 0
for row in reader:
if len(row) >= 7:
rowNum = rowNum + 1
clauses = re.split('，|。|？|！|；|;|、|\?|!|·|）|（',row[6])

for clause in clauses:
clause.replace('&hellip','')
clause = clause.strip()
if len(clause) != 0:
clauseNum = clauseNum + 1

totalClauseNum = totalClauseNum + clauseNum
totalRowNum = totalRowNum + rowNum
data_item = {}
data_item['fileName'] = item['fileName']
data_item['clauseNum'] = str(clauseNum)
data_item['rowNum'] = str(rowNum)
countData.append(data_item)

with open('countData.csv', 'w') as f:
f.write('文件名,原始评论条数,断句条数\n')
for item in countData:
f.write(item['fileName']+","+item['rowNum']+","+item['clauseNum']+'\n')
f.write('评论总数,'+ str(totalRowNum)+"\n")
f.write('句子总数,'+ str(totalClauseNum)+"\n")

# -----------------------测试---------------------------

def test_get_all_brand_url(self):
text = json.loads(requests.get(self.url,headers=self.get_user_hearder()).text)

for brand in text['brands']:
url = 'https://list.jd.com/list.html?cat=9987,653,655&ev=exbrand_' + str(brand['id'])+'&sort=sort_rank_asc&trans=1&JL=3_'+quote(brand['name'])
print(url)

def test_find_next_page(self,url):
soup = BeautifulSoup(requests.get(url,headers=self.get_user_hearder()).text)
href = soup.find("a",attrs={"class":"pn-next"}) # 下一页

if href:
print(href.get('href'))
brand_url = 'https://list.jd.com'+href.get('href')
else:
brand_url = ''
print('url is None')
print(brand_url)

def test_get_comment_json(self,productId):
json_cOntent= self.getCommJson(productId)
print(json_content)
for comm in json_content['comments']:
print(comm['content'].replace('\n',''))

def test_read_csv(self):
reader = csv.reader(open('test.csv'))
for row in reader:
if len(row) >= 6:
print(row[6] + '\n')

if __name__ == '__main__':
jingdOng= Jingdong()
# 爬全部的品牌
# jingdong.parse_brand()
# 测试
# jingdong.getCommJson(12280434216,0,0)
# 测试
# jingdong.test_get_all_brand_url()
# jingdong.test_find_next_page('https://list.jd.com/list.html?cat=9987,653,655&ev=exbrand%5F8557&page=3&sort=sort_rank_asc&trans=1&JL=6_0_0')
# jingdong.test_get_comment_json(11083454031)

# 爬需要的60个手机型号(现在只有33个型号)

# product_list = [
# [3857525,4669576],[4411638, 4316775,4431603],[3924115,3875973],[3398125],[5097448,4199965],
# [4502433,4199967],[4411628],[3857521],[1345368],[11375078958,11774045896,11546640578],
# [4461939],[10417752533,10417197477],
# [10827008669],[4869176],[4086221,4086223,3867555,3867557],
# [4432058,4432056,4432052,4086229,4086227],[3352172,3352168],[4222708,3763103],[4170768,4170788,4170784,4170782],
# [4978326,4978306,4978332,5247848],[3729301,3729311,3729315],[10399574837,10416687137,10437750952,11089374104,11089374105],
# [1816276356,1816276354,10256482570,1816276355],[10065260353,10065260354,10069410228,10069410229],
# [10654370492,11022002650,10654370493,10654370494],[12481158400,12481163501,13304714040],
# [2166504],[3548595,3548599,3979666,3979664],[4363831,4363833,4363805,4363811,4363847],
# [4230493,5158518,5158508],[2589814,2589808,2589818],[2972184,2972174,2972172,2972186],[10213303571,10213303572]
# ]

# 追加的酷派，努比亚，一加
product_list = [
[3397564,3075827,3785780],[3151585,3159473],[3159465],[3789933],[3697279],[2917215],
[2214850],[4066471],[2401116],[5019352,4160791],[10072766014],[10717616871],
[4345197],[4345173],[5014204,4229972,4161762,2943569],[4746242,4983290,4024777,4746262,4245285],
[4899658,4996220,4100837,5239536],[4220709,4534743,4220711,4497841],[3139087,3569552],
[11881030122,11881076398,11839878990,12627332950]
]

jingdong.parseProducts(product_list)

# jingdong.test_read_csv()

# jingdong.split_comment_csv()

# jingdong.count_origin_comments()

爬虫

推荐阅读

c语言
最适合初学者的编程语言

本文探讨了适合编程新手的最佳语言选择，包括Python、JavaScript等易于上手且功能强大的语言，以及如何通过有效的学习方法提高编程技能。 ... [详细]

蜡笔小新 2024-11-22 16:17:04
mvc
H5技术实现经典游戏《贪吃蛇》

本文将分享一个使用HTML5技术实现的经典小游戏——《贪吃蛇》。通过H5技术，我们将探讨如何构建这款游戏的两种主要玩法：积分闯关和无尽模式。 ... [详细]

蜡笔小新 2024-11-21 20:16:59
mvc
Scrapy框架中Settings配置的调用方法

本文探讨了在Scrapy框架中如何从其他Python文件中访问和使用settings.py中定义的配置项。通过具体示例，介绍了两种有效的调用方式。 ... [详细]

蜡笔小新 2024-11-20 15:29:03
mvc
Bootstrap Paginator 分页插件详解与应用

本文深入探讨了Bootstrap Paginator这款流行的JavaScript分页插件，提供了详细的使用指南和示例代码，旨在帮助开发者更好地理解和利用该工具进行高效的数据展示。 ... [详细]

蜡笔小新 2024-11-20 13:39:53
asp.net
HTML前端开发：UINavigationController与页面间数据传递详解

本文详细介绍了如何在HTML前端开发中利用UINavigationController进行页面管理和数据传递，适合初学者和有一定基础的开发者学习。 ... [详细]

蜡笔小新 2024-11-20 09:46:39
sms
提升工作效率：掌握15个键盘快捷键

在日常工作中，熟练掌握计算机操作技巧能够显著提升工作效率。本文将介绍15个常用的键盘快捷键，帮助用户更加高效地完成工作任务。 ... [详细]

蜡笔小新 2024-11-19 15:20:10
sms
Java中的引用类型详解

本文详细介绍了Java中的引用类型，包括强引用、软引用、弱引用和虚引用的特点和应用场景。 ... [详细]

蜡笔小新 2024-11-18 10:12:58
session
Django与Python及其他Web框架的对比

本文详细介绍了Django与其他Python Web框架（如Flask和Tornado）的区别，并探讨了Django的基本使用方法及与其他语言（如PHP）的比较。 ... [详细]

蜡笔小新 2024-11-18 09:13:53
cache
MySQL Administrator: 监控与管理工具

本文介绍了 MySQL Administrator 的主要功能，包括图形化监控 MySQL 服务器的实时状态、连接健康度、内存健康度以及如何创建自定义的健康图表。此外，还详细解释了状态变量和系统变量的管理。 ... [详细]

蜡笔小新 2024-11-18 08:20:16
cache
从财务转型为数据分析师的两年历程

本文作者小尧，曾在税务师事务所工作，后成功转型为数据分析师。本文分享了他如何确定职业方向、积累行业知识，并最终实现转型的经验。 ... [详细]

蜡笔小新 2024-11-17 12:19:41
redis
python3 基础回忆录

整理于2020年10月下旬：总结过去，展望未来Itistoughtodayandtomorrowwillbetougher.butthedayaftertomorrowisbeau ... [详细]

蜡笔小新 2024-11-17 10:24:41
redis
LeetCode 实战：寻找三数之和为零的组合

给定一个包含 n 个整数的数组，判断该数组中是否存在三个元素 a、b、c，使得 a + b + c = 0。找出所有满足条件且不重复的三元组。 ... [详细]

蜡笔小新 2024-11-15 18:39:48
redis
Bootstrap 插件使用指南

本文详细介绍了如何在 Web 前端开发中使用 Bootstrap 插件，包括自动触发插件的方法、插件的引用方式以及具体的实例。 ... [详细]

蜡笔小新 2024-11-15 12:24:25
redis
使用Tkinter构建51Ape无损音乐爬虫UI

本文介绍了如何使用Python的内置模块Tkinter来构建一个简单的用户界面，用于爬取51Ape网站上的无损音乐百度云链接。虽然Tkinter入门相对简单，但在实际开发过程中由于文档不足可能会带来一些不便。 ... [详细]

蜡笔小新 2024-11-15 10:31:11
h2
使用Python爬取妙笔阁小说信息并保存为TXT和CSV格式

本文介绍了如何使用Python爬取妙笔阁小说网仙侠系列中所有小说的信息，并将其保存为TXT和CSV格式。主要内容包括如何构造请求头以避免被网站封禁，以及如何利用XPath解析HTML并提取所需信息。 ... [详细]

蜡笔小新 2024-11-14 19:54:58

GodlikeZ寰

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章