当前位置: 开发笔记 > 编程语言 > 正文

Python爬虫以及数据可视化分析！这才是零基础入门案例！

作者：手机用户2502923413 | 来源：互联网 | 2023-10-12 09:00

简单几步，通过Python对B站番剧排行数据进行爬取，并进行可视化分析下面，我们开始吧！PS:作为Python爬虫初学者&

简单几步&＃xff0c;通过Python对B站番剧排行数据进行爬取&＃xff0c;并进行可视化分析

下面&＃xff0c;我们开始吧&＃xff01;

PS: 作为Python爬虫初学者&＃xff0c;如有不正确的地方&＃xff0c;望各路大神不吝赐教[抱拳]

本项目将会对B站番剧排行的数据进行网页信息爬取以及数据可视化分析

首先&＃xff0c;准备好相关库

requests、pandas、BeautifulSoup、matplotlib等

因为这是第三方库&＃xff0c;所以我们需要额外下载
下载有两种方法&＃xff08;以requests为例&＃xff0c;其余库的安装方法类似&＃xff09;&＃xff1a;

&＃xff08;一&＃xff09;在命令行输入

前提&＃xff1a;装了pip&＃xff08; Python 包管理工具,提供了对Python 包的查找、下载、安装、卸载的功能。 &＃xff09;

pip install requests

&＃xff08;二&＃xff09;通过PyCharm下载

第一步&＃xff1a;编译器左上角File–>Settings…

第二步&＃xff1a;找到Project Interpreter 点击右上角加号按钮&＃xff0c;弹出界面上方搜索库名&＃xff1a;requests&＃xff0c;点击左下角Install &＃xff0c;当提示successfully时&＃xff0c;即安装完成。

&＃xff08;二&＃xff09;通过PyCharm下载

准备工作做好后&＃xff0c;开始项目的实行

一、获取网页内容

def get_html(url):try:r &＃61; requests.get(url) # 使用get来获取网页数据r.raise_for_status() # 如果返回参数不为200&＃xff0c;抛出异常r.encoding &＃61; r.apparent_encoding # 获取网页编码方式return r.text # 返回获取的内容except:return &＃39;错误&＃39;

我们来看爬取情况&＃xff0c;是否有我们想要的内容&＃xff1a;

def main():url &＃61; &＃39;https://www.bilibili.com/v/popular/rank/bangumi&＃39; # 网址html &＃61; get_html(url) # 获取返回值print(html) # 打印

if __name__ &＃61;&＃61; &＃39;__main__&＃39;: #入口main()

爬取结果如下图所示&＃xff1a;

成功&＃xff01;

二、信息解析阶段&＃xff1a;

第一步&＃xff0c;先构建BeautifulSoup实例

soup &＃61; BeautifulSoup(html, &＃39;html.parser&＃39;) # 指定BeautifulSoup的解析器

第二步&＃xff0c;初始化要存入信息的容器

# 定义好相关列表准备存储相关信息TScore &＃61; [] # 综合评分name &＃61; [] # 动漫名字play&＃61; [] # 播放量review &＃61; [] # 评论数favorite&＃61; [] # 收藏数

第三步&＃xff0c;开始信息整理
我们先获取番剧的名字&＃xff0c;并将它们先存进列表中

# ******************************************** 动漫名字存储for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;info&＃39;):# print(tag)bf &＃61; tag.a.stringname.append(str(bf))print(name)

此处我们用到了beautifulsoup的find_all()来进行解析。在这里&＃xff0c;find_all()的第一个参数是标签名&＃xff0c;第二个是标签中的class值&＃xff08;注意下划线哦(class_&＃61;‘info’)&＃xff09;。我们在网页界面按下F12&＃xff0c;就能看到网页代码&＃xff0c;找到相应位置&＃xff0c;就能清晰地看见相关信息&＃xff1a;

接着&＃xff0c;我们用几乎相同的方法来对综合评分、播放量&＃xff0c;评论数和收藏数来进行提取

# ******************************************** 播放量存储for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;detail&＃39;):# print(tag)bf &＃61; tag.find(&＃39;span&＃39;, class_&＃61;&＃39;data-box&＃39;).get_text()# 统一单位为‘万’if &＃39;亿&＃39; in bf:num &＃61; float(re.search(r&＃39;\d(.\d)?&＃39;, bf).group()) * 10000# print(num)bf &＃61; numelse:bf &＃61; re.search(r&＃39;\d*(\.)?\d&＃39;, bf).group()play.append(float(bf))print(play)# ******************************************** 评论数存储for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;detail&＃39;):# pl &＃61; tag.span.next_sibling.next_siblingpl &＃61; tag.find(&＃39;span&＃39;, class_&＃61;&＃39;data-box&＃39;).next_sibling.next_sibling.get_text()# *********统一单位if &＃39;万&＃39; not in pl:pl &＃61; &＃39;%.1f&＃39; % (float(pl) / 10000)# print(123, pl)else:pl &＃61; re.search(r&＃39;\d*(\.)?\d&＃39;, pl).group()review.append(float(pl))print(review)# ******************************************** 收藏数for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;detail&＃39;):sc &＃61; tag.find(&＃39;span&＃39;, class_&＃61;&＃39;data-box&＃39;).next_sibling.next_sibling.next_sibling.next_sibling.get_text()sc &＃61; re.search(r&＃39;\d*(\.)?\d&＃39;, sc).group()favorite.append(float(sc))print(favorite)# ******************************************** 综合评分for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;pts&＃39;):zh &＃61; tag.find(&＃39;div&＃39;).get_text()TScore.append(int(zh))print(&＃39;综合评分&＃39;, TScore)

其中有个.next_sibling是用于提取同级别的相同标签信息&＃xff0c;如若没有这个方法&＃xff0c;当它找到第一个’span’标签之后&＃xff0c;就不会继续找下去了&＃xff08;根据具体情况来叠加使用此方法&＃xff09;;

还用到了正则表达式来提取信息&＃xff08;需要导入库‘re’&＃xff09;

最后我们将提取的信息&＃xff0c;存进excel表格之中&＃xff0c;并返回结果集

# 存储至excel表格中info &＃61; {&＃39;动漫名&＃39;: name, &＃39;播放量(万)&＃39;: play, &＃39;评论数(万)&＃39;: review,&＃39;收藏数(万)&＃39;: favorite, &＃39;综合评分&＃39;: TScore}dm_file &＃61; pandas.DataFrame(info)dm_file.to_excel(&＃39;Dongman.xlsx&＃39;, sheet_name&＃61;"动漫数据分析")# 将所有列表返回return name, play, review, favorite, TScore

我们可以打开文件看一看存储的信息格式&＃xff08;双击打开&＃xff09;

成功&＃xff01;

三、数据可视化分析

我们先做一些基础设置
要先准备一个文件: STHeiti Medium.ttc [注意存放在项目中的位置]

my_font &＃61; font_manager.FontProperties(fname&＃61;&＃39;./data/STHeiti Medium.ttc&＃39;) # 设置中文字体&＃xff08;图表中能显示中文&＃xff09;# 为了坐标轴上能显示中文plt.rcParams[&＃39;font.sans-serif&＃39;] &＃61; [&＃39;SimHei&＃39;]plt.rcParams[&＃39;axes.unicode_minus&＃39;] &＃61; Falsedm_name &＃61; info[0] # 番剧名dm_play &＃61; info[1] # 番剧播放量dm_review &＃61; info[2] # 番剧评论数dm_favorite &＃61; info[3] # 番剧收藏数dm_com_score &＃61; info[4] # 番剧综合评分# print(dm_com_score)

然后&＃xff0c;开始使用matplot来绘制图形&＃xff0c;实现数据可视化分析
文中有详细注释&＃xff0c;这里就不再赘述了&＃xff0c;聪明的你一定一看就懂了&＃xff5e;

# **********************************************************************综合评分和播放量对比# *******综合评分条形图fig, ax1 &＃61; plt.subplots()plt.bar(dm_name, dm_com_score, color&＃61;&＃39;red&＃39;) #设置柱状图plt.title(&＃39;综合评分和播放量数据分析&＃39;, fontproperties&＃61;my_font) # 表标题ax1.tick_params(labelsize&＃61;6) plt.xlabel(&＃39;番剧名&＃39;) # 横轴名plt.ylabel(&＃39;综合评分&＃39;) # 纵轴名plt.xticks(rotation&＃61;90, color&＃61;&＃39;green&＃39;) # 设置横坐标变量名旋转度数和颜色# *******播放量折线图ax2 &＃61; ax1.twinx() # 组合图必须加这个ax2.plot(dm_play, color&＃61;&＃39;cyan&＃39;) # 设置线粗细&＃xff0c;节点样式plt.ylabel(&＃39;播放量&＃39;) # y轴plt.plot(1, label&＃61;&＃39;综合评分&＃39;, color&＃61;"red", linewidth&＃61;5.0) # 图例plt.plot(1, label&＃61;&＃39;播放量&＃39;, color&＃61;"cyan", linewidth&＃61;1.0, linestyle&＃61;"-") # 图例plt.legend()plt.savefig(r&＃39;E:1.png&＃39;, dpi&＃61;1000, bbox_inches&＃61;&＃39;tight&＃39;) #保存至本地plt.show()

来看看效果

有没有瞬间就感觉高~大~上~~了&＃xff08;嘿嘿~&＃xff09;

然后我们用相同的方法来多绘制几个对比图&＃xff1a;

# **********************************************************************评论数和收藏数对比# ********评论数条形图fig, ax3 &＃61; plt.subplots()plt.bar(dm_name, dm_review, color&＃61;&＃39;green&＃39;)plt.title(&＃39;番剧评论数和收藏数分析&＃39;)plt.ylabel(&＃39;评论数&＃xff08;万&＃xff09;&＃39;)ax3.tick_params(labelsize&＃61;6)plt.xticks(rotation&＃61;90, color&＃61;&＃39;green&＃39;)# *******收藏数折线图ax4 &＃61; ax3.twinx() # 组合图必须加这个ax4.plot(dm_favorite, color&＃61;&＃39;yellow&＃39;) # 设置线粗细&＃xff0c;节点样式plt.ylabel(&＃39;收藏数&＃xff08;万&＃xff09;&＃39;)plt.plot(1, label&＃61;&＃39;评论数&＃39;, color&＃61;"green", linewidth&＃61;5.0)plt.plot(1, label&＃61;&＃39;收藏数&＃39;, color&＃61;"yellow", linewidth&＃61;1.0, linestyle&＃61;"-")plt.legend()plt.savefig(r&＃39;E:2.png&＃39;, dpi&＃61;1000, bbox_inches&＃61;&＃39;tight&＃39;)# **********************************************************************综合评分和收藏数对比# *******综合评分条形图fig, ax5 &＃61; plt.subplots()plt.bar(dm_name, dm_com_score, color&＃61;&＃39;red&＃39;)plt.title(&＃39;综合评分和收藏数量数据分析&＃39;)plt.ylabel(&＃39;综合评分&＃39;)ax5.tick_params(labelsize&＃61;6)plt.xticks(rotation&＃61;90, color&＃61;&＃39;green&＃39;)# *******收藏折线图ax6 &＃61; ax5.twinx() # 组合图必须加这个ax6.plot(dm_favorite, color&＃61;&＃39;yellow&＃39;) # 设置线粗细&＃xff0c;节点样式plt.ylabel(&＃39;收藏数&＃xff08;万&＃xff09;&＃39;)plt.plot(1, label&＃61;&＃39;综合评分&＃39;, color&＃61;"red", linewidth&＃61;5.0)plt.plot(1, label&＃61;&＃39;收藏数&＃39;, color&＃61;"yellow", linewidth&＃61;1.0, linestyle&＃61;"-")plt.legend()plt.savefig(r&＃39;E:3.png&＃39;, dpi&＃61;1000, bbox_inches&＃61;&＃39;tight&＃39;)# **********************************************************************播放量和评论数对比# *******播放量条形图fig, ax7 &＃61; plt.subplots()plt.bar(dm_name, dm_play, color&＃61;&＃39;cyan&＃39;)plt.title(&＃39;播放量和评论数数据分析&＃39;)plt.ylabel(&＃39;播放量&＃xff08;万&＃xff09;&＃39;)ax7.tick_params(labelsize&＃61;6)plt.xticks(rotation&＃61;90, color&＃61;&＃39;green&＃39;)# *******评论数折线图ax8 &＃61; ax7.twinx() # 组合图必须加这个ax8.plot(dm_review, color&＃61;&＃39;green&＃39;) # 设置线粗细&＃xff0c;节点样式plt.ylabel(&＃39;评论数&＃xff08;万&＃xff09;&＃39;)plt.plot(1, label&＃61;&＃39;播放量&＃39;, color&＃61;"cyan", linewidth&＃61;5.0)plt.plot(1, label&＃61;&＃39;评论数&＃39;, color&＃61;"green", linewidth&＃61;1.0, linestyle&＃61;"-")plt.legend()plt.savefig(r&＃39;E:4.png&＃39;, dpi&＃61;1000, bbox_inches&＃61;&＃39;tight&＃39;)plt.show()

我们来看看最终效果

Nice&＃xff01;很完美~ 大家可以根据自己的想法按照相同的方法进行数据组合分析。

最后&＃xff0c;附上全部代码

import re
import pandas
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from matplotlib import font_managerdef get_html(url):try:r &＃61; requests.get(url) # 使用get来获取网页数据r.raise_for_status() # 如果返回参数不为200&＃xff0c;抛出异常r.encoding &＃61; r.apparent_encoding # 获取网页编码方式return r.text # 返回获取的内容except:return &＃39;错误&＃39;def save(html):# 解析网页soup &＃61; BeautifulSoup(html, &＃39;html.parser&＃39;) # 指定Beautiful的解析器为“html.parser”with open(&＃39;./data/B_data.txt&＃39;, &＃39;r&＃43;&＃39;, encoding&＃61;&＃39;UTF-8&＃39;) as f:f.write(soup.text)# 定义好相关列表准备存储相关信息TScore &＃61; [] # 综合评分name &＃61; [] # 动漫名字bfl &＃61; [] # 播放量pls &＃61; [] # 评论数scs &＃61; [] # 收藏数# ******************************************** 动漫名字存储for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;info&＃39;):# print(tag)bf &＃61; tag.a.stringname.append(str(bf))print(name)# ******************************************** 播放量存储for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;detail&＃39;):# print(tag)bf &＃61; tag.find(&＃39;span&＃39;, class_&＃61;&＃39;data-box&＃39;).get_text()# 统一单位为‘万’if &＃39;亿&＃39; in bf:num &＃61; float(re.search(r&＃39;\d(.\d)?&＃39;, bf).group()) * 10000# print(num)bf &＃61; numelse:bf &＃61; re.search(r&＃39;\d*(\.)?\d&＃39;, bf).group()bfl.append(float(bf))print(bfl)# ******************************************** 评论数存储for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;detail&＃39;):# pl &＃61; tag.span.next_sibling.next_siblingpl &＃61; tag.find(&＃39;span&＃39;, class_&＃61;&＃39;data-box&＃39;).next_sibling.next_sibling.get_text()# *********统一单位if &＃39;万&＃39; not in pl:pl &＃61; &＃39;%.1f&＃39; % (float(pl) / 10000)# print(123, pl)else:pl &＃61; re.search(r&＃39;\d*(\.)?\d&＃39;, pl).group()pls.append(float(pl))print(pls)# ******************************************** 收藏数for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;detail&＃39;):sc &＃61; tag.find(&＃39;span&＃39;, class_&＃61;&＃39;data-box&＃39;).next_sibling.next_sibling.next_sibling.next_sibling.get_text()sc &＃61; re.search(r&＃39;\d*(\.)?\d&＃39;, sc).group()scs.append(float(sc))print(scs)# ******************************************** 综合评分for tag in soup.find_all(&＃39;div&＃39;, class_&＃61;&＃39;pts&＃39;):zh &＃61; tag.find(&＃39;div&＃39;).get_text()TScore.append(int(zh))print(&＃39;综合评分&＃39;, TScore)# 存储至excel表格中info &＃61; {&＃39;动漫名&＃39;: name, &＃39;播放量(万)&＃39;: bfl, &＃39;评论数(万)&＃39;: pls, &＃39;收藏数(万)&＃39;: scs, &＃39;综合评分&＃39;: TScore}dm_file &＃61; pandas.DataFrame(info)dm_file.to_excel(&＃39;Dongman.xlsx&＃39;, sheet_name&＃61;"动漫数据分析")# 将所有列表返回return name, bfl, pls, scs, TScoredef view(info):my_font &＃61; font_manager.FontProperties(fname&＃61;&＃39;./data/STHeiti Medium.ttc&＃39;) # 设置中文字体&＃xff08;图标中能显示中文&＃xff09;dm_name &＃61; info[0] # 番剧名dm_play &＃61; info[1] # 番剧播放量dm_review &＃61; info[2] # 番剧评论数dm_favorite &＃61; info[3] # 番剧收藏数dm_com_score &＃61; info[4] # 番剧综合评分# print(dm_com_score)# 为了坐标轴上能显示中文plt.rcParams[&＃39;font.sans-serif&＃39;] &＃61; [&＃39;SimHei&＃39;]plt.rcParams[&＃39;axes.unicode_minus&＃39;] &＃61; False# **********************************************************************综合评分和播放量对比# *******综合评分条形图fig, ax1 &＃61; plt.subplots()plt.bar(dm_name, dm_com_score, color&＃61;&＃39;red&＃39;) #设置柱状图plt.title(&＃39;综合评分和播放量数据分析&＃39;, fontproperties&＃61;my_font) # 表标题ax1.tick_params(labelsize&＃61;6)plt.xlabel(&＃39;番剧名&＃39;) # 横轴名plt.ylabel(&＃39;综合评分&＃39;) # 纵轴名plt.xticks(rotation&＃61;90, color&＃61;&＃39;green&＃39;) # 设置横坐标变量名旋转度数和颜色# *******播放量折线图ax2 &＃61; ax1.twinx() # 组合图必须加这个ax2.plot(dm_play, color&＃61;&＃39;cyan&＃39;) # 设置线粗细&＃xff0c;节点样式plt.ylabel(&＃39;播放量&＃39;) # y轴plt.plot(1, label&＃61;&＃39;综合评分&＃39;, color&＃61;"red", linewidth&＃61;5.0) # 图例plt.plot(1, label&＃61;&＃39;播放量&＃39;, color&＃61;"cyan", linewidth&＃61;1.0, linestyle&＃61;"-") # 图例plt.legend()plt.savefig(r&＃39;E:1.png&＃39;, dpi&＃61;1000, bbox_inches&＃61;&＃39;tight&＃39;) #保存至本地# plt.show()# **********************************************************************评论数和收藏数对比# ********评论数条形图fig, ax3 &＃61; plt.subplots()plt.bar(dm_name, dm_review, color&＃61;&＃39;green&＃39;)plt.title(&＃39;番剧评论数和收藏数分析&＃39;)plt.ylabel(&＃39;评论数&＃xff08;万&＃xff09;&＃39;)ax3.tick_params(labelsize&＃61;6)plt.xticks(rotation&＃61;90, color&＃61;&＃39;green&＃39;)# *******收藏数折线图ax4 &＃61; ax3.twinx() # 组合图必须加这个ax4.plot(dm_favorite, color&＃61;&＃39;yellow&＃39;) # 设置线粗细&＃xff0c;节点样式plt.ylabel(&＃39;收藏数&＃xff08;万&＃xff09;&＃39;)plt.plot(1, label&＃61;&＃39;评论数&＃39;, color&＃61;"green", linewidth&＃61;5.0)plt.plot(1, label&＃61;&＃39;收藏数&＃39;, color&＃61;"yellow", linewidth&＃61;1.0, linestyle&＃61;"-")plt.legend()plt.savefig(r&＃39;E:2.png&＃39;, dpi&＃61;1000, bbox_inches&＃61;&＃39;tight&＃39;)# **********************************************************************综合评分和收藏数对比# *******综合评分条形图fig, ax5 &＃61; plt.subplots()plt.bar(dm_name, dm_com_score, color&＃61;&＃39;red&＃39;)plt.title(&＃39;综合评分和收藏数量数据分析&＃39;)plt.ylabel(&＃39;综合评分&＃39;)ax5.tick_params(labelsize&＃61;6)plt.xticks(rotation&＃61;90, color&＃61;&＃39;green&＃39;)# *******收藏折线图ax6 &＃61; ax5.twinx() # 组合图必须加这个ax6.plot(dm_favorite, color&＃61;&＃39;yellow&＃39;) # 设置线粗细&＃xff0c;节点样式plt.ylabel(&＃39;收藏数&＃xff08;万&＃xff09;&＃39;)plt.plot(1, label&＃61;&＃39;综合评分&＃39;, color&＃61;"red", linewidth&＃61;5.0)plt.plot(1, label&＃61;&＃39;收藏数&＃39;, color&＃61;"yellow", linewidth&＃61;1.0, linestyle&＃61;"-")plt.legend()plt.savefig(r&＃39;E:3.png&＃39;, dpi&＃61;1000, bbox_inches&＃61;&＃39;tight&＃39;)# **********************************************************************播放量和评论数对比# *******播放量条形图fig, ax7 &＃61; plt.subplots()plt.bar(dm_name, dm_play, color&＃61;&＃39;cyan&＃39;)plt.title(&＃39;播放量和评论数数据分析&＃39;)plt.ylabel(&＃39;播放量&＃xff08;万&＃xff09;&＃39;)ax7.tick_params(labelsize&＃61;6)plt.xticks(rotation&＃61;90, color&＃61;&＃39;green&＃39;)# *******评论数折线图ax8 &＃61; ax7.twinx() # 组合图必须加这个ax8.plot(dm_review, color&＃61;&＃39;green&＃39;) # 设置线粗细&＃xff0c;节点样式plt.ylabel(&＃39;评论数&＃xff08;万&＃xff09;&＃39;)plt.plot(1, label&＃61;&＃39;播放量&＃39;, color&＃61;"cyan", linewidth&＃61;5.0)plt.plot(1, label&＃61;&＃39;评论数&＃39;, color&＃61;"green", linewidth&＃61;1.0, linestyle&＃61;"-")plt.legend()plt.savefig(r&＃39;E:4.png&＃39;, dpi&＃61;1000, bbox_inches&＃61;&＃39;tight&＃39;)plt.show()def main():url &＃61; &＃39;https://www.bilibili.com/v/popular/rank/bangumi&＃39; # 网址html &＃61; get_html(url) # 获取返回值# print(html)info &＃61; save(html)view(info)if __name__ &＃61;&＃61; &＃39;__main__&＃39;:main()

关于图表的分析和得出的结论&＃xff0c;这里就不描述了&＃xff0c;一千个读者就有一千个哈姆雷特&＃xff0c;每个人有每个人的分析描述方法&＃xff0c;相信你们能有更加透彻的见解分析。

以上就是关于爬虫以及数据可视化分析的内容&＃xff0c;希望能帮到你们&＃xff01;

伙伴们可以到github上查看源码文件&＃xff1a;https://github.com/Lemon-Sheep/Py/tree/master

喜欢记得点个赞哦~

如何获取源码&＃xff1a;

①3000多本Python电子书有
②Python开发环境安装教程有
③Python400集自学视频有
④软件开发常用词汇有
⑤Python学习路线图有
⑥项目源码案例分享有
如果你用得到的话可以直接拿走&＃xff0c;在我的QQ技术交流群里群号&＃xff1a;754370353&＃xff08;纯技术交流和资源共享&＃xff0c;广告勿入&＃xff09;以自助拿走
点击这里领取