自制信息检索网站（二）——分析掘金数据

今天继续自制信息检索网站的第二步&＃xff0c;简单的分析一下掘金的数据。在上次第一步得到数据后通过简单的数据清洗可视化来看一看掘金。

本次使用的是jupyter notebook用到的库有pymongo,用来连接MongoDB数据库&＃xff1b;jieba用来进行分词&＃xff1b;pyecharts用来数据可视化。

掘金文章长度

这一部分会对所有爬取到的掘金文章的长度进行统计&＃xff0c;首先统计所有文章的长度&＃xff1a;

# 分析文章的长度分布 all_items &＃61; collection.find({}) all_content_length &＃61; [len(item[&＃39;content&＃39;]) for item in all_items] print(all_items.count())# 输出 2996

然后在文章长度按照large(>&＃61;10000字),middle(5000-10000字),small(1000-5000字),x-small(0-1000字)进行分类&＃xff1a;

content_length_distribution &＃61; {&＃39;large&＃39;: 0,&＃39;middle&＃39;: 0,&＃39;small&＃39;: 0,&＃39;x-small&＃39;: 0, } for item_content_length in all_content_length:if item_content_length >&＃61; 10000:content_length_distribution[&＃39;large&＃39;] &＃43;&＃61; 1 elif item_content_length >&＃61; 5000 and item_content_length <10000:content_length_distribution[&＃39;middle&＃39;] &＃43;&＃61; 1 elif item_content_length >&＃61; 1000 and item_content_length <5000:content_length_distribution[&＃39;small&＃39;] &＃43;&＃61; 1 else:content_length_distribution[&＃39;x-small&＃39;] &＃43;&＃61; 1 print(content_length_distribution)# 输出 {&＃39;small&＃39;: 1707, &＃39;middle&＃39;: 440, &＃39;large&＃39;: 175, &＃39;x-small&＃39;: 674}

通过echarts绘图:

from pyecharts import Pie pie &＃61; Pie("掘金文章长度分布图", "单位为&＃xff1a;&＃xff08;字&＃xff09;") pie.add("字数", ["x-small", "large", "middle", "small"], [674,175,440,1707],is_more_utils&＃61;True) pie

可以看到掘金的文章大多还是以1000-5000位多数的。然后我们看一下在1000-5000各部分的文章字数分布情况。

from pyecharts import Bar # 字数在1000 - 5000 的文章字数分布 small_content_length_distribution &＃61; {&＃39;1000-2000&＃39;: 0 ,&＃39;2000-3000&＃39;: 0 ,&＃39;3000-4000&＃39;: 0 ,&＃39;4000-5000&＃39;: 0 } for item_content_length in all_content_length:if item_content_length >&＃61; 1000 and item_content_length <2000:small_content_length_distribution[&＃39;1000-2000&＃39;] &＃43;&＃61; 1 elif item_content_length >&＃61; 2000 and item_content_length <3000:small_content_length_distribution[&＃39;2000-3000&＃39;] &＃43;&＃61; 1 elif item_content_length >&＃61; 3000 and item_content_length <4000:small_content_length_distribution[&＃39;3000-4000&＃39;] &＃43;&＃61; 1 elif item_content_length >&＃61; 4000 and item_content_length <5000:small_content_length_distribution[&＃39;4000-5000&＃39;] &＃43;&＃61; 1 print(small_content_length_distribution)#输出 {&＃39;3000-4000&＃39;: 347, &＃39;4000-5000&＃39;: 238, &＃39;2000-3000&＃39;: 495, &＃39;1000-2000&＃39;: 627}bar &＃61; Bar(&＃39;字数在1000 - 5000 的文章字数分布&＃39;,"单位为&＃xff1a;&＃xff08;字&＃xff09;") labels &＃61; [label for label in small_content_length_distribution.keys()] values &＃61; [value for value in small_content_length_distribution.values()] # print(labels) bar.add(&＃39;字数&＃39;,labels,values,is_more_utils&＃61;True) bar

这里可以看到在1000-5000字数内文章的分布还是比较均匀的。

掘金的标签

掘金的文章大部分都自带tags&＃xff0c;我们来看一下这写tags多数有什么。

# 分析所有文章的tags all_items &＃61; collection.find({}) # 获得所有的tags all_tags &＃61; [] for item in all_items:all_tags &＃43;&＃61; item[&＃39;tags&＃39;] print(all_tags[:20])# 输出 [&＃39;Javascript&＃39;, &＃39;前端&＃39;, &＃39;微信小程序&＃39;, &＃39;RxJS&＃39;, &＃39;微信&＃39;, &＃39;Javascript&＃39;, &＃39;设计&＃39;, &＃39;微信小程序&＃39;, &＃39;前端&＃39;, &＃39;微信&＃39;, &＃39;微信小程序&＃39;, &＃39;微信小程序&＃39;, &＃39;前端&＃39;, &＃39;微信&＃39;, &＃39;微信小程序&＃39;, &＃39;前端&＃39;, &＃39;微信&＃39;, &＃39;微信小程序&＃39;, &＃39;Javascript&＃39;, &＃39;前端&＃39;]all_tags_set &＃61; set(all_tags) print(len(all_tags_set)) all_tags_distribution &＃61; {} for set_item in all_tags_set:all_tags_distribution[set_item] &＃61; 0 # print(all_tags_distribution) for tag_item in all_tags:all_tags_distribution[tag_item] &＃43;&＃61; 1from pyecharts import WordCloudword_cloud &＃61; WordCloud(&＃39;掘金文章的标签的分布&＃39;,&＃39;&＃39;)labels &＃61; [label for label in all_tags_distribution.keys()] values &＃61; [value for value in all_tags_distribution.values()]word_cloud.add(&＃39;次数&＃39;,labels,values) word_cloud

从这里可以看到tags还是跟近两年比较火的前端、人工智能相关。

掘金每年的文章数

掘金每年的文章数呈现什么样的趋势了&＃xff0c;我们来看一下。

year_list &＃61; [] for item_create_date in all_created_date:year_list.append(item_create_date.split(&＃39;-&＃39;)[0]) year_list &＃61; sorted(year_list,key&＃61;lambda x:int(x[3])) # print(year_list) year_set &＃61; set(year_list) print(year_set) all_year_distribution &＃61; {} for set_item in year_set:all_year_distribution[set_item] &＃61; 0for list_item in year_list:all_year_distribution[list_item] &＃43;&＃61; 1 print(all_year_distribution)from pyecharts import Lineline &＃61; Line(&＃39;掘金文章年份分布图&＃39;,&＃39;&＃39;) labels &＃61; [label for label in all_year_distribution.keys()] values &＃61; [value for value in all_year_distribution.values()]line.add(&＃39;文章数&＃39;,[&＃39;2015&＃39;, &＃39;2016&＃39;, &＃39;2017&＃39;, &＃39;2018&＃39;],[ 4, 327, 1162, 1503],is_more_utils&＃61;True) line

可以看到&＃xff0c;掘金近几年一直呈上升趋势&＃xff0c;越来越多的人选择掘金&＃xff0c;当然也因为笔者爬取的数据有限&＃xff0c;所以结果仅供参考。我们再来看看去年&＃xff0c;也就是2017每个月份的文章数。

# 2017年每个月份的文章分布图 month_list &＃61; [] # all_created_date[:5] for item_created_date in all_created_date:if item_created_date.split(&＃39;-&＃39;)[0] &＃61;&＃61; &＃39;2017&＃39;:month_list.append(int(item_created_date.split(&＃39;-&＃39;)[1]))month_list.sort() month_set &＃61; set(month_list) # print(month_set)month_distribution &＃61; {} for set_item in month_set:month_distribution[set_item] &＃61; 0for list_item in month_list:month_distribution[list_item] &＃43;&＃61; 1 print(month_distribution)line &＃61; Line(&＃39;掘金文章2017年每月分布图&＃39;,&＃39;&＃39;) labels &＃61; [label for label in month_distribution.keys()] values &＃61; [value for value in month_distribution.values()]line.add(&＃39;文章数&＃39;,labels,values,is_more_utils&＃61;True) line#输出 {1: 60, 2: 15, 3: 17, 4: 25, 5: 41, 6: 23, 7: 29, 8: 51, 9: 64, 10: 137, 11: 200, 12: 500}

掘金文章的浏览量、收藏量

我们接下来看一下那些文章是掘金里面的明星文章。首先是浏览量前50的文章。

# 浏览量前五十 views_count_distribution &＃61; {} for item in sort_by_views_count[:50]: # print(item[&＃39;title&＃39;],item[&＃39;views_count&＃39;])views_count_distribution[item[&＃39;title&＃39;]] &＃61; item[&＃39;views_count&＃39;] print(views_count_distribution)# {&＃39;编写自己的代码库&＃xff08;Javascript常用实例的实现与封装&＃xff09;&＃39;: 16254, &＃39;未来的前端工程师&＃39;: 11002, &＃39;你敢在post和get上刁难我&＃xff0c;就别怪我装逼了&＃39;: 14472, &＃39;送给前端开发者的一份新年礼物&＃39;: 12775, &＃39;B站的前端之路&＃39;: 17492, &＃39;2018前端值得关注的技术&＃39;: 23680, &＃39;微信小游戏跳一跳外挂辅助程序&＃39;: 12887, &＃39;如何优雅地使用 Git&＃39;: 11595, &＃39;AI 系统首次实现真正自主编程&＃xff0c;完爆初级程序员&＃39;: 28363, &＃39;首个微信小程序开发教程&＃xff01;&＃39;: 125928, &＃39;面试过阿里等互联网大公司&＃xff0c;我知道了这些套路 | 掘金技术征文&＃39;: 18862, &＃39;鹿晗关晓彤公开恋情&＃xff0c;是如何把微博服务器搞炸的&＃xff1f;&＃39;: 21196, &＃39;某小公司RESTful、共用接口、前后端分离、接口约定的实践&＃39;: 11735, &＃39;这一次&＃xff0c;彻底弄懂 Javascript 执行机制&＃39;: 24905, &＃39;打造自己的Javascript武器库&＃39;: 12927, &＃39;100&＃43; 超全的 web 开发工具和资源&＃39;: 10838, &＃39;Javascript专题系列20篇正式完结&＃xff01;&＃39;: 20346, &＃39;iView 发布后台管理系统 iview-admin&＃xff0c;没错&＃xff0c;它就是你想要的&＃39;: 14191, &＃39;2018 我所了解的 Vue 知识大全&＃xff08;一&＃xff09;&＃39;: 11840, &＃39;[译] React、Jest、Flow 和 Immutable.js 将使用 MIT 许可证&＃39;: 27578, &＃39;个人总结&＃xff08;css3新特性&＃xff09;&＃39;: 12321, &＃39;教你用Python来玩微信跳一跳&＃39;: 22652, &＃39;2017下半年掘金日报优质文章合集&＃xff1a;前端篇&＃39;: 16820, &＃39;[译] 2017 年比较 Angular、React、Vue 三剑客 &＃39;: 15398, &＃39;个人分享--web前端学习资源分享&＃39;: 19548, &＃39;如何无痛降低 if else 面条代码复杂度&＃39;: 19974, &＃39;技术胖155集前端视频教程-全部免费观看&＃39;: 19691, &＃39;关于IT培训机构的个人看法&＃39;: 18355, &＃39;JS维护nginx反向代理&＃xff0c;妈妈再也不用担心我跨域了&＃xff01;&＃39;: 10739, &＃39;能让你开发效率翻倍的 VSCode 插件配置&＃xff08;上&＃xff09;&＃39;: 11849, &＃39;手摸手&＃xff0c;带你优雅的使用 icon&＃39;: 11728, &＃39;Vue 脱坑记 - 查漏补缺(汇总下群里高频询问的xxx及给出不靠谱的解决方案)&＃39;: 26215, &＃39;前端入行两年--教会了我这些道理&＃39;: 12649, &＃39;反击爬虫&＃xff0c;前端工程师的脑洞可以有多大&＃xff1f;&＃39;: 11503, &＃39;永久免费&＃xff01;吴恩达刚公布的深度学习课程上线网易云课堂&＃39;: 15603, &＃39;2018 Web 开发者最佳学习路线&＃39;: 11105, &＃39;别再拿奇技淫巧搬砖了&＃39;: 17900, &＃39;GitHub 排名前 100 的安卓、iOS 项目简介&＃39;: 14091, &＃39;源码圈 365 胖友的书单整理&＃39;: 67221, &＃39;妈妈再也不用担心我不会webpack了&＃39;: 15861}bar &＃61; Bar(&＃39;掘金文章浏览量前50分布图&＃39;,&＃39;&＃39;) labels &＃61; [label for label in views_count_distribution.keys()] values &＃61; [value for value in views_count_distribution.values()]bar.add(&＃39;浏览数&＃39;,labels,values,is_more_utils&＃61;True) bar

由于title过于长所以这里显示不是很好&＃xff0c;有兴趣的可以去github上面下载这个notebook看一下。再来看一下收藏量&＃xff1a;

# 收藏量前五十 sort_by_collection_count &＃61; collection.find().sort(&＃39;collection_count&＃39;,pymongo.DESCENDING)collection_count_distribution &＃61; {} for item in sort_by_collection_count[:50]: # print(item[&＃39;title&＃39;],item[&＃39;views_count&＃39;])collection_count_distribution[item[&＃39;title&＃39;]] &＃61; item[&＃39;collection_count&＃39;] # print(collection_count_distribution)bar &＃61; Bar(&＃39;掘金文章收藏量前50分布图&＃39;,&＃39;&＃39;) labels &＃61; [label for label in collection_count_distribution.keys()] values &＃61; [value for value in collection_count_distribution.values()]bar.add(&＃39;收藏数&＃39;,labels,values,is_more_utils&＃61;True) bar

下面是评论量&＃xff1a;

# 评论量前五十 sort_by_comments_count &＃61; collection.find().sort(&＃39;comments_count&＃39;,pymongo.DESCENDING)comments_count_distribution &＃61; {} for item in sort_by_comments_count[:50]: # print(item[&＃39;title&＃39;],item[&＃39;views_count&＃39;])comments_count_distribution[item[&＃39;title&＃39;]] &＃61; item[&＃39;comments_count&＃39;] # print(collection_count_distribution)bar &＃61; Bar(&＃39;掘金文章评论量前50分布图&＃39;,&＃39;&＃39;) labels &＃61; [label for label in comments_count_distribution.keys()] values &＃61; [value for value in comments_count_distribution.values()]bar.add(&＃39;评论数&＃39;,labels,values,is_more_utils&＃61;True) bar

掘金的标题和内容

下面我们来通过分词来看一下掘金的标题和内容大致都说了些什么。首先是标题&＃xff1a;

all_tokens_set &＃61; set(all_tokens_list) # print(all_tokens_set) all_tokens_distribution &＃61; {} for set_item in all_tokens_set:all_tokens_distribution[set_item] &＃61; 0 # print(all_tags_distribution) for token_item in all_tokens_list:all_tokens_distribution[token_item] &＃43;&＃61; 1 # print(all_tokens_distribution)from pyecharts import WordCloud word_cloud &＃61; WordCloud(&＃39;掘金文章标题分布&＃39;,&＃39;&＃39;)labels &＃61; [label for label in all_tokens_distribution.keys()] values &＃61; [value for value in all_tokens_distribution.values()]word_cloud.add(&＃39;次数&＃39;,labels,values) word_clou

接下来是内容&＃xff1a;

# 分析文章内容 all_content &＃61; [item[&＃39;content&＃39;] for item in collection.find({})]content_tokens_list &＃61; []for line in all_content:cuts &＃61; jieba.cut(line,cut_all&＃61;False)for cut in cuts:if cut not in add_punc:content_tokens_list.append(cut)print(len(content_tokens_list)) # 输出 4806605 content_tokens_set &＃61; set(content_tokens_list) print(len(content_tokens_set)) # 输出 80081 content_tokens_distribution &＃61; {} for set_item in content_tokens_set:content_tokens_distribution[set_item] &＃61; 0 # print(all_tags_distribution) for token_item in content_tokens_list:content_tokens_distribution[token_item] &＃43;&＃61; 1new_content_tokens_distribution &＃61; sorted(content_tokens_distribution.items(),key&＃61;lambda x:x[1]) # print(content_tokens_distribution.items()) word_cloud_data &＃61; new_content_tokens_distribution[-800:-150]word_cloud &＃61; WordCloud(&＃39;掘金文章内容分布&＃39;,&＃39;&＃39;)labels &＃61; [label[0] for label in word_cloud_data] values &＃61; [value[1] for value in word_cloud_data]word_cloud.add(&＃39;次数&＃39;,labels,values) word_cloud

这里由于掘金的内容词太多了&＃xff0c;而且前面很大一部分都是停用词&＃xff0c;几乎没有意义&＃xff0c;所以截取了一部分展示&＃xff0c;其实效果也不好。最后因为爬取的数据有限&＃xff0c;所以所有结果仅供参看&＃xff0c;并无任何实际意义。项目的源代码在这里juejin_spider.

自制信息检索网站（二）——分析掘金数据

掘金文章长度

掘金的标签

掘金每年的文章数

掘金文章的浏览量、收藏量

掘金的标题和内容

中央电视台电影频道节目预告及优化分析

Java面试题解析

深入理解org.neo4j.helpers.collection.Iterators.single()方法及其应用

GWT PopupPanel onKeyDownPreview 方法详解与实例

Java 中 Writer flush()方法，示例

java编写的简易计算器

Dockerfile 编写与 Docker 网络配置详解

JQuery基础：省市联动与表单验证

深入解析ExpandableComposite.addExpansionListener()方法及其应用

分页插件3指定到某一页

C语言实现小写金额转换为大写金额

2023年京东Android面试真题解析与经验分享

从 .NET 转 Java 的自学之路：IO 流基础篇

解决Element UI中Select组件创建条目为空时报错的问题

UnityGUI 扩展与自定义控件