热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

正则表达式解析抓取猫眼电影Top100

猫眼电影提供实时票房数据,这个以后玩榜单规则:将猫眼电影库中的经典影片,按照评分和评分人数从高到低综合排序取前100名,每

猫眼电影提供实时票房数据,这个以后玩


榜单规则:将猫眼电影库中的经典影片,按照评分和评分人数从高到低综合排序取前100名,每天上午10点更新。相关数据来源于“猫眼电影库”。




第一步,分析URL,一共有10页,每页10个,观察URL得


http://maoyan.com/board/4?offset=0 最后一个数字为增量,每次加10,第一页为0

#构造10页的地址
base_url = 'http://maoyan.com/board/4?offset={}'
urls = []
for i in range(10):urls.append(base_url.format(10*i))

第二步,分析单个页面


不加headers访问被禁止了,说是恶意访问

#构造headers
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36','COOKIE':'__mta=251008569.1536744988778.1536745503768.1536745544944.17; _lxsdk_cuid=165cd22f1a0c8-0ec18267a3a2f5-3c604504-1fa400-165cd22f1a0c8; uuid_n_v=v1; uuid=53511D10B66F11E894F593DFAB82C37F1716AA64EB0848C2BADBC75AA2E23EA6; _csrf=5f3373e2e85bd09c75c54ffbc624db254862d2ceb2dfbeff81eb9fa58e289001; __guid=17099173.3022114119644780000.1536744988414.424; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; _lxsdk=53511D10B66F11E894F593DFAB82C37F1716AA64EB0848C2BADBC75AA2E23EA6; __mta=251008569.1536744988778.1536744999954.1536745003054.4; monitor_count=17; _lxsdk_s=165cd22f1a1-995-e6-849%7C%7C47'
}
def get_onepage(url):response &#61; requests.get(url,headers&#61;headers).textindex &#61; re.findall(&#39;board-index.*?>(.*?)<&#39;,response,re.S)[1:-1]name &#61; re.findall(&#39;

,response,re.S)star &#61; re.findall(&#39;(.*?)

&#39;,response,re.S)date &#61; re.findall(&#39;releasetime">(.*?),response,re.S)img &#61; re.findall(&#39;dd>.*?&#39;,response,re.S)score &#61; re.findall(&#39;score.*?integer">(.*?)<.*?fraction">(.*?)<&#39;,response,re.S)name,star,date,img,score &#61; list(name),list(star),list(date),list(img),list(score)#star和score需要处理一下stars &#61; []for i in star:stars.append(i.split())scores &#61; []for i,j in score:scores.append(i&#43;j)for i in range(10):all_dict[index[i]] &#61; {&#39;index&#39;:index[i],&#39;name&#39;:name[i],&#39;star&#39;:stars[i],&#39;img&#39;:img[i],&#39;date&#39;:date[i]}

最后将score去掉不存&#xff0c;因为第一次成功了&#xff0c;但是今天尽然龙猫那部没有评分了&#xff0c;一页9个就会报错&#xff0c;img的查找有些麻烦&#xff0c;试了好几次&#xff0c;测试中发现img的属性标签会变化位置&#xff0c;与浏览器中看到的顺序不一致。


最后完整代码&#xff0c;爬取10页数据&#xff0c;并将数据打印出来&#xff0c;并写入json文件&#xff0c;重新读取

import requests,re
import jsonbase_url &#61; &#39;http://maoyan.com/board/4?offset&#61;{}&#39;
urls &#61; []
for i in range(10):urls.append(base_url.format(10*i))#构造headers
headers &#61; {&#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36&#39;,&#39;COOKIE&#39;:&#39;__mta&#61;251008569.1536744988778.1536745503768.1536745544944.17; _lxsdk_cuid&#61;165cd22f1a0c8-0ec18267a3a2f5-3c604504-1fa400-165cd22f1a0c8; uuid_n_v&#61;v1; uuid&#61;53511D10B66F11E894F593DFAB82C37F1716AA64EB0848C2BADBC75AA2E23EA6; _csrf&#61;5f3373e2e85bd09c75c54ffbc624db254862d2ceb2dfbeff81eb9fa58e289001; __guid&#61;17099173.3022114119644780000.1536744988414.424; _lx_utm&#61;utm_source%3DBaidu%26utm_medium%3Dorganic; _lxsdk&#61;53511D10B66F11E894F593DFAB82C37F1716AA64EB0848C2BADBC75AA2E23EA6; __mta&#61;251008569.1536744988778.1536744999954.1536745003054.4; monitor_count&#61;17; _lxsdk_s&#61;165cd22f1a1-995-e6-849%7C%7C47&#39;
}
def get_onepage(url):response &#61; requests.get(url,headers&#61;headers).textindex &#61; re.findall(&#39;board-index.*?>(.*?)<&#39;,response,re.S)[1:-1]name &#61; re.findall(&#39;

,response,re.S)star &#61; re.findall(&#39;(.*?)

&#39;,response,re.S)date &#61; re.findall(&#39;releasetime">(.*?),response,re.S)img &#61; re.findall(&#39;dd>.*?&#39;,response,re.S)score &#61; re.findall(&#39;score.*?integer">(.*?)<.*?fraction">(.*?)<&#39;,response,re.S)name,star,date,img,score &#61; list(name),list(star),list(date),list(img),list(score)#star和score需要处理一下stars &#61; []for i in star:stars.append(i.split())scores &#61; []for i,j in score:scores.append(i&#43;j)for i in range(10):all_dict[index[i]] &#61; {&#39;index&#39;:index[i],&#39;name&#39;:name[i],&#39;star&#39;:stars[i],&#39;img&#39;:img[i],&#39;date&#39;:date[i]}all_dict &#61; {}
for i in urls:get_onepage(i)for i in all_dict.items():print(i)with open(&#39;maoyan.json&#39;,&#39;w&#39;,encoding&#61;&#39;utf8&#39;) as f:json.dump(all_dict,f)with open(&#39;maoyan.json&#39;,&#39;r&#39;,encoding&#61;&#39;utf8&#39;) as f:print(json.load(f))

最后&#xff0c;打开json看是没有utf8转码的&#xff0c;不知是编辑器的事吗&#xff1f;

这里写图片描述


在代码里读文件正常

这里写图片描述


json在线格式解析也正常

这里写图片描述


推荐阅读
author-avatar
发的好地方
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有