作者:惬听风吟jyy_802 | 来源:互联网 | 2023-08-22 20:13
我用Python爬取了女神视界,爬虫之路永无止境「内附源码」-我发现抖音上很多小姐姐就拍个跳舞的视频就火了,大家是冲着舞蹈水平去的吗,都是冲着颜值身材去的,能刷到这篇文章的都是ls
我发现抖音上很多小姐姐就拍个跳舞的视频就火了,大家是冲着舞蹈水平去的吗,都是冲着颜值身材去的,能刷到这篇文章的都是lsp了,我就跟大家不一样了,一个个刷太麻烦了,我直接爬下来看个够,先随意展示两个。
data:image/s3,"s3://crabby-images/33790/33790f4256b67bc73b23f619214a73543eab0818" alt=""
采集目标
爬取目标:女神世界
data:image/s3,"s3://crabby-images/7e4fc/7e4fc748ee6c68f9048243cf05e0221f0fc2074a" alt=""
效果展示
data:image/s3,"s3://crabby-images/0887e/0887e59cf5a9bc927fdd72a20281e8a89464a355" alt=""
工具使用
使用环境:Python3.7 工具:pycharm 第三方库:requests, re, pyquery
爬虫思路:
- 获取的是视频数据 (16进制字节)
- 在这个页面没有视频地址 需要进去详情页 所有需要从 视频播放页开始抓取
使用快捷键 F12 进入开发者控制台:
data:image/s3,"s3://crabby-images/67047/6704797013ac9ab9796e04074a588e64aa24e677" alt=""
先不急, 找到 视频地址 去搜索他 看看在哪里有包含:
data:image/s3,"s3://crabby-images/2b8e9/2b8e97c05b476516064b1e1117ef2738e93d6a83" alt=""
data:image/s3,"s3://crabby-images/edd54/edd54ffafbabee6960a2d1e9b00a87b84061aba2" alt=""
定位他 发现是静态页面返回的数据:
data:image/s3,"s3://crabby-images/a3781/a3781ec2507051117e080676af2d354a3aa8f132" alt=""
上代码:
def Tools(url):# 封装一个工具函数 用来做请求的
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36 Edg/93.0.961.52'
}
respOnse= requests.get(url, headers=headers)
return response
url = 'https://www.520mmtv.com/9614.html'
respOnse= Tools(url).text
video_url = re.findall(r'url: "(.*?)",', response)[0] # 正则表达式提取 视频地址
video_cOntent= Tools(video_url).content
# 视频地址存储 需要在代码同路径 手动创建 短视频文件夹
with open('./短视频/123.mp4', 'ab') as f:
f.write(video_content)
# 下载了一个
data:image/s3,"s3://crabby-images/0363f/0363f86bf8f5566a2d1c1108823a669efb9b93c8" alt=""
data:image/s3,"s3://crabby-images/2813b/2813b6fd1cee937eede9fe32319ac6ec409535a1" alt=""
data:image/s3,"s3://crabby-images/6cd19/6cd194940185b24672d4bcb2ebc1dac899fb44b7" alt=""
data:image/s3,"s3://crabby-images/dd51b/dd51b11e7fca6fd63dd8f06cf176c0c62145eef4" alt=""
def main():
url = 'https://www.520mmtv.com/hd/rewu.html'
respOnse= Tools(url).text
doc = pq(response) # 创建pyquery对象 注意根据css的 class 类选择 和id选择器进行数据提取
i_list = doc('.i_list.list_n2.cxudy-list-formatvideo a').items() # .类选择器 中间有空格的 记得替换成.
meta_title = doc('.meta-title').items() # 标题
for i, t in zip(i_list, meta_title):
href = i.attr('href')
Play(t.text(), href)
全部代码:
import requests
import re
from pyquery import PyQuery as pq
def Tools(url):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36 Edg/93.0.961.52'
}
respOnse= requests.get(url, headers=headers)
return response
def Play(title, url):
# url = 'https://www.520mmtv.com/9614.html'
respOnse= Tools(url).text
video_url = re.findall(r'url: "(.*?)",', response)[0]
video_cOntent= Tools(video_url).content
with open('./短视频/{}.mp4'.format(title), 'ab') as f:
f.write(video_content)
print('{}下载完成....'.format(title))
def main():
url = 'https://www.520mmtv.com/hd/rewu.html'
respOnse= Tools(url).text
doc = pq(response) # 创建pyquery对象 注意根据css的 class 类选择 和id选择器进行数据提取
i_list = doc('.meta-title').items() # .类选择器 中间有空格的 记得替换成.
meta_title = doc('.meta-title').items() # 标题
for i, t in zip(i_list, meta_title):
href = i.attr('href')
Play(t.text(), href)
if __name__ == '__main__':
main()
下载比较慢网络不好,你网快的话 ,就下载快。
效果:
data:image/s3,"s3://crabby-images/e57ad/e57ad283e4921ce09398890841033a452940dc43" alt=""