分享十种py3爬取网页资源的方法

作者：机智的孙志嵘 | 来源：互联网 | 2017-05-14 02:44

这两天学习了python3实现抓取网页资源的方法，发现了很多种方法，所以，今天添加一点小笔记。

1、最简单

import urllib.request
respOnse= urllib.request.urlopen(&＃39;http://python.org/&＃39;)
html = response.read()

2、使用 Request

import urllib.request
 
req = urllib.request.Request(&＃39;http://python.org/&＃39;)
respOnse= urllib.request.urlopen(req)
the_page = response.read()

3、发送数据

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request
 
url = &＃39;http://localhost/login.php&＃39;
user_agent = &＃39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&＃39;
values = {
     &＃39;act&＃39; : &＃39;login&＃39;,
     &＃39;login[email]&＃39; : &＃39;yzhang@i9i8.com&＃39;,
     &＃39;login[password]&＃39; : &＃39;123456&＃39;
     }
 
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header(&＃39;Referer&＃39;, &＃39;http://www.python.org/&＃39;)
respOnse= urllib.request.urlopen(req)
the_page = response.read()
 
print(the_page.decode("utf8"))

4、发送数据和header

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request
 
url = &＃39;http://localhost/login.php&＃39;
user_agent = &＃39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&＃39;
values = {
     &＃39;act&＃39; : &＃39;login&＃39;,
     &＃39;login[email]&＃39; : &＃39;yzhang@i9i8.com&＃39;,
     &＃39;login[password]&＃39; : &＃39;123456&＃39;
     }
headers = { &＃39;User-Agent&＃39; : user_agent }
 
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
respOnse= urllib.request.urlopen(req)
the_page = response.read()
 
print(the_page.decode("utf8"))

5、http 错误

#! /usr/bin/env python3
 
import urllib.request
 
req = urllib.request.Request(&＃39;http://www.python.org/fish.html&＃39;)
try:
  urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
  print(e.code)
  print(e.read().decode("utf8"))

6、异常处理1

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://twitter.com/")
try:
  respOnse= urlopen(req)
except HTTPError as e:
  print(&＃39;The server couldn\&＃39;t fulfill the request.&＃39;)
  print(&＃39;Error code: &＃39;, e.code)
except URLError as e:
  print(&＃39;We failed to reach a server.&＃39;)
  print(&＃39;Reason: &＃39;, e.reason)
else:
  print("good!")
  print(response.read().decode("utf8"))

7、异常处理2

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request("http://twitter.com/")
try:
  respOnse= urlopen(req)
except URLError as e:
  if hasattr(e, &＃39;reason&＃39;):
    print(&＃39;We failed to reach a server.&＃39;)
    print(&＃39;Reason: &＃39;, e.reason)
  elif hasattr(e, &＃39;code&＃39;):
    print(&＃39;The server couldn\&＃39;t fulfill the request.&＃39;)
    print(&＃39;Error code: &＃39;, e.code)
else:
  print("good!")
  print(response.read().decode("utf8"))

8、HTTP 认证

#! /usr/bin/env python3
 
import urllib.request
 
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
 
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://cms.tetx.com/"
password_mgr.add_password(None, top_level_url, &＃39;yzhang&＃39;, &＃39;cccddd&＃39;)
 
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
 
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
 
# use the opener to fetch a URL
a_url = "https://cms.tetx.com/"
x = opener.open(a_url)
print(x.read())
 
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
 
a = urllib.request.urlopen(a_url).read().decode(&＃39;utf8&＃39;)
print(a)

9、使用代理

#! /usr/bin/env python3
 
import urllib.request
 
proxy_support = urllib.request.ProxyHandler({&＃39;sock5&＃39;: &＃39;localhost:1080&＃39;})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)

 
a = urllib.request.urlopen("http://g.cn").read().decode("utf8")
print(a)

10、超时

#! /usr/bin/env python3
 
import socket
import urllib.request
 
# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)
 
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request(&＃39;http://twitter.com/&＃39;)
a = urllib.request.urlopen(req).read()
print(a)

推荐阅读

include
UVALive 8201 - BBP 公式计算圆周率

在1995年，Simon Plouffe 发现了一种特殊的求和方法来表示某些常数。两年后，Bailey 和 Borwein 在他们的论文中发表了这一发现，这种方法被命名为 Bailey-Borwein-Plouffe (BBP) 公式。该问题要求计算圆周率 π 的第 n 个十六进制数字。 ... [详细]

蜡笔小新 2024-11-21 18:32:57
text
SIP基础概览

本文介绍了SIP（Session Initiation Protocol，会话发起协议）的基本概念、功能、消息格式及其实现机制。SIP是一种在IP网络上用于建立、管理和终止多媒体通信会话的应用层协议。 ... [详细]

蜡笔小新 2024-11-21 17:42:08
io
二维码的实现与应用

本文介绍了二维码的基本概念、分类及其优缺点，并详细描述了如何使用Java编程语言结合第三方库（如ZXing和qrcode.jar）来实现二维码的生成与解析。 ... [详细]

蜡笔小新 2024-11-21 17:10:15
text
Requests库的基本使用方法

本文介绍了Python中Requests库的基础用法，包括如何安装、GET和POST请求的实现、如何处理Cookies和Headers，以及如何解析JSON响应。相比urllib库，Requests库提供了更为简洁高效的接口来处理HTTP请求。 ... [详细]

蜡笔小新 2024-11-21 13:17:41
spring
我的读书清单（持续更新）

我的读书清单（持续更新）201705311.《一千零一夜》2006（四五年级）2.《中华上下五千年》2008（初一）3.《鲁滨孙漂流记》2008（初二）4.《钢铁是怎样炼成的》20 ... [详细]

蜡笔小新 2024-11-21 13:01:23
spring
Python 领跑！2019年2月编程语言排名更新

根据最新的编程语言流行指数（PYPL）排行榜，Python 在2019年2月的份额达到了26.42%，稳坐榜首位置。 ... [详细]

蜡笔小新 2024-11-21 09:18:39
io
Markdown 编辑技巧详解

本文介绍如何使用 Typora 编辑器高效编写 Markdown 文档，包括代码块的插入方法等实用技巧。Typora 官方网站：https://www.typora.io/ 学习资源：https://www.markdown.xyz/ ... [详细]

蜡笔小新 2024-11-20 23:42:54
io
程序员的精神世界与职业追求

本文探讨了程序员这一职业的本质，认为他们是专注于问题解决的专业人士。文章深入分析了他们的日常工作状态、个人品质以及面对挑战时的态度，强调了编程不仅是一项技术活动，更是个人成长和精神修炼的过程。 ... [详细]

蜡笔小新 2024-11-21 18:56:08
text
Python 开发环境最佳实践：Anaconda + Jupyter Notebook 快速上手指南

对于初学者而言，搭建一个高效稳定的 Python 开发环境是入门的关键一步。本文将详细介绍如何利用 Anaconda 和 Jupyter Notebook 来构建一个既易于管理又功能强大的开发环境。 ... [详细]

蜡笔小新 2024-11-21 18:30:23
text
2023年，Android开发前景如何？25岁还能转行吗？

近期，关于Android开发行业的讨论在多个平台上热度不减，许多人担忧其未来发展。本文将探讨当前Android开发市场的现状、薪资水平及职业选择建议。 ... [详细]

蜡笔小新 2024-11-21 18:08:07
text
支付宝免费提现攻略详解

在日常生活中，支付宝已成为不可或缺的支付工具之一。本文将详细介绍如何通过支付宝实现免费提现，帮助用户更好地管理个人财务，避免不必要的手续费支出。 ... [详细]

蜡笔小新 2024-11-21 16:47:52
include
Singleton单例模式和DoubleChecked Locking双重检查锁定模式

问题描述现在，不管开发一个多大的系统（至少我现在的部门是这样的），都会带一个日志功能；在实际开发过程中 ... [详细]

蜡笔小新 2024-11-21 15:14:45
text
IC卡操作功能实现

本文介绍了如何通过C#语言调用动态链接库（DLL）中的函数来实现IC卡的基本操作，包括初始化设备、设置密码模式、获取设备状态等，并详细展示了将TextBox中的数据写入IC卡的具体实现方法。 ... [详细]

蜡笔小新 2024-11-21 11:02:19
int
深入理解C++构造函数

本文详细介绍了C++中的构造函数，包括其定义、特点以及如何通过构造函数进行对象的初始化。此外，还探讨了转换构造函数的概念及其在不同情境下的应用，以及如何避免不必要的隐式类型转换。 ... [详细]

蜡笔小新 2024-11-21 10:41:14
ascii
java语言基础数据类型：详解

数据类型--char一、char1.1char占用2个字节char取值范围：【0~65535】char采用unicode编码方式char类型的字面量用单引号括起来char可以存储一 ... [详细]

蜡笔小新 2024-11-21 08:47:17

机智的孙志嵘

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章