Cousera-Usingpythontoaccesswebdata(第2~4周笔记)

作者：zhihong520珠珠_448 | 来源：互联网 | 2023-08-30 15:42

第二周:正则表达式http:www.cnblogs.commoonachep5110322.html第三周:NetworksandSocketshttp:

第二周: 正则表达式
http://www.cnblogs.com/moonache/p/5110322.html
第三周: Networks and Sockets
http://www.cnblogs.com/moonache/p/5112060.html
第四周：Programs that Surf the Web
http://www.cnblogs.com/moonache/p/5112088.html

BeautifulSoup：http://cuiqingcai.com/1319.html
官方文档：http://beautifulsoup.readthedocs.io/zh_CN/latest/

第五周：http://www.cnblogs.com/moonache/p/5112109.html

第四周课后作业：

习题1. 对下面网址中的comment数字求和

http://python-data.dr-chuck.net/comments_329805.html

import urllib
from bs4 import BeautifulSoup

url = raw_input("Enter-")

html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('span')
sum = 0
for tag in tags:
    sum += int(tag.contents[0])
print sum

习题2.

In this assignment you will write a Python program that expands on http://www.pythonlearn.com/code/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.

We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment

Sample problem: Start at http://python-data.dr-chuck.net/known_by_Fikret.html
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah
Actual problem: Start at: http://python-data.dr-chuck.net/known_by_Wiktorja.html
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
Hint: The first character of the name of the last page that you will load is: G

解题思路：该题中需要使用迭代

import urllib
from bs4 import BeautifulSoup

url = raw_input("Enter URL:")
count = int(raw_input("Enter count:"))
position = int(raw_input("Enter position:"))
print url
for i in range(1,count+1):
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html,"html.parser")
    tags = soup('a')
    url = tags[position-1].get('href')
    print url

答案是Giyia
这里写图片描述

第五周课后作业：

Extracting Data from XML

In this assignment you will write a Python program somewhat similar to http://www.pythonlearn.com/code/geoxml.py. The program will prompt for a URL, read the XML data from that URL using urllib and then parse and extract the comment counts from the XML data, compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://python-data.dr-chuck.net/comments_42.xml (Sum=2553)
Actual data: http://python-data.dr-chuck.net/comments_329802.xml (Sum ends with 74)
You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.
Data Format and Approach
The data consists of a number of names and comment counts in XML as follows:

Matthias
97

You are to look through all the tags and find the values sum the numbers. The closest sample code that shows how to parse XML is geoxml.py. But since the nesting of the elements in our data is different than the data we are parsing in that sample code you will have to make real changes to the code.
To make the code a little simpler, you can use an XPath selector string to look through the entire tree of XML for any tag named ‘count’ with the following line of code:

counts = tree.findall(‘.//count’)
Take a look at the Python ElementTree documentation and look for the supported XPath syntax for details. You could also work from the top of the XML down to the comments node and then loop through the child nodes of the comments node.
Sample Execution

$ python solution.py
Enter location: http://python-data.dr-chuck.net/comments_42.xml
Retrieving http://python-data.dr-chuck.net/comments_42.xml
Retrieved 4204 characters
Count: 50
Sum: 2…
解法一：

import urllib
import xml.etree.ElementTree as ET

location = raw_input('Enter location:')
print 'Retrieve',location
xml = urllib.urlopen(location).read()
print 'Retrieve %d characters' %len(xml)
commentinfo = ET.fromstring(xml) #使用该方法后相当于已经获取到树根，然后可以直接在树根下面查找child
comment = commentinfo.findall('comments/comment')
sum = 0
for item in comment:
    sum += int(item.find('count').text)
    #sum += int(item.text)
print sum

解法二

import urllib
import xml.etree.ElementTree as ET

location = raw_input('Enter location:')
print 'Retrieve',location
xml = urllib.urlopen(location).read()
print 'Retrieve %d characters' %len(xml)
commentinfo = ET.fromstring(xml)
count = commentinfo.findall('.//count')
sum = 0
for item in count:
    sum += int(item.text)
print sum

第六周课后作业

1.编程作业一：https://pr4e.dr-chuck.com/tsugi/mod/python-data/index.php?PHPSESSID=197f44127af777033ac63a9769a1fab7

Extracting Data from JSON

In this assignment you will write a Python program somewhat similar to http://www.pythonlearn.com/code/json2.py. The program will prompt for a URL, read the JSON data from that URL using urllib and then parse and extract the comment counts from the JSON data, compute the sum of the numbers in the file and enter the sum below:
We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://python-data.dr-chuck.net/comments_42.json (Sum=2553)
Actual data: http://python-data.dr-chuck.net/comments_329806.json (Sum ends with 50)
You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.
Data Format
The data consists of a number of names and comment counts in JSON as follows:

{
comments: [
{
name: “Matthias”
count: 97
},
{
name: “Geomer”
count: 97
}
…
]
}
The closest sample code that shows how to parse JSON and extract a list is json2.py. You might also want to look at geoxml.py to see how to prompt for a URL and retrieve data from a URL.

Sample Execution

$ python solution.py
Enter location: http://python-data.dr-chuck.net/comments_42.json
Retrieving http://python-data.dr-chuck.net/comments_42.json
Retrieved 2733 characters
Count: 50
Sum: 2…

解题思路：

import json
import urllib

url = raw_input("Enter location:")
print 'Retrieving',url

html = urllib.urlopen(url).read()
print 'Retrieve %d characters' %len(html)
info = json.loads(html)
#comments = info.find('comments') #不可以使用这种方式，会报错
comments = info['comments']
print 'count:',len(comments)
sum = 0
for item in comments:
    sum += item['count']
print sum

2.编程作业二

Calling a JSON API

In this assignment you will write a Python program somewhat similar to http://www.pythonlearn.com/code/geojson.py. The program will prompt for a location, contact a web service and retrieve JSON for the web service and parse that data, and retrieve the first place_id from the JSON. A place ID is a textual identifier that uniquely identifies a place as within Google Maps.
API End Points

To complete this assignment, you should use this API endpoint that has a static subset of the Google Data:

http://python-data.dr-chuck.net/geojson
This API uses the same parameters (sensor and address) as the Google API. This API also has no rate limit so you can test as often as you like. If you visit the URL with no parameters, you get a list of all of the address values which can be used with this API.
To call the API, you need to provide a sensor=false parameter and the address that you are requesting as the address= parameter that is properly URL encoded using the urllib.urlencode() fuction as shown in http://www.pythonlearn.com/code/geojson.py

Test Data / Sample Execution

You can test to see if your program is working with a location of “South Federal University” which will have a place_id of “ChIJJ8oO7_B_bIcR2AlhC8nKlok”.

$ python solution.py
Enter location: South Federal University
Retrieving http://…
Retrieved 2101 characters
Place id ChIJJ8oO7_B_bIcR2AlhC8nKlok
Turn In
在上例中测试的数据在http://python-data.dr-chuck.net/geojson?sensor=false&address=South%20Federal%20University中

Please run your program to find the place_id for this location:

Washington State University
Make sure to enter the name and case exactly as above and enter the place_id and your Python code below. Hint: The first seven characters of the place_id are “ChIJvQ5 …”
Make sure to retreive the data from the URL specified above and not the normal Google API. Your program should work with the Google API - but the place_id may not match for this assignment.

import urllib
import json

serviceurl = 'http://python-data.dr-chuck.net/geojson?'

address = raw_input('Enter location:')
url = serviceurl + urllib.urlencode({'sensor':'false','address':address})
print 'Retrieving:',url
html = urllib.urlopen(url)
data = html.read()
print 'Retrieved',len(data),'characters'

try: js = json.loads(str(data))
except: js = None

if 'status' not in js or js['status'] != 'OK':
    print '==== Failure To Retrieve ===='
results = js['results']

print results[0]['place_id']

这里写图片描述

推荐阅读

match
Python正则表达式学习记录及常用方法

本文记录了学习Python正则表达式的过程，介绍了re模块的常用方法re.search，并解释了rawstring的作用。正则表达式是一种方便检查字符串匹配模式的工具，通过本文的学习可以掌握Python中使用正则表达式的基本方法。 ... [详细]

蜡笔小新 2023-12-13 16:37:19
md5
南邮ctf-web的writeup

本文介绍了南邮ctf-web的writeup，包括签到题和md5 collision。在CTF比赛和渗透测试中，可以通过查看源代码、代码注释、页面隐藏元素、超链接和HTTP响应头部来寻找flag或提示信息。利用PHP弱类型，可以发现md5('QNKCDZO')='0e830400451993494058024219903391'和md5('240610708')='0e462097431906509019562988736854'。 ... [详细]

蜡笔小新 2023-12-13 10:58:55
match
Python自动提取文本中的时间（包含中文日期）及特殊时间识别方法

本文介绍了在处理不规则数据时如何使用Python自动提取文本中的时间日期，包括使用dateutil.parser模块统一日期字符串格式和使用datefinder模块提取日期。同时，还介绍了一段使用正则表达式的代码，可以支持中文日期和一些特殊的时间识别，例如'2012年12月12日'、'3小时前'、'在2012/12/13哈哈'等。 ... [详细]

蜡笔小新 2023-12-12 12:09:33
match
Python爬虫中使用正则表达式的方法和注意事项

本文介绍了在Python爬虫中使用正则表达式的方法和注意事项。首先解释了爬虫的四个主要步骤，并强调了正则表达式在数据处理中的重要性。然后详细介绍了正则表达式的概念和用法，包括检索、替换和过滤文本的功能。同时提到了re模块是Python内置的用于处理正则表达式的模块，并给出了使用正则表达式时需要注意的特殊字符转义和原始字符串的用法。通过本文的学习，读者可以掌握在Python爬虫中使用正则表达式的技巧和方法。 ... [详细]

蜡笔小新 2023-12-12 11:51:07
process
【机器学习】生成式对抗网络模型综述

生成式对抗网络模型综述摘要生成式对抗网络模型(GAN)是基于深度学习的一种强大的生成模型，可以应用于计算机视觉、自然语言处理、半监督学习等重要领域。生成式对抗网络 ... [详细]

蜡笔小新 2023-12-14 17:51:18
match
PHP图片截取方法及应用实例

本文介绍了使用PHP动态切割JPEG图片的方法，并提供了应用实例，包括截取视频图、提取文章内容中的图片地址、裁切图片等问题。详细介绍了相关的PHP函数和参数的使用，以及图片切割的具体步骤。同时，还提供了一些注意事项和优化建议。通过本文的学习，读者可以掌握PHP图片截取的技巧，实现自己的需求。 ... [详细]

蜡笔小新 2023-12-14 16:44:09
java
Java String与StringBuffer的区别及其应用场景

本文主要介绍了Java中String和StringBuffer的区别，String是不可变的，而StringBuffer是可变的。StringBuffer在进行字符串处理时不生成新的对象，内存使用上要优于String类。因此，在需要频繁对字符串进行修改的情况下，使用StringBuffer更加适合。同时，文章还介绍了String和StringBuffer的应用场景。 ... [详细]

蜡笔小新 2023-12-13 19:21:06
erlang
Thrift教程初级篇——RPC框架Thrift的安装环境变量配置与第一个实例

本文介绍了RPC框架Thrift的安装环境变量配置与第一个实例，讲解了RPC的概念以及如何解决跨语言、c++客户端、web服务端、远程调用等需求。Thrift开发方便上手快，性能和稳定性也不错，适合初学者学习和使用。 ... [详细]

蜡笔小新 2023-12-13 17:36:52
import
如何从列表中删除所有零？

本文介绍了如何使用python从列表中删除所有的零，并将结果以列表形式输出，同时提供了示例格式。 ... [详细]

蜡笔小新 2023-12-13 13:02:00
import
Python连接服务器失败：使用aiohttp模拟服务器出现错误问题及解决方法

本文介绍了在使用Python中的aiohttp模块模拟服务器时出现的连接失败问题，并提供了相应的解决方法。文章中详细说明了出错的代码以及相关的软件版本和环境信息，同时也提到了相关的警告信息和函数的替代方案。通过阅读本文，读者可以了解到如何解决Python连接服务器失败的问题，并对aiohttp模块有更深入的了解。 ... [详细]

蜡笔小新 2023-12-13 12:37:59
process
clone的fork与pthread_create创建线程有何不同

本文讨论了clone的fork与pthread_create创建线程的不同之处。进程是一个指令执行流及其执行环境，其执行环境是一个系统资源的集合。在调用系统调用fork创建一个进程时，子进程只是完全复制父进程的资源，这样得到的子进程独立于父进程，具有良好的并发性。但是二者之间的通讯需要通过专门的通讯机制，另外通过fork创建子进程系统开销很大。因此，在某些情况下，使用clone或pthread_create创建线程可能更加高效。 ... [详细]

蜡笔小新 2023-12-12 20:00:06
require
Spring常用注解（绝对经典），全靠这份Java知识点PDF大全

本文介绍了Spring常用注解和注入bean的注解，包括@Bean、@Autowired、@Inject等，同时提供了一个Java知识点PDF大全的资源链接。其中详细介绍了ColorFactoryBean的使用，以及@Autowired和@Inject的区别和用法。此外，还提到了@Required属性的配置和使用。 ... [详细]

蜡笔小新 2023-12-12 10:15:07
sum
SpringMVC接收请求参数的方式总结

本文总结了在SpringMVC开发中处理控制器参数的各种方式，包括处理使用@RequestParam注解的参数、MultipartFile类型参数和Simple类型参数的RequestParamMethodArgumentResolver，处理@RequestBody注解的参数的RequestResponseBodyMethodProcessor，以及PathVariableMapMethodArgumentResol等子类。 ... [详细]

蜡笔小新 2023-12-11 19:55:40
sum
Oracle 11g物理Active Data Guard实时查询（Realtime query）特性

在Oracle11g以前版本中的的DataGuard物理备用数据库，可以以只读的方式打开数据库，但此时MediaRecovery利用日志进行数据同步的过 ... [详细]

蜡笔小新 2023-12-11 15:49:10
import
Python实验报告文档中的文件和数据格式化操作

本文介绍了Python语言程序设计中文件和数据格式化的操作，包括使用np.savetext保存文本文件，对文本文件和二进制文件进行统一的操作步骤，以及使用Numpy模块进行数据可视化编程的指南。同时还提供了一些关于Python的测试题。 ... [详细]

蜡笔小新 2023-12-10 17:02:16

zhihong520珠珠_448

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章