作者:悍受蓁 | 来源:互联网 | 2023-09-16 05:26
Iamusingthervestpackagetoscrapeinformationfromthepagehttp:www.radiolab.orgseriespodc
I am using the rvest
package to scrape information from the page http://www.radiolab.org/series/podcasts. After scraping the first page, I want to follow the "Next" link at the bottom, scrape that second page, move onto the third page, etc.
我正在使用rvest包从http://www.radiolab.org/series/podcasts页面中获取信息。在抓第一页后,我想按照底部的“下一步”链接,抓第二页,移到第三页等。
The following line gives an error:
以下行给出错误:
html_session("http://www.radiolab.org/series/podcasts") %>% follow_link("Next")
## Navigating to
##
## ./2/
## Error in parseURI(u) : cannot parse URI
##
## ./2/
Inspecting the HTML shows there is some extra cruft around the "./2/" that rvest
apparently doesn't like:
检查HTML显示在“.//”周围有一些额外的错误,其中rvest显然不喜欢:
html("http://www.radiolab.org/series/podcasts") %>% html_node(".pagefooter-next a")
## Next
.Last.value %>% html_attrs()
## href
## "\n \n ./2/ "
Question 1: How can I get rvest::follow_link
to treat this link correctly like my browser does? (I could manually grab the "Next" link and clean it up with regex, but prefer to take advantage of the automation provided with rvest
.)
问题1:如何像我的浏览器一样正确处理rvest :: follow_link? (我可以手动抓取“下一步”链接并使用正则表达式进行清理,但更喜欢利用rvest提供的自动化功能。)
At the end of the follow_link
code, it calls jump_to
. So I tried the following:
在follow_link代码的末尾,它调用jump_to。所以我尝试了以下内容:
html_session("http://www.radiolab.org/series/podcasts") %>% jump_to("./2/")
## http://www.radiolab.org/series/2/
## Status: 404
## Type: text/html; charset=utf-8
## Size: 10744
## Warning message:
## In request_GET(x, url, ...) : client error: (404) Not Found
Digging into the code, it looks like the issue is with XML::getRelativeURL
, which uses dirname
to strip off the last part of the original path ("/podcasts"):
深入研究代码,问题似乎是XML :: getRelativeURL,它使用dirname去掉原始路径的最后一部分(“/ podcasts”):
XML::getRelativeURL("./2/", "http://www.radiolab.org/series/podcasts/")
## [1] "http://www.radiolab.org/series/./2"
XML::getRelativeURL("../3/", "http://www.radiolab.org/series/podcasts/2/")
## [1] "http://www.radiolab.org/series/3"
Question 2: How can I get rvest::jump_to
and XML::getRelativeURL
to correctly handle relative paths?
问题2:如何让rvest :: jump_to和XML :: getRelativeURL正确处理相对路径?
1 个解决方案
1
Since this problem still seems to occur with RadioLab.com, your best solution is to create a custom function to handle this edge case. If you're only worried about this site - and this particular error - then you can write something like this:
由于RadioLab.com似乎仍然存在这个问题,因此您最好的解决方案是创建一个自定义函数来处理这种边缘情况。如果你只是担心这个网站 - 以及这个特殊的错误 - 那么你可以这样写:
library(rvest)
follow_next <- function(session, text ="Next", ...) {
link <- html_node(session, xpath = sprintf("//*[text()[contains(.,'%s')]]", text))
url <- html_attr(link, "href")
url = trimws(url)
url = gsub("^\\.{1}/", "", url)
message("Navigating to ", url)
jump_to(session, url, ...)
}
That would allow you to write code like this:
这将允许您编写如下代码:
html_session("http://www.radiolab.org/series/podcasts") %>%
follow_next()
#> Navigating to 2/
#> http://www.radiolab.org/series/podcasts/2/
#> Status: 200
#> Type: text/html; charset=utf-8
#> Size: 61261
This is not per se an error - the URL on RadioLab is malformed, and failing to parse a malformed URL is not a bug. If you want to be liberal in how you handle the issue you need to manually work around it.
这本身不是一个错误 - RadioLab上的URL格式错误,并且无法解析格式错误的URL不是错误。如果您想在处理问题方面保持自由,则需要手动解决问题。
Note that you could also use RSelenium
to launch an actual browser (e.g. Chrome) and have that perform the URL parsing for you.
请注意,您还可以使用RSelenium启动实际的浏览器(例如Chrome)并为您执行URL解析。