作者:冰月雪镜樱1993 | 来源:互联网 | 2023-05-16 09:11
Ineedtoautomateaprocessinvolvingawebsitethatisusingaloginform.Ineedtocapturesome
I need to automate a process involving a website that is using a login form. I need to capture some data in the pages following the login page.
我需要自动化涉及使用登录表单的网站的流程。我需要在登录页面后面的页面中捕获一些数据。
I know how to screen-scrape normal pages, but not those behind a secure site.
我知道如何屏幕抓取普通网页,而不是安全网站背后的网页。
- Can this be done with the .NET WebClient class?
- How would I automatically login?
我该如何自动登录?
- How would I keep logged in for the other pages?
我如何继续登录其他页面?
可以使用.NET WebClient类完成吗?我该如何自动登录?我如何继续登录其他页面?
4 个解决方案
9
One way would be through automating a browser -- you mentioned WebClient, so I'm guessing you might be referring to WebClient in .NET.
一种方法是通过自动化浏览器 - 你提到了WebClient,所以我猜你可能指的是.NET中的WebClient。
Two main points:
两个要点:
- There's nothing special about https related to WebClient - it just works
与WebClient相关的https没有什么特别之处 - 它只是起作用
- COOKIEs are typically used to carry authentication -- you'll need to capture and replay them
COOKIE通常用于进行身份验证 - 您需要捕获并重放它们
Here's the steps I'd follow:
这是我要遵循的步骤:
- GET the login form, capture the the COOKIE in the response.
获取登录表单,捕获响应中的COOKIE。
- Using Xpath and HtmlAgilityPack, find the "input type=hidden" field names and values.
使用Xpath和HtmlAgilityPack,找到“input type = hidden”字段名称和值。
- POST to login form's action with user name, password, and hidden field values in the request body. Include the COOKIE in the request headers. Again, capture the COOKIE in the response.
使用用户名,密码和请求正文中的隐藏字段值POST登录表单的操作。在请求标头中包含COOKIE。再次,在响应中捕获COOKIE。
- GET the pages you want, again, with the COOKIE in the request headers.
再次使用请求标头中的COOKIE获取所需的页面。
On step 2, I mention a somewhat complicated method for automating the login. Usually, you can post with username and password directly to the known login form action without getting the initial form or relaying the hidden fields. Some sites have form validation (different from field validation) on their forms which makes this method not work.
在第2步,我提到了一种有点复杂的自动登录方法。通常,您可以使用用户名和密码直接发布到已知的登录表单操作,而无需获取初始表单或中继隐藏字段。某些网站在其表单上进行了表单验证(与字段验证不同),这使得此方法无效。
HtmlAgilityPack is a .NET library that allows you to turn ill-formed html into an XmlDocument so you can XPath over it. Quite useful.
HtmlAgilityPack是一个.NET库,允许您将格式错误的HTML转换为XmlDocument,以便对其进行XPath。非常有用。
Finally, you may run into a situation where the form relies on client script to alter the form values before submitting. You may need to simulate this behavior.
最后,您可能会遇到这样一种情况,即表单依赖客户端脚本在提交之前更改表单值。您可能需要模拟此行为。
Using a tool to view the http traffic for this type of work is extremely helpful - I recommend ieHttpHeaders, Fiddler, or FireBug (net tab).
使用工具查看此类工作的http流量非常有用 - 我建议使用ieHttpHeaders,Fiddler或FireBug(网络标签)。