利用HttpClient和Jsoup库实现简单的Java爬虫程序
HttpClient简介
HttpClient是Apache Jakarta Common下的子项目,可以用来提供高效的、最新的、功能丰富的支持HTTP协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本。它的主要功能有:
- (1) 实现了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)
- (2) 支持自动转向
- (3) 支持 HTTPS 协议
- (4) 支持代理服务器等
Jsoup简介
jsoup是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。它的主要功能有:
- (1) 从一个URL,文件或字符串中解析HTML;
- (2) 使用DOM或CSS选择器来查找、取出数据;
- (3) 可操作HTML元素、属性、文本;
使用步骤
maven项目添加依赖
pom.xml文件依赖如下:
<dependency>
<groupId>org.apache.httpcomponentsgroupId>
<artifactId>httpclientartifactId>
<version>4.5.2version>
dependency>
<dependency>
<groupId>org.jsoupgroupId>
<artifactId>jsoupartifactId>
<version>1.8.3version>
dependency>
编写Junit测试代码
代码
import org.apache.http.HttpEntity;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;
import java.util.List;
/** * HttpClient & Jsoup libruary test class * * Created by xuyh at 2017/11/6 15:28. */
public class HttpClientJsoupTest {
@Test
public void test() {
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
CloseableHttpClient httpClient = null;
CloseableHttpResponse respOnse= null;
String respOnseStr= "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext cOntext= HttpClientContext.create();
respOnse= httpClient.execute(httpGet, context);
int state = response.getStatusLine().getStatusCode();
if (state != 200)
respOnseStr= "";
HttpEntity entity = response.getEntity();
if (entity != null)
respOnseStr= EntityUtils.toString(entity, "utf-8");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null)
response.close();
if (httpClient != null)
httpClient.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
if (respOnseStr== null)
return;
Document document = Jsoup.parse(responseStr);
List elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
.getElementsByAttributeValue("class", "phdnews_hdline");
elements.forEach(element -> {
for (Element e : element.getElementsByTag("a")) {
System.out.println(e.attr("href"));
System.out.println(e.text());
}
});
}
}
详解
- 新建HttpGet对象,对象将从 http://sports.sina.com.cn/ 这个URL地址获取GET响应。并设置socket超时时间和连接超时时间分别为30000ms。
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
- 通过HttpClientBuilder新建一个CloseableHttpClient对象,并执行上面的HttpGet规定的请求,将响应放在新建的HttpClientContext对象中。最后从HttpClientContext对象中获取响应的文本格式。
CloseableHttpClient httpClient = null;
CloseableHttpResponse respOnse= null;
String respOnseStr= "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext cOntext= HttpClientContext.create();
respOnse= httpClient.execute(httpGet, context);
int state = response.getStatusLine().getStatusCode();
if (state != 200)
respOnseStr= "";
HttpEntity entity = response.getEntity();
if (entity != null)
respOnseStr= EntityUtils.toString(entity, "utf-8");
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (response != null)
response.close();
if (httpClient != null)
httpClient.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
- 将响应的文本用Jsoup库解析,得到其中的各个元素
Document document = Jsoup.parse(responseStr);
List elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
.getElementsByAttributeValue("class", "phdnews_hdline");
elements.forEach(element -> {
for (Element e : element.getElementsByTag("a")) {
System.out.println(e.attr("href"));
System.out.println(e.text());
}
});
- Jsoup的Document对象继承自org.jsoup.nodes.Element类和Element均有的部分方法:
public Element getElementById(String id);
public Elements getElementsByClass(String className);
public Elements getElementsByAttributeValue(String key, String value);
public Elements getElementsByTag(String tagName);
public String attr(String attributeKey);
public String text();
<div class="code">
<div>
<br>
这是第一个段落。
<br>
div>
div>
运行结果
http://sports.sina.com.cn/sportsevents/3v3/2017-11-05/doc-ifynmzrs7218551.shtml
3X3黄金联赛冠军赛山西队夺冠!独享48万
http://video.sina.com.cn/sports/k/cba/1105final3x3/
视频
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/181467390769.html
黄金mvp集锦
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/170167390621.html
直捣黄龙1v2
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/183267390917.html
5佳球:库里式虚晃
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/150067390331.html
大嫂徐冬冬亮相
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/145367390313.html
现场众多美女云集
http://video.sina.com.cn/p/sports/c/zj/v/doc/2017-11-05/150867390337.html
啦啦队热舞表演
http://sports.sina.com.cn/nba/
哈登56分周琦暴扣火箭胜
http://sports.sina.com.cn/basketball/nba/2017-11-06/doc-ifynmzrs7300047.shtml
詹皇26分骑士负
编写工具类
将HttpClient和Jsoup进行封装,形成一个工具类,内容如下:
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.COOKIEStore;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.COOKIE.COOKIE;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import javax.net.ssl.*;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/** * * Http工具,包含: * 普通http请求工具(使用httpClient进行http,https请求的发送) *
* Created by xuyh at 2017/7/17 19:08. */
public class HttpUtils {
/** * 请求超时时间,默认20000ms */
private int timeout = 20000;
/** * COOKIE表 */
private Map COOKIEMap = new HashMap<>();
/** * 请求编码(处理返回结果),默认UTF-8 */
private String charset = "UTF-8";
private static HttpUtils httpUtils;
private HttpUtils() {
}
/** * 获取实例 * * @return */
public static HttpUtils getInstance() {
if (httpUtils == null)
httpUtils = new HttpUtils();
return httpUtils;
}
/** * 清空COOKIEMap */
public void invalidCOOKIEMap() {
COOKIEMap.clear();
}
public int getTimeout() {
return timeout;
}
/** * 设置请求超时时间 * * @param timeout */
public void setTimeout(int timeout) {
this.timeout = timeout;
}
public String getCharset() {
return charset;
}
/** * 设置请求字符编码集 * * @param charset */
public void setCharset(String charset) {
this.charset = charset;
}
/** * 将网页返回为解析后的文档格式 * * @param html * @return * @throws Exception */
public static Document parseHtmlToDoc(String html) throws Exception {
return removeHtmlSpace(html);
}
private static Document removeHtmlSpace(String str) {
Document doc = Jsoup.parse(str);
String result = doc.html().replace(" ", "");
return Jsoup.parse(result);
}
/** * 执行get请求,返回doc * * @param url * @return * @throws Exception */
public Document executeGetAsDocument(String url) throws Exception {
return parseHtmlToDoc(executeGet(url));
}
/** * 执行get请求 * * @param url * @return * @throws Exception */
public String executeGet(String url) throws Exception {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap));
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpClient httpClient = null;
String str = "";
try {
httpClient = HttpClientBuilder.create().build();
HttpClientContext cOntext= HttpClientContext.create();
CloseableHttpResponse respOnse= httpClient.execute(httpGet, context);
getCOOKIEsFromCOOKIEStore(context.getCOOKIEStore(), COOKIEMap);
int state = response.getStatusLine().getStatusCode();
if (state == 404) {
str = "";
}
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
str = EntityUtils.toString(entity, charset);
}
} finally {
response.close();
}
} catch (IOException e) {
throw e;
} finally {
try {
if (httpClient != null)
httpClient.close();
} catch (IOException e) {
throw e;
}
}
return str;
}
/** * 用https执行get请求,返回doc * * @param url * @return * @throws Exception */
public Document executeGetWithSSLAsDocument(String url) throws Exception {
return parseHtmlToDoc(executeGetWithSSL(url));
}
/** * 用https执行get请求 * * @param url * @return * @throws Exception */
public String executeGetWithSSL(String url) throws Exception {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap));
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpClient httpClient = null;
String str = "";
try {
httpClient = createSSLInsecureClient();
HttpClientContext cOntext= HttpClientContext.create();
CloseableHttpResponse respOnse= httpClient.execute(httpGet, context);
getCOOKIEsFromCOOKIEStore(context.getCOOKIEStore(), COOKIEMap);
int state = response.getStatusLine().getStatusCode();
if (state == 404) {
str = "";
}
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
str = EntityUtils.toString(entity, charset);
}
} finally {
response.close();
}
} catch (IOException e) {
throw e;
} catch (GeneralSecurityException ex) {
throw ex;
} finally {
try {
if (httpClient != null)
httpClient.close();
} catch (IOException e) {
throw e;
}
}
return str;
}
/** * 执行post请求,返回doc * * @param url * @param params * @return * @throws Exception */
public Document executePostAsDocument(String url, Map params) throws Exception {
return parseHtmlToDoc(executePost(url, params));
}
/** * 执行post请求 * * @param url * @param params * @return * @throws Exception */
public String executePost(String url, Map params) throws Exception {
String reStr = "";
HttpPost httpPost = new HttpPost(url);
httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
httpPost.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap));
List paramsRe = new ArrayList<>();
for (String key : params.keySet()) {
paramsRe.add(new BasicNameValuePair(key, params.get(key)));
}
CloseableHttpClient httpclient = HttpClientBuilder.create().build();
CloseableHttpResponse response;
try {
httpPost.setEntity(new UrlEncodedFormEntity(paramsRe));
HttpClientContext cOntext= HttpClientContext.create();
respOnse= httpclient.execute(httpPost, context);
getCOOKIEsFromCOOKIEStore(context.getCOOKIEStore(), COOKIEMap);
HttpEntity entity = response.getEntity();
reStr = EntityUtils.toString(entity, charset);
} catch (IOException e) {
throw e;
} finally {
httpPost.releaseConnection();
}
return reStr;
}
/** * 用https执行post请求,返回doc * * @param url * @param params * @return * @throws Exception */
public Document executePostWithSSLAsDocument(String url, Map params) throws Exception {
return parseHtmlToDoc(executePostWithSSL(url, params));
}
/** * 用https执行post请求 * * @param url * @param params * @return * @throws Exception */
public String executePostWithSSL(String url, Map params) throws Exception {
String re = "";
HttpPost post = new HttpPost(url);
List paramsRe = new ArrayList<>();
for (String key : params.keySet()) {
paramsRe.add(new BasicNameValuePair(key, params.get(key)));
}
post.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap));
post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpResponse response;
try {
CloseableHttpClient httpClientRe = createSSLInsecureClient();
HttpClientContext cOntextRe= HttpClientContext.create();
post.setEntity(new UrlEncodedFormEntity(paramsRe));
respOnse= httpClientRe.execute(post, contextRe);
HttpEntity entity = response.getEntity();
if (entity != null) {
re = EntityUtils.toString(entity, charset);
}
getCOOKIEsFromCOOKIEStore(contextRe.getCOOKIEStore(), COOKIEMap);
} catch (Exception e) {
throw e;
}
return re;
}
/** * 发送JSON格式body的POST请求 * * @param url 地址 * @param jsonBody json body * @return * @throws Exception */
public String executePostWithJson(String url, String jsonBody) throws Exception {
String reStr = "";
HttpPost httpPost = new HttpPost(url);
httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
httpPost.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap));
CloseableHttpClient httpclient = HttpClientBuilder.create().build();
CloseableHttpResponse response;
try {
httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
HttpClientContext cOntext= HttpClientContext.create();
respOnse= httpclient.execute(httpPost, context);
getCOOKIEsFromCOOKIEStore(context.getCOOKIEStore(), COOKIEMap);
HttpEntity entity = response.getEntity();
reStr = EntityUtils.toString(entity, charset);
} catch (IOException e) {
throw e;
} finally {
httpPost.releaseConnection();
}
return reStr;
}
/** * 发送JSON格式body的SSL POST请求 * * @param url 地址 * @param jsonBody json body * @return * @throws Exception */
public String executePostWithJsonAndSSL(String url, String jsonBody) throws Exception {
String re = "";
HttpPost post = new HttpPost(url);
post.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap));
post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
CloseableHttpResponse response;
try {
CloseableHttpClient httpClientRe = createSSLInsecureClient();
HttpClientContext cOntextRe= HttpClientContext.create();
post.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
respOnse= httpClientRe.execute(post, contextRe);
HttpEntity entity = response.getEntity();
if (entity != null) {
re = EntityUtils.toString(entity, charset);
}
getCOOKIEsFromCOOKIEStore(contextRe.getCOOKIEStore(), COOKIEMap);
} catch (Exception e) {
throw e;
}
return re;
}
private void getCOOKIEsFromCOOKIEStore(COOKIEStore COOKIEStore, Map COOKIEMap) {
List COOKIEs = COOKIEStore.getCOOKIEs();
for (COOKIE COOKIE : COOKIEs) {
COOKIEMap.put(COOKIE.getName(), COOKIE.getValue());
}
}
private String convertCOOKIEMapToString(Map map) {
String COOKIE = "";
for (String key : map.keySet()) {
COOKIE += (key + "=" + map.get(key) + "; ");
}
if (map.size() > 0) {
COOKIE = COOKIE.substring(0, COOKIE.length() - 2);
}
return COOKIE;
}
/** * 创建 SSL连接 * * @return * @throws GeneralSecurityException */
private static CloseableHttpClient createSSLInsecureClient() throws GeneralSecurityException {
try {
SSLContext sslCOntext= new SSLContextBuilder().loadTrustMaterial(null, (chain, authType) -> true).build();
SSLConnectionSocketFactory sslCOnnectionSocketFactory= new SSLConnectionSocketFactory(sslContext,
(s, sslContextL) -> true);
return HttpClients.custom().setSSLSocketFactory(sslConnectionSocketFactory).build();
} catch (GeneralSecurityException e) {
throw e;
}
}
}
上面的工具类不仅可以进行网页内容的获取,还能够进行http请求的发送。
源码地址
https://github.com/johnsonmoon/HttpUtils.git