■出现问题的原因推测
被反爬了,缺少了COOKIE,你请求出来的信息就是运行一段js,
生成COOKIE,看到args1了么,这个是密钥,下面的也不是编码的,就是js混淆的问题
防爬网站需要携带一些基础http头模拟成浏览器登录
https://www.jianshu.com/p/401a25134b89
以下代码运行的返回值
package com.sxz.timecontroal;import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URLDecoder;
import java.util.zip.GZIPInputStream;import org.apache.http.Header;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;public class CheckTimeWithNet {static final String LOGINURL = "https://blog.csdn.net/sxzlc/article/list/3";public static void main(final String[] args) {final DefaultHttpClient httpclient = new DefaultHttpClient();final HttpGet httpGet = new HttpGet(LOGINURL);HttpResponse response = null;try {httpGet.addHeader("Accept-Encoding", "gzip, deflate"); response = httpclient.execute(httpGet); } catch (final ClientProtocolException cpException) {} catch (final IOException ioException) {}// verify response is HTTP OKfinal int statusCode = response.getStatusLine().getStatusCode();if (statusCode != HttpStatus.SC_OK) {System.out.println("Error authenticating to Force.com: "+statusCode);return;}System.out.println("---------------------Status code Info Start---------------------");System.out.println(response.getStatusLine());System.out.println("---------------------Status code Info end ---------------------");System.out.println("---------------------Head Info Start---------------------");final Header[] hs = response.getAllHeaders();for(final Header h:hs){System.out.println(h.getName() + ":" + h.getValue());}System.out.println("---------------------Head Info End ---------------------");String getResult = null;try {// response.setEntity(new GzipDecompressingEntity(response.getEntity())); // getResult = EntityUtils.toString(response.getEntity(),"UTF-8");getResult = getStringFromResponseUzip(response);} catch (final Exception ioException) {// Handle system IO exception}System.out.println(getResult);}public static String getStringFromResponseUzip(final HttpResponse response) throws Exception {if (response == null) {return null;}String responseText = "";//InputStream in = response.getEntity().getContent();final InputStream in = response.getEntity().getContent();final Header[] headers = response.getHeaders("Content-Encoding");for(final Header h : headers){System.out.println(h.getValue());if(h.getValue().indexOf("gzip") > -1){//For GZip responsetry{final GZIPInputStream gzin = new GZIPInputStream(in);final InputStreamReader isr = new InputStreamReader(gzin,"UTF-8");responseText = getStringFromStream(isr);//responseText = URLDecoder.decode(responseText, "utf-8");}catch (final IOException exception){exception.printStackTrace();}return responseText;}}responseText = EntityUtils.toString(response.getEntity(),"utf-8");return responseText;}public static String getStringFromStream(final InputStreamReader isr) throws Exception{final BufferedReader br = new BufferedReader(isr);final StringBuilder sb = new StringBuilder();String tmp;while((tmp = br.readLine())!=null){sb.append(tmp);sb.append("\r\n");}br.close();isr.close();return sb.toString();}
}
---------------------Status code Info Start---------------------
HTTP/1.1 200 OK
---------------------Status code Info end ---------------------
---------------------Head Info Start---------------------
Server:Tengine
Date:Sat, 07 Dec 2019 12:20:38 GMT
Content-Type:text/html; charset=utf-8
Transfer-Encoding:chunked
Connection:keep-alive
Set-COOKIE:acw_tc=2760820215757212385795097e52a909ebbcda96b20e30f4c216c0bfbc89e6;path=/;HttpOnly;Max-Age=2678401
Content-Encoding:gzip
cache-control:no-cache, no-store
Pragma:no-cache
Strict-Transport-Security:max-age=86400
---------------------Head Info End ---------------------
gzip
解压后为16进制代码,有待解决。。。
\x65 z
这是 URLENCODE造成的,使用URLDECODE解决
感谢,[gybao]大神的帮助
https://bbs.csdn.net/topics/395274030
但是,没有使用URLDECODE,之前的代码,在运行一下,竟然直接成功了。
但是,我之前是怎么跑出这种效果的,原因不明。。。 推测问题的原因在下面记述
■再次修改后的代码
对于目前最新代码的说明
当能进入到下面79行的分支中时,不论有没有85行都不会出现乱码问题。
代码
package com.sxz.timecontroal;import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URLDecoder;
import java.util.zip.GZIPInputStream;import org.apache.http.Header;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;public class CheckTimeWithNet {//static final String LOGINURL = "https://blog.csdn.net/sxzlc?orderby=ViewCount";static final String LOGINURL = "https://blog.csdn.net/sxzlc/article/list/2?orderby=ViewCount";public static void main(final String[] args) {final DefaultHttpClient httpclient = new DefaultHttpClient();final HttpGet httpGet = new HttpGet(LOGINURL);HttpResponse response = null;try {httpGet.addHeader("Accept-Encoding", "gzip, deflate"); response = httpclient.execute(httpGet); } catch (final ClientProtocolException cpException) {} catch (final IOException ioException) {}// verify response is HTTP OKfinal int statusCode = response.getStatusLine().getStatusCode();if (statusCode != HttpStatus.SC_OK) {System.out.println("Error authenticating to Force.com: "+statusCode);return;}System.out.println("---------------------Status code Info Start---------------------");System.out.println(response.getStatusLine());System.out.println("---------------------Status code Info end ---------------------");System.out.println("---------------------Head Info Start---------------------");final Header[] hs = response.getAllHeaders();for(final Header h:hs){System.out.println(h.getName() + ":" + h.getValue());}System.out.println("---------------------Head Info End ---------------------");String getResult = null;try {// response.setEntity(new GzipDecompressingEntity(response.getEntity())); // getResult = EntityUtils.toString(response.getEntity(),"UTF-8");getResult = getStringFromResponseUzip(response);} catch (final Exception ioException) {// Handle system IO exception}System.out.println(getResult);}public static String getStringFromResponseUzip(final HttpResponse response) throws Exception {if (response == null) {return null;}String responseText = "";//InputStream in = response.getEntity().getContent();final InputStream in = response.getEntity().getContent();final Header[] headers = response.getHeaders("Content-Encoding");for(final Header h : headers){System.out.println(h.getValue());if(h.getValue().indexOf("gzip") > -1){//For GZip responsetry{final GZIPInputStream gzin = new GZIPInputStream(in);final InputStreamReader isr = new InputStreamReader(gzin,"UTF-8");responseText = getStringFromStream(isr);responseText = URLDecoder.decode(responseText, "UTF-8");}catch (final IOException exception){exception.printStackTrace();}System.out.println("---------------------is gzip---------------------");return responseText;}}System.out.println("---------------------is not gzip---------------------");responseText = EntityUtils.toString(response.getEntity(),"utf-8");return responseText;}public static String getStringFromStream(final InputStreamReader isr) throws Exception{final BufferedReader br = new BufferedReader(isr);final StringBuilder sb = new StringBuilder();String tmp;while((tmp = br.readLine())!=null){sb.append(tmp);sb.append("\r\n");}br.close();isr.close();return sb.toString();}
}
以上代码运行后的结果
如果Get不设定gzip是,
---
-------------------------------------------------------
■原因推测
还是网站那边做了什么特殊的处理
上午之所以好用,是因为网站那边返回的结果没有进行 gzip压缩,
而下午请求同样的地址,经过了gzip压缩,所以在解析处理的时候,无法正常解析。
■现象1
下午再次同样的运行代码,又出现了乱码的问题,
加上DECODE也没有用(以下88,87行),估计解码时出现问题,直接返回NULL了
现象2
上午再cmd 窗口中,使用CURL 上面的地址
可以返回页面的HTML,下午就不行了,返回效果如下。
■关于URLEncode的确认
上面的乱码抽取了一部分,确定是URL编码,但是在解码全部字符串的时候,返回值为NULL
■补充说明
而且,感觉乱码是,返回的信息,和上午返回所有的页面HTML代码相比较,少了很多!
-------------------------------------------------------
■后续(结果说明1)
关于一会儿是 gzip, 一会儿不是,
原因推测是,因为负载平衡,每次访问的服务器不一样。
基于 Nginx 的两个版本(Openresty和Tengine)
・gzip的server信息 Tengine
------------------------------------
------------------------------------
・不是gzip时的server信息
------------------------------------
TODO
------------------------------------
----
---