请问用什么方法可以获取搜狐评论排行中的新闻列表

0 0

请问用什么方法可以获取搜狐评论排行中的新闻列表5

搜狐评论列表url: http://comment.news.sohu.com/djpm/
我想得到里头的热门新闻。
用了httpclient+htmlcleaner 还是不行
原因：请求到的html文件是一堆乱码，用了gbk(页面中设置的charset是gbk), utf-8, utf-16, ascii都还是不行。

请问各位大侠：有什么其它比较好的方法可以解决这个问题？或者除了httpclient外还有没有更好的请求网页的工具？
求解答，不甚感激！

问题补充：

flootball 写道

看看网页源码是何物？
网页源码能看见内容，那httpclient就能取下来。
获取内容的规则：分析下内容就好了。

源码可以看到，但是httpclient就是取不下来，取出来的全是乱码。你可以试一下。
只是部分网页是这样，比如点击排行的那页。
搜狐的大部分其它网页是可以正常解析的。

问题补充：

flootball 写道

贴出代码看看。。！
我这连不上搜狐.

HttpClient client = new DefaultHttpClient();
HttpGet httpGet = new HttpGet(urlString);

try {
HttpResponse response = client.execute(httpGet);
HttpEntity entity = response.getEntity();
System.out.println("status line:" + response.getStatusLine());
if (entity != null) {
System.out.println("response content length:" + entity.getContentLength());
}else {
return;
}
InputStream inputStream = entity.getContent();
Reader reader = new InputStreamReader(inputStream, "gb2312");
BufferedReader bufferedreader = new BufferedReader(reader);
String lineString;
while ((lineString = bufferedreader.readLine()) != null) {
System.out.println(lineString);
}
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

我用的是httpclient4.1。
我觉得这个跟代码没有关系。应该是httpclient包解析的问题，浏览器对网页解析没有问题，但用程序解就出现乱码。
我还用c#试过了，也是可以得到字节流，但是这个字节流就是不能还原成正常的字母和文字，都是乱码。
请大家看看，这个问题困扰我几天了。
或者有没有其它更成熟的请求http的工具包，推荐一下啦

问题补充：

flootball 写道

HttpClient httpClient=new HttpClient();
GetMethod getMethod = null;
BufferedReader br=null;
BufferedWriter bw=null;
getMethod=new GetMethod(url);
getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
new DefaultHttpMethodRetryHandler());
int statusCode = httpClient.executeMethod(getMethod);
if (statusCode != HttpStatus.SC_OK) {
System.err.println("Method failed: " + getMethod.getStatusLine());
}
br=new BufferedReader(new InputStreamReader(getMethod.getResponseBodyAsStream(),"UTF-8"));

其余的代码自己补全。
用这个试试吧。

我自己已经解决了不是这个的问题是http流用了gzip压缩。在解析之前先gzip解压缩，就好了。不过非常感谢你的热心。分数给你。

Java综合