`
yanfeijun
  • 浏览: 26037 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

URLConnection的编码问题

 
阅读更多
前一阵抓网页遇到编码问题,于是写了个方法在每次抓取之前确认一下网页的编码,代码如下:
private static String getEncode(String strUrl){
		String encode = HttpClient.encode;
		InputStream in = null;
		HttpURLConnection con = null;
		try{
			log.debug("检查url编码:" + strUrl);
			URL url = new URL(strUrl);
			con = (HttpURLConnection)url.openConnection();
//			String[] s = strurl.split("/");
			System.out.printf("编码:%s \n" ,con.getContentEncoding());
			if(con.getContentEncoding()!=null){
				return con.getContentEncoding();
			}
			in = con.getInputStream();
			con.setConnectTimeout(5*1000);
			con.setReadTimeout(10*1000);
			
			
			BufferedReader read = new BufferedReader(new InputStreamReader(in));
			String inStr = null;
			
			String reg = "meta http-equiv=\"Content-Type\" content=\".*?charset=(.*?)\"";
			Pattern p = Pattern.compile(reg);
			
			while ((inStr = read.readLine()) != null) {
				Matcher m = p.matcher(inStr);
				if(m.find()){
					encode = m.group(1);
					log.debug("code:" + encode);
					break;
				}
			}
			
		}catch(Exception e){
			log.error(e.getMessage(),e);
		}finally{
			try {
				in.close();
				con.disconnect();
			} catch (Exception e) {
			}
		}
		return encode;
	}

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics