`
ld362093642
  • 浏览: 65302 次
  • 性别: Icon_minigender_1
  • 来自: 武汉
社区版块
存档分类
最新评论

html抽取正文等

阅读更多
readability 学习中
朋友给的资源 https://github.com/selectingProcess/snacktory(侵删)


File f = new File("htmtmp/4186.htm");
    	Converter c = new Converter();
    	 ArticleTextExtractor extractors = new ArticleTextExtractor();
    	JResult res =  extractors.extractContent(c.streamToString( new FileInputStream(f)));
    	System.out.println(res.getText());


BufferedReader reader = new BufferedReader(new FileReader("htmtmp/1.htm"));
        String line = null;
        Set<String> existing = new LinkedHashSet<String>();
        while ((line = reader.readLine()) != null) {
            int index1 = line.indexOf("\"");
            int index2 = line.indexOf("\"", index1 + 1);
            String url = line.substring(index1 + 1, index2);
            String domainStr = SHelper.extractDomain(url, true);
            String counterStr = "";
            // TODO more similarities
            if (existing.contains(domainStr))
                counterStr = "2";
            else
                existing.add(domainStr);

            String html = new HtmlFetcher().fetchAsString(url, 20000);
            String outFile = domainStr + counterStr + ".html";
            BufferedWriter writer = new BufferedWriter(new FileWriter(outFile));
            writer.write(html);
            writer.close();
        }
        reader.close();




分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics