Stanford中文分词包批量处理的一个示例

icenows

浏览: 56053 次
性别:
来自: 上海

最近访客更多访客>>

xueyue521q

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

自然语言处理NLP

IE Eclipse

抱怨了很久的分词问题，后来发现Stanford的一个中文分词软件非常不错，就想拿来看看效果怎么样。

由于这个软件实在是太强大了，我也来不及去仔细分析，只是把DEMO研究了一下，看了下相关的3，4个类，利用API写了一段批量处理的示例代码。

——这个分词软件有一个学习的过程，使用条件随机场方法，所以不把这一部分剥离处理恐怕处理大规模数据的时候就太慢了。

直接上代码吧。
package Test; import java.util.List; import java.util.Properties; import java.util.zip.GZIPInputStream; import java.io.*; import edu.stanford.nlp.ie.crf.CRFClassifier; import edu.stanford.nlp.ie.AbstractSequenceClassifier; /** This is a very simple demo of calling the Chinese Word Segmenter * programmatically. It assumes an input file in UTF8. *

* * Usage: java -mx1g -cp seg.jar SegDemo fileName *
* This will run correctly in the distribution home directory. To
* run in general, the properties for where to find dictionaries or
* normalizations have to be set.
*
* @author Christopher Manning
*/
public class Demo {
Properties props = new Properties(); //设置训练参数
props.setProperty("sighanCorporaDict", "data");
// props.setProperty("NormalizationTable", "data/norm.simp.utf8");
// props.setProperty("normTableEncoding", "UTF-8");
// below is needed because CTBSegDocumentIteratorFactory accesses it
props.setProperty("serDictionary","data/dict-chris6.ser.gz");
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");
//
CRFClassifier classifier = new CRFClassifier(props); //初始化分类器
classifier.loadClassifierNoExceptions("data/ctb.gz", props); //使用CTB作为训练集，也可以使用pku.gz
// flags must be re-set after data is loaded
classifier.flags.setProperties(props);
//classifier.writeAnswers(classifier.test(args[0]));
// classifier.testAndWriteAnswers(file);
List<string> test=classifier.segmentString("面对新世纪，世界各国人民的共同愿望是：继续发展人类以往创造的一切文明成果，克服20世纪困扰着人类的战争和贫困问题，推进和平与发展的崇高事业，创造一个美好的世界。"); //中文分词的输出序列，注意，该方法只适用中文分词，如果是其他语言，建议修改源码，在AbstractSequenceClassifier类中填加一个返回分词结果的方法，参照testAndWriteAnswers方法，将最后的print输出改成return即可。 for(int count=0;count<test></test> System.out.println(test.get(count)); // classifier.testAndWriteAnswers("/home/xu/sample.base"); } } </string>
以上是一个小小的实验，其中使用eclipse有一个问题是，这个程序很耗内存的，建议设置run configuration - =Arguments - VM arguments 里添加参数 "-Xmx1024m"。

自此，应该知道怎么实现批量处理了。

分享到：

ubuntu server安装图形界面 | A Speech in Gettysburg

2009-06-26 02:28
浏览 5798
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论