构建自己的DSL之二抓取文本处理

fuliang

浏览: 1637662 次
性别:
来自: 北京

最近访客更多访客>>

依然任逍遥

stephenworld

lli

samwalt

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Ruby
Machine Learning

Ruby DSL Text Processing 文本处理文本记录

转载请标明出处：http://fuliang.iteye.com/blog/1122051

公司的蜘蛛抓取的内容一个记录是以TAB分割的各个字段的值，并且随着各种分类得分、正文静态得分策略的添加，版本不断的演变。每次做抽样、分析、分类语料等文本处理都需要写一些样板式的代码，并且得到wiki查找指定版本每个字段的位置。构建一个好的DSL来自动处理这件事情能够省去很多重复的操作，只需要关注要处理的事情即可。
我们想提供简单自然的API来做事情，我们常用的需求有：
1、每次版本变更几乎不需要修改代码、只需要添加配置文件，比如新版本增加一个
travel_confidence，我们不需要修改代码就可以使用：

crawler_file.find_by_travel_confidence(90)
crawler_file.find_by_travel_confidence_gt(50)
...

2、可以自动的识别版本、并得到版本号：

crawler_file.version

3、按照次序给出各个字段的名字：

crawler_file.field_names

4、支持模糊查询字段的名字：

crawler_file.grep_fields(/url/)

5、根据某个字段的模糊或者精确的值来在一个文件中查找记录

#根据host来查找记录
crawler_file.find_by_host("www.9tour.cn") do |record|
    printf("%s\t%s\n", record.title, record.host)
end
#根据标题的字段来模糊查找
crawler_file.find_by_title_like(/线路/) do |record|
    puts record.title
end

6、数字的字段我们需要支持根据大小关系来查找记录：比如gt(>)、ge(>=)
eq(=)、le(<=)、lt(<)

#content_confidence大于50的记录
crawler_file.find_by_content_confidence_gt(50) do |record|
    printf("%s\t%s\n", record.title, record.content_confidence)
end

7、比较复杂的需求，我们可以写一些字段小过滤器，来找到需要的记录：

filter = lambda{|host,title| host == "www.9tour.cn" && title =~ /线路/}

crawler_file.find_by_fields([:host,:title],filter) do |record|
    printf("%s\t%s\n", record.host,record.title)
end

8.我们需要代码DRY。

我们下面看看如何完成这个功能，首先我们可以使用yaml来配置版本以及记录对应的字段：

v1:
    download_time: 0
    host: 1
    url: 2
    url_md5: 3
    parent_url_md5: 4
    crawl_level: 5
    loading_time: 6
    anchor_text: 7
    title: -4
    keywords: -3
    description: -2
    content: -1
v2:
    download_time: 0
    host: 1
    url: 2
    url_md5: 3
    parent_url_md5: 4
    crawl_level: 5
    loading_time: 6
    http_code: 7
    content_confidence: 8
    anchor_text: 9
    title: -4
    keywords: -3
    description: -2
    content: -1
...#中间省略

v9:
    download_time: 0
    host: 1
    url: 2
    url_md5: 3
    parent_url_md5: 4
    crawl_level: 5
    publish_time: 6
    http_code: 7
    content_confidence: 8
    list_confidence: 9
    feeling_confidence: 10
    travel_confidence: 11
    qnc_cat: 12
    qnc_chi: 13
    qnc_zhu: 14
    qnc_xing: 15
    qnc_you: 16
    qnc_gou: 17
    qnc_le: 18
    anchor_text: 19
    raw_title: -10
    title: -9
    keywords: -8
    description: -7
    content: -6
    lda_tag: -5
    location_text: -4
    location_confidence: -3
    hotel_confidence: -2
    gonglue_confidence: -1

以及是各个版本是数字字段的版本集合：

num_fields:
- download_time
- crawl_level
- publish_time
- content_confidence
- list_confidence
- feeling_confidence
- travel_confidence
- hotel_confidence
- gonglue_confidence

功能一：根据字段数来简单识别版本：

class VersionDetector
	@@field_num_version_map = {
		12 => 1,
		14 => 2,
		15 => 3,
		24 => 4,
		25 => 5,
		16 => 6,
		26 => 7,
		27 => 8,
		30 => 9
	};

	class << self
		def detect(file)
			version = -1
			if file.is_a?(String) then
				line = File.open(file) do |file| file.gets end
				version = @@field_num_version_map[line.split(/\t/).size]
			elsif file.is_a?(File) then
				before_pos = file.pos
				file.seek(0)
				line = file.gets
				version = @@field_num_version_map[line.split(/\t/).size]
				file.seek(before_pos)
			else
				raise ArgumentError.new 'Argument type: #{file.class} is error, must be a String or File type'
			end
			
			raise Exception.new 'Unkown version file format' if version.nil?

			return version
		end
	end
end

我们通过yaml来load版本配置：

require 'yaml'

class FieldConfig
    attr_reader :fields_map, :num_fields

    def initialize(version)
        config = YAML.load_file 'conf.yml'
        @fields_map = config["v#{version}"]
        @num_fields = config["num_fields"]
    end
end

我们根据配置文件动态的定义记录的字段，这样我们修改字段，不需要修改代码：

class CrawlerRecord
    def self.config(field_config)
        @@field_config = field_config
        attr_reader *(field_config.fields_map.keys) #动态定义字段的读方法
    end

    def initialize(raw_line)
        @raw_line = raw_line
        fields = raw_line.split(/\t/)
        @@field_config.fields_map.each do |key,value|#动态设置各个字段的值
            instance_variable_set("@" + key.to_s,fields[value])
        end
    end

    def raw
        @raw_line
    end
end

我们写一个CrawlerFile类来支持上面描述的一些功能：

class CrawlerFile

end

在这个类中定义数字字段支持的关系操作符：

@@num_fields_op = {
		:gt => ">",
		:lt => "<",
		:eq => "=",
		:ge => ">=",
		:le => "<="
};

字段和版本的读取方法：

attr_reader :field_names, :version

定义初始化方法：

def initialize(path)
	@file = File.new(path) #对应的文件
	@version = VersionDetector.detect(@file) #得到版本信息
	@@field_config = FieldConfig.new(@version) #得到该版本的配置
	@field_names = @@field_config.fields_map.keys #根据配置文件得到字段名字
	CrawlerRecord.config(@@field_config) #配置CrawlerRecord动态生成字段读方法
	define_help_method #定义帮助方法，来完成上面列举的其他功能
end

实现define_help_method

def define_help_method
		CrawlerFile.class_eval do 
#根据配置文件动态定义按照一个字段模糊查找方法find_by_xxx_like
			@@field_config.fields_map.keys.each do |field|
				define_method :"find_by_#{field}_like" do |regex,&block|
					if block.nil? then
						lines = []
						@file.each_line do |raw_line|
							line = CrawlerRecord.new(raw_line)
							lines << line if line.send(field) =~ regex
						end
						lines
					else
						@file.each_line do |raw_line|
							line = CrawlerRecord.new(raw_line)
							block.call(line)  if line.send(field) =~ regex
						end	
					end	
					@file.seek(0)
				end
#根据配置文件动态定义按照一个字段模糊查找方法find_by_xxx			
				define_method :"find_by_#{field}" do |value,&block|
					if block.nil? then
						lines = []
						@file.each_line do |raw_line|
							line = CrawlerRecord.new(raw_line)
							lines << line if line.send(field) == value
						end
						lines
					else
						@file.each_line do |raw_line|
							line = CrawlerRecord.new(raw_line)
							block.call(line) if line.send(field) == value
						end
					end
					@file.seek(0)
				end
			end
#为所有的数字字段动态定义按照大小关系查找的方法：			
			@@field_config.num_fields.each do |field|
				next if not @@field_config.fields_map[field]

				@@num_fields_op.keys.each do |op|
					define_method :"find_by_#{field}_#{op.to_s}" do |value,&block|
						op_val = @@num_fields_op[op]
						if block.nil? then
							lines = []
							@file.each_line do |raw_line|
								line = CrawlerRecord.new(raw_line)
								field_val = line.send(field)
								lines << line if eval("#{field_val} #{op_val} #{value}")
							end
							lines  
						else
							@file.each_line do |raw_line|
								line = CrawlerRecord.new(raw_line)
								field_val = line.send(field)
								block.call(line) if eval("#{field_val.to_i} #{op_val} #{value}")
							end
						end	
						@file.seek(0)	
					end
				end
			end
		end
	end

支持字段的组合的查询：

def find_by_fields(fields,cond_checker)
		if block_given? then
			@file.each_line do |raw_line|
				line = CrawlerRecord.new(raw_line)
				yield line if cond_checker.call(*fields.collect{|field| line.send(field) })
			end
		else
			lines = []
			@file.each_line do |line|
				line = CrawlerRecord.new(raw_line)
				lines << line if cond_checker.call(*fields.collect{|field| line.send(field)})
			end
			lines
		end
		@file.seek(0)
	end

关闭文件：

def close
   @file.close
end

2
顶

1
踩

分享到：

构建自己的DSL之三抓取文件管理 | 构建自己的DSL之一 Simple Crawler

2011-07-11 23:18
浏览 2250
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论