构建自己的DSL之三抓取文件管理

fuliang

浏览: 1637739 次
性别:
来自: 北京

最近访客更多访客>>

依然任逍遥

stephenworld

lli

samwalt

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Ruby
Machine Learning

Ruby Text Processing 文本处理

转载请标明出处：http://fuliang.iteye.com/blog/1127437

我们抓取的网页抽取的结果是带有日期的文件，经常需要操作某个日期范围的文件，来进行统计，抽样，入库，所有需要一个方便的DSL来处理这件事情。
我们希望制定几个条件就可以得到符合条件的文件，比如：

data_set = CrawlerDataSet.with_cond do |cond|
     cond.dir("/mydir").
          from_date("2011-05-01").
          to_date("2011-07-08")
end

然后我们可以得到符合条件的文件名：

data_set.file_names

我们还可以利用构建自己的DSL之二中的CrawlerFile：

data_set.each do |file|
   puts file.version
end

我们可以利用Date Range来轻松完成这些功能：

#!/usr/bin/env ruby

require 'date'
require 'crawler_file'

class CrawlerDataSet
	class << self 
		def with_cond
			return yield CrawlerDataSet.new
		end
	end

	def initialize
		@files = []
	end

	def dir(dir)
		@dir = dir
		self
	end
	
	def from_date(from_date)
		@from_date = Date.parse(from_date)
		self
	end

	def to_date(to_date=nil)
		@to_date = if to_date.nil? then Date.today else Date.parse(to_date) end
		self
	end
	#use the date range
	def file_names
		(@from_date	.. @to_date).each do |date|
			date_str = date.strftime("%Y%m%d")	
			Dir.glob("#@dir/#{date_str}-*dedup").each do |file|
				@files << file
			end
		end
		@files
	end

	def each
		file_names.each do |file_name|
			begin
				crawler_file = CrawlerFile.new(file_name)
				yield crawler_file
			ensure
				crawler_file.close
			end	
		end
	end
	
	def each_with_name
		file_names.each do |file_name|
			begin
				crawler_file = CrawlerFile.new(file_name)
				yield crawler_file, file_name
			ensure
				crawler_file.close
			end
		end
	end
end

分享到：

使用scala.sys.process包和系统交互 | 构建自己的DSL之二抓取文本处理

2011-07-18 23:26
浏览 1706
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

构建自己的DSL之三抓取文件管理

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

构建自己的DSL之三 抓取文件管理

评论

发表评论

相关推荐

[zz]推荐系统-从入门到精通

机器学习在公司的分享

Deep learning的一些教程[rz]

[ZZ]计算机视觉、模式识别、机器学习常用牛人主页链接

Deep learning的一些有用链接

信息论学习总结（二）最大熵模型

信息论学习总结（一）基础知识

loss function

Large-Scale Support Vector Machines: Algorithms and Theory

使用SGD(Stochastic Gradient Descent)进行大规模机器学习

松本行弘的程序世界

Ruby HTTP/HTML parser相关资源

命令行词典

构建自己的DSL之二 抓取文本处理

构建自己的DSL之一 Simple Crawler

paper and book阅读

轻松删除所有安装的gem

模式识别和机器学习 笔记 第四章 线性分类模型（二）

模式识别和机器学习 笔记 第四章 线性分类模型（一）

模式识别和机器学习 第六章 核方法

最近访客更多访客>>

构建自己的DSL之三抓取文件管理

构建自己的DSL之二抓取文本处理

模式识别和机器学习笔记第四章线性分类模型（二）

模式识别和机器学习笔记第四章线性分类模型（一）

模式识别和机器学习第六章核方法