Hi,everyone
I have enjoyed Scrubyt for days and it worked greatly in most case.However,problems came out when scraped urls from Google and Yahoo at the same time.Here is my code:
require 'rubygems'
require 'scrubyt'
Scrubyt.logger = Scrubyt::Logger.new
query = 'ruby'
google_data = Scrubyt::Extractor.define do
fetch 'http://www.google.com/ncr'
fill_textfield 'q', query
submit
#retrieve by xpath
title "/html/body/div/div/div/a" do
url "href", :type => :attribute
end
end #end of extrator
google_file = File.open("google.xml", "w")
google_data.to_xml.write(google_file, 1)
google_file.close
yahoo_data = Scrubyt::Extractor.define do
fetch 'http://search.yahoo.com'
fill_textfield 'p', query
submit
#retrieve by xpath
title "/html/body/div/div/div/div/div/div/div/ol/li/div/h3/a" do
url "href", :type => :attribute
end
end #end of extrator
yahoo_file = File.open("yahoo.xml", "w")
yahoo_data.to_xml.write(yahoo_file, 1)
yahoo_file.close
Running Environment: Ubuntu 7.04 + Netbeans 6.0 + Scrubyt
google.xml
<root>
<title>
<url>http://www.ruby-lang.org/</url>
</title>
<title>
<url>http://www.ruby-lang.org/en/20020101.html</url>
</title>
...
<root>
yahoo.xml
<root>
<title>
<url>http://rds.yahoo.com/_ylt=A0oGklhqbodHe08AchtXNyoA;_ylu=X3oDMTE5MXY5dDllBHNlYwNzcgRwb3MDMQRjb2xvA3NrMQR2dGlkA1lTMTk4XzgyBGwDV1Mx/SIG=11ff2e34s/EXP=1200144362/**http%3a//www.ruby-lang.org/en</url>
</title>
<title>
<url>http://rds.yahoo.com/_ylt=A0oGklhqbodHe08AdBtXNyoA;_ylu=X3oDMTE5cHJpN25qBHNlYwNzcgRwb3MDMgRjb2xvA3NrMQR2dGlkA1lTMTk4XzgyBGwDV1Mx/SIG=12aq03736/EXP=1200144362/**http%3a//en.wikipedia.org/wiki/Ruby_programming_language</url>
</title>
...
<root>
If switched the order of two extractors,that's define yahoo extractor fitstly,the result changed:
google.xml
<root/>
yahoo.xml
<root>
<title>
<url>http://www.ruby-lang.org/en</url>
</title>
<title>
<url>http://en.wikipedia.org/wiki/Ruby_programming_language</url>
</title>
.....
<root>
It seems the latter extractor will be influenced by the former one. Since xpath I used for Yahoo is longer than Google, the result form Google is empty when defined Yahoo extractor firstly.
Why is that and how can I overcome this problem? Thanks in advance.
分享到:
- 2008-01-12 01:32
- 浏览 1588
- 评论(1)
- 论坛回复 / 浏览 (1 / 1957)
- 查看更多
相关推荐
java poi包和extractors包
tm-extractors-0.4实现了对word内容的显示,把这个类放在项目的lib下就行了,我有两个代码都实现了这个功能,希望能帮到你
包含tm-extractors-0.4.jar 包含tm-extractors-0.4.jar 包含tm-extractors-0.4.jar
从相似图像中提取相同密钥,fuzzy extractors:how to extract strong key from bio_picture
带源码的可以读取word的java项目,使用tm-extractors-0.4.jar这个jar包操作word
android打开本地ppt、word要导入tm-extractors-0.4.jar+jxl.jar
java读取doc文件 1. 把tm-extractors-0.4.jar包扔到classpath路径下. 2.如果出现异常可能的原因是:把tm-extracors-0.4jar提升
word处理工具
转码解码包
本jar包功能是将doc文件解析成文字形式, 将压缩包解压之后将jar包引入到项目当中并添加.之后可以直接引用
In this project, we investigate authentication systems that utilize fuzzy extractors and Physically Unclonable Functions (PUFs) to uniquely identify hardware components.
使用poi读取word文件的补充扩展包,支持2003版本office及其以前的文档
java io读取word文件的基本操作 简单易用 其中用到组件tm-extractors-0.4.jar 说明:需要把tm-extractors-0.4.jar放到类路径下面
资源来自pypi官网。 资源全名:lhub_extractors-0.2.2-py3-none-any.whl
提取器这些是data-utils,用于将数据导入api的数据库。它们是使用javascript编写的,没有任何起点。你不能一起跑
nocaps图像特征提取器 这个软件库脚本和Jupyter笔记本提取自底向上的图像所需特性在基准模型的集合nocaps 。 预训练权重和此代码库的某些部分来自和 。 如果您认为此代码有用,请考虑引用我们的论文和这些著作。...
Graylog-OPNsense_Extractors Graylog的提取程序以解析OPNsense防火墙日志。 应该能够解析大多数所有IPv4和IPv6消息。 19年8月13日更新,以支持OPNsense消息格式更改。 18年6月21日更新至IPv6 ICMP。 OPNsense发送...
dnif提取器即插即用提取器,可将不同的日志事件转换为通用的DNIF数据模型(DDM)提取器截至08-04-2021 设备类型小贩产品一体化溪流作业系统微软Windows(nxlog) NXLog(JSON) 身份验证,IAM,SYSMON过程,SYSMON...
该项目试图提供一些用JavaME编写的随机性提取器的实现,以供在加密移动应用程序中进一步使用。