以下是在weblech目录下config文件夹中的配置文件Spider.properties
# Spider configuration file
#
# All of these settings default to sensible values if not specified.
# Directory in which to save downloaded files, defaults to "."
saveRootDirectory = c:/weblech/sites
# Filename in which to save mailto links
mailtoLogFile = mailto.txt
# Tell the spider to reload HTML pages each time, but not images
# or other files
refreshHTMLs = true
refreshImages = false
refreshOthers = false
# Set the extensions the Spider should use to determine which
# pages are of MIME type text/html. The Spider also learns new
# types as it downloads them.
htmlExtensions = htm,html,shtm,shtml
# Similarly for MIME type image/*
imageExtensions = gif,jpg,jpeg,png,bmp
# URL at which we should start the spider
startLocation = http://www.slashdot.org/
# Whether to do depth first search, or the default breadth
# first search when finding URLs to download
depthFirst = false
# Maximum depth of pages to retrieve (the first page is depth
# 0, links from there depth 1, etc). Setting to 0 is "unlimited"
maxDepth = 2
# Basic URL filtering. URLs must contain this string in order
# to be downloaded by WebLech
urlMatch = slashdot.org
# Basic URL prioritisation. URLs which are "interesting" are
# downloaded first, URLs which are "boring" last.
interestingURLs=pollBooth.pl,faq
boringURLs=article.pl
# User Agent header
userAgent = Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
# Username and password for basic HTTP authentication, if required.
# The same username and password will be used for all authentication
# challenges during a download session.
basicAuthUser = myUser
basicAuthPassword = 1234
# Number of download threads to start
spiderThreads = 1
# How often to checkpoint the Spider. A checkpoint file is named
# "spider.checkpoint" and can be used to start the spider in the
# middle of a run. Setting this value to 0 disables checkpoints.
# Here we checkpoint every 30 seconds
checkpointInterval = 30000
分享到:
相关推荐
WebLech程序阅读笔记。一个爬虫
WebLech是一个功能强大的Web站点下载与镜像工具。它支持按功能需求来下载web站点并能够尽可能模仿标准Web浏览器的行为。WebLech有一个功能控制台并采用多线程操作。适合初学者
WebLech是一个功能强大的Web站点下载与镜像免费开源工具。它支持按功能需求来下载web站点并能够尽可能模仿标准Web浏览器的行为。WebLech有一个功能控制台并采用多线程操作。
用于抓取还有某关键词的网页,可以自己控制抓取深度和设置抓取什么网页。
WebLech是一个功能强大的Web站点下载与镜像工具。它支持按功能需求来下载web站点并能够尽可能模仿标准Web浏览器的行为,次、可以直接获取所需要的静态网页,jsoup可提取所需要的数据和关键字,导入eclipse 或...
开源爬虫weblech
开源搜索引擎的比较,Lucene Nutch Heritrix Weblech等