|
Cobra Parsing: Disable Persistent Connections and Set Socket Timeouts
|
I use the Cobra Toolkit to parse web pages for various projects. In a project that has eight concurrent parsers running, I found that some of the parsers would hang indefinitely in a socket read during JavaScript processing. I think, but have not confirmed, that most of the hung sockets are related to persistent connections to Google's safe browser/YouTube servers in the 1E100.net domain. So I wanted a method to disable persistent connections in Cobra. As it's also possible for a server not to respond, URLConnection read timeouts would be a useful option, too.
I created some simple code that causes URLConnection objects created by Cobra to be configured with a timeout and to optionally disable persistent connections.
The Cobra DOM parser requires an org.lobobrowser.html.UserAgentContext object. A sample UserAgentContext object is provided in org.lobobrowser.html.test.SimpleUserAgentContext. This class has a createHttpRequest() method that returns an org.lobobrowser.html.test.SimpleHttpRequest object whenever the parser needs to open a socket connection. In one of the several open() methods, SimpleHttpRequest creates a URLConnection object. The timeout and persistence settings need to be applied to the URLConnection.
I created two simple classes in the package com.benjysbrain.CobraExtension:
- CobraUserAgentContext extends SimpleUserAgentContext.
- CobraHttpRequest extends SimpleHttpRequest.
CobraUserAgentContext
The CobraUserAgentContext constructor takes two parameters:
- int timeout - the timeout in milliseconds used on the URLConnection setReadTimeout() and setConnectTimeout() methods
- boolean persistent - false to disable persistent connections
The createHttpRequest() method returns a CobraHttpRequest object that has been configured with the timeout and persistence values.
CobraHttpRequest
The new constructor for this class contains the UserAgentContext and proxy parameters as in the parent class but also adds timeout and persistent settings as in the CobraUserAgentContext class. The open() method with five parameters is overridden. When it is called, a URLConnection is created by the parent's five parameter open() method, and then the timeouts and persistence settings are applied to the URLConnection.
Using the code
Substitute CobraUserAgentContext for SimpleUserAgentContext in your programs. Use the constructor that allows you to set timeouts and persistence values, and pass the object to org.lobobrowser.html.parser.DocumentBuilderImpl. Whenever the parser needs a new URL connection, it will use the CobraHttpRequest object, which sets the timeouts and persistence settings.
To compile the code, listed below, you will need Java 1.5 or greater as the timeout methods of URLConnection are not in earlier versions. Create the com.benjysbrain.CobraExtension directory structure, put the source in the leaf directory, add the Cobra Toolkit jar to your classpass, and compile the source. I recommend setting a timeout value of about one minute, but you might want to increase this depending on the responsiveness of the servers from which pages are parsed.
package com.benjysbrain.CobraExtension ;
import org.lobobrowser.html.test.* ;
import org.lobobrowser.html.* ;
/** CobraUserAgentContext is a subclass of
org.lobobrowser.html.test.SimpleUserAgentContext that overrides the
createHttpRequest() method to provide an HttpRequest object with
a URLConnection object with timeouts and other properties. In addition
to the new createHttpRequest() method, a new constructor has been
added.
<p>
The Cobra Toolkit (http://lobobrowser.org/cobra.jsp) is part of
the Lobo Project.
<p>
Java 1.5 or later required.
<p>
Copyright 2010 by Ben E. Cline. This source code is provided
for educational purposes "as is" with no warranty. If you use
the code, please acknowledge the author.
<p>
http://www.benjysbrain.com
@author Benjy Cline
*/
public class CobraUserAgentContext extends SimpleUserAgentContext {
/** The read timeout and connection timeout in milliseconds. */
int timeout ;
/** If false, HttpRequest objects have the "Connection : close" property
set to discourage persistent connections. */
boolean persistent ;
/** Create a CobraUserAgentContext object where createHttpRequest()
returns an HttpRequest object with a URLConnection object with
the specified timeout and persistence setting. */
public CobraUserAgentContext(int timeout, boolean persistent) {
super() ;
this.timeout = timeout ;
this.persistent = persistent ;
}
/** Create an HttpRequest object, used to load images, scripts, etc.,
with timeout and persistence values. */
public HttpRequest createHttpRequest() {
return new CobraHttpRequest(this, this.getProxy(), timeout, persistent) ;
}
}
package com.benjysbrain.CobraExtension ;
import org.lobobrowser.html.test.* ;
import org.lobobrowser.html.* ;
import java.io.* ;
/**
CobraHttpRequest is a subclass of
org.lobobrowser.html.test.SimpleHttpRequest. It adds a constructor
and a modified version of the open() method. If the new constructor
is used, a timeout and persistent state are used during open() calls
to configure the URLConnection object. See
com.benjysbrain.CobraExtension.CobraUserAgentContext.
<p>
The Cobra Toolkit (http://lobobrowser.org/cobra.jsp) is part of
the Lobo Project.
<p>
Java 1.5 or later required.
<p>
Copyright 2010 by Ben E. Cline. This source code is provided
for educational purposes "as is" with no warranty. If you use
the code, please acknowledge the author.
<p>
http://www.benjysbrain.com
<p>
@author Benjy Cline
*/
public class CobraHttpRequest extends SimpleHttpRequest {
/** The read timeout and connection timeout in milliseconds. */
int timeout = 1000*60*30 ;
/** If false, HttpRequest objects have the "Connection : close" property
set to discourage persistent connections. */
boolean persistent = false ;
/** Create an HttpRequest object whose open() methods create
URLConnection objects with timeout and persistence values.
*/
public CobraHttpRequest(UserAgentContext context, java.net.Proxy proxy,
int timeout, boolean persistence) {
super(context, proxy) ;
this.timeout = timeout ;
this.persistent = persistent ;
}
/** Override the primary open() method so that the URLConnection object
can be configured. */
public void open(final String method, final java.net.URL url,
boolean asyncFlag, final String userName,
final String password) throws java.io.IOException {
super.open(method, url, asyncFlag, userName, password) ;
connection.setReadTimeout(timeout) ;
connection.setConnectTimeout(timeout) ;
if(!persistent)
connection.setRequestProperty("Connection", "close") ;
}
}
These classes are not particularly general, but they can serve as a model for more elegant code. If you have questions or comments or if you discover errors in this page or the code, please let me know at the e-mail address in the footer of this page.
This page © copyright 2010 by Ben E. Cline. E-Mail:
分享到:
相关推荐
lobo是一个开源的网页浏览器,完全用java写成。 浏览器的目标是支持HTML4,javascript,CSS2. 当然更主要的目标是,力图使lobo浏览网页速度更快,特点完整和稳定 最新的版本v0.97.5:...
Java网页浏览器 Lobo
Cobar是一个开源的纯Java实现的Html DOM解析器和渲染器。基于Mozilla Rhino,它支持HTML4,Javascript和CSS2。
,jar,src> Lobo is an extensible all-Java web browser and RIA platform. It supports HTML 4, Javascript (AJAX) and CSS 2 ... Cobra is the web browser's renderer API; also a Javascript-aware HTML parser.
lobo浏览器0.98.3版本..java浏览器
基于java的网页浏览器 Lobo.zip
Lobo Evolution是Lobo浏览器的分支。 该项目继续了Lobo Browser(lobochief)的工作。 Lobo Evolution是一个可扩展的全Java Web浏览器和RIA平台。 它支持HTML 4,HTML5 Javascript(AJAX),CSS 3和Java(Swing / ...
lobo是一个开源的网页浏览器,完全用java写成。 浏览器的目标是支持HTML4,javascript,CSS2。安全,可扩展,容易集成其他语言,可移植。
El Rio Lobo
Java网页浏览器 Lobo.7z
基于Java的网页浏览器 Lobo.zip
java源码:Java网页浏览器 Lobo.zip
基于java的开发源码-网页浏览器 Lobo.zip
基于Java的实例源码-网页浏览器 Lobo.zip
给眼电图滤波,采用巴特沃斯高低通滤波器。
Lobo Evolution - Java Web 浏览器 Lobo Evolution 是 Lobo Browser 的一个分支。 该项目延续了路宝浏览器(lobochief)的工作。 Lobo Evolution 是一个可扩展的全 Java 网络浏览器和 RIA 平台。 它支持 HTML 4、...
基于一维Lobo-Evans方法和计算流体力学的二氯乙烷裂解装置的综合模拟和优化
基于一维Lobo-Evans方法和计算流体动力学的二氯乙烷裂解装置的综合模拟和优化
lobo网站 lobo商品评价系统网站端,逐步的评价服务调用的项目 api。 项目是JDS五期一班四组在编程马拉松比赛中实现的一小部分,感谢组内所有的小伙伴。 核心依赖库 Java(1.7+) 球衣(2.6) 速度(1.7) 公地 ...