`

Using the Lobo Cobra Toolkit to Retrieve the HTML of Rendered Pages

    博客分类:
  • JS
阅读更多


Cobra Parsing: Disable Persistent Connections and Set Socket Timeouts

I use the Cobra Toolkit to parse web pages for various projects. In a project that has eight concurrent parsers running, I found that some of the parsers would hang indefinitely in a socket read during JavaScript processing. I think, but have not confirmed, that most of the hung sockets are related to persistent connections to Google's safe browser/YouTube servers in the 1E100.net domain. So I wanted a method to disable persistent connections in Cobra. As it's also possible for a server not to respond, URLConnection read timeouts would be a useful option, too.

I created some simple code that causes URLConnection objects created by Cobra to be configured with a timeout and to optionally disable persistent connections.

The Cobra DOM parser requires an org.lobobrowser.html.UserAgentContext object. A sample UserAgentContext object is provided in org.lobobrowser.html.test.SimpleUserAgentContext. This class has a createHttpRequest() method that returns an org.lobobrowser.html.test.SimpleHttpRequest object whenever the parser needs to open a socket connection. In one of the several open() methods, SimpleHttpRequest creates a URLConnection object. The timeout and persistence settings need to be applied to the URLConnection.

I created two simple classes in the package com.benjysbrain.CobraExtension:

  • CobraUserAgentContext extends SimpleUserAgentContext.
  • CobraHttpRequest extends SimpleHttpRequest.

CobraUserAgentContext

The CobraUserAgentContext constructor takes two parameters:

  • int timeout - the timeout in milliseconds used on the URLConnection setReadTimeout() and setConnectTimeout() methods
  • boolean persistent - false to disable persistent connections

The createHttpRequest() method returns a CobraHttpRequest object that has been configured with the timeout and persistence values.

CobraHttpRequest

The new constructor for this class contains the UserAgentContext and proxy parameters as in the parent class but also adds timeout and persistent settings as in the CobraUserAgentContext class. The open() method with five parameters is overridden. When it is called, a URLConnection is created by the parent's five parameter open() method, and then the timeouts and persistence settings are applied to the URLConnection.

Using the code

Substitute CobraUserAgentContext for SimpleUserAgentContext in your programs. Use the constructor that allows you to set timeouts and persistence values, and pass the object to org.lobobrowser.html.parser.DocumentBuilderImpl. Whenever the parser needs a new URL connection, it will use the CobraHttpRequest object, which sets the timeouts and persistence settings.

To compile the code, listed below, you will need Java 1.5 or greater as the timeout methods of URLConnection are not in earlier versions. Create the com.benjysbrain.CobraExtension directory structure, put the source in the leaf directory, add the Cobra Toolkit jar to your classpass, and compile the source. I recommend setting a timeout value of about one minute, but you might want to increase this depending on the responsiveness of the servers from which pages are parsed.

 

package com.benjysbrain.CobraExtension ; import org.lobobrowser.html.test.* ; import org.lobobrowser.html.* ; /** CobraUserAgentContext is a subclass of org.lobobrowser.html.test.SimpleUserAgentContext that overrides the createHttpRequest() method to provide an HttpRequest object with a URLConnection object with timeouts and other properties. In addition to the new createHttpRequest() method, a new constructor has been added. <p> The Cobra Toolkit (http://lobobrowser.org/cobra.jsp) is part of the Lobo Project. <p> Java 1.5 or later required. <p> Copyright 2010 by Ben E. Cline. This source code is provided for educational purposes "as is" with no warranty. If you use the code, please acknowledge the author. <p> http://www.benjysbrain.com @author Benjy Cline */ public class CobraUserAgentContext extends SimpleUserAgentContext { /** The read timeout and connection timeout in milliseconds. */ int timeout ; /** If false, HttpRequest objects have the "Connection : close" property set to discourage persistent connections. */ boolean persistent ; /** Create a CobraUserAgentContext object where createHttpRequest() returns an HttpRequest object with a URLConnection object with the specified timeout and persistence setting. */ public CobraUserAgentContext(int timeout, boolean persistent) { super() ; this.timeout = timeout ; this.persistent = persistent ; } /** Create an HttpRequest object, used to load images, scripts, etc., with timeout and persistence values. */ public HttpRequest createHttpRequest() { return new CobraHttpRequest(this, this.getProxy(), timeout, persistent) ; } }

 

package com.benjysbrain.CobraExtension ; import org.lobobrowser.html.test.* ; import org.lobobrowser.html.* ; import java.io.* ; /** CobraHttpRequest is a subclass of org.lobobrowser.html.test.SimpleHttpRequest. It adds a constructor and a modified version of the open() method. If the new constructor is used, a timeout and persistent state are used during open() calls to configure the URLConnection object. See com.benjysbrain.CobraExtension.CobraUserAgentContext. <p> The Cobra Toolkit (http://lobobrowser.org/cobra.jsp) is part of the Lobo Project. <p> Java 1.5 or later required. <p> Copyright 2010 by Ben E. Cline. This source code is provided for educational purposes "as is" with no warranty. If you use the code, please acknowledge the author. <p> http://www.benjysbrain.com <p> @author Benjy Cline */ public class CobraHttpRequest extends SimpleHttpRequest { /** The read timeout and connection timeout in milliseconds. */ int timeout = 1000*60*30 ; /** If false, HttpRequest objects have the "Connection : close" property set to discourage persistent connections. */ boolean persistent = false ; /** Create an HttpRequest object whose open() methods create URLConnection objects with timeout and persistence values. */ public CobraHttpRequest(UserAgentContext context, java.net.Proxy proxy, int timeout, boolean persistence) { super(context, proxy) ; this.timeout = timeout ; this.persistent = persistent ; } /** Override the primary open() method so that the URLConnection object can be configured. */ public void open(final String method, final java.net.URL url, boolean asyncFlag, final String userName, final String password) throws java.io.IOException { super.open(method, url, asyncFlag, userName, password) ; connection.setReadTimeout(timeout) ; connection.setConnectTimeout(timeout) ; if(!persistent) connection.setRequestProperty("Connection", "close") ; } }

These classes are not particularly general, but they can serve as a model for more elegant code. If you have questions or comments or if you discover errors in this page or the code, please let me know at the e-mail address in the footer of this page.


This page © copyright 2010 by Ben E. Cline.  E-Mail:  

 

 

 

 

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics