idl.tmt.documentsource.webcrawl
Class WgetWebCrawlContext

java.lang.Object
  |
  +--idl.tmt.documentsource.webcrawl.WgetWebCrawlContext
All Implemented Interfaces:
DocumentProvider, FilesystemDocumentProvider, WebCrawlContext

public class WgetWebCrawlContext
extends java.lang.Object
implements WebCrawlContext, FilesystemDocumentProvider

This class provides the context for a web crawl done using wget. The wget crawl must have been invoked with the options '-nv -o [logfilename]' to produce the properly formatted log file. See the WGET manual for detailed documentation of those and other options. Created on Feb 27, 2004

Author:
jelsas

Field Summary
private static java.lang.String DEFAULT_INDEX
           
private  int docCount
           
private  MultiMapDocumentIDMap docIDMap
           
private  java.util.Iterator localDocIterator
           
private  java.io.File localRootDir
           
private  MultiMapURLMapper urlMapper
           
 
Constructor Summary
WgetWebCrawlContext(java.io.File wgetLogFile, java.io.File localRootDir)
          Creates a new WgetWebCrawlContext object.
 
Method Summary
 int documentCount()
          Returns a count of documents
 DocumentIDMapper getDocumentIDMapper()
          Returns the DocumentIDMapper for this web crawl
 FilesystemDocumentProvider getDocumentProvider()
          Returns a reference to the FilesystemDocumentProvider
 java.io.File getNextDocument()
          Returns the next document URL
 java.net.URL[] getRemoteCrawlRoots()
          This method is unsupported for this implementation, and returns null.
 java.io.File getRoot()
          Returns the local root directory where the mirrored documents are located.
 URLMapper getURLMapper()
          returns the URL Mapper object
 boolean hasMoreDocuments()
          Checks if there are more documents to be returned
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, toString, wait, wait, wait
 

Field Detail

localRootDir

private java.io.File localRootDir

urlMapper

private MultiMapURLMapper urlMapper

docIDMap

private MultiMapDocumentIDMap docIDMap

localDocIterator

private java.util.Iterator localDocIterator

docCount

private int docCount

DEFAULT_INDEX

private static final java.lang.String DEFAULT_INDEX
Constructor Detail

WgetWebCrawlContext

public WgetWebCrawlContext(java.io.File wgetLogFile,
                           java.io.File localRootDir)
                    throws java.io.IOException
Creates a new WgetWebCrawlContext object. This constructor reads in the entire wget log file and populates an in-memory mapping of all the remote & local URLs
Method Detail

getURLMapper

public URLMapper getURLMapper()
returns the URL Mapper object
Specified by:
getURLMapper in interface WebCrawlContext
See Also:
WebCrawlContext.getURLMapper()

getRemoteCrawlRoots

public java.net.URL[] getRemoteCrawlRoots()
This method is unsupported for this implementation, and returns null. It is not possible to know the roots of the web crawl from the log file generated by wget with the options above.
Specified by:
getRemoteCrawlRoots in interface WebCrawlContext
See Also:
WebCrawlContext.getRemoteCrawlRoots()

getDocumentProvider

public FilesystemDocumentProvider getDocumentProvider()
Returns a reference to the FilesystemDocumentProvider
Specified by:
getDocumentProvider in interface WebCrawlContext
See Also:
WebCrawlContext.getDocumentProvider()

documentCount

public int documentCount()
Returns a count of documents
Specified by:
documentCount in interface FilesystemDocumentProvider
See Also:
FilesystemDocumentProvider.documentCount()

getRoot

public java.io.File getRoot()
Returns the local root directory where the mirrored documents are located.
Specified by:
getRoot in interface FilesystemDocumentProvider
See Also:
FilesystemDocumentProvider.getRoot()

getNextDocument

public java.io.File getNextDocument()
Returns the next document URL
Specified by:
getNextDocument in interface DocumentProvider
See Also:
DocumentProvider.getNextDocument()

hasMoreDocuments

public boolean hasMoreDocuments()
Checks if there are more documents to be returned
Specified by:
hasMoreDocuments in interface DocumentProvider
See Also:
DocumentProvider.hasMoreDocuments()

getDocumentIDMapper

public DocumentIDMapper getDocumentIDMapper()
Returns the DocumentIDMapper for this web crawl
Specified by:
getDocumentIDMapper in interface WebCrawlContext
See Also:
WebCrawlContext.getDocumentIDMapper()