idl.tmt.documentparsing
Class TextDocumentParser

java.lang.Object
  |
  +--idl.tmt.documentparsing.TextDocumentParser
All Implemented Interfaces:
DocumentParser

public class TextDocumentParser
extends java.lang.Object
implements DocumentParser

Created on Apr 5, 2004

Author:
jelsas

Field Summary
private  java.lang.String delimiters
           
static java.lang.String DELIMITERS_PROP
          Property key for setting the delimiters to use for parsing.
private  Filter filter
           
private  boolean isParsingCharacters
           
private  boolean isParsingWords
           
private  java.util.List listeners
           
private  char maxDelimChar
           
private static int READ_BUFFER_SIZE
           
static java.lang.String WORD_FILTER_PROP
          Property key for setting the filter to use for parsing.
 
Constructor Summary
TextDocumentParser()
           
 
Method Summary
 void addParsingListener(ParsingListener listener)
          Registers a new ParsingListener for this document parser
 Filter getFilter()
           
private  void notifyListenersDocumentComplete()
           
private  void notifyListenersNewDoc(int docID)
           
private  void notifyListenersOfCharacters(char[] chars, int pos)
           
private  void notifyListenersOfWord(java.lang.String word, int pos)
           
 void parseDocument(int docID, java.io.Reader documentReader)
          Initiates the parsing of a document.
 void parseText(int startPosition, char[] text)
           
 void parseText(int startPosition, char[] text, int offset, int length)
          Handles a chunk of text, and notifies the character parsing listeners, or word parsing listeners.
 void removeParsingListener(ParsingListener listener)
          Removes a parsing listener from this document parser
 void setFilter(Filter filter)
           
private  void setMaxDelimChar()
          Set maxDelimChar to the highest char in the delimiter set.
 void setParameter(java.lang.String name, java.lang.Object value)
          This method configures this parser for various runtime parameters.
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, toString, wait, wait, wait
 

Field Detail

WORD_FILTER_PROP

public static final java.lang.String WORD_FILTER_PROP
Property key for setting the filter to use for parsing. The property value for this property must be an instance of idl.tmt.documentparsing.filters.Filter

DELIMITERS_PROP

public static final java.lang.String DELIMITERS_PROP
Property key for setting the delimiters to use for parsing. The property value for this property must be a String

READ_BUFFER_SIZE

private static final int READ_BUFFER_SIZE

delimiters

private java.lang.String delimiters

maxDelimChar

private char maxDelimChar

filter

private Filter filter

isParsingWords

private boolean isParsingWords

isParsingCharacters

private boolean isParsingCharacters

listeners

private java.util.List listeners
Constructor Detail

TextDocumentParser

public TextDocumentParser()
Method Detail

setParameter

public void setParameter(java.lang.String name,
                         java.lang.Object value)
                  throws InvalidParameterException
This method configures this parser for various runtime parameters.
Specified by:
setParameter in interface DocumentParser
See Also:
DocumentParser.setParameter(java.lang.String, java.lang.Object)

addParsingListener

public void addParsingListener(ParsingListener listener)
Description copied from interface: DocumentParser
Registers a new ParsingListener for this document parser
Specified by:
addParsingListener in interface DocumentParser
Following copied from interface: idl.tmt.documentparsing.DocumentParser
Parameters:
listener - The listener to be registered

removeParsingListener

public void removeParsingListener(ParsingListener listener)
Description copied from interface: DocumentParser
Removes a parsing listener from this document parser
Specified by:
removeParsingListener in interface DocumentParser
Following copied from interface: idl.tmt.documentparsing.DocumentParser
Parameters:
listener - the listener to be removed.

parseDocument

public void parseDocument(int docID,
                          java.io.Reader documentReader)
                   throws java.io.IOException
Description copied from interface: DocumentParser
Initiates the parsing of a document.
Specified by:
parseDocument in interface DocumentParser
Following copied from interface: idl.tmt.documentparsing.DocumentParser
Parameters:
docID - The numeric ID of this document
document - The Reader for the document to be parsed

parseText

public void parseText(int startPosition,
                      char[] text)

parseText

public void parseText(int startPosition,
                      char[] text,
                      int offset,
                      int length)
Handles a chunk of text, and notifies the character parsing listeners, or word parsing listeners. There are a couple caveats within this method: (1) if a listener implements both word and character parsing, both interfaces will get notified, but possibly not in the correct order. that is, the characters & word positions will not be sequential (2) if words are split over different invocations of this method, they will be treated as 2 different words. I don't know how the HTML parser works, but it seems like this method should only be invoked once per chunk of text (i.e. between HTML tags) (3) words are filtered using the idl.tmt.documentparsing.filters.Filter specified via the setProperty() method (4) words are considered any continuous string of characters or digits occuring between delimiter characters. These characters can be speciifed via the setProperty method.
Parameters:
startPosition - The offset in the file where parsing begins
text - The array of character data
offset - The offset in the array to start parsing
length - The length of the character array to parse

notifyListenersNewDoc

private void notifyListenersNewDoc(int docID)

notifyListenersDocumentComplete

private void notifyListenersDocumentComplete()

notifyListenersOfWord

private void notifyListenersOfWord(java.lang.String word,
                                   int pos)

notifyListenersOfCharacters

private void notifyListenersOfCharacters(char[] chars,
                                         int pos)

setMaxDelimChar

private void setMaxDelimChar()
Set maxDelimChar to the highest char in the delimiter set. This method "borrowed" from the java.util.StringTokenizer class. :)

setFilter

public void setFilter(Filter filter)

getFilter

public Filter getFilter()