idl.tmt.documentparsing
Class TextDocumentParser
java.lang.Object
|
+--idl.tmt.documentparsing.TextDocumentParser
- All Implemented Interfaces:
- DocumentParser
- public class TextDocumentParser
- extends java.lang.Object
- implements DocumentParser
Created on Apr 5, 2004
- Author:
- jelsas
|
Method Summary |
void |
addParsingListener(ParsingListener listener)
Registers a new ParsingListener for this document parser |
Filter |
getFilter()
|
private void |
notifyListenersDocumentComplete()
|
private void |
notifyListenersNewDoc(int docID)
|
private void |
notifyListenersOfCharacters(char[] chars,
int pos)
|
private void |
notifyListenersOfWord(java.lang.String word,
int pos)
|
void |
parseDocument(int docID,
java.io.Reader documentReader)
Initiates the parsing of a document. |
void |
parseText(int startPosition,
char[] text)
|
void |
parseText(int startPosition,
char[] text,
int offset,
int length)
Handles a chunk of text, and notifies the character parsing
listeners, or word parsing listeners. |
void |
removeParsingListener(ParsingListener listener)
Removes a parsing listener from this document parser |
void |
setFilter(Filter filter)
|
private void |
setMaxDelimChar()
Set maxDelimChar to the highest char in the delimiter set. |
void |
setParameter(java.lang.String name,
java.lang.Object value)
This method configures this parser for various runtime parameters. |
| Methods inherited from class java.lang.Object |
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, toString, wait, wait, wait |
WORD_FILTER_PROP
public static final java.lang.String WORD_FILTER_PROP
- Property key for setting the filter to use for parsing. The
property value for this property must be an instance of
idl.tmt.documentparsing.filters.Filter
DELIMITERS_PROP
public static final java.lang.String DELIMITERS_PROP
- Property key for setting the delimiters to use for parsing. The
property value for this property must be a String
READ_BUFFER_SIZE
private static final int READ_BUFFER_SIZE
delimiters
private java.lang.String delimiters
maxDelimChar
private char maxDelimChar
filter
private Filter filter
isParsingWords
private boolean isParsingWords
isParsingCharacters
private boolean isParsingCharacters
listeners
private java.util.List listeners
TextDocumentParser
public TextDocumentParser()
setParameter
public void setParameter(java.lang.String name,
java.lang.Object value)
throws InvalidParameterException
- This method configures this parser for various runtime parameters.
- Specified by:
setParameter in interface DocumentParser
- See Also:
DocumentParser.setParameter(java.lang.String, java.lang.Object)
addParsingListener
public void addParsingListener(ParsingListener listener)
- Description copied from interface:
DocumentParser
- Registers a new ParsingListener for this document parser
- Specified by:
addParsingListener in interface DocumentParser
- Following copied from interface:
idl.tmt.documentparsing.DocumentParser
- Parameters:
listener - The listener to be registered
removeParsingListener
public void removeParsingListener(ParsingListener listener)
- Description copied from interface:
DocumentParser
- Removes a parsing listener from this document parser
- Specified by:
removeParsingListener in interface DocumentParser
- Following copied from interface:
idl.tmt.documentparsing.DocumentParser
- Parameters:
listener - the listener to be removed.
parseDocument
public void parseDocument(int docID,
java.io.Reader documentReader)
throws java.io.IOException
- Description copied from interface:
DocumentParser
- Initiates the parsing of a document.
- Specified by:
parseDocument in interface DocumentParser
- Following copied from interface:
idl.tmt.documentparsing.DocumentParser
- Parameters:
docID - The numeric ID of this documentdocument - The Reader for the document to be parsed
parseText
public void parseText(int startPosition,
char[] text)
parseText
public void parseText(int startPosition,
char[] text,
int offset,
int length)
- Handles a chunk of text, and notifies the character parsing
listeners, or word parsing listeners. There are a couple caveats
within this method:
(1) if a listener implements both word and character parsing, both
interfaces will get notified, but possibly not in the correct order.
that is, the characters & word positions will not be sequential
(2) if words are split over different invocations of this method,
they will be treated as 2 different words. I don't know how the
HTML parser works, but it seems like this method should only
be invoked once per chunk of text (i.e. between HTML tags)
(3) words are filtered using the idl.tmt.documentparsing.filters.Filter
specified via the setProperty() method
(4) words are considered any continuous string of characters or digits
occuring between delimiter characters. These characters can be
speciifed via the setProperty method.
- Parameters:
startPosition - The offset in the file where parsing beginstext - The array of character dataoffset - The offset in the array to start parsinglength - The length of the character array to parse
notifyListenersNewDoc
private void notifyListenersNewDoc(int docID)
notifyListenersDocumentComplete
private void notifyListenersDocumentComplete()
notifyListenersOfWord
private void notifyListenersOfWord(java.lang.String word,
int pos)
notifyListenersOfCharacters
private void notifyListenersOfCharacters(char[] chars,
int pos)
setMaxDelimChar
private void setMaxDelimChar()
- Set maxDelimChar to the highest char in the delimiter set.
This method "borrowed" from the java.util.StringTokenizer
class. :)
setFilter
public void setFilter(Filter filter)
getFilter
public Filter getFilter()