idl.tmt.representation
Class MetaTextRepresentationBuilder
java.lang.Object
|
+--idl.tmt.representation.BagOfWordsRepresentationBuilder
|
+--idl.tmt.representation.MetaTextRepresentationBuilder
- All Implemented Interfaces:
- HTMLParsingListener, ParsingListener, RepresentationBuilder
- public class MetaTextRepresentationBuilder
- extends BagOfWordsRepresentationBuilder
- implements HTMLParsingListener
Creates a document representation based on the text contained in the
META html tag's content attribute.
Created on Mar 19, 2004
- Author:
- jelsas
|
Method Summary |
void |
documentCollectionComplete()
Builds the document representation matrix |
void |
documentComplete()
Indicates that the parsing of the current document has completed. |
void |
endTag(javax.swing.text.html.HTML.Tag tag,
int pos)
Indicates that an HTML end tag has been reached |
java.util.Set |
getAllowMetaNames()
Gets a set of the Strings that are allowed for "meta"
names in this document representation. |
void |
newDocument(int docID)
Indicates that a new document parsing has begun. |
void |
setAllowMetaNames(java.util.Set metaNames)
Sets the allowable meta names. |
void |
setTextParser(TextDocumentParser textParser)
Overrides the superclass's setTextParser() method to
add a listener to the text parser. |
void |
startTag(javax.swing.text.html.HTML.Tag tag,
javax.swing.text.MutableAttributeSet atts,
int pos)
Indicates that a new HTML start tag has been entered. |
| Methods inherited from class idl.tmt.representation.BagOfWordsRepresentationBuilder |
addTermToDocRepresentation, buildRepresentation, cleanup, getRepresentation, getTermList, getWeight, isBinarize, isDebug, isShareTermlist, setBinarize, setDebug, setNumDocuments, setShareTermlist, setTermList, setWeight, toString |
| Methods inherited from class java.lang.Object |
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, wait, wait, wait |
allowMetaNamesStrs
private static final java.lang.String[] allowMetaNamesStrs
allowMetaNames
private java.util.Set allowMetaNames
currentDocID
private int currentDocID
privateTextParser
private TextDocumentParser privateTextParser
MetaTextRepresentationBuilder
public MetaTextRepresentationBuilder(int numDocs)
- Creates a new MetaTextRepresentationBuilder. This class can build
a matrix document representation from the meta text of HTML documents.
The meta text is in <meta> tags, in the "content" attribute
- Parameters:
numDocs - The number of documents in the collection.
MetaTextRepresentationBuilder
public MetaTextRepresentationBuilder(int numDocs,
TermList termList,
Filter filter)
MetaTextRepresentationBuilder
public MetaTextRepresentationBuilder(boolean binarize,
TermList termList,
Filter filter)
- Creates a new MetaTextRepresentationBuilder with the specified term
list & filter. This class can build a matrix document representation from the
meta text of HTML documents.
- Parameters:
filter - The filter to use when filtering words. Can be null.numDocs - The number of docs in the collectiontermList - The term list to use. Can be null.
MetaTextRepresentationBuilder
public MetaTextRepresentationBuilder()
setTextParser
public void setTextParser(TextDocumentParser textParser)
- Overrides the superclass's setTextParser() method to
add a listener to the text parser. This listener
extracts words from the text in the meta tag and adds
those words to the documents' representations.
- Overrides:
setTextParser in class BagOfWordsRepresentationBuilder
getAllowMetaNames
public java.util.Set getAllowMetaNames()
- Gets a set of the Strings that are allowed for "meta"
names in this document representation.
- Returns:
-
setAllowMetaNames
public void setAllowMetaNames(java.util.Set metaNames)
- Sets the allowable meta names. This should be a Set of Strings.
- Parameters:
metaNames -
startTag
public void startTag(javax.swing.text.html.HTML.Tag tag,
javax.swing.text.MutableAttributeSet atts,
int pos)
- Description copied from interface:
HTMLParsingListener
- Indicates that a new HTML start tag has been entered.
- Specified by:
startTag in interface HTMLParsingListener
- Following copied from interface:
idl.tmt.documentparsing.HTMLParsingListener
- Parameters:
tag - The tagatts - The tag's attributespos - The character position of this tag in the document
endTag
public void endTag(javax.swing.text.html.HTML.Tag tag,
int pos)
- Description copied from interface:
HTMLParsingListener
- Indicates that an HTML end tag has been reached
- Specified by:
endTag in interface HTMLParsingListener
- Following copied from interface:
idl.tmt.documentparsing.HTMLParsingListener
- Parameters:
tag - The tagpos - The character position of this tag in the document
newDocument
public void newDocument(int docID)
- Description copied from interface:
ParsingListener
- Indicates that a new document parsing has begun. The
invocation of this method implies that parsing has completed
on the current document.
- Specified by:
newDocument in interface ParsingListener
- Following copied from interface:
idl.tmt.documentparsing.ParsingListener
- Parameters:
docID - the numeric ID of the new document to be parsed
documentComplete
public void documentComplete()
- Description copied from interface:
ParsingListener
- Indicates that the parsing of the current document has completed.
- Specified by:
documentComplete in interface ParsingListener
documentCollectionComplete
public void documentCollectionComplete()
- Builds the document representation matrix
- Specified by:
documentCollectionComplete in interface ParsingListener
- See Also:
ParsingListener.documentCollectionComplete()