idl.tmt.representation
Class MetaTextRepresentationBuilder

java.lang.Object
  |
  +--idl.tmt.representation.BagOfWordsRepresentationBuilder
        |
        +--idl.tmt.representation.MetaTextRepresentationBuilder
All Implemented Interfaces:
HTMLParsingListener, ParsingListener, RepresentationBuilder

public class MetaTextRepresentationBuilder
extends BagOfWordsRepresentationBuilder
implements HTMLParsingListener

Creates a document representation based on the text contained in the META html tag's content attribute. Created on Mar 19, 2004

Author:
jelsas

Field Summary
private  java.util.Set allowMetaNames
           
private static java.lang.String[] allowMetaNamesStrs
           
private  int currentDocID
           
private  TextDocumentParser privateTextParser
           
 
Fields inherited from class idl.tmt.representation.BagOfWordsRepresentationBuilder
binarize, debug, myMatrix, numDocs, rep, shareTermlist, termList, textParser, weight
 
Constructor Summary
MetaTextRepresentationBuilder()
           
MetaTextRepresentationBuilder(boolean binarize, TermList termList, Filter filter)
          Creates a new MetaTextRepresentationBuilder with the specified term list & filter.
MetaTextRepresentationBuilder(int numDocs)
          Creates a new MetaTextRepresentationBuilder.
MetaTextRepresentationBuilder(int numDocs, TermList termList, Filter filter)
           
 
Method Summary
 void documentCollectionComplete()
          Builds the document representation matrix
 void documentComplete()
          Indicates that the parsing of the current document has completed.
 void endTag(javax.swing.text.html.HTML.Tag tag, int pos)
          Indicates that an HTML end tag has been reached
 java.util.Set getAllowMetaNames()
          Gets a set of the Strings that are allowed for "meta" names in this document representation.
 void newDocument(int docID)
          Indicates that a new document parsing has begun.
 void setAllowMetaNames(java.util.Set metaNames)
          Sets the allowable meta names.
 void setTextParser(TextDocumentParser textParser)
          Overrides the superclass's setTextParser() method to add a listener to the text parser.
 void startTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet atts, int pos)
          Indicates that a new HTML start tag has been entered.
 
Methods inherited from class idl.tmt.representation.BagOfWordsRepresentationBuilder
addTermToDocRepresentation, buildRepresentation, cleanup, getRepresentation, getTermList, getWeight, isBinarize, isDebug, isShareTermlist, setBinarize, setDebug, setNumDocuments, setShareTermlist, setTermList, setWeight, toString
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, wait, wait, wait
 

Field Detail

allowMetaNamesStrs

private static final java.lang.String[] allowMetaNamesStrs

allowMetaNames

private java.util.Set allowMetaNames

currentDocID

private int currentDocID

privateTextParser

private TextDocumentParser privateTextParser
Constructor Detail

MetaTextRepresentationBuilder

public MetaTextRepresentationBuilder(int numDocs)
Creates a new MetaTextRepresentationBuilder. This class can build a matrix document representation from the meta text of HTML documents. The meta text is in <meta> tags, in the "content" attribute
Parameters:
numDocs - The number of documents in the collection.

MetaTextRepresentationBuilder

public MetaTextRepresentationBuilder(int numDocs,
                                     TermList termList,
                                     Filter filter)

MetaTextRepresentationBuilder

public MetaTextRepresentationBuilder(boolean binarize,
                                     TermList termList,
                                     Filter filter)
Creates a new MetaTextRepresentationBuilder with the specified term list & filter. This class can build a matrix document representation from the meta text of HTML documents.
Parameters:
filter - The filter to use when filtering words. Can be null.
numDocs - The number of docs in the collection
termList - The term list to use. Can be null.

MetaTextRepresentationBuilder

public MetaTextRepresentationBuilder()
Method Detail

setTextParser

public void setTextParser(TextDocumentParser textParser)
Overrides the superclass's setTextParser() method to add a listener to the text parser. This listener extracts words from the text in the meta tag and adds those words to the documents' representations.
Overrides:
setTextParser in class BagOfWordsRepresentationBuilder

getAllowMetaNames

public java.util.Set getAllowMetaNames()
Gets a set of the Strings that are allowed for "meta" names in this document representation.
Returns:
 

setAllowMetaNames

public void setAllowMetaNames(java.util.Set metaNames)
Sets the allowable meta names. This should be a Set of Strings.
Parameters:
metaNames -  

startTag

public void startTag(javax.swing.text.html.HTML.Tag tag,
                     javax.swing.text.MutableAttributeSet atts,
                     int pos)
Description copied from interface: HTMLParsingListener
Indicates that a new HTML start tag has been entered.
Specified by:
startTag in interface HTMLParsingListener
Following copied from interface: idl.tmt.documentparsing.HTMLParsingListener
Parameters:
tag - The tag
atts - The tag's attributes
pos - The character position of this tag in the document

endTag

public void endTag(javax.swing.text.html.HTML.Tag tag,
                   int pos)
Description copied from interface: HTMLParsingListener
Indicates that an HTML end tag has been reached
Specified by:
endTag in interface HTMLParsingListener
Following copied from interface: idl.tmt.documentparsing.HTMLParsingListener
Parameters:
tag - The tag
pos - The character position of this tag in the document

newDocument

public void newDocument(int docID)
Description copied from interface: ParsingListener
Indicates that a new document parsing has begun. The invocation of this method implies that parsing has completed on the current document.
Specified by:
newDocument in interface ParsingListener
Following copied from interface: idl.tmt.documentparsing.ParsingListener
Parameters:
docID - the numeric ID of the new document to be parsed

documentComplete

public void documentComplete()
Description copied from interface: ParsingListener
Indicates that the parsing of the current document has completed.
Specified by:
documentComplete in interface ParsingListener

documentCollectionComplete

public void documentCollectionComplete()
Builds the document representation matrix
Specified by:
documentCollectionComplete in interface ParsingListener
See Also:
ParsingListener.documentCollectionComplete()