idl.tmt.classification
Class HTMLMetricsClassifier

java.lang.Object
  |
  +--idl.tmt.classification.HTMLMetricsClassifier
All Implemented Interfaces:
CharacterParsingListener, ClassificationBuilder, HTMLParsingListener, ParsingListener

public class HTMLMetricsClassifier
extends java.lang.Object
implements ClassificationBuilder, HTMLParsingListener, CharacterParsingListener

Created on Jun 21, 2004

Author:
jelsas

Field Summary
private  SimpleClassification classification
           
private  double contentThreshold
           
private  int currDocAnchorCount
           
private  int currDocId
           
private  int currDocTagCount
           
private  int currDocTDCharCount
           
private  int currDocTDCount
           
private  double indexThreshold
           
private  boolean inTD
           
private static java.lang.String IS_CONTENT
           
private static java.lang.String IS_INDEX
           
private static java.lang.String IS_TABLE
           
private  double tableThreshold
           
 
Constructor Summary
HTMLMetricsClassifier()
           
 
Method Summary
 void characters(char[] characters, int pos)
          Indicates that a string of characters has been encountered in the document being parsed
 void documentCollectionComplete()
          Indicates that the parsing of the entire collection of documents is complete.
 void documentComplete()
          Indicates that the parsing of the current document has completed.
 void endTag(javax.swing.text.html.HTML.Tag tag, int pos)
          Indicates that an HTML end tag has been reached
 DocumentClassification getClassification()
           
 double getIndexThreshold()
           
 double getTableThreshold()
           
 void newDocument(int docID)
          Indicates that a new document parsing has begun.
 void setIndexThreshold(double d)
           
 void setTableThreshold(double d)
           
 void startTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet atts, int pos)
          Indicates that a new HTML start tag has been entered.
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, toString, wait, wait, wait
 

Field Detail

classification

private SimpleClassification classification

IS_INDEX

private static final java.lang.String IS_INDEX

IS_TABLE

private static final java.lang.String IS_TABLE

IS_CONTENT

private static final java.lang.String IS_CONTENT

currDocId

private int currDocId

currDocAnchorCount

private int currDocAnchorCount

currDocTagCount

private int currDocTagCount

currDocTDCount

private int currDocTDCount

currDocTDCharCount

private int currDocTDCharCount

inTD

private boolean inTD

indexThreshold

private double indexThreshold

tableThreshold

private double tableThreshold

contentThreshold

private double contentThreshold
Constructor Detail

HTMLMetricsClassifier

public HTMLMetricsClassifier()
Method Detail

getClassification

public DocumentClassification getClassification()
Specified by:
getClassification in interface ClassificationBuilder

startTag

public void startTag(javax.swing.text.html.HTML.Tag tag,
                     javax.swing.text.MutableAttributeSet atts,
                     int pos)
Description copied from interface: HTMLParsingListener
Indicates that a new HTML start tag has been entered.
Specified by:
startTag in interface HTMLParsingListener
Following copied from interface: idl.tmt.documentparsing.HTMLParsingListener
Parameters:
tag - The tag
atts - The tag's attributes
pos - The character position of this tag in the document

endTag

public void endTag(javax.swing.text.html.HTML.Tag tag,
                   int pos)
Description copied from interface: HTMLParsingListener
Indicates that an HTML end tag has been reached
Specified by:
endTag in interface HTMLParsingListener
Following copied from interface: idl.tmt.documentparsing.HTMLParsingListener
Parameters:
tag - The tag
pos - The character position of this tag in the document

characters

public void characters(char[] characters,
                       int pos)
Description copied from interface: CharacterParsingListener
Indicates that a string of characters has been encountered in the document being parsed
Specified by:
characters in interface CharacterParsingListener
Following copied from interface: idl.tmt.documentparsing.CharacterParsingListener
Parameters:
characters - The characters encountered.
pos - The start position of these characters in the document

newDocument

public void newDocument(int docID)
Description copied from interface: ParsingListener
Indicates that a new document parsing has begun. The invocation of this method implies that parsing has completed on the current document.
Specified by:
newDocument in interface ParsingListener
Following copied from interface: idl.tmt.documentparsing.ParsingListener
Parameters:
docID - the numeric ID of the new document to be parsed

documentComplete

public void documentComplete()
Description copied from interface: ParsingListener
Indicates that the parsing of the current document has completed.
Specified by:
documentComplete in interface ParsingListener

documentCollectionComplete

public void documentCollectionComplete()
Description copied from interface: ParsingListener
Indicates that the parsing of the entire collection of documents is complete.
Specified by:
documentCollectionComplete in interface ParsingListener

getIndexThreshold

public double getIndexThreshold()

getTableThreshold

public double getTableThreshold()

setIndexThreshold

public void setIndexThreshold(double d)

setTableThreshold

public void setTableThreshold(double d)