Friday, July 22, 2005

Meeting No. 9

In this meeting Ernesto and I discussed the new system for making document comaprisons (known as Stage 2) and how it should be altered, and spotted a bug in Stage 1 (ontology learning).

Learner Bug: Basically some possessive "'s" are being treated as separate words. I will need to work out how to avoid this while maintaining a working relationship with both NLProcessor and QTag libraries.

Existing Stage 2 Implementation: My current implementation of Stage 2 allows for the comparison of numerical values and the sizes of sets:

Numerical values:
  • number of words in the original document
  • number of concepts
  • number of attributes
  • number of relationships
  • information density (number of concepts / number of words)

Sets to compare:
  • concept names
  • concept contents (name + list of attributes)
  • relationship pairs (single parent->child branch)
  • full relationships (parent->child->grandchild->... branches)


Comparisons using these values alone cannot tell us much about the relationship between two documents. Therefore further metrics were suggested by Ernesto:

Percentage of concepts in kbA that also occur in kbB (and vice versa). Various bandings will be used to give different outputs. For example, a 20% crossover for both KBs suggests they are discussing the same topic. Similar comparisons can be done with relationship sets.

A quick look at different news articles on the same subject showed that the overlap of concepts does not have to be very high for the two documents to be on the same subject.

My aim is to modify the XML schema and underlying code for comparisons by Tuesday to incorporate these new ideas. Use of ConcepTool's lexical and structural analyses tools will be added at a later date.

0 Comments:

Post a Comment

<< Home