CS5914 Project - Ontology Learning from Websites: July 2005

Demo to Chris Mellish 2

Showed Chris the system working with ontology comparisons. He warned us to watch our choice of phrasings, for example giving a measure of information density is only as reliable as our ability to extract information from text.

Further work to be done to the system before performing evaluations:
Add the use of lexical and structural analysis to calculate measure statistics for use in comparisons
Add functionality for an overlap KB to be built from two KBs being compared: the overlap KB will contain all that is common to both original KBs.
Check the entire project for code worth cleaning.
Refine Learn GUI.

ConcepTool Restrictions

I have found an issue with ConcepTool that works against some of my design decisions. I've designed the learner system to allow the user to specify any number of files, folders or web pages, and for folder recursion and link following to be used to reach as many documents as you like.

There were 2 motivators behind this approach:
1) You can create a desktop search engine index with ontologies by specifying "C:/" or "/" as the folder to learn from, including recursion to all subfolders.
2) You can learn from an entire website by turning on link following.

I've been testing the effectiveness of various learning heuristics, and found that learning from just 20 documents is enough to cause problems. The learning process itself only takes a couple of minutes, but the process of loading the CT XML file into ConcepTool grinds away for about 30 minutes before running out of memory. Since the XML file it made was only 2.5 meg (compared to about 30k for single files), there's obviously a lot of bloating going on in ConcepTool that means it's pretty unusable with that many documents.

It's just as well this is a proof of concept and not a commercial venture.

Meeting No. 10

Demo of the system to Ernesto. Not much to report on.

Work for Thursday (when we meet with Chris Mellish again): write a report on Stage 1 heuristics; write a report on Stage 2 heuristics; add newly calculated measure values to a Knowledge Base's information section for easy review; experiment with different websites to find good percentage values to use in heuristics.

Note: Avoid Omondo's UML plugin for Eclipse

About a month ago I looked into different plugins for Eclipse that offer UML class diagram creation directly from source code. From the two I looked at (SDE and Omondo, I chose Omondo for reasons I don't quite remember.

Anyway, I started using an evaluation copy of the full version of Omondo, and that worked quite nicely. The demo ended about a week ago though, so I downloaded the free version, which still offered everything I'd been using in the full version, so I didn't seem to be missing much.

Then I tried opening a file from I'd made before, but was told I couldn't open it because I was using the free version! I'm glad I've got exported JPEGs of my old diagrams.

Anyway, I get on to making a new diagram, spend ages making it look good (something Omondo isn't good at is laying out a diagram well to start with) and saved it. I come back to it today and find

What a mess.

So, in short, don't use Omondo.

Meeting No. 9

In this meeting Ernesto and I discussed the new system for making document comaprisons (known as Stage 2) and how it should be altered, and spotted a bug in Stage 1 (ontology learning).

Learner Bug: Basically some possessive "'s" are being treated as separate words. I will need to work out how to avoid this while maintaining a working relationship with both NLProcessor and QTag libraries.

Existing Stage 2 Implementation: My current implementation of Stage 2 allows for the comparison of numerical values and the sizes of sets:

Numerical values:

number of words in the original document
number of concepts
number of attributes
number of relationships
information density (number of concepts / number of words)

Sets to compare:

concept names
concept contents (name + list of attributes)
relationship pairs (single parent->child branch)
full relationships (parent->child->grandchild->... branches)

Comparisons using these values alone cannot tell us much about the relationship between two documents. Therefore further metrics were suggested by Ernesto:

Percentage of concepts in kbA that also occur in kbB (and vice versa). Various bandings will be used to give different outputs. For example, a 20% crossover for both KBs suggests they are discussing the same topic. Similar comparisons can be done with relationship sets.

A quick look at different news articles on the same subject showed that the overlap of concepts does not have to be very high for the two documents to be on the same subject.

My aim is to modify the XML schema and underlying code for comparisons by Tuesday to incorporate these new ideas. Use of ConcepTool's lexical and structural analyses tools will be added at a later date.

Demo to Chris Mellish

My meeting today included a demonstration of work so far to Chris Mellish, who is very well informed in the area of natural language processing, and who had developed the heuristics my system is based on together with Ernesto.

The demo used my new interface for the ontology mining process, which is now integrated into ConcepTool: a new ontology is created by selecting "Learn" from the "Project" menu

then enter the details of what you wish to learn from

Chris was very impressed with the speed of execution and the relatively low amount of noise in results. He was even more impressed by the use of an XML file to specify the heuristics used in ontology learning. He has agreed to meet with us again next week to go over the next stage of work, which is document comparison through ontologies.

Ernesto and I discussed Stage 2 once Chris had left. The measures he suggested for making statistical comparisons included some of what I had previously considered, and a few I had not.

I kept quiet about my work so far on Stage 2 as I have already developed an XML schema for specifying how to make assertions about a document pairing provided a set of statistical conditions are met. I thought that bringing this up might hinder the thought process as the schema is a bit restrictive at the moment, and needs to be expanded to allow for more expressiveness. Also, I implemented all comparisons myself, but there are existing tools within ConcepTool that could be used to help here.

Addressing Issues Raised in No. 7

Using Wikipedia highlighted yet another issue in my file handling: UTF-8 and other encodings. I've fixed this (rather crudely by removing all non-ASCII characters) and the other web crawling issues from before.

Attributes: Added Rule 5 to the base heuristics. Also added the WordNet heuristic. This required adding a new Rule type - PostProcessRule. That's quite fortunate as it completes the set for my rules (already present: PreProcessRule and ProcessRule)

Note: The JWNL project only supports WordNet 2.0 at the moment, not the (current) latest version 2.1 (though support is possible by renaming the dictionary files in 2.1).

I've also added Properties to knowledge bases in the same vain as Statistics in MS Word so that quick comparisons can be made between various rules.

As a side note, progress has been hindered slightly: I finally got around to installing Eclipse 3.1, then spent a while working out how to reverse the new "features" like highlighting anything you click on in bright yellow. Thanks guys, that's really helpful!

Meeting No. 7

A general meeting looking at possible test sites and attribute learning.

Test sites: Wikipedia and Maryland General are good sources of dense documents (I have used a document from MG for testing so far).

Attributes: If we can learn cardinality for an attribute, it will take one of four forms: 0..1, 1..1, 0..N, 1..N.

A fifth rule to be added to the current set: A of_IN B to create attribute A for concept B.

We can look at the use of PRPS (possessive pronoun), WPS (possessive wh-noun) and IN for further attribute learning.

Currently the system builds a taxonomy (hierarchy of words) rather than an ontology. It may be possible to use WordNet to identify children of a concept that would be better suited as a sole attribute. For example, concept "car" has children "green car" and "blue car". WordNet identifies "green" and "blue" as having the hypernym "colour", so "colour" becomes an attribute for "car" with possible values "green" and "blue".

Web crawling issues: My system does not handle frames or #anchor links intelligently (the former is ignored, the latter would fetch the same page many times)

The system should allow for the specification of a maximum depth in crawling, or for a maximum number of pages to parse.

Potential application of system: Document version comparison. By adding meta information to an ontology for a document (such as number of words, number of concepts, number of attributes, deepest hierarchy and number of each hierarchy depth) it would be possible to see if two documents are related: if one subsumes the other or if they intersect.

Remote File Retrieval and PDF Support

Today's work so far has been in the following two areas:

Remote File Retrieval: I have added functionality for downloading a single web document, and support for extracting text from a HTML file. I have also added site downloading: given a single web address, ConcepTrainer will download the given file and any files from the same domain that are linked from it. Link following occurs in all files. Each file is only attempted once.

PDF Text Extraction: I have briefly investigated the support for extracting text from PDF documents in Java. The only library with an open license I have come across is PDFBox (old BSD license). A quick test of this wasn't very impressive though - the 4 files I tested with all gave fatal errors when parsing. This may be because of use of ByteArrayInputStream to feed the file rather than FileInputStream as it was most likely designed to expect. This will be investigated later, as will command line PDF text extraction tools.

Meeting No. 6

Ernesto expressed his concern over my decision to use the NLProcessor tool for Part-of-Speech tagging as I could only get a 90 day evaluation license, so future use of my system in the department might be tricky. Unfortunately there is little information on their site about purchasing full licenses, or attaining an academic license. I have to write a small report detailing my choice of NLProcessor over other systems and detail its pros and cons.

I then demonstrated the core of my system (I'm sick of referring to it as "my system", "the ConcepTool extension we propose" and the rest, so for now my project is called ConcepTrainer: remember kids, good names give no hits on Google) which he was very impressed with. The extraction of concepts and parent-child relationships works well with the heuristics he drew up for me, but the attribute learning isn't very good. We will both investigate this area and try to come up with new heuristics for this.

Since I'm a bit further forward with the project than expected, we can now begin looking at the next stage: comparing the ontologies for 2 different documents with each other. This requires some form of meta-concept to describe a document, which we will brainstorm on Tuesday.

CS5914 Project - Ontology Learning from Websites

Thursday, July 28, 2005

Demo to Chris Mellish 2

Wednesday, July 27, 2005

ConcepTool Restrictions

Tuesday, July 26, 2005

Meeting No. 10

Sunday, July 24, 2005

Note: Avoid Omondo's UML plugin for Eclipse

Friday, July 22, 2005

Meeting No. 9

Tuesday, July 19, 2005

Demo to Chris Mellish

Wednesday, July 06, 2005

Addressing Issues Raised in No. 7

Tuesday, July 05, 2005

Meeting No. 7

Monday, July 04, 2005

Remote File Retrieval and PDF Support

Friday, July 01, 2005

Meeting No. 6

About Me

Previous Posts

Archives

Blogs of Interest