Wednesday, November 16, 2005

The Mark

I just realised that I had alluded to posting my mark when it was announced, and that happened about a month ago. I got a 19 for the project (out of 20), which I am very happy with.

This blog is now definately dormant!

Friday, September 09, 2005

Project Completed

On Wednesday I handed in the final version of my report. The day after I demonstrated the system to my second marker, George Coghill, who seemed impressed with it.

This blog is now effectively dormant, save perhaps for a post giving my mark (due on the 13th of October, which is thankfully not a Friday).

Saturday, August 27, 2005

Meeting 16

I tarballed and submitted my code archive earlier this week. I've found a couple of small things that could have been changed since then, but nothing major.

I emailed Ernesto my second draft of the dissertation - it wasn't quite complete, still needing the last few chapters, but the main content was there. Ernesto gave comments on structure and style that I've been working into it over the last few days.

During our meeting yesterday, we discussed some of his comments on the draft, then looked through the slides I'd made for my presentation (which has now been moved to the 1st of September). Ernesto gave ideas on how to improve headers and bullets, and gave me advice on how to pace myself during the talk.

Friday, August 19, 2005

Meetings No. 13-15

Over the last week, my meetings have consisted of discussion of my dissertation and (today only) my presentation.

Ernesto recommended that the main bulk of my report should follow the standard format of Background, Req Spec, Design, Implementation, Evaluation, instead of my suggestion to split the report into two parts: one for mining, one for comparing. The work I've done on the report so far (Background to Evaluation) has followed this suggestion and seems to work well.

My presentation is currently scheduled for the 30th of August. My second marker is someone I've never heard of, and my presentation session will include Pete Edwards and his students, so I'm sure to get some tough questions.

Current plan: get a decent amount of the dissertation written within the next few days, so that I have a good idea of the content to use for my presentation.

Monday, August 08, 2005

Meeting No. 12

Since the code for the system is nearly complete, Ernesto and I discussed the evaluation of the system that I will perform.

The idea is that for the two parts of the system (learning and comparing ontologies), I will determine how effective the system is, based on error and success rates. I'll also use the evaluation to generate statistical values for comparison rules, and to determine the thresholds (if any) that exist for KB sizes when using different comparison measures.

Thursday, July 28, 2005

Demo to Chris Mellish 2

Showed Chris the system working with ontology comparisons. He warned us to watch our choice of phrasings, for example giving a measure of information density is only as reliable as our ability to extract information from text.

Further work to be done to the system before performing evaluations:
Add the use of lexical and structural analysis to calculate measure statistics for use in comparisons
Add functionality for an overlap KB to be built from two KBs being compared: the overlap KB will contain all that is common to both original KBs.
Check the entire project for code worth cleaning.
Refine Learn GUI.

Wednesday, July 27, 2005

ConcepTool Restrictions

I have found an issue with ConcepTool that works against some of my design decisions. I've designed the learner system to allow the user to specify any number of files, folders or web pages, and for folder recursion and link following to be used to reach as many documents as you like.

There were 2 motivators behind this approach:
1) You can create a desktop search engine index with ontologies by specifying "C:/" or "/" as the folder to learn from, including recursion to all subfolders.
2) You can learn from an entire website by turning on link following.

I've been testing the effectiveness of various learning heuristics, and found that learning from just 20 documents is enough to cause problems. The learning process itself only takes a couple of minutes, but the process of loading the CT XML file into ConcepTool grinds away for about 30 minutes before running out of memory. Since the XML file it made was only 2.5 meg (compared to about 30k for single files), there's obviously a lot of bloating going on in ConcepTool that means it's pretty unusable with that many documents.

It's just as well this is a proof of concept and not a commercial venture.