CS5914 Project - Ontology Learning from Websites: June 2005

Recent Work: Core of the System

Since my last post I have been working quite a lot on the system's core functionality: the ability to apply a set of heuristics to an arbitrary collection of documents.

Requirements of the core:

Heuristics are specified in an external file that can be modified without hacking Java code in between. XML screaming to be let in here.

One or more files can be used as an ontology basis; Files can be identified using URIs for separate files or folders (remote only); folders can optionally be recursed for further files; remote files can optionally be parsed to find links to further files.

Documents can be of varied format: the system must be adaptable to using .txt, .doc, .pdf, etc.

I'm sure there was more motivation behind my work recently, but it escapes me just now.

Based on these requirements I have done the following:

Developed an XML Schema to be used for specifying all heuristics (pre-processing and processing rules) within the system.

Created an instance of this schema based on the heuristics given to me a couple of weeks ago by Ernesto.

Implemented much of the core of the system: conversion from plain text to Part-of-Speech tagged text using the NLProcessor from Edinburgh Uni/infogistics.com; written pre-processors; implemented part of IO side of system.

I've decided to use the Factory design pattern a lot in my system and it seems to be working well at separating concerns.

Also modified project plan slightly as per Ernesto's requests. It is now ready for submission.

Finally, I've been playing with Omondo's UML plugin for Eclipse. It makes some really nice looking class diagrams based on a given package, and seems to be easy to edit, though I'm certain it's missing some relationships on scanning (most symptomatic is use of "return new Class()" being missed as a use of this class).

A note for anyone considering use of this plugin: I ran it on a small package and it spent about a minute searching for Java classes everywhere - other packages (my next Subversion update included ALL files) and maybe even other open projects (though I had no others open). That was annoying, but each subsequent use of the plugin during that session was much faster.

Meeting No. 5

I didn't email Ernesto my latest reports until just before our meeting, so we'll have to discuss them at a later date if necessary.

Ernesto detailed the work he wants me to do over the next two weeks:

Investigate existing Part-of-Speech taggers that we can make use of

Design the architecture of my system

Work on the heuristics of the system

Code the core system

I have already investigated 1 since the meeting, and have found NLProcessor created by Edinburgh Uni and Infogistics.com to be worth using - it has a Java interface and is free to use for 90 days. Obviously this ties us to a short-term solution but the system should be written with this in mind and allow for future swapping out of NLProcessor.

I've had ideas about the architecture floating around my head for a couple of weeks so it'll be good to get those details down soon.

The heuristics will require a lot of thought - I may have to further research NLP.

Project Plan

I have written my first draft of the project plan. Will review this and my background research document with Ernesto at 1pm today.

Background Document

I've written up my findings from researching the area of (semi-)automatic ontology extraction from text. I meet with Ernesto tomorrow to discuss this document.

I shall now start work on my project plan.

LaTeX Templates

I've been beaten to the punch slightly, but heard Steve say he had to re-learn LaTeX for writing reports so I've put the templates Ernesto gave me, plus a cheat sheet and skeleton document up in my CSD space.

Should be fine for anyone to use these as long as you keep Ernesto's name in it. megart.cls is for short reports, and megarep.cls is for the dissertation.

Meeting No. 4

Just had a very long meeting (105 minutes) with Ernesto.

JDOM Migration: Ernesto is happy with the work done and my report on it. From now on Robin and I are to work alone, unless we wish to discuss system features that might benefit both of us.

Business Case: Ernesto liked my ideas for potential business applications of my system, notably the use of ontologies to aid desktop searching. He thinks this may be more appealing than crawling web pages to find all documents to work with, as that is quite common functionality and not overly exciting.

We also went through my list of papers earmarked for reading. Ernesto had many good tips for how to spot a good paper (at least 10 pages, not more than 4 years old). I will annotate my paper list with the comments he made then begin working through those deemed worth a look.

My next deadline is set for next Friday, when I should produce a page of bullet points giving an overview of what I've found to be common among the research done by other institutes and perhaps what is missing or has not yet been tried properly. I should also look into finding systems for grammar tokenisation (ie. converting natural language to a string giving the main terms and their sense (noun/verb/etc)) that I may be able to use rather than starting from scratch.

JDOM Migration Report

Ernesto has asked that for each "mini-project" I do as part of the overall project, I write a short report on what I've done. I therefore wrote a 3 page report on the JDOM Migration mini project.

I have also added a few more papers to my CiteULike list. Meeting with Ernesto tomorrow to go over both these items.

Use of Subversion is going very well - it made merging my work with Robin's very easy during the JDOM task since I knew I wouldn't lose anything I had done. I have had one instance a couple of days ago of svn claiming that a folder was under version control, then that it wasn't and I ended up recreating the repository because of it. I think I've figured out the problem now (included for my future reference):

If a folder is under version control, but its contained folders are not, the following happens:

svn add folder: folder is already under version control

svn remove folder: folder is not under version control (use --force to force removal)

JDOM Migration

Ernesto had asked Robin and myself to migrate ConcepTool from use of DOM (org.xml.dom package and its children) to JDOM (org.jdom, the package we've been using in our course for a while now). This was because in shifting from Java 1.4 to 1.5, a bug crept in that affects DOM's use of streams. Also, JDOM is well recognised as a good way to handle XML.

Robin and I began work on this yesterday and have just finished the migration. There is still one reference to DOM that is required for use in XSL Transformation code.

Current todo list:

Create a list of papers to reduce to those worth reading with Ernesto (I already have a reasonable list at CiteULike but plan on adding to it).

Write a Project Plan (first deliverable of the module). I had planned to start this next week, but may manage to start sooner since the XML work was shorter than expected.

Third Meeting (Ernesto + Robin)

This meeting was made so that Ernesto could check that we were happy with our projects and our current progress.

I asked for guidance on research systems similar to what I will write and Ernesto proposed that I create a large list of potential papers for reading, then spend 5-10 minutes with him reducing this to those that are worth reading.

I have been working on my Business Case document for the past few days. I emailed my first draft to Ernesto on Sunday night, and he commented on it so that I could make a second draft (emailed today). Aside from further inclusion of investigation into other systems, I think this document is in good shape.

For Friday Robin and I are expected to have finished work on migrating ConcepTool from DOM to JDOM.

Second Meeting (Ernesto + Robin)

Most of the time was spent being shown ConcepTool in action by Ernesto. He had emailed us the source code and some papers the previous night.

He also suggested a new motivation for our work: document organisation on the fly. This would be useful for internal document management in government/hospital/other settings, and currently is an unsolved problem. Currently "commercial taxonomy software" is used for these purposes.

Items received from Ernesto:
ConcepTool (LexicalAnalysis version) source code

Lexical Analysis in CT paper. (Printed version also given)

Ontology Learning for the Semantic Web paper.

Ontology Learning and Its Application to Automated Terminology Translation paper.

OntoWeb Deliverable 1.5 paper. (Printed version also given)

An XML Core Infrastructure for the ConcepTool System report. (Printed version also given)

First Meeting with Ernesto (and Robin)

Met with Ernesto to discuss the project in broad strokes. Robin was also there as he and I will be working together for the start of our projects.

My project is to add to the existing ConcepTool system the ability to learn ontologies from websites. This means that many of the tools I will make use of, such as lexical reasoning, will be taken from ConcepTool. This software platform is apparently in quite a tightly coupled state, and must be used in an all-or-nothing way. Robin's project is to make the platform more modularised, with the option to plug-in the modules you wish to use.

In essence, I am a client of Robin's, asking him to provide me with ConcepTool in a modularised form so that I can develop my system easily.

My first task is to research the Business Case for my project: what is the motivation behind it and where could it prove useful.

From Ernesto's notes:
Most ontologies are either automatically developed & then human-validated, or not built at all (consider the industrial cost-effectiveness of ontologies and developing them. We can therefore pitch this system as a semi-automatic ontology learner for businesses without the funds or time to manually develop them.

Create ontologies on the fly to browse catalogues and web pages in a better way: Allow the user to view many websites with the same navigatory structure, using each site's ontology to map between the user's own representation and that of the website.

Items received from Ernesto:
Learning path for myself and Robin

An intelligent system for the extraction of ontological knowledge from text: Ernesto's specification of my project.

Ontology learning for the semantic web: Outline by Ernesto of this area of research.

OntoElicitor: extraction expressive domain ontologies from text: Paper by Ernesto and Chris Mellish.

Project Allocation

I have been allocated Dr Ernesto Compatangelo as my supervisor. I spoke with Ernesto about an idea I had for a project last week, and our discussion led me to believe this is who I would end up with. My project should be a melding of my original idea with Ernesto's knowledge of the domain (ironically worded), his work on ConcepTool and what is actually possible in 3 months.

For completeness, here is the original sketch of my idea:

Download all pages from a given root url, including all pages linked to that are within the same domain and at the same or lower levels of heirarchy.
Store each page in a DB table
Use all pages / a subset / a folder to locate information that is identical / almost identical on each page -> website maintenance
Identify significant differences -> content extraction
Identify patterns -> ontology learning

System is built around many subsystems:
Web site downloader
Use spider design to queue pages
XHTML corrector
XHTML tokeniser - replace text with
Pattern identifier and matcher
Should locate patterns repeating within a page - eg. blogs posts from a month. Allows bloggers to migrate
Pattern visualiser - display to user information on which pages differ from the majority and how
Potential - open 2 docs in browser with extra CSS file used to highlight differences.
OR: generate new page from existing ones with same highlighting and a control bar at top to move between different pages - highlighted text changes.
(Use Firefox caret to reselect area and try again)
Content extractor - use visualiser to identify information, then extract this to create a new DB
Application - convert static website to dynamic site with DB
Ontology learner - automatic exraction of content to generate ontology + instances

Entirely modular so that extra functionality can be added easily.

Bottleneck - downloading pages
Use threading to hit a website up to once per second and to always wait for reply before next second delay
In a separate thread perform correction, tokenising, etc.

Created Pending Allocation

Blog created to chart progress of my CS5914 project in E-Commerce at the University of Aberdeen.

I was expecting allocation of project on Friday past, but have still not received this.

CS5914 Project - Ontology Learning from Websites

Tuesday, June 28, 2005

Recent Work: Core of the System

Friday, June 24, 2005

Meeting No. 5

Project Plan

Thursday, June 23, 2005

Background Document

Tuesday, June 21, 2005

LaTeX Templates

Friday, June 17, 2005

Meeting No. 4

Thursday, June 16, 2005

JDOM Migration Report

Wednesday, June 15, 2005

JDOM Migration

Tuesday, June 14, 2005

Third Meeting (Ernesto + Robin)

Wednesday, June 08, 2005

Second Meeting (Ernesto + Robin)

Tuesday, June 07, 2005

First Meeting with Ernesto (and Robin)

Monday, June 06, 2005

Project Allocation

Created Pending Allocation

About Me

Previous Posts

Archives

Blogs of Interest