Tuesday, June 28, 2005

Recent Work: Core of the System

Since my last post I have been working quite a lot on the system's core functionality: the ability to apply a set of heuristics to an arbitrary collection of documents.

Requirements of the core:
  • Heuristics are specified in an external file that can be modified without hacking Java code in between. XML screaming to be let in here.

  • One or more files can be used as an ontology basis; Files can be identified using URIs for separate files or folders (remote only); folders can optionally be recursed for further files; remote files can optionally be parsed to find links to further files.

  • Documents can be of varied format: the system must be adaptable to using .txt, .doc, .pdf, etc.

I'm sure there was more motivation behind my work recently, but it escapes me just now.

Based on these requirements I have done the following:
  • Developed an XML Schema to be used for specifying all heuristics (pre-processing and processing rules) within the system.

  • Created an instance of this schema based on the heuristics given to me a couple of weeks ago by Ernesto.

  • Implemented much of the core of the system: conversion from plain text to Part-of-Speech tagged text using the NLProcessor from Edinburgh Uni/infogistics.com; written pre-processors; implemented part of IO side of system.

I've decided to use the Factory design pattern a lot in my system and it seems to be working well at separating concerns.

Also modified project plan slightly as per Ernesto's requests. It is now ready for submission.

Finally, I've been playing with Omondo's UML plugin for Eclipse. It makes some really nice looking class diagrams based on a given package, and seems to be easy to edit, though I'm certain it's missing some relationships on scanning (most symptomatic is use of "return new Class()" being missed as a use of this class).

A note for anyone considering use of this plugin: I ran it on a small package and it spent about a minute searching for Java classes everywhere - other packages (my next Subversion update included ALL files) and maybe even other open projects (though I had no others open). That was annoying, but each subsequent use of the plugin during that session was much faster.

0 Comments:

Post a Comment

<< Home