Wednesday, July 06, 2005

Addressing Issues Raised in No. 7

Using Wikipedia highlighted yet another issue in my file handling: UTF-8 and other encodings. I've fixed this (rather crudely by removing all non-ASCII characters) and the other web crawling issues from before.

Attributes: Added Rule 5 to the base heuristics. Also added the WordNet heuristic. This required adding a new Rule type - PostProcessRule. That's quite fortunate as it completes the set for my rules (already present: PreProcessRule and ProcessRule)

Note: The JWNL project only supports WordNet 2.0 at the moment, not the (current) latest version 2.1 (though support is possible by renaming the dictionary files in 2.1).

I've also added Properties to knowledge bases in the same vain as Statistics in MS Word so that quick comparisons can be made between various rules.

As a side note, progress has been hindered slightly: I finally got around to installing Eclipse 3.1, then spent a while working out how to reverse the new "features" like highlighting anything you click on in bright yellow. Thanks guys, that's really helpful!

0 Comments:

Post a Comment

<< Home