Monday, June 06, 2005

Project Allocation

I have been allocated Dr Ernesto Compatangelo as my supervisor. I spoke with Ernesto about an idea I had for a project last week, and our discussion led me to believe this is who I would end up with. My project should be a melding of my original idea with Ernesto's knowledge of the domain (ironically worded), his work on ConcepTool and what is actually possible in 3 months.

For completeness, here is the original sketch of my idea:

Download all pages from a given root url, including all pages linked to that are within the same domain and at the same or lower levels of heirarchy.
Store each page in a DB table
Use all pages / a subset / a folder to locate information that is identical / almost identical on each page -> website maintenance
Identify significant differences -> content extraction
Identify patterns -> ontology learning


System is built around many subsystems:
Web site downloader
Use spider design to queue pages
XHTML corrector
XHTML tokeniser - replace text with
Pattern identifier and matcher
Should locate patterns repeating within a page - eg. blogs posts from a month. Allows bloggers to migrate
Pattern visualiser - display to user information on which pages differ from the majority and how
Potential - open 2 docs in browser with extra CSS file used to highlight differences.
OR: generate new page from existing ones with same highlighting and a control bar at top to move between different pages - highlighted text changes.
(Use Firefox caret to reselect area and try again)
Content extractor - use visualiser to identify information, then extract this to create a new DB
Application - convert static website to dynamic site with DB
Ontology learner - automatic exraction of content to generate ontology + instances

Entirely modular so that extra functionality can be added easily.


Bottleneck - downloading pages
Use threading to hit a website up to once per second and to always wait for reply before next second delay
In a separate thread perform correction, tokenising, etc.

0 Comments:

Post a Comment

<< Home