Monday, July 04, 2005

Remote File Retrieval and PDF Support

Today's work so far has been in the following two areas:

Remote File Retrieval: I have added functionality for downloading a single web document, and support for extracting text from a HTML file. I have also added site downloading: given a single web address, ConcepTrainer will download the given file and any files from the same domain that are linked from it. Link following occurs in all files. Each file is only attempted once.

PDF Text Extraction: I have briefly investigated the support for extracting text from PDF documents in Java. The only library with an open license I have come across is PDFBox (old BSD license). A quick test of this wasn't very impressive though - the 4 files I tested with all gave fatal errors when parsing. This may be because of use of ByteArrayInputStream to feed the file rather than FileInputStream as it was most likely designed to expect. This will be investigated later, as will command line PDF text extraction tools.

0 Comments:

Post a Comment

<< Home