Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.1-incubating
-
None
-
None
Description
Hi,
I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
expected 512 bytes.
I don't know how they make it work in Nutch (any ideas ?).
To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
I didn't commit this modification; I would like to have your opinions before.
Regards.