[TIKA-35] Extract MsOffice properties - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.1-incubating
Fix Version/s: 0.1-incubating
Component/s: None
Labels:
None

Description

Hi,
I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
expected 512 bytes.
I don't know how they make it work in Nutch (any ideas ?).
To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
I didn't commit this modification; I would like to have your opinions before.
Regards.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

RereadableInputStream.java
02/Oct/07 00:53
3 kB
Keith Bennett
RereadableInputStreamTest.java
02/Oct/07 00:53
1 kB
Keith Bennett
tika35.patch
27/Sep/07 20:05
23 kB
Rida Benjelloun
tika35.patch
27/Sep/07 17:13
24 kB
Rida Benjelloun

Activity

People

Assignee:: Rida Benjelloun

Reporter:: Rida Benjelloun

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 27/Sep/07 17:12

Updated:: 03/Oct/07 20:30

Resolved:: 01/Oct/07 16:44