Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-35

Extract MsOffice properties

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.1-incubating
    • 0.1-incubating
    • None
    • None

    Description

      Hi,
      I have developed a patch that allows MsOffice properties extraction. I wasn't able to extract the MsOffice properties and full text from a single inputstream, I always get this error : java.io.IOException Source code of java.io.IOException: Unable to read entire header; -1 bytes read;
      expected 512 bytes.
      I don't know how they make it work in Nutch (any ideas ?).
      To get it work, I have added "filePath" variable in the parser class, and I populate it from ParseUtils class. After that I create an inputStream from filePath or Url and I use it to extract properties and I use the default inputstream to extract full text.
      I didn't commit this modification; I would like to have your opinions before.
      Regards.

      Attachments

        1. tika35.patch
          24 kB
          Rida Benjelloun
        2. tika35.patch
          23 kB
          Rida Benjelloun
        3. RereadableInputStreamTest.java
          1 kB
          Keith Bennett
        4. RereadableInputStream.java
          3 kB
          Keith Bennett

        Activity

          People

            rbenjelloun Rida Benjelloun
            rbenjelloun Rida Benjelloun
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: