Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.1
    • Component/s: None
    • Labels:
      None

      Description

      I would like both MimeTypes and the POIFSContainerDetector to be able to detect files created with Star Office Draw, Impress, Writer and Calc.

      I started working on this, but stumbled upon a POI issue, which I posted to poi-user.

      http://thread.gmane.org/gmane.comp.jakarta.poi.user/17857

      Nick? Yegor? I know you're on the Tika list as well. Could you take a look? How to get the raw content of CompObj entry?

      1. testStarOffice-5.2-calc.sdc
        17 kB
        Antoni Mylka
      2. testStarOffice-5.2-draw.sda
        29 kB
        Antoni Mylka
      3. testStarOffice-5.2-impress.sdd
        29 kB
        Antoni Mylka
      4. testStarOffice-5.2-write.sdw
        8 kB
        Antoni Mylka

        Activity

        Hide
        Antoni Mylka added a comment -

        The files I want to distinguish inside POIFSContainerDetector. Impress and Draw have the same set of top-level names. I'd like to distinguish them by strings contained in the raw content of the CompObj entry, but I don't know how to get that content via POI. Please have a look at my user@poi question.

        Show
        Antoni Mylka added a comment - The files I want to distinguish inside POIFSContainerDetector. Impress and Draw have the same set of top-level names. I'd like to distinguish them by strings contained in the raw content of the CompObj entry, but I don't know how to get that content via POI. Please have a look at my user@poi question.
        Hide
        Nick Burch added a comment -

        Note that it looks like the strings are prefixed with a 4 byte long length field, and are null terminated. It looks like the first one may always start in the same place in the file, if so you can probably skip forward to that, then use the POI utils to read you the string from the DocumentInputStream

        Show
        Nick Burch added a comment - Note that it looks like the strings are prefixed with a 4 byte long length field, and are null terminated. It looks like the first one may always start in the same place in the file, if so you can probably skip forward to that, then use the POI utils to read you the string from the DocumentInputStream
        Hide
        Alex Ott added a comment -

        for .sdw and .sdc you can just look onto names of streams in root directory: they should be /StarWriterDocument and /StarCalcDocument, but for .sda and .sdd it's more compilcated - they both have /StarDrawDocument3 entries, so you'll need to parse CompObj as you suggested

        Show
        Alex Ott added a comment - for .sdw and .sdc you can just look onto names of streams in root directory: they should be /StarWriterDocument and /StarCalcDocument, but for .sda and .sdd it's more compilcated - they both have /StarDrawDocument3 entries, so you'll need to parse CompObj as you suggested
        Hide
        Antoni Mylka added a comment -

        Committed in r1221686. Thanks for the tip about DocumentInputStream. The commit fixes the indentation in few places, as noticed by Nick in dev@tika email:

        http://www.mail-archive.com/dev@tika.apache.org/msg03608.html

        Show
        Antoni Mylka added a comment - Committed in r1221686. Thanks for the tip about DocumentInputStream. The commit fixes the indentation in few places, as noticed by Nick in dev@tika email: http://www.mail-archive.com/dev@tika.apache.org/msg03608.html

          People

          • Assignee:
            Unassigned
            Reporter:
            Antoni Mylka
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development