Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2461

Wordperfect file identified as Quattro Pro document

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.16
    • Fix Version/s: None
    • Component/s: detector
    • Labels:
      None
    • Environment:

      Linux Mint 17

      Description

      While running Tika 1.16 in detect mode over some legacy files from our repository system, I came across one file with a .wpd extension for which Tika reported the following mimetype:

      application/x-quattro-pro; version=7-8
      

      Opening the file in LibreOffice reveals this is actually a WordPerfect document (not sure about which version; the .WPD extension suggests WP 6 or later). I had a look at the Quattro Pro entry in tika-mimetypes.xml:

            <mime-type type="application/x-quattro-pro">
              <_comment>
                Quattro Pro - Corel Spreadsheet (part of WordPerfect Office suite)
              </_comment>
              <!-- qp2 and wb3 are currently detected by POIFSContainerDetector
                  TODO: add detection for wb2 and wb1 -->
              <glob pattern="*.qpw"/>
              <glob pattern="*.wb1"/>
              <glob pattern="*.wb2"/>
              <glob pattern="*.wb3"/>
            </mime-type>
      

      This suggests that the problem originates from POIFSContainerDetector.

      For legal reasons I cannot share the original file. However I was able to create a derived file by truncating the original file after 18 kB, and this derived file shows the same behaviour. The file is available at this link:

      tika-identified-as-quattro-pro-truncated.wpd

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              johanvanderknijff Johan van der Knijff
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: