Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1605

mime type detector recognizes xlsx as zip file

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7, 2.2.1
    • Fix Version/s: 2.3, 1.9
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      With mime.type.magic as true (the default) Office Open XML spreadsheets (*.xlsx) are treated as zip files and not parsed correctly:

      % bin/nutch parsechecker http://localhost/test.xlsx
      fetching: http://localhost/test.xlsx
      parsing: http://localhost/test.xlsx
      contentType: application/zip
      ...
      

      Xlsx files are formally zip files. Nevertheless, both HTTP header and file name are clear:

      % wget -d http://localhost/test.xlsx
      ...
      HTTP/1.1 200 OK
      ...
      Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
      ...
      

      Tika 1.4 detects the type correctly:

      % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx
      application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
      

        Attachments

        1. test.xlsx
          4 kB
          Sebastian Nagel
        2. NUTCH-1605-trunk-v2.patch
          10 kB
          Sebastian Nagel
        3. NUTCH-1605-trunk-v1.patch
          2 kB
          Sebastian Nagel

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              snagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: