Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1605

mime type detector recognizes xlsx as zip file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7, 2.2.1
    • 2.3, 1.9
    • parser
    • None
    • Patch Available

    Description

      With mime.type.magic as true (the default) Office Open XML spreadsheets (*.xlsx) are treated as zip files and not parsed correctly:

      % bin/nutch parsechecker http://localhost/test.xlsx
      fetching: http://localhost/test.xlsx
      parsing: http://localhost/test.xlsx
      contentType: application/zip
      ...
      

      Xlsx files are formally zip files. Nevertheless, both HTTP header and file name are clear:

      % wget -d http://localhost/test.xlsx
      ...
      HTTP/1.1 200 OK
      ...
      Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
      ...
      

      Tika 1.4 detects the type correctly:

      % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx
      application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
      

      Attachments

        1. test.xlsx
          4 kB
          Sebastian Nagel
        2. NUTCH-1605-trunk-v1.patch
          2 kB
          Sebastian Nagel
        3. NUTCH-1605-trunk-v2.patch
          10 kB
          Sebastian Nagel

        Activity

          People

            Unassigned Unassigned
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: