Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2033

parse-tika skips valid documents.

    XMLWordPrintableJSON

Details

    Description

      If we run:

      bin/nutch parsechecker -dumpText http://ngdc.noaa.gov/geoportal/openSearchDescription
      

      we’ll get:

      Status: failed(2,0): Can't retrieve Tika parser for mime-type application/opensearchdescription+xml
      

      the same occurs for:

      bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
      

      Both perfectly valid documents if they were returned as "application/xml" and "text/plain" respectively.

      This happens because parse-tika uses the mime type to retrieve a suitable parser, some composite mime types are not included in this list even though they are perfectly valid and parsable documents. This not taking into account that servers often return incorrect mime types for the documents requested.

      We created a helper class as a workaround for this issue. The class uses regex expressions to define synonyms. In the first case any mime type that matches "application/(.*)+xml" will be replaced by "application/xml". This way parse-tika will parse the document just fine.

      Attachments

        Activity

          People

            lewismc Lewis John McGibbney
            betolink Luis Lopez
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: