Nutch
  1. Nutch
  2. NUTCH-1259

Store detected content type in crawldatum metadata

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents.

        Issue Links

          Activity

          Markus Jelsma created issue -
          Markus Jelsma made changes -
          Field Original Value New Value
          Link This issue is related to NUTCH-1258 [ NUTCH-1258 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1259-1.5-1.patch [ 12513622 ]
          Markus Jelsma made changes -
          Patch Info Patch Available [ 10042 ]
          Julien Nioche made changes -
          Summary TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata Store detected content type in crawldatum metadata
          Julien Nioche made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Markus Jelsma made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Assignee Markus Jelsma [ markus17 ] Julien Nioche [ jnioche ]
          Julien Nioche made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Markus Jelsma made changes -
          Link This issue relates to NUTCH-1293 [ NUTCH-1293 ]
          Lewis John McGibbney made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development