Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1262

Map `duplicating` content-types to a single type

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.6
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Similar or duplicating content-types can end-up differently in an index. With, for example, both application/xhtml+xml and text/html it is impossible to use a single filter to select `web pages`.

      See also: http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html

      Content-Type mapping is disabled by default and is enabled via moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/.

      # target MIME-type <TAB> type1 [<TAB> type2 ...]
      
      # Map XHTML to HTML
      text/html       application/xhtml+xml
      
      # Map XHTML and HTML to something else
      Web page        text/html       application/xhtml+xml
      
      # Map some office documents to each other
      Office document application/vnd.oasis.opendocument.text application/x-tika-msoffice
      

        Attachments

        1. NUTCH-1262-1.5-2.patch
          3 kB
          Markus Jelsma
        2. NUTCH-1262-1.5-1.patch
          3 kB
          Markus Jelsma

          Activity

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              markus17 Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: