Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1262

Map `duplicating` content-types to a single type

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.4
    • 1.6
    • None
    • None
    • Patch Available

    Description

      Similar or duplicating content-types can end-up differently in an index. With, for example, both application/xhtml+xml and text/html it is impossible to use a single filter to select `web pages`.

      See also: http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html

      Content-Type mapping is disabled by default and is enabled via moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/.

      # target MIME-type <TAB> type1 [<TAB> type2 ...]
      
      # Map XHTML to HTML
      text/html       application/xhtml+xml
      
      # Map XHTML and HTML to something else
      Web page        text/html       application/xhtml+xml
      
      # Map some office documents to each other
      Office document application/vnd.oasis.opendocument.text application/x-tika-msoffice
      

      Attachments

        1. NUTCH-1262-1.5-1.patch
          3 kB
          Markus Jelsma
        2. NUTCH-1262-1.5-2.patch
          3 kB
          Markus Jelsma

        Activity

          People

            markus17 Markus Jelsma
            markus17 Markus Jelsma
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: