Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2720

ROBOTS metatag ignored when capitalized

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.15
    • 1.17
    • indexer, robots
    • None

    Description

      As discussed on the mailing list, index-metadata fails to ignore a webpage with a capitalized robots metatag such as <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">. This only applies when parse-tika is used. parse-html will "decapitalize"

      Parsing the attached noindex.html leads to the following results:

      parse-html:

      bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata" -Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" http://localhost:8080/noindex.html
      
      Parse Metadata: [...] metatag.robots=noindex,nofollow robots=noindex,nofollow

      parse-tika:

      bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" -Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" http://localhost:8080/noindex.html
      
      Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW  [...] ROBOTS=NOINDEX,NOFOLLOW [...]

       

      The field being named "ROBOTS" and not "robots" leads to parseData.getMeta("robots") being null in https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257.

      Attachments

        1. noindex.html
          0.1 kB
          Felix Zett

        Activity

          People

            snagel Sebastian Nagel
            fezett Felix Zett
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: