Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2387

Nutch should not index document with "noindex" meta

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: indexer
    • Labels:
    • Environment:

      Linux mint 18,

      Description

      I'm using nutch 1.12 in local mode and solr 4.10.3.
      For some reason i have detected that nutch index document with "noindex" robots meta.
      I have use nutch script for a complete cycle:
      bin/crawl -i urls/ crawl/ -2
      with this url:
      https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
      After various testing the problem persist and aproximately 200 document with this robots meta are indexed incorrectly.
      I have read the method configure in IndexerMapReduce.java class and it has a line for that property but for some reason it is not doing appropiately.
      this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false); (line 97)

        Activity

        Hide
        wastl-nagel Sebastian Nagel added a comment -

        Is the property indexer.delete.robots.noindex set to true in nutch-site.xml or via command-line -Dindexer.delete.robots.noindex=true?

        Show
        wastl-nagel Sebastian Nagel added a comment - Is the property indexer.delete.robots.noindex set to true in nutch-site.xml or via command-line -Dindexer.delete.robots.noindex=true ?

          People

          • Assignee:
            Unassigned
            Reporter:
            eyeris Eyeris Rodriguez Rueda
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development