Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1798

Crawl script not calling index command correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.2.1
    • 2.3
    • None
    • None
    • Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9

    Description

      Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so:

      <property>
      <name>plugin.includes</name>
      <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
      <description>Regular expression naming plugin directory names to
      include. Any plugin not matching this expression is excluded.
      In any case you need at least include the nutch-extensionpoints plugin. By
      default Nutch includes crawling just HTML and plain text via HTTP,
      and basic indexing and search plugins. In order to use HTTPS please enable
      protocol-httpclient, but be aware of possible intermittent problems with the
      underlying commons-httpclient library.
      </description>
      </property>

      <property>
      <name>elastic.cluster</name>
      <value>elasticsearch</value>
      <description>The cluster name to discover. Either host and potr must be defined
      or cluster.</description>
      </property>

      I have then created a folder called urls and added seed.txt.

      i ran the following commands
      bin/nutch inject urls
      bin/nutch generate -topN 1000
      bin/nutch fetch -all
      bin/nutch parse -all
      bin/nutch updatedb

      bin/nutch index -all

      it runs no errors however no documents have been index

      i also tried setting up the following with solr and no documents are indexed

      Log:

      2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success
      2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06
      2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting
      2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
      2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
      2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
      2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
      2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
      2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
      2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z]
      2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ...
      2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites []
      2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized
      2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ...
      2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address

      {inet[/0:0:0:0:0:0:0:0:9301]}

      , publish_address

      {inet[/10.0.2.15:9301]}

      2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added

      {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}

      , reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
      2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q
      2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address

      {inet[/0:0:0:0:0:0:0:0:9201]}

      , publish_address

      {inet[/10.0.2.15:9201]}

      2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started
      2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
      2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
      2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
      2014-06-24 02:58:11,581 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
      2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
      2014-06-24 02:58:11,716 INFO elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0]
      2014-06-24 02:58:11,717 INFO elastic.ElasticIndexWriter - Processing to finalize last execute
      2014-06-24 02:58:11,717 INFO elasticsearch.node - [Silver] stopping ...
      2014-06-24 02:58:11,751 INFO elasticsearch.node - [Silver] stopped
      2014-06-24 02:58:11,751 INFO elasticsearch.node - [Silver] closing ...
      2014-06-24 02:58:11,756 INFO elasticsearch.node - [Silver] closed
      2014-06-24 02:58:11,759 WARN mapred.FileOutputCommitter - Output path is null in cleanup
      2014-06-24 02:58:12,511 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
      2014-06-24 02:58:12,511 INFO indexer.IndexingJob - Active IndexWriters :
      ElasticIndexWriter
      elastic.cluster : elastic prefix cluster
      elastic.host : hostname
      elastic.port : port (default 9300)
      elastic.index : elastic index command
      elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
      elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

      2014-06-24 02:58:12,525 INFO elasticsearch.node - [Lifeguard] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z]
      2014-06-24 02:58:12,525 INFO elasticsearch.node - [Lifeguard] initializing ...
      2014-06-24 02:58:12,555 INFO elasticsearch.plugins - [Lifeguard] loaded [], sites []
      2014-06-24 02:58:13,025 INFO elasticsearch.node - [Lifeguard] initialized
      2014-06-24 02:58:13,025 INFO elasticsearch.node - [Lifeguard] starting ...
      2014-06-24 02:58:13,032 INFO elasticsearch.transport - [Lifeguard] bound_address

      {inet[/0:0:0:0:0:0:0:0:9301]}

      , publish_address

      {inet[/10.0.2.15:9301]}

      2014-06-24 02:58:16,063 INFO cluster.service - [Lifeguard] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added

      {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}

      , reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
      2014-06-24 02:58:16,072 INFO elasticsearch.discovery - [Lifeguard] elasticsearch/MWiqtTiqS5aC_M7QvGtfyg
      2014-06-24 02:58:16,074 INFO elasticsearch.http - [Lifeguard] bound_address

      {inet[/0:0:0:0:0:0:0:0:9201]}

      , publish_address

      {inet[/10.0.2.15:9201]}

      2014-06-24 02:58:16,076 INFO elasticsearch.node - [Lifeguard] started
      2014-06-24 02:58:16,076 INFO indexer.IndexingJob - IndexingJob: done.

      Attachments

        1. part-r-00000
          43 kB
          Aaron Bedward

        Activity

          People

            Unassigned Unassigned
            mrbedward Aaron Bedward
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: