Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1791

Null pointer exceptions with gora-cassandra-0.4

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • 2.3
    • 2.5
    • generator, storage
    • None
    • dsc-cassandra-2.0.2, dsc-cassandra-2.0.7

    Description

      Latest nutch-2.x source checkout fails to run with Cassandra 2.0.2 (and also Cassandra 2.0.7) as storage backend both in normal Nutch operations (inject, generate, fetch) cycle as in the junit tests TestGoraStorage

      2014-06-03 11:24:23,495 INFO  connection.CassandraHostRetryService (CassandraHostRetryService.java:<init>(48)) - Downed Host Retry service started with queue size -1 and retry delay 10s
      2014-06-03 11:24:23,535 INFO  service.JmxMonitor (JmxMonitor.java:registerMonitor(52)) - Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector
      Exception in thread "main" java.lang.NullPointerException
      	at org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121)
      	at org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57)
      	at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
      	at org.apache.nutch.storage.TestGoraStorage.readWrite(TestGoraStorage.java:93)
      	at org.apache.nutch.storage.TestGoraStorage.main(TestGoraStorage.java:230)
      

      After injecting:

      ksmets@precise64 ~/l/a/r/local> ./bin/nutch inject urls
      InjectorJob: starting at 2014-06-03 11:55:11
      InjectorJob: Injecting urlDir: urls
      InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
      InjectorJob: total number of urls rejected by filters: 0
      InjectorJob: total number of urls injected after normalization and filtering: 1
      Injector: finished at 2014-06-03 11:55:13, elapsed: 00:00:02
      
      ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats
      WebTable statistics start
      Statistics for WebTable:
      min score:	1.0
      retry 0:	1
      jobs:	{db_stats-job_local1403358409_0001={jobID=job_local1403358409_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=97, MAP_INPUT_RECORDS=1, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=12, MAP_OUTPUT_BYTES=53, COMMITTED_HEAP_BYTES=358612992, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769, COMBINE_INPUT_RECORDS=4, REDUCE_INPUT_RECORDS=6, REDUCE_INPUT_GROUPS=6, COMBINE_OUTPUT_RECORDS=6, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=6, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4}, FileSystemCounters={FILE_BYTES_READ=974145, FILE_BYTES_WRITTEN=1144369}, File Output Format Counters ={BYTES_WRITTEN=225}}}}
      max score:	1.0
      TOTAL urls:	1
      status 0 (null):	1
      avg score:	1.0
      WebTable statistics: done
      
      ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/
      key:	http://example.com/
      baseUrl:	null
      status:	0 (null)
      fetchTime:	1401789311270
      prevFetchTime:	0
      fetchInterval:	2592000
      retriesSinceFetch:	0
      modifiedTime:	0
      prevModifiedTime:	0
      protocolStatus:	(null)
      parseStatus:	(null)
      title:	null
      score:	1.0
      markers:	org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c
      reprUrl:	null
      metadata _csh_ : 	?�
      

      After generating,

      ksmets@precise64 ~/l/a/r/local> ./bin/nutch generate -topN 1
      GeneratorJob: starting at 2014-06-03 11:55:38
      GeneratorJob: Selecting best-scoring urls due for fetch.
      GeneratorJob: starting
      GeneratorJob: filtering: true
      GeneratorJob: normalizing: true
      GeneratorJob: topN: 1
      GeneratorJob: finished at 2014-06-03 11:55:40, time elapsed: 00:00:02
      GeneratorJob: generated batch id: 1401789338-222512082 containing 1 URLs
      
      ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats
      WebTable statistics start
      Statistics for WebTable:
      jobs:	{db_stats-job_local73029265_0001={jobID=job_local73029265_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=358612992, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=974054, FILE_BYTES_WRITTEN=1144028}, File Output Format Counters ={BYTES_WRITTEN=98}}}}
      TOTAL urls:	0
      WebTable statistics: done
      
      ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/
      WebTableReader: java.lang.NullPointerException
      	at org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121)
      	at org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57)
      	at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
      	at org.apache.nutch.crawl.WebTableReader.read(WebTableReader.java:238)
      	at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:494)
      	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      	at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:430)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ksmets Koen Smets
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: