Details
Description
Latest nutch-2.x source checkout fails to run with Cassandra 2.0.2 (and also Cassandra 2.0.7) as storage backend both in normal Nutch operations (inject, generate, fetch) cycle as in the junit tests TestGoraStorage
2014-06-03 11:24:23,495 INFO connection.CassandraHostRetryService (CassandraHostRetryService.java:<init>(48)) - Downed Host Retry service started with queue size -1 and retry delay 10s
2014-06-03 11:24:23,535 INFO service.JmxMonitor (JmxMonitor.java:registerMonitor(52)) - Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector
Exception in thread "main" java.lang.NullPointerException
at org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121)
at org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57)
at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
at org.apache.nutch.storage.TestGoraStorage.readWrite(TestGoraStorage.java:93)
at org.apache.nutch.storage.TestGoraStorage.main(TestGoraStorage.java:230)
After injecting:
ksmets@precise64 ~/l/a/r/local> ./bin/nutch inject urls InjectorJob: starting at 2014-06-03 11:55:11 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 1 Injector: finished at 2014-06-03 11:55:13, elapsed: 00:00:02 ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats WebTable statistics start Statistics for WebTable: min score: 1.0 retry 0: 1 jobs: {db_stats-job_local1403358409_0001={jobID=job_local1403358409_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=97, MAP_INPUT_RECORDS=1, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=12, MAP_OUTPUT_BYTES=53, COMMITTED_HEAP_BYTES=358612992, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769, COMBINE_INPUT_RECORDS=4, REDUCE_INPUT_RECORDS=6, REDUCE_INPUT_GROUPS=6, COMBINE_OUTPUT_RECORDS=6, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=6, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4}, FileSystemCounters={FILE_BYTES_READ=974145, FILE_BYTES_WRITTEN=1144369}, File Output Format Counters ={BYTES_WRITTEN=225}}}} max score: 1.0 TOTAL urls: 1 status 0 (null): 1 avg score: 1.0 WebTable statistics: done ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/ key: http://example.com/ baseUrl: null status: 0 (null) fetchTime: 1401789311270 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus: (null) title: null score: 1.0 markers: org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c reprUrl: null metadata _csh_ : ?�
After generating,
ksmets@precise64 ~/l/a/r/local> ./bin/nutch generate -topN 1 GeneratorJob: starting at 2014-06-03 11:55:38 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: normalizing: true GeneratorJob: topN: 1 GeneratorJob: finished at 2014-06-03 11:55:40, time elapsed: 00:00:02 GeneratorJob: generated batch id: 1401789338-222512082 containing 1 URLs ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats WebTable statistics start Statistics for WebTable: jobs: {db_stats-job_local73029265_0001={jobID=job_local73029265_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=358612992, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=974054, FILE_BYTES_WRITTEN=1144028}, File Output Format Counters ={BYTES_WRITTEN=98}}}} TOTAL urls: 0 WebTable statistics: done ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/ WebTableReader: java.lang.NullPointerException at org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121) at org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57) at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114) at org.apache.nutch.crawl.WebTableReader.read(WebTableReader.java:238) at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:494) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:430)
Attachments
Issue Links
- is blocked by
-
GORA-395 NPE occurs in o.a.g.cassandra.query.CassandraResult when accessing cc (=null) and setting the unionField (title)
- Closed
- is related to
-
NUTCH-1780 ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file
- Closed
-
NUTCH-1714 Nutch 2.x upgrade to Gora 0.4
- Closed