Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2526

NPE in scoring-opic when indexing document without CrawlDb datum

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 1.15
    • Component/s: parser, scoring
    • Labels:
      None
    • Docs Text:
      Hide
      2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
      2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
      2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20180307130959
      2018-03-07 15:41:53,677 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
      2018-03-07 15:41:54,861 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
      2018-03-07 15:41:55,168 INFO client.AbstractJestClient - Setting server pool to a list of 1 servers: [http://localhost:9200]
      2018-03-07 15:41:55,170 INFO client.JestClientFactory - Using multi thread/connection supporting pooling connection manager
      2018-03-07 15:41:55,238 INFO client.JestClientFactory - Using default GSON instance
      2018-03-07 15:41:55,238 INFO client.JestClientFactory - Node Discovery disabled...
      2018-03-07 15:41:55,238 INFO client.JestClientFactory - Idle connection reaping disabled...
      2018-03-07 15:41:55,282 INFO elasticrest.ElasticRestIndexWriter - Processing remaining requests [docs = 1, length = 210402, total docs = 1]
      2018-03-07 15:41:55,361 INFO elasticrest.ElasticRestIndexWriter - Processing to finalize last execute
      2018-03-07 15:41:55,458 INFO elasticrest.ElasticRestIndexWriter - Previous took in ms 175, including wait 97
      2018-03-07 15:41:55,468 WARN mapred.LocalJobRunner - job_local1561152089_0001
      java.lang.Exception: java.lang.NullPointerException
      at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
      Caused by: java.lang.NullPointerException
      at org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
      at org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
      at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
      at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
      at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
      at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
      at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
      at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
      at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
      Show
      2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb 2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20180307130959 2018-03-07 15:41:53,677 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2018-03-07 15:41:54,861 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter 2018-03-07 15:41:55,168 INFO client.AbstractJestClient - Setting server pool to a list of 1 servers: [ http://localhost:9200 ] 2018-03-07 15:41:55,170 INFO client.JestClientFactory - Using multi thread/connection supporting pooling connection manager 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Using default GSON instance 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Node Discovery disabled... 2018-03-07 15:41:55,238 INFO client.JestClientFactory - Idle connection reaping disabled... 2018-03-07 15:41:55,282 INFO elasticrest.ElasticRestIndexWriter - Processing remaining requests [docs = 1, length = 210402, total docs = 1] 2018-03-07 15:41:55,361 INFO elasticrest.ElasticRestIndexWriter - Processing to finalize last execute 2018-03-07 15:41:55,458 INFO elasticrest.ElasticRestIndexWriter - Previous took in ms 175, including wait 97 2018-03-07 15:41:55,468 WARN mapred.LocalJobRunner - job_local1561152089_0001 java.lang.Exception: java.lang.NullPointerException at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: java.lang.NullPointerException at org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171) at org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

      Description

      I was trying to write a parse filter plugin whose work was to parse internal links as a separate document.what I did basically is,breaking the page into multiple parseResults each parseResult having ParseText and ParseData corresponding to the InternalLinks. I was successfully able to parse them separately. But at the time of Scoring Some Error occurred.

      I am attaching the Logs for Indexing.

       

       2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
      2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
      2018-03-07 15:41:52,327 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20180307130959
      2018-03-07 15:41:53,677 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
      2018-03-07 15:41:54,861 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
      2018-03-07 15:41:55,168 INFO client.AbstractJestClient - Setting server pool to a list of 1 servers: http://localhost:9200
      2018-03-07 15:41:55,170 INFO client.JestClientFactory - Using multi thread/connection supporting pooling connection manager
      2018-03-07 15:41:55,238 INFO client.JestClientFactory - Using default GSON instance
      2018-03-07 15:41:55,238 INFO client.JestClientFactory - Node Discovery disabled...
      2018-03-07 15:41:55,238 INFO client.JestClientFactory - Idle connection reaping disabled...
      2018-03-07 15:41:55,282 INFO elasticrest.ElasticRestIndexWriter - Processing remaining requests [docs = 1, length = 210402, total docs = 1]
      2018-03-07 15:41:55,361 INFO elasticrest.ElasticRestIndexWriter - Processing to finalize last execute
      2018-03-07 15:41:55,458 INFO elasticrest.ElasticRestIndexWriter - Previous took in ms 175, including wait 97
      2018-03-07 15:41:55,468 WARN mapred.LocalJobRunner - job_local1561152089_0001
      java.lang.Exception: java.lang.NullPointerException
      at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
      Caused by: java.lang.NullPointerException
      at org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
      at org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
      at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
      at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
      at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
      at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
      at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
      at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
      at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
      at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                yash21 Yash Thenuan
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: