Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-593

Nutch crawl problem

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      java version : jdk-6u1-linux-amd64.bin, hadoop version : hadoop-0.12.0

      Description

      i use nutch-0.9, hadoop-0.12.2 and i use this command "bin/nutch crawl
      urls -dir crawled -depth 3" have error :

      • crawl started in: crawled
      • rootUrlDir = input
      • threads = 10
      • depth = 3
      • Injector: starting
      • Injector: crawlDb: crawled/crawldb
      • Injector: urlDir: input
      • Injector: Converting injected urls to crawl db entries.
      • Total input paths to process : 1
      • Running job: job_0001
      • map 0% reduce 0%
      • map 100% reduce 0%
      • map 100% reduce 100%
      • Job complete: job_0001
      • Counters: 6
      • Map-Reduce Framework
      • Map input records=3
      • Map output records=1
      • Map input bytes=22
      • Map output bytes=52
      • Reduce input records=1
      • Reduce output records=1
      • Injector: Merging injected urls into crawl db.
      • Total input paths to process : 2
      • Running job: job_0002
      • map 0% reduce 0%
      • map 100% reduce 0%
      • map 100% reduce 58%
      • map 100% reduce 100%
      • Job complete: job_0002
      • Counters: 6
      • Map-Reduce Framework
      • Map input records=3
      • Map output records=1
      • Map input bytes=60
      • Map output bytes=52
      • Reduce input records=1
      • Reduce output records=1
      • Injector: done
      • Generator: Selecting best-scoring urls due for fetch.
      • Generator: starting
      • Generator: segment: crawled/segments/25501213164325
      • Generator: filtering: false
      • Generator: topN: 2147483647
      • Total input paths to process : 2
      • Running job: job_0003
      • map 0% reduce 0%
      • map 100% reduce 0%
      • map 100% reduce 100%
      • Job complete: job_0003
      • Counters: 6
      • Map-Reduce Framework
      • Map input records=3
      • Map output records=1
      • Map input bytes=59
      • Map output bytes=77
      • Reduce input records=1
      • Reduce output records=1
      • Generator: 0 records selected for fetching, exiting ...
      • Stopping at depth=0 - no more URLs to fetch.
      • No URLs to fetch - check your seed list and URL filters.
      • crawl finished: crawled

      but sometime i crawl some url it has error indexes time that

      • Indexer: done
      • Dedup: starting
      • Dedup: adding indexes in: crawled/indexes
      • Total input paths to process : 2
      • Running job: job_0025
      • map 0% reduce 0%
      • Task Id : task_0025_m_000001_0, Status : FAILED
        task_0025_m_000001_0: - Error running child
        task_0025_m_000001_0: java.lang.ArrayIndexOutOfBoundsException: -1
        task_0025_m_000001_0: at
        org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        task_0025_m_000001_0: at
        org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReade
        r.next(DeleteDuplicates.java:176)
        task_0025_m_000001_0: at
        org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        task_0025_m_000001_0: at
        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        task_0025_m_000001_0: at org.apache.hadoop.mapred.MapTask.run
        (MapTask.java:175)
        task_0025_m_000001_0: at
        org.apache.hadoop.mapred.TaskTracker$Child.main
        (TaskTracker.java:1445)
      • Task Id : task_0025_m_000000_0, Status : FAILED
        task_0025_m_000000_0: - Error running child
        task_0025_m_000000_0: java.lang.ArrayIndexOutOfBoundsException: -1
        task_0025_m_000000_0: at
        org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        task_0025_m_000000_0: at
        org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReade
        r.next(DeleteDuplicates.java:176)
        task_0025_m_000000_0: at
        org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        task_0025_m_000000_0: at
        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        task_0025_m_000000_0: at org.apache.hadoop.mapred.MapTask.run
        (MapTask.java:175)
        task_0025_m_000000_0: at
        org.apache.hadoop.mapred.TaskTracker$Child.main
        (TaskTracker.java:1445)
      • Task Id : task_0025_m_000000_1, Status : FAILED
        task_0025_m_000000_1: - Error running child
        task_0025_m_000000_1: java.lang.ArrayIndexOutOfBoundsException: -1
        task_0025_m_000000_1: at
        org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        task_0025_m_000000_1: at
        org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReade
        r.next(DeleteDuplicates.java:176)
        task_0025_m_000000_1: at
        org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        task_0025_m_000000_1: at
        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        task_0025_m_000000_1: at org.apache.hadoop.mapred.MapTask.run
        (MapTask.java:175)
        task_0025_m_000000_1: at
        org.apache.hadoop.mapred.TaskTracker$Child.main
        (TaskTracker.java:1445)
      • Task Id : task_0025_m_000001_1, Status : FAILED
        task_0025_m_000001_1: - Error running child
        task_0025_m_000001_1: java.lang.ArrayIndexOutOfBoundsException: -1
        task_0025_m_000001_1: at
        org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        task_0025_m_000001_1: at
        org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReade
        r.next(DeleteDuplicates.java:176)
        task_0025_m_000001_1: at
        org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        task_0025_m_000001_1: at
        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        task_0025_m_000001_1: at org.apache.hadoop.mapred.MapTask.run
        (MapTask.java:175)
        task_0025_m_000001_1: at
        org.apache.hadoop.mapred.TaskTracker$Child.main
        (TaskTracker.java:1445)
      • Task Id : task_0025_m_000001_2, Status : FAILED
        task_0025_m_000001_2: - Error running child
        task_0025_m_000001_2: java.lang.ArrayIndexOutOfBoundsException: -1
        task_0025_m_000001_2: at
        org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        task_0025_m_000001_2: at
        org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReade
        r.next(DeleteDuplicates.java:176)
        task_0025_m_000001_2: at
        org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        task_0025_m_000001_2: at
        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        task_0025_m_000001_2: at org.apache.hadoop.mapred.MapTask.run
        (MapTask.java:175)
        task_0025_m_000001_2: at
        org.apache.hadoop.mapred.TaskTracker$Child.main
        (TaskTracker.java:1445)
      • Task Id : task_0025_m_000000_2, Status : FAILED
        task_0025_m_000000_2: - Error running child
        task_0025_m_000000_2: java.lang.ArrayIndexOutOfBoundsException: -1
        task_0025_m_000000_2: at
        org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        task_0025_m_000000_2: at
        org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReade
        r.next(DeleteDuplicates.java:176)
        task_0025_m_000000_2: at
        org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        task_0025_m_000000_2: at
        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        task_0025_m_000000_2: at org.apache.hadoop.mapred.MapTask.run
        (MapTask.java:175)
        task_0025_m_000000_2: at
        org.apache.hadoop.mapred.TaskTracker$Child.main
        (TaskTracker.java:1445)
      • map 100% reduce 100%
      • Task Id : task_0025_m_000001_3, Status : FAILED
        task_0025_m_000001_3: - Error running child
        task_0025_m_000001_3: java.lang.ArrayIndexOutOfBoundsException: -1
        task_0025_m_000001_3: at
        org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        task_0025_m_000001_3: at
        org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReade
        r.next(DeleteDuplicates.java:176)
        task_0025_m_000001_3: at
        org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        task_0025_m_000001_3: at
        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        task_0025_m_000001_3: at org.apache.hadoop.mapred.MapTask.run
        (MapTask.java:175)
        task_0025_m_000001_3: at
        org.apache.hadoop.mapred.TaskTracker$Child.main
        (TaskTracker.java:1445)
      • Task Id : task_0025_m_000000_3, Status : FAILED
        task_0025_m_000000_3: - Error running child
        task_0025_m_000000_3: java.lang.ArrayIndexOutOfBoundsException: -1
        task_0025_m_000000_3: at
        org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        task_0025_m_000000_3: at
        org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReade
        r.next(DeleteDuplicates.java:176)
        task_0025_m_000000_3: at
        org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        task_0025_m_000000_3: at
        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        task_0025_m_000000_3: at org.apache.hadoop.mapred.MapTask.run
        (MapTask.java:175)
        task_0025_m_000000_3: at
        org.apache.hadoop.mapred.TaskTracker$Child.main
        (TaskTracker.java:1445)
        Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.indexer.DeleteDuplicates.dedup
        (DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

      how i solve it?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jubjib sudarat
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: