Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2143

GeneratorJob ignores batch id passed as argument

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 2.3.1
    • 2.3.1
    • generator
    • None

    Description

      The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl:

      bin/nutch generate ... -batchId 1444941073-14208
      ...
      GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
      
      Fetching : 
      bin/nutch fetch ... 1444941073-14208 ...
      ...
      QueueFeeder finished: total 0 records. Hit by time limit :0
      

      The generated URLs are marked with the wrong batch id:

      hbase(main):010:0> scan 'test_webpage'
      ROW                            COLUMN+CELL
       org.apache.nutch:http/        column=f:bid, timestamp=1444941077080, value=1444941074-858443668
       ...
       org.apache.nutch:http/        column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668
      

      and fetcher will not fetch anything. This problem was reported by Sherban Drulea [1, [2.

      Attachments

        1. NUTCH-2143-v2.patch
          2 kB
          Sebastian Nagel
        2. NUTCH-2143-v3.patch
          3 kB
          Sebastian Nagel
        3. patch
          0.8 kB
          liuqibj

        Activity

          People

            lewismc Lewis John McGibbney
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: