Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2143

GeneratorJob ignores batch id passed as argument

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 2.3.1
    • Component/s: generator
    • Labels:
      None

      Description

      The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl:

      bin/nutch generate ... -batchId 1444941073-14208
      ...
      GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
      
      Fetching : 
      bin/nutch fetch ... 1444941073-14208 ...
      ...
      QueueFeeder finished: total 0 records. Hit by time limit :0
      

      The generated URLs are marked with the wrong batch id:

      hbase(main):010:0> scan 'test_webpage'
      ROW                            COLUMN+CELL
       org.apache.nutch:http/        column=f:bid, timestamp=1444941077080, value=1444941074-858443668
       ...
       org.apache.nutch:http/        column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668
      

      and fetcher will not fetch anything. This problem was reported by Sherban Drulea [1, [2.

        Attachments

        1. NUTCH-2143-v2.patch
          2 kB
          Sebastian Nagel
        2. NUTCH-2143-v3.patch
          3 kB
          Sebastian Nagel
        3. patch
          0.8 kB
          liuqibj

          Activity

            People

            • Assignee:
              lewismc Lewis John McGibbney
              Reporter:
              snagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: