Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2635

Generator writes unneeded temporary output

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.15
    • Fix Version/s: 1.16
    • Component/s: generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Generator writes the temporary output of the Selector job/step twice (see line 516). Not a big issue when generating small fetch lists but may be when working on large data. The temporary output looks like:

      % tree -h generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/
      enerate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/
      |-- [4.0K]  fetchlist-1
      |   `-- [ 25M]  part-r-00000
      `-- [ 77M]  part-r-00000
      
      1 directory, 2 files
      
      % file generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/part-r-00000 
      generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/part-r-00000: ASCII text
      
      % file generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/fetchlist-1/part-r-00000 
      generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/fetchlist-1/part-r-00000: Apache Hadoop Sequence file version 6
      

      The unneeded output is plain-text which explains its larger size compared to the Hadoop Sequence file.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                snagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: