Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2458

Separate hdfs tmp directory for flume hdfs sink

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0.1
    • Fix Version/s: None
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      The current HDFS sink will write temporary files to the same directory as the final file will be stored. This is a problem for several reasons:

      1) File moving
      When mapreduce fetches a list of files to be processed and then processes files that are then gone (i.e. are moved from .tmp to whatever final name it is suppose to have), then the mapreduce job will crash.

      2) File type
      When mapreduce decides how to process files, then it looks at files extension. If using compressed files, then it will decompress it for you. If the file has a .tmp file extension (in the same folder) then it will treat a compressed file as an uncompressed files, thus breaking the results of the mapreduce job.

      I propose that the sink gets an optional tmp path for storing these files to avoid these issues.

        Attachments

        1. patch-2458.txt
          17 kB
          Neerja Khattar
        2. FLUME-2458.patch
          17 kB
          Neerja Khattar

          Activity

            People

            • Assignee:
              neerjakhattar Neerja Khattar
              Reporter:
              sbakke Sverre Bakke
            • Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated: