Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6854

Each map task should create a unique temporary name that includes an object name

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0-alpha2
    • Fix Version/s: None
    • Component/s: distcp
    • Labels:
    • Target Version/s:

      Description

      Consider an example: a local file "/data/a.txt" need to be copied into swift://container.service/data/a.txt

      The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0

      Upon completion distcp will move swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_000000_0 into swift://container.mil01/data/a.txt
      ************************************
      The temporary file naming convention assumes that each map task will sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
      and then rename them to the final names. Most of Hadoop eco system components use object.name which is part of the temporary name, however distcp doesn't use such approach.

      This JIRA propose to add a configuration key indicating that temporary objects will also include object name as part of their temporary file name,

      For example
      "/data/a.txt" will be uploaded into
      "swift://container.mil01/data/a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0"

      "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" doesn't affects flows in the access drivers, since "a.txt" is not considered as a sub-directory so no special operations will be taken.

      The benefits of the patch :
      1. Temp object names will be better distributed in object stores, since they all has different prefix.
      2. Sometimes it's not possible to debug what data is copied and what failed. Sometimes temp files are not renamed, it will be much better if expecting temp name - one can figure what object names were copied.
      3. Different systems may expect "a.txt.distcp.tmp.attempt_local2036034928_0001_m_000000_0" and extract value prior "distcp.tmp" thus getting destination object name.

        Attachments

        1. HADOOP-6854-002.patch
          5 kB
          Gil Vernik
        2. HADOOP-6854-001.patch
          5 kB
          Gil Vernik

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              gvernik Gil Vernik
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: