Pig
  1. Pig
  2. PIG-129

need to create temp files in the task's working directory

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, pig creates temp data such is spilled bags in the directory specified by java.io.tmpdir. The problem is that this directory is usually shared by all tasks and can easily run out of space.

      A better approach would be to create this files in the temp dir inside of the taks working directory as these locations usually have much mor space and also they can be hosted on different disks so the performance could be better.

      There are 2 parts to this fix:

      (1) in org.apache.pig.data.DataBag to check if the temp directory exists and create it if not before trying to create the temp file. This is somewhere around line 390 in the code.
      (2) Change the mapred.child.java.opts in hadoop-site.xml to include new value for tmpdir property to point to ./tmp. For instance:
      <property>
      <name>mapred.child.java.opts</name>
      <value>-Xmx1024M -Djava.io.tmpdir="./tmp"</value>
      <description>arguments passed to child jvms</description>
      </property>

      1. TempAllocator0.patch
        13 kB
        Pi Song
      2. PIG-129.patch
        3 kB
        Amir Youssefi

        Activity

        Hide
        Olga Natkovich added a comment -

        I committed the patch. Thanks Amir for contributing.

        Show
        Olga Natkovich added a comment - I committed the patch. Thanks Amir for contributing.
        Hide
        Amir Youssefi added a comment -

        Moved Pi Song's additional improvement requests to PIG-138.

        Show
        Amir Youssefi added a comment - Moved Pi Song's additional improvement requests to PIG-138 .
        Hide
        Amir Youssefi added a comment -

        Here is summary of decisions made with Olga:

        Goal of this JIRA is to have a means to create a temporary directory under Hadoop Task Dir. I will open a new JIRA so others (Pi Song) can continue work on local mode and multiple directories.

        • We address the case in which Hadoop Platform is used.
        • We rely on Hadoop to clean up the directory.
        • We tested this on a cluster and observed logs showing creation of directory and actual directory/file being generated.
        • Added Code Block is actually called by a synchronized block of code. Second checking of directory creation is because of an observed case on a cluster.
        Show
        Amir Youssefi added a comment - Here is summary of decisions made with Olga: Goal of this JIRA is to have a means to create a temporary directory under Hadoop Task Dir. I will open a new JIRA so others (Pi Song) can continue work on local mode and multiple directories. We address the case in which Hadoop Platform is used. We rely on Hadoop to clean up the directory. We tested this on a cluster and observed logs showing creation of directory and actual directory/file being generated. Added Code Block is actually called by a synchronized block of code. Second checking of directory creation is because of an observed case on a cluster.
        Hide
        Amir Youssefi added a comment -

        Patch to create temporary directory.

        Show
        Amir Youssefi added a comment - Patch to create temporary directory.
        Hide
        Pi Song added a comment -

        The mentioned sample code

        Show
        Pi Song added a comment - The mentioned sample code
        Hide
        Pi Song added a comment -

        Olga,

        I agree with you. But don't forget we're in an open source project. What you can also do for a low priority task is to give a good direction and leave until someone will do it.

        Go ahead on Hadoop side and please don't forget to keep implementation generic.

        Here I will leave a simple implementation that will give an idea what I expect as a guideline. If I have time on the weekend, I may come back to complete it.

        Show
        Pi Song added a comment - Olga, I agree with you. But don't forget we're in an open source project. What you can also do for a low priority task is to give a good direction and leave until someone will do it. Go ahead on Hadoop side and please don't forget to keep implementation generic. Here I will leave a simple implementation that will give an idea what I expect as a guideline. If I have time on the weekend, I may come back to complete it.
        Hide
        Olga Natkovich added a comment -

        Hi Pi,

        On hadoop side, I am hoping that by using tasks directory we can get the multi-disk distribution and cleaning for free.

        For local pig, I think that the idea is good but I question the priority of this feature. The use case is fairly limitted. Local pig is mostly for coming up to speed on the system not to run large scale processes.

        Show
        Olga Natkovich added a comment - Hi Pi, On hadoop side, I am hoping that by using tasks directory we can get the multi-disk distribution and cleaning for free. For local pig, I think that the idea is good but I question the priority of this feature. The use case is fairly limitted. Local pig is mostly for coming up to speed on the system not to run large scale processes.
        Hide
        Amir Youssefi added a comment -

        I think it's a good idea to have multiple tmp dirs. Having several physical drives is common these days. I brought up the same idea earlier this week as next logical step.

        A new feature in Hadoop 0.16.1 will partially address the tmp dir issue. But it takes a while for it to go through pipeline and reach users. Currently tmp directory is a hot issue for us so we plan to address this in Pig.

        I will probably do this in two stages.

        1) ./tmp directory under working directory. This automatically gets cleaned.
        2) open discussion on details of using multiple tmp directories (possibly over multiple physical drives). We need to take into account cleaning scenarios as well.

        -Amir

        Show
        Amir Youssefi added a comment - I think it's a good idea to have multiple tmp dirs. Having several physical drives is common these days. I brought up the same idea earlier this week as next logical step. A new feature in Hadoop 0.16.1 will partially address the tmp dir issue. But it takes a while for it to go through pipeline and reach users. Currently tmp directory is a hot issue for us so we plan to address this in Pig. I will probably do this in two stages. 1) ./tmp directory under working directory. This automatically gets cleaned. 2) open discussion on details of using multiple tmp directories (possibly over multiple physical drives). We need to take into account cleaning scenarios as well. -Amir
        Hide
        Pi Song added a comment -

        Olga,
        I want to clarify a bit more about what I think and I really need you opinion on this bit. Regarding temp file creation due to DataBag spill, this can happen in 2 places:-

        • In Hadoop Map Reduce execution engine
        • In Local execution engine

        I agree with you that the working dir mechanism in hadoop is already good and you're trying to adopt it BUT what about local execution engine?

        I think even most people pay more attention on Hadoop backend and that's where Pig started, but the local engine still has its use.

        A sample use case would be if I have a big data file on my harddisk(thus cannot be too big) and what I do is I just download Pig and then quickly write a pig script to perform processing in my local machine using local execution engine (without running Hadoop)

        A good local engine implementation will help improve usability of Pig!!!

        Can we handle this issue in 2 different ways? One for hadoop backend, one for local engine. I'm willing to implement what I've proposed in the last comment for the local engine.

        Show
        Pi Song added a comment - Olga, I want to clarify a bit more about what I think and I really need you opinion on this bit. Regarding temp file creation due to DataBag spill, this can happen in 2 places:- In Hadoop Map Reduce execution engine In Local execution engine I agree with you that the working dir mechanism in hadoop is already good and you're trying to adopt it BUT what about local execution engine? I think even most people pay more attention on Hadoop backend and that's where Pig started, but the local engine still has its use. A sample use case would be if I have a big data file on my harddisk(thus cannot be too big) and what I do is I just download Pig and then quickly write a pig script to perform processing in my local machine using local execution engine (without running Hadoop) A good local engine implementation will help improve usability of Pig!!! Can we handle this issue in 2 different ways? One for hadoop backend, one for local engine. I'm willing to implement what I've proposed in the last comment for the local engine.
        Hide
        Pi Song added a comment -

        I think the concept of multi-dir temp file creator (LocalDirAllocator in Hadoop) should be adopted to Pig. What it does is:-

        • You can set up a set of tmp file dirs in configuration (They can be on different physical drives so you can utilize more disk space)
        • When a temp file is being created, the system will probe the given temp dirs in round-robin fashion
        • For a selected temp dir, if it exists and you have permission to write, temp file will be created
        • For a selected temp dir, it it doesn't exist or you don't have permission to write, the temp dir will be kept in the black list, thus not being used later on.
        • For the next temp file, move on to the next temp dir
        Show
        Pi Song added a comment - I think the concept of multi-dir temp file creator (LocalDirAllocator in Hadoop) should be adopted to Pig. What it does is:- You can set up a set of tmp file dirs in configuration (They can be on different physical drives so you can utilize more disk space) When a temp file is being created, the system will probe the given temp dirs in round-robin fashion For a selected temp dir, if it exists and you have permission to write, temp file will be created For a selected temp dir, it it doesn't exist or you don't have permission to write, the temp dir will be kept in the black list, thus not being used later on. For the next temp file, move on to the next temp dir

          People

          • Assignee:
            Amir Youssefi
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development