Pig
  1. Pig
  2. PIG-2591

Unit tests should not write to /tmp but respect java.io.tmpdir

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.14.0
    • Component/s: tools
    • Labels:
      None

      Description

      Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml)

      1. bugPIG-2591.patch
        5 kB
        Jarek Jarcec Cecho
      2. PIG-2495.patch
        49 kB
        Jarek Jarcec Cecho

        Issue Links

          Activity

          Hide
          Olga Natkovich added a comment -

          Moving to 12 since no work has been done and the ticket is unassigned

          Show
          Olga Natkovich added a comment - Moving to 12 since no work has been done and the ticket is unassigned
          Hide
          Jarek Jarcec Cecho added a comment -

          I've started working on this JIRA. So far I've fixed two tests. It's kind of slow procedure. It seems that I need to fix additional 32 files, so it will take some time. I'm attaching my current patch so that anyone else can take a look.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - I've started working on this JIRA. So far I've fixed two tests. It's kind of slow procedure. It seems that I need to fix additional 32 files, so it will take some time. I'm attaching my current patch so that anyone else can take a look. Jarcec
          Hide
          Cheolsoo Park added a comment -

          Hi Jarcec,

          Can you look at PIG-2995? What's suggested there is to use FileLocalizer.getTemporaryPath() to get temporary paths. If we use FileLocalizer.getTemporaryPath(), the "pig.temp.dir" can be used to control the root directory as well.

          I like PIG-2995 more. Do you agree?

          Thanks!

          Show
          Cheolsoo Park added a comment - Hi Jarcec, Can you look at PIG-2995 ? What's suggested there is to use FileLocalizer.getTemporaryPath() to get temporary paths. If we use FileLocalizer.getTemporaryPath(), the "pig.temp.dir" can be used to control the root directory as well. I like PIG-2995 more. Do you agree? Thanks!
          Hide
          Jarek Jarcec Cecho added a comment -

          Yup agreed, I'll take a look.

          Show
          Jarek Jarcec Cecho added a comment - Yup agreed, I'll take a look.
          Hide
          Prashant Kommireddi added a comment -

          Hi Jarek, just looked at your patch. It would probably good to have a common method to access the temp directory. The following code in FileLocalizer.java sets default temp dir to "/tmp". You can probably modify this class to make it that shared util

              private static synchronized ContainerDescriptor relativeRoot(final PigContext pigContext)
                      throws DataStorageException {
          
                  if (relativeRoot.get() == null) {
                      String tdir= pigContext.getProperties().getProperty("pig.temp.dir", "/tmp");
                      relativeRoot.set(pigContext.getDfs().asContainer(tdir + "/temp" + r.nextInt()));
                      toDelete().push(relativeRoot.get());
                  }
          
                  return relativeRoot.get();
              }
          

          Thanks for working on this.

          Show
          Prashant Kommireddi added a comment - Hi Jarek, just looked at your patch. It would probably good to have a common method to access the temp directory. The following code in FileLocalizer.java sets default temp dir to "/tmp". You can probably modify this class to make it that shared util private static synchronized ContainerDescriptor relativeRoot( final PigContext pigContext) throws DataStorageException { if (relativeRoot.get() == null ) { String tdir= pigContext.getProperties().getProperty( "pig.temp.dir" , "/tmp" ); relativeRoot.set(pigContext.getDfs().asContainer(tdir + "/temp" + r.nextInt())); toDelete().push(relativeRoot.get()); } return relativeRoot.get(); } Thanks for working on this.
          Hide
          Jarek Jarcec Cecho added a comment -

          I've moved the logic of getting tmp directory into new util class TmpUtil, so that It can be used both in FileLocator class and in the tests.

          I've changed only one single test case TestBatchAliases for the time being. I'll start working on other test cases as soon as this will get accepted.

          Show
          Jarek Jarcec Cecho added a comment - I've moved the logic of getting tmp directory into new util class TmpUtil , so that It can be used both in FileLocator class and in the tests. I've changed only one single test case TestBatchAliases for the time being. I'll start working on other test cases as soon as this will get accepted.
          Hide
          Cheolsoo Park added a comment -

          Jarek Jarcec Cecho, sorry for the late review. I like what you propose.

          Show
          Cheolsoo Park added a comment - Jarek Jarcec Cecho , sorry for the late review. I like what you propose.
          Hide
          Jarek Jarcec Cecho added a comment -

          Cheolsoo Park, thank you very much for your review, I appreciate your time. There is a lot of tests that needs to be fixed and they seems to be changing, so I would prefer to fix all of them incrementally rather than all at once. Do you think that it would be feasible to get this one committed and create subtask for each additional chunk of fixed test cases?

          Show
          Jarek Jarcec Cecho added a comment - Cheolsoo Park , thank you very much for your review, I appreciate your time. There is a lot of tests that needs to be fixed and they seems to be changing, so I would prefer to fix all of them incrementally rather than all at once. Do you think that it would be feasible to get this one committed and create subtask for each additional chunk of fixed test cases?
          Hide
          Prashant Kommireddi added a comment -

          Hey Jarek Jarcec Cecho Cheolsoo Park correct me if I am wrong but the patch does not pick up temporary directory from system properties. I think the JIRA was intended to pick the temp dir from system properties? "pig.temp.dir" is used for specifying the temporary directory (local, hdfs) for storing intermediate data between MR jobs. Does it make more sense to pick up base path for tests from system props instead?

          I like the approach otherwise, it's a rather large effort to change all tests in the same patch to use this.

          Show
          Prashant Kommireddi added a comment - Hey Jarek Jarcec Cecho Cheolsoo Park correct me if I am wrong but the patch does not pick up temporary directory from system properties. I think the JIRA was intended to pick the temp dir from system properties? "pig.temp.dir" is used for specifying the temporary directory (local, hdfs) for storing intermediate data between MR jobs. Does it make more sense to pick up base path for tests from system props instead? I like the approach otherwise, it's a rather large effort to change all tests in the same patch to use this.
          Hide
          Cheolsoo Park added a comment -

          Prashant Kommireddi, you're right about the original intention of this jira and pig.temp.dir.

          I think that there are 2 problems here:

          1. Some unit tests hard-coded /tmp.
          2. pig.temp.dir is not set, so FileLocalizer.getTemporaryPath() creates files in /tmp.

          This is what I suggest:

          1. We make every unit test use FileLocalizer.getTemporaryPath().
          2. We set pig.temp.dir to java.io.tmpdir in build.xml (or ./build so ant clean can delete them).

          Then, we meet all the requirements of PIG-2995 and PIG-2591. Do you agree?

          Show
          Cheolsoo Park added a comment - Prashant Kommireddi , you're right about the original intention of this jira and pig.temp.dir. I think that there are 2 problems here: Some unit tests hard-coded /tmp. pig.temp.dir is not set, so FileLocalizer.getTemporaryPath() creates files in /tmp. This is what I suggest: We make every unit test use FileLocalizer.getTemporaryPath(). We set pig.temp.dir to java.io.tmpdir in build.xml (or ./build so ant clean can delete them). Then, we meet all the requirements of PIG-2995 and PIG-2591 . Do you agree?
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Cheolsoo Park and Prashant Kommireddi,
          thank you very much for your feedback. I agree with proposed steps and I'll be more than happy to execute them myself. I just have small comment, direct usage of FileLocalizer is not simple as it requires a lot of initialization (for example that class needs valid PigContext). Thus I would propose to move the actual logic of getting temporary directory into standalone easily reusable class TmpUtil (which is part of bugPIG-2591.patch).

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Cheolsoo Park and Prashant Kommireddi , thank you very much for your feedback. I agree with proposed steps and I'll be more than happy to execute them myself. I just have small comment, direct usage of FileLocalizer is not simple as it requires a lot of initialization (for example that class needs valid PigContext ). Thus I would propose to move the actual logic of getting temporary directory into standalone easily reusable class TmpUtil (which is part of bugPIG-2591.patch ). Jarcec
          Hide
          Cheolsoo Park added a comment -

          I agree with Jarcec. Sorry if my comment wasn't clear enough.

          Show
          Cheolsoo Park added a comment - I agree with Jarcec. Sorry if my comment wasn't clear enough.
          Hide
          Prashant Kommireddi added a comment -

          Thanks guys. Here are a couple cases I am thinking about:

          1. On Windows machine - java.io.tempdir could default to something like, say "C:\DOCUME~1\username\LOCALS~1\Temp\". This is fine if we are running unit tests on a local machine but could have unexpected results when pig attempts to create that directory on HDFS for intermediate files?

          2. Clarity - would it be better to separate these properties as "pig.test.temp.dir" and "pig.temp.dir"? It might be good (and expected) if we default pig.temp.dir to "/tmp" as that's been the case always.

          Jarcec, I agree on what you propose too. It makes sense to separate the helper methods into TmpUtil. The patch does not currently use java.io.tmpdir, how do you see that being used?

          Show
          Prashant Kommireddi added a comment - Thanks guys. Here are a couple cases I am thinking about: 1. On Windows machine - java.io.tempdir could default to something like, say "C:\DOCUME~1\username\LOCALS~1\Temp\". This is fine if we are running unit tests on a local machine but could have unexpected results when pig attempts to create that directory on HDFS for intermediate files? 2. Clarity - would it be better to separate these properties as "pig.test.temp.dir" and "pig.temp.dir"? It might be good (and expected) if we default pig.temp.dir to "/tmp" as that's been the case always. Jarcec, I agree on what you propose too. It makes sense to separate the helper methods into TmpUtil. The patch does not currently use java.io.tmpdir, how do you see that being used?
          Hide
          Cheolsoo Park added a comment -

          Prashant Kommireddi,

          1. If we set "pig.temp.dir" to "./build", it would work for both Linux and Windows.
          2. I am not proposing to change the default value of "pig.temp.dir". I am proposing to set it for unit tests only in build.xml. I don't think we need to introduce an additional property.

          Do you agree?

          Show
          Cheolsoo Park added a comment - Prashant Kommireddi , If we set "pig.temp.dir" to "./build", it would work for both Linux and Windows. I am not proposing to change the default value of "pig.temp.dir". I am proposing to set it for unit tests only in build.xml. I don't think we need to introduce an additional property. Do you agree?
          Hide
          Prashant Kommireddi added a comment -

          1. Agreed (assuming we no longer are considering java.io.tmpdir and instead using "./build"?)
          2. I am a bit unclear on the reasons behind reusing "pig.temp.dir". It's a bit confusing to me that pig.temp.dir is set differently for tests vs intermediate store at different places.

          Show
          Prashant Kommireddi added a comment - 1. Agreed (assuming we no longer are considering java.io.tmpdir and instead using "./build"?) 2. I am a bit unclear on the reasons behind reusing "pig.temp.dir". It's a bit confusing to me that pig.temp.dir is set differently for tests vs intermediate store at different places.
          Hide
          Cheolsoo Park added a comment -

          PIG-2995 explains the reasoning. I think that the advantage of having a property such as "pig.temp.dir" is to use it on a case-by-case basis without having to changing code. So I don't understand why setting the property to different values for different purposes is confusing.

          As far as I understand, both PIG-2995 and PIG-2591 are trying to solve the same problem. When you have automated builds, you want to control where builds generate temporary files. In fact, there are two kinds of files that tests generate. As you pointed out, 1) intermediate files and 2) output files (e.g. STORE foo INTO '/tmp'). To me, controlling them with a single property sounds better rather than having two properties.

          Does this make sense?

          Show
          Cheolsoo Park added a comment - PIG-2995 explains the reasoning. I think that the advantage of having a property such as "pig.temp.dir" is to use it on a case-by-case basis without having to changing code. So I don't understand why setting the property to different values for different purposes is confusing. As far as I understand, both PIG-2995 and PIG-2591 are trying to solve the same problem. When you have automated builds, you want to control where builds generate temporary files. In fact, there are two kinds of files that tests generate. As you pointed out, 1) intermediate files and 2) output files (e.g. STORE foo INTO '/tmp'). To me, controlling them with a single property sounds better rather than having two properties. Does this make sense?
          Hide
          Cheolsoo Park added a comment -

          Jarek Jarcec Cecho and Prashant Kommireddi, I thought about this more last night. I withdraw my agreement on Jarcec's patch.

          • I don't think we want to have the following method:
            public static String getTemporaryDirectory() {
                return getTemporaryDirectory(System.getProperties(), random);
            }
            

            Let's say that we set "pig.temp.dir" to "/x" in system properties (e.g. <sysproperty> in build.xml). But FilLocalizer.getTemporaryPath() is not effected by this because "pig.temp.dir" is not set in PigContext. So Pig will still generate intermediate data under "file:///tmp". Now if we do "store foo into TmpUtil.getTemporaryDirectory()", unit tests will generate temporary files into two directories: output files into "/x" and intermediate files into "/tmp".

            The direct usage of FileLocalizer is not simple as it requires a lot of initialization (for example that class needs valid PigContext).

            In fact, this is good because that's how we guarantee all the temporary files go into a single directory controlled by PigContext. So we should use FilLocalizer.getTemporaryPath().

          • I was wrong about setting "pig.temp.dir" in system properties in build.xml for unit tests. In fact, it won't be propagated to PigContext. So we should set "pig.temp.dir" in PigContext.
          • All of these are only applicable to local mode (i.e. file://). If unit tests run on mini cluster, we shouldn't set "pig.temp.dir". Currently, mini cluster already generates files under "./build", so we don't need to worry about them.
          • We should use File.createTempFile() for local temporary files that are not generated by Pig. (e.g. Some tests generate input files on local and copy them to mini cluster.) Then, "java.io.tmpdir" will be automatically honored.

          I think I covered all the cases. Please let me know what you think.

          Show
          Cheolsoo Park added a comment - Jarek Jarcec Cecho and Prashant Kommireddi , I thought about this more last night. I withdraw my agreement on Jarcec's patch. I don't think we want to have the following method: public static String getTemporaryDirectory() { return getTemporaryDirectory( System .getProperties(), random); } Let's say that we set "pig.temp.dir" to "/x" in system properties (e.g. <sysproperty> in build.xml). But FilLocalizer.getTemporaryPath() is not effected by this because "pig.temp.dir" is not set in PigContext. So Pig will still generate intermediate data under "file:///tmp". Now if we do "store foo into TmpUtil.getTemporaryDirectory()", unit tests will generate temporary files into two directories: output files into "/x" and intermediate files into "/tmp". The direct usage of FileLocalizer is not simple as it requires a lot of initialization (for example that class needs valid PigContext). In fact, this is good because that's how we guarantee all the temporary files go into a single directory controlled by PigContext. So we should use FilLocalizer.getTemporaryPath(). I was wrong about setting "pig.temp.dir" in system properties in build.xml for unit tests. In fact, it won't be propagated to PigContext. So we should set "pig.temp.dir" in PigContext. All of these are only applicable to local mode (i.e. file:// ). If unit tests run on mini cluster, we shouldn't set "pig.temp.dir". Currently, mini cluster already generates files under "./build", so we don't need to worry about them. We should use File.createTempFile() for local temporary files that are not generated by Pig. (e.g. Some tests generate input files on local and copy them to mini cluster.) Then, "java.io.tmpdir" will be automatically honored. I think I covered all the cases. Please let me know what you think.
          Hide
          Prashant Kommireddi added a comment -

          Cheolsoo Park I agree with your assessment.


          I was wrong about setting "pig.temp.dir" in system properties in build.xml for unit tests. In fact, it won't be propagated to PigContext. So we should set "pig.temp.dir" in PigContext.

          That was partly a cause for my confusion . Setting a system property would not reflect in PigContext.

          Jarek Jarcec Cecho apologies for the back-and-forth and thanks for your patience. But it's important we all agree on the best approach as this patch affects the way we write and run tests

          Show
          Prashant Kommireddi added a comment - Cheolsoo Park I agree with your assessment. I was wrong about setting "pig.temp.dir" in system properties in build.xml for unit tests. In fact, it won't be propagated to PigContext. So we should set "pig.temp.dir" in PigContext. That was partly a cause for my confusion . Setting a system property would not reflect in PigContext. Jarek Jarcec Cecho apologies for the back-and-forth and thanks for your patience. But it's important we all agree on the best approach as this patch affects the way we write and run tests
          Hide
          Alan Gates added a comment -

          Canceling patch as it's clear we don't have agreement on the approach yet.

          Show
          Alan Gates added a comment - Canceling patch as it's clear we don't have agreement on the approach yet.

            People

            • Assignee:
              Jarek Jarcec Cecho
              Reporter:
              Thomas Weise
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development