Pig
  1. Pig
  2. PIG-3014

CurrentTime() UDF has undesirable characteristics

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      As part of the explanation of the new DateTime datatype I noticed that we had added a CurrentTime() UDF. The issue with this UDF is that it returns the current time of every exec invocation, which can lead to confusing results. In PIG-1431 I proposed a way such that every instance of the same NOW() will return the same time, which I think is better. Would enjoy thoughts.

      1. PIG-3014-3.patch
        0.5 kB
        Cheolsoo Park
      2. PIG-3014-2.patch
        7 kB
        Cheolsoo Park
      3. PIG-3014-1.patch
        5 kB
        Jonathan Coveney
      4. PIG-3014-0.patch
        8 kB
        Jonathan Coveney

        Issue Links

          Activity

          Hide
          Alan Gates added a comment -

          +1. It's hard to envision why you'd want the current behavior.

          Show
          Alan Gates added a comment - +1. It's hard to envision why you'd want the current behavior.
          Hide
          Jonathan Coveney added a comment -

          I've attached a fix and a couple light tests. Note that I uncovered PIG-3032 while developing this, though this isn't affected by that bug.

          Show
          Jonathan Coveney added a comment - I've attached a fix and a couple light tests. Note that I uncovered PIG-3032 while developing this, though this isn't affected by that bug.
          Hide
          Zhijie Shen added a comment -

          Hi Jonathan,

          thanks for correcting my error. According to you patch, I've also some comments.

          As far as I know, getArgToFuncMapping() is called when generating the logic plan, while exec() is called at runtime. Hence the DateTime object generated in getArgToFuncMapping() reflects the timestamp when the pig latin statements are parsed. If there's several statements containing CurrentTime(), the timestamps will be similar. Please correct me if I'm wrong.

          For example,

          "
          A = load 'justSomeRows' using mock.Storage();
          B = foreach A generate *, CurrentTime();
          ......
          C = foreach B generate *, CurrentTime();
          "

          In this case, there're a bunch of statements between B and C. In B, we want to get the timestamp before executing the statements, while in C, want to get the timestamp after executing them. The difference between the two timestamps should reflect the runtime interval instead of the interval between parsing two CurrentTime() UDFs.

          I think the more accurate behavior of CurrentTime() should be generating a unique timestamp for a statement when it is executed.

          Show
          Zhijie Shen added a comment - Hi Jonathan, thanks for correcting my error. According to you patch, I've also some comments. As far as I know, getArgToFuncMapping() is called when generating the logic plan, while exec() is called at runtime. Hence the DateTime object generated in getArgToFuncMapping() reflects the timestamp when the pig latin statements are parsed. If there's several statements containing CurrentTime(), the timestamps will be similar. Please correct me if I'm wrong. For example, " A = load 'justSomeRows' using mock.Storage(); B = foreach A generate *, CurrentTime(); ...... C = foreach B generate *, CurrentTime(); " In this case, there're a bunch of statements between B and C. In B, we want to get the timestamp before executing the statements, while in C, want to get the timestamp after executing them. The difference between the two timestamps should reflect the runtime interval instead of the interval between parsing two CurrentTime() UDFs. I think the more accurate behavior of CurrentTime() should be generating a unique timestamp for a statement when it is executed.
          Hide
          Jonathan Coveney added a comment -

          Zhijie,

          I think that the semantics in my patch are sufficient. You are correct that in some cases, they might be "closer together" than we might want, but what does that even mean? The semantics are not well specified. What if the optimizer in fact put C before B? What if the optimizer had them run at the same time? What if my cluster happens to be tuned to a certain workload...and so on and so on. I think as long as "now" is defined as "after the script runs," and as long as it is the same for every value in a given relation that uses it, that's the only guarantee that we can make. We can document this limitation (i.e. that "now" is a more or less arbitrary value in between the beginning of your script and when it is finished being parsed).

          I suppose there would be some utility in a CurrentTime() where the time is with respect to the beginning of execution, but it could easily suffer from the same issue if it was in a foreach with a really time consuming value, where the "now" value quickly becomes stale. I think the incremental gain is minimal, and the incremental complexity is quite high. If you deeply disagree, though, we can discuss how to do it. I think the following would work: per each instantiation of the UDF, we create two unique files and put them in HDFS (I do not think the distributed cache will work in this specific case, but it may). Those files will be the constructor argument. On first execution, each mapper tries to delete the file. Since delete is atomic, only one should succeed. This is the leader. It will record the current time and serialize it to the second file. We would have to coordinate atomicity...perhaps it could write a magic value at the end of the serialized date time, so all of the mappers would read the file until they read the magic number, and then they'd know it was done.

          This would be pretty complicated for what I see as a minimal gain, but it would probably be a "more correct" now() implementation. I do not know if Hadoop has a more convenient coordination mechanism between mappers (this sort of goes against the whole point).

          I welcome more thoughts

          Show
          Jonathan Coveney added a comment - Zhijie, I think that the semantics in my patch are sufficient. You are correct that in some cases, they might be "closer together" than we might want, but what does that even mean? The semantics are not well specified. What if the optimizer in fact put C before B? What if the optimizer had them run at the same time? What if my cluster happens to be tuned to a certain workload...and so on and so on. I think as long as "now" is defined as "after the script runs," and as long as it is the same for every value in a given relation that uses it, that's the only guarantee that we can make. We can document this limitation (i.e. that "now" is a more or less arbitrary value in between the beginning of your script and when it is finished being parsed). I suppose there would be some utility in a CurrentTime() where the time is with respect to the beginning of execution, but it could easily suffer from the same issue if it was in a foreach with a really time consuming value, where the "now" value quickly becomes stale. I think the incremental gain is minimal, and the incremental complexity is quite high. If you deeply disagree, though, we can discuss how to do it. I think the following would work: per each instantiation of the UDF, we create two unique files and put them in HDFS (I do not think the distributed cache will work in this specific case, but it may). Those files will be the constructor argument. On first execution, each mapper tries to delete the file. Since delete is atomic, only one should succeed. This is the leader. It will record the current time and serialize it to the second file. We would have to coordinate atomicity...perhaps it could write a magic value at the end of the serialized date time, so all of the mappers would read the file until they read the magic number, and then they'd know it was done. This would be pretty complicated for what I see as a minimal gain, but it would probably be a "more correct" now() implementation. I do not know if Hadoop has a more convenient coordination mechanism between mappers (this sort of goes against the whole point). I welcome more thoughts
          Hide
          Jonathan Coveney added a comment -

          (as a sidenote, I was thinking about this, and IF (big if) Hadoop can guarantee an atomic write action (I don't think it can?) then we only need one file. Each mapper can attempt to read it, and if it is empty, it appends the current time, and then it reads the first date time in that file. It would avoid a race condition because of the atomic write action. If writing isn't atomic though you'd have to abuse some atomic action for coordination, a la delete above. In fact, we could even make this a generic API that let's you coordinate some runtime value for udf invocations, but once again, it's not really a pattern we want to encourage).

          Now I sort of want to do this just for the challenge of it...

          Show
          Jonathan Coveney added a comment - (as a sidenote, I was thinking about this, and IF (big if) Hadoop can guarantee an atomic write action (I don't think it can?) then we only need one file. Each mapper can attempt to read it, and if it is empty, it appends the current time, and then it reads the first date time in that file. It would avoid a race condition because of the atomic write action. If writing isn't atomic though you'd have to abuse some atomic action for coordination, a la delete above. In fact, we could even make this a generic API that let's you coordinate some runtime value for udf invocations, but once again, it's not really a pattern we want to encourage). Now I sort of want to do this just for the challenge of it...
          Hide
          Jonathan Coveney added a comment -

          I think this patch is a good compromise. I was talking to Bill earlier today and he mentioned that we have in the job conf a timestamp for around when a given job starts. IMHO this is close enough, and it will be the same. Easy peasy.

          Show
          Jonathan Coveney added a comment - I think this patch is a good compromise. I was talking to Bill earlier today and he mentioned that we have in the job conf a timestamp for around when a given job starts. IMHO this is close enough, and it will be the same. Easy peasy.
          Hide
          Cheolsoo Park added a comment -

          Hi Jonathan,

          I agree with using the job start time. That sounds reasonable to me.

          But I have two comments regarding your patch:

          • The Apache license header shouldn't be removed in src/org/apache/pig/builtin/CurrentTime.java.
          • After applying your patch, src/org/apache/pig/builtin/CurrentTime.java looks like this. Can you please fix indentations?
            /**
                 * This is a default constructor for Pig reflection purposes. It should
                 * never actually be used.
             */
                public CurrentTime() {}
            
                @Override
                public DateTime exec(Tuple input) throws IOException {
                    if (!isInitialized) {
                        String dateTimeValue = UDFContext.getUDFContext().getJobConf().get("pig.job.submitted.timestamp");
                        if (dateTimeValue == null) {
                            throw new ExecException("pig.job.submitted.timestamp was not set!");
                }   
                        dateTime = new DateTime(Long.parseLong(dateTimeValue));
                        isInitialized  = true;
                }   
                    return dateTime;
                }
            

          Thanks!

          Show
          Cheolsoo Park added a comment - Hi Jonathan, I agree with using the job start time. That sounds reasonable to me. But I have two comments regarding your patch: The Apache license header shouldn't be removed in src/org/apache/pig/builtin/CurrentTime.java . After applying your patch, src/org/apache/pig/builtin/CurrentTime.java looks like this. Can you please fix indentations? /** * This is a default constructor for Pig reflection purposes. It should * never actually be used. */ public CurrentTime() {} @Override public DateTime exec(Tuple input) throws IOException { if (!isInitialized) { String dateTimeValue = UDFContext.getUDFContext().getJobConf().get( "pig.job.submitted.timestamp" ); if (dateTimeValue == null ) { throw new ExecException( "pig.job.submitted.timestamp was not set!" ); } dateTime = new DateTime( Long .parseLong(dateTimeValue)); isInitialized = true ; } return dateTime; } Thanks!
          Hide
          Cheolsoo Park added a comment -

          I was going to commit the patch after fixing whitespaces.

          But I realized that the new test case TestCurrentTime fails with hadoop-2.0.x.

          ERROR 0: pig.job.submitted.timestamp was not set!
          
          Show
          Cheolsoo Park added a comment - I was going to commit the patch after fixing whitespaces. But I realized that the new test case TestCurrentTime fails with hadoop-2.0.x. ERROR 0: pig.job.submitted.timestamp was not set!
          Hide
          Cheolsoo Park added a comment -

          Here is how I ended up fixing the test with hadoop-2.0.x:

          from

          MapReduceLauncher.java
          for (Job job : jc.getWaitingJobs()) {
              job.getJobConf().set("pig.script.submitted.timestamp", Long.toString(scriptSubmittedTimestamp));
              job.getJobConf().set("pig.job.submitted.timestamp", Long.toString(System.currentTimeMillis()));
          }
          

          to

          MapReduceLauncher.java
          for (Job job : jc.getWaitingJobs()) {
              JobConf jobConfCopy = job.getJobConf();
              jobConfCopy.set("pig.script.submitted.timestamp", Long.toString(scriptSubmittedTimestamp));
              jobConfCopy.set("pig.job.submitted.timestamp", Long.toString(System.currentTimeMillis()));
              job.setJobConf(jobConfCopy);
          }
          

          Apparently, job.getJobConf() returns a different JobConf object each time, so properties that are set by job.getJobConf().set() do not last at all.

          This is quite surprising to me because this means that there are many other properties that are not properly set with hadoop-2.0.x now. I will open another jira to get this issue fixed.

          Show
          Cheolsoo Park added a comment - Here is how I ended up fixing the test with hadoop-2.0.x: from MapReduceLauncher.java for (Job job : jc.getWaitingJobs()) { job.getJobConf().set( "pig.script.submitted.timestamp" , Long .toString(scriptSubmittedTimestamp)); job.getJobConf().set( "pig.job.submitted.timestamp" , Long .toString( System .currentTimeMillis())); } to MapReduceLauncher.java for (Job job : jc.getWaitingJobs()) { JobConf jobConfCopy = job.getJobConf(); jobConfCopy.set( "pig.script.submitted.timestamp" , Long .toString(scriptSubmittedTimestamp)); jobConfCopy.set( "pig.job.submitted.timestamp" , Long .toString( System .currentTimeMillis())); job.setJobConf(jobConfCopy); } Apparently, job.getJobConf() returns a different JobConf object each time, so properties that are set by job.getJobConf().set() do not last at all. This is quite surprising to me because this means that there are many other properties that are not properly set with hadoop-2.0.x now. I will open another jira to get this issue fixed.
          Hide
          Cheolsoo Park added a comment -

          Attaching a patch that makes TestCurrentTime pass in both hadoop 20 and 23. I also fixed whitespace and Apache header issues that I mentioned in a previous comment.

          Thanks!

          Show
          Cheolsoo Park added a comment - Attaching a patch that makes TestCurrentTime pass in both hadoop 20 and 23. I also fixed whitespace and Apache header issues that I mentioned in a previous comment. Thanks!
          Hide
          Rohini Palaniswamy added a comment -

          Apparently, job.getJobConf() returns a different JobConf object each time, so properties that are set by job.getJobConf().set() do not last at all.

          This is got me worried. I don't see a clone happening in mapreduce code. Were you able to get to the root cause of the behaviour?

          Show
          Rohini Palaniswamy added a comment - Apparently, job.getJobConf() returns a different JobConf object each time, so properties that are set by job.getJobConf().set() do not last at all. This is got me worried. I don't see a clone happening in mapreduce code. Were you able to get to the root cause of the behaviour?
          Hide
          Jonathan Coveney added a comment -

          Thanks for fixing my patch over the weekend, Cheolsoo! Feel free to commit, or I can later. The reviewer has become the reviewed

          I agree with Rohini that we need to be vigilant about this getJobConf situation.

          Show
          Jonathan Coveney added a comment - Thanks for fixing my patch over the weekend, Cheolsoo! Feel free to commit, or I can later. The reviewer has become the reviewed I agree with Rohini that we need to be vigilant about this getJobConf situation.
          Hide
          Rohini Palaniswamy added a comment -

          I don't see a clone happening in mapreduce code. Were you able to get to the root cause of the behaviour?

          I was looking at the wrong version of hadoop code. H23 indeed returns a copy of the JobConf for jobcontrol.Job. Checked that we don't do a set in other places. Even the submitted.timestamp that was set was not being used in code elsewhere before this case. May be it was just set for debugging purposes. So we should be good with 0.9 and 0.10 for h23.

          public synchronized JobConf getJobConf() {
              return new JobConf(super.getJob().getConfiguration());
            }
          

          My +1 as well.

          Show
          Rohini Palaniswamy added a comment - I don't see a clone happening in mapreduce code. Were you able to get to the root cause of the behaviour? I was looking at the wrong version of hadoop code. H23 indeed returns a copy of the JobConf for jobcontrol.Job. Checked that we don't do a set in other places. Even the submitted.timestamp that was set was not being used in code elsewhere before this case. May be it was just set for debugging purposes. So we should be good with 0.9 and 0.10 for h23. public synchronized JobConf getJobConf() { return new JobConf( super .getJob().getConfiguration()); } My +1 as well.
          Hide
          Cheolsoo Park added a comment -

          Committed to trunk. Thanks Jonathan and Rohini!

          Show
          Cheolsoo Park added a comment - Committed to trunk. Thanks Jonathan and Rohini!
          Hide
          Julien Le Dem added a comment -

          I see a failing test:
          org.apache.pig.test.TestBuiltin.testConversionBetweenDateTimeAndString

          java.lang.NullPointerException
          at org.apache.pig.builtin.CurrentTime.exec(CurrentTime.java:41)
          at org.apache.pig.test.TestBuiltin.testConversionBetweenDateTimeAndString(TestBuiltin.java:450)

          Show
          Julien Le Dem added a comment - I see a failing test: org.apache.pig.test.TestBuiltin.testConversionBetweenDateTimeAndString java.lang.NullPointerException at org.apache.pig.builtin.CurrentTime.exec(CurrentTime.java:41) at org.apache.pig.test.TestBuiltin.testConversionBetweenDateTimeAndString(TestBuiltin.java:450)
          Hide
          Cheolsoo Park added a comment -

          Hi Julien,

          Sorry for that. It is failing because TestBuiltin is not set pig.job.submitted.timestamp. I will get it fixed now.

          Show
          Cheolsoo Park added a comment - Hi Julien, Sorry for that. It is failing because TestBuiltin is not set pig.job.submitted.timestamp . I will get it fixed now.
          Hide
          Cheolsoo Park added a comment -

          Attached a patch that fixes TestBuiltin.

          The CurrentTime() must get called only in the back-end because it reads the value of "pig.job.submitted.timestamp" out of JobConf. But the unit test case was calling it in the front-end, resulting in a NullPointerException.

          Since the test case is not valid, I simply removed it.

          Thanks!

          Show
          Cheolsoo Park added a comment - Attached a patch that fixes TestBuiltin . The CurrentTime() must get called only in the back-end because it reads the value of "pig.job.submitted.timestamp" out of JobConf. But the unit test case was calling it in the front-end, resulting in a NullPointerException. Since the test case is not valid, I simply removed it. Thanks!
          Hide
          Rohini Palaniswamy added a comment -

          Since the test case is not valid, I simply removed it.

          +1. TestCurrentTime covers CurrentTime udf adequately.

          Just an observation though. All builtin udf tests are in TestBuiltin, but CurrentTime alone has a separate test class with just one test. Should we move that to TestBuiltin?

          Show
          Rohini Palaniswamy added a comment - Since the test case is not valid, I simply removed it. +1. TestCurrentTime covers CurrentTime udf adequately. Just an observation though. All builtin udf tests are in TestBuiltin, but CurrentTime alone has a separate test class with just one test. Should we move that to TestBuiltin?
          Hide
          Cheolsoo Park added a comment -

          Thanks Rohini.

          In fact, I asked that question on the dev mailing list a while ago:
          http://search-hadoop.com/m/OVyoR1Ktpcy/Adding+new+test+cases+to+TestBuiltin.java&subj=Adding+new+test+cases+to+TestBuiltin+java

          Julien said that each built-in UDF should have its own test suite, so I followed it in PIG-2881. I guess that the same applies to CurrentTime().

          Please anyone correct me if I am wrong.

          Show
          Cheolsoo Park added a comment - Thanks Rohini. In fact, I asked that question on the dev mailing list a while ago: http://search-hadoop.com/m/OVyoR1Ktpcy/Adding+new+test+cases+to+TestBuiltin.java&subj=Adding+new+test+cases+to+TestBuiltin+java Julien said that each built-in UDF should have its own test suite, so I followed it in PIG-2881 . I guess that the same applies to CurrentTime(). Please anyone correct me if I am wrong.
          Hide
          Julien Le Dem added a comment -

          I think it's better to have one test class per UDF.
          Usually tests are grouped per class or functional group of classes.
          All builtin UDFs do not make a functional group as they have various different purposes. It just makes a huge Test class which is undesirable.

          Show
          Julien Le Dem added a comment - I think it's better to have one test class per UDF. Usually tests are grouped per class or functional group of classes. All builtin UDFs do not make a functional group as they have various different purposes. It just makes a huge Test class which is undesirable.
          Hide
          Cheolsoo Park added a comment -

          Patch committed to trunk.

          Show
          Cheolsoo Park added a comment - Patch committed to trunk.
          Hide
          Rohini Palaniswamy added a comment -

          Ah. I had forgotten about that question. Agree with Julien.

          Show
          Rohini Palaniswamy added a comment - Ah. I had forgotten about that question. Agree with Julien.

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Jonathan Coveney
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development