Hive
  1. Hive
  2. HIVE-3822

Casting from double to timestamp type seems inconsistent

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      select cast(1355944339 as timestamp) from decimal_3 limit 1;
      1970-01-16 08:39:04.339
      
      select cast(1355944339000 as timestamp) from decimal_3 limit 1;
      2012-12-19 11:12:19
      
      select cast(1355944339.123456789 as timestamp) from decimal_3 limit 1;
      2012-12-19 11:12:19.1234567
      

      If specifying the unixTimestamp without a decimal point, we need to specify the millisecond timestamp. If specifying with a decimal point, we need to specify only the second timestamp and the rest goes after decimal. Moreover, it seems like some precision is lost (notice '89' are lost in the last query.

        Issue Links

          Activity

          Hide
          Arijit Banerjee added a comment -

          I think a good approach would be to fix it in a way that it can work with both double and long values. This is what I did and works for me. Let me know what you think. I also suggested this in the duplicate issue #HIVE-3454.

          In org.apache.hadoop.hive.serde2.objectinspector.primitive PrimitiveObjectInspectorUtils.java edit getTimestamp method and convert the long object to double.

          getTimestamp(Object o, PrimitiveObjectInspector oi)

          case LONG:
          //Timestamp conversion from LONG is messy. Converting to double.
          long tsLongPrimitive=((LongObjectInspector) oi).get(o);
          result =TimestampWritable.doubleToTimestamp(tsLongPrimitive*1.0);
          break;

          Show
          Arijit Banerjee added a comment - I think a good approach would be to fix it in a way that it can work with both double and long values. This is what I did and works for me. Let me know what you think. I also suggested this in the duplicate issue # HIVE-3454 . In org.apache.hadoop.hive.serde2.objectinspector.primitive PrimitiveObjectInspectorUtils.java edit getTimestamp method and convert the long object to double. getTimestamp(Object o, PrimitiveObjectInspector oi) case LONG: //Timestamp conversion from LONG is messy. Converting to double. long tsLongPrimitive=((LongObjectInspector) oi).get(o); result =TimestampWritable.doubleToTimestamp(tsLongPrimitive*1.0); break;
          Hide
          Ryan Harris added a comment -

          I don't have enough experience with this operation on other database systems to have clear expectations of how things should work in Hive.

          I initially noticed the inconsistency when using to_utc_timestamp() and then identified that the behavior seemed to come from the cast() conversion

          If you are saying that the behavior of cast() is expected and appropriate, then it would seem that to_utc_timestamp() is producing unexpected results....

          However, to_utc_timestamp(timestamp,string) expects a timestamp (not integer/float) value...thus cast() is being used to get the int/float submitted, INTO a timestamp, which is why

          select from_utc_timestamp(to_utc_timestamp(0.0,'GMT'),'GMT') from test limit 1
          and
          select from_utc_timestamp(to_utc_timestamp(cast(0.0 as timestamp),'GMT'),'GMT') from test limit 1
          both return the same (wrong) result:
          1969-12-31 17:00:00

          Here is my question....
          If you are processing a log file that contains a GMT epoch formatted date value, what is the appropriate way to process/store that value in a timestamp field and get the same stored timestamp regardless of the timezone setting of the cluster processing the file?

          What should be changed to ensure that:

          select from_utc_timestamp(to_utc_timestamp(0.0,'GMT'),'GMT') from test limit 1

          ALWAYS returns 1970-01-01 00:00:00 as it should, no matter what the system timezone is set for?

          Show
          Ryan Harris added a comment - I don't have enough experience with this operation on other database systems to have clear expectations of how things should work in Hive. I initially noticed the inconsistency when using to_utc_timestamp() and then identified that the behavior seemed to come from the cast() conversion If you are saying that the behavior of cast() is expected and appropriate, then it would seem that to_utc_timestamp() is producing unexpected results.... However, to_utc_timestamp(timestamp,string) expects a timestamp (not integer/float) value...thus cast() is being used to get the int/float submitted, INTO a timestamp, which is why select from_utc_timestamp(to_utc_timestamp(0.0,'GMT'),'GMT') from test limit 1 and select from_utc_timestamp(to_utc_timestamp(cast(0.0 as timestamp),'GMT'),'GMT') from test limit 1 both return the same (wrong) result: 1969-12-31 17:00:00 Here is my question.... If you are processing a log file that contains a GMT epoch formatted date value, what is the appropriate way to process/store that value in a timestamp field and get the same stored timestamp regardless of the timezone setting of the cluster processing the file? What should be changed to ensure that: select from_utc_timestamp(to_utc_timestamp(0.0,'GMT'),'GMT') from test limit 1 ALWAYS returns 1970-01-01 00:00:00 as it should, no matter what the system timezone is set for?
          Hide
          Mark Grover added a comment -

          select cast(0.0 as timestamp) from test limit 1
          1969-12-31 17:00:00
          select cast('1970-01-01 00:00:00' as timestamp) from test limit 1
          1970-01-01 00:00:00

          We can't talk about human readable timestamps without referring to a particular timezone. When your query is select cast('1970-01-01 00:00:00' as timestamp) from test limit 1, you are implicitly referring to your default timezone (you can make an argument that implicit reference should be to GMT timezone here but like I said before that's historically not been the case with database systems) and therefore the results are different. The first query creates a query that's 01/01/1970 midnight in UTC while the second creates the same time but in Denver.

          The problem is that cast(STRING datetime to TIMESTAMP) produces different results from cast(int/float datetime to TIMESTAMP)

          That would always be the case (unless your system timezone is UTC) since the STRING datetime refers to your system timezone while the int/float datetime always refers to UTC timezone.

          Show
          Mark Grover added a comment - select cast(0.0 as timestamp) from test limit 1 1969-12-31 17:00:00 select cast('1970-01-01 00:00:00' as timestamp) from test limit 1 1970-01-01 00:00:00 We can't talk about human readable timestamps without referring to a particular timezone. When your query is select cast('1970-01-01 00:00:00' as timestamp) from test limit 1 , you are implicitly referring to your default timezone (you can make an argument that implicit reference should be to GMT timezone here but like I said before that's historically not been the case with database systems) and therefore the results are different. The first query creates a query that's 01/01/1970 midnight in UTC while the second creates the same time but in Denver. The problem is that cast(STRING datetime to TIMESTAMP) produces different results from cast(int/float datetime to TIMESTAMP) That would always be the case (unless your system timezone is UTC) since the STRING datetime refers to your system timezone while the int/float datetime always refers to UTC timezone.
          Hide
          Ryan Harris added a comment -

          Also,
          select from_utc_timestamp(to_utc_timestamp('1970-01-01 00:00:00','GMT'),'America/Denver') from test limit 1
          1969-12-31 17:00:00

          is an expected result..

          select from_utc_timestamp(to_utc_timestamp(0.0,'GMT'),'America/Denver') from test limit 1
          1969-12-31 10:00:00

          is VERY unexpected...

          Show
          Ryan Harris added a comment - Also, select from_utc_timestamp(to_utc_timestamp('1970-01-01 00:00:00','GMT'),'America/Denver') from test limit 1 1969-12-31 17:00:00 is an expected result.. select from_utc_timestamp(to_utc_timestamp(0.0,'GMT'),'America/Denver') from test limit 1 1969-12-31 10:00:00 is VERY unexpected...
          Hide
          Ryan Harris added a comment -

          I'm still not sure that I agree with you on the time conversion issue...
          I'm not sure what the "correct" answer is, but I'm experiencing inconsistent/unexpected behavior depending on how I convert the timestamp.

          From my point of view, if a timezone isn't specified then 0 = 1970-01-01 00:00:00

          select cast(0.0 as timestamp) from test limit 1
          1969-12-31 17:00:00

          select cast('1970-01-01 00:00:00' as timestamp) from test limit 1
          1970-01-01 00:00:00

          Shouldn't those two results be the same?
          This is the "inconsistency" I reference.

          select to_utc_timestamp(0.0,'GMT') from test limit 1
          1969-12-31 17:00:00

          select to_utc_timestamp('1970-01-01 00:00:00','GMT') from test limit 1
          1970-01-01 00:00:00

          The same inconsistency is produced here....but the behavior seemed to be related to the cast() conversion taking place in to_utc_timestamp, not in the UDF itself...if you still think that this is actually an issue with the UDF itself I can file a separate JIRA for that, but to me, all of this seems to be related back to cast()

          In this case, my cluster is set to America/Denver timezone, so if I attempted to set a local epoch time to a timestamp in what to me is a logical approach, this appears to work:
          select to_utc_timestamp(0.0,'America/Denver') from test limit 1
          1970-01-01 00:00:00

          However, I say that it "appears" to work, because if I the retrieve that same stored timestamp using the from_utc_timestamp() UDF:
          select from_utc_timestamp(to_utc_timestamp(0.0,'America/Denver'),'America/Denver') from test limit 1
          1969-12-31 17:00:00

          which again is different than if I had done:
          "select from_utc_timestamp(to_utc_timestamp('1970-01-01 00:00:00','America/Denver'),'America/Denver') from test limit 1
          1970-01-01 00:00:00

          The problem is that cast(STRING datetime to TIMESTAMP) produces different results from cast(int/float datetime to TIMESTAMP)

          Show
          Ryan Harris added a comment - I'm still not sure that I agree with you on the time conversion issue... I'm not sure what the "correct" answer is, but I'm experiencing inconsistent/unexpected behavior depending on how I convert the timestamp. From my point of view, if a timezone isn't specified then 0 = 1970-01-01 00:00:00 select cast(0.0 as timestamp) from test limit 1 1969-12-31 17:00:00 select cast('1970-01-01 00:00:00' as timestamp) from test limit 1 1970-01-01 00:00:00 Shouldn't those two results be the same? This is the "inconsistency" I reference. select to_utc_timestamp(0.0,'GMT') from test limit 1 1969-12-31 17:00:00 select to_utc_timestamp('1970-01-01 00:00:00','GMT') from test limit 1 1970-01-01 00:00:00 The same inconsistency is produced here....but the behavior seemed to be related to the cast() conversion taking place in to_utc_timestamp, not in the UDF itself...if you still think that this is actually an issue with the UDF itself I can file a separate JIRA for that, but to me, all of this seems to be related back to cast() In this case, my cluster is set to America/Denver timezone, so if I attempted to set a local epoch time to a timestamp in what to me is a logical approach, this appears to work: select to_utc_timestamp(0.0,'America/Denver') from test limit 1 1970-01-01 00:00:00 However, I say that it "appears" to work, because if I the retrieve that same stored timestamp using the from_utc_timestamp() UDF: select from_utc_timestamp(to_utc_timestamp(0.0,'America/Denver'),'America/Denver') from test limit 1 1969-12-31 17:00:00 which again is different than if I had done: "select from_utc_timestamp(to_utc_timestamp('1970-01-01 00:00:00','America/Denver'),'America/Denver') from test limit 1 1970-01-01 00:00:00 The problem is that cast(STRING datetime to TIMESTAMP) produces different results from cast(int/float datetime to TIMESTAMP)
          Hide
          Mark Grover added a comment -

          In my opinion the result of a query should NOT depend on the time configuration of your server/cluster.

          Ryan, that's presently not the case. And, that's not just Hive but other databases as well. MySQL, for example, uses the default timezone of the system for unix timestamp to human-readable conversions as well.

          What you are saying about to_utc_timestamp() is most likely a bug in the UDF code itself. If so, please file a separate JIRA for that. In my personal opinion, a bug in how to_utc_timestamp works doesn't warrant changing the behavior of Timestamp type to use GMT timezone when converting to human readable format instead of the default system time zone.

          I haven't looked at the code but I would expect to_utc_timestamp() to return the same result as long as the first parameter represents the same logical timestamp regardless of how it is expressed - integer timestamp or timestamp type. Are you saying the result is different depending on the representation being passed as the first parameter?

          Show
          Mark Grover added a comment - In my opinion the result of a query should NOT depend on the time configuration of your server/cluster. Ryan, that's presently not the case. And, that's not just Hive but other databases as well. MySQL, for example, uses the default timezone of the system for unix timestamp to human-readable conversions as well. What you are saying about to_utc_timestamp() is most likely a bug in the UDF code itself. If so, please file a separate JIRA for that. In my personal opinion, a bug in how to_utc_timestamp works doesn't warrant changing the behavior of Timestamp type to use GMT timezone when converting to human readable format instead of the default system time zone. I haven't looked at the code but I would expect to_utc_timestamp() to return the same result as long as the first parameter represents the same logical timestamp regardless of how it is expressed - integer timestamp or timestamp type. Are you saying the result is different depending on the representation being passed as the first parameter?
          Hide
          Ryan Harris added a comment -

          I understand that is the way the code does work, the question, in my mind...is that the way it should work?

          I understand that timestamps are inherently timezone-less, and if you want/need to deal with timezone conversions that is the reason the from_utc_timestamp() and to_utc_timestamp() functions were created.

          In this case, the result of the query is going to change/depend on the local timezone of the cluster you are running the query against.

          In my opinion the result of a query should NOT depend on the time configuration of your server/cluster.

          If you WANT/NEED to alter the conversion of the timestamp to account for a specific timezone, shouldn't you use from_utc_timestamp()?

          As it is, because of the implicit cast() conversion that is occurring when to_utc_timestamp() is called, the results of: to_utc_timestamp(1355944339.0, 'GMT')
          STILL varies based on the timezone of the JVM/cluster where the function is called...that makes NO sense.

          IMO, the cast() function should assume that the epoch integer/double passed to it is timezone-less and NOT apply any skew/adjustment based on the whatever the local timezone of the JVM might be.

          Show
          Ryan Harris added a comment - I understand that is the way the code does work, the question, in my mind...is that the way it should work? I understand that timestamps are inherently timezone-less, and if you want/need to deal with timezone conversions that is the reason the from_utc_timestamp() and to_utc_timestamp() functions were created. In this case, the result of the query is going to change/depend on the local timezone of the cluster you are running the query against. In my opinion the result of a query should NOT depend on the time configuration of your server/cluster. If you WANT/NEED to alter the conversion of the timestamp to account for a specific timezone, shouldn't you use from_utc_timestamp()? As it is, because of the implicit cast() conversion that is occurring when to_utc_timestamp() is called, the results of: to_utc_timestamp(1355944339.0, 'GMT') STILL varies based on the timezone of the JVM/cluster where the function is called...that makes NO sense. IMO, the cast() function should assume that the epoch integer/double passed to it is timezone-less and NOT apply any skew/adjustment based on the whatever the local timezone of the JVM might be.
          Hide
          Mark Grover added a comment -

          The short answer is no.
          Here is why:
          That particular unix timestamp represents Wed, 19 Dec 2012 19:12:19 GMT. The Hive timestamp datatype is a based on TimestampWritable which is a essentially a wrapper around java.sql.Timestamp. java.sql.Timestamp as you know is timezone-less. It's more or less equivalent to a unix timestamp. Consequently, Hive timestamps are inherently timezone-less. The tricky bit though is when you have to print this timestamp in a human-readable form where you need use a particular timezone. As you will see in the code, the DateFormat specifies no particular timezone when formatting the Timestamp object to a human-readable string, so it ends up using the default timezone of the JVM which is the same as the system's default timezone unless specifically overriden.

          To summarize, when we print a timestamp object in Hive as a human-readable string, the default JVM/system timezone is used. In my case, the above Hive queries were run on PST machine (GMT-8), so 2012-12-19 11:12:19 is the correct expected output instead of 2012-12-19 19:12:19

          Hope that helps!

          Show
          Mark Grover added a comment - The short answer is no. Here is why: That particular unix timestamp represents Wed, 19 Dec 2012 19:12:19 GMT. The Hive timestamp datatype is a based on TimestampWritable which is a essentially a wrapper around java.sql.Timestamp . java.sql.Timestamp as you know is timezone-less. It's more or less equivalent to a unix timestamp. Consequently, Hive timestamps are inherently timezone-less. The tricky bit though is when you have to print this timestamp in a human-readable form where you need use a particular timezone. As you will see in the code, the DateFormat specifies no particular timezone when formatting the Timestamp object to a human-readable string , so it ends up using the default timezone of the JVM which is the same as the system's default timezone unless specifically overriden. To summarize, when we print a timestamp object in Hive as a human-readable string, the default JVM/system timezone is used. In my case, the above Hive queries were run on PST machine (GMT-8), so 2012-12-19 11:12:19 is the correct expected output instead of 2012-12-19 19:12:19 Hope that helps!
          Hide
          Ryan Harris added a comment -

          Shouldn't these tests have returned:
          Wed, 19 Dec 2012 19:12:19
          ?

          Show
          Ryan Harris added a comment - Shouldn't these tests have returned: Wed, 19 Dec 2012 19:12:19 ?

            People

            • Assignee:
              Mark Grover
              Reporter:
              Mark Grover
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development