Hive
  1. Hive
  2. HIVE-5527

Use of localtime in vectorized Timestamp arithmetic results in data corruption (results depends on localtime)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      A query like

      SELECT ctimestamp2 from alltypesorc WHERE ctimestamp2 > -10669;

      returns rows in row mode, but not in vector mode when running in GMT+2 timezone.

      I know what causes this, but I don’t know exactly whether is a bug or not.
      The reading of the TIMESTAMP types is done in TimeStampTreeReader class,

      long ms = (result.vector[result.isRepeating ? 0 : i] + WriterImpl.BASE_TIMESTAMP)

      • WriterImpl.MILLIS_PER_SECOND;
        long ns = parseNanos(nanoVector.vector[nanoVector.isRepeating ? 0 : i]);
        // the rounding error exists because java always rounds up when dividing integers
        // -42001/1000 = -42; and -42001 % 1000 = -1 (+ 1000)
        // to get the correct value we need
        // (-42 - 1)*1000 + 999 = -42001
        // (42)*1000 + 1 = 42001
        if(ms < 0 && ns != 0) { ms -= 1000; }

        // Convert millis into nanos and add the nano vector value to it
        result.vector[i] = (ms * 1000000) + ns;

      As you see this relies on the ORC WriterImpl.BASE_TIMESTAMP, which is declared as:

      static final long BASE_TIMESTAMP =
      Timestamp.valueOf("2015-01-01 00:00:00").getTime() / MILLIS_PER_SECOND;

      On US/Pacific time, this will be 1420099200
      On EEST (GMT+2) time is 1420063200

      The first row in alltypesorc for ctimestamp2 reads -1420099192 as data[0] and 7005 as nanos[0]. On US/Pacific, with a LONG vector timestamp value of 8875000000. On EEST it ends up with -35992125000000. (Note how the abs(data[0]) value is smaller than the US/Pacific basetime, but bigger than the EEST, so it goes negative on EEST and just cascades to a huge negative number).

      The vector filter simply compares this with -10669 (the query WHERE clause) and it qualifies the row on US/Pacific, but fails on EEST.

      I’m not sure what the right solution is, the whole of Hive code appears to be riddled with Timezone problems. As a side node, the build-common.xml sets an environment variable TZ to US/Pacific, but this has no effect in running tests on Windows.

      But the gist of it is this: in row mode the results are consistent on any time zone. In vector mode the results vary (rows qualify for WHERE clause) depending on the timezone.

        Activity

        There are no comments yet on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Remus Rusanu
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development