Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-5527

Use of localtime in vectorized Timestamp arithmetic results in data corruption (results depends on localtime)

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      A query like

      SELECT ctimestamp2 from alltypesorc WHERE ctimestamp2 > -10669;

      returns rows in row mode, but not in vector mode when running in GMT+2 timezone.

      I know what causes this, but I don’t know exactly whether is a bug or not.
      The reading of the TIMESTAMP types is done in TimeStampTreeReader class,

      long ms = (result.vector[result.isRepeating ? 0 : i] + WriterImpl.BASE_TIMESTAMP)

      • WriterImpl.MILLIS_PER_SECOND;
        long ns = parseNanos(nanoVector.vector[nanoVector.isRepeating ? 0 : i]);
        // the rounding error exists because java always rounds up when dividing integers
        // -42001/1000 = -42; and -42001 % 1000 = -1 (+ 1000)
        // to get the correct value we need
        // (-42 - 1)*1000 + 999 = -42001
        // (42)*1000 + 1 = 42001
        if(ms < 0 && ns != 0) { ms -= 1000; }

        // Convert millis into nanos and add the nano vector value to it
        result.vector[i] = (ms * 1000000) + ns;

      As you see this relies on the ORC WriterImpl.BASE_TIMESTAMP, which is declared as:

      static final long BASE_TIMESTAMP =
      Timestamp.valueOf("2015-01-01 00:00:00").getTime() / MILLIS_PER_SECOND;

      On US/Pacific time, this will be 1420099200
      On EEST (GMT+2) time is 1420063200

      The first row in alltypesorc for ctimestamp2 reads -1420099192 as data[0] and 7005 as nanos[0]. On US/Pacific, with a LONG vector timestamp value of 8875000000. On EEST it ends up with -35992125000000. (Note how the abs(data[0]) value is smaller than the US/Pacific basetime, but bigger than the EEST, so it goes negative on EEST and just cascades to a huge negative number).

      The vector filter simply compares this with -10669 (the query WHERE clause) and it qualifies the row on US/Pacific, but fails on EEST.

      I’m not sure what the right solution is, the whole of Hive code appears to be riddled with Timezone problems. As a side node, the build-common.xml sets an environment variable TZ to US/Pacific, but this has no effect in running tests on Windows.

      But the gist of it is this: in row mode the results are consistent on any time zone. In vector mode the results vary (rows qualify for WHERE clause) depending on the timezone.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rusanu Remus Rusanu
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: