Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5035

Selecting timestamp value from Hive table causes IndexOutOfBoundsException

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.9.0
    • None
    • Execution - Data Types
    • None

    Description

      I used the new option to read Hive timestamps.

      alter session set `store.parquet.reader.int96_as_timestamp` = true;

      This query fails:

      select timestamp_id from orders_parts_hive where timestamp_id = '2016-10-03 06:11:52.429';
      Error: SYSTEM ERROR: IndexOutOfBoundsException: readerIndex: 0, writerIndex: 36288 (expected: 0 <= readerIndex <= writerIndex <= capacity(32768))

      Fragment 0:0

      [Error Id: 50537b32-cdc9-4898-9581-531066288fbd on qa-node211:31010] (state=,code=0)

      Selecting all the columns succeed.

      0: jdbc:drill:zk=10.10.100.186:5181> select * from orders_parts_hive where timestamp_id = '2016-10-03 06:11:52.429';
      ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

      o_orderkey o_custkey o_orderstatus o_totalprice o_orderdate o_clerk o_shippriority o_comment int_id bigint_id float_id double_id varchar_id date_id timestamp_id dir0

      ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

      11335 871 F 133549.0 1994-10-22 null 0 ealms. theodolites maintain. regular, even instructions against t -4 -4 -4.0 -4.0 -4 2016-09-29 2016-10-03 06:11:52.429 o_orderpriority=2-HIGH

      ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

      Attachments

        1. orders_parts_hive.tar
          1.60 MB
          Robert Hou

        Activity

          rhou Robert Hou added a comment -

          I am using RC1.

          0: jdbc:drill:zk=10.10.100.186:5181> select * from sys.version;
          ------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

          version commit_id commit_message commit_time build_email build_time

          ------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

          1.9.0 5cea9afa6278e21574c6a982ae5c3d82085ef904 [maven-release-plugin] prepare release drill-1.9.0 09.11.2016 @ 10:28:44 PST rhou@mapr.com 10.11.2016 @ 12:56:24 PST

          ------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

          rhou Robert Hou added a comment - I am using RC1. 0: jdbc:drill:zk=10.10.100.186:5181> select * from sys.version; --------- ----------------------------------------- --------------------------------------------------- -------------------------- -------------- ---------------------------+ version commit_id commit_message commit_time build_email build_time --------- ----------------------------------------- --------------------------------------------------- -------------------------- -------------- ---------------------------+ 1.9.0 5cea9afa6278e21574c6a982ae5c3d82085ef904 [maven-release-plugin] prepare release drill-1.9.0 09.11.2016 @ 10:28:44 PST rhou@mapr.com 10.11.2016 @ 12:56:24 PST --------- ----------------------------------------- --------------------------------------------------- -------------------------- -------------- ---------------------------+
          rhou Robert Hou added a comment - - edited

          This table is partitioned on a string. The problem occurs with a partition that has null values. The value of the string is "NOT SPECIFIED".

          I can select every row up to the null partition using:

          select timestamp_id from orders_parts_hive limit 9026;

          But the next row is in the null partition and causes an exception.

          select timestamp_id from orders_parts_hive limit 9027;

          rhou Robert Hou added a comment - - edited This table is partitioned on a string. The problem occurs with a partition that has null values. The value of the string is "NOT SPECIFIED". I can select every row up to the null partition using: select timestamp_id from orders_parts_hive limit 9026; But the next row is in the null partition and causes an exception. select timestamp_id from orders_parts_hive limit 9027;
          rhou Robert Hou added a comment -

          The Hive table is partitioned on o_orderpriority, which is a string.

          rhou Robert Hou added a comment - The Hive table is partitioned on o_orderpriority, which is a string.
          rhou Robert Hou added a comment - - edited

          The partition only has null values for timestamp_id. Could this be an issue with empty batches? There are 3024 null values in the partition.

          rhou Robert Hou added a comment - - edited The partition only has null values for timestamp_id. Could this be an issue with empty batches? There are 3024 null values in the partition.

          Can we remove the new session option and use IMPALA_TIMESTAMP and see if it has the same issue? And run the same query on drill-1.8.0 and see if this is a regression.

          rkins Rahul Kumar Challapalli added a comment - Can we remove the new session option and use IMPALA_TIMESTAMP and see if it has the same issue? And run the same query on drill-1.8.0 and see if this is a regression.
          rhou Robert Hou added a comment - - edited

          I set the new option to false and I do not get an exception. I will try with IMPALA_TIMESTAMP.

          rhou Robert Hou added a comment - - edited I set the new option to false and I do not get an exception. I will try with IMPALA_TIMESTAMP.

          Are you sure that the data is generated using itself? You can have a hive table sitting on top of data generated by drill.

          rkins Rahul Kumar Challapalli added a comment - Are you sure that the data is generated using itself? You can have a hive table sitting on top of data generated by drill.

          Also it would be helpful if you can upload the data along with hive ddl. Assuming data is less than 10MB

          rkins Rahul Kumar Challapalli added a comment - Also it would be helpful if you can upload the data along with hive ddl. Assuming data is less than 10MB
          rhou Robert Hou added a comment -

          Yes, I created it. It is a Hive table partitioned on a string. I created it using data from a Drill table.

          rhou Robert Hou added a comment - Yes, I created it. It is a Hive table partitioned on a string. I created it using data from a Drill table.
          rhou Robert Hou added a comment -

          I'm trying to figure out how to do that. Because it is a Hive partitioned table, it has five directories, each with one file, and they all have the same name. Maybe I'll use a tar file.

          rhou Robert Hou added a comment - I'm trying to figure out how to do that. Because it is a Hive partitioned table, it has five directories, each with one file, and they all have the same name. Maybe I'll use a tar file.
          rhou Robert Hou added a comment -

          This is a Hive partitioned table. It is partitioned on o_orderpriority.

          rhou Robert Hou added a comment - This is a Hive partitioned table. It is partitioned on o_orderpriority.

          I am a little confused. How did you generate the data for the hive table? If it is generated by drill, that explains the behavior and this is not a bug

          rkins Rahul Kumar Challapalli added a comment - I am a little confused. How did you generate the data for the hive table? If it is generated by drill, that explains the behavior and this is not a bug
          rhou Robert Hou added a comment -

          Interesting.

          I exported Drill data to a tbl file. I edited the tbl file so that Hive could read it. I created a Hive table and loaded it from the tbl file. Created a parquet Hive table from the first Hive table. And then created a partitioned Hive table from the parquet Hive table.

          rhou Robert Hou added a comment - Interesting. I exported Drill data to a tbl file. I edited the tbl file so that Hive could read it. I created a Hive table and loaded it from the tbl file. Created a parquet Hive table from the first Hive table. And then created a partitioned Hive table from the parquet Hive table.
          rhou Robert Hou added a comment -

          The DDL for the partitioned Hive table:

          create table orders_parts_hive (
          o_orderkey int,
          o_custkey int,
          o_orderstatus string,
          o_totalprice double,
          o_orderdate date,
          o_clerk string,
          o_shippriority int,
          o_comment string,
          int_id int,
          bigint_id bigint,
          float_id float,
          double_id double,
          varchar_id string,
          date_id date,
          timestamp_id timestamp)
          partitioned by (o_orderpriority string)
          stored as parquet;

          rhou Robert Hou added a comment - The DDL for the partitioned Hive table: create table orders_parts_hive ( o_orderkey int, o_custkey int, o_orderstatus string, o_totalprice double, o_orderdate date, o_clerk string, o_shippriority int, o_comment string, int_id int, bigint_id bigint, float_id float, double_id double, varchar_id string, date_id date, timestamp_id timestamp) partitioned by (o_orderpriority string) stored as parquet;
          rhou Robert Hou added a comment -

          I am not sure this is a release stopper. It may be due to the fact that I have a partition that only has null values for the column.

          rhou Robert Hou added a comment - I am not sure this is a release stopper. It may be due to the fact that I have a partition that only has null values for the column.

          Based on your explanation, the parquet files are created by hive itself. Then this is a bug. But just to confirm, can you do the below checks

          1. Inspect the parquet metadata and look for "creator" field
          2. Try to run a similar query from hive and see if it succeeds

          rkins Rahul Kumar Challapalli added a comment - Based on your explanation, the parquet files are created by hive itself. Then this is a bug. But just to confirm, can you do the below checks 1. Inspect the parquet metadata and look for "creator" field 2. Try to run a similar query from hive and see if it succeeds
          rhou Robert Hou added a comment -

          ~/bin/parquet-meta 000000_0
          file: file:/root/drill-test-framework-pushdown/data/orders_parts_hive/o_orderpriority=1-URGENT/000000_0
          creator: parquet-mr version 1.6.0

          rhou Robert Hou added a comment - ~/bin/parquet-meta 000000_0 file: file:/root/drill-test-framework-pushdown/data/orders_parts_hive/o_orderpriority=1-URGENT/000000_0 creator: parquet-mr version 1.6.0
          rhou Robert Hou added a comment -

          I tried with Hive. It succeeds.

          rhou Robert Hou added a comment - I tried with Hive. It succeeds.

          Thanks rhou. This confirms that its a bug with drill. Can you also check whether it is a regression? (Use timestamp_impala with drill-1.8.0 and see if it succeeds)

          https://drill.apache.org/docs/parquet-format/#about-int96-support

          rkins Rahul Kumar Challapalli added a comment - Thanks rhou . This confirms that its a bug with drill. Can you also check whether it is a regression? (Use timestamp_impala with drill-1.8.0 and see if it succeeds) https://drill.apache.org/docs/parquet-format/#about-int96-support
          rhou Robert Hou added a comment -

          I am not able to use timestamp_impala yet. But I tried the original query with Drill 1.8, and I get zero rows back. Which makes sense, since we are not interpreting the timestamp correctly.

          select timestamp_id from orders_parts_hive where timestamp_id >= '2016-10-09 13:36:38.986' and timestamp_id <= '2016-10-09 13:45:38.986';
          ---------------

          timestamp_id

          ---------------
          ---------------

          I also tried selecting the whole column. I get bad values (known problem), but I get all the values. I don't get an exception.

          select timestamp_id from orders_parts_hive;

          rhou Robert Hou added a comment - I am not able to use timestamp_impala yet. But I tried the original query with Drill 1.8, and I get zero rows back. Which makes sense, since we are not interpreting the timestamp correctly. select timestamp_id from orders_parts_hive where timestamp_id >= '2016-10-09 13:36:38.986' and timestamp_id <= '2016-10-09 13:45:38.986'; --------------- timestamp_id --------------- --------------- I also tried selecting the whole column. I get bad values (known problem), but I get all the values. I don't get an exception. select timestamp_id from orders_parts_hive;

          People

            Unassigned Unassigned
            rhou Robert Hou
            Robert Hou Robert Hou
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: