Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5970

DrillParquetReader always builds the schema with "OPTIONAL" dataMode columns instead of "REQUIRED" ones

    XMLWordPrintableJSON

Details

    Description

      The root cause of the issue is that adding REQUIRED (not-nullable) data types to the container in the all MapWriters is not implemented.

      It can lead to get invalid schema.

      0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      +-----------+----------------------------+
      | Fragment  | Number of records written  |
      +-----------+----------------------------+
      | 0_0       | 1                          |
      +-----------+----------------------------+
      1 row selected (2.376 seconds)
      

      Run from Drill unit test framework (to see "data mode"):

      @Test
        public void test() throws Exception {
          setColumnWidths(new int[] {25, 25});
          List<QueryDataBatch> queryDataBatches = testSqlWithResults("select * from dfs.tmp.bof_repro_1");
          printResult(queryDataBatches);
        }
      
      1 row(s):
      -------------------------------------------------------
      | MYCOL<VARCHAR(REPEATED)> | Bucket<VARCHAR(OPTIONAL)>|
      -------------------------------------------------------
      | ["hello","hai"]          | Bucket1                  |
      -------------------------------------------------------
      Total record count: 1
      
      
      vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema /tmp/bof_repro_1/0_0_0.parquet 
      message root {
        repeated binary MYCOL (UTF8);
        required binary Bucket (UTF8);
      }
      

      To simulate of obtaining the wrong result you can try the query with aggregation by using a new parquet reader (used by default for complex data types) and old parquet reader. False "Hash aggregate does not support schema changes" error will happen.

      1) Create two parquet files.

      0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_1 as select * from (select CONVERT_FROM('["hello","hai"]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
      +-----------+----------------------------+
      | Fragment  | Number of records written  |
      +-----------+----------------------------+
      | 0_0       | 1                          |
      +-----------+----------------------------+
      1 row selected (1.122 seconds)
      0: jdbc:drill:schema=dfs> CREATE TABLE dfs.tmp.bof_repro_2 as select * from (select CONVERT_FROM('[]','JSON') AS MYCOL, 'Bucket1' AS Bucket FROM (VALUES(1)));
      +-----------+----------------------------+
      | Fragment  | Number of records written  |
      +-----------+----------------------------+
      | 0_0       | 1                          |
      +-----------+----------------------------+
      1 row selected (0.552 seconds)
      0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
      

      2) Copy the parquet files from bof_repro_1 to bof_repro_2.

      [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_1
      Found 1 items
      -rw-r--r--   3 mapr mapr        415 2017-07-25 11:46 /tmp/bof_repro_1/0_0_0.parquet
      [root@naravm1 ~]# hadoop fs -ls /tmp/bof_repro_2
      Found 1 items
      -rw-r--r--   3 mapr mapr        368 2017-07-25 11:46 /tmp/bof_repro_2/0_0_0.parquet
      [root@naravm1 ~]# hadoop fs -cp /tmp/bof_repro_1/0_0_0.parquet /tmp/bof_repro_2/0_0_1.parquet
      [root@naravm1 ~]#
      

      3) Query the table.

      0: jdbc:drill:schema=dfs> ALTER SESSION SET  `planner.enable_streamagg`=false;
      +-------+------------------------------------+
      |  ok   |              summary               |
      +-------+------------------------------------+
      | true  | planner.enable_streamagg updated.  |
      +-------+------------------------------------+
      1 row selected (0.124 seconds)
      0: jdbc:drill:schema=dfs> select * from dfs.tmp.bof_repro_2;
      +------------------+----------+
      |      MYCOL       |  Bucket  |
      +------------------+----------+
      | ["hello","hai"]  | Bucket1  |
      | null             | Bucket1  |
      +------------------+----------+
      2 rows selected (0.247 seconds)
      0: jdbc:drill:schema=dfs> select bucket, count(*) from dfs.tmp.bof_repro_2 group by bucket;
      Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema changes
      
      Fragment 0:0
      
      [Error Id: 60f8ada3-5f00-4413-a676-4881fc8cb409 on naravm3:31010] (state=,code=0)
      

      Attachments

        Issue Links

          Activity

            People

              vitalii Vitalii Diravka
              vitalii Vitalii Diravka
              Salim Achouche Salim Achouche
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: