Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4614

Drill must appoint one data type per one column for self-describing data while querying directories

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.6.0
    • 1.7.0
    • Execution - Data Types
    • None

    Description

      While drill selects data from the directory and detects data types on-the-fly
      it is possible that one field will be of several data types .

      For example:

      1. Create an input file as follows
      20K rows with the following -
      {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
      200 rows with the following -
      {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
      entries only"}}

      2. CTAS as follows

      CREATE TABLE dfs.`tmp`.`tp` as select * from dfs.`data.json` t
      

      In this case will be created parquet table as the folder with two files.

      3. Select the data

      select t.others.additional from dfs.`tmp`.`tp` t
      

      The result of selecting will be mix of EXPR$0<INT(OPTIONAL)> and EXPR$0<VARCHAR(OPTIONAL)>.

      It happens because Drill defines column data type per file.
      The same result with json files.
      Since streaming aggregate does not support schema changes this issue makes impossible of using aggregate functions with query results.

      Attachments

        1. data.json
          1.45 MB
          Vitalii Diravka

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vitalii Vitalii Diravka
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: