Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4614

Drill must appoint one data type per one column for self-describing data while querying directories

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.7.0
    • Component/s: Execution - Data Types
    • Labels:
      None

      Description

      While drill selects data from the directory and detects data types on-the-fly
      it is possible that one field will be of several data types .

      For example:

      1. Create an input file as follows
      20K rows with the following -
      {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
      200 rows with the following -
      {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
      entries only"}}

      2. CTAS as follows

      CREATE TABLE dfs.`tmp`.`tp` as select * from dfs.`data.json` t
      

      In this case will be created parquet table as the folder with two files.

      3. Select the data

      select t.others.additional from dfs.`tmp`.`tp` t
      

      The result of selecting will be mix of EXPR$0<INT(OPTIONAL)> and EXPR$0<VARCHAR(OPTIONAL)>.

      It happens because Drill defines column data type per file.
      The same result with json files.
      Since streaming aggregate does not support schema changes this issue makes impossible of using aggregate functions with query results.

        Attachments

        1. data.json
          1.45 MB
          Vitalii Diravka

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                vitalii Vitalii Diravka
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: