Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4614

Drill must appoint one data type per one column for self-describing data while querying directories

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.6.0
    • 1.7.0
    • Execution - Data Types
    • None

    Description

      While drill selects data from the directory and detects data types on-the-fly
      it is possible that one field will be of several data types .

      For example:

      1. Create an input file as follows
      20K rows with the following -
      {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
      200 rows with the following -
      {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
      entries only"}}

      2. CTAS as follows

      CREATE TABLE dfs.`tmp`.`tp` as select * from dfs.`data.json` t
      

      In this case will be created parquet table as the folder with two files.

      3. Select the data

      select t.others.additional from dfs.`tmp`.`tp` t
      

      The result of selecting will be mix of EXPR$0<INT(OPTIONAL)> and EXPR$0<VARCHAR(OPTIONAL)>.

      It happens because Drill defines column data type per file.
      The same result with json files.
      Since streaming aggregate does not support schema changes this issue makes impossible of using aggregate functions with query results.

      Attachments

        1. data.json
          1.45 MB
          Vitalii Diravka

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            vitalii Vitalii Diravka
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment