Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7747

Function to determine Unknow fields / on fly generated missing fields

    XMLWordPrintableJSON

    Details

    • Type: Wish
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.17.0
    • Fix Version/s: None
    • Component/s: Functions - Drill
    • Labels:
      None

      Description

      it would be really useful to have a function allowing to know if a field comes from an existing column or not.

      With this data:

      apache drill 1.17> SELECT * FROM dfs.test.`f1.parquet`;
      +---+--------+-------+
      | a |   b    |   c   |
      +---+--------+-------+
      | 1 | test-1 | other |
      | 2 | test-2 | null  |
      | 3 | test-3 | old   |
      +---+--------+-------+
      
      apache drill 1.17> SELECT * FROM dfs.test.`f2.parquet`;
      +----+---------+
      | a  |    b    |
      +----+---------+
      | 10 | test-10 |
      | 20 | test-20 |
      | 30 | test-30 |
      +----+---------+
      
      apache drill 1.17> SELECT *, drilltypeof(c), modeof(c) FROM dfs.test.`f*.parquet`;
      +------------+----+---------+-------+---------+----------+
      |    dir0    | a  |    b    |   c   | EXPR$1  |  EXPR$2  |
      +------------+----+---------+-------+---------+----------+
      | f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE |
      | f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE |
      | f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE |
      | f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE |
      | f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE |
      | f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE |
      +------------+----+---------+-------+---------+----------+
      

      It will be nice to know when 'c' data is present because the column exists in the Parquet (or other type file) or if the value NULL was generated because the column was missing.

      Example a function 'origin' that take a column name and return for each row if the value was 'generated' or 'original' (other/better keyword could be choose (exist(column)=>true/false))
      Virtual Example with previous data:

      apache drill> SELECT *, drilltypeof(c), modeof(c), origin(c) AS origin FROM dfs.test.`f*.parquet`;
      +------------+----+---------+-------+---------+----------+-----------+
      |    dir0    | a  |    b    |   c   | EXPR$1  |  EXPR$2  |  origin   |
      +------------+----+---------+-------+---------+----------+-----------+
      | f1.parquet | 1  | test-1  | other | VARCHAR | NULLABLE | original |
      | f1.parquet | 2  | test-2  | null  | VARCHAR | NULLABLE | original |
      | f1.parquet | 3  | test-3  | old   | VARCHAR | NULLABLE | original |
      | f2.parquet | 10 | test-10 | null  | VARCHAR | NULLABLE | generated |
      | f2.parquet | 20 | test-20 | null  | VARCHAR | NULLABLE | generated |
      | f2.parquet | 30 | test-30 | null  | VARCHAR | NULLABLE | generated |
      +------------+----+---------+-------+---------+----------+-----------+
      

      Or maybe another way could be to have an implicit column name (like filename, filepath...) that contains the list of available "columns"

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              benj641 benj
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: