Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5542

Scan unnecessary adds implicit columns to ScanRecordBatch for select * query

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      It seems that Drill would add several implicit columns (`fqn`, `filepath`, `filename`, `suffix`) to ScanBatch, where it's actually not required at downstream operator. Although those implicit columns would be dropped off later on, it increases both memory and CPU overhead.

      1. JSON
      ```

      {a: 100}

      ```

      select * from dfs.tmp.`1.json`;
      +------+
      |  a   |
      +------+
      | 100  |
      +------+
      

      The schema from ScanRecordBatch is :

      [ schema:
          BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], selectionVector=NONE], 
       

      2. Parquet

      elect * from cp.`tpch/nation.parquet`;
      +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
      | n_nationkey  |     n_name      | n_regionkey  |                                                      n_comment                                                      |
      +--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
      | 0            | ALGERIA         | 0            |  haggle. carefully final deposits detect slyly agai                                                                 |
      ...
      

      The schema of ScanRecordBatch:

        schema:
          BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], 
      

      3. Text

      cat 1.csv
      a, b, c
      
      select * from dfs.tmp.`1.csv`;
      +----------------+
      |    columns     |
      +----------------+
      | ["a","b","c"]  |
      +----------------+
      

      Schema of ScanRecordBatch

        schema:
          BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE], 
      

      If implicit columns are not part of query result of `select * query`, then Scan operator should not populate those implicit columns.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jni Jinfeng Ni
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: