[DRILL-5542] Scan unnecessary adds implicit columns to ScanRecordBatch for select * query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Execution - Relational Operators
Labels:
None

Description

It seems that Drill would add several implicit columns (`fqn`, `filepath`, `filename`, `suffix`) to ScanBatch, where it's actually not required at downstream operator. Although those implicit columns would be dropped off later on, it increases both memory and CPU overhead.

1. JSON
```

{a: 100}

```

select * from dfs.tmp.`1.json`;
+------+
|  a   |
+------+
| 100  |
+------+

The schema from ScanRecordBatch is :

[ schema:
    BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], selectionVector=NONE],

2. Parquet

elect * from cp.`tpch/nation.parquet`;
+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
| n_nationkey  |     n_name      | n_regionkey  |                                                      n_comment                                                      |
+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
| 0            | ALGERIA         | 0            |  haggle. carefully final deposits detect slyly agai                                                                 |
...

The schema of ScanRecordBatch:

  schema:
    BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],

3. Text

cat 1.csv
a, b, c

select * from dfs.tmp.`1.csv`;
+----------------+
|    columns     |
+----------------+
| ["a","b","c"]  |
+----------------+

Schema of ScanRecordBatch

  schema:
    BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],

If implicit columns are not part of query result of `select * query`, then Scan operator should not populate those implicit columns.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jinfeng Ni

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/May/17 00:52

Updated:: 27/May/17 14:39