Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
It seems that Drill would add several implicit columns (`fqn`, `filepath`, `filename`, `suffix`) to ScanBatch, where it's actually not required at downstream operator. Although those implicit columns would be dropped off later on, it increases both memory and CPU overhead.
1. JSON
```
```
select * from dfs.tmp.`1.json`; +------+ | a | +------+ | 100 | +------+
The schema from ScanRecordBatch is :
[ schema: BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)], selectionVector=NONE],
2. Parquet
elect * from cp.`tpch/nation.parquet`;
+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
| n_nationkey | n_name | n_regionkey | n_comment |
+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
| 0 | ALGERIA | 0 | haggle. carefully final deposits detect slyly agai |
...
The schema of ScanRecordBatch:
schema: BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED), n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],
3. Text
cat 1.csv a, b, c select * from dfs.tmp.`1.csv`; +----------------+ | columns | +----------------+ | ["a","b","c"] | +----------------+
Schema of ScanRecordBatch
schema: BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)], fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],
If implicit columns are not part of query result of `select * query`, then Scan operator should not populate those implicit columns.