Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The scanner, in its original form, was something of a prototype query engine. It handled complex projection (beyond just casting) and filtering. Over time features have been moved out of the scanner and into the execution engine to the point that the scanner now is just a tool for scanning multiple files simultaneously to feed as input to an exec plan (i.e. "scan node").
The concept of a "scanner" should mostly be removed from our public API surface. Those working directly with the execution engine will still need to know about the scan node but that should be about it.
For example, in python we have pages like this and code like this:
dataset = ds.dataset('/tmp/my_dataset', format='parquet') scanner = dataset.scanner(columns=['x']) ds.write_dataset(scanner, '/tmp/my_new_dataset', format='parquet')
Over time I think this will lead to confusion. It's already a little convoluted. For example, a call to dataset.to_table(...) creates a Scanner and calls ToTable with ScanOptions. This method then creates an ExecPlan and, in order to do so, must create a ScanNode. The ScanNode consumes some (but not all) of the options in ScanOption while the ExecPlan consumes the rest.
The Scanner (if one continues to exist) should be an internal detail not visible to users. The previous code could either change to use a new term query:
dataset = ds.dataset('/tmp/my_dataset', format='parquet') query = dataset.query(columns=['x']) ds.write_dataset(query, '/tmp/my_new_dataset', format='parquet')
Or we could use the record batch reader concept:
dataset = ds.dataset('/tmp/my_dataset', format='parquet') record_batch_reader = dataset.to_reader(columns=['x']) ds.write_dataset(record_batch_reader, '/tmp/my_new_dataset', format='parquet')
I would like to make some changes to the scanner in 9.0.0 and would hope to address this then so I'm happy to hear opinions / thoughts.
Attachments
Issue Links
- is depended upon by
-
ARROW-16410 [C++] Scanner -> ScanNode
- Open
- relates to
-
ARROW-16616 [Python] Allow lazy evaluation of filters in Dataset and add Datset.filter method
- Resolved