Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
Description
ARROW-8058 added options related to schema inference / validation for the Dataset factory. We should expose this in Python in the dataset(..) factory function:
- Add ability to pass a user-specified schema with a schema keyword, instead of inferring the schema from (one of) the files (to be passed to the factory finish method)
- Add validate_schema option to toggle whether the schema is validated against the actual files or not.
- Expose in some way the number of fragments to be inspected when inferring or validating the schema. Not sure yet what the best API for this would be.
Some relevant notes from the original PR: https://github.com/apache/arrow/pull/6687#issuecomment-604394407
Attachments
Issue Links
- duplicates
-
ARROW-8964 [Python][Parquet] improve reading of partitioned parquet datasets whose schema changed
- Closed
-
ARROW-9455 [Python] add option for taking all columns from all files in pa.dataset
- Closed
- is duplicated by
-
ARROW-17308 ValueError: Keyword 'validate_schema' is not yet supported with the new Dataset API
- Closed
- links to