Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8221

[Python][Dataset] Expose schema inference / validation options in the factory

    XMLWordPrintableJSON

Details

    Description

      ARROW-8058 added options related to schema inference / validation for the Dataset factory. We should expose this in Python in the dataset(..) factory function:

      • Add ability to pass a user-specified schema with a schema keyword, instead of inferring the schema from (one of) the files (to be passed to the factory finish method)
      • Add validate_schema option to toggle whether the schema is validated against the actual files or not.
      • Expose in some way the number of fragments to be inspected when inferring or validating the schema. Not sure yet what the best API for this would be.

      Some relevant notes from the original PR: https://github.com/apache/arrow/pull/6687#issuecomment-604394407

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m