Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13755

[Python] Allow usage of field_names in partitioning when saving datasets

    XMLWordPrintableJSON

Details

    Description

      When loading back datasets, it's possible to quickly provide the name of the columns for which data was partitioned using

      partitioning=pyarrow.dataset.partitioning(field_names=["year"])
      

      this is convenient because it's easier and quicker than providing the whole schema, which can still be autodetected from the loaded data.

      On the other side, we don't support this when saving data. If you provide field_names instead of the schema you will get a crash

      pyarrow/dataset.py in _ensure_write_partitioning(scheme)
          684     if not isinstance(scheme, Partitioning):
          685         # TODO support passing field names, and get types from schema
      --> 686         raise ValueError("partitioning needs to be actual Partitioning object")
          687     return scheme
          688 
      

      It would be convenient to allow to use field_names only even when saving as we can automatically detect the schema from the table itself that we are saving.

      Attachments

        Issue Links

          Activity

            People

              amol- Alessandro Molina
              amol- Alessandro Molina
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h
                  8h