Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3947

[Python] query distinct values of a given partition from a ParquetDataset

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.10.0
    • Fix Version/s: None
    • Component/s: Python
    • Labels:
    • Environment:
      MacOSX, Python 3.6,

      Description

      Right now the values of a given partition from a `ParquetDataset` is buried inside `ParquetDataset.pieces`, a bit inconvenient for the user to dig out this information. A helper function/method to perform this task in `ParquetDataset` class would be very helpful for the users.

      A pure personal opinion on the name of this method: `ParquetDataset.select_distinct()` with partition_name as the positional arg, to resemble SQL `SELECT DISTINCT column FROM table`.

      I'm not sure how to contribute here on Jira, so I created this GitHub Gist as an possible solution for this problem.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              XiUpsilon Ji Xu
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: