Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15716

[Dataset][Python] Parse a list of fragment paths to gather filters

    XMLWordPrintableJSON

Details

    • Wish
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 7.0.0
    • None
    • Python

    Description

      Is it possible for partitioning.parse() to be updated to parse a list of paths instead of just a single path?

      I am passing the .paths from file_visitor to downstream tasks to process data which was recently saved, but I can run into problems with this if I overwrite data with delete_matching in order to consolidate small files since the paths won't exist.

      Here is the output of my current approach to use filters instead of reading the paths directly:

      # Fragments saved during write_dataset 
      ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
      
      # Run partitioning.parse() on each fragment 
      [<pyarrow.compute.Expression (date_id == 20210813)>, <pyarrow.compute.Expression (date_id == 20210114)>, <pyarrow.compute.Expression (date_id == 20210114)>, <pyarrow.compute.Expression (date_id == 20210114)>]
      
      # Format those expressions into a list of tuples
      [('date_id', 'in', [20210114, 20210813])]
      
      # Convert to an expression which is used as a filter in .to_table()
      is_in(date_id, {value_set=int64:[
        20210114,
        20210813
      ], skip_nulls=false})
      

      My hope would be to do something like filt_exp = partitioning.parse(paths) which would return a dataset expression.

      Attachments

        Issue Links

          Activity

            People

              vibhatha Vibhatha Lakmal Abeykoon
              ldacey Lance Dacey
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h