Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15281

[C++] Implement ability to retrieve fragment filename




      A user has requested the ability to include the filename of the CSV in the dataset output - see discussion on ARROW-15260 for more context.

      Relevant info from that ticket:

      "From a C++ perspective we've got many of the pieces needed already. One challenge is that the datasets API is written to work with "fragments" and not "files". For example, a dataset might be an in-memory table in which case we are working with InMemoryFragment and not FileFragment so there is no concept of "filename".

      That being said, the low level ScanBatchesAsync method actually returns a generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is a struct with the record batch as well as the source fragment for that record batch.

      So if you were to execute scan, you could inspect the fragment and, if it is a FileFragment, you could extract the filename.

      Another challenge is that R is moving towards more and more access through an exec plan and not directly using a scanner. In order for that to work we would need to augment the scan results with the filename in C++ before sending into the exec plan. Luckily, we already do this a bit as well. We currently augment the scan results with fragment index, batch index, and whether the batch is the last batch in the fragment.

      Since ExecBatch can work with constants efficiently I don't think there will be much performance cost in always including the filename. So the work remaining is simply to add a new augmented field {}fragment_source_name which is always attached if the underlying fragment is a filename. Then users can get this field if they want by including "{_}_fragment_source_name" in the list of columns they query for."


        Issue Links



              sanjibansg Sanjiban Sengupta
              thisisnic Nicola Crane
              0 Vote for this issue
              5 Start watching this issue



                Time Tracking

                  Original Estimate - Not Specified
                  Not Specified
                  Remaining Estimate - 0h
                  Time Spent - 9h 10m
                  9h 10m