Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15281

[C++] Implement ability to retrieve fragment filename

    XMLWordPrintableJSON

Details

    Description

      A user has requested the ability to include the filename of the CSV in the dataset output - see discussion on ARROW-15260 for more context.

      Relevant info from that ticket:

       
      "From a C++ perspective we've got many of the pieces needed already. One challenge is that the datasets API is written to work with "fragments" and not "files". For example, a dataset might be an in-memory table in which case we are working with InMemoryFragment and not FileFragment so there is no concept of "filename".

      That being said, the low level ScanBatchesAsync method actually returns a generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is a struct with the record batch as well as the source fragment for that record batch.

      So if you were to execute scan, you could inspect the fragment and, if it is a FileFragment, you could extract the filename.

      Another challenge is that R is moving towards more and more access through an exec plan and not directly using a scanner. In order for that to work we would need to augment the scan results with the filename in C++ before sending into the exec plan. Luckily, we already do this a bit as well. We currently augment the scan results with fragment index, batch index, and whether the batch is the last batch in the fragment.

      Since ExecBatch can work with constants efficiently I don't think there will be much performance cost in always including the filename. So the work remaining is simply to add a new augmented field {}fragment_source_name which is always attached if the underlying fragment is a filename. Then users can get this field if they want by including "{_}_fragment_source_name" in the list of columns they query for."

      Attachments

        Issue Links

          Activity

            People

              sanjibansg Sanjiban Sengupta
              thisisnic Nicola Crane
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 9h 10m
                  9h 10m