Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8163 [C++][Dataset] Allow FileSystemDataset's file list to be lazy
  3. ARROW-17306

[C++] Provide an optimized`GetFileInfoGenerator` specialization for `LocalFileSystem`

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 10.0.0
    • C++

    Description

      At the moment, `LocalFileSystem` does not have a separate optimized implementation of `GetFileInfoGenerator` with a fallback to the generic `FileSystem::GetFileInfoGenerator`, which simply queues the synchronous version of `GetFileInfo(FileSelector)` to the background thread and waits for its completion before yielding.

      This generally defeats all the purpose of `GetFileInfoGenerator` so that we cannot really use it to push down the `FileInfo` items to whatever consumer "on the fly" (e.g. `FileSystemDatasetFactory` and `FileSystemDataset`, correspondingly).

      Provide a fair implementation so that it yields more than one time and allows to retrieve the data in chunks, so that the resulting `FileInfoGenerator` is usable for the purpose of streaming processing of data.

       

      Attachments

        Issue Links

          Activity

            People

              psolodovnikov Pavel Solodovnikov
              psolodovnikov Pavel Solodovnikov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 11h 20m
                  11h 20m