[ARROW-8884] [C++] Listing files with S3FileSystem is slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- filesystem

External issue URL:
https://github.com/apache/arrow/issues/25019

Description

Listing files on S3 is slow due to the recursive nature of the algorithm.

The following change modifies the behavior of the S3Result to include all objects but no "grouping" (directories). This lower dramatically the number of HTTP calls.

diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
     if (!prefix.empty()) {
       req.SetPrefix(ToAwsString(prefix) + kSep);
     }
-    req.SetDelimiter(Aws::String() + kSep);
+    // req.SetDelimiter(Aws::String() + kSep);
     req.SetMaxKeys(kListObjectsMaxKeys);
 
     while (true) {

The suggested change is to add an option to Selector, e.g. `no_directory_result` or something like this.

Attachments

Issue Links

relates to

ARROW-10788 [C++] Make S3 recursive walks parallel

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Francois Saint-Jacques

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 21/May/20 18:10

Updated:: 11/Jan/23 08:03