Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10758

[C++] Arrow Dataset Loading CSV format file from S3

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.0.0
    • None
    • C++
    • None

    Description

      I am using `S3FileSystem` along with `CsvFileFormat` in Arrow dataset to load all csv files under a S3 bucket. 

      Main test code is as below:

       

      auto format = std::make_shared<CsvFileFormat>();
      string output_path;
      auto s3_file_system = arrow::fs::FileSystemFromUri("s3://test-csv-bucket", &output_path).ValueOrDie();
      
      FileSystemFactoryOptions options;
      options.partition_base_dir = output_path;
      
      arrow::fs::FileSelector _file_selector;
      
      ASSERT_OK_AND_ASSIGN(auto factory,
                           FileSystemDatasetFactory::Make(s3_file_system, _file_selector, format, options));
      
      ASSERT_OK_AND_ASSIGN(auto schema, factory->Inspect());
      
      ASSERT_OK_AND_ASSIGN(auto dataset, factory->Finish(schema));
      
      

      But it seems when calling `ASSERT_OK_AND_ASSIGN(auto schema, factory->Inspect());` it throws exception when reading file from S3 bucket and the exception stack is as follows:

       

       

      __pthread_kill 0x00007fff70dc033a
      pthread_kill 0x00007fff70e7ce60
      abort 0x00007fff70d47808
      malloc_vreport 0x00007fff70e3d50b
      malloc_report 0x00007fff70e4040f
      Aws::Free(void*) AWSMemory.cpp:97
      std::__1::enable_if<std::is_polymorphic<std::__1::basic_iostream<char, std::__1::char_traits<char> > >::value, void>::type Aws::Delete<std::__1::basic_iostream<char, std::__1::char_traits<char> > >(std::__1::basic_iostream<char, std::__1::char_traits<char> >*) AWSMemory.h:119
      Aws::Utils::Stream::ResponseStream::ReleaseStream() ResponseStream.cpp:62
      Aws::Utils::Stream::ResponseStream::~ResponseStream() ResponseStream.cpp:54
      Aws::Utils::Stream::ResponseStream::~ResponseStream() ResponseStream.cpp:53
      Aws::S3::Model::GetObjectResult::~GetObjectResult() GetObjectResult.h:30
      Aws::S3::Model::GetObjectResult::~GetObjectResult() GetObjectResult.h:30
      arrow::fs::(anonymous namespace)::ObjectInputFile::ReadAt(long long, long long, void*) s3fs.cc:724
      arrow::fs::(anonymous namespace)::ObjectInputFile::ReadAt(long long, long long) s3fs.cc:735
      arrow::dataset::OpenReader(arrow::dataset::FileSource const&, arrow::dataset::CsvFileFormat const&, std::__1::shared_ptr<arrow::dataset::ScanOptions> const&, arrow::MemoryPool*) file_csv.cc:119
      arrow::dataset::CsvFileFormat::Inspect(arrow::dataset::FileSource const&) const file_csv.cc:182
      arrow::dataset::FileSystemDatasetFactory::InspectSchemas(arrow::dataset::InspectOptions) discovery.cc:219
      arrow::dataset::DatasetFactory::Inspect(arrow::dataset::InspectOptions) discovery.cc:41
      

       

      Does Arrow dataset support reading csv/parquest/ipc from S3Filesystem?

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              BitStream Lynch Wu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: