Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
5.0.0, 6.0.0, 6.0.1, 7.0.0
-
Centos 7 gcc 9.2.0
Description
We are seeing that for certain data sets when dealing with lists of structs, repeated reads yield different results - I have a file that exhibits this behavior and below is the code for reproducing it:
filesystem::path filePath = dirPath / "writeReadRowGroup.parquet"; arrow::MemoryPool *pool = arrow::default_memory_pool(); std::shared_ptr<arrow::io::ReadableFile> infile; PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(filePath, pool)); std::unique_ptr<parquet::arrow::FileReader> arrow_reader; auto status = parquet::arrow::OpenFile(infile, pool, &arrow_reader); CHECK_OK(status); std::shared_ptr<arrow::Schema> readSchema; CHECK_OK(arrow_reader->GetSchema(&readSchema)); std::shared_ptr<arrow::Table> table; std::vector<int> indicesToGet; CHECK_OK(arrow_reader->ReadTable(&table)); auto recordListCol1 = arrow::Table::Make(arrow::schema({table->schema()->GetFieldByName("recordList")}), {table->GetColumnByName("recordList")}); for (int i = 0; i < 20; ++i) { cout << "data reread operation number = " + std::to_string(i) << endl; std::shared_ptr<arrow::Table> table2; CHECK_OK(arrow_reader->ReadTable(&table2)); auto recordListCol2 = arrow::Table::Make(arrow::schema({table2->schema()->GetFieldByName("recordList")}), {table2->GetColumnByName("recordList")}); bool equals = recordListCol1->Equals(*recordListCol2); if (!equals) { cout << recordListCol1->ToString() << endl; cout << endl << "new table" << endl; cout << recordListCol2->ToString() << endl; throw std::runtime_error("Subsequent re-read failure "); } }
Apparently, as shown in the attached capture the state machine used to track nulls is broken on subsequent usage
Attachments
Attachments
Issue Links
- is related to
-
ARROW-15550 [C++] Add an environment variable to debug memory
- Resolved
- links to