Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14047

[C++] [Parquet] FileReader returns inconsistent results on repeat reads

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 5.0.0, 6.0.0, 6.0.1, 7.0.0
    • 7.0.2, 8.0.0
    • C++
    • Centos 7 gcc 9.2.0

    Description

      We are seeing that for certain data sets when dealing with lists of structs, repeated reads yield different results - I have a file that exhibits this behavior and below is the code for reproducing it:

        filesystem::path filePath = dirPath / "writeReadRowGroup.parquet";
        arrow::MemoryPool *pool = arrow::default_memory_pool();  std::shared_ptr<arrow::io::ReadableFile> infile;
        PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(filePath, pool));
        std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
        auto status = parquet::arrow::OpenFile(infile, pool, &arrow_reader);
        CHECK_OK(status);  std::shared_ptr<arrow::Schema> readSchema;
        CHECK_OK(arrow_reader->GetSchema(&readSchema));
        std::shared_ptr<arrow::Table> table;
        std::vector<int> indicesToGet;
        CHECK_OK(arrow_reader->ReadTable(&table));  auto recordListCol1 = arrow::Table::Make(arrow::schema({table->schema()->GetFieldByName("recordList")}),
                                                 {table->GetColumnByName("recordList")});  for (int i = 0; i < 20; ++i) {
          cout << "data reread operation number = " + std::to_string(i) << endl;
          std::shared_ptr<arrow::Table> table2;
          CHECK_OK(arrow_reader->ReadTable(&table2));
          auto recordListCol2 = arrow::Table::Make(arrow::schema({table2->schema()->GetFieldByName("recordList")}),
                                                   {table2->GetColumnByName("recordList")});
          bool equals = recordListCol1->Equals(*recordListCol2);
          if (!equals) {
            cout << recordListCol1->ToString() << endl;
            cout << endl << "new table" << endl;
            cout << recordListCol2->ToString() << endl;
            throw std::runtime_error("Subsequent re-read failure ");
          }  }
      
      

      Apparently, as shown in the attached capture the state machine used to track nulls is broken on subsequent usage

       

      Attachments

        1. writeReadRowGroup.parquet
          16 kB
          Radu Teodorescu
        2. Capture.PNG
          58 kB
          Radu Teodorescu

        Issue Links

          Activity

            People

              wjones127 Will Jones
              wjones127 Will Jones
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8.5h
                  8.5h