Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24230

With Parquet 1.10 upgrade has errors in the vectorized reader

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.3.1, 2.4.0
    • SQL
    • None

    Description

      When reading some parquet files can get an error like:

      java.io.IOException: expecting more rows but reached last block. Read 0 out of 1194236

      This happens when looking for a needle thats pretty rare in a large haystack.

       

      The issue here I believe is that the total row count is calculated at

      https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L229

       

      But we pass the blocks we filtered via 

      org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups

      to the ParquetFileReader constructor.

       

      However the ParquetFileReader constructor will filter the list of blocks again using

       

      https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L737

       

      if a block is filtered out by the latter method, and not the former the vectorized reader will believe it should see more rows than it will.

      the fix I used locally is pretty straight forward:

      for (BlockMetaData block : blocks) {
      this.totalRowCount += block.getRowCount();
      }
      

      goes to

      this.totalRowCount = this.reader.getRecordCount();
      

      rdblue do you know if this sounds right? The second filter method in the ParquetFileReader might filter more blocks leading to the count being off? 

      Attachments

        Activity

          People

            rdblue Ryan Blue
            ianoc Ian O Connell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: