Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2089

[C++] RowGroupMetaData file_offset set incorrectly

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      It appears that the RowGroupMetaData file_offset property is being set to the same value as the first ColumnMetaData file_offset property in https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/metadata.cc#L1557-L1565

       

      This is not consistent with the definition of these properties given in the Thrift file: https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift

      struct ColumnChunk {
        ...
      
        /** Byte offset in file_path to the ColumnMetaData **/
        2: required i64 file_offset
        
        ...
      }
      
      ...
      
      struct RowGroup {
        ...
      
        /** Byte offset from beginning of file to first page (data or dictionary)
         * in this row group **/
        5: optional i64 file_offset
      
        ...
      }
      

      This is causing issues when trying to read the file with the parquet-mr libraries, because the RowGroup's file offset is used to determine whether a RowGroup exists within a given file split: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1226-L1251

       

      This issue is therefore resulting in Parquet files which cannot be read as the metadata is incorrect.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            emkornfield Micah Kornfield
            archmenzies Archie Menzies
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m

                Slack

                  Issue deployment