Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2027

Merging parquet files created in 1.11.1 not possible using 1.12.0

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12.0
    • Fix Version/s: 1.12.1
    • Component/s: parquet-mr
    • Labels:
      None

      Description

      I have parquet files created using 1.11.1. In the process I join two files (with the same schema) into a one output file. I create Hadoop writer:

      val hadoopWriter = new ParquetFileWriter(
            HadoopOutputFile.fromPath(
              new Path(outputPath.toString),
              new Configuration()
            ), outputSchema, Mode.OVERWRITE,
            8 * 1024 * 1024,
            2097152,
            DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH,
            DEFAULT_STATISTICS_TRUNCATE_LENGTH,
            DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED
          )
          hadoopWriter.start()
      

      and try to append one file into another:

      hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new Configuration()))
      

      Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with that error:

      STDERR: Exception in thread "main" java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
       at org.apache.parquet.format.Util.read(Util.java:365)
       at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
       at org.apache.parquet.format.Util.readPageHeader(Util.java:127)
       at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75)
       at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58)
       at org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998)
       at org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918)
       at org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888)
       at org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895)
       at [...]
      Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
       at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108)
       at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
       at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
       at org.apache.parquet.format.Util.read(Util.java:362)
       ... 14 more
      

        Attachments

          Activity

            People

            • Assignee:
              gszadovszky Gabor Szadovszky
              Reporter:
              eltherion Matthew M
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: