Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-222

parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: 1.6.0
    • Fix Version/s: None
    • Component/s: parquet-mr
    • Labels:
      None

      Description

      In Spark SQL, there is a function saveAsParquetFile in DataFrame or SchemaRDD. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows:

      WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space
              at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
              at parquet.column.values.dictionary.IntList.<init>(IntList.java:83)
              at parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:85)
              at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:549)
              at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
              at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
              at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
              at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
              at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
              at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
              at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
              at parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
              at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
              at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
              at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
              at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
              at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
              at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
              at org.apache.spark.scheduler.Task.run(Task.scala:56)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      

      By the way, there is another similar issue https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed it and mark it as resolved.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              debugger87 Chaozhong Yang
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                Remaining Estimate - 336h
                336h
                Logged:
                Time Spent - Not Specified
                Not Specified