Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1355

Improvement Binary write performance

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 1.10.0
    • None
    • parquet-mr

    Description

      Benchmark code:

      test("Parquet write benchmark") {
        val count = 100 * 1024 * 1024
        val numIters = 5
        withTempPath { path =>
          val benchmark = new Benchmark(s"Parquet write benchmark ${spark.sparkContext.version}", 5)
      
          Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)").foreach { dt =>
            benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
              spark.range(count).selectExpr(s"cast(id as $dt) as id")
                .write.mode("overwrite").parquet(path.getAbsolutePath)
            }
          }
          benchmark.run()
        }
      }
      

      Result:

      -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3
      
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
      Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
      
      Parquet write benchmark 2.3.3-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      long type                                   10963 / 11344          0.0  2192675973.8       1.0X
      string type                                 28423 / 29437          0.0  5684553922.2       0.4X
      decimal(18, 0) type                         11558 / 11696          0.0  2311587203.6       0.9X
      decimal(38, 18) type                        43858 / 44432          0.0  8771537663.4       0.2X
      
      
      -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
      
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
      Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
      
      Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      long type                                   11633 / 12070          0.0  2326572295.8       1.0X
      string type                                 31374 / 32178          0.0  6274760187.4       0.4X
      decimal(18, 0) type                         13019 / 13294          0.0  2603841925.4       0.9X
      decimal(38, 18) type                        50719 / 50983          0.0 10143775007.6       0.2X
      

      The mainly affects the performance is toByteBuffer.
      If don't use the toByteBuffer when compare binary, the result is:

      -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
      
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
      Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
      
      Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      long type                                   11171 / 11508          0.0  2234189382.0       1.0X
      string type                                 30072 / 30290          0.0  6014346455.4       0.4X
      decimal(18, 0) type                         12150 / 12239          0.0  2430052708.8       0.9X
      decimal(38, 18) type                        44974 / 45423          0.0  8994773738.8       0.2X
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            yumwang Yuming Wang Assign to me
            yumwang Yuming Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment