Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16321

[Spark 2.0] Performance regression when reading parquet and using PPD and non-vectorized reader

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.0.0
    • 2.0.1, 2.1.0
    • SQL
    • None

    Description

      UPDATE
      Please start with this comment
      https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785

      I assume that problem results from the performance problem with reading parquet files

      Original Issue description

      I did some test on parquet file with many nested columns (about 30G in
      400 partitions) and Spark 2.0 is 2x slower.

      df = sqlctx.read.parquet(path)
      df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000 else []).collect()
      

      Spark 1.6 -> 2.3 min
      Spark 2.0 -> 4.6 min (2x slower)

      I used BasicProfiler for this task and cumulative time was:
      Spark 1.6 - 4300 sec
      Spark 2.0 - 5800 sec

      Should I expect such a drop in performance ?

      I don't know how to prepare sample data to show the problem.
      Any ideas ? Or public data with many nested columns ?

      Attachments

        1. visualvm_spark2.png
          114 kB
          Maciej Bryński
        2. visualvm_spark16.png
          113 kB
          Maciej Bryński
        3. visualvm_spark2_G1GC.png
          110 kB
          Maciej Bryński
        4. spark2_trace.png
          115 kB
          Maciej Bryński
        5. spark16._trace.png
          107 kB
          Maciej Bryński
        6. Spark2.nps
          510 kB
          Maciej Bryński
        7. Spark16.nps
          682 kB
          Maciej Bryński
        8. spark2_nofilterpushdown.nps
          580 kB
          Maciej Bryński
        9. spark2_query.nps
          101 kB
          Maciej Bryński
        10. spark16_query.nps
          109 kB
          Maciej Bryński

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              maver1ck Maciej Bryński
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: