Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32989

Performance regression when selecting from str_to_map

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.1
    • 3.1.0
    • SQL
    • None

    Description

      When I create a map using str_to_map and select more than a single value, I notice a notable performance regression in 3.0.1 compared to 2.4.7. When selecting a single value, the performance is the same. Plans are identical between versions.

      It seems like in 2.x the map from str_to_map is preserved for a given row, but in 3.x it's recalculated for each column. One hint that it might be the case is that when I tried forcing materialisation of said map in 3.x (by a coalesce, don't know if there's a better way), I got the performance roughly to 2.x levels.

      Here's a reproducer (the csv in question gets autogenerated by the python code):

      $ head regression.csv 
      foo
      foo=bar&baz=bak&bar=foo
      foo=bar&baz=bak&bar=foo
      foo=bar&baz=bak&bar=foo
      foo=bar&baz=bak&bar=foo
      foo=bar&baz=bak&bar=foo
      ... (10M more rows)
      
      import time
      import os
      
      import pyspark  
      from pyspark.sql import SparkSession
      
      import pyspark.sql.functions as f
      
      if __name__ == '__main__':
          print(pyspark.__version__)
          spark = SparkSession.builder.getOrCreate()
      
          filename = 'regression.csv'
          if not os.path.isfile(filename):
              with open(filename, 'wt') as fw:
                  fw.write('foo\n')
                  for _ in range(10_000_000):
                      fw.write('foo=bar&baz=bak&bar=foo\n')
      
          df = spark.read.option('header', True).csv(filename)
          t = time.time()
          dd = (df
                  .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
                  .select(
                      f.col('my_map')['foo'],
                  )
              )
          dd.write.mode('overwrite').csv('tmp')
          t2 = time.time()
          print('selected one', t2 - t)
      
          dd = (df
                  .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
                  # .coalesce(100) # forcing evaluation before selection speeds it up in 3.0.1
                  .select(
                      f.col('my_map')['foo'],
                      f.col('my_map')['bar'],
                      f.col('my_map')['baz'],
                  )
              )
          dd.explain(True)
          dd.write.mode('overwrite').csv('tmp')
          t3 = time.time()
          print('selected three', t3 - t2)
      

      Results for 2.4.7 and 3.0.1, both installed from PyPI, Python 3.7, macOS (times are in seconds)

      # 3.0.1
      # selected one 6.375471830368042                                                  
      # selected three 14.847578048706055
      
      # 2.4.7
      # selected one 6.679579019546509                                                  
      # selected three 6.5622029304504395  
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            viirya L. C. Hsieh Assign to me
            ondrej Ondrej Kokes
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment