Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-2160

PySpark: Matplotlib Integration extremely slow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.7.0
    • None
    • front-end, GUI
    • None

    Description

      Issue:
      I tested matplotlib integration in Pyspark. As a baseline, the following 3 examples took at 1 - 2 seconds in Jupyter on the same machine.

      %pyspark
      
      import matplotlib.pyplot as plt
      plt.plot([1,2,3,4])
      plt.ylabel('some numbers')
      z.show(plt)
      

      ==> 1 sec

      %pyspark
      
      import numpy as np
      import matplotlib.pyplot as plt
      
      # Fixing random state for reproducibility
      np.random.seed(19680801)
      
      mu, sigma = 100, 15
      x = mu + sigma * np.random.randn(10000)
      
      # the histogram of the data
      n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)
      
      plt.xlabel('Smarts')
      plt.ylabel('Probability')
      plt.title('Histogram of IQ')
      plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
      plt.axis([40, 160, 0, 0.03])
      plt.grid(True)
      plt.show()
      

      ==> 11 sec

      %pyspark
      from ggplot import *
      
      ggplot(diamonds, aes(x='price', fill='cut')) +\
          geom_density(alpha=0.25) +\
          facet_wrap("clarity")
      

      ==> 138 sec

      Environment:
      Downloaded http://apache.mirror.digionline.de/zeppelin/zeppelin-0.7.0/zeppelin-0.7.0-bin-netinst.tgz and installed spark, python, sh, md and angular interpreter
      Started via bin/zeppelin.sh

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bwalter42 Bernhard Walter
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: