Description
Issue:
I tested matplotlib integration in Pyspark. As a baseline, the following 3 examples took at 1 - 2 seconds in Jupyter on the same machine.
%pyspark import matplotlib.pyplot as plt plt.plot([1,2,3,4]) plt.ylabel('some numbers') z.show(plt)
==> 1 sec
%pyspark import numpy as np import matplotlib.pyplot as plt # Fixing random state for reproducibility np.random.seed(19680801) mu, sigma = 100, 15 x = mu + sigma * np.random.randn(10000) # the histogram of the data n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75) plt.xlabel('Smarts') plt.ylabel('Probability') plt.title('Histogram of IQ') plt.text(60, .025, r'$\mu=100,\ \sigma=15$') plt.axis([40, 160, 0, 0.03]) plt.grid(True) plt.show()
==> 11 sec
%pyspark from ggplot import * ggplot(diamonds, aes(x='price', fill='cut')) +\ geom_density(alpha=0.25) +\ facet_wrap("clarity")
==> 138 sec
Environment:
Downloaded http://apache.mirror.digionline.de/zeppelin/zeppelin-0.7.0/zeppelin-0.7.0-bin-netinst.tgz and installed spark, python, sh, md and angular interpreter
Started via bin/zeppelin.sh
Attachments
Issue Links
- duplicates
-
ZEPPELIN-1894 Matplotlib is very slow in python interpreter
- Resolved