Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
Not Applicable
-
None
-
pyspark with local Spark 2.1
Description
Do we have undocumented limits for RDDConverterUtilsExt.convertPy4JArrayToMB?
Below simple script works for 23100 rows, while 46900 fails. This is how to easily and consistently reproduce.
START:
$pyspark --master local --jars $SYSTEMML_HOME/SystemML.jar --driver-memory 8G --executor-memory 2G
PYTHON SCRIPT:
from systemml import MLContext, dml
import pandas as pd
sc.version
ml = MLContext(sc)
print "Spark Version:", sc.version
print "SystemML Version:", ml.version()
print "SystemML Built-Time:", ml.buildTime()
- !! number of rows 23100 works, while 46900 fails
nr = 46900
X_pd = pd.DataFrame(range(1, (nr*784)+1,1),dtype=float).values.reshape(nr,784)
script ="""
write(X, $Xfile, format="csv")
"""
prog = dml(script).input(X=X_pd).input(**
)
ml.execute(prog)
OUTPUT:
Spark Version: 2.1.0
SystemML Version: 0.14.0-incubating-SNAPSHOT
SystemML Built-Time: 2017-03-03 07:33:40 UTC
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
.......
Py4JError: An error occurred while calling z:org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt.convertPy4JArrayToMB. Trace:
java.lang.NegativeArraySizeException
at py4j.Base64.decode(Base64.java:321)
at py4j.Protocol.getBytes(Protocol.java:173)
at py4j.Protocol.getObject(Protocol.java:294)
at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:82)
at py4j.commands.CallCommand.execute(CallCommand.java:77)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)