Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.5.0
-
None
-
None
-
Spark Standalone Cluster on Linux
Description
Trying to aggregate with the LAG Analytic function gives the wrong result. In my testcase it was always giving the fixed value '103079215105' when I tried to run on an integer.
Note that this only happens on Spark 1.5.0, and only when running in cluster mode.
It works fine when running on Spark 1.4.1, or when running in local mode.
I did not test on a yarn cluster.
I did not test other analytic aggregates.
Input Jason:
{"VAA":"A", "VBB":1} {"VAA":"B", "VBB":-1} {"VAA":"C", "VBB":2} {"VAA":"d", "VBB":3} {"VAA":null, "VBB":null}
Java:
SparkContext sc = new SparkContext(conf); HiveContext sqlContext = new HiveContext(sc); DataFrame df = sqlContext.read().json("file:///home/app/input.json"); df = df.withColumn( "previous", lag(dataFrame.col("VBB"), 1) .over(Window.orderBy(dataFrame.col("VAA"))) );
Important to understand the conditions under which the job ran, I submitted to a standalone spark cluster in client mode as follows:
spark-submit \ --master spark:\\xxxxxx:7077 \ --deploy-mode client \ --class package.to.DriverClass \ --driver-java-options -Dhdp.version=2.2.0.0–2041 \ --num-executors 2 \ --driver-memory 2g \ --executor-memory 2g \ --executor-cores 2 \ /path/to/sample-program.jar
Expected Result:
{"VAA":null, "VBB":null, "previous":null} {"VAA":"A", "VBB":1, "previous":null} {"VAA":"B", "VBB":-1, "previous":1} {"VAA":"C", "VBB":2, "previous":-1} {"VAA":"d", "VBB":3, "previous":2}
Actual Result:
{"VAA":null, "VBB":null, "previous":103079215105} {"VAA":"A", "VBB":1, "previous":103079215105} {"VAA":"B", "VBB":-1, "previous":103079215105} {"VAA":"C", "VBB":2, "previous":103079215105} {"VAA":"d", "VBB":3, "previous":103079215105}
Attachments
Issue Links
- duplicates
-
SPARK-11009 RowNumber in HiveContext returns negative values in cluster mode
- Resolved