[SPARK-10893] Lag Analytic function broken - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: Spark Core, SQL
Labels:
None
Environment:

Spark Standalone Cluster on Linux

Description

Trying to aggregate with the LAG Analytic function gives the wrong result. In my testcase it was always giving the fixed value '103079215105' when I tried to run on an integer.
Note that this only happens on Spark 1.5.0, and only when running in cluster mode.
It works fine when running on Spark 1.4.1, or when running in local mode.
I did not test on a yarn cluster.
I did not test other analytic aggregates.

Input Jason:

/home/app/input.json

{"VAA":"A", "VBB":1}
{"VAA":"B", "VBB":-1}
{"VAA":"C", "VBB":2}
{"VAA":"d", "VBB":3}
{"VAA":null, "VBB":null}

Java:

    SparkContext sc = new SparkContext(conf);
    HiveContext sqlContext = new HiveContext(sc);
    DataFrame df = sqlContext.read().json("file:///home/app/input.json");
    
    df = df.withColumn(
      "previous",
      lag(dataFrame.col("VBB"), 1)
        .over(Window.orderBy(dataFrame.col("VAA")))
      );

Important to understand the conditions under which the job ran, I submitted to a standalone spark cluster in client mode as follows:

spark-submit \
  --master spark:\\xxxxxx:7077 \
  --deploy-mode client \
  --class package.to.DriverClass \
  --driver-java-options -Dhdp.version=2.2.0.0–2041 \
  --num-executors 2 \
  --driver-memory 2g \
  --executor-memory 2g \
  --executor-cores 2 \
  /path/to/sample-program.jar

Expected Result:

{"VAA":null, "VBB":null, "previous":null}
{"VAA":"A", "VBB":1, "previous":null}
{"VAA":"B", "VBB":-1, "previous":1}
{"VAA":"C", "VBB":2, "previous":-1}
{"VAA":"d", "VBB":3, "previous":2}

Actual Result:

{"VAA":null, "VBB":null, "previous":103079215105}
{"VAA":"A", "VBB":1, "previous":103079215105}
{"VAA":"B", "VBB":-1, "previous":103079215105}
{"VAA":"C", "VBB":2, "previous":103079215105}
{"VAA":"d", "VBB":3, "previous":103079215105}

Attachments

Issue Links

duplicates

SPARK-11009 RowNumber in HiveContext returns negative values in cluster mode

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jo Desmet

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Oct/15 02:45

Updated:: 18/Oct/15 15:17

Resolved:: 18/Oct/15 15:17