[SPARK-18528] limit + groupBy leads to java.lang.NullPointerException - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.1
Fix Version/s: 2.0.3, 2.1.1, 2.2.0
Component/s: PySpark, SQL
Labels:
None
Environment:

CentOS release 6.6, Linux 2.6.32-504.el6.x86_64

Description

Using limit on a DataFrame prior to groupBy will lead to a crash. Repartitioning will avoid the crash.

will crash: df.limit(3).groupBy("user_id").count().show()
will work: df.limit(3).coalesce(1).groupBy('user_id').count().show()
will work: df.limit(3).repartition('user_id').groupBy('user_id').count().show()

Here is a reproducible example along with the error message:

>>> df = spark.createDataFrame([ (1, 1), (1, 3), (2, 1), (3, 2), (3, 3) ], ["user_id", "genre_id"])
>>>
>>> df.show()
-------------+

user_id genre_id

-------------+

1 1

1 3

2 1

3 2

3 3

-------------+
>>> df.groupBy("user_id").count().show()
----------+

user_id count

----------+

1 2

3 2

2 1

----------+
>>> df.limit(3).groupBy("user_id").count().show()
[Stage 8:===================================================>(1964 + 24) / 2000]16/11/21 01:59:27 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Attachments

Issue Links

is duplicated by

SPARK-18851 DataSet Limit into Aggregate Results in NPE in Codegen

Resolved

SPARK-19037 Run count(distinct x) from sub query found some errors

Resolved

links to

[Github] Pull Request #15980 (maropu)

Activity

People

Assignee:: Takeshi Yamamuro

Reporter:: Corey

Votes:: 3 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Nov/16 17:31

Updated:: 10/May/17 01:19

Resolved:: 22/Dec/16 00:54