[SPARK-44132] nesting full outer joins confuses code generator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0, 3.4.0, 3.5.0
Fix Version/s: 3.5.0, 4.0.0
Component/s: SQL
Labels:
None
Environment:

We verified the existence of this bug from spark 3.3 until spark 3.5.

Description

We are seeing issues with the code generator when querying java bean encoded data with 2 nested joins.

dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer");

will generate invalid code in the code generator. And can depending on the data used generate stack traces like:

 Caused by: java.lang.NullPointerException
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

Or:

 Caused by: java.lang.AssertionError: index (2) should < 2
        at org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
        at org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

When we look at the generated code we see that the code generator seems to be mixing up parameters. For example:

if (smj_leftOutputRow_0 != null) {                          //<==== null check for wrong/left parameter
  boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //<==== causes NPE on right parameter here

It is as if the the nesting of 2 full outer joins is confusing the code generator and as such generating invalid code.

There is one other strange thing. We found this issue when using data sets which were using the java bean encoder. We tried to reproduce this in the spark shell or using scala case classes but were unable to do so.

We made a reproduction scenario as unit tests (one for each of the stacktrace above) on the spark code base and made it available as a pull request to this case.

Attachments

Issue Links

links to

[Github] Pull Request #41712 (steven-aerts)

Activity

People

Assignee:: Steven Aerts

Reporter:: Steven Aerts

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Jun/23 07:29

Updated:: 07/Aug/23 23:09

Resolved:: 07/Aug/23 23:09