Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0
-
None
-
None
Description
I am seeing executor errors such as the following for (50M+ record) flow datasets containing IPv6 source and destination IP values:
java.io.IOException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
The problem does not occur for larger (e.g. 500M+ record) IPv4 flow datasets. It seems like this could be related to https://issues.apache.org/jira/browse/SPARK-6235; however, setting the following had no effect:
spark.sql.autoBroadcastJoinThreshold=-1
After removing the broadcast hints for the following joins, the Int.MAX_VALUE errors no longer occur and the job completes:
Has anyone noticed similar behavior?
Is the model designed/tested to handle IPv6 data?