[SPOT-286] Spot ML Flow IPv6 values causing memory error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0
Fix Version/s: None
Labels:
None

Description

I am seeing executor errors such as the following for (50M+ record) flow datasets containing IPv6 source and destination IP values:

java.io.IOException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

The problem does not occur for larger (e.g. 500M+ record) IPv4 flow datasets. It seems like this could be related to https://issues.apache.org/jira/browse/SPARK-6235; however, setting the following had no effect:

spark.sql.autoBroadcastJoinThreshold=-1

After removing the broadcast hints for the following joins, the Int.MAX_VALUE errors no longer occur and the job completes:

https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala#L67

https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala#L73

Has anyone noticed similar behavior?

Is the model designed/tested to handle IPv6 data?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Curtis Howard

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Oct/18 20:07

Updated:: 11/Jul/20 14:57

Resolved:: 11/Jul/20 14:57

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h