Uploaded image for project: 'Spot (Retired)'
  1. Spot (Retired)
  2. SPOT-286

Spot ML Flow IPv6 values causing memory error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0
    • None
    • None

    Description

      I am seeing executor errors such as the following for (50M+ record) flow datasets containing IPv6 source and destination IP values:

       java.io.IOException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

      The problem does not occur for larger (e.g. 500M+ record) IPv4 flow datasets.  It seems like this could be related to https://issues.apache.org/jira/browse/SPARK-6235; however,  setting the following had no effect:

      spark.sql.autoBroadcastJoinThreshold=-1

      After removing the broadcast hints for the following joins, the Int.MAX_VALUE errors no longer occur and the job completes:

       

      https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala#L67

      https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/model/FlowSuspiciousConnectsModel.scala#L73

       

      Has anyone noticed similar behavior? 

      Is the model designed/tested to handle IPv6 data?

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            curtis_howard Curtis Howard
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 0.5h
                0.5h