[SPOT-43] [ML] Job count at InvalidDataHandler.scala:43 in SPOT-ML for Flow data is taking too long. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Labels:
- patch
- performance
Environment:
Cluster 9 nodes, 256 GB RAM each node, 48 virtual CPUs each node.
Running Spark with
43 Executors, 30 GB memory each, 8 cores per executor.

Flags:

Patch

Description

When running spot-ml, flow, for ~ 1TB data, there is a job (last job) taking 3.3 hours. After reviewing Spark UI, I noticed that this step count at InvalidDataHandler.scala:43 is actually performing the same join that was already executed in job map at FlowSuspiciousConnectsAnalysis.scala:42 (6.4 hours).
count at InvalidDataHandler.scala:43 is filtering corrupt records from scoredFlowRecords, and it seems to be being re-calculated.

Steps to reproduce:
Start spot-ml for flow data with a data set of 1TB or bigger.
Wait until it completes, look at the time last job took.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2017-02-07 at 11.18.37 AM.png
07/Feb/17 17:20
544 kB
Ricardo Barona

Issue Links

links to

GitHub Pull Request #20

Activity

People

Assignee:: Unassigned

Reporter:: Ricardo Barona

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Feb/17 17:20

Updated:: 17/Feb/17 22:11

Resolved:: 17/Feb/17 22:11

Time Tracking

Estimated:

48h

Remaining:

48h

Logged:

Not Specified