[SPARK-12026] ChiSqTest gets slower and slower over time when number of features is large - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.2
Fix Version/s: 1.6.1, 2.0.0
Component/s: MLlib
Labels:
- mllib
- stats

Target Version/s:

1.6.1, 2.0.0

Description

I've been running a ChiSqTest to pick features for feature reduction. My understanding is that internally it creates jobs to run on batches of 1000 features at a time.

I was under the impression that the features are treated as independant, but this does not appear to be the case. When the number of features is large (160k in my case), each batch gets slower and slower. As an example, running on 25 m3.2xlarges on Amazon EMR, it started at just over 1 minute per batch. By the end, batches were taking over 30 minutes per batch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

First Stages.png
27/Nov/15 14:55
462 kB
Hunter Kelly
Latest Stages.png
27/Nov/15 14:56
388 kB
Hunter Kelly

Issue Links

links to

[Github] Pull Request #10146 (hhbyyh)

Activity

People

Assignee:: yuhao yang

Reporter:: Hunter Kelly

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Nov/15 12:32

Updated:: 14/Jan/16 01:44

Resolved:: 14/Jan/16 01:44