[SPARK-10870] Criteo Display Advertising Challenge - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

Very useful dataset to test pipeline because of:

"Big data" dataset - original Kaggle competition dataset is 12 gb, but there's 1tb dataset of the same schema as well.
Sparse models - categorical features has high cardinality
Reproducible results - because it's public and many other distributed machine learning libraries (e.g. wormwhole, parameter server, azure ml etc.) have made a base line benchmarks on which we could compare.

I have some base line results with custom models (GBDT encoders and tuned LR) on spark-1.4. Will make pipelines using public spark model. Winning solution used GBDT encoder (not available in spark, but not difficult to make one from GBT from mllib) + hashing + factorization machine (planned for spark-1.6).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Peter Rudenko

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Sep/15 14:16

Updated:: 21/May/19 04:34

Resolved:: 21/May/19 04:34