Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
None
-
None
Description
Very useful dataset to test pipeline because of:
- "Big data" dataset - original Kaggle competition dataset is 12 gb, but there's 1tb dataset of the same schema as well.
- Sparse models - categorical features has high cardinality
- Reproducible results - because it's public and many other distributed machine learning libraries (e.g. wormwhole, parameter server, azure ml etc.) have made a base line benchmarks on which we could compare.
I have some base line results with custom models (GBDT encoders and tuned LR) on spark-1.4. Will make pipelines using public spark model. Winning solution used GBDT encoder (not available in spark, but not difficult to make one from GBT from mllib) + hashing + factorization machine (planned for spark-1.6).