XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • ML

    Description

      Very useful dataset to test pipeline because of:

      1. "Big data" dataset - original Kaggle competition dataset is 12 gb, but there's 1tb dataset of the same schema as well.
      2. Sparse models - categorical features has high cardinality
      3. Reproducible results - because it's public and many other distributed machine learning libraries (e.g. wormwhole, parameter server, azure ml etc.) have made a base line benchmarks on which we could compare.

      I have some base line results with custom models (GBDT encoders and tuned LR) on spark-1.4. Will make pipelines using public spark model. Winning solution used GBDT encoder (not available in spark, but not difficult to make one from GBT from mllib) + hashing + factorization machine (planned for spark-1.6).

      Attachments

        Activity

          People

            Unassigned Unassigned
            prudenko Peter Rudenko
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: