Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10329

Cost RDD in k-means|| initialization is not storage-efficient

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Later
    • 1.3.1, 1.4.1, 1.5.0
    • None
    • MLlib

    Description

      Currently we use `RDD[Vector]` to store point cost during k-means|| initialization, where each `Vector` has size `runs`. This is not storage-efficient because `runs` is usually 1 and then each record is a Vector of size 1. What we need is just the 8 bytes to store the cost, but we introduce two objects (DenseVector and its values array), which could cost 16 bytes. That is 200% overhead. Thanks Jie Huang and Jiayin Hu from Intel for reporting this issue!

      There are several solutions:

      1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per record.
      2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each `Array[Double]` object covers 1024 instances, which could remove most of the overhead.

      Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs kicking out the training dataset from memory.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            hujiayin hujiayin
            mengxr Xiangrui Meng
            Xiangrui Meng Xiangrui Meng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment