Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2465

Use long as user / item ID for ALS

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 1.0.1
    • None
    • MLlib
    • None

    Description

      I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation.

      The main reason for is that identifiers are not generally numeric at all, and will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to 64 bits pushes this back to billions.

      It would also mean numeric IDs that happen to be larger than the largest int can be used directly as identifiers.

      On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.

      Thoughts? I will post a PR so as to show what the change would be.

      Attachments

        1. Screen Shot 2014-07-13 at 8.49.40 PM.png
          219 kB
          Xiangrui Meng
        2. ALS using MEMORY_AND_DISK.png
          47 kB
          Xiangrui Meng
        3. ALS using MEMORY_AND_DISK_SER.png
          47 kB
          Xiangrui Meng

        Issue Links

          Activity

            People

              Unassigned Unassigned
              srowen Sean R. Owen
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: