Mahout
  1. Mahout
  2. MAHOUT-308

Improve Lanczos to handle extremely large feature sets (without hashing)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.3
    • Fix Version/s: 0.5
    • Component/s: Math
    • Labels:
      None
    • Environment:

      all

      Description

      DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the driver (client) computer while Hadoop is iterating. The memory requirements of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for desiredRank = a few hundred, starts to cap out usefulness at some-small-number * millions of columns for most commodity hardware.

      The solution (without doing stochastic decomposition) is to persist the Lanczos basis to disk, except for the most recent two vectors. Some care must be taken in the "orthogonalizeAgainstBasis()" method call, which uses the entire basis. This part would be slower this way.

      1. MAHOUT-308.patch
        14 kB
        Danny Leshem

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Jake Mannix
            Reporter:
            Jake Mannix
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development