Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1640

Better collections would significantly improve vector-operation speed

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.11.0
    • 0.11.2
    • None

    Description

      The collections currently used by Mahout to implement sparse vectors are extremely slow. The proposed patch (localized to RandomAccessSparseVector) uses fastutil's maps and the speed improvements in vector benchmarks are very significant. It would be interesting to see whether these improvements percolate to high-level classes using sparse vectors.

      I had to patch two unit tests (an off-by-one bug and an overfitting bug; both were exposed by the different order in which key/values were returned by iterators).

      The included files speed-std and speed-fastutil show the speed improvement. Some more speed might be gained by using everywhere the standard java.util.Map.Entry interface instead of Element.

      DISCLAIMER: The "Times" set of tests has been run multiplying two identical vectors. The standard tests multiply two random vectors, so in fact they just test the speed of the underlying map remove() method, as almost all products are zero. This is not very realistic and was heavily penalizing fastutil's "true deletions". Better tests, with a typical overlap of nonzero entries, would be even more realistic.

      Attachments

        1. fastutil.patch
          11 kB
          Sebastiano Vigna
        2. speed-fastutil
          27 kB
          Sebastiano Vigna
        3. speed-std
          27 kB
          Sebastiano Vigna

        Activity

          People

            smarthi Suneel Marthi
            vigna Sebastiano Vigna
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: