Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1507

Support input and output using user defined ID wherever possible



    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.9
    • Fix Version/s: 0.10.0
    • Component/s: Math
    • Labels:
    • Environment:

      Spark Scala, Mahout v2


      All users of Mahout have data which is addressed by keys or IDs of their own devise. In order to use much of Mahout they must translate these IDs into Mahout IDs, then run their jobs and translate back again when retrieving the output. If the ID space is very large this is a difficult problem for users to solve at scale.

      For many Mahout operations this would not be necessary if these external keys could be maintained for vectors and dimensions, or for rows and columns of a DRM.

      The reason I bring this up now is that much groundwork is being laid for Mahout's future on Spark so getting this notion in early could be fundamentally important and used to build on.

      If external IDs for rows and columns were maintained then RSJ, DRM Transpose (and other DRM ops), vector extraction, clustering, and recommenders would need no ID translation steps, a big user win.

      A partial solution might be to support external row IDs alone somewhat like the NamedVector and PropertyVector in the Mahout hadoop code.

      On Apr 3, 2014, at 11:00 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

      Perhaps this is best phrased as a feature request.

      On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:


      sequence file keys have also special meaning if they are Ints. .E.g. A'
      physical operator requires keys to be ints, in which case it interprets
      them as row indexes that become column indexes. This of course isn't always
      the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
      reality optimizer will never choose actual transposition as a physical step
      in such pipeline. This interpretation is consistent with interpretation of
      long-existing Hadoop-side DistributedRowMatrix#transpose.

      On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

      On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

      On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

      I think this duality, names and keys, is not very healthy really, and
      creates addtutiinal hassle. Spark drm takes care of keys automatically
      thoughout, but propagating names from name vectors is solely algorithm
      concern as it stands.

      Not sure what you mean.

      Not what you think, it looks like.

      I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
      persisted, key goes to the key of a sequence file. In particular, it means
      that there is a case of Bag[ key -> NamedVector]. Which means, external
      anchor could be saved to either key or name of a row. In practice it causes
      compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
      saves external keys (file paths) into key, whereas e.g. clustering
      algorithms are not seeing them because they expect them to be the name part
      of the vector. I am just saying we have two ways to name the rows, and it
      is generally not a healthy choice for the aforementioned reason.

      In my experience Names and Properties are primarily used to store
      external keys, which are quite healthy.

      Users never have data with Mahout keys, they must constantly go back and
      forth. This is exactly what the R data frame does, no? I'm not so concerned
      with being able to address an element by the external key
      drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
      external ids follow the data through any calculation that makes sense.

      I am with you on this.

      This would mean clustering, recommendations, transpose, RSJ would require
      no id transforming steps. This would make dealing with Mahout much easier.

      Data frames is a little bit a different thing, right now we work just with
      matrices. Although, yes, our in-core matrices support row and column names
      (just like in R) and distributed matrices support row keys only. what i
      mean is that algebraic expression e.g.

      Aexpr %*% Bexpr will automatically propagate keys from Aexpr as implied
      above, but not necessarily named vectors, because internally algorithms
      blockify things into matrix blocks, and i am far from sure that Mahout
      in-core stuff works correctly with named vectors as part of a matrix block
      in all situations. I may be wrong. I always relied on sequence file keys to
      identify data points.

      Note that sequence file keys are bigger than just a name, it is anything
      Writable. I.e. you could save a data structure there, as long as you have a
      Writable for it.

      On Apr 2, 2014 1:08 PM, "Pat Ferrel" <pat@occamsmachete.com> wrote:

      Are the Spark efforts supporting all Mahout Vector types? Named,
      Vectors? It occurred to me that data frames in R is a related but more
      general solution. If all rows and columns of a DRM and their
      Vectors (row or column vectors) were to support arbitrary properties
      attached to them in such a way that they are preserved during
      Vector extraction, and any other operations that make sense there
      would be
      a huge benefit for users.

      One of the constant problems with input to Mahout is translation of
      External to Mahout going in, Mahout to external coming out. Most of
      would be unneeded if Mahout supported data frames, some would be
      avoided by
      supporting named or property vectors universally.




            • Assignee:
              pferrel Pat Ferrel
              pferrel Pat Ferrel
            • Votes:
              0 Vote for this issue
              6 Start watching this issue


              • Created: