[MAHOUT-1507] Support input and output using user defined ID wherever possible - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.9
Fix Version/s: 0.10.0
Component/s: classic
Labels:
- DSL
- scala
- spark
Environment:

Spark Scala, Mahout v2

Description

All users of Mahout have data which is addressed by keys or IDs of their own devise. In order to use much of Mahout they must translate these IDs into Mahout IDs, then run their jobs and translate back again when retrieving the output. If the ID space is very large this is a difficult problem for users to solve at scale.

For many Mahout operations this would not be necessary if these external keys could be maintained for vectors and dimensions, or for rows and columns of a DRM.

The reason I bring this up now is that much groundwork is being laid for Mahout's future on Spark so getting this notion in early could be fundamentally important and used to build on.

If external IDs for rows and columns were maintained then RSJ, DRM Transpose (and other DRM ops), vector extraction, clustering, and recommenders would need no ID translation steps, a big user win.

A partial solution might be to support external row IDs alone somewhat like the NamedVector and PropertyVector in the Mahout hadoop code.

On Apr 3, 2014, at 11:00 AM, Pat Ferrel <pat@occamsmachete.com> wrote:

Perhaps this is best phrased as a feature request.

On Apr 2, 2014, at 2:55 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

PS.

sequence file keys have also special meaning if they are Ints. .E.g. A'
physical operator requires keys to be ints, in which case it interprets
them as row indexes that become column indexes. This of course isn't always
the case, e.g. (Aexpr).t %*% Aexpr doesn't require int indices because in
reality optimizer will never choose actual transposition as a physical step
in such pipeline. This interpretation is consistent with interpretation of
long-existing Hadoop-side DistributedRowMatrix#transpose.

On Wed, Apr 2, 2014 at 2:45 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

On Wed, Apr 2, 2014 at 1:56 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

On Apr 2, 2014, at 1:39 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

I think this duality, names and keys, is not very healthy really, and
just
creates addtutiinal hassle. Spark drm takes care of keys automatically
thoughout, but propagating names from name vectors is solely algorithm
concern as it stands.

Not sure what you mean.

Not what you think, it looks like.

I mean that Mahout DRM structure is a bag of (key -> Vector) pairs. When
persisted, key goes to the key of a sequence file. In particular, it means
that there is a case of Bag[ key -> NamedVector]. Which means, external
anchor could be saved to either key or name of a row. In practice it causes
compatibility mess, e.g. we saw those numerous cases where e.g. seq2sparse
saves external keys (file paths) into key, whereas e.g. clustering
algorithms are not seeing them because they expect them to be the name part
of the vector. I am just saying we have two ways to name the rows, and it
is generally not a healthy choice for the aforementioned reason.

In my experience Names and Properties are primarily used to store
external keys, which are quite healthy.

Users never have data with Mahout keys, they must constantly go back and
forth. This is exactly what the R data frame does, no? I'm not so concerned
with being able to address an element by the external key
drmB["pat"]["iPad'] like a HashMap. But it would sure be nice to have the
external ids follow the data through any calculation that makes sense.

I am with you on this.

This would mean clustering, recommendations, transpose, RSJ would require
no id transforming steps. This would make dealing with Mahout much easier.

Data frames is a little bit a different thing, right now we work just with
matrices. Although, yes, our in-core matrices support row and column names
(just like in R) and distributed matrices support row keys only. what i
mean is that algebraic expression e.g.

Aexpr %*% Bexpr will automatically propagate keys from Aexpr as implied
above, but not necessarily named vectors, because internally algorithms
blockify things into matrix blocks, and i am far from sure that Mahout
in-core stuff works correctly with named vectors as part of a matrix block
in all situations. I may be wrong. I always relied on sequence file keys to
identify data points.

Note that sequence file keys are bigger than just a name, it is anything
Writable. I.e. you could save a data structure there, as long as you have a
Writable for it.

On Apr 2, 2014 1:08 PM, "Pat Ferrel" <pat@occamsmachete.com> wrote:

Are the Spark efforts supporting all Mahout Vector types? Named,
Property
Vectors? It occurred to me that data frames in R is a related but more
general solution. If all rows and columns of a DRM and their
coresponding
Vectors (row or column vectors) were to support arbitrary properties
attached to them in such a way that they are preserved during
transpose,
Vector extraction, and any other operations that make sense there
would be
a huge benefit for users.

One of the constant problems with input to Mahout is translation of
IDs.
External to Mahout going in, Mahout to external coming out. Most of
this
would be unneeded if Mahout supported data frames, some would be
avoided by
supporting named or property vectors universally.

Support input and output using user defined ID wherever possible

Details

Description

Attachments

Activity

People

Dates