Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.8
    • Component/s: Integration
    • Labels:
      None

      Description

      Utility to concatenate matrices stored as SequenceFiles of vectors.
      Each pair in the SequenceFile is the IntWritable row number and a VectorWritable.
      The input and output files may skip rows.

      1. MAHOUT-884.patch
        13 kB
        Suneel Marthi
      2. MAHOUT-884.patch
        13 kB
        Lance Norskog
      3. MAHOUT-884.patch
        18 kB
        Lance Norskog
      4. MAHOUT-884.patch
        16 kB
        Lance Norskog

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2071 (See https://builds.apache.org/job/Mahout-Quality/2071/)
        MAHOUT-884: Matrix Concatenate Utility change the utility name to concatmatrices (Revision 1491325)

        Result = SUCCESS
        smarthi :
        Files :

        • /mahout/trunk/src/conf/driver.classes.default.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2071 (See https://builds.apache.org/job/Mahout-Quality/2071/ ) MAHOUT-884 : Matrix Concatenate Utility change the utility name to concatmatrices (Revision 1491325) Result = SUCCESS smarthi : Files : /mahout/trunk/src/conf/driver.classes.default.props
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2070 (See https://builds.apache.org/job/Mahout-Quality/2070/)
        MAHOUT-884: Matrix Concatenate Utility (Revision 1491309)

        Result = SUCCESS
        smarthi :
        Files :

        • /mahout/trunk/CHANGELOG
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/ConcatenateVectorsJob.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/ConcatenateVectorsReducer.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/TestConcatenateVectorsJob.java
        • /mahout/trunk/src/conf/driver.classes.default.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2070 (See https://builds.apache.org/job/Mahout-Quality/2070/ ) MAHOUT-884 : Matrix Concatenate Utility (Revision 1491309) Result = SUCCESS smarthi : Files : /mahout/trunk/CHANGELOG /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/ConcatenateVectorsJob.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/ConcatenateVectorsReducer.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/TestConcatenateVectorsJob.java /mahout/trunk/src/conf/driver.classes.default.props
        Hide
        Suneel Marthi added a comment -

        Committed patch to trunk, will be adding additional test cases.

        Show
        Suneel Marthi added a comment - Committed patch to trunk, will be adding additional test cases.
        Hide
        Suneel Marthi added a comment -

        Modified patch to be compatible with present codebase.

        Show
        Suneel Marthi added a comment - Modified patch to be compatible with present codebase.
        Hide
        Robin Anil added a comment -

        Not a blocker, might need some cleanup. Pushing to backlog

        Show
        Robin Anil added a comment - Not a blocker, might need some cleanup. Pushing to backlog
        Hide
        Suneel Marthi added a comment -

        Also will be adding unit tests as part of committing this patch.

        Show
        Suneel Marthi added a comment - Also will be adding unit tests as part of committing this patch.
        Hide
        Suneel Marthi added a comment -

        Agree with Sebastian. I can work on this later today.

        Show
        Suneel Marthi added a comment - Agree with Sebastian. I can work on this later today.
        Hide
        Sebastian Schelter added a comment -

        regarding the patch: please make sure to always close readers in finally blocks and don't throw an InterruptedException if the job fails.

        Show
        Sebastian Schelter added a comment - regarding the patch: please make sure to always close readers in finally blocks and don't throw an InterruptedException if the job fails.
        Hide
        Ted Dunning added a comment -

        Suneel, can you commit this if you think it is good?

        Show
        Ted Dunning added a comment - Suneel, can you commit this if you think it is good?
        Hide
        Lance Norskog added a comment -

        You're right, the Metadata call should be removed. Mahout does not use the Metadata feature anywhere.

        Show
        Lance Norskog added a comment - You're right, the Metadata call should be removed. Mahout does not use the Metadata feature anywhere.
        Hide
        Suneel Marthi added a comment -

        Has this code been committed to trunk yet. Looking at the code, the following call seems unnecessary as the retrieved metadata value is never used in getDimensions().

        Metadata m = reader.getMetadata();

        Show
        Suneel Marthi added a comment - Has this code been committed to trunk yet. Looking at the code, the following call seems unnecessary as the retrieved metadata value is never used in getDimensions(). Metadata m = reader.getMetadata();
        Hide
        Suneel Marthi added a comment -

        Lance, Thanks for this effort. This is something I need now for the stuff I am working on.

        Show
        Suneel Marthi added a comment - Lance, Thanks for this effort. This is something I need now for the stuff I am working on.
        Hide
        Lance Norskog added a comment - - edited

        Completely redone. Now a Hadoop job which uses Jake's trick of caching the row widths. It supports any Writable as the key class. Input vectors can be in multiple files and out of order. Supports named vectors.

        Minuses:

        • Only concatenates two matrices
        • Hard-coded to SequentialAccessSparseVector with no compression
        Show
        Lance Norskog added a comment - - edited Completely redone. Now a Hadoop job which uses Jake's trick of caching the row widths. It supports any Writable as the key class. Input vectors can be in multiple files and out of order. Supports named vectors. Minuses: Only concatenates two matrices Hard-coded to SequentialAccessSparseVector with no compression
        Hide
        Jake Mannix added a comment -

        There's a better way to do this, however, if both matrices have different column dimensionalities (which is always the case except for very special concatenations), you can put the cardinalities of each matrix in the Configuration, and then do the simple Identity Mapper and have the Reducer use the cardinality of the pair of Vectors to decide which goes first.

        No custom partitioner, comparator, and it's all simple scalable map-reduce.

        Show
        Jake Mannix added a comment - There's a better way to do this, however, if both matrices have different column dimensionalities (which is always the case except for very special concatenations), you can put the cardinalities of each matrix in the Configuration, and then do the simple Identity Mapper and have the Reducer use the cardinality of the pair of Vectors to decide which goes first. No custom partitioner, comparator, and it's all simple scalable map-reduce.
        Hide
        Jake Mannix added a comment -

        The trick is that we want the vectors to come in right-to-left order at each reducer, so that the output vector writes sequentially. See in Ricky Ho's blog page, search for "Optimized reducer-side join". He uses a partitioner to achieve this.

        Yeah, you just do a secondary sort-by-value (using a custom partitioner and comparator).

        Show
        Jake Mannix added a comment - The trick is that we want the vectors to come in right-to-left order at each reducer, so that the output vector writes sequentially. See in Ricky Ho's blog page, search for "Optimized reducer-side join". He uses a partitioner to achieve this. Yeah, you just do a secondary sort-by-value (using a custom partitioner and comparator).
        Hide
        Lance Norskog added a comment -

        Map-reduce does not handle this well. There are two ways to implement this in Hadoop:

        1. Null Mapper -> Reducer<IntWritable,VectorWritable>
          • The Reducer loads iterators for all VectorWritables, then walks forward monotonically through all iterators.
        2. Mapper -> Partitioner<1 Reducer per row> -> (Reducer<IntWritable index, DoubleWritable value>)
          • More: Reducer setup/teardown creates an output VectorWritable, each reduce() call receives one vector index and one or more values.

        The first requires loading into memory the contents for row X, from each matrix, simultaneously. ConcatenateMatrices already has this problem, and does not copy the vectors over the network. The second is a "map-increase" algorithm: it creates a separate key pair for each value in the output matrix. Neither of these scale.

        The only way to do this is to precondition the input matrices into one file with ordered rows, and use the above single-threaded concatenator. If you want multiple files, you can partition the matrices into matching sets of rows: part-r-00000 is row 0->499, part-r-00001 is row 500->999... etc. You then run ConcatenateMatrices on each pair.

        Show
        Lance Norskog added a comment - Map-reduce does not handle this well. There are two ways to implement this in Hadoop: Null Mapper -> Reducer<IntWritable,VectorWritable> The Reducer loads iterators for all VectorWritables, then walks forward monotonically through all iterators. Mapper -> Partitioner<1 Reducer per row> -> (Reducer<IntWritable index, DoubleWritable value>) More: Reducer setup/teardown creates an output VectorWritable, each reduce() call receives one vector index and one or more values. The first requires loading into memory the contents for row X, from each matrix, simultaneously. ConcatenateMatrices already has this problem, and does not copy the vectors over the network. The second is a "map-increase" algorithm: it creates a separate key pair for each value in the output matrix. Neither of these scale. The only way to do this is to precondition the input matrices into one file with ordered rows, and use the above single-threaded concatenator. If you want multiple files, you can partition the matrices into matching sets of rows: part-r-00000 is row 0->499, part-r-00001 is row 500->999... etc. You then run ConcatenateMatrices on each pair.
        Hide
        Lance Norskog added a comment -

        What is the scope of matrix sizes where this really is worth running on, say, a ten-machine cluster v.s. a sequential program?

        Show
        Lance Norskog added a comment - What is the scope of matrix sizes where this really is worth running on, say, a ten-machine cluster v.s. a sequential program?
        Hide
        Lance Norskog added a comment -

        Then this should be a map-reduce job, not a sequential process, as these matrices could be really large.

        Ah! But how are they stored? Is it an HDFS directory with part-r-00000 , 00001 ... 0000n for n distinct sets of rows?

        Identity mapper + reduce-side join with concatenation would be the most straightforward scalable way to do it.

        The trick is that we want the vectors to come in right-to-left order at each reducer, so that the output vector writes sequentially. See in Ricky Ho's blog page, search for "Optimized reducer-side join". He uses a partitioner to achieve this.

        http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

        Show
        Lance Norskog added a comment - Then this should be a map-reduce job, not a sequential process, as these matrices could be really large. Ah! But how are they stored? Is it an HDFS directory with part-r-00000 , 00001 ... 0000n for n distinct sets of rows? Identity mapper + reduce-side join with concatenation would be the most straightforward scalable way to do it. The trick is that we want the vectors to come in right-to-left order at each reducer, so that the output vector writes sequentially. See in Ricky Ho's blog page, search for "Optimized reducer-side join". He uses a partitioner to achieve this. http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html
        Hide
        Lance Norskog added a comment -

        Support NamedVectors. If both input vectors have a name, the left one wins.

        Show
        Lance Norskog added a comment - Support NamedVectors. If both input vectors have a name, the left one wins.
        Hide
        Jake Mannix added a comment -

        Ah! So the point is to concatenate the rows themselves. This makes much more sense, yes, I can see wanting to do this.

        Then this should be a map-reduce job, not a sequential process, as these matrices could be really large. Identity mapper + reduce-side join with concatenation would be the most straightforward scalable way to do it.

        Show
        Jake Mannix added a comment - Ah! So the point is to concatenate the rows themselves . This makes much more sense, yes, I can see wanting to do this. Then this should be a map-reduce job, not a sequential process, as these matrices could be really large. Identity mapper + reduce-side join with concatenation would be the most straightforward scalable way to do it.
        Hide
        Dan Brickley added a comment -

        My original use case was here: http://www.searchworkings.org/forum/-/message_boards/view_message/356639

        """I have a matrix of 100,000 items x 30k features; and another of those
        same 100,000 items, x however-many different features (from n-gram
        collocation extraction). In current app, these are library holdings
        and subject codes + extracted phrases. (later these should be 14
        million items by somewhat but not shockingly larger feature space, if
        that is useful to know)

        I'd like to compose these into a larger unified feature matrix, with
        same row structure, and with feature columns drawing from both input
        matrices. So far in this work I've managed to get by using bin/mahout
        rather than firing up Eclipse and messing with Java; I'd be happy to
        learn I can continue in this work style. But if custom code is needed
        that's fine. Either way, some pointer would be much appreciated..."""

        Show
        Dan Brickley added a comment - My original use case was here: http://www.searchworkings.org/forum/-/message_boards/view_message/356639 """I have a matrix of 100,000 items x 30k features; and another of those same 100,000 items, x however-many different features (from n-gram collocation extraction). In current app, these are library holdings and subject codes + extracted phrases. (later these should be 14 million items by somewhat but not shockingly larger feature space, if that is useful to know) I'd like to compose these into a larger unified feature matrix, with same row structure, and with feature columns drawing from both input matrices. So far in this work I've managed to get by using bin/mahout rather than firing up Eclipse and messing with Java; I'd be happy to learn I can continue in this work style. But if custom code is needed that's fine. Either way, some pointer would be much appreciated..."""
        Hide
        Jake Mannix added a comment -

        why do we want the part files squished into one? If the result is small enough to read into memory somewhere, then you can easily just iterate over the part files, reading each one into memory, right?

        Show
        Jake Mannix added a comment - why do we want the part files squished into one? If the result is small enough to read into memory somewhere, then you can easily just iterate over the part files, reading each one into memory, right?
        Hide
        Lance Norskog added a comment -

        I forgot about NamedVectors

        Show
        Lance Norskog added a comment - I forgot about NamedVectors

          People

          • Assignee:
            Suneel Marthi
            Reporter:
            Lance Norskog
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development