Mahout
  1. Mahout
  2. MAHOUT-873

Provide MapReduce job for creating Encoded Vectors from sequence files

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: None

      Description

      Similar to SparseVectorsFromSequenceFiles, provide a version that can do encoded vectors. Start simple by handling basic text, but this could easily evolve to handle pluggable Vectorizer's that can better deal with features (numerics, etc.).

      1. MAHOUT-873.patch
        35 kB
        Grant Ingersoll
      2. MAHOUT-873.patch
        47 kB
        Grant Ingersoll
      3. MAHOUT-873.patch
        51 kB
        Grant Ingersoll

        Issue Links

          Activity

          Grant Ingersoll created issue -
          Hide
          Grant Ingersoll added a comment -

          Patch that does basic work. Also refactors AbstractJob and HadoopUtil a bit to make it easier to use PrepareJob

          Show
          Grant Ingersoll added a comment - Patch that does basic work. Also refactors AbstractJob and HadoopUtil a bit to make it easier to use PrepareJob
          Grant Ingersoll made changes -
          Field Original Value New Value
          Attachment MAHOUT-873.patch [ 12502530 ]
          Hide
          Grant Ingersoll added a comment -

          Progress. Extracted a Vectorizer interface, made the encoder pluggable, various other goodness.

          Also started to flesh out hooking in seq2encoded to build-asf-email so as to run SGD over the ASF email archive. Now just need train test per MAHOUT-851.

          Show
          Grant Ingersoll added a comment - Progress. Extracted a Vectorizer interface, made the encoder pluggable, various other goodness. Also started to flesh out hooking in seq2encoded to build-asf-email so as to run SGD over the ASF email archive. Now just need train test per MAHOUT-851 .
          Grant Ingersoll made changes -
          Attachment MAHOUT-873.patch [ 12502550 ]
          Hide
          Grant Ingersoll added a comment -

          Starts the conversion of DictionaryVectorizer. I now think we could fold SparseVectorsFromSequenceFiles and EncodedVectorsFromSequenceFiles, perhaps, into one class, if we use appropriate command line grouping of options.

          Show
          Grant Ingersoll added a comment - Starts the conversion of DictionaryVectorizer. I now think we could fold SparseVectorsFromSequenceFiles and EncodedVectorsFromSequenceFiles, perhaps, into one class, if we use appropriate command line grouping of options.
          Grant Ingersoll made changes -
          Attachment MAHOUT-873.patch [ 12502552 ]
          Hide
          Grant Ingersoll added a comment -

          I've checked in some baseline functionality here. Going to leave this open, as I think we can really take this to an interesting level of capabilities.

          Show
          Grant Ingersoll added a comment - I've checked in some baseline functionality here. Going to leave this open, as I think we can really take this to an interesting level of capabilities.
          Grant Ingersoll made changes -
          Labels MAHOUT_INTRO_CONTRIBUTE
          Grant Ingersoll made changes -
          Link This issue relates to MAHOUT-851 [ MAHOUT-851 ]
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1151 (See https://builds.apache.org/job/Mahout-Quality/1151/)
          MAHOUT-873: baseline of simple vectorization encoding capabilities

          gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197839
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/common/ClassUtils.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFiles.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodingMapper.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SimpleTextEncodingVectorizer.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/Vectorizer.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/VectorizerConfig.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFilesTest.java
          • /mahout/trunk/examples/bin/build-asf-email.sh
          • /mahout/trunk/src/conf/driver.classes.props
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1151 (See https://builds.apache.org/job/Mahout-Quality/1151/ ) MAHOUT-873 : baseline of simple vectorization encoding capabilities gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197839 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/ClassUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFiles.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodingMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SimpleTextEncodingVectorizer.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/Vectorizer.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/VectorizerConfig.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFilesTest.java /mahout/trunk/examples/bin/build-asf-email.sh /mahout/trunk/src/conf/driver.classes.props
          Grant Ingersoll made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Sean Owen made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development