Mahout
  1. Mahout
  2. MAHOUT-873

Provide MapReduce job for creating Encoded Vectors from sequence files

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: None

      Description

      Similar to SparseVectorsFromSequenceFiles, provide a version that can do encoded vectors. Start simple by handling basic text, but this could easily evolve to handle pluggable Vectorizer's that can better deal with features (numerics, etc.).

      1. MAHOUT-873.patch
        51 kB
        Grant Ingersoll
      2. MAHOUT-873.patch
        47 kB
        Grant Ingersoll
      3. MAHOUT-873.patch
        35 kB
        Grant Ingersoll

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Patch that does basic work. Also refactors AbstractJob and HadoopUtil a bit to make it easier to use PrepareJob

          Show
          Grant Ingersoll added a comment - Patch that does basic work. Also refactors AbstractJob and HadoopUtil a bit to make it easier to use PrepareJob
          Hide
          Grant Ingersoll added a comment -

          Progress. Extracted a Vectorizer interface, made the encoder pluggable, various other goodness.

          Also started to flesh out hooking in seq2encoded to build-asf-email so as to run SGD over the ASF email archive. Now just need train test per MAHOUT-851.

          Show
          Grant Ingersoll added a comment - Progress. Extracted a Vectorizer interface, made the encoder pluggable, various other goodness. Also started to flesh out hooking in seq2encoded to build-asf-email so as to run SGD over the ASF email archive. Now just need train test per MAHOUT-851 .
          Hide
          Grant Ingersoll added a comment -

          Starts the conversion of DictionaryVectorizer. I now think we could fold SparseVectorsFromSequenceFiles and EncodedVectorsFromSequenceFiles, perhaps, into one class, if we use appropriate command line grouping of options.

          Show
          Grant Ingersoll added a comment - Starts the conversion of DictionaryVectorizer. I now think we could fold SparseVectorsFromSequenceFiles and EncodedVectorsFromSequenceFiles, perhaps, into one class, if we use appropriate command line grouping of options.
          Hide
          Grant Ingersoll added a comment -

          I've checked in some baseline functionality here. Going to leave this open, as I think we can really take this to an interesting level of capabilities.

          Show
          Grant Ingersoll added a comment - I've checked in some baseline functionality here. Going to leave this open, as I think we can really take this to an interesting level of capabilities.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1151 (See https://builds.apache.org/job/Mahout-Quality/1151/)
          MAHOUT-873: baseline of simple vectorization encoding capabilities

          gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197839
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/common/ClassUtils.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFiles.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodingMapper.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SimpleTextEncodingVectorizer.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/Vectorizer.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/VectorizerConfig.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFilesTest.java
          • /mahout/trunk/examples/bin/build-asf-email.sh
          • /mahout/trunk/src/conf/driver.classes.props
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1151 (See https://builds.apache.org/job/Mahout-Quality/1151/ ) MAHOUT-873 : baseline of simple vectorization encoding capabilities gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197839 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/ClassUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFiles.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodingMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SimpleTextEncodingVectorizer.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/Vectorizer.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/VectorizerConfig.java /mahout/trunk/core/src/test/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFilesTest.java /mahout/trunk/examples/bin/build-asf-email.sh /mahout/trunk/src/conf/driver.classes.props

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development