Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      Would be great to have a M/R job that took in a line, applied a regex to it and then used the capturing groups as output to various formats (FPG, Classifier, etc.)

      1. MAHOUT-403.patch
        23 kB
        Grant Ingersoll
      2. MAHOUT-403.patch
        24 kB
        Grant Ingersoll

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1155 (See https://builds.apache.org/job/Mahout-Quality/1155/)
        MAHOUT-403: add in some regex transformation capabilities for converting raw content

        gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197992
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/commandline/DefaultOptionCreator.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFiles.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/AnalyzerTransformer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/ChainTransformer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/FPGFormatter.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/IdentityFormatter.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/IdentityTransformer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexConverterDriver.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexFormatter.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexMapper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexTransformer.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexUtils.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/URLDecodeTransformer.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/regex
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/regex/RegexMapperTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/regex/RegexUtilsTest.java
        • /mahout/trunk/src/conf/driver.classes.props
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1155 (See https://builds.apache.org/job/Mahout-Quality/1155/ ) MAHOUT-403 : add in some regex transformation capabilities for converting raw content gsingers : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197992 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/commandline/DefaultOptionCreator.java /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/EncodedVectorsFromSequenceFiles.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/AnalyzerTransformer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/ChainTransformer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/FPGFormatter.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/IdentityFormatter.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/IdentityTransformer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexConverterDriver.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexFormatter.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexMapper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexTransformer.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/RegexUtils.java /mahout/trunk/integration/src/main/java/org/apache/mahout/utils/regex/URLDecodeTransformer.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/regex /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/regex/RegexMapperTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/utils/regex/RegexUtilsTest.java /mahout/trunk/src/conf/driver.classes.props
        Hide
        Grant Ingersoll added a comment -

        Committed revision 1197992.

        Show
        Grant Ingersoll added a comment - Committed revision 1197992.
        Hide
        Grant Ingersoll added a comment -

        Here's an example of running it against some Solr request logs:

        --input /path/to/logs --output /tmp/solr/output --regex "(?<=(?|&)q=).*?(?=&|$)" --overwrite --transformerClass url

        Show
        Grant Ingersoll added a comment - Here's an example of running it against some Solr request logs: --input /path/to/logs --output /tmp/solr/output --regex "(?<=(?|&)q=).*?(?=&|$)" --overwrite --transformerClass url
        Hide
        Grant Ingersoll added a comment -

        The main issue it has is passing in the regex and dealing with escaping. May have to load Regex's from a file.

        Show
        Grant Ingersoll added a comment - The main issue it has is passing in the regex and dealing with escaping. May have to load Regex's from a file.
        Hide
        Grant Ingersoll added a comment -

        Up to date with trunk, adds test. I used this to convert a bunch of Solr log files to FPG format.

        I think it will work fairly well for single line files. It is more flexible than Hadoop's built in RegexMapper, AFAICT.

        Show
        Grant Ingersoll added a comment - Up to date with trunk, adds test. I used this to convert a bunch of Solr log files to FPG format. I think it will work fairly well for single line files. It is more flexible than Hadoop's built in RegexMapper, AFAICT.
        Hide
        Grant Ingersoll added a comment -

        It may be something like this is better served in Pig, but I think it is nice to have some basic functionality like this readily available in Mahout.

        Show
        Grant Ingersoll added a comment - It may be something like this is better served in Pig, but I think it is nice to have some basic functionality like this readily available in Mahout.
        Hide
        Grant Ingersoll added a comment -

        OK, I think I have something that is workable.

        Show
        Grant Ingersoll added a comment - OK, I think I have something that is workable.
        Hide
        Sean Owen added a comment -

        (Merely marking Wont-Fix as a provocation. If you stand behind a use case for this, and can finish/update the patch, by all means reopen.)

        Show
        Sean Owen added a comment - (Merely marking Wont-Fix as a provocation. If you stand behind a use case for this, and can finish/update the patch, by all means reopen.)
        Hide
        Grant Ingersoll added a comment -

        Here's a start on this. Still needs more testing and a review of the implementation. Namely, not sure I'm sold on the RegexTransformer/Formatter stuff and what it's long term ramifications are.

        Show
        Grant Ingersoll added a comment - Here's a start on this. Still needs more testing and a review of the implementation. Namely, not sure I'm sold on the RegexTransformer/Formatter stuff and what it's long term ramifications are.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development