Flume
  1. Flume
  2. FLUME-722

RegexAllExtractor doesn't ignore empty groups

    Details

      Description

      Hi flume devs.

      I saw a bug when using RegexAllExtractor: line 94:

      if(names.get(grp-1) != "")

      { Attributes.setString(e, names.get(grp-1), val); }

      Please help to file a jira and correct it to use String.equal(), otherwise it doesn't ignore empty groups.

      (I don't think I can open a issue at cloudera jira.)

      Thanks,
      Mingjie

      1. FLUME-722.patch
        0.9 kB
        Mingjie Lai

        Activity

        Hide
        Hudson added a comment -

        Integrated in flume-728 #40 (See https://builds.apache.org/job/flume-728/40/)
        FLUME-722. Transactinal memory channel implementation.

        (Prasad Mujumdar via Arvind Prabhakar)

        arvind : http://svn.apache.org/viewvc/?view=rev&rev=1185412
        Files :

        • /incubator/flume/branches/flume-728/flume-ng-core/src/main/java/org/apache/flume/channel/MemoryChannel.java
        • /incubator/flume/branches/flume-728/flume-ng-core/src/main/java/org/apache/flume/channel/MultiOpMemChannel.java
        • /incubator/flume/branches/flume-728/flume-ng-core/src/main/java/org/apache/flume/channel/PseudoTxnMemoryChannel.java
        • /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/channel/TestMemoryChannelTransaction.java
        • /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/sink/TestAvroSink.java
        • /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/sink/TestLoggerSink.java
        • /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/sink/TestRollingFileSink.java
        • /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/source/TestAvroSource.java
        • /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/source/TestSequenceGeneratorSource.java
        • /incubator/flume/branches/flume-728/flume-ng-node/src/test/java/org/apache/flume/source/TestNetcatSource.java
        • /incubator/flume/branches/flume-728/flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSEventSink.java
        Show
        Hudson added a comment - Integrated in flume-728 #40 (See https://builds.apache.org/job/flume-728/40/ ) FLUME-722 . Transactinal memory channel implementation. (Prasad Mujumdar via Arvind Prabhakar) arvind : http://svn.apache.org/viewvc/?view=rev&rev=1185412 Files : /incubator/flume/branches/flume-728/flume-ng-core/src/main/java/org/apache/flume/channel/MemoryChannel.java /incubator/flume/branches/flume-728/flume-ng-core/src/main/java/org/apache/flume/channel/MultiOpMemChannel.java /incubator/flume/branches/flume-728/flume-ng-core/src/main/java/org/apache/flume/channel/PseudoTxnMemoryChannel.java /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/channel/TestMemoryChannelTransaction.java /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/sink/TestAvroSink.java /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/sink/TestLoggerSink.java /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/sink/TestRollingFileSink.java /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/source/TestAvroSource.java /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/source/TestSequenceGeneratorSource.java /incubator/flume/branches/flume-728/flume-ng-node/src/test/java/org/apache/flume/source/TestNetcatSource.java /incubator/flume/branches/flume-728/flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSEventSink.java
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1448/
        -----------------------------------------------------------

        (Updated 2011-08-10 07:11:31.240611)

        Review request for Flume and jmhsieh.

        Changes
        -------

        Update do include link to jira issue

        Summary
        -------

        Reposted for Mingjie Lai (was originally posted and reviewed on https://review.cloudera.org/r/1891/)

        This addresses bug flume-722.
        https://issues.apache.org/jira/browse/flume-722

        Diffs


        /trunk/flume-core/src/main/java/com/cloudera/flume/core/extractors/RegexAllExtractor.java 1155987
        /trunk/flume-core/src/test/java/com/cloudera/flume/core/extractors/TestExtractors.java 1155987

        Diff: https://reviews.apache.org/r/1448/diff

        Testing
        -------

        Ran test, it passed.

        Thanks,

        jmhsieh

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1448/ ----------------------------------------------------------- (Updated 2011-08-10 07:11:31.240611) Review request for Flume and jmhsieh. Changes ------- Update do include link to jira issue Summary ------- Reposted for Mingjie Lai (was originally posted and reviewed on https://review.cloudera.org/r/1891/ ) This addresses bug flume-722. https://issues.apache.org/jira/browse/flume-722 Diffs /trunk/flume-core/src/main/java/com/cloudera/flume/core/extractors/RegexAllExtractor.java 1155987 /trunk/flume-core/src/test/java/com/cloudera/flume/core/extractors/TestExtractors.java 1155987 Diff: https://reviews.apache.org/r/1448/diff Testing ------- Ran test, it passed. Thanks, jmhsieh
        Hide
        Jonathan Hsieh added a comment -

        looks good. Thanks Mingjie!

        Still waiting on svn import so committing to github.com for now. Will close issue when pushed to apache svn.

        Show
        Jonathan Hsieh added a comment - looks good. Thanks Mingjie! Still waiting on svn import so committing to github.com for now. Will close issue when pushed to apache svn.
        Hide
        Mingjie Lai added a comment -

        @jon Thanks for your comments.

        I created https://review.cloudera.org/r/1891/ for easier review. And I also posted a new patch which includes the test case at the rb.

        Show
        Mingjie Lai added a comment - @jon Thanks for your comments. I created https://review.cloudera.org/r/1891/ for easier review. And I also posted a new patch which includes the test case at the rb.
        Hide
        Jonathan Hsieh added a comment -

        Actually, instead of the previous comment, this is a smaller and better change that better demonstrates the root cause. (second call to names.add different and added an assert).

        Can you modify the existing function to be this instead?

          @Test
          public void testRegexAllExtractor() throws IOException, InterruptedException {
            MemorySinkSource mem = new MemorySinkSource();
            mem.open();
            ArrayList<String> names = new ArrayList<String>();
            names.add("d1");
            // when parsed, a separate instance of "" is created
            names.add(new String(new byte[0]));
            names.add("d2");
        
            RegexAllExtractor re = new RegexAllExtractor(mem, "(\\d):(\\d):(\\d)",
                names);
        
            re.open();
            re.append(new EventImpl("1:2:3.4foobar5".getBytes()));
            re.close();
        
            mem.close();
            mem.open();
            Event e1 = mem.next();
            assertEquals("1", Attributes.readString(e1, "d1"));
            assertEquals(null, Attributes.readString(e1, ""));
            assertEquals("3", Attributes.readString(e1, "d2"));
          }
        

        Thanks,
        Jon.

        Show
        Jonathan Hsieh added a comment - Actually, instead of the previous comment, this is a smaller and better change that better demonstrates the root cause. (second call to names.add different and added an assert). Can you modify the existing function to be this instead? @Test public void testRegexAllExtractor() throws IOException, InterruptedException { MemorySinkSource mem = new MemorySinkSource(); mem.open(); ArrayList< String > names = new ArrayList< String >(); names.add( "d1" ); // when parsed, a separate instance of "" is created names.add( new String ( new byte [0])); names.add( "d2" ); RegexAllExtractor re = new RegexAllExtractor(mem, "(\\d):(\\d):(\\d)" , names); re.open(); re.append( new EventImpl( "1:2:3.4foobar5" .getBytes())); re.close(); mem.close(); mem.open(); Event e1 = mem.next(); assertEquals( "1" , Attributes.readString(e1, "d1" )); assertEquals( null , Attributes.readString(e1, "")); assertEquals( "3" , Attributes.readString(e1, "d2" )); } Thanks, Jon.
        Hide
        Jonathan Hsieh added a comment -

        Here's a test the fails before and succeeds after the fix. Fold this into TestExtractors please?

          /**
          * the parser step creates a "" that is not the canonical "" which requires equals vs == test
          **/
          @Test
          public void testRegexAllExtractorEmptyProblem() throws IOException,
              InterruptedException, FlumeSpecException {
            final MemorySinkSource mem = new MemorySinkSource();
            mem.open();
            SinkFactoryImpl sf = new SinkFactoryImpl();
            sf.setSink("mem", new SinkBuilder() {
              @Override
              public EventSink build(Context context, String... argv) {
                return mem;
              }
              
            });
            FlumeBuilder.setSinkFactory(sf);
            RegexAllExtractor re = (RegexAllExtractor) FlumeBuilder.buildSink(
                LogicalNodeContext.testingContext(), "regexAll(\"(.+)\\\\t(.+)\","
                    + "\"\", \"keep\") mem");
        
            re.open();
            re.append(new EventImpl("ignoreme\tkeepme".getBytes()));
            re.close();
        
            mem.close();
            mem.open();
            Event e1 = mem.next();
            assertEquals(null, Attributes.readString(e1, ""));
            assertEquals("keepme", Attributes.readString(e1, "keep"));
          }
        

        Thanks,
        Jon.

        Show
        Jonathan Hsieh added a comment - Here's a test the fails before and succeeds after the fix. Fold this into TestExtractors please? /** * the parser step creates a "" that is not the canonical " " which requires equals vs == test **/ @Test public void testRegexAllExtractorEmptyProblem() throws IOException, InterruptedException, FlumeSpecException { final MemorySinkSource mem = new MemorySinkSource(); mem.open(); SinkFactoryImpl sf = new SinkFactoryImpl(); sf.setSink( "mem" , new SinkBuilder() { @Override public EventSink build(Context context, String ... argv) { return mem; } }); FlumeBuilder.setSinkFactory(sf); RegexAllExtractor re = (RegexAllExtractor) FlumeBuilder.buildSink( LogicalNodeContext.testingContext(), "regexAll(\" (.+)\\\\t(.+)\ "," + "\" \ ", \" keep\ ") mem" ); re.open(); re.append( new EventImpl( "ignoreme\tkeepme" .getBytes())); re.close(); mem.close(); mem.open(); Event e1 = mem.next(); assertEquals( null , Attributes.readString(e1, "")); assertEquals( "keepme" , Attributes.readString(e1, "keep" )); } Thanks, Jon.
        Hide
        Jonathan Hsieh added a comment -

        Ok, I've been able to cause the problem and it seems to come from the parsing of the config. Diving deeper.

        Show
        Jonathan Hsieh added a comment - Ok, I've been able to cause the problem and it seems to come from the parsing of the config. Diving deeper.
        Hide
        Jonathan Hsieh added a comment -

        @Mingjie

        Nice catch! I agree this is a bad code smell. Am I right to assume that after the patch, the { : 210.5.102.6 } and { :194.78.100.30 } parts are not present?

        I'm really curious about how to duplicate this and why this happens and would really like to figure this out before we commit. I've tried to make a test case that would make an "" attribute show up but I can't seem to do it either. Can you give a scrubbed example line? Do they look like:

        "1.2.3.4\txxxLicense\t1304406740\t42\txxxUrl"

        Also, a style nit – can you invert the equals expression to eliminate the possibility of a NPE? This could be done by changing:
        !names.get(grp-1).equals("")
        into
        !"".equals(names.get(grp-1))

        Show
        Jonathan Hsieh added a comment - @Mingjie Nice catch! I agree this is a bad code smell. Am I right to assume that after the patch, the { : 210.5.102.6 } and { :194.78.100.30 } parts are not present? I'm really curious about how to duplicate this and why this happens and would really like to figure this out before we commit. I've tried to make a test case that would make an "" attribute show up but I can't seem to do it either. Can you give a scrubbed example line? Do they look like: "1.2.3.4\txxxLicense\t1304406740\t42\txxxUrl" Also, a style nit – can you invert the equals expression to eliminate the possibility of a NPE? This could be done by changing: !names.get(grp-1).equals("") into !"".equals(names.get(grp-1))
        Hide
        Mingjie Lai added a comment -

        Patch available.

        Show
        Mingjie Lai added a comment - Patch available.
        Hide
        Mingjie Lai added a comment - - edited

        I cannot reproduce it with test cases – TestExtractors.

        But for a flume deployment, I have an flume node configured as:

        <code>
        tail ("/tmp/log.log") | regexAll( "(.)\\t(.)\\t(.)\\t(.)
        t(.+)", "", "license", "timestamp", "rating", "url") => text("/tmp/aa.txt")
        <code>

        what I saw is:
        <code>
        localhost [INFO Thu Aug 04 13:05:12 PDT 2011] { : 210.5.102.6 }

        { license : xxx }

        { rating : 38 }

        { tailSrcFile : (long)7382926376034922343 (string) 'full.log' (double)3.641231503360977E185 }

        { timestamp : 1304406740 }

        { url : xxx}

        localhost [INFO Thu Aug 04 13:05:12 PDT 2011] { : 194.78.100.30 }

        { license : xxx }

        { rating : 41 }

        { tailSrcFile : (long)7382926376034922343 (string) 'full.log' (double)3.641231503360977E185 }

        { timestamp : 1304406740 }

        { url : xxx}


        <code>

        It's' an obvious java string comparison bug. So no test cases for this issue.

        Show
        Mingjie Lai added a comment - - edited I cannot reproduce it with test cases – TestExtractors. But for a flume deployment, I have an flume node configured as: <code> tail ("/tmp/log.log") | regexAll( "(. )\\t(. )\\t(. )\\t(. ) t(.+)", "", "license", "timestamp", "rating", "url") => text("/tmp/aa.txt") <code> what I saw is: <code> localhost [INFO Thu Aug 04 13:05:12 PDT 2011] { : 210.5.102.6 } { license : xxx } { rating : 38 } { tailSrcFile : (long)7382926376034922343 (string) 'full.log' (double)3.641231503360977E185 } { timestamp : 1304406740 } { url : xxx} localhost [INFO Thu Aug 04 13:05:12 PDT 2011] { : 194.78.100.30 } { license : xxx } { rating : 41 } { tailSrcFile : (long)7382926376034922343 (string) 'full.log' (double)3.641231503360977E185 } { timestamp : 1304406740 } { url : xxx} <code> It's' an obvious java string comparison bug. So no test cases for this issue.

          People

          • Assignee:
            Mingjie Lai
            Reporter:
            Nicholas Verbeck
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development