Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-350

Generalize the SequenceFileInputFilter to apply to any InputFormat

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412 so that it can be applied to any InputFormat. To do this, I propose:

      interface WritableFilter {
      boolean accept(Writable item);
      }

      class FilterInputFormat implements InputFormat {
      ...
      }

      FilterInputFormat would look in the JobConf for:
      mapred.input.filter.source = the underlying input format
      mapred.input.filter.filters = a list of class names that implement WritableFilter

      The FilterInputFormat will work like the current SequenceFilter, but use an internal RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value) to the RecordReader interface, but that will be addressed in a different issue.

      1. filtering_v3.patch
        97 kB
        Enis Soztutar
      2. filtering_v2.patch
        144 kB
        Enis Soztutar
      3. filterinputformat_v1.patch
        63 kB
        Enis Soztutar

        Issue Links

          Activity

          Hide
          Owen O'Malley added a comment -

          Oops, I should use the structure of SequencFileInputFilter and do:

          public class FilterInputFormat implements InputFormat {
          public static interface Filter

          { boolean accept(Writable key); }

          ...
          }

          Show
          Owen O'Malley added a comment - Oops, I should use the structure of SequencFileInputFilter and do: public class FilterInputFormat implements InputFormat { public static interface Filter { boolean accept(Writable key); } ... }
          Hide
          Enis Soztutar added a comment -

          Here is the patch to generalize the filtering capability on input records.

          The patch

          • deprecates SequenceFileInputFilter
          • introduces o.a.h.io.filter package
          • refactors SequenceFileInputFilter.Filter to o.a.h.io.filter.WritableFilter, SequenceFileInputFilter.RegexFilter to o.a.h.io.filter.RegexFilter, ..
          • Introduces new filters DefaultFilter, ItemFilter, SetFilter, and RangeFilter.
          • Introduces FilterRecordReader which uses the provided filter on the records returned by the underlying RecordReader.
          • Introduces FilterInputFormat which is a decorator to add filtering capability. It can work with the legacy code with no change.
          Show
          Enis Soztutar added a comment - Here is the patch to generalize the filtering capability on input records. The patch deprecates SequenceFileInputFilter introduces o.a.h.io.filter package refactors SequenceFileInputFilter.Filter to o.a.h.io.filter.WritableFilter, SequenceFileInputFilter.RegexFilter to o.a.h.io.filter.RegexFilter, .. Introduces new filters DefaultFilter, ItemFilter, SetFilter, and RangeFilter. Introduces FilterRecordReader which uses the provided filter on the records returned by the underlying RecordReader. Introduces FilterInputFormat which is a decorator to add filtering capability. It can work with the legacy code with no change.
          Hide
          Enis Soztutar added a comment -

          Marking it for 0.17.

          Show
          Enis Soztutar added a comment - Marking it for 0.17.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12373832/filterinputformat_v1.patch
          against trunk revision 614721.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1655/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1655/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1655/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1655/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12373832/filterinputformat_v1.patch against trunk revision 614721. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1655/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1655/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1655/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1655/console This message is automatically generated.
          Hide
          Chris Douglas added a comment - - edited
          • It looks like one can't meaningfully nest FilterInputFormats. A check in initBaseInputFormat that rejects an attempt to do this would probably be a good idea.
          • IIRC, Configuration::IntegerRanges is limited to positive integers, so the default range of (Integer.MIN_VALUE + "-" + Integer.MAX_VALUE) in IntRangeFilter::setConf may be invalid.
          • RangeFilter should probably accept a WritableComparator to support user types and an alternative syntax for projections
          • Would it be too constraining to limit SetFilter and ItemFilter to Text? Filtering based on the string representation of Writables seems like an overly general strategy.

          The patch looks good; this should be very useful.

          Show
          Chris Douglas added a comment - - edited It looks like one can't meaningfully nest FilterInputFormats. A check in initBaseInputFormat that rejects an attempt to do this would probably be a good idea. IIRC, Configuration::IntegerRanges is limited to positive integers, so the default range of (Integer.MIN_VALUE + "-" + Integer.MAX_VALUE) in IntRangeFilter::setConf may be invalid. RangeFilter should probably accept a WritableComparator to support user types and an alternative syntax for projections Would it be too constraining to limit SetFilter and ItemFilter to Text? Filtering based on the string representation of Writables seems like an overly general strategy. The patch looks good; this should be very useful.
          Hide
          Enis Soztutar added a comment -

          Thanks very much for the review,

          bq, It looks like one can't meaningfully nest FilterInputFormats. A check in initBaseInputFormat that rejects an attempt to do this would probably be a good idea.

          In my opinion, people may try to nest WritableFilters rather than FilterInputFormats for multi-logic filtering(for example to filter by range and percent). However, I do not feel that chaining of more than one filters will be used(however one can always write a ChainFilter, which can be configured to apply more than one Filter sequentially).

          I think people will understand that nesting FilterInputFormat cannot be done with the current API :

          job.setInputFormat(FilterInputFormat.class);
          FilterInputFormat.setBaseInputFormat(job, FilterInputFormat.class); //nesting
          //now we should set the actual InputFormat
          FilterInputFormat.setBaseInputFormat(job, TextInputFormat.class); //already confusing
          

          IIRC, Configuration::IntegerRanges is limited to positive integers, so the default range of (Integer.MIN_VALUE + "-" + Integer.MAX_VALUE) in IntRangeFilter::setConf may be invalid.

          right. I have missed that IntegerRanges is limited to positive integers. I think we should make IntRangeFilter to extend ComparableRangeFilter instead(not using IntegerRanges anymore).

          RangeFilter should probably accept a WritableComparator to support user types and an alternative syntax for projections

          pls. see below

          Would it be too constraining to limit SetFilter and ItemFilter to Text? Filtering based on the string representation of Writables seems like an overly general strategy.

          The main reason behind this weird strategy of using string comparison on serialized versions of the Writables is that we should somehow pass the specified writables(for example min and max values) to the tasks, and currently the only way for this is to store them in the configuration. It would be great if we have setWritable() and getWritable() methods in the configuration, so that we can then directly compares WritableComparables (possibly using WritableComparators), however the proposal to add these methods are lazily rejected (i cannot remember the issue number).
          another solution to this may be adding an interface for example WritableStorable, WritableDeserializer, which will provide
          Writable forName(String) method.

          If you see a better solution to pass the Writables to the tasks, I will be very glad to adopt it. Or should we add setWritable() getWritable() to the Configuration?

          Show
          Enis Soztutar added a comment - Thanks very much for the review, bq, It looks like one can't meaningfully nest FilterInputFormats. A check in initBaseInputFormat that rejects an attempt to do this would probably be a good idea. In my opinion, people may try to nest WritableFilters rather than FilterInputFormats for multi-logic filtering(for example to filter by range and percent). However, I do not feel that chaining of more than one filters will be used(however one can always write a ChainFilter, which can be configured to apply more than one Filter sequentially). I think people will understand that nesting FilterInputFormat cannot be done with the current API : job.setInputFormat(FilterInputFormat.class); FilterInputFormat.setBaseInputFormat(job, FilterInputFormat.class); //nesting //now we should set the actual InputFormat FilterInputFormat.setBaseInputFormat(job, TextInputFormat.class); //already confusing IIRC, Configuration::IntegerRanges is limited to positive integers, so the default range of (Integer.MIN_VALUE + "-" + Integer.MAX_VALUE) in IntRangeFilter::setConf may be invalid. right. I have missed that IntegerRanges is limited to positive integers. I think we should make IntRangeFilter to extend ComparableRangeFilter instead(not using IntegerRanges anymore). RangeFilter should probably accept a WritableComparator to support user types and an alternative syntax for projections pls. see below Would it be too constraining to limit SetFilter and ItemFilter to Text? Filtering based on the string representation of Writables seems like an overly general strategy. The main reason behind this weird strategy of using string comparison on serialized versions of the Writables is that we should somehow pass the specified writables(for example min and max values) to the tasks, and currently the only way for this is to store them in the configuration. It would be great if we have setWritable() and getWritable() methods in the configuration, so that we can then directly compares WritableComparables (possibly using WritableComparators), however the proposal to add these methods are lazily rejected (i cannot remember the issue number). another solution to this may be adding an interface for example WritableStorable, WritableDeserializer, which will provide Writable forName(String) method. If you see a better solution to pass the Writables to the tasks, I will be very glad to adopt it. Or should we add setWritable() getWritable() to the Configuration?
          Hide
          Chris Douglas added a comment -

          I think people will understand that nesting FilterInputFormat cannot be done with the current API

          True. The case I had in mind wasn't so much nesting as it was simultaneous instantiation. For example, the classes in mapred.join wouldn't hesitate to accept multiple FilterInputFormats for multiple datasources, but- as you point out- the API is such that one would quickly realize that chained filters aren't possible without some effort. I was hoping that these classes could be integrated into the aforementioned framework and I'm confident they can be.

          On that note, the MD5PercentFilter guards access to the MessageDigest within the instance, but multiple instances could corrupt it. It would probably be better if it were not static.

          Also, since a filter may discard the vast majority of the input, is it necessary to update the reporter to avoid a timeout? A call to next may churn through data for some time, and I'm uncertain whether one can expect the base InputFormat to keep the task alive. I'd expect it to be fine for the majority of cases, but if you felt like being paranoid it's not insane.

          If you see a better solution to pass the Writables to the tasks, I will be very glad to adopt it. Or should we add setWritable() getWritable() to the Configuration?

          I don't, sorry. I remember the JIRA you mention, the rejection of get/setWritable, and the reasoning probably remains sound. Other than the solutions you propose, the only other way I can think of would be to have an auxiliary InputFormat/input dir that slurps a set of keys (no splits!) into an in-memory Set and assume that the OOM exceptions are a strong hint to the user. Gross.

          Still, you could probably still restrict these to Text, as long as the user is aware of SequenceFileAsTextInputFormat and related options. Automatically converting to String could produce some weird results if one isn't aware of how the filter is effected. Forcing someone to figure out how to get their WritableComparables to Text is ample warning, I think.

          Show
          Chris Douglas added a comment - I think people will understand that nesting FilterInputFormat cannot be done with the current API True. The case I had in mind wasn't so much nesting as it was simultaneous instantiation. For example, the classes in mapred.join wouldn't hesitate to accept multiple FilterInputFormats for multiple datasources, but- as you point out- the API is such that one would quickly realize that chained filters aren't possible without some effort. I was hoping that these classes could be integrated into the aforementioned framework and I'm confident they can be. On that note, the MD5PercentFilter guards access to the MessageDigest within the instance, but multiple instances could corrupt it. It would probably be better if it were not static. Also, since a filter may discard the vast majority of the input, is it necessary to update the reporter to avoid a timeout? A call to next may churn through data for some time, and I'm uncertain whether one can expect the base InputFormat to keep the task alive. I'd expect it to be fine for the majority of cases, but if you felt like being paranoid it's not insane. If you see a better solution to pass the Writables to the tasks, I will be very glad to adopt it. Or should we add setWritable() getWritable() to the Configuration? I don't, sorry. I remember the JIRA you mention, the rejection of get/setWritable, and the reasoning probably remains sound. Other than the solutions you propose, the only other way I can think of would be to have an auxiliary InputFormat/input dir that slurps a set of keys (no splits!) into an in-memory Set and assume that the OOM exceptions are a strong hint to the user. Gross. Still, you could probably still restrict these to Text, as long as the user is aware of SequenceFileAsTextInputFormat and related options. Automatically converting to String could produce some weird results if one isn't aware of how the filter is effected. Forcing someone to figure out how to get their WritableComparables to Text is ample warning, I think.
          Hide
          Enis Soztutar added a comment -

          The case I had in mind wasn't so much nesting as it was simultaneous instantiation. For example, the classes in mapred.join wouldn't hesitate to accept multiple FilterInputFormats for multiple datasources, but- as you point out- the API is such that one would quickly realize that chained filters aren't possible without some effort. I was hoping that these classes could be integrated into the aforementioned framework and I'm confident they can be.

          I did not think about the join framework. Having a look at it, i guess we can still stick with the current framework. The reason is that FilterInputFormat filters based on keys, and join framework performs the joins based on keys, and thus when we perform the joins we would want to use the same filter for all the datasets. Personally i have not tried, but i guess the following might work :

          //set join expression as usual 
          job.set("mapred.join.expr", "inner(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,"
                    "hdfs://host:8020/foo/bar"),
                tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
                    "hdfs://host:8020/foo/baz"))");
          // wrap CompositeInputFormat with FilterInputFormat 
          job.setInputFormat(FilterInputFormat.class);
          FilterInputFormat.setBaseInputFormat(job, CompositeInputFormat.class); 
          

          Do you think this will do the trick?

          On that note, the MD5PercentFilter guards access to the MessageDigest within the instance, but multiple instances could corrupt it. It would probably be better if it were not static.

          I think current implementation is OK, since we are updating and digesting the MessageDigest in only the MD5Hashcode function which is already synchronized.

          Also, since a filter may discard the vast majority of the input, is it necessary to update the reporter to avoid a timeout? A call to next may churn through data for some time, and I'm uncertain whether one can expect the base InputFormat to keep the task alive. I'd expect it to be fine for the majority of cases, but if you felt like being paranoid it's not insane.

          I think we better be pragmatic about this one. Lets not spend some nontrivial amount of effort on this. We can fix it if it is exploited in some way.

          Still, you could probably still restrict these to Text

          I envision that, given this implementation, filters will mostly be used for Text anyway but I do not think that we should limit the use of them. The fact that string conversion on keys before the comparison/regex matching will be done is clearly documented in the javadocs of the respective filters. People are expected to read the javadocs before using the classes. smile

          Show
          Enis Soztutar added a comment - The case I had in mind wasn't so much nesting as it was simultaneous instantiation. For example, the classes in mapred.join wouldn't hesitate to accept multiple FilterInputFormats for multiple datasources, but- as you point out- the API is such that one would quickly realize that chained filters aren't possible without some effort. I was hoping that these classes could be integrated into the aforementioned framework and I'm confident they can be. I did not think about the join framework. Having a look at it, i guess we can still stick with the current framework. The reason is that FilterInputFormat filters based on keys, and join framework performs the joins based on keys, and thus when we perform the joins we would want to use the same filter for all the datasets. Personally i have not tried, but i guess the following might work : //set join expression as usual job.set( "mapred.join.expr" , " inner (tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class," "hdfs: //host:8020/foo/bar" ), tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class, "hdfs: //host:8020/foo/baz" ))"); // wrap CompositeInputFormat with FilterInputFormat job.setInputFormat(FilterInputFormat.class); FilterInputFormat.setBaseInputFormat(job, CompositeInputFormat.class); Do you think this will do the trick? On that note, the MD5PercentFilter guards access to the MessageDigest within the instance, but multiple instances could corrupt it. It would probably be better if it were not static. I think current implementation is OK, since we are updating and digesting the MessageDigest in only the MD5Hashcode function which is already synchronized. Also, since a filter may discard the vast majority of the input, is it necessary to update the reporter to avoid a timeout? A call to next may churn through data for some time, and I'm uncertain whether one can expect the base InputFormat to keep the task alive. I'd expect it to be fine for the majority of cases, but if you felt like being paranoid it's not insane. I think we better be pragmatic about this one. Lets not spend some nontrivial amount of effort on this. We can fix it if it is exploited in some way. Still, you could probably still restrict these to Text I envision that, given this implementation, filters will mostly be used for Text anyway but I do not think that we should limit the use of them. The fact that string conversion on keys before the comparison/regex matching will be done is clearly documented in the javadocs of the respective filters. People are expected to read the javadocs before using the classes. smile
          Hide
          Enis Soztutar added a comment -

          cancelling the patch until i find some time to fix IntRangeFilter

          Show
          Enis Soztutar added a comment - cancelling the patch until i find some time to fix IntRangeFilter
          Hide
          Chris Douglas added a comment -

          I did not think about the join framework. Having a look at it, i guess we can still stick with the current framework.

          I think your example would work, but I was considering filters at arbitrary positions in the join. I was thinking of adding a new node to the parser that accepts a Filter and an argument (the range, the regexp, etc) and sets the filter expression prior to the instantiation of the RecordReader (as it does for mapred.input.dir). Both should work.

          I think current implementation is OK, since we are updating and digesting the MessageDigest in only the MD5Hashcode function which is already synchronized.

          The MD5Hashcode function is synchronized on the instance, but it's protecting a static. Unless there's only one instance of the MD5PercentFilter, synchronizing on the method is insufficient, no?

          I think we better be pragmatic about this one. Lets not spend some nontrivial amount of effort on this. We can fix it if it is exploited in some way.

          nod Again, I think it'll be fine for the majority of cases, but I thought I'd mention it.

          People are expected to read the javadocs before using the classes.

          Well, fair enough. Really, it only supports Text, and this seems like a convenient way to annotate the class since it's not difficult to effect the translation. Further, toString isn't usually considered in the Comparable/equals/hashCode family of equality, so it seems risky.

          Show
          Chris Douglas added a comment - I did not think about the join framework. Having a look at it, i guess we can still stick with the current framework. I think your example would work, but I was considering filters at arbitrary positions in the join. I was thinking of adding a new node to the parser that accepts a Filter and an argument (the range, the regexp, etc) and sets the filter expression prior to the instantiation of the RecordReader (as it does for mapred.input.dir). Both should work. I think current implementation is OK, since we are updating and digesting the MessageDigest in only the MD5Hashcode function which is already synchronized. The MD5Hashcode function is synchronized on the instance, but it's protecting a static. Unless there's only one instance of the MD5PercentFilter, synchronizing on the method is insufficient, no? I think we better be pragmatic about this one. Lets not spend some nontrivial amount of effort on this. We can fix it if it is exploited in some way. nod Again, I think it'll be fine for the majority of cases, but I thought I'd mention it. People are expected to read the javadocs before using the classes. Well, fair enough. Really, it only supports Text, and this seems like a convenient way to annotate the class since it's not difficult to effect the translation. Further, toString isn't usually considered in the Comparable/equals/hashCode family of equality, so it seems risky.
          Hide
          Owen O'Malley added a comment -

          I agree with Chris. Using obj1.toString().equals(obj2.toString()) is both very expensive and likely to break. We got 2x speedups out of TextOutputFormat by avoiding a obj.toString().toBytes("UTF-8") data conversion pair. I'd be very worried about the performance if you are doing toString for compares.

          Show
          Owen O'Malley added a comment - I agree with Chris. Using obj1.toString().equals(obj2.toString()) is both very expensive and likely to break. We got 2x speedups out of TextOutputFormat by avoiding a obj.toString().toBytes("UTF-8") data conversion pair. I'd be very worried about the performance if you are doing toString for compares.
          Hide
          Enis Soztutar added a comment -

          After spending some time on thinking about his patch, I have redesigned the API. The changes are :

          • Refactored WritableFilter to Filter, so that Filter can be applied to non-Writables (according to Serialization framework)
          • Added a Stringifier interface and a Default implementation using hadoop serialization framework. Now ordinary objects can be kept in the configuration. Acknowledging the performance loss in String.equals() comparison, we had to pass the actual objects in the configuration, or not use filtering at all.
          • Added FilterEngine to evaluate postfix filter expressions
          • Added OR, AND, NOT Filters
          • Fixed synchronization issue in MessageDigest
          • Filtering is moved to core framework instead of a library.
          • Changed the API so that JobConf is now used to add filters. This API is better since it hides nearly all the details from the appliaction code. The applications just configures the filter by calling JobConf#addFilter().
          • Added a counter for filtered-out records
          • Added filtering section to the mapred tutorial.
          Show
          Enis Soztutar added a comment - After spending some time on thinking about his patch, I have redesigned the API. The changes are : Refactored WritableFilter to Filter, so that Filter can be applied to non-Writables (according to Serialization framework) Added a Stringifier interface and a Default implementation using hadoop serialization framework. Now ordinary objects can be kept in the configuration. Acknowledging the performance loss in String.equals() comparison, we had to pass the actual objects in the configuration, or not use filtering at all. Added FilterEngine to evaluate postfix filter expressions Added OR, AND, NOT Filters Fixed synchronization issue in MessageDigest Filtering is moved to core framework instead of a library. Changed the API so that JobConf is now used to add filters. This API is better since it hides nearly all the details from the appliaction code. The applications just configures the filter by calling JobConf#addFilter(). Added a counter for filtered-out records Added filtering section to the mapred tutorial.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378052/filtering_v2.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 20 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to cause Findbugs to fail.

          core tests -1. The patch failed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1983/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1983/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1983/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378052/filtering_v2.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 20 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1983/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1983/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1983/console This message is automatically generated.
          Hide
          Enis Soztutar added a comment -

          Previous patch was generated incorrectly. This should do the trick.

          Show
          Enis Soztutar added a comment - Previous patch was generated incorrectly. This should do the trick.
          Hide
          Enis Soztutar added a comment -

          Retrying Hudson to pick the patch.

          Show
          Enis Soztutar added a comment - Retrying Hudson to pick the patch.
          Hide
          Chris Douglas added a comment -

          Wow, looks like quite a rewrite! A few points:

          • Would it be possible to separate the Stringify functionality into a separate JIRA? It is independently useful functionality and would be easier to review if it were extracted from an already formidable patch. I like this idea.
            • DefaultStringifier could reuse more of the objects it creates. The serializer will hold a reference to the ByteArray(Input/Output)Buffers you've defined, so re-opening and closing them every time is unnecessary if you use a resetable stream like o.a.h.io.Data(Input/Output)Buffer.
            • The static convenience methods (load, store, storeArray) that create new Stringifiers in DefaultStringifier are probably unnecessary. Adding a stringified object to a config doesn't need that much support. Further, it's not clear to me how one would register a Stringifier that isn't a DefaultStringifier. Would it make sense to follow the pattern in the Serialization framework from HADOOP-1986- or the WritableComparator static initialization- and register custom Stringifiers for classes?
          • Your previous version- where this was library, not framework code- is preferred. The FilterRecordReader should be part of a FilterInputFormat, which is- as you illustrated with your earlier patch- not an undue burden on a user. Weaving this into Task and MapTask doesn't seem to add any additional functionality or syntactic elegance. You could add counters to the library code without changing core functionality. Being part of Task also prevents it from being integrated into other libraries (like the aforementioned join framework).
          • There are no progress updates as you churn through records in your filtering engine. You might want to keep a reporter around to prevent timeouts.
          • It would be much better if one could employ a (Raw)Comparator instead of using compareTo in RangeFilter and equals in ItemFilter. Using JobConf::getOutputKeyComparator would be a good start, though one could imagine uses where one would want different comparators for different filters.
          • It is unnecessary to distinguish between Filters and FunctionFilters. A single eval method accepting the key and stack is sufficient (and permits functions like AND and OR to short-circuit, somewhat). Forcing the application of every filter should probably be avoided when possible (see next bullet).
          • FilterEngine uses a static method and int to track the insertion order of filters and their associated properties, which is probably not the correct approach. Consider a case where a client submits two jobs, both with filters. Given that you require the user to add their expression in postfix notation, it doesn't seem unduly onerous to require them to track the position of their filters and avoid keeping a count. That said, assigning sequential identifiers to your filter parameters is probably not an ideal approach in itself. You might consider writing a parser or creating a Stringify-able object hierarchy (as in Swing, sort of) for your expressions. Something that looks more like an expression parse tree would probably effect a more efficient data structure.
          • Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive to create, the synchronization will probably cost more than making this an instance variable.

          Well done! I look forward to this.

          Show
          Chris Douglas added a comment - Wow, looks like quite a rewrite! A few points: Would it be possible to separate the Stringify functionality into a separate JIRA? It is independently useful functionality and would be easier to review if it were extracted from an already formidable patch. I like this idea. DefaultStringifier could reuse more of the objects it creates. The serializer will hold a reference to the ByteArray(Input/Output)Buffers you've defined, so re-opening and closing them every time is unnecessary if you use a resetable stream like o.a.h.io.Data(Input/Output)Buffer. The static convenience methods (load, store, storeArray) that create new Stringifiers in DefaultStringifier are probably unnecessary. Adding a stringified object to a config doesn't need that much support. Further, it's not clear to me how one would register a Stringifier that isn't a DefaultStringifier. Would it make sense to follow the pattern in the Serialization framework from HADOOP-1986 - or the WritableComparator static initialization- and register custom Stringifiers for classes? Your previous version- where this was library, not framework code- is preferred. The FilterRecordReader should be part of a FilterInputFormat, which is- as you illustrated with your earlier patch- not an undue burden on a user. Weaving this into Task and MapTask doesn't seem to add any additional functionality or syntactic elegance. You could add counters to the library code without changing core functionality. Being part of Task also prevents it from being integrated into other libraries (like the aforementioned join framework). There are no progress updates as you churn through records in your filtering engine. You might want to keep a reporter around to prevent timeouts. It would be much better if one could employ a (Raw)Comparator instead of using compareTo in RangeFilter and equals in ItemFilter. Using JobConf::getOutputKeyComparator would be a good start, though one could imagine uses where one would want different comparators for different filters. It is unnecessary to distinguish between Filters and FunctionFilters. A single eval method accepting the key and stack is sufficient (and permits functions like AND and OR to short-circuit, somewhat). Forcing the application of every filter should probably be avoided when possible (see next bullet). FilterEngine uses a static method and int to track the insertion order of filters and their associated properties, which is probably not the correct approach. Consider a case where a client submits two jobs, both with filters. Given that you require the user to add their expression in postfix notation, it doesn't seem unduly onerous to require them to track the position of their filters and avoid keeping a count. That said, assigning sequential identifiers to your filter parameters is probably not an ideal approach in itself. You might consider writing a parser or creating a Stringify-able object hierarchy (as in Swing, sort of) for your expressions. Something that looks more like an expression parse tree would probably effect a more efficient data structure. Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive to create, the synchronization will probably cost more than making this an instance variable. Well done! I look forward to this.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378107/filtering_v3.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 15 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac -1. The applied patch generated 584 javac compiler warnings (more than the trunk's current 582 warnings).

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1991/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1991/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1991/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1991/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378107/filtering_v3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 15 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 584 javac compiler warnings (more than the trunk's current 582 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1991/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1991/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1991/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1991/console This message is automatically generated.
          Hide
          Enis Soztutar added a comment -

          Wow, looks like quite a rewrite! A few points:

          Indeed it is smile

          Would it be possible to separate the Stringify functionality into a separate JIRA? It is independently useful functionality and would be easier to review if it were extracted from an already formidable patch. I like this idea.

          Done in HADOOP-3048.

          Your previous version- where this was library, not framework code- is preferred. The FilterRecordReader should be part of a FilterInputFormat, which is- as you illustrated with your earlier patch- not an undue burden on a user. Weaving this into Task and MapTask doesn't seem to add any additional functionality or syntactic elegance. You could add counters to the library code without changing core functionality. Being part of Task also prevents it from being integrated into other libraries (like the aforementioned join framework).

          Implementing this functionality as a library was good, and as you mentioned, there are lots of benefits for doing so. However after spending some time on this I realized implementing this in the core framework would be even better, because :

          • Theoretically every job can/should use the filtering functionality, since there is no drawback but lots of benefits. So this necessitates that the InputFormats of every job should be FilterInputFormat, shading the real InputFormat under FilterInputFormat#setBaseInputFormat().
          • There is a lot of legacy code which can benefit from this, but people will be reluctant (or lazy) to convert their job's input format to filter. So maximum usability and minimum code change should be aimed.
          • Although the functionality is at the core, we only change a few lines(except FilterRR) from the Task and MapTask classes, effectively encapsulating the functionality. We may extract FilterRecordReader to its own class, so that it is completely separate. I should note that join can readily use filtering. The filtering just filters before passing the record to the mapper, so the joined keys would be filtered.

          At the risk of repeating myself, I believe that every InputFormat of every job should be FilterInputFormat, which is why we should integrate this to the core. However I am open to any suggestions and discussions. smile

          There are no progress updates as you churn through records in your filtering engine. You might want to keep a reporter around to prevent timeouts.

          I will check that

          It would be much better if one could employ a (Raw)Comparator instead of using compareTo in RangeFilter and equals in ItemFilter. Using JobConf::getOutputKeyComparator would be a good start, though one could imagine uses where one would want different comparators for different filters.

          I will check that

          It is unnecessary to distinguish between Filters and FunctionFilters. A single eval method accepting the key and stack is sufficient (and permits functions like AND and OR to short-circuit, somewhat). Forcing the application of every filter should probably be avoided when possible (see next bullet).

          Yes you're right (indeed I first did that way). However it is very unlikely that a user may implement a FunctionFilter, but it is quite likely that she can implement a Filter. Thus adding a stack argument that no filter uses seems confusing and unnecessary. Consider the javadoc for the stack argument in the Filter#accept() method being as "@param stack filters should not use this".

          FilterEngine uses a static method and int to track the insertion order of filters and their associated properties, which is probably not the correct approach. Consider a case where a client submits two jobs, both with filters. Given that you require the user to add their expression in postfix notation, it doesn't seem unduly onerous to require them to track the position of their filters and avoid keeping a count. That said, assigning sequential identifiers to your filter parameters is probably not an ideal approach in itself. You might consider writing a parser or creating a Stringify-able object hierarchy (as in Swing, sort of) for your expressions. Something that looks more like an expression parse tree would probably effect a more efficient data structure.

          I also think that the best way to do the filtering was creating the object hierarchy, then pass it to the configuration, but somehow I did the postfix expressions way. I should think a better way I guess. I guess serializing the filters -> stringify -> store in conf -> pass to Task -> load from conf -> destringify -> deserialize might work.

          Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive to create, the synchronization will probably cost more than making this an instance variable.

          I just took the implementation MessageDigest$MD5PercentFilter, which used static digest, and I assumed that getInstance() returns the singleton instance of the object. However it seems it is not the case. I will fix this.

          Well done! I look forward to this.

          Me too, thanks for the in depth review Chris.

          Show
          Enis Soztutar added a comment - Wow, looks like quite a rewrite! A few points: Indeed it is smile Would it be possible to separate the Stringify functionality into a separate JIRA? It is independently useful functionality and would be easier to review if it were extracted from an already formidable patch. I like this idea. Done in HADOOP-3048 . Your previous version- where this was library, not framework code- is preferred. The FilterRecordReader should be part of a FilterInputFormat, which is- as you illustrated with your earlier patch- not an undue burden on a user. Weaving this into Task and MapTask doesn't seem to add any additional functionality or syntactic elegance. You could add counters to the library code without changing core functionality. Being part of Task also prevents it from being integrated into other libraries (like the aforementioned join framework). Implementing this functionality as a library was good, and as you mentioned, there are lots of benefits for doing so. However after spending some time on this I realized implementing this in the core framework would be even better, because : Theoretically every job can/should use the filtering functionality, since there is no drawback but lots of benefits. So this necessitates that the InputFormats of every job should be FilterInputFormat, shading the real InputFormat under FilterInputFormat#setBaseInputFormat(). There is a lot of legacy code which can benefit from this, but people will be reluctant (or lazy) to convert their job's input format to filter. So maximum usability and minimum code change should be aimed. Although the functionality is at the core, we only change a few lines(except FilterRR) from the Task and MapTask classes, effectively encapsulating the functionality. We may extract FilterRecordReader to its own class, so that it is completely separate. I should note that join can readily use filtering. The filtering just filters before passing the record to the mapper, so the joined keys would be filtered. At the risk of repeating myself, I believe that every InputFormat of every job should be FilterInputFormat, which is why we should integrate this to the core. However I am open to any suggestions and discussions. smile There are no progress updates as you churn through records in your filtering engine. You might want to keep a reporter around to prevent timeouts. I will check that It would be much better if one could employ a (Raw)Comparator instead of using compareTo in RangeFilter and equals in ItemFilter. Using JobConf::getOutputKeyComparator would be a good start, though one could imagine uses where one would want different comparators for different filters. I will check that It is unnecessary to distinguish between Filters and FunctionFilters. A single eval method accepting the key and stack is sufficient (and permits functions like AND and OR to short-circuit, somewhat). Forcing the application of every filter should probably be avoided when possible (see next bullet). Yes you're right (indeed I first did that way). However it is very unlikely that a user may implement a FunctionFilter, but it is quite likely that she can implement a Filter. Thus adding a stack argument that no filter uses seems confusing and unnecessary. Consider the javadoc for the stack argument in the Filter#accept() method being as "@param stack filters should not use this". FilterEngine uses a static method and int to track the insertion order of filters and their associated properties, which is probably not the correct approach. Consider a case where a client submits two jobs, both with filters. Given that you require the user to add their expression in postfix notation, it doesn't seem unduly onerous to require them to track the position of their filters and avoid keeping a count. That said, assigning sequential identifiers to your filter parameters is probably not an ideal approach in itself. You might consider writing a parser or creating a Stringify-able object hierarchy (as in Swing, sort of) for your expressions. Something that looks more like an expression parse tree would probably effect a more efficient data structure. I also think that the best way to do the filtering was creating the object hierarchy, then pass it to the configuration, but somehow I did the postfix expressions way. I should think a better way I guess. I guess serializing the filters -> stringify -> store in conf -> pass to Task -> load from conf -> destringify -> deserialize might work. Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive to create, the synchronization will probably cost more than making this an instance variable. I just took the implementation MessageDigest$MD5PercentFilter, which used static digest, and I assumed that getInstance() returns the singleton instance of the object. However it seems it is not the case. I will fix this. Well done! I look forward to this. Me too, thanks for the in depth review Chris.
          Hide
          Chris Douglas added a comment -

          Theoretically every job can/should use the filtering functionality, since there is no drawback but lots of benefits. So this necessitates that the InputFormats of every job should be FilterInputFormat, shading the real InputFormat under FilterInputFormat#setBaseInputFormat().

          Even if every job can use the filtering functionality, integrating it into Task/MapTask limits where it may be applied. If, for example, one were reading from multiple sources, different sets of filters could be applied to each source. Similarly, a map or a reduce task could use a filtering record reader to read a subset of records indirectly. If it's limited to the interfaces you provide to MapTask, then this code can't be reused elsewhere. Again, since weaving it into core doesn't seem to give you extra functionality- it seems to make it less general- and there's zero performance hit, making it a library looks laced with win.

          There is a lot of legacy code which can benefit from this, but people will be reluctant (or lazy) to convert their job's input format to filter. So maximum usability and minimum code change should be aimed.

          I disagree, and I cite your previous patch. Its interface was not only easier to understand than the postfix additions, but specifying the baseInputFormat was very intuitive. For users seeking to benefit from this, the difficulty delta between the library and Task implementations is so slight that I doubt it'll actually prevent someone from taking advantage of it.

          Although the functionality is at the core, we only change a few lines(except FilterRR) from the Task and MapTask classes, effectively encapsulating the functionality. We may extract FilterRecordReader to its own class, so that it is completely separate. I should note that join can readily use filtering. The filtering just filters before passing the record to the mapper, so the joined keys would be filtered.

          Not exactly. If I apply a RangeFilter to each of my record readers, the join considers a smaller subset of the records read. Since it's generating the cross of all the matching records (i.e. sets A, B, C sharing key k and containing values x1, x2 would emit [a1, b1, c1], [a1, b1, c2], [a1, b2, c1], ... [a2, b2, c2]), my filter would have to reject the cross of all those records, rather than each individually. Further, if I only want to filter the records from B in the previous example, the filter in my map would need more state to ensure I'm not emitting duplicate records to the map (or my map code would have to deal with that). One can imagine other cases where, again, filtering shouldn't be limited to a single part of the job, or cases where it might change the result if filters can only be applied at a certain stage.

          However it is very unlikely that a user may implement a FunctionFilter, but it is quite likely that she can implement a Filter. Thus adding a stack argument that no filter uses seems confusing and unnecessary. Consider the javadoc for the stack argument in the Filter#accept() method being as "@param stack filters should not use this".

          Though Filters will be useful, the semantics of a FunctionFilter aren't so mysterious that people won't want to write those, too. Again, the purpose of both parameters is easily explained, and people will decide whether they should employ them or not. It seems premature to decide that there are only two types of filters, anyway. It sounds like we agree that it's a cleaner interface with only one signature for the eval; I'm just not sure I see the extensibility benefit as clearly.

          Show
          Chris Douglas added a comment - Theoretically every job can/should use the filtering functionality, since there is no drawback but lots of benefits. So this necessitates that the InputFormats of every job should be FilterInputFormat, shading the real InputFormat under FilterInputFormat#setBaseInputFormat(). Even if every job can use the filtering functionality, integrating it into Task/MapTask limits where it may be applied. If, for example, one were reading from multiple sources, different sets of filters could be applied to each source. Similarly, a map or a reduce task could use a filtering record reader to read a subset of records indirectly. If it's limited to the interfaces you provide to MapTask, then this code can't be reused elsewhere. Again, since weaving it into core doesn't seem to give you extra functionality- it seems to make it less general- and there's zero performance hit, making it a library looks laced with win. There is a lot of legacy code which can benefit from this, but people will be reluctant (or lazy) to convert their job's input format to filter. So maximum usability and minimum code change should be aimed. I disagree, and I cite your previous patch. Its interface was not only easier to understand than the postfix additions, but specifying the baseInputFormat was very intuitive. For users seeking to benefit from this, the difficulty delta between the library and Task implementations is so slight that I doubt it'll actually prevent someone from taking advantage of it. Although the functionality is at the core, we only change a few lines(except FilterRR) from the Task and MapTask classes, effectively encapsulating the functionality. We may extract FilterRecordReader to its own class, so that it is completely separate. I should note that join can readily use filtering. The filtering just filters before passing the record to the mapper, so the joined keys would be filtered. Not exactly. If I apply a RangeFilter to each of my record readers, the join considers a smaller subset of the records read. Since it's generating the cross of all the matching records (i.e. sets A, B, C sharing key k and containing values x1, x2 would emit [a1, b1, c1] , [a1, b1, c2] , [a1, b2, c1] , ... [a2, b2, c2] ), my filter would have to reject the cross of all those records, rather than each individually. Further, if I only want to filter the records from B in the previous example, the filter in my map would need more state to ensure I'm not emitting duplicate records to the map (or my map code would have to deal with that). One can imagine other cases where, again, filtering shouldn't be limited to a single part of the job, or cases where it might change the result if filters can only be applied at a certain stage. However it is very unlikely that a user may implement a FunctionFilter, but it is quite likely that she can implement a Filter. Thus adding a stack argument that no filter uses seems confusing and unnecessary. Consider the javadoc for the stack argument in the Filter#accept() method being as "@param stack filters should not use this". Though Filters will be useful, the semantics of a FunctionFilter aren't so mysterious that people won't want to write those, too. Again, the purpose of both parameters is easily explained, and people will decide whether they should employ them or not. It seems premature to decide that there are only two types of filters, anyway. It sounds like we agree that it's a cleaner interface with only one signature for the eval; I'm just not sure I see the extensibility benefit as clearly.
          Hide
          Owen O'Malley added a comment -

          A big +1 to leaving this in library code rather than framework.

          Show
          Owen O'Malley added a comment - A big +1 to leaving this in library code rather than framework.
          Hide
          Devaraj Das added a comment -

          Cancelling patch as it is under discussion

          Show
          Devaraj Das added a comment - Cancelling patch as it is under discussion
          Hide
          Enis Soztutar added a comment -

          Even if every job can use the filtering functionality, integrating it into Task/MapTask limits where it may be applied. If, for example, one were reading from multiple sources, different sets of filters could be applied to each source. Similarly, a map or a reduce task could use a filtering record reader to read a subset of records indirectly. If it's limited to the interfaces you provide to MapTask, then this code can't be reused elsewhere. Again, since weaving it into core doesn't seem to give you extra functionality- it seems to make it less general- and there's zero performance hit, making it a library looks laced with win.

          please see below

          Not exactly. If I apply a RangeFilter to each of my record readers, the join considers a smaller subset of the records read. Since it's generating the cross of all the matching records (i.e. sets A, B, C sharing key k and containing values x1, x2 would emit [a1, b1, c1], [a1, b1, c2], [a1, b2, c1], ... [a2, b2, c2]), my filter would have to reject the cross of all those records, rather than each individually. Further, if I only want to filter the records from B in the previous example, the filter in my map would need more state to ensure I'm not emitting duplicate records to the map (or my map code would have to deal with that). One can imagine other cases where, again, filtering shouldn't be limited to a single part of the job, or cases where it might change the result if filters can only be applied at a certain stage.

          What we discuss, differs in just one part :
          As a library :

          • FilterRecordReader same as before
          • classes under io.filter - same as before
          • FilterInputFormat as a thin wrapper
          • Configure filters from FilterInputFormat (delegating to FilterEngine)
          • Users may freely extend / use FilterInputFormat, FilterRecordReader, Filter's.
            As integrated to the core
          • FilterRecordReader same as before
          • classes under io.filter - same as before
          • instead of FilterInputFormat, use FilterRecordReader wrapping MapTask.TrackedRecordReader if filtering is enabled.
          • Configure filters from JobConf. (delegating to FilterEngine)
          • Users may freely extend / use FilterRecordReader, Filter's.

          The bulk of the implementation is in the FilterRecordReader, FilterEngine and Filter implementations, thus this code(and filtering functionality) CAN be reused. Moreover FilterInputFormat from the previous patch is only a wrapper. Assuming we move out FilterRecordReader to a separate class, join framework can easily extend its grammar to accept an InputFormat which filters records from its underlying inputformat (but this is a different issue ). In this sense, the current patch does not "limit" applicability of filtering to other parts of the system, such as using more than one inputFormats, filtering map output results etc, but it enables a filtering framework and enables a frequent use case in which we filter input records to the job. The other use cases can indeed be implemented on top of the patch (in the case of join, it can bypass the core filter, and introduce filtering at another layer).

          I disagree, and I cite your previous patch. Its interface was not only easier to understand than the postfix additions, but specifying the baseInputFormat was very intuitive. For users seeking to benefit from this, the difficulty delta between the library and Task implementations is so slight that I doubt it'll actually prevent someone from taking advantage of it.

          The postfix additions is irrelevant to whether filtering should be a library or not. The postfix expressions are a way to specify the filtering expression to use, that part of the API will not be changed if we had sticked with FilterInputFormat.

          Though Filters will be useful, the semantics of a FunctionFilter aren't so mysterious that people won't want to write those, too. Again, the purpose of both parameters is easily explained, and people will decide whether they should employ them or not. It seems premature to decide that there are only two types of filters, anyway. It sounds like we agree that it's a cleaner interface with only one signature for the eval; I'm just not sure I see the extensibility benefit as clearly.

          I have separated Filter and FunctionFilter interfaces to enhance encapsulation, not extensibility. Having one eval method is a cleaner interface to core developers who could understand how the postfix expression is evaluated using Dijkstra's algorithm in the FilterEngine class, but it is a cleaner interface to hide all this details from the user if she just wants to develop a Filter.
          Think about 3 + 4 or in postfix 3 4 +. The 3 and 4 are values, and + is a function. Clearly, the Filters(which are values) and FunctionFilters(which are functions) are different in that case. However 3,4 and + are all represented as symbols so that they can be interpreted and passed to a machine evaluating the expression. Similarly FunctionFilter implements Filter(although it is not a filter) to be passed to the FilterEngine. As most of the users will only implements Filters, and there is only one boolean binary function (which is XOR) which is not implemented, I see no loss of generality in hiding the gory details of FunctionFilters.
          Maybe the best way was to construct the hierarchy :

          interface ExpressionItem;
          interface Filter extends ExpressionItem;
          interface Function extends ExpressionItem;
          FilterEngine.add(ExpressionItem item);
          

          But had chosen not to do this, to simplify things. Anyway as I said before, now that passing complete objects are imaginable, thanks to HADOOP-3048, I will change the postfix expressions and develop a more intuitive way :

          Filter f1 = new RangeFilter(2, 5);
          Filter f2 = new RangeFilter (10, 20);
          Filter orFilter = new ORFilter(f1, f2);
          Job.addFilter(orFilter);
          

          Well, Chris I share your wisdom about the patch being a library, but I only insist because I [still] think that this way might be better. Other than that, since the implicit votes are 2v1, I will probably implement the next version of the patch as a library including FilterInputFormat, unless someone objects.

          Show
          Enis Soztutar added a comment - Even if every job can use the filtering functionality, integrating it into Task/MapTask limits where it may be applied. If, for example, one were reading from multiple sources, different sets of filters could be applied to each source. Similarly, a map or a reduce task could use a filtering record reader to read a subset of records indirectly. If it's limited to the interfaces you provide to MapTask, then this code can't be reused elsewhere. Again, since weaving it into core doesn't seem to give you extra functionality- it seems to make it less general- and there's zero performance hit, making it a library looks laced with win. please see below Not exactly. If I apply a RangeFilter to each of my record readers, the join considers a smaller subset of the records read. Since it's generating the cross of all the matching records (i.e. sets A, B, C sharing key k and containing values x1, x2 would emit [a1, b1, c1] , [a1, b1, c2] , [a1, b2, c1] , ... [a2, b2, c2] ), my filter would have to reject the cross of all those records, rather than each individually. Further, if I only want to filter the records from B in the previous example, the filter in my map would need more state to ensure I'm not emitting duplicate records to the map (or my map code would have to deal with that). One can imagine other cases where, again, filtering shouldn't be limited to a single part of the job, or cases where it might change the result if filters can only be applied at a certain stage. What we discuss, differs in just one part : As a library : FilterRecordReader same as before classes under io.filter - same as before FilterInputFormat as a thin wrapper Configure filters from FilterInputFormat (delegating to FilterEngine) Users may freely extend / use FilterInputFormat, FilterRecordReader, Filter's. As integrated to the core FilterRecordReader same as before classes under io.filter - same as before instead of FilterInputFormat, use FilterRecordReader wrapping MapTask.TrackedRecordReader if filtering is enabled. Configure filters from JobConf. (delegating to FilterEngine) Users may freely extend / use FilterRecordReader, Filter's. The bulk of the implementation is in the FilterRecordReader, FilterEngine and Filter implementations, thus this code(and filtering functionality) CAN be reused. Moreover FilterInputFormat from the previous patch is only a wrapper. Assuming we move out FilterRecordReader to a separate class, join framework can easily extend its grammar to accept an InputFormat which filters records from its underlying inputformat (but this is a different issue ). In this sense, the current patch does not "limit" applicability of filtering to other parts of the system, such as using more than one inputFormats, filtering map output results etc, but it enables a filtering framework and enables a frequent use case in which we filter input records to the job. The other use cases can indeed be implemented on top of the patch (in the case of join, it can bypass the core filter, and introduce filtering at another layer). I disagree, and I cite your previous patch. Its interface was not only easier to understand than the postfix additions, but specifying the baseInputFormat was very intuitive. For users seeking to benefit from this, the difficulty delta between the library and Task implementations is so slight that I doubt it'll actually prevent someone from taking advantage of it. The postfix additions is irrelevant to whether filtering should be a library or not. The postfix expressions are a way to specify the filtering expression to use, that part of the API will not be changed if we had sticked with FilterInputFormat. Though Filters will be useful, the semantics of a FunctionFilter aren't so mysterious that people won't want to write those, too. Again, the purpose of both parameters is easily explained, and people will decide whether they should employ them or not. It seems premature to decide that there are only two types of filters, anyway. It sounds like we agree that it's a cleaner interface with only one signature for the eval; I'm just not sure I see the extensibility benefit as clearly. I have separated Filter and FunctionFilter interfaces to enhance encapsulation, not extensibility. Having one eval method is a cleaner interface to core developers who could understand how the postfix expression is evaluated using Dijkstra's algorithm in the FilterEngine class, but it is a cleaner interface to hide all this details from the user if she just wants to develop a Filter. Think about 3 + 4 or in postfix 3 4 +. The 3 and 4 are values, and + is a function. Clearly, the Filters(which are values) and FunctionFilters(which are functions) are different in that case. However 3,4 and + are all represented as symbols so that they can be interpreted and passed to a machine evaluating the expression. Similarly FunctionFilter implements Filter(although it is not a filter) to be passed to the FilterEngine. As most of the users will only implements Filters, and there is only one boolean binary function (which is XOR) which is not implemented, I see no loss of generality in hiding the gory details of FunctionFilters. Maybe the best way was to construct the hierarchy : interface ExpressionItem; interface Filter extends ExpressionItem; interface Function extends ExpressionItem; FilterEngine.add(ExpressionItem item); But had chosen not to do this, to simplify things. Anyway as I said before, now that passing complete objects are imaginable, thanks to HADOOP-3048 , I will change the postfix expressions and develop a more intuitive way : Filter f1 = new RangeFilter(2, 5); Filter f2 = new RangeFilter (10, 20); Filter orFilter = new ORFilter(f1, f2); Job.addFilter(orFilter); Well, Chris I share your wisdom about the patch being a library, but I only insist because I [still] think that this way might be better. Other than that, since the implicit votes are 2v1, I will probably implement the next version of the patch as a library including FilterInputFormat, unless someone objects.
          Hide
          Chris Douglas added a comment -

          will change the postfix expressions and develop a more intuitive way...

          +1 I like this syntax. Since you're passing serialized objects with your filters, you might want to test larger expressions to make sure length limits in the Configuration aren't a problem, but hitting them seems unlikely. I don't know if we even have limits in that area, but again: it'd be worth testing. On that note, for the FunctionFilters you've defined, it might be a good idea to permit them to take an arbitrary number of arguments >=2, as in:

          Filter f1 = new RangeFilter(2, 5);
          Filter f2 = new RangeFilter(10, 20);
          Filter f3 = new RangeFilter(30, 40);
          Filter orFilter = new ORFilter(f1, f2, f3);
          

          With your new syntax, this would be both easy to implement, presents more opportunities for optimization within your FilterEngine, and is very convenient for users.

          Having one eval method is a cleaner interface to core developers who could understand how the postfix expression is evaluated...

          Isn't all of this hidden by the FilterEngine? I'm not sure I understand what you're asserting in this paragraph... I thought we were discussing whether or not it made sense to collapse Filters and FunctionFilters into a single Filter interface that manipulates the key/stack. By construction, you know that your FunctionFilters have either Filters or FunctionFilters as children. Once you reconstruct the tree, it's not clear to me why you'd even need a stack. The key gets passed through your tree to the child Filters, which return results to the parent, which may or may not pass the key to its other children depending on the return value. It might make sense to have a FunctionFilter base type from which your operators descend- since they share common functionality- but the additional interface seems unnecessary. Have I misunderstood you, or am I responding to your new syntax instead of the original, postfix, stack-based implementation?

          The postfix additions is irrelevant to whether filtering should be a library or not. The postfix expressions are a way to specify the filtering expression to use, that part of the API will not be changed if we had sticked with FilterInputFormat.

          Sorry, I was unclear. You're right, the postfix syntax is orthogonal to this discussion since that functionality wasn't present in the original patch. I was only pointing out that those who could benefit from Filters aren't going to be turned away because they need to use a different InputFormat, i.e. using the library poses a more familiar and less difficult problem to users than the syntax and implications of Filters.

          [library vs core in general]

          It cannot be disputed that your integration of Filters into Tasks has a negligible cost and that it does not prohibit its use elsewhere and in other frameworks. That said, the semantics of Filters match those of InputFormat precisely. At its point of integration, it does precisely what an InputFormat would effect (with one caveat concerning map counters). It also avoids any confusion about where the filtering occurs, particularly where other decorator InputFormats are applied. Though I'm sympathetic to making Filtering part of every job, setting the InputFormat seems like a modest burden that happens to also fit the existing semantics in an intuitive and efficient way.

          Show
          Chris Douglas added a comment - will change the postfix expressions and develop a more intuitive way... +1 I like this syntax. Since you're passing serialized objects with your filters, you might want to test larger expressions to make sure length limits in the Configuration aren't a problem, but hitting them seems unlikely. I don't know if we even have limits in that area, but again: it'd be worth testing. On that note, for the FunctionFilters you've defined, it might be a good idea to permit them to take an arbitrary number of arguments >=2, as in: Filter f1 = new RangeFilter(2, 5); Filter f2 = new RangeFilter(10, 20); Filter f3 = new RangeFilter(30, 40); Filter orFilter = new ORFilter(f1, f2, f3); With your new syntax, this would be both easy to implement, presents more opportunities for optimization within your FilterEngine, and is very convenient for users. Having one eval method is a cleaner interface to core developers who could understand how the postfix expression is evaluated... Isn't all of this hidden by the FilterEngine? I'm not sure I understand what you're asserting in this paragraph... I thought we were discussing whether or not it made sense to collapse Filters and FunctionFilters into a single Filter interface that manipulates the key/stack. By construction, you know that your FunctionFilters have either Filters or FunctionFilters as children. Once you reconstruct the tree, it's not clear to me why you'd even need a stack. The key gets passed through your tree to the child Filters, which return results to the parent, which may or may not pass the key to its other children depending on the return value. It might make sense to have a FunctionFilter base type from which your operators descend- since they share common functionality- but the additional interface seems unnecessary. Have I misunderstood you, or am I responding to your new syntax instead of the original, postfix, stack-based implementation? The postfix additions is irrelevant to whether filtering should be a library or not. The postfix expressions are a way to specify the filtering expression to use, that part of the API will not be changed if we had sticked with FilterInputFormat. Sorry, I was unclear. You're right, the postfix syntax is orthogonal to this discussion since that functionality wasn't present in the original patch. I was only pointing out that those who could benefit from Filters aren't going to be turned away because they need to use a different InputFormat, i.e. using the library poses a more familiar and less difficult problem to users than the syntax and implications of Filters. [library vs core in general] It cannot be disputed that your integration of Filters into Tasks has a negligible cost and that it does not prohibit its use elsewhere and in other frameworks. That said, the semantics of Filters match those of InputFormat precisely. At its point of integration, it does precisely what an InputFormat would effect (with one caveat concerning map counters). It also avoids any confusion about where the filtering occurs, particularly where other decorator InputFormats are applied. Though I'm sympathetic to making Filtering part of every job, setting the InputFormat seems like a modest burden that happens to also fit the existing semantics in an intuitive and efficient way.
          Hide
          Enis Soztutar added a comment -

          We cannot make this into 0.17, moving to 0.18.

          Show
          Enis Soztutar added a comment - We cannot make this into 0.17, moving to 0.18.
          Hide
          Harsh J added a comment -

          With PathFilter available for use in API today, I don't see value in adding this.

          Closing out for now, feel free to reopen if this still is required. Better as a new discussion since this one has gone stale.

          Show
          Harsh J added a comment - With PathFilter available for use in API today, I don't see value in adding this. Closing out for now, feel free to reopen if this still is required. Better as a new discussion since this one has gone stale.

            People

            • Assignee:
              Enis Soztutar
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development