Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-941

Enhancements to Hadoop record I/O - Part 1

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.12.0
    • Component/s: record
    • Labels:
      None
    • Environment:

      All

      Description

      Hadoop record I/O can be used effectively outside of Hadoop. It would increase its utility if developers can use it without having to import hadoop classes, or having to depend on Hadoop jars. Following changes to the current translator and runtime are proposed.

      Proposed Changes:

      1. Use java.lang.String as a native type for ustring (instead of Text.)
      2. Provide a Buffer class as a native Java type for buffer (instead of BytesWritable), so that later BytesWritable could be implemented as following DDL:
      module org.apache.hadoop.io {
      record BytesWritable

      { buffer value; }

      }
      3. Member names in generated classes should not have prefixes 'm' before their names. In the above example, the private member name would be 'value' not 'mvalue' as it is done now.
      4. Convert getters and setters to have CamelCase. e.g. in the above example the getter will be:
      public Buffer getValue();
      5. Generate clone() methods for records in Java i.e. the generated classes should implement Cloneable.
      6. Make generated Java codes for maps and vectors use Java generics.

      These are the proposed user-visible changes. Internally, the translator will be restructured so that it is easier to plug-in translators for different targets.

      1. jute-patch.txt
        249 kB
        Milind Bhandarkar

        Activity

        Hide
        dbowen David Bowen added a comment -

        I don't think that anyone can be relying on them, since they didn't work until yesterday. But compatibility is a good reason to make this small fix asap.

        Show
        dbowen David Bowen added a comment - I don't think that anyone can be relying on them, since they didn't work until yesterday. But compatibility is a good reason to make this small fix asap.
        Hide
        cutting Doug Cutting added a comment -

        > the variable int serialization unnecessarily uses 8 bytes for any number less than -112

        Perhaps we should file a separate issue for this one. It breaks compatibility, but I don't know that we have many folks yet using negative VInts...

        Show
        cutting Doug Cutting added a comment - > the variable int serialization unnecessarily uses 8 bytes for any number less than -112 Perhaps we should file a separate issue for this one. It breaks compatibility, but I don't know that we have many folks yet using negative VInts...
        Hide
        sameerp Sameer Paranjpye added a comment -

        Milind could you link this issue with the new one.

        Thanks

        Show
        sameerp Sameer Paranjpye added a comment - Milind could you link this issue with the new one. Thanks
        Hide
        milindb Milind Bhandarkar added a comment -

        David,

        I have filed another jira issue to address the remaining enhancements. I will add your points [3] and [5] to that issue.

        Show
        milindb Milind Bhandarkar added a comment - David, I have filed another jira issue to address the remaining enhancements. I will add your points [3] and [5] to that issue.
        Hide
        dbowen David Bowen added a comment -

        As a reminder, points [3] and [5] from an earlier comment (https://issues.apache.org/jira/browse/HADOOP-941#action_12475802) have not yet been addressed. Are they covered by another Jira ticket? (With so much email traffic, I could have missed it.)

        • David
        Show
        dbowen David Bowen added a comment - As a reminder, points [3] and [5] from an earlier comment ( https://issues.apache.org/jira/browse/HADOOP-941#action_12475802 ) have not yet been addressed. Are they covered by another Jira ticket? (With so much email traffic, I could have missed it.) David
        Hide
        dbowen David Bowen added a comment -

        One more comment: the variable int serialization unnecessarily uses 8 bytes for any number less than -112. This would penalize an app that uses a lot of negative shorts or ints.

        I think there is a simple fix. Instead of clearing and setting the sign bit for negative numbers, flip all the bits. I.e. in the writer change

        i &= 0x7FFFFFFFFFFFFFFFL;

        to:

        i ^= 0xFFFFFFFFFFFFFFFFL;

        and do the same operation in the reader to get the negative number back.

        Show
        dbowen David Bowen added a comment - One more comment: the variable int serialization unnecessarily uses 8 bytes for any number less than -112. This would penalize an app that uses a lot of negative shorts or ints. I think there is a simple fix. Instead of clearing and setting the sign bit for negative numbers, flip all the bits. I.e. in the writer change i &= 0x7FFFFFFFFFFFFFFFL; to: i ^= 0xFFFFFFFFFFFFFFFFL; and do the same operation in the reader to get the negative number back.
        Hide
        milindb Milind Bhandarkar added a comment -

        The pleasure is all mine, Doug.

        Show
        milindb Milind Bhandarkar added a comment - The pleasure is all mine, Doug.
        Hide
        cutting Doug Cutting added a comment -

        I just committed this. Thanks for your persistence, Milind.

        Show
        cutting Doug Cutting added a comment - I just committed this. Thanks for your persistence, Milind.
        Show
        hadoopqa Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12352278/jute-patch.txt applied and successfully tested against trunk revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/513049 . Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch
        Hide
        milindb Milind Bhandarkar added a comment -

        Making patch available, because alll concerns have been addressed.

        Show
        milindb Milind Bhandarkar added a comment - Making patch available, because alll concerns have been addressed.
        Hide
        milindb Milind Bhandarkar added a comment -

        Addressed Doug's concern about duplicate code.

        Show
        milindb Milind Bhandarkar added a comment - Addressed Doug's concern about duplicate code.
        Hide
        cutting Doug Cutting added a comment -

        > org.apache.hadoop.record package controls the serialization formats it uses, I believe the read and write methods belong correctly to o.a.h.r.

        Then, when they need to differ from those other methods, add private versions. But, until then, I can see no reason to copy these methods.

        Show
        cutting Doug Cutting added a comment - > org.apache.hadoop.record package controls the serialization formats it uses, I believe the read and write methods belong correctly to o.a.h.r. Then, when they need to differ from those other methods, add private versions. But, until then, I can see no reason to copy these methods.
        Hide
        milindb Milind Bhandarkar added a comment -

        Since org.apache.hadoop.record package controls the serialization formats it uses, I believe the read and write methods belong correctly to o.a.h.r. Note that the binary record serialization format is not guaranteed to be identical to hadoop's writable implementations.

        Show
        milindb Milind Bhandarkar added a comment - Since org.apache.hadoop.record package controls the serialization formats it uses, I believe the read and write methods belong correctly to o.a.h.r. Note that the binary record serialization format is not guaranteed to be identical to hadoop's writable implementations.
        Hide
        cutting Doug Cutting added a comment -

        This still copies code from elsewhere in the Hadoop tree (as noted by David, above). In particular, Util.java copies a number of methods from WritableUtils.java and a number from WritableComparator.java. We should avoid code replication if at all possible. Is there any other copied code here?

        Show
        cutting Doug Cutting added a comment - This still copies code from elsewhere in the Hadoop tree (as noted by David, above). In particular, Util.java copies a number of methods from WritableUtils.java and a number from WritableComparator.java. We should avoid code replication if at all possible. Is there any other copied code here?
        Show
        hadoopqa Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12352272/jute-patch.txt applied and successfully tested against trunk revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/512944 . Results are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch
        Hide
        milindb Milind Bhandarkar added a comment -

        Making patch available.

        Note that this patch moves some files from src/test/org/apache/hadoop/record/test to its parent directory, which the patch process does not handle well. Please keep this in mind while committing.

        Show
        milindb Milind Bhandarkar added a comment - Making patch available. Note that this patch moves some files from src/test/org/apache/hadoop/record/test to its parent directory, which the patch process does not handle well. Please keep this in mind while committing.
        Hide
        milindb Milind Bhandarkar added a comment -

        Diffs between this patch and previous one:

        The contentious issue of alllowing the record i/o translator to have an option to generate code that could be used outside the Hadoop context (for which no consensus could be reached so far) has been separated into a different issue, and those changes have been removed. i.e. generated records always implement WritableComparable and provide a raw comparator.

        David Bowen pointed out that Binary deserialization for longs did not handle negative numbers. This has been fixed, and test added. This bug existed in c++ deserialization as welll, and has been fixed.

        Every generated record is now cloneable. i.e. it implements Object.clone(). Test added.

        Buffer class now has a toString() method that spits out its string representation as sequence of hex-pairs. In addition it has a toString(String charsetName) for application that want to use buffer it to store non-utf8 string.

        Show
        milindb Milind Bhandarkar added a comment - Diffs between this patch and previous one: The contentious issue of alllowing the record i/o translator to have an option to generate code that could be used outside the Hadoop context (for which no consensus could be reached so far) has been separated into a different issue, and those changes have been removed. i.e. generated records always implement WritableComparable and provide a raw comparator. David Bowen pointed out that Binary deserialization for longs did not handle negative numbers. This has been fixed, and test added. This bug existed in c++ deserialization as welll, and has been fixed. Every generated record is now cloneable. i.e. it implements Object.clone(). Test added. Buffer class now has a toString() method that spits out its string representation as sequence of hex-pairs. In addition it has a toString(String charsetName) for application that want to use buffer it to store non-utf8 string.
        Hide
        milindb Milind Bhandarkar added a comment -

        Split the issue of making record I/O usable outside Hadoop into a separate issue. Will pull the current patch, and upload a new patch.

        Show
        milindb Milind Bhandarkar added a comment - Split the issue of making record I/O usable outside Hadoop into a separate issue. Will pull the current patch, and upload a new patch.
        Hide
        cutting Doug Cutting added a comment -

        > dependency on the Lucene jar. Which was deemed unacceptable, so we copied Lucene's PriorityQueue

        I didn't like that change, since it duplicated code. But the duplication was across projects, and thus harder to avoid, and it fixed a serious bug, that Hadoop-based applications that wanted to use a different version of Lucene from the one in Hadoop could not easily do so, as the version included in Hadoop was found first on the class path. So the cost seemed worth the benefit. But here I don't see any serious functional bug that implementing Writable and WritableComparable causes. Perhaps, somewhere down the line, such a bug will emerge, and we'll take on what source-code complexity is required to fix it then. But, perhaps again, we'll never have to. Let's wait and see. None of the arguments I have heard (package names, generated code size, jar file size) yet constitute bugs worthy of fixing unless the cost of the fix is negligible. The record code generation is likely to get more complex as time passes, to add features that fix real problems. Making it more complex now has a combinatorial effect on our ability to maintain this code: more conditional branches to edit, more modes to test, more features to document, etc. The long-term cost of adding code complexity are much greater than, e.g., the long-term costs of slightly larger generated code size.

        Show
        cutting Doug Cutting added a comment - > dependency on the Lucene jar. Which was deemed unacceptable, so we copied Lucene's PriorityQueue I didn't like that change, since it duplicated code. But the duplication was across projects, and thus harder to avoid, and it fixed a serious bug, that Hadoop-based applications that wanted to use a different version of Lucene from the one in Hadoop could not easily do so, as the version included in Hadoop was found first on the class path. So the cost seemed worth the benefit. But here I don't see any serious functional bug that implementing Writable and WritableComparable causes. Perhaps, somewhere down the line, such a bug will emerge, and we'll take on what source-code complexity is required to fix it then. But, perhaps again, we'll never have to. Let's wait and see. None of the arguments I have heard (package names, generated code size, jar file size) yet constitute bugs worthy of fixing unless the cost of the fix is negligible. The record code generation is likely to get more complex as time passes, to add features that fix real problems. Making it more complex now has a combinatorial effect on our ability to maintain this code: more conditional branches to edit, more modes to test, more features to document, etc. The long-term cost of adding code complexity are much greater than, e.g., the long-term costs of slightly larger generated code size.
        Hide
        dbowen David Bowen added a comment -

        Milind,

        Excuse my newbie-ness. I didn't realize that readVLong etc were old code from WritableUtils. Are these methods now duplicated in record.Utils simply to facilitate using the o.a.h.r package stand-alone? That would seem unfortunate.

        I still can't see that the methods are correct. I see the sign bit removed from negative numbers, and I don't see where it is put back. In any case, it would seem logical for writeVLong to use less than 8 bytes for small negative ints, and it does not appear to do that.

        On a separate topic: it might be worth considering a different approach to code generation for record Comparators. E.g. a generated record could have an additional method to return its "legend", like this:

        private static final byte[] legend =

        { TYPE_BOOL, TYPE_FLOAT, TYPE_LONG, TYPE_USTRING }

        ;
        public byte[] getLegend()

        { return legend; }

        where the TYPE_* things are static final bytes. Then you could have a Comparator that knows how to compare the binary forms of things that have legends - it just iterates over the legend using a switch statement to do the right thing based on the type.

        I think there is a maintenance benefit in keeping the generated code as small and as simple as possible. Performance-wise, this adds the overhead of a for-loop and a switch statement dispatch, but I don't think that would be significant.

        • David
        Show
        dbowen David Bowen added a comment - Milind, Excuse my newbie-ness. I didn't realize that readVLong etc were old code from WritableUtils. Are these methods now duplicated in record.Utils simply to facilitate using the o.a.h.r package stand-alone? That would seem unfortunate. I still can't see that the methods are correct. I see the sign bit removed from negative numbers, and I don't see where it is put back. In any case, it would seem logical for writeVLong to use less than 8 bytes for small negative ints, and it does not appear to do that. On a separate topic: it might be worth considering a different approach to code generation for record Comparators. E.g. a generated record could have an additional method to return its "legend", like this: private static final byte[] legend = { TYPE_BOOL, TYPE_FLOAT, TYPE_LONG, TYPE_USTRING } ; public byte[] getLegend() { return legend; } where the TYPE_* things are static final bytes. Then you could have a Comparator that knows how to compare the binary forms of things that have legends - it just iterates over the legend using a switch statement to do the right thing based on the type. I think there is a maintenance benefit in keeping the generated code as small and as simple as possible. Performance-wise, this adds the overhead of a for-loop and a switch statement dispatch, but I don't think that would be significant. David
        Hide
        owen.omalley Owen O'Malley added a comment -

        Milind,
        In every map/reduce implementation that I know of the contract does guarantee that the keys to the reduce are sorted. Many map/reduce applications depend on that sort. Some applications only need identical keys to be grouped, so in theory if the framework wanted to support 2 completely different datapaths, the keys could be grouped but not sorted. But removing the sort completely from the framework would make many applications impossible to write.

        Show
        owen.omalley Owen O'Malley added a comment - Milind, In every map/reduce implementation that I know of the contract does guarantee that the keys to the reduce are sorted. Many map/reduce applications depend on that sort. Some applications only need identical keys to be grouped, so in theory if the framework wanted to support 2 completely different datapaths, the keys could be grouped but not sorted. But removing the sort completely from the framework would make many applications impossible to write.
        Hide
        milindb Milind Bhandarkar added a comment -

        David,

        Thanks for testing the patch. Since I have moved record I/O tests from the org.apache.hadoop.record.test package to o.a.h.r package (to remove instances that needed to be public only because of tests, and since moving (i.e. renaming) files does not work very well with the patch process (rename is treated as remove followed by an add), such problems occur.

        Other points are well taken (3,4,5). I will provide more tests and do the refactoring in the future patch. As you can guess, just the read methods were in in WritableUtils (Owen, add one more dependency of static methods only to your list), they were copied to record.Utils, and so it seems like an awkward way of doing things. Now that they are in record.Utils, and referred to only from the generated code, I can make them more amenable to "doing the right thing" .

        Show
        milindb Milind Bhandarkar added a comment - David, Thanks for testing the patch. Since I have moved record I/O tests from the org.apache.hadoop.record.test package to o.a.h.r package (to remove instances that needed to be public only because of tests, and since moving (i.e. renaming) files does not work very well with the patch process (rename is treated as remove followed by an add), such problems occur. Other points are well taken (3,4,5). I will provide more tests and do the refactoring in the future patch. As you can guess, just the read methods were in in WritableUtils (Owen, add one more dependency of static methods only to your list), they were copied to record.Utils, and so it seems like an awkward way of doing things. Now that they are in record.Utils, and referred to only from the generated code, I can make them more amenable to "doing the right thing" .
        Hide
        dbowen David Bowen added a comment -

        A few notes from trying out this patch.

        [1] When I apply the patch, to what I think is a clean workspace, I get this:
        ...
        patching file src/test/org/apache/hadoop/record/test/TestWritable.java
        Reversed (or previously applied) patch detected! Assume -R? [n] n
        Apply anyway? [n]
        Skipping patch.
        1 out of 1 hunk ignored – saving rejects to file src/test/org/apache/hadoop/record/test/TestWritable.java.rej
        ...

        [2] TestRecordWritable needs to be brought up-to-date because of the change in the Reporter interface. Instead of defining its own do-nothing Reporter, it can do this: "Reporter reporter = Reporter.NULL;".

        [3] In the generated code I see sections like this:

        int i1 = org.apache.hadoop.record.Utils.readVInt(b1, s1);
        int i2 = org.apache.hadoop.record.Utils.readVInt(b2, s2);
        int z1 = org.apache.hadoop.record.Utils.getVIntSize(i1);
        int z2 = org.apache.hadoop.record.Utils.getVIntSize(i2);
        s1+=z1; s2+=z2; l1-=z1; l2-=z2;
        int r1 = org.apache.hadoop.record.Utils.compareBytes(b1,s1,l1,b2,s2,l2);
        if (r1 != 0)

        { return (r1<0)?-1:0; }

        s1+=i1; s2+=i2; l1-=i1; l1-=i2;

        I think the compareBytes call should be using i1 and i2, not l1 and l2. This would not show up as a bug in testing, just a loss of performance.

        Actually, I think it might have been better to replace this open-coding with method calls. I.e. have a packed-buffer class, comprising a byte buffer, position and length remaining. It would have methods like compareFloat, compareVLong etc. The generated code would then be much smaller and probably at least as efficient. (Currently the variable length is being computed twice for each int.)

        [4] I don't see any unit tests for the readVLong/writeVLong combination. Reading the code, I am not convinced that it will work correctly with negative numbers. (It looks like the number read will always be positive.)

        [5] rcc when called with no arguments should print out a usage message that explains what all the command line options are.

        Show
        dbowen David Bowen added a comment - A few notes from trying out this patch. [1] When I apply the patch, to what I think is a clean workspace, I get this: ... patching file src/test/org/apache/hadoop/record/test/TestWritable.java Reversed (or previously applied) patch detected! Assume -R? [n] n Apply anyway? [n] Skipping patch. 1 out of 1 hunk ignored – saving rejects to file src/test/org/apache/hadoop/record/test/TestWritable.java.rej ... [2] TestRecordWritable needs to be brought up-to-date because of the change in the Reporter interface. Instead of defining its own do-nothing Reporter, it can do this: "Reporter reporter = Reporter.NULL;". [3] In the generated code I see sections like this: int i1 = org.apache.hadoop.record.Utils.readVInt(b1, s1); int i2 = org.apache.hadoop.record.Utils.readVInt(b2, s2); int z1 = org.apache.hadoop.record.Utils.getVIntSize(i1); int z2 = org.apache.hadoop.record.Utils.getVIntSize(i2); s1+=z1; s2+=z2; l1-=z1; l2-=z2; int r1 = org.apache.hadoop.record.Utils.compareBytes(b1,s1,l1,b2,s2,l2); if (r1 != 0) { return (r1<0)?-1:0; } s1+=i1; s2+=i2; l1-=i1; l1-=i2; I think the compareBytes call should be using i1 and i2, not l1 and l2. This would not show up as a bug in testing, just a loss of performance. Actually, I think it might have been better to replace this open-coding with method calls. I.e. have a packed-buffer class, comprising a byte buffer, position and length remaining. It would have methods like compareFloat, compareVLong etc. The generated code would then be much smaller and probably at least as efficient. (Currently the variable length is being computed twice for each int.) [4] I don't see any unit tests for the readVLong/writeVLong combination. Reading the code, I am not convinced that it will work correctly with negative numbers. (It looks like the number read will always be positive.) [5] rcc when called with no arguments should print out a usage message that explains what all the command line options are.
        Hide
        milindb Milind Bhandarkar added a comment -

        Owen,

        I think raw comparison functions are not even useful INSIDE hadoop. Note that the reduce contract does not say anything about "sorting". It only assures "grouping", which is much weaker contract. Hadoop's implementation of grouping enforces a stricter contract, i.e. sorting.

        Efficient implementation of grouping does not require raw "compareTo", rather it requires raw "equals". Advantage of raw equals over raw compareTo is that raw equals is a generic byte array comparison, rather than one that needs to be implemented by every class.

        Even if grouping is implemented using sorting, one can use sort on the hashCode of keys (again raw hashCode is generic implementation on byte array, not implemented by every class), which would fulfill reduce's contract.

        I am even more puzzled by your --nohadoop proposal. Users of record I/O in the Hadoop context should know about hadoop, not those using it outside of Hadoop context. So having non-hadoop users know aboout --nohadoop option is clearly a wrong thing to do.

        Show
        milindb Milind Bhandarkar added a comment - Owen, I think raw comparison functions are not even useful INSIDE hadoop. Note that the reduce contract does not say anything about "sorting". It only assures "grouping", which is much weaker contract. Hadoop's implementation of grouping enforces a stricter contract, i.e. sorting. Efficient implementation of grouping does not require raw "compareTo", rather it requires raw "equals". Advantage of raw equals over raw compareTo is that raw equals is a generic byte array comparison, rather than one that needs to be implemented by every class. Even if grouping is implemented using sorting, one can use sort on the hashCode of keys (again raw hashCode is generic implementation on byte array, not implemented by every class), which would fulfill reduce's contract. I am even more puzzled by your --nohadoop proposal. Users of record I/O in the Hadoop context should know about hadoop, not those using it outside of Hadoop context. So having non-hadoop users know aboout --nohadoop option is clearly a wrong thing to do.
        Hide
        owen.omalley Owen O'Malley added a comment -

        I think that the raw comparison functions are useful outside of Hadoop in any application that involves sorting or merging. They provide a useful optimization for Hadoop, but are generally useful. Therefore, I don't think it is fair to count that code as pure Hadoop support.

        (And you are right that the dependent classes do include at least Writable, WritableComparable, WritableComparator and DataInputBuffer.)

        That said, I think that a "-nohadoop" command line switch wouldn't be bad. I do think it is pretty unreasonable to make the default behavior not support Hadoop, since that is the primary intended audience.

        Show
        owen.omalley Owen O'Malley added a comment - I think that the raw comparison functions are useful outside of Hadoop in any application that involves sorting or merging. They provide a useful optimization for Hadoop, but are generally useful. Therefore, I don't think it is fair to count that code as pure Hadoop support. (And you are right that the dependent classes do include at least Writable, WritableComparable, WritableComparator and DataInputBuffer.) That said, I think that a "-nohadoop" command line switch wouldn't be bad. I do think it is pretty unreasonable to make the default behavior not support Hadoop, since that is the primary intended audience.
        Hide
        dbowen David Bowen added a comment -

        > In this tradeoff between larger [generated] code size and smaller source code, I choose smaller source code.

        It seems to me that the amount of code added to the compiler for this feature is very small. There are a bit over 2000 lines of code in org.apache.hadoop.record.compiler. The support for "-hadoop" consists of one line to extract the command-line option, and three if-thens to omit parts which are not needed outside of Hadoop. So that makes seven lines, counting the lines containing the closing braces.

        I don't know how bad it is to have the larger generated code. On test cases it looks like the size is roughly doubled. That is perhaps much less of an issue than having to either bundle the hadoop jar file or else carefully disentangle the needed parts. I realized after I wrote my previous comment that it is not just a matter of bundling the Writable and WritableComparable interfaces. The generated code subclasses WritableComparator, so that it and the transitive closure of its dependencies would need to be bundled also.

        So, once again: this feature provides a clear benefit at a negligible cost in complexity, and it should be accepted on its merits.

        Show
        dbowen David Bowen added a comment - > In this tradeoff between larger [generated] code size and smaller source code, I choose smaller source code. It seems to me that the amount of code added to the compiler for this feature is very small. There are a bit over 2000 lines of code in org.apache.hadoop.record.compiler. The support for "-hadoop" consists of one line to extract the command-line option, and three if-thens to omit parts which are not needed outside of Hadoop. So that makes seven lines, counting the lines containing the closing braces. I don't know how bad it is to have the larger generated code. On test cases it looks like the size is roughly doubled. That is perhaps much less of an issue than having to either bundle the hadoop jar file or else carefully disentangle the needed parts. I realized after I wrote my previous comment that it is not just a matter of bundling the Writable and WritableComparable interfaces. The generated code subclasses WritableComparator, so that it and the transitive closure of its dependencies would need to be bundled also. So, once again: this feature provides a clear benefit at a negligible cost in complexity, and it should be accepted on its merits.
        Hide
        milindb Milind Bhandarkar added a comment -

        Doug,

        Thanks for your clarification. For the entire duration fo this debate I have been thinking from the record I/O users' perspective. Here is a recent example that happened with Hadoop. We were using Lucene's PriorityQueue class. And that was causing dependency on the Lucene jar. Which was deemed unacceptable, so we copied Lucene's PriorityQueue class to Hadoop, and got rid of the Lucene dependency.

        I see Hadoop record I/O's users' dependency on the entire Hadoop jar the same way. Users of Hadooop record I/O notice that they are essentialy using only two interfaces (not even implementations) from outside of record I/O. It is obvious that the dependency on Hadoop jar is something they will like to avoid (similar to what we did with dependency on the entire Lucene Jar).

        If one sees things from the record I/O users' perspectives, I think allowing users to generate code to be used outside of Hadoop context is the right thing to do.

        Maybe I misunderstood your point earlier, which to me seemed like arguing for "one artifact per project". I agreed to that. That is why proposal #10 has not been implemented in these changes, nor is it proposed anymore. Hadoop should produce only one tar file as an artifact. That is why I have not modified build process to produce a separate record I/O artifact. Instead, for record I/O's users, I have given a simple shelll command that will produce a record I/O jar file containing only record I/O classes, separate from all other Hadoop classes. Please educate me about the downside of that.

        I value consensus more than anything, and that is why I am trying to explain my position. Even the merit of changes needs to be decided by consensus.

        Show
        milindb Milind Bhandarkar added a comment - Doug, Thanks for your clarification. For the entire duration fo this debate I have been thinking from the record I/O users' perspective. Here is a recent example that happened with Hadoop. We were using Lucene's PriorityQueue class. And that was causing dependency on the Lucene jar. Which was deemed unacceptable, so we copied Lucene's PriorityQueue class to Hadoop, and got rid of the Lucene dependency. I see Hadoop record I/O's users' dependency on the entire Hadoop jar the same way. Users of Hadooop record I/O notice that they are essentialy using only two interfaces (not even implementations) from outside of record I/O. It is obvious that the dependency on Hadoop jar is something they will like to avoid (similar to what we did with dependency on the entire Lucene Jar). If one sees things from the record I/O users' perspectives, I think allowing users to generate code to be used outside of Hadoop context is the right thing to do. Maybe I misunderstood your point earlier, which to me seemed like arguing for "one artifact per project". I agreed to that. That is why proposal #10 has not been implemented in these changes, nor is it proposed anymore. Hadoop should produce only one tar file as an artifact. That is why I have not modified build process to produce a separate record I/O artifact. Instead, for record I/O's users, I have given a simple shelll command that will produce a record I/O jar file containing only record I/O classes, separate from all other Hadoop classes. Please educate me about the downside of that. I value consensus more than anything, and that is why I am trying to explain my position. Even the merit of changes needs to be decided by consensus.
        Hide
        cutting Doug Cutting added a comment -

        > the generated code is much larger

        In this tradeoff between larger code size and smaller source code, I choose smaller source code.

        > Writable imposes a single format.

        Writable and WritableComparable are for high-performance binary i/o and comparison. Records support other, non-binary representations. Applications that require the highest-performance binary format can use the Writable methods. Applications that desire format flexibility can use the other interface. If record's other binary i/o implementation is efficient enough, then it could be implemented on the base Record class, with no separate Writable or WritableComparable methods implemented per generated class. If it is as performant and more general, perhaps it should replace Writable and WritableComparable. But if, on the other hand, we can generate significantly higher-performance implementations of the Writable and WritableComparable API, then we should generate these per-class implementations, and all applications can choose.

        The description of this issue lists 11 changes. I object to one of them. I have done so since before any code was written, so this is not a surprise. I'm somewhat surprised that my concerns have been ignored. The constructive thing to do at this point is to make a patch without that one item, then perhaps pursue that separately. Apache operates by consensus and merit. I do not see the merit of that change.

        Show
        cutting Doug Cutting added a comment - > the generated code is much larger In this tradeoff between larger code size and smaller source code, I choose smaller source code. > Writable imposes a single format. Writable and WritableComparable are for high-performance binary i/o and comparison. Records support other, non-binary representations. Applications that require the highest-performance binary format can use the Writable methods. Applications that desire format flexibility can use the other interface. If record's other binary i/o implementation is efficient enough, then it could be implemented on the base Record class, with no separate Writable or WritableComparable methods implemented per generated class. If it is as performant and more general, perhaps it should replace Writable and WritableComparable. But if, on the other hand, we can generate significantly higher-performance implementations of the Writable and WritableComparable API, then we should generate these per-class implementations, and all applications can choose. The description of this issue lists 11 changes. I object to one of them. I have done so since before any code was written, so this is not a surprise. I'm somewhat surprised that my concerns have been ignored. The constructive thing to do at this point is to make a patch without that one item, then perhaps pursue that separately. Apache operates by consensus and merit. I do not see the merit of that change.
        Hide
        milindb Milind Bhandarkar added a comment -

        >Since one of the uses that is being considered is for a wire protocol, that is precisely what makes Writable and
        WritableComparable valuable.

        Unfortunately, Writable imposes a single format. Ås it is currently in Hadoop, Writable implies Binary (not-so packed Binary) format. From the users' experience, just by changing the format as a string in the record i/o interface, they can output csv or xml, other common formats (pluggable, JSON as a possible candidate), with the advantage that they are readable and therefore more debuggable. If Writable interface allowed us to specify a different serialization method, it would not be as indecipherable as it is now.

        The Record interface, which has more readable "serialize" and "deserialize" methods rather than "write" and "readFields" (i.e. binary) were to replace Writable completely, it would be good for Hadoop long-term.

        e.g. Then all the popular formats, e.g. Text, Binary, and XML will all be interpretable through a single interface.

        Show
        milindb Milind Bhandarkar added a comment - >Since one of the uses that is being considered is for a wire protocol, that is precisely what makes Writable and WritableComparable valuable. Unfortunately, Writable imposes a single format. Ås it is currently in Hadoop, Writable implies Binary (not-so packed Binary) format. From the users' experience, just by changing the format as a string in the record i/o interface, they can output csv or xml, other common formats (pluggable, JSON as a possible candidate), with the advantage that they are readable and therefore more debuggable. If Writable interface allowed us to specify a different serialization method, it would not be as indecipherable as it is now. The Record interface, which has more readable "serialize" and "deserialize" methods rather than "write" and "readFields" (i.e. binary) were to replace Writable completely, it would be good for Hadoop long-term. e.g. Then all the popular formats, e.g. Text, Binary, and XML will all be interpretable through a single interface.
        Hide
        dbowen David Bowen added a comment -

        Doug, I think that you may be missing a point here. From the point of view of people who want to use this package outside of Hadoop, there is a significant drawback to your suggestion of just including the two interfaces Writable and WritableComparable in the stand-alone jar file. Namely, the generated code is much larger, and results in more class files, because of the WritableComparator subclasses that are generated. These classes don't provide any value to non-Hadoop-users.

        The difference between the generated code in the two cases, with and without the -hadoop option, is pretty simple. There is just a bunch of stuff left out when -hadoop is not specified. Therefore, I don't think that there is a strong case to be made that this is adding excessive complexity to the rcc.

        I don't buy your argument about the danger of using the wrong command-line option and getting the wrong generated code. The command-line option will generally be in a build.xml file, and if someone gets it wrong the compiler will immediately point out the problem.

        So I think this patch offers a clear benefit to a (currently) small number of people, with essentially no downside. So I vote to commit it.

        Show
        dbowen David Bowen added a comment - Doug, I think that you may be missing a point here. From the point of view of people who want to use this package outside of Hadoop, there is a significant drawback to your suggestion of just including the two interfaces Writable and WritableComparable in the stand-alone jar file. Namely, the generated code is much larger, and results in more class files, because of the WritableComparator subclasses that are generated. These classes don't provide any value to non-Hadoop-users. The difference between the generated code in the two cases, with and without the -hadoop option, is pretty simple. There is just a bunch of stuff left out when -hadoop is not specified. Therefore, I don't think that there is a strong case to be made that this is adding excessive complexity to the rcc. I don't buy your argument about the danger of using the wrong command-line option and getting the wrong generated code. The command-line option will generally be in a build.xml file, and if someone gets it wrong the compiler will immediately point out the problem. So I think this patch offers a clear benefit to a (currently) small number of people, with essentially no downside. So I vote to commit it.
        Hide
        milindb Milind Bhandarkar added a comment -

        P.S. There is no comparison between weight of Java serialization and the serialization of Writable and Record. Writable and Record are equally lightweight, so that should not be a deciding factor between these two.

        Show
        milindb Milind Bhandarkar added a comment - P.S. There is no comparison between weight of Java serialization and the serialization of Writable and Record. Writable and Record are equally lightweight, so that should not be a deciding factor between these two.
        Hide
        jimk Jim Kellerman added a comment -

        >> Milind Bhandarkar [26/Jan/07 02:57 PM] wrote:
        >> A particular usage of Record I/O that I am considering is as a wire protocol for records. Having a common wire
        >> protocol for multiple language targets is obviously facilitated by Record I/O.

        >> Sameer Paranjpye [23/Feb/07 04:51 PM] wrote:
        >> If we don't want the compiler option, the alternatives are to stop generating the Writable and WritableComparable
        >> interfaces altogether or to make them part of org.apache.hadoop.record

        > Doug Cutting [23/Feb/07 05:33 PM] wrote:
        > Either of those would be preferable to the current situation, but my first choice is still simply to include Writable and
        > WritableComparable in any record jar file and be done with it. This is not a hard problem.

        +1

        Since one of the uses that is being considered is for a wire protocol, that is precisely what makes Writable and
        WritableComparable valuable. They are much more lightweight than Java serialization and furthermore would
        be simpler for a non-Java client or server to deal with.

        Show
        jimk Jim Kellerman added a comment - >> Milind Bhandarkar [26/Jan/07 02:57 PM] wrote: >> A particular usage of Record I/O that I am considering is as a wire protocol for records. Having a common wire >> protocol for multiple language targets is obviously facilitated by Record I/O. >> Sameer Paranjpye [23/Feb/07 04:51 PM] wrote: >> If we don't want the compiler option, the alternatives are to stop generating the Writable and WritableComparable >> interfaces altogether or to make them part of org.apache.hadoop.record > Doug Cutting [23/Feb/07 05:33 PM] wrote: > Either of those would be preferable to the current situation, but my first choice is still simply to include Writable and > WritableComparable in any record jar file and be done with it. This is not a hard problem. +1 Since one of the uses that is being considered is for a wire protocol, that is precisely what makes Writable and WritableComparable valuable. They are much more lightweight than Java serialization and furthermore would be simpler for a non-Java client or server to deal with.
        Hide
        cutting Doug Cutting added a comment -

        > The difficulty is precisely in including interfaces from another package.

        What's so hard about that?

        > It causes generated code to depend on Hadoop.

        This is Hadoop, isn't it?

        > If we don't want the compiler option, the alternatives are to stop generating the Writable and WritableComparable interfaces altogether or to make them part of org.apache.hadoop.record

        Either of those would be preferable to the current situation, but my first choice is still simply to include Writable and WritableComparable in any record jar file and be done with it. This is not a hard problem. We've already wasted way too much discussion on it. We're solving a non-problem by adding complexity.

        Show
        cutting Doug Cutting added a comment - > The difficulty is precisely in including interfaces from another package. What's so hard about that? > It causes generated code to depend on Hadoop. This is Hadoop, isn't it? > If we don't want the compiler option, the alternatives are to stop generating the Writable and WritableComparable interfaces altogether or to make them part of org.apache.hadoop.record Either of those would be preferable to the current situation, but my first choice is still simply to include Writable and WritableComparable in any record jar file and be done with it. This is not a hard problem. We've already wasted way too much discussion on it. We're solving a non-problem by adding complexity.
        Hide
        sameerp Sameer Paranjpye added a comment -

        The difficulty is precisely in including interfaces from another package. It causes generated code to depend on Hadoop. We've gotten pretty strong feedback from users (see Bens comments above) that this is undesirable. In addition, if we ever want to split this out into a separate project including Writable will make it problematic.

        If we don't want the compiler option, the alternatives are to stop generating the Writable and WritableComparable interfaces altogether or to make them part of org.apache.hadoop.record

        Show
        sameerp Sameer Paranjpye added a comment - The difficulty is precisely in including interfaces from another package. It causes generated code to depend on Hadoop. We've gotten pretty strong feedback from users (see Bens comments above) that this is undesirable. In addition, if we ever want to split this out into a separate project including Writable will make it problematic. If we don't want the compiler option, the alternatives are to stop generating the Writable and WritableComparable interfaces altogether or to make them part of org.apache.hadoop.record
        Hide
        cutting Doug Cutting added a comment -

        > For using record I/O outside of Hadoop, the only classes that are needed are from org.apache.hadoop.record.

        As I've now said many times, I don't see the significant benefit of this, while I do see a benefit to having only a single form of generated java code: it removes the opportunity for confusion (wrong classes generated) and simplifies the generation code. Those both seem to vastly outweigh the overhead of including two interfaces from another package in the jar for standalone use.

        Show
        cutting Doug Cutting added a comment - > For using record I/O outside of Hadoop, the only classes that are needed are from org.apache.hadoop.record. As I've now said many times, I don't see the significant benefit of this, while I do see a benefit to having only a single form of generated java code: it removes the opportunity for confusion (wrong classes generated) and simplifies the generation code. Those both seem to vastly outweigh the overhead of including two interfaces from another package in the jar for standalone use.
        Hide
        milindb Milind Bhandarkar added a comment -

        > Records require a runtime jar file containing many org.apache.hadoop classes.

        For using record I/O outside of Hadoop, the only classes that are needed are from org.apache.hadoop.record. Nothing else. With a single shell command, one can repackage the jar to contain only o.a.h.r.

        Only when --hadoop is specified, there is dependency on classes outside o.a.h.r.

        > However generating two different versions of classes with the same name offers significant chance for confusion.

        This a commmon patttern observed in translators. Various DHTML code generators look at the requesting brower and generate different code specialized for that browser. This is not any different.

        Show
        milindb Milind Bhandarkar added a comment - > Records require a runtime jar file containing many org.apache.hadoop classes. For using record I/O outside of Hadoop, the only classes that are needed are from org.apache.hadoop.record. Nothing else. With a single shell command, one can repackage the jar to contain only o.a.h.r. Only when --hadoop is specified, there is dependency on classes outside o.a.h.r. > However generating two different versions of classes with the same name offers significant chance for confusion. This a commmon patttern observed in translators. Various DHTML code generators look at the requesting brower and generate different code specialized for that browser. This is not any different.
        Hide
        cutting Doug Cutting added a comment -

        Mostly this looks good to me.

        > 6. The default -language="java" target would generate class code for records that would not have Hadoop dependency on WritableComparable interface, but instead would have "implements Record, Comparable". (i.e. It will not have write() and readFields() methods.) An additional option "-writable" will need to be specified on rcc commandline to generate classes that "implements Record, WritableComparable".

        I think this will cause more harm that it spares. Records require a runtime jar file containing many org.apache.hadoop classes. Adding a few additional interfaces (Writable, WritableComparable) into that runtime jar does not seem significant to me. However generating two different versions of classes with the same name offers significant chance for confusion.

        Show
        cutting Doug Cutting added a comment - Mostly this looks good to me. > 6. The default - language="java" target would generate class code for records that would not have Hadoop dependency on WritableComparable interface, but instead would have "implements Record, Comparable". (i.e. It will not have write() and readFields() methods.) An additional option " -writable" will need to be specified on rcc commandline to generate classes that "implements Record, WritableComparable". I think this will cause more harm that it spares. Records require a runtime jar file containing many org.apache.hadoop classes. Adding a few additional interfaces (Writable, WritableComparable) into that runtime jar does not seem significant to me. However generating two different versions of classes with the same name offers significant chance for confusion.
        Hide
        milindb Milind Bhandarkar added a comment -

        Making patch available. Will file a new Jira issue for remaining enhancements.

        Show
        milindb Milind Bhandarkar added a comment - Making patch available. Will file a new Jira issue for remaining enhancements.
        Hide
        milindb Milind Bhandarkar added a comment -

        This is the first of the two patches. It implements the following changes out of all proposed changes:

        1. Use java.lang.String as a native type for ustring (instead of Text.)
        2. Provide a Buffer class as a native Java type for buffer (instead of BytesWritable).
        3. Member names in generated classes should not have prefixes 'm' before their names. In the above example, the private member name would be 'value' not 'mvalue' as it is done now.
        4. Convert getters and setters to have CamelCase. e.g. in the above example the getter will be:
        public Buffer getValue();
        6. The default -language="java" target would generate class code for records that would not have Hadoop dependency on WritableComparable interface, but instead would have "implements Record, Comparable". (i.e. It will not have write() and readFields() methods.) An additional option "-writable" will need to be specified on rcc commandline to generate classes that "implements Record, WritableComparable".
        11. Make generated Java codes for maps and vectors use Java generics.

        Show
        milindb Milind Bhandarkar added a comment - This is the first of the two patches. It implements the following changes out of all proposed changes: 1. Use java.lang.String as a native type for ustring (instead of Text.) 2. Provide a Buffer class as a native Java type for buffer (instead of BytesWritable). 3. Member names in generated classes should not have prefixes 'm' before their names. In the above example, the private member name would be 'value' not 'mvalue' as it is done now. 4. Convert getters and setters to have CamelCase. e.g. in the above example the getter will be: public Buffer getValue(); 6. The default - language="java" target would generate class code for records that would not have Hadoop dependency on WritableComparable interface, but instead would have "implements Record, Comparable". (i.e. It will not have write() and readFields() methods.) An additional option " -writable" will need to be specified on rcc commandline to generate classes that "implements Record, WritableComparable". 11. Make generated Java codes for maps and vectors use Java generics.
        Hide
        cutting Doug Cutting added a comment -

        > It is not clear that Hadoop considers the record io a stable interface to be used outside of Hadoop.

        All public features in Hadoop's core are subject to the same back-compatibility goals. Someone who uses just records should be on no less stable ground than someone who uses just HDFS.

        My recommendation would be to simply include hadoop's jar in your project and to submit patches to hadoop if there are record features that you'd like to add. If a sufficiently large, independent community of record users develops, then it might someday make sense to split it into a separate project, with its own releases, committers, etc. But in the meantime I think forking your own version will prove inconvenient, and I doubt there is yet a critical mass to support a separate project.

        Show
        cutting Doug Cutting added a comment - > It is not clear that Hadoop considers the record io a stable interface to be used outside of Hadoop. All public features in Hadoop's core are subject to the same back-compatibility goals. Someone who uses just records should be on no less stable ground than someone who uses just HDFS. My recommendation would be to simply include hadoop's jar in your project and to submit patches to hadoop if there are record features that you'd like to add. If a sufficiently large, independent community of record users develops, then it might someday make sense to split it into a separate project, with its own releases, committers, etc. But in the meantime I think forking your own version will prove inconvenient, and I doubt there is yet a critical mass to support a separate project.
        Hide
        milindb Milind Bhandarkar added a comment -

        >Personally, I'm fine with it either way.

        Ben,

        The way I see it, there are two types of end-users of Hadoop Record I/O. One who develop codes using record I/O translator, and the other who use this generated code as libraries in their own applications. It is clear that your use of record I/O falls in the first category, where you implement wire-protocols using record I/O, and other people use it as a library within their own applications. While you are fine with dependency on the Hadoop jar, could you provide insight as to whether your users would be fine with it too ?

        Thanks.

        Show
        milindb Milind Bhandarkar added a comment - >Personally, I'm fine with it either way. Ben, The way I see it, there are two types of end-users of Hadoop Record I/O. One who develop codes using record I/O translator, and the other who use this generated code as libraries in their own applications. It is clear that your use of record I/O falls in the first category, where you implement wire-protocols using record I/O, and other people use it as a library within their own applications. While you are fine with dependency on the Hadoop jar, could you provide insight as to whether your users would be fine with it too ? Thanks.
        Hide
        breed Benjamin Reed added a comment -

        Using the Hadoop jar would be an extreme cognitive burden. Plunking a map/reduce/distribute filesystem jar into an otherwise simple application to use a small library from that jar (that is currently not used by the rest of Hadoop) seems to defy logic.

        We maintain a copy of the record io to:

        1) Address some of the issues that Milind's patch address.
        2) We added C support.
        3) We wanted to address the "cognitive burden" raised above by providing a simple standalone library.
        4) It is not clear that Hadoop considers the record io a stable interface to be used outside of Hadoop. Since our application is only interested in the record io, we need that to be stable.

        (It has nothing to do with the frequencies of Hadoop releases...)

        I realize this is a tangential issue since we aren't even using Hadoop in our application. We just noticed that the record io is a small portion of hadoop that is very useful outside of Hadoop as well. It would be nice to make that available for other applications.

        Personally, I'm fine with it either way. While it would be convenient to have a standard record io library, I don't want to push you in a direction you are not interested in.

        Show
        breed Benjamin Reed added a comment - Using the Hadoop jar would be an extreme cognitive burden. Plunking a map/reduce/distribute filesystem jar into an otherwise simple application to use a small library from that jar (that is currently not used by the rest of Hadoop) seems to defy logic. We maintain a copy of the record io to: 1) Address some of the issues that Milind's patch address. 2) We added C support. 3) We wanted to address the "cognitive burden" raised above by providing a simple standalone library. 4) It is not clear that Hadoop considers the record io a stable interface to be used outside of Hadoop. Since our application is only interested in the record io, we need that to be stable. (It has nothing to do with the frequencies of Hadoop releases...) I realize this is a tangential issue since we aren't even using Hadoop in our application. We just noticed that the record io is a small portion of hadoop that is very useful outside of Hadoop as well. It would be nice to make that available for other applications. Personally, I'm fine with it either way. While it would be convenient to have a standard record io library, I don't want to push you in a direction you are not interested in.
        Hide
        cutting Doug Cutting added a comment -

        > I will admit to being one of the motivators of this bug.

        So how much of a burden would it be to you, cognitive or otherwise, to have Writable, WritableComparable (and perhaps even a few other classes) in the jar that you use? How much of a burden is it today to simply use the Hadoop jar?

        > Currently we maintain a separate customized copy of the record io code [ ... ]

        Why, pray tell? Are Hadoop's releases too infrequent?

        Show
        cutting Doug Cutting added a comment - > I will admit to being one of the motivators of this bug. So how much of a burden would it be to you, cognitive or otherwise, to have Writable, WritableComparable (and perhaps even a few other classes) in the jar that you use? How much of a burden is it today to simply use the Hadoop jar? > Currently we maintain a separate customized copy of the record io code [ ... ] Why, pray tell? Are Hadoop's releases too infrequent?
        Hide
        breed Benjamin Reed added a comment -

        I will admit to being one of the motivators of this bug. We have found record io extremely useful in applications that have nothing to do with Hadoop. (We use it in a message passing layer and to maintain a "database" log as well as serialize the contents of the "database".) Currently we maintain a separate customized copy of the record io code, but it seems very counter productive if everyone who wants to use record io does this.

        I can see this easily becoming a very useful part of Hadoop in its own right (much as Hadoop is a useful part of Nutch in its own right) since generic cross language cross encoding record layers seem hard to come by for some reason. As it is, developers are going to be very hesitant to use record io outside of Hadoop because Hadoop developers could change things (such as introduce more dependencies on other parts of Hadoop).

        It is very simple to use the --writable flag Milind proposes or even make record io stand alone and have a simple RecordWritable class that Hadoop uses to interface to record io.

        Implementing these changes and making it possible to build just a recordio.jar will allow developers to realize that they can use record io in their project rather than reinvent the wheel.

        Show
        breed Benjamin Reed added a comment - I will admit to being one of the motivators of this bug. We have found record io extremely useful in applications that have nothing to do with Hadoop. (We use it in a message passing layer and to maintain a "database" log as well as serialize the contents of the "database".) Currently we maintain a separate customized copy of the record io code, but it seems very counter productive if everyone who wants to use record io does this. I can see this easily becoming a very useful part of Hadoop in its own right (much as Hadoop is a useful part of Nutch in its own right) since generic cross language cross encoding record layers seem hard to come by for some reason. As it is, developers are going to be very hesitant to use record io outside of Hadoop because Hadoop developers could change things (such as introduce more dependencies on other parts of Hadoop). It is very simple to use the --writable flag Milind proposes or even make record io stand alone and have a simple RecordWritable class that Hadoop uses to interface to record io. Implementing these changes and making it possible to build just a recordio.jar will allow developers to realize that they can use record io in their project rather than reinvent the wheel.
        Hide
        runping Runping Qi added a comment -

        The generated classes should have a static method returning the IDL string from which the classes are generated.

        Also, it may be a good time to start to think how to deal with the issues related to IDL versioning/evolution. Of course, this issue is a big issue, and certainly deserves a separate JIRA

        Show
        runping Runping Qi added a comment - The generated classes should have a static method returning the IDL string from which the classes are generated. Also, it may be a good time to start to think how to deal with the issues related to IDL versioning/evolution. Of course, this issue is a big issue, and certainly deserves a separate JIRA
        Hide
        cutting Doug Cutting added a comment -

        > I am proposing that there should be a way (i.e. ant target) to produce a stand-alone record I/O Jar [ ...]

        Who would build this target? Would it be included in releases as a separate jar? I'm okay with multiple jars in releases, but multiple tar files is a harder sell.

        > I do not believe the problem lies with the size of Hadoop Jar file, but with a perception among Record I/O users that somehow their application has a dependency on Hadoop.

        That perception would be correct. So long as records are a part of Hadoop, they're a part of Hadoop and will be released with Hadoop, etc. Hadoop itself is layered, and I'm not opposed to structuring the jars and javadoc to reflect that layering, e.g., by adding more groups, so that folks can only rely on the layers they desire.

        Show
        cutting Doug Cutting added a comment - > I am proposing that there should be a way (i.e. ant target) to produce a stand-alone record I/O Jar [ ...] Who would build this target? Would it be included in releases as a separate jar? I'm okay with multiple jars in releases, but multiple tar files is a harder sell. > I do not believe the problem lies with the size of Hadoop Jar file, but with a perception among Record I/O users that somehow their application has a dependency on Hadoop. That perception would be correct. So long as records are a part of Hadoop, they're a part of Hadoop and will be released with Hadoop, etc. Hadoop itself is layered, and I'm not opposed to structuring the jars and javadoc to reflect that layering, e.g., by adding more groups, so that folks can only rely on the layers they desire.
        Hide
        milindb Milind Bhandarkar added a comment -

        Comments Inline:

        >I don't understand the motivation for this. The Writable and WritableComparable >interfaces are small and standalone. Having record-defined classes that cannot >be easily used with the rest of Hadoop seems like it could cause more confusion >than including these.

        The use of Writable in the context of Record I/O is for binary serialization format only. However, the record I/O supports serialization to two additional formats CSV and XML. Since there is no way to incorporate these into Writable, one needs an additional record-i/o defined interface, which is Record. If Record can handle all three serialization formats anyway, Writable support is needed only if one wants to use SequenceFile or IPC in Hadoop. Outside of Hadoop, these are not useful.

        >Are you proposing a separate release cycle or just separate release artifacts? >If a separate release cycle, then this should be placed in a separate >sub-project. Each sub-project requires a diverse developer community, which I'm >not sure that record alone has.

        I am not proposing any of these. Single Hadoop build should have a single artifact, which is the Hadoop-core.jar. I am proposing that there should be a way (i.e. ant target) to produce a stand-alone record I/O Jar that does not include anything from outside org.apache.hadoop.record.*. This target is not invoked by default at build time.

        >I suspect that many users of records might find SequenceFile also useful.

        A particular usage of Record I/O that I am considering is as a wire protocol for records. Having a common wire protocol for multiple language targets is obviously facilitated by Record I/O.

        I do not believe the problem lies with the size of Hadoop Jar file, but with a perception among Record I/O users that somehow their application has a dependency on Hadoop.

        Show
        milindb Milind Bhandarkar added a comment - Comments Inline: >I don't understand the motivation for this. The Writable and WritableComparable >interfaces are small and standalone. Having record-defined classes that cannot >be easily used with the rest of Hadoop seems like it could cause more confusion >than including these. The use of Writable in the context of Record I/O is for binary serialization format only. However, the record I/O supports serialization to two additional formats CSV and XML. Since there is no way to incorporate these into Writable, one needs an additional record-i/o defined interface, which is Record. If Record can handle all three serialization formats anyway, Writable support is needed only if one wants to use SequenceFile or IPC in Hadoop. Outside of Hadoop, these are not useful. >Are you proposing a separate release cycle or just separate release artifacts? >If a separate release cycle, then this should be placed in a separate >sub-project. Each sub-project requires a diverse developer community, which I'm >not sure that record alone has. I am not proposing any of these. Single Hadoop build should have a single artifact, which is the Hadoop-core.jar. I am proposing that there should be a way (i.e. ant target) to produce a stand-alone record I/O Jar that does not include anything from outside org.apache.hadoop.record.*. This target is not invoked by default at build time. >I suspect that many users of records might find SequenceFile also useful. A particular usage of Record I/O that I am considering is as a wire protocol for records. Having a common wire protocol for multiple language targets is obviously facilitated by Record I/O. I do not believe the problem lies with the size of Hadoop Jar file, but with a perception among Record I/O users that somehow their application has a dependency on Hadoop.
        Hide
        cutting Doug Cutting added a comment -

        > generate class code for records that would not have Hadoop dependency on WritableComparable interface

        I don't understand the motivation for this. The Writable and WritableComparable interfaces are small and standalone. Having record-defined classes that cannot be easily used with the rest of Hadoop seems like it could cause more confusion than including these.

        > As part of Hadoop build process, produce a tar bundle for Record I/O alone.

        Are you proposing a separate release cycle or just separate release artifacts? If a separate release cycle, then this should be placed in a separate sub-project. Each sub-project requires a diverse developer community, which I'm not sure that record alone has.

        As a counter-proposal, we might consider splitting hadoop's jar into a layered set of jars: hadoop-common (util, config, io, record), hadoop-fs and hadoop-mapred, and perhaps other, all still bundled into a single release artifact. The javadoc could be structured into corresponding sections. I suspect that many users of records might find SequenceFile also useful.

        More generally, what is the goal? Is the hadoop jar file, at 1MB, unwieldy? Or is there some other problem? Creating more release artifacts could create more confusion (will folks know what to download?). Multiple jars are less confusing, since most developers don't deal explicitly with jar files, but just use the set included in the download. Should metrics also be released separately? How about fs w/o mapred? There're just too many possible ways of bundling things. I think the general mapping is one project, one release artifact, unless there's a very strong reason otherwise, which I've not yet seen.

        Show
        cutting Doug Cutting added a comment - > generate class code for records that would not have Hadoop dependency on WritableComparable interface I don't understand the motivation for this. The Writable and WritableComparable interfaces are small and standalone. Having record-defined classes that cannot be easily used with the rest of Hadoop seems like it could cause more confusion than including these. > As part of Hadoop build process, produce a tar bundle for Record I/O alone. Are you proposing a separate release cycle or just separate release artifacts? If a separate release cycle, then this should be placed in a separate sub-project. Each sub-project requires a diverse developer community, which I'm not sure that record alone has. As a counter-proposal, we might consider splitting hadoop's jar into a layered set of jars: hadoop-common (util, config, io, record), hadoop-fs and hadoop-mapred, and perhaps other, all still bundled into a single release artifact. The javadoc could be structured into corresponding sections. I suspect that many users of records might find SequenceFile also useful. More generally, what is the goal? Is the hadoop jar file, at 1MB, unwieldy? Or is there some other problem? Creating more release artifacts could create more confusion (will folks know what to download?). Multiple jars are less confusing, since most developers don't deal explicitly with jar files, but just use the set included in the download. Should metrics also be released separately? How about fs w/o mapred? There're just too many possible ways of bundling things. I think the general mapping is one project, one release artifact, unless there's a very strong reason otherwise, which I've not yet seen.

          People

          • Assignee:
            milindb Milind Bhandarkar
            Reporter:
            milindb Milind Bhandarkar
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development