Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      There is a need to add versioning support to Record I/O. Users frequently update DDL files, usually by adding/removing fields, but do not want to change the name of the data structure. They would like older & newer deserializers to read as much data as possible. For example, suppose Record I/O is used to serialize/deserialize log records, each of which contains a message and a timestamp. An initial data definition could be as follows:

      class MyLogRecord {
        ustring msg;
        long timestamp;
      }
      

      Record I/O creates a class, MyLogRecord, which represents a log record and can serialize/deserialize itself. Now, suppose newer log records additionally contain a severity level. A user would want to update the definition for a log record but use the same class name. The new definition would be:

      class MyLogRecord {
        ustring msg;
        long timestamp;
        int severity;
      }
      

      Users would want a new deserializer to read old log records (and perhaps use a default value for the severity field), and an old deserializer to read newer log records (and skip the severity field).

      This requires some concept of versioning in Record I/O, or rather, the additional ability to read/write type information of a record. The following is a proposal to do this.

      Every Record I/O Record will have type information which represents how the record is structured (what fields it has, what types, etc.). This type information, represented by the class RecordTypeInfo, is itself serializable/deserializable. Every Record supports a method getRecordTypeInfo(), which returns a RecordTypeInfo object. Users are expected to serialize this type information (by calling RecordTypeInfo.serialize()) in an appropriate fashion (in a separate file, for example, or at the beginning of a file). Using the same DDL as above, here's how we could serialize log records:

      FileOutputStream fOut = new FileOutputStream("data.log");
      CsvRecordOutput csvOut = new CsvRecordOutput(fOut);
      ...
      // get the type information for MyLogRecord
      RecordTypeInfo typeInfo = MyLogRecord.getRecordTypeInfo();
      // ask it to write itself out
      typeInfo.serialize(csvOut);
      ...
      // now, serialize a bunch of records
      while (...) {
         MyLogRecord log = new MyLogRecord();
         // fill up the MyLogRecord object
        ...
        // serialize
        log.serialize(csvOut);
      }
      

      In this example, the type information of a Record is serialized fist, followed by contents of various records, all into the same file.

      Every Record also supports a method that allows a user to set a filter for deserializing. A method setRTIFilter() takes a RecordTypeInfo object as a parameter. This filter represents the type information of the data that is being deserialized. When deserializing, the Record uses this filter (if one is set) to figure out what to read. Continuing with our example, here's how we could deserialize records:

      FileInputStream fIn = new FileInputStream("data.log");
      // we know the record was written in CSV format
      CsvRecordInput csvIn = new CsvRecordInput(fIn);
      ...
      // we know the type info is written in the beginning. read it. 
      RecordTypeInfo typeInfoFilter = new RecordTypeInfo();
      // deserialize it
      typeInfoFilter.deserialize(csvIn);
      // let MyLogRecord know what to expect
      MyLogRecord.setRTIFilter(typeInfoFilter);
      // deserialize each record
      while (there is data in file) {
        MyLogRecord log = new MyLogRecord();
        log.read(csvIn);
        ...
      }
      

      The filter is optional. If not provided, the deserializer expects data to be in the same format as it would serialize. (Note that a filter can also be provided for serializing, forcing the serializer to write information in the format of the filter, but there is no use case for this functionality yet).

      What goes in the type information for a record? The type information for each field in a Record is made up of:
      1. a unique field ID, which is the field name.
      2. a type ID, which denotes the type of the field (int, string, map, etc).

      The type information for a composite type contains type information for each of its fields. This approach is somewhat similar to the one taken by Facebook's Thrift, as well as by Google's Sawzall. The main difference is that we use field names as the field ID, whereas Thrift and Sawzall use user-defined field numbers. While field names take more space, they have the big advantage that there is no change to support existing DDLs.

      When deserializing, a Record looks at the filter and compares it with its own set of

      {field name, field type}

      tuples. If there is a field in the data that it doesn't know about it, it skips it (it knows how many bytes to skip, based on the filter). If the deserialized data does not contain some field values, the Record gives them default values. Additionally, we could allow users to optionally specify default values in the DDL. The location of a field in a structure does not matter. This lets us support reordering of fields. Note that there is no change required to the DDL syntax, and very minimal changes to client code (clients just need to read/write type information, in addition to record data).

      This scheme gives us an addition powerful feature: we can build a generic serializer/deserializer, so that users can read all kinds of data without having access to the original DDL or the original stubs. As long as you know where the type information of a record is serialized, you can read all kinds of data. One can also build a simple UI that displays the structure of data serialized in any generic file. This is very useful for handling data across lots of versions.

      1. 1883_patch01
        74 kB
        Vivek Ratan
      2. example.zip
        2 kB
        Vivek Ratan
      3. 1883_patch02
        106 kB
        Vivek Ratan
      4. example.zip
        5 kB
        Vivek Ratan
      5. 1883_patch03
        142 kB
        Vivek Ratan
      6. 1883_patch04
        145 kB
        Vivek Ratan

        Activity

        Hide
        Nilay Vaish added a comment -

        I was thinking about the type info required for composite types. As per the Sawzall paper, the field ID is required for identifying the field in the binary representation. I was wondering how it will happen if we use field Names as field IDs. Can you provide your thoughts on this?

        Show
        Nilay Vaish added a comment - I was thinking about the type info required for composite types. As per the Sawzall paper, the field ID is required for identifying the field in the binary representation. I was wondering how it will happen if we use field Names as field IDs. Can you provide your thoughts on this?
        Hide
        Vivek Ratan added a comment -

        Not sure I understand your question fully. As long as a field name or a field number uniquely identifies a field within the enclosing record, we can serialize/deserialize simple and composite types. Suppose we have the following DDL:

        class small_struct {
          int a;
          boolean b;
        }
        
        class big_struct {
          int a;
          small_struct s;
        }
        

        Within a class, I cannot have more than one field with the same

        {type, name} tuple. So the {type, name}

        tuple is the unique identifier of the field. The two fields named 'a' are in different classes, so we're OK. Sawzall/Thrift leave it to the user to assign unique field numbers to each field, and use that as the unique field identifier. This approach uses fewer bytes, but requires the user's support. For Jute, the former approach seems better.

        Is this what you were asking?

        Show
        Vivek Ratan added a comment - Not sure I understand your question fully. As long as a field name or a field number uniquely identifies a field within the enclosing record, we can serialize/deserialize simple and composite types. Suppose we have the following DDL: class small_struct { int a; boolean b; } class big_struct { int a; small_struct s; } Within a class, I cannot have more than one field with the same {type, name} tuple. So the {type, name} tuple is the unique identifier of the field. The two fields named 'a' are in different classes, so we're OK. Sawzall/Thrift leave it to the user to assign unique field numbers to each field, and use that as the unique field identifier. This approach uses fewer bytes, but requires the user's support. For Jute, the former approach seems better. Is this what you were asking?
        Hide
        Nilay Vaish added a comment -

        Let me break my question into parts. That should help.

        1. Why do Swazall/Thrift use field numbers when each field can be recognised by the ID? Do they not keep the field IDs? Having both field IDs and field names should require more space than just having the field IDs.
        2. Within a class, does not a name itself identifies a field? AFAIK type is not required. The class given below should be invalid since it has two fields of same name.

        class a{
          int s;
          char s;
        }
        

        Thanks
        Nilay

        Show
        Nilay Vaish added a comment - Let me break my question into parts. That should help. 1. Why do Swazall/Thrift use field numbers when each field can be recognised by the ID? Do they not keep the field IDs? Having both field IDs and field names should require more space than just having the field IDs. 2. Within a class, does not a name itself identifies a field? AFAIK type is not required. The class given below should be invalid since it has two fields of same name. class a{ int s; char s; } Thanks Nilay
        Hide
        Vivek Ratan added a comment -

        Let me answer #2 first. Remember, with versioning, we're looking to figure out which fields have changed. You need both a field name and its type to uniquely identify a field. Suppose we have:

        class a{
          int s;
          char c;
        }
        

        Now, suppose I replace the second field:

        class a{
          int s;
          long c;
        }
        

        A deserializer generated from the new class, when reading code serialized by the old class, needs to understand that the second field in the serialized data needs to be skipped, i.e., 'char c' is different from from 'long c'. Similarly, if I changed just the name of a field, and not its type, it is a different field (this one is more debatable, but it's safer to assume that it's a different field). So the type information needs to contain both the field name and field type, which a deserializer matches with the field name and type of its own fields to see what to read and what to skip. [BTW, your example is true in that the class is invalid because both fields have the same name, but I think it's the wrong example for what we're discussing].

        Now, alternatively, I could have the user assign a unique field number to each field (which Thrift and Sawzall do). I could have something like this:

        class a{
          1: int s;
          2: char c;
        }
        

        If a user changes a field, they decide whether the field number changes. They SHOULD change the field number if the field type changes (and maybe if only the name changes). Any deserializer will depend on the user-generated field numbers to decide what fields to read and what to skip. So, for example, you could change the class as follows:

        class a{
          1:int s;
          3:long c;
        }
        

        Here, the changed field has been given a new field number.

        Either approach works. The latter (using field numbers) is more space-efficient as we just need to keep track of field numbers (and types, in order to know how many bytes to skip). But it requires manual user support and is more prone to errors. We're recommending using the former as the space inefficiency is minuscule (since you presumably will store the type information once, for hundreds and thousands of records), and you don't have to change DDLs and depend on users generating field numbers.

        Show
        Vivek Ratan added a comment - Let me answer #2 first. Remember, with versioning, we're looking to figure out which fields have changed. You need both a field name and its type to uniquely identify a field. Suppose we have: class a{ int s; char c; } Now, suppose I replace the second field: class a{ int s; long c; } A deserializer generated from the new class, when reading code serialized by the old class, needs to understand that the second field in the serialized data needs to be skipped, i.e., 'char c' is different from from 'long c'. Similarly, if I changed just the name of a field, and not its type, it is a different field (this one is more debatable, but it's safer to assume that it's a different field). So the type information needs to contain both the field name and field type, which a deserializer matches with the field name and type of its own fields to see what to read and what to skip. [BTW, your example is true in that the class is invalid because both fields have the same name, but I think it's the wrong example for what we're discussing] . Now, alternatively, I could have the user assign a unique field number to each field (which Thrift and Sawzall do). I could have something like this: class a{ 1: int s; 2: char c; } If a user changes a field, they decide whether the field number changes. They SHOULD change the field number if the field type changes (and maybe if only the name changes). Any deserializer will depend on the user-generated field numbers to decide what fields to read and what to skip. So, for example, you could change the class as follows: class a{ 1: int s; 3: long c; } Here, the changed field has been given a new field number. Either approach works. The latter (using field numbers) is more space-efficient as we just need to keep track of field numbers (and types, in order to know how many bytes to skip). But it requires manual user support and is more prone to errors. We're recommending using the former as the space inefficiency is minuscule (since you presumably will store the type information once, for hundreds and thousands of records), and you don't have to change DDLs and depend on users generating field numbers.
        Hide
        Nilay Vaish added a comment -

        That explains it all. Have you started working on this?

        Thanks
        Nilay

        Show
        Nilay Vaish added a comment - That explains it all. Have you started working on this? Thanks Nilay
        Hide
        Jeff Hammerbacher added a comment -

        i'm no expert here, but this approach appears to be sacrificing the long-term benefit of ever using jute for high-frequency rpc in favor of the short-term benefit of not breaking existing ddls (and also encourages developers to come up with short names for fields, never a good thing). your approach does make for much nicer ddl files, however. it should be noted that thrift works just fine without the numbering of fields: it generates field indentifiers counting down from -1 for unnumbered fields.

        this seems like a drawback if one ever wanted to replace hadoop's ipc serialization (about which i know nothing) with jute, though i'm having trouble concocting scenarios of high-frequency, low payload rpc in the hadoop system.

        Show
        Jeff Hammerbacher added a comment - i'm no expert here, but this approach appears to be sacrificing the long-term benefit of ever using jute for high-frequency rpc in favor of the short-term benefit of not breaking existing ddls (and also encourages developers to come up with short names for fields, never a good thing). your approach does make for much nicer ddl files, however. it should be noted that thrift works just fine without the numbering of fields: it generates field indentifiers counting down from -1 for unnumbered fields. this seems like a drawback if one ever wanted to replace hadoop's ipc serialization (about which i know nothing) with jute, though i'm having trouble concocting scenarios of high-frequency, low payload rpc in the hadoop system.
        Hide
        Vivek Ratan added a comment -

        Jeff, you have a valid point, in that, if we use Record I/O versioning for Hadoop RPC (and it certainly is possible), the cost of using field names can become prohibitive. If you think about it, the requirements for RPC are quite different from that for storage. In the former, the serialized type info is comparable in size to the serialized parameters, whereas in the latter, the type information is minuscule compared to the large amounts of serialized data. It's unlikely that one solution will satisfy both, so here's a long-term view of our proposed solution: for storage, using field names seems appropriate. If we use Record I/O for RPC, we introduce optional field numbers, ala Thrift/Sawzall. If field numbers are not provided, one can use field names (or have the compiler generate field numbers). we just need to make sure that the current design (and its implementation) does not prevent us from adding field numbers in the future, and I don't think it does. It's also worth actually seeing what the cost of using field names in RPCs will actually be. Perhaps there are other bigger bottlenecks that will show up. Regardless, we can relatively easily add support for field numbers (or some such solution) in the future with minimal fuss.

        Regarding your other points:

        • Developers do not need to use short names when using Record I/O for storage. The savings will be negligible.
        • The Thrift compiler does generate default negative numbers for fields. There is a problem with that approach, however, because it cannot remember these numbers across different versions of the DDL. So suppose I have a DDL as follows:
          class a {
            1: int s;
            char c;
          }
          

          The Thrift compiler will assign a value of -1 (or a similar number) to 'c'. Now suppose I change the DDL to:

          class a {
            1: int s;
            long l;
            char c;
          }
          

          we have a problem, because the compiler does not remember that it assigned -1 to 'c' (unless it explicitly changes the DDL, which it doesn't). It wil go ahead and assign -1 to 'l' and -2 to 'c', which will cause a new deserializer to not read the field 'c' from the old serialized data. Things will still not crash, but the solution is not optimal.

        In summary, I think that once/if we get to RPC, and if we see a real performance issue, we should look at adding field numbers or something similar. And I think we can do it in a way that doesn't break backwards compatibility, and lets you pick the optimal solution, depending on whether you're using versioning for storage or RPC. But the details, and the implementation, can probably wait till we get to that point.

        Show
        Vivek Ratan added a comment - Jeff, you have a valid point, in that, if we use Record I/O versioning for Hadoop RPC (and it certainly is possible), the cost of using field names can become prohibitive. If you think about it, the requirements for RPC are quite different from that for storage. In the former, the serialized type info is comparable in size to the serialized parameters, whereas in the latter, the type information is minuscule compared to the large amounts of serialized data. It's unlikely that one solution will satisfy both, so here's a long-term view of our proposed solution: for storage, using field names seems appropriate. If we use Record I/O for RPC, we introduce optional field numbers, ala Thrift/Sawzall. If field numbers are not provided, one can use field names (or have the compiler generate field numbers). we just need to make sure that the current design (and its implementation) does not prevent us from adding field numbers in the future, and I don't think it does. It's also worth actually seeing what the cost of using field names in RPCs will actually be. Perhaps there are other bigger bottlenecks that will show up. Regardless, we can relatively easily add support for field numbers (or some such solution) in the future with minimal fuss. Regarding your other points: Developers do not need to use short names when using Record I/O for storage. The savings will be negligible. The Thrift compiler does generate default negative numbers for fields. There is a problem with that approach, however, because it cannot remember these numbers across different versions of the DDL. So suppose I have a DDL as follows: class a { 1: int s; char c; } The Thrift compiler will assign a value of -1 (or a similar number) to 'c'. Now suppose I change the DDL to: class a { 1: int s; long l; char c; } we have a problem, because the compiler does not remember that it assigned -1 to 'c' (unless it explicitly changes the DDL, which it doesn't). It wil go ahead and assign -1 to 'l' and -2 to 'c', which will cause a new deserializer to not read the field 'c' from the old serialized data. Things will still not crash, but the solution is not optimal. In summary, I think that once/if we get to RPC, and if we see a real performance issue, we should look at adding field numbers or something similar. And I think we can do it in a way that doesn't break backwards compatibility, and lets you pick the optimal solution, depending on whether you're using versioning for storage or RPC. But the details, and the implementation, can probably wait till we get to that point.
        Hide
        Benjamin Reed added a comment -

        Using DDLs with RPC shouldn't be inefficient. The schema of RPCs don't change. So you get the DDL of the RPC when you first connect and then remember it for subsequent invocations. (Kind of like what happens when you use the DDL for storage.)

        If you are only using the RPC once, then you do pay a price for the DDL transfer, but after a couple of calls it becomes much more efficient than passing a compressed encoding by mapping names to numbers with each invocation.

        Even supporting changes to the RPC DDL while running isn't hard. (For example, the RPC server goes away and comes back with a new DDL.) You simply have a unique version number (the server can just generate a counter) that represents an instance of the DDL or RPC instance. The client sends the one number with each request. If the a client sends an invalid number, the server sends back a new DDL with a new number. (This is just simple caching.)

        Note the above implies a client conforms to server scheme. You can do the reverse if you want a server conforms to client scheme. I prefer the former.

        Show
        Benjamin Reed added a comment - Using DDLs with RPC shouldn't be inefficient. The schema of RPCs don't change. So you get the DDL of the RPC when you first connect and then remember it for subsequent invocations. (Kind of like what happens when you use the DDL for storage.) If you are only using the RPC once, then you do pay a price for the DDL transfer, but after a couple of calls it becomes much more efficient than passing a compressed encoding by mapping names to numbers with each invocation. Even supporting changes to the RPC DDL while running isn't hard. (For example, the RPC server goes away and comes back with a new DDL.) You simply have a unique version number (the server can just generate a counter) that represents an instance of the DDL or RPC instance. The client sends the one number with each request. If the a client sends an invalid number, the server sends back a new DDL with a new number. (This is just simple caching.) Note the above implies a client conforms to server scheme. You can do the reverse if you want a server conforms to client scheme. I prefer the former.
        Hide
        Sameer Paranjpye added a comment -

        > i'm no expert here, but this approach appears to be sacrificing the long-term benefit of ever using jute for high-frequency rpc in favor of the short-term benefit of not breaking
        > existing ddls

        Not really. The field names are not part of the serialized record. Field names are only ever present in the serialized DDL. In the RPC use case, the client and server would do an initial handshake where the client discovers the servers interface. All subsequent calls would involve record serialization only. It might actually help in high frequency RPCs since type-ids/names don't occur over and over again.

        Show
        Sameer Paranjpye added a comment - > i'm no expert here, but this approach appears to be sacrificing the long-term benefit of ever using jute for high-frequency rpc in favor of the short-term benefit of not breaking > existing ddls Not really. The field names are not part of the serialized record. Field names are only ever present in the serialized DDL. In the RPC use case, the client and server would do an initial handshake where the client discovers the servers interface. All subsequent calls would involve record serialization only. It might actually help in high frequency RPCs since type-ids/names don't occur over and over again.
        Hide
        Jeff Hammerbacher added a comment -

        hmm, talked this over with a thrift dev. in our services at facebook, most follow the connect -> make call -> disconnect model. maintaining persistent connections, a negotiation step, and caching of ddl all overly complicate matters, which is deadly for high volume and high performance service deployment. all of these things may make sense in a system for storage and transfer of massive data streams, however.

        Show
        Jeff Hammerbacher added a comment - hmm, talked this over with a thrift dev. in our services at facebook, most follow the connect -> make call -> disconnect model. maintaining persistent connections, a negotiation step, and caching of ddl all overly complicate matters, which is deadly for high volume and high performance service deployment. all of these things may make sense in a system for storage and transfer of massive data streams, however.
        Hide
        eric baldeschwieler added a comment -

        Yupp. So we should all pause and discuss before integrating recordio into hadoop RPC.
        There are solutions to your concern, but they all add complexity.
        I'm not sure it makes sense to evolve recordio in that direction.

        Show
        eric baldeschwieler added a comment - Yupp. So we should all pause and discuss before integrating recordio into hadoop RPC. There are solutions to your concern, but they all add complexity. I'm not sure it makes sense to evolve recordio in that direction.
        Hide
        Vivek Ratan added a comment -

        I have attached the first cut of the code changes (1883_patch01). The versioning functionality currently works only for Java-generated classes. I'd appreciate any comments. Some details of what's been done:

        • A new folder org/apache/hadoop/record/platform contains all the new classes that the compiler needs for versioning support.
        • Every generated class has a static method getTypeInfo() which returns an object of type RecordTypeInfo. This object is itself a Record and can be read/written.
        • Every generated class also has a method, setTypeFilter() wherein a user passes in a RecordTypeInfo object and instructs the class to deserialize data based on this filter.
        • Internally, a RecordTypeInfo object is a collection of TypeInfo objects, each representing a field in the record. A TypeInfo object is made of a field name and a TypeID object, which represents the type of the field. TypeID represents a basic type, whereas StructTypeID, VectorTypeID, and MapTypeID (which are subclasses) represent the composite types.
        • When deserializing, an object checks whether it has a filter set. If it doesn't, it deserializes as before. If a filter is set, the filter is compared against the object's type information to decide what to read, which member variables to assign data to, and what to skip. To prevent comparing field names and types each time, we compare the two RecordTypeInfo objects once, and create a vector of ints, which indicates which field to assign data to.

        I'm happy to provide class diagrams of changes, if things are not clear.

        I have also attached a zip file (example.zip) which contains a DDL and test code to read and write records with version information.

        Show
        Vivek Ratan added a comment - I have attached the first cut of the code changes (1883_patch01). The versioning functionality currently works only for Java-generated classes. I'd appreciate any comments. Some details of what's been done: A new folder org/apache/hadoop/record/platform contains all the new classes that the compiler needs for versioning support. Every generated class has a static method getTypeInfo() which returns an object of type RecordTypeInfo . This object is itself a Record and can be read/written. Every generated class also has a method, setTypeFilter() wherein a user passes in a RecordTypeInfo object and instructs the class to deserialize data based on this filter. Internally, a RecordTypeInfo object is a collection of TypeInfo objects, each representing a field in the record. A TypeInfo object is made of a field name and a TypeID object, which represents the type of the field. TypeID represents a basic type, whereas StructTypeID , VectorTypeID , and MapTypeID (which are subclasses) represent the composite types. When deserializing, an object checks whether it has a filter set. If it doesn't, it deserializes as before. If a filter is set, the filter is compared against the object's type information to decide what to read, which member variables to assign data to, and what to skip. To prevent comparing field names and types each time, we compare the two RecordTypeInfo objects once, and create a vector of ints, which indicates which field to assign data to. I'm happy to provide class diagrams of changes, if things are not clear. I have also attached a zip file (example.zip) which contains a DDL and test code to read and write records with version information.
        Hide
        Milind Bhandarkar added a comment -

        Vivek,

        Can you please generate the patch from within the top hadoop dir ? I cannot apply this patch to trunk, because the filesnames all start with C:\eclipse... etc.

        Show
        Milind Bhandarkar added a comment - Vivek, Can you please generate the patch from within the top hadoop dir ? I cannot apply this patch to trunk, because the filesnames all start with C:\eclipse... etc.
        Hide
        Vivek Ratan added a comment -

        Sorry, I've been inexplicably forgetting to remove the full path from the patch files. I have attached a new patch (1883_patch02) which contains support for both Java and C+. It's pretty complete . I have also attached a new zip file (example.zip) which contains Java and C+ files that read and write records using the new code. They're pretty good for testing.

        Show
        Vivek Ratan added a comment - Sorry, I've been inexplicably forgetting to remove the full path from the patch files. I have attached a new patch (1883_patch02) which contains support for both Java and C+ . It's pretty complete . I have also attached a new zip file (example.zip) which contains Java and C + files that read and write records using the new code. They're pretty good for testing.
        Hide
        Milind Bhandarkar added a comment -

        Overall:

        Place apache license at the top of each newly added file.
        All public methods, and preferably all methods should have javadoc comments.
        Please integrate examples (converting them to unit tests) into the patch.
        Also please run the record I/O benchmark with and without RTI filter, and please report results.

        package: org.apache.hadoop.record.platform:

        TypeID.java:

        TypeID should extend Record, since it needs to be serialized and deserialized. Currently it implements only the write method. The write method of Record serializes to binary format only, and not any generic serializers, whereas TypeID's write method can serialize to any RecordOut. This applies to all TypeIDs that extend TypeID.

        Please consider making RIOType a private static inner class.

        StructTypeID.java:

        Class javadoc should say typeID for struct type.
        Also, the serialization format comment does not include the struct name.

        VectorTypeID.java:

        MapTypeID.java:

        class javadoc comment says vector.
        Serialization format comment does not mention keyType, valueType.

        TypeInfo.java:

        TypeInfo should extend Record. (See comment on TypeID above.)

        RecordTypeInfo.java:

        Should extend TypeInfo, once TypeInfo extends Record.
        That would move the hairy serialization and deserialization code (the genericRead* methods) to TypeInfo, and TypeID where they belong.

        Utils.java:

        The skip method really belongs to TypeID. So, we can get rid of Utils.java.

        package: org.apache.hadoop.record.compiler:

        JType.java:

        Methods genRtiFieldCondition and genRtiNestedFieldCondition are commented out. Please remove them.

        JCompType.java:

        Consts.java:

        Please consider making these constants into static methods that take a string as parameter, and return another string (prepended by that constant).

        CppGenerator.java:

        JBuffer.java:
        JBoolean.java:
        JByte.java:
        JDouble.java:
        JFloat.java
        JInt.java:
        JLong.java:
        JString.java:

        All types have an additional inner class now that extends from generic CppType, and adds one method getTypeIDObjectString() that returns the TypeID string. These classes can be eliminated by providing this method in the base class CppType. CppType's constructor should be modified to take as parameter the TypeIDString. Is there any other functionality that these inner type-specific classes are providing ?

        JVector.java:
        JMap.java:
        JType.java:

        Methods genRtiFieldCondition and genRtiNestedFieldCondition are commented out. Please remove them.

        Method getID() should be modified to prepend RIO_PREFIX, rather than prepending RIO_PREFIX everywhere to getId()'s return value.

        genCompareBytes() and genSlurpBytes() code is not modified to use RIO_PREFIXES.

        JRecord.java:

        Needs cleanup. Lots of commented out code.
        Generate another private method called deserializeWithFilter. That will make generated code more readable.

        In C++ generated code, deserializeWithoutFilter need not be virtual.

        Show
        Milind Bhandarkar added a comment - Overall: Place apache license at the top of each newly added file. All public methods, and preferably all methods should have javadoc comments. Please integrate examples (converting them to unit tests) into the patch. Also please run the record I/O benchmark with and without RTI filter, and please report results. package: org.apache.hadoop.record.platform: TypeID.java: TypeID should extend Record, since it needs to be serialized and deserialized. Currently it implements only the write method. The write method of Record serializes to binary format only, and not any generic serializers, whereas TypeID's write method can serialize to any RecordOut. This applies to all TypeIDs that extend TypeID. Please consider making RIOType a private static inner class. StructTypeID.java: Class javadoc should say typeID for struct type. Also, the serialization format comment does not include the struct name. VectorTypeID.java: MapTypeID.java: class javadoc comment says vector. Serialization format comment does not mention keyType, valueType. TypeInfo.java: TypeInfo should extend Record. (See comment on TypeID above.) RecordTypeInfo.java: Should extend TypeInfo, once TypeInfo extends Record. That would move the hairy serialization and deserialization code (the genericRead* methods) to TypeInfo, and TypeID where they belong. Utils.java: The skip method really belongs to TypeID. So, we can get rid of Utils.java. package: org.apache.hadoop.record.compiler: JType.java: Methods genRtiFieldCondition and genRtiNestedFieldCondition are commented out. Please remove them. JCompType.java: Consts.java: Please consider making these constants into static methods that take a string as parameter, and return another string (prepended by that constant). CppGenerator.java: JBuffer.java: JBoolean.java: JByte.java: JDouble.java: JFloat.java JInt.java: JLong.java: JString.java: All types have an additional inner class now that extends from generic CppType, and adds one method getTypeIDObjectString() that returns the TypeID string. These classes can be eliminated by providing this method in the base class CppType. CppType's constructor should be modified to take as parameter the TypeIDString. Is there any other functionality that these inner type-specific classes are providing ? JVector.java: JMap.java: JType.java: Methods genRtiFieldCondition and genRtiNestedFieldCondition are commented out. Please remove them. Method getID() should be modified to prepend RIO_PREFIX, rather than prepending RIO_PREFIX everywhere to getId()'s return value. genCompareBytes() and genSlurpBytes() code is not modified to use RIO_PREFIXES. JRecord.java: Needs cleanup. Lots of commented out code. Generate another private method called deserializeWithFilter. That will make generated code more readable. In C++ generated code, deserializeWithoutFilter need not be virtual.
        Hide
        Vivek Ratan added a comment -

        Thanks for the feedback, Milind. This was the first patch, a prototype, and is weak on documentation. That will be fixed in the next patch. I will also be adding test cases, and removing code that has been commented out.

        My feedback's below. I've spent a fair bit of time thinking about the class design for the type information functionality, both in terms of simplicity/extensibility, and also for performance, so that we don't create extra objects or spend too much time dealing with type information. Therefore I'm addressing your concerns individually.

        >> TypeID should extend Record, since it needs to be serialized and deserialized.
        TypeID is not a Record. Yes, it can write itself to a stream, but that's the only similarity. A Record implements WritableComparable and Cloneable, neither of which are needed for TypeID. TypeID, and its children, are very different from Record. They represent the ID of a Jute type.

        >> Please consider making RIOType a private static inner class.
        RIOType is referred to outside of TypeID (in RecordTypeInfo, for example). It cannot be private.

        >> TypeInfo should extend Record. (See comment on TypeID above.)
        As I mentioned for TypeID, TypeInfo is also not a Record and shouldn't extend it.

        >> RecordTypeInfo.java:Should extend TypeInfo, once TypeInfo extends Record.
        >>That would move the hairy serialization and deserialization code (the genericRead* methods) to TypeInfo, and TypeID where they belong.
        Again, RecordTypeInfo is actually a Record, as you can serialize/deserialize and compare it. It , however, is not a TypeInfo. RecordTypeInfo represents the type information for a record. It is the class that a Jute user sees. TypeInfo and TypeID are internal classes used by RecordTypeInfo to represent type information (though they do show up in the generated code). Think of it this way: Type information for a record, represented by RecordTypeInfo, is a collection of type information for each of its fields. Each field's type information, represented by a TypeInfo object, is comprised of the field name (a string) and a field type ID (a TypeID object). So TypeInfo and TypeID are helper classes used by RecordTypeInfo.
        The genericRead* methods belong to RecordTypeInfo as they are used during deserialization. Not sure why you think they're 'hairy'. They're required because there is a hierarchy of TypeID classes and RecordTypeInfo does not know which to create unless it reads the type of a field during deserialization. And having read the type, it can itself create the right TypeID object - this is clean, simple, and minimizes performance impact. In fact, this same code will be used for generic deserialization. Any particular reasons you think the code is hairy or belongs elsewhere?

        >> The skip method really belongs to TypeID. So, we can get rid of Utils.java.
        Not really. skip() is a general purpose method and has nothing to do with a TypeID. It's used during deserialization by RecordTypeInfo. I placed it in a separate class because it really is a utility method, and it can be used in other places (generic deserialization, for example). I realize that Utils only has this one method for now, but that's where it belongs.

        >> Please consider making these constants into static methods that take a string as parameter, and return another string (prepended by that constant).
        Why? These really are constants. Turning them into method calls will cost us a method invocation each time they're used, which seems completely unnecessary. What do I gain by using static methods? What I've done is a pretty standard way of defining constants in Java.

        >> All types have an additional inner class now that extends from generic CppType, and adds one method getTypeIDObjectString() that returns the TypeID string. These classes can be eliminated by providing this method in the base class CppType. CppType's constructor should be modified to take as parameter the TypeIDString. Is there any other functionality that these inner type-specific classes are providing ?

        I've defined inner classes for the CPP types for the following reasons:

        • they mirror more closely the inner classes for the Java types
        • even though they really have only one method for now, this is the right way to do OO. Rather than pass a string to the constructor, each class overrides the getTypeIDObjectString() method, just as with the Java classes. Makes it much easier for change - tomorrow, if the method does other stuff besides just return a string (which it actually did ,before I optimized the code out), then this design makes it so much easier. I don't like passing subclass-specific stuff to the constructor, just to save on a method. It's not good design. I realize we have additional classes, but so what? There's no performance impact, and the code, IMO, is much cleaner, symmetric, and more OO.
        • it's quite possible that we will add other methods to these classes in the future. The Java classes already have other methods. I didn't look at redesigning the C++ classes, but I would think that the C++ inner classes would mirror the Java classes quite closely. This way, we don't have two different algorithms or control flows, one for Java and one for C++.

        >> Method getID() should be modified to prepend RIO_PREFIX, rather than prepending RIO_PREFIX everywhere to getId()'s return value.
        I'm ambivalent on this one. getID() is supposed to append a level number to a string. The string is prefixed by RIO_PREFIX in some cases (where we can clash with user defined variables; see HADOOP-1994), but there are other cases where the string passed to getID() is not prefixed by RIO_PREFIX. So we can't add RIO_PREFIX in getID(). Well, we could, but the generated code would look a little more ugly (RIO_PREFIX, while required, does not make for pretty variable names). Right now, the Jute code looks a little more ugly with RIO_PREFIX being explicitly prefixed in some cases, but I'd prefer that over slightly uglier generated code, which the user is more likely to look at. This is a pretty small matter, so if you have stronger feelings about it than I do, I can make the change.

        >> genCompareBytes() and genSlurpBytes() code is not modified to use RIO_PREFIXES.
        This is because this code does not use the DDL-defined fields, so we don't need RIO_PREFIX (again, see HADOOP-1994).

        >> Generate another private method called deserializeWithFilter. That will make generated code more readable.
        deserialize() is really 'deserialize with filter'. It checks if a filter is present. If not, it calls deserializeWithoutFilter(). Otherwise it continues. Seems unnecessary to have another method called deserializeWithFilter, and have deserialize() just do a check for the presence of a filter and call either deserializeWithFilter() or deserializeWithoutFilter(). I don't think this will enhance readability much.

        >> In C++ generated code, deserializeWithoutFilter need not be virtual
        deserializeWithoutFilter(), which really is the old deserialize(), is virtual because deserialize() is virtual. I think we either want both to be virtual or both to not be virtual. Since deserialize() was virtual in the old code, I also made deserializeWithoutFilter() also virtual.

        Show
        Vivek Ratan added a comment - Thanks for the feedback, Milind. This was the first patch, a prototype, and is weak on documentation. That will be fixed in the next patch. I will also be adding test cases, and removing code that has been commented out. My feedback's below. I've spent a fair bit of time thinking about the class design for the type information functionality, both in terms of simplicity/extensibility, and also for performance, so that we don't create extra objects or spend too much time dealing with type information. Therefore I'm addressing your concerns individually. >> TypeID should extend Record, since it needs to be serialized and deserialized. TypeID is not a Record. Yes, it can write itself to a stream, but that's the only similarity. A Record implements WritableComparable and Cloneable, neither of which are needed for TypeID. TypeID, and its children, are very different from Record. They represent the ID of a Jute type. >> Please consider making RIOType a private static inner class. RIOType is referred to outside of TypeID (in RecordTypeInfo, for example). It cannot be private. >> TypeInfo should extend Record. (See comment on TypeID above.) As I mentioned for TypeID, TypeInfo is also not a Record and shouldn't extend it. >> RecordTypeInfo.java:Should extend TypeInfo, once TypeInfo extends Record. >>That would move the hairy serialization and deserialization code (the genericRead* methods) to TypeInfo, and TypeID where they belong. Again, RecordTypeInfo is actually a Record, as you can serialize/deserialize and compare it. It , however, is not a TypeInfo. RecordTypeInfo represents the type information for a record. It is the class that a Jute user sees. TypeInfo and TypeID are internal classes used by RecordTypeInfo to represent type information (though they do show up in the generated code). Think of it this way: Type information for a record, represented by RecordTypeInfo, is a collection of type information for each of its fields. Each field's type information, represented by a TypeInfo object, is comprised of the field name (a string) and a field type ID (a TypeID object). So TypeInfo and TypeID are helper classes used by RecordTypeInfo. The genericRead* methods belong to RecordTypeInfo as they are used during deserialization. Not sure why you think they're 'hairy'. They're required because there is a hierarchy of TypeID classes and RecordTypeInfo does not know which to create unless it reads the type of a field during deserialization. And having read the type, it can itself create the right TypeID object - this is clean, simple, and minimizes performance impact. In fact, this same code will be used for generic deserialization. Any particular reasons you think the code is hairy or belongs elsewhere? >> The skip method really belongs to TypeID. So, we can get rid of Utils.java. Not really. skip() is a general purpose method and has nothing to do with a TypeID. It's used during deserialization by RecordTypeInfo. I placed it in a separate class because it really is a utility method, and it can be used in other places (generic deserialization, for example). I realize that Utils only has this one method for now, but that's where it belongs. >> Please consider making these constants into static methods that take a string as parameter, and return another string (prepended by that constant). Why? These really are constants. Turning them into method calls will cost us a method invocation each time they're used, which seems completely unnecessary. What do I gain by using static methods? What I've done is a pretty standard way of defining constants in Java. >> All types have an additional inner class now that extends from generic CppType, and adds one method getTypeIDObjectString() that returns the TypeID string. These classes can be eliminated by providing this method in the base class CppType. CppType's constructor should be modified to take as parameter the TypeIDString. Is there any other functionality that these inner type-specific classes are providing ? I've defined inner classes for the CPP types for the following reasons: they mirror more closely the inner classes for the Java types even though they really have only one method for now, this is the right way to do OO. Rather than pass a string to the constructor, each class overrides the getTypeIDObjectString() method, just as with the Java classes. Makes it much easier for change - tomorrow, if the method does other stuff besides just return a string (which it actually did ,before I optimized the code out), then this design makes it so much easier. I don't like passing subclass-specific stuff to the constructor, just to save on a method. It's not good design. I realize we have additional classes, but so what? There's no performance impact, and the code, IMO, is much cleaner, symmetric, and more OO. it's quite possible that we will add other methods to these classes in the future. The Java classes already have other methods. I didn't look at redesigning the C++ classes, but I would think that the C++ inner classes would mirror the Java classes quite closely. This way, we don't have two different algorithms or control flows, one for Java and one for C++. >> Method getID() should be modified to prepend RIO_PREFIX, rather than prepending RIO_PREFIX everywhere to getId()'s return value. I'm ambivalent on this one. getID() is supposed to append a level number to a string. The string is prefixed by RIO_PREFIX in some cases (where we can clash with user defined variables; see HADOOP-1994 ), but there are other cases where the string passed to getID() is not prefixed by RIO_PREFIX. So we can't add RIO_PREFIX in getID(). Well, we could, but the generated code would look a little more ugly (RIO_PREFIX, while required, does not make for pretty variable names). Right now, the Jute code looks a little more ugly with RIO_PREFIX being explicitly prefixed in some cases, but I'd prefer that over slightly uglier generated code, which the user is more likely to look at. This is a pretty small matter, so if you have stronger feelings about it than I do, I can make the change. >> genCompareBytes() and genSlurpBytes() code is not modified to use RIO_PREFIXES. This is because this code does not use the DDL-defined fields, so we don't need RIO_PREFIX (again, see HADOOP-1994 ). >> Generate another private method called deserializeWithFilter. That will make generated code more readable. deserialize() is really 'deserialize with filter'. It checks if a filter is present. If not, it calls deserializeWithoutFilter(). Otherwise it continues. Seems unnecessary to have another method called deserializeWithFilter, and have deserialize() just do a check for the presence of a filter and call either deserializeWithFilter() or deserializeWithoutFilter(). I don't think this will enhance readability much. >> In C++ generated code, deserializeWithoutFilter need not be virtual deserializeWithoutFilter(), which really is the old deserialize(), is virtual because deserialize() is virtual. I think we either want both to be virtual or both to not be virtual. Since deserialize() was virtual in the old code, I also made deserializeWithoutFilter() also virtual.
        Hide
        Owen O'Malley added a comment -

        Please change to use autoconf/automake for the c++ code like the src/c++/utils and pipes do. It makes it much much easier to make the code compile in a cross-platform environment. There is also duplicated code between utils and librecordio that should be shared.

        Show
        Owen O'Malley added a comment - Please change to use autoconf/automake for the c++ code like the src/c++/utils and pipes do. It makes it much much easier to make the code compile in a cross-platform environment. There is also duplicated code between utils and librecordio that should be shared.
        Hide
        Vivek Ratan added a comment -

        >> Please change to use autoconf/automake for the c++ code like the src/c++/utils and pipes do. It makes it much much easier to make the code compile in a cross-platform environment. There is also duplicated code between utils and librecordio that should be shared.

        These are valid points, but I think it makes more sense to track them separately in Jira. They're more to do with cleanup of the current C++ code, than with Record I/O versioning. Owen, if it makes sense, can you open up a separate Jira issue for your suggestions? Thanks.

        Show
        Vivek Ratan added a comment - >> Please change to use autoconf/automake for the c++ code like the src/c++/utils and pipes do. It makes it much much easier to make the code compile in a cross-platform environment. There is also duplicated code between utils and librecordio that should be shared. These are valid points, but I think it makes more sense to track them separately in Jira. They're more to do with cleanup of the current C++ code, than with Record I/O versioning. Owen, if it makes sense, can you open up a separate Jira issue for your suggestions? Thanks.
        Hide
        Owen O'Malley added a comment -

        Ok, I just assigned HADOOP-397 to you. smile

        Show
        Owen O'Malley added a comment - Ok, I just assigned HADOOP-397 to you. smile
        Hide
        Chad Walters added a comment -

        Picking back up the thread on RPC, HBase has some use cases that would entail high-frequency, low payload RPC. So the HBase folks would be interested in solutions that allow for this.

        Show
        Chad Walters added a comment - Picking back up the thread on RPC, HBase has some use cases that would entail high-frequency, low payload RPC. So the HBase folks would be interested in solutions that allow for this.
        Hide
        Vivek Ratan added a comment -

        The latest patch (1883_patch03) is attached, along with unit test cases for both Java and C++.

        Show
        Vivek Ratan added a comment - The latest patch (1883_patch03) is attached, along with unit test cases for both Java and C++.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12372354/1883_patch03
        against trunk revision .

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs -1. The patch appears to introduce 9 new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12372354/1883_patch03 against trunk revision . @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 9 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1453/console This message is automatically generated.
        Hide
        Vivek Ratan added a comment -

        Fixed findings from Findbugs. New patch (1883_patch04) submitted.

        Show
        Vivek Ratan added a comment - Fixed findings from Findbugs. New patch (1883_patch04) submitted.
        Hide
        Vivek Ratan added a comment -

        Resubmitting new patch

        Show
        Vivek Ratan added a comment - Resubmitting new patch
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12372512/1883_patch04
        against trunk revision .

        @author +1. The patch does not contain any @author tags.

        javadoc -1. The javadoc tool appears to have generated messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs -1. The patch appears to cause Findbugs to fail.

        core tests -1. The patch failed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1482/testReport/
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1482/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1482/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12372512/1883_patch04 against trunk revision . @author +1. The patch does not contain any @author tags. javadoc -1. The javadoc tool appears to have generated messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to cause Findbugs to fail. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1482/testReport/ Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1482/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1482/console This message is automatically generated.
        Hide
        Vivek Ratan added a comment -

        Build broke because of hbase problems (don't know why). Cancelling patch so I can resubmit it.

        Show
        Vivek Ratan added a comment - Build broke because of hbase problems (don't know why). Cancelling patch so I can resubmit it.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12372512/1883_patch04
        against trunk revision .

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs -1. The patch appears to introduce 4 new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1497/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1497/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1497/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1497/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12372512/1883_patch04 against trunk revision . @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 4 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1497/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1497/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1497/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1497/console This message is automatically generated.
        Hide
        Vivek Ratan added a comment -

        Findbugs warnings can be ignored.

        • The Bad Practice warning, 'org.apache.hadoop.record.meta.RecordTypeInfo defines compareTo(Object) and uses Object.equals()', can be ignored since RecordTypeInfo::comparTo() is unsupported (it throws an exception and shouldn't be called).
        • The other 3 warnings, 'Dodgy warnings', are for unchecked casts. All three casts are valid as they're based on the typeVal field of a TypeID object. No sense in doing a dynamic check as the typeVal field guarantees the type of the object.
        Show
        Vivek Ratan added a comment - Findbugs warnings can be ignored. The Bad Practice warning, 'org.apache.hadoop.record.meta.RecordTypeInfo defines compareTo(Object) and uses Object.equals()', can be ignored since RecordTypeInfo::comparTo() is unsupported (it throws an exception and shouldn't be called). The other 3 warnings, 'Dodgy warnings', are for unchecked casts. All three casts are valid as they're based on the typeVal field of a TypeID object. No sense in doing a dynamic check as the typeVal field guarantees the type of the object.
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Vivek!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Vivek!
        Hide
        Devaraj Das added a comment -

        Reverted patch due to a bad commit. Will commit the patch again in a bit.

        Show
        Devaraj Das added a comment - Reverted patch due to a bad commit. Will commit the patch again in a bit.
        Hide
        Devaraj Das added a comment -

        Fixed the commit issue.

        Show
        Devaraj Das added a comment - Fixed the commit issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Vivek Ratan
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development