Avro
  1. Avro
  2. AVRO-637

GenericArray should implement Collection

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.0
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed

      Description

      It would be nice if Avro arrays were better integrated with Java collections. The GenericArray interface permits array element reuse, which is awkward with java.util.Collection. But if GenericArray implemented Collection and the Avro runtime permitted arbitrary Collection implementations to be passed for Arrays then it would simplify many applications. The runtime could still reuse elements if an array implemented GenericArray, so performance would not suffer for applications that, e.g., loop over a data file, reusing instances.

      1. AVRO-637-fix.patch
        0.6 kB
        Doug Cutting
      2. AVRO-637.patch
        14 kB
        Doug Cutting
      3. AVRO-637.patch
        14 kB
        Doug Cutting
      4. AVRO-637.patch
        11 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          Thiruvalluvan M. G. added a comment -

          The interface java.util.Collection does not preserve the order, but we need to preserve the order for Avro array. So I feel a more appropriate type for GenericArray should be java.util.List.

          Show
          Thiruvalluvan M. G. added a comment - The interface java.util.Collection does not preserve the order, but we need to preserve the order for Avro array. So I feel a more appropriate type for GenericArray should be java.util.List.
          Hide
          Doug Cutting added a comment -

          Yes, List would probably be a better interface to implement.

          But what type should we require for data passed in? Should

          {"type":"array", "items":"string"}

          in generated methods & fields be Collection<CharSequence> or List<CharSequence>? In either case, the runtime will create a GenericData.Array<Utf8>, but we'd like folks to be able to pass, e.g., ArrayDeque<String>, no?

          Show
          Doug Cutting added a comment - Yes, List would probably be a better interface to implement. But what type should we require for data passed in? Should {"type":"array", "items":"string"} in generated methods & fields be Collection<CharSequence> or List<CharSequence>? In either case, the runtime will create a GenericData.Array<Utf8>, but we'd like folks to be able to pass, e.g., ArrayDeque<String>, no?
          Hide
          Doug Cutting added a comment -

          Here's a patch for this. This changes GenericArray<T> to implement List<T>. It also changes specific and generic to accept arbitrary Collection implementations as array parameters and field values.

          This change is incompatible, in that it changes generated method signatures and field types. If someone approves it today, then I can commit it and quickly roll another 1.4.0 candidate, otherwise it might have to wait until 1.5.0.

          Show
          Doug Cutting added a comment - Here's a patch for this. This changes GenericArray<T> to implement List<T>. It also changes specific and generic to accept arbitrary Collection implementations as array parameters and field values. This change is incompatible, in that it changes generated method signatures and field types. If someone approves it today, then I can commit it and quickly roll another 1.4.0 candidate, otherwise it might have to wait until 1.5.0.
          Hide
          Doug Cutting added a comment -

          Here's an updated version that also update the generic data documentation. Note that this documentation was also out-of-date for enums and strings, so I updated that too.

          Show
          Doug Cutting added a comment - Here's an updated version that also update the generic data documentation. Note that this documentation was also out-of-date for enums and strings, so I updated that too.
          Hide
          Scott Carey added a comment -

          I think the type in SpecificRecord and elsewhere has to be a List<T> instead of Collection<T> in order to guarantee ordering, and that methods like add() actually append to the list. If someone sets a field to an arbitrary collection, say a Set, then that gets passed on to other code that expects add() to function like an array, there will be surprises.

          I believe that at minimum, GenericDatumReader.newArray() and addToArray() need to check for instanceof List instead of Collection, and if a non-list Collection, create a new GenericArray from a non-list Collection.

          We should be able to take in any Collection to initialize an inner List; if there was a getter/setter style for SpecificRecord then the setter could take Collection<T> and the getter return List<T>, but with raw public member variables our hands are tied.

          Alternatively, a simple thing to do is just change Collection to List in this patch – clients can always use new ArrayList(Collection c);

          As much as I like this feature and think that making such changes should be done sooner rather than later, I suspect we will be changing these APIs more in the future and it might be less painful to batch them all up. Specifically, I'm thinking about changes to the generated classes such as using getters/setters to hide the internal representation of the data.

          Show
          Scott Carey added a comment - I think the type in SpecificRecord and elsewhere has to be a List<T> instead of Collection<T> in order to guarantee ordering, and that methods like add() actually append to the list. If someone sets a field to an arbitrary collection, say a Set, then that gets passed on to other code that expects add() to function like an array, there will be surprises. I believe that at minimum, GenericDatumReader.newArray() and addToArray() need to check for instanceof List instead of Collection, and if a non-list Collection, create a new GenericArray from a non-list Collection. We should be able to take in any Collection to initialize an inner List; if there was a getter/setter style for SpecificRecord then the setter could take Collection<T> and the getter return List<T>, but with raw public member variables our hands are tied. Alternatively, a simple thing to do is just change Collection to List in this patch – clients can always use new ArrayList(Collection c); As much as I like this feature and think that making such changes should be done sooner rather than later, I suspect we will be changing these APIs more in the future and it might be less painful to batch them all up. Specifically, I'm thinking about changes to the generated classes such as using getters/setters to hide the internal representation of the data.
          Hide
          Doug Cutting added a comment -

          > I think the type in SpecificRecord and elsewhere has to be a List<T> instead of Collection<T> in order to guarantee ordering,

          I don't follow this logic. If they pass a List, then its order will be preserved. But if someone passes in an unordered data structure, why should they expect order to be preserved?

          Show
          Doug Cutting added a comment - > I think the type in SpecificRecord and elsewhere has to be a List<T> instead of Collection<T> in order to guarantee ordering, I don't follow this logic. If they pass a List, then its order will be preserved. But if someone passes in an unordered data structure, why should they expect order to be preserved?
          Hide
          Doug Cutting added a comment -

          > I suspect we will be changing these APIs more in the future and it might be less painful to batch them all up.

          1.4 already makes changes to generated APIs (Utf8 -> CharSequence) so 1.4. is an opportunity to batch up such changes. I'd much rather include this in 1.4 than 1.5. No one has signed up to imminently replace the compiler with a template system. I don't think we should delay changes without bound.

          Show
          Doug Cutting added a comment - > I suspect we will be changing these APIs more in the future and it might be less painful to batch them all up. 1.4 already makes changes to generated APIs (Utf8 -> CharSequence) so 1.4. is an opportunity to batch up such changes. I'd much rather include this in 1.4 than 1.5. No one has signed up to imminently replace the compiler with a template system. I don't think we should delay changes without bound.
          Hide
          Scott Carey added a comment -

          I don't follow this logic. If they pass a List, then its order will be preserved. But if someone passes in an unordered data structure, why should they expect order to be preserved?

          Producers and Consumers are sometimes decoupled and really different users. User A wants to create a list from a Set and persist that later, user B intercepts the serialization process and tries to append something to the list.
          The Avro contract for an array type preserves order. So, for example one helper method might alter a type and assume that the data is ordered and appendable (it was in 1.3, and comes out of deserialization that way), then a different user applies this helper method to an object constructed some other way, and the result is a surprise.
          The raw type is Collection, which does not guarantee this so such a user would be mistaken, but the semantics are more complicated and error prone.

          If we are going as far as Collection and abandoning the idea that the data in memory is consistently ordered, sortable, and appendable we should probably consider going one step further up the interface inheritance tree, past Collection to the top: Iterable. Then you could serialize arrays that don't fit in memory too.

          1.4 already makes changes to generated APIs (Utf8 -> CharSequence)

          Good point, another change in 1.4 along these lines doesn't add much incremental work to upgrade.

          Show
          Scott Carey added a comment - I don't follow this logic. If they pass a List, then its order will be preserved. But if someone passes in an unordered data structure, why should they expect order to be preserved? Producers and Consumers are sometimes decoupled and really different users. User A wants to create a list from a Set and persist that later, user B intercepts the serialization process and tries to append something to the list. The Avro contract for an array type preserves order. So, for example one helper method might alter a type and assume that the data is ordered and appendable (it was in 1.3, and comes out of deserialization that way), then a different user applies this helper method to an object constructed some other way, and the result is a surprise. The raw type is Collection, which does not guarantee this so such a user would be mistaken, but the semantics are more complicated and error prone. If we are going as far as Collection and abandoning the idea that the data in memory is consistently ordered, sortable, and appendable we should probably consider going one step further up the interface inheritance tree, past Collection to the top: Iterable. Then you could serialize arrays that don't fit in memory too. 1.4 already makes changes to generated APIs (Utf8 -> CharSequence) Good point, another change in 1.4 along these lines doesn't add much incremental work to upgrade.
          Hide
          Scott Carey added a comment -

          Another way to frame this is:

          What is the purpose of this ticket?

          1: To allow users to avoid having to use/learn GenericArray for populating records with data?
          or
          2: To use the most generic applicable built-in Java interface to define the type that an Avro array is?

          If it is #1, I feel that List<> is the most appropriate because it is semantically the same as GenericArray was.
          If it is #2, I feel that Iterable is a better generalization for an Avro array than Collection.

          Show
          Scott Carey added a comment - Another way to frame this is: What is the purpose of this ticket? 1: To allow users to avoid having to use/learn GenericArray for populating records with data? or 2: To use the most generic applicable built-in Java interface to define the type that an Avro array is? If it is #1, I feel that List<> is the most appropriate because it is semantically the same as GenericArray was. If it is #2, I feel that Iterable is a better generalization for an Avro array than Collection.
          Hide
          Scott Carey added a comment -

          If it is #2, I feel that Iterable is a better generalization for an Avro array than Collection.

          Well, now that I think about it a bit more, there is the (very big) problem of size() not existing on Iterable. So I'll correct myself and say that Collection is as high up the inheritance tree we can go. if #2 is the goal.

          Show
          Scott Carey added a comment - If it is #2, I feel that Iterable is a better generalization for an Avro array than Collection. Well, now that I think about it a bit more, there is the (very big) problem of size() not existing on Iterable. So I'll correct myself and say that Collection is as high up the inheritance tree we can go. if #2 is the goal.
          Hide
          Doug Cutting added a comment -

          Scott> Good point, another change in 1.4 along these lines doesn't add much incremental work to upgrade.

          Great. So we're agreed that this should go into 1.4.0.

          So the questions to resolve are:

          • what types should we accept at runtime; and
          • what type should generated code declare

          At runtime I think it's useful to accept the most general interface we can, so that means Collection, as we require just iterator(), size(), clear() and add(T) in order to read & write data.

          As for generated declarations, I prefer to declare things to be the most general type required, again, in this case Collection. The only downside I see to this is that folks who want to randomly access a value they receive will not be able to without casting or somesuch, as get(int) is a List method, not a Collection method. But if we declare them as List they'll be unable to pass other types of collections. HDFS RPCs do pass Set implementations over RPC. With mechanisms like reflection's "java-class" annotation the set's class can be preserved by the DatumReader. And the runtime will support arbitrary Collections. So it seems a shame to not permit them in specific data. So, on the whole, I lean towards Collection here too, but would be willing to go with List if others feel strongly.

          Show
          Doug Cutting added a comment - Scott> Good point, another change in 1.4 along these lines doesn't add much incremental work to upgrade. Great. So we're agreed that this should go into 1.4.0. So the questions to resolve are: what types should we accept at runtime; and what type should generated code declare At runtime I think it's useful to accept the most general interface we can, so that means Collection, as we require just iterator(), size(), clear() and add(T) in order to read & write data. As for generated declarations, I prefer to declare things to be the most general type required, again, in this case Collection. The only downside I see to this is that folks who want to randomly access a value they receive will not be able to without casting or somesuch, as get(int) is a List method, not a Collection method. But if we declare them as List they'll be unable to pass other types of collections. HDFS RPCs do pass Set implementations over RPC. With mechanisms like reflection's "java-class" annotation the set's class can be preserved by the DatumReader. And the runtime will support arbitrary Collections. So it seems a shame to not permit them in specific data. So, on the whole, I lean towards Collection here too, but would be willing to go with List if others feel strongly.
          Hide
          Philip Zeyliger added a comment -

          The schema we're talking about here is '

          {"type":"array", "items":"string"}

          ', no? In Java, "array" reads "List" to me. For the generated code, List is the closest match we've got. (The generated code is not a good match for things that don't fit into memory, so I'm not persuaded by the Iterable argument.) General types are nice for writing, but they're a massive pain for reading. I recently wrote an RPC Server using Avro Java, and the lack of get() on an array that the client wrote was quite irritating (hence I'm a big fan of this jira). At this layer, I think you have to choose the interfaces that are closest to the implementation.

          Show
          Philip Zeyliger added a comment - The schema we're talking about here is ' {"type":"array", "items":"string"} ', no? In Java, "array" reads "List" to me. For the generated code, List is the closest match we've got. (The generated code is not a good match for things that don't fit into memory, so I'm not persuaded by the Iterable argument.) General types are nice for writing, but they're a massive pain for reading. I recently wrote an RPC Server using Avro Java, and the lack of get() on an array that the client wrote was quite irritating (hence I'm a big fan of this jira). At this layer, I think you have to choose the interfaces that are closest to the implementation.
          Hide
          Doug Cutting added a comment -

          > For the generated code, List is the closest match we've got.

          Okay, here's a version of the patch that uses List in generated code (for array parameters and fields). Collection is used by the runtime, so both Reflect and Generic will accept arbitrary collection implementations.

          Show
          Doug Cutting added a comment - > For the generated code, List is the closest match we've got. Okay, here's a version of the patch that uses List in generated code (for array parameters and fields). Collection is used by the runtime, so both Reflect and Generic will accept arbitrary collection implementations.
          Hide
          Philip Zeyliger added a comment -

          Patch looks good. I took a look at it via reviewboard at https://review.cloudera.org/r/745/, if anyone else is curious what it looks like side-by-side.

          Show
          Philip Zeyliger added a comment - Patch looks good. I took a look at it via reviewboard at https://review.cloudera.org/r/745/ , if anyone else is curious what it looks like side-by-side.
          Hide
          Doug Cutting added a comment -

          I just committed this.

          Show
          Doug Cutting added a comment - I just committed this.
          Hide
          Eric Evans added a comment -

          Using the 1.4.0 rc2 artifacts I get the following...

          org.apache.avro.AvroRuntimeException: java.lang.NoSuchMethodException: org.apach
          e.cassandra.avro.CassandraServer.batch_mutate(java.util.Collection, org.apache.cassandra.avro.ConsistencyLevel)
                  at org.apache.avro.specific.SpecificResponder.respond(SpecificResponder.java:97)
                  at org.apache.avro.ipc.Responder.respond(Responder.java:136)
                  at org.apache.avro.ipc.Responder.respond(Responder.java:88)
                  at org.apache.avro.ipc.ResponderServlet.doPost(ResponderServlet.java:48)
                  at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
                  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
                  at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
                  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
                  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
                  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
                  at org.mortbay.jetty.Server.handle(Server.java:322)
                  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536)
                  at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930)
                  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
                  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
                  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
                  at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
                  at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
          Caused by: java.lang.NoSuchMethodException: org.apache.cassandra.avro.CassandraServer.batch_mutate(java.util.Collection, org.apache.cassandra.avro.ConsistencyLevel)
                  at java.lang.Class.getMethod(Class.java:1622)
                  at org.apache.avro.specific.SpecificResponder.respond(SpecificResponder.java:91)
                  ... 17 more
          

          ... from an interface that SpecificCompiler created with the following sig:

          java.lang.Void batch_mutate(java.util.List<org.apache.cassandra.avro.MutationsMapEntry> mutation_map, org.apache.cassandra.avro.ConsistencyLevel consistency_level) throws org.apache.avro.ipc.AvroRemoteException, org.apache.cassandra.avro.InvalidRequestException, org.apache.cassandra.avro.UnavailableException, org.apache.cassandra.avro.TimedOutException;
          
          Show
          Eric Evans added a comment - Using the 1.4.0 rc2 artifacts I get the following... org.apache.avro.AvroRuntimeException: java.lang.NoSuchMethodException: org.apach e.cassandra.avro.CassandraServer.batch_mutate(java.util.Collection, org.apache.cassandra.avro.ConsistencyLevel) at org.apache.avro.specific.SpecificResponder.respond(SpecificResponder.java:97) at org.apache.avro.ipc.Responder.respond(Responder.java:136) at org.apache.avro.ipc.Responder.respond(Responder.java:88) at org.apache.avro.ipc.ResponderServlet.doPost(ResponderServlet.java:48) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:322) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.lang.NoSuchMethodException: org.apache.cassandra.avro.CassandraServer.batch_mutate(java.util.Collection, org.apache.cassandra.avro.ConsistencyLevel) at java.lang.Class.getMethod(Class.java:1622) at org.apache.avro.specific.SpecificResponder.respond(SpecificResponder.java:91) ... 17 more ... from an interface that SpecificCompiler created with the following sig: java.lang.Void batch_mutate(java.util.List<org.apache.cassandra.avro.MutationsMapEntry> mutation_map, org.apache.cassandra.avro.ConsistencyLevel consistency_level) throws org.apache.avro.ipc.AvroRemoteException, org.apache.cassandra.avro.InvalidRequestException, org.apache.cassandra.avro.UnavailableException, org.apache.cassandra.avro.TimedOutException;
          Hide
          Doug Cutting added a comment -

          Eric, I think this should fix the problem you've found. Can you please try it? Thanks!

          Show
          Doug Cutting added a comment - Eric, I think this should fix the problem you've found. Can you please try it? Thanks!
          Hide
          Eric Evans added a comment -

          Yup, that works. +1

          Show
          Eric Evans added a comment - Yup, that works. +1
          Hide
          Doug Cutting added a comment -

          Thanks, Eric. I committed this.

          Show
          Doug Cutting added a comment - Thanks, Eric. I committed this.
          Hide
          Philip Zeyliger added a comment -

          BTW, using this is still awkward with the generated code. Since record.stuff is still typed GenericArray, you can't assign a List to it. I think we'll be able to change up the templates to make this more pleasant to use on the "write" side.

          Show
          Philip Zeyliger added a comment - BTW, using this is still awkward with the generated code. Since record.stuff is still typed GenericArray, you can't assign a List to it. I think we'll be able to change up the templates to make this more pleasant to use on the "write" side.
          Hide
          Doug Cutting added a comment -

          Philip> Since record.stuff is still typed GenericArray [ ... ]

          I talked to Philip, and I think he was using a a version of Avro w/o this patch. With this patch, generated code uses List<T>.

          Show
          Doug Cutting added a comment - Philip> Since record.stuff is still typed GenericArray [ ... ] I talked to Philip, and I think he was using a a version of Avro w/o this patch. With this patch, generated code uses List<T>.

            People

            • Assignee:
              Doug Cutting
              Reporter:
              Doug Cutting
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development