Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: spec
    • Labels:
      None

      Description

      Cassandra uses Set as a return type for some methods. It would be nice to not have to use a List as a workaround.

        Activity

        Hide
        Doug Cutting added a comment -

        My preference is an attribute that's a hint to the runtime, e.g., "useSet":true might indicate that a set-like representation should be used if available, but permitting implementations to ignore the annotation and still process the data correctly. If a writer intends setwise comparisons, it must sort the array before writing. We could require that, if this attribute is specified, that no duplicates are present and perhaps that things are sorted, but we should not force readers to check this or otherwise handle these any differently than other arrays.

        For Cassandra, what matters is that the Java specific API represent these differently than arrays. So, in terms of a patch, that's the implementation to target. We might actually implement it in generic too, since specific inherits its array code from there anyway. But I don't think we need to implement it anywhere else right off, since it is an optional feature. So SpecificCompiler should generate a different type when this attribute is specified, a GenericData.Set class should be added, extending GenericData.Array, and GenericDatumReader#newArray() should create an instance of GenericData.Set when "useSet" is specified. Does that sound like a reasonable plan?

        Show
        Doug Cutting added a comment - My preference is an attribute that's a hint to the runtime, e.g., "useSet":true might indicate that a set-like representation should be used if available, but permitting implementations to ignore the annotation and still process the data correctly. If a writer intends setwise comparisons, it must sort the array before writing. We could require that, if this attribute is specified, that no duplicates are present and perhaps that things are sorted, but we should not force readers to check this or otherwise handle these any differently than other arrays. For Cassandra, what matters is that the Java specific API represent these differently than arrays. So, in terms of a patch, that's the implementation to target. We might actually implement it in generic too, since specific inherits its array code from there anyway. But I don't think we need to implement it anywhere else right off, since it is an optional feature. So SpecificCompiler should generate a different type when this attribute is specified, a GenericData.Set class should be added, extending GenericData.Array, and GenericDatumReader#newArray() should create an instance of GenericData.Set when "useSet" is specified. Does that sound like a reasonable plan?
        Hide
        Scott Carey added a comment -

        An ordered set is easily compared to an array, perhaps they should be ordered.

        If things are unordered, comparison gets more complicated. For example, the simple way to compare would be to build both sets up in memory – this would not work for large sets or lists.
        I propose that a set is ordered by default. A client can do this by either sorting when writing, or with a data structure like a linked set. A client can choose to disregard order and accept that set equivalence is invalid if they wish.

        We could consider an "ordered": true property as well
        "unique": false, "ordered": true
        Would then be the implicit default. An array or set that is not ordered is always unequal to another array or set. A reader with an 'ordered' schema reading an 'unordered' serialization would be a tough spot. That might not be a supported promotion.


        That leads us to the next question: What are the resolution rules between arrays and sets? My answer to this is: sets written can always be read as arrays. Arrays written can be read as sets as long as the uniqueness constraint is not violated.

        One can always read an array as a set, even with duplicates. The duplicates get eliminated in the process of creating the set. Interestingly, one can go either direction, but not back and forth.

        I think in the short run, Doug's version is appropriate. The above would be valuable but also take a while to sort out the details for what works best across languages and is a spec change, rather than a Java API extension. Besides, it should be possible to specify what object to use as an array container regardless.

        Taken in combination with AVRO-436, ordered maps, these two things are potentially significant changes for something that clients can emulate on their own. Ordered maps can be emulated by an array of

        {key, value}

        tuples.
        The simplest option other than Doug's Java Reflect API version is to require that Avro sets are ordered for the purposes of equivalence and comparisson, and if a client wants to compare two objects for equality or sort order they must guarantee order on writing (this restriction already happens when serializing sets as arrays).
        "unique": (true|false) is then just a reserved keyword hint for languages to construct Set - like APIs for data access.
        We can consider adding the more difficult support for unordered sets/lists incrementally.

        Show
        Scott Carey added a comment - An ordered set is easily compared to an array, perhaps they should be ordered. If things are unordered, comparison gets more complicated. For example, the simple way to compare would be to build both sets up in memory – this would not work for large sets or lists. I propose that a set is ordered by default. A client can do this by either sorting when writing, or with a data structure like a linked set. A client can choose to disregard order and accept that set equivalence is invalid if they wish. We could consider an "ordered": true property as well "unique": false, "ordered": true Would then be the implicit default. An array or set that is not ordered is always unequal to another array or set. A reader with an 'ordered' schema reading an 'unordered' serialization would be a tough spot. That might not be a supported promotion. That leads us to the next question: What are the resolution rules between arrays and sets? My answer to this is: sets written can always be read as arrays. Arrays written can be read as sets as long as the uniqueness constraint is not violated. One can always read an array as a set, even with duplicates. The duplicates get eliminated in the process of creating the set. Interestingly, one can go either direction, but not back and forth. I think in the short run, Doug's version is appropriate. The above would be valuable but also take a while to sort out the details for what works best across languages and is a spec change, rather than a Java API extension. Besides, it should be possible to specify what object to use as an array container regardless. Taken in combination with AVRO-436 , ordered maps, these two things are potentially significant changes for something that clients can emulate on their own. Ordered maps can be emulated by an array of {key, value} tuples. The simplest option other than Doug's Java Reflect API version is to require that Avro sets are ordered for the purposes of equivalence and comparisson, and if a client wants to compare two objects for equality or sort order they must guarantee order on writing (this restriction already happens when serializing sets as arrays). "unique": (true|false) is then just a reserved keyword hint for languages to construct Set - like APIs for data access. We can consider adding the more difficult support for unordered sets/lists incrementally.
        Hide
        Thiruvalluvan M. G. added a comment -

        +1 for "unique":"true".

        Array differs from set in two ways:

        • Arrays guarantee the order of elements and sets don't. Using arrays to implement sets shouldn't be a problem.
        • Arrays do not guarantee uniqueness. Using arrays for sets would work if we can define equality in the spec. Unfortunately we cannot say "two entities are equal if their schemas are equal and their bit representations in Avro are equal". The trouble comes from maps (and sets when we have them) because Avro does not enforce order in them. But we can define equality something like this:
        • Two primitive entities are equal if and only if their schemas and Avro binary representations are equal.
        • Two complex entities (other than maps and sets) are equal if an only if their schemas are equal and their contents are equal. For example, two unions are equal if and only if their union indexes are equal and the union members are equal.
        • Two Maps(sets) are equal if and only if their schemas are equal and their elements are equal except for their order.

        Another question to answer is: Is it the responsibility of the Avro library to ensure uniqueness? My answer is no. Since, in Avro we interpret the contents assuming that the the schema we have is indeed the schema that the writer used, we can trust "unique":"true" as well.

        That leads us to the next question: What are the resolution rules between arrays and sets? My answer to this is: sets written can always be read as arrays. Arrays written can be read as sets as long as the uniqueness constraint is not violated. It extends our schema resolution philosophy - we try to be as lenient as possible unless we encounter data violations.

        Show
        Thiruvalluvan M. G. added a comment - +1 for "unique":"true". Array differs from set in two ways: Arrays guarantee the order of elements and sets don't. Using arrays to implement sets shouldn't be a problem. Arrays do not guarantee uniqueness. Using arrays for sets would work if we can define equality in the spec. Unfortunately we cannot say "two entities are equal if their schemas are equal and their bit representations in Avro are equal". The trouble comes from maps (and sets when we have them) because Avro does not enforce order in them. But we can define equality something like this: Two primitive entities are equal if and only if their schemas and Avro binary representations are equal. Two complex entities (other than maps and sets) are equal if an only if their schemas are equal and their contents are equal. For example, two unions are equal if and only if their union indexes are equal and the union members are equal. Two Maps(sets) are equal if and only if their schemas are equal and their elements are equal except for their order. Another question to answer is: Is it the responsibility of the Avro library to ensure uniqueness? My answer is no. Since, in Avro we interpret the contents assuming that the the schema we have is indeed the schema that the writer used, we can trust "unique":"true" as well. That leads us to the next question: What are the resolution rules between arrays and sets? My answer to this is: sets written can always be read as arrays. Arrays written can be read as sets as long as the uniqueness constraint is not violated. It extends our schema resolution philosophy - we try to be as lenient as possible unless we encounter data violations.
        Hide
        Scott Carey added a comment -

        I think that this is best done at the API level with serialization unchanged. Something like Doug's proposal should work.

        For cross-language purposes farther down the road, we might want to have some support for this in schema metadata that is not language specific. Perhaps something like.

        {"type": "array", "items": "int", "subtype": "set"}

        (or, "unique": true ?)

        Languages could optionally do something with that information.

        Show
        Scott Carey added a comment - I think that this is best done at the API level with serialization unchanged. Something like Doug's proposal should work. For cross-language purposes farther down the road, we might want to have some support for this in schema metadata that is not language specific. Perhaps something like. {"type": "array", "items": "int", "subtype": "set"} (or, "unique": true ?) Languages could optionally do something with that information.
        Hide
        Doug Cutting added a comment -

        Adding a new fundamental type to Avro is difficult without breaking back-compatibility. But a way to achieve this might be with a schema attribute. For example, the Java reflect API uses the "java-class" attribute to indicate which class implements a collection. Thus the schema ReflectDatumWriter infers for HashSet<String> is

        {"type": "array", "items": "string", "java-class": "java.util.HashSet"}

        . ReflectDatumReader then instantiates this appropriately.

        Might something like this work for Cassandra?

        Show
        Doug Cutting added a comment - Adding a new fundamental type to Avro is difficult without breaking back-compatibility. But a way to achieve this might be with a schema attribute. For example, the Java reflect API uses the "java-class" attribute to indicate which class implements a collection. Thus the schema ReflectDatumWriter infers for HashSet<String> is {"type": "array", "items": "string", "java-class": "java.util.HashSet"} . ReflectDatumReader then instantiates this appropriately. Might something like this work for Cassandra?

          People

          • Assignee:
            Unassigned
            Reporter:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development