Avro
  1. Avro
  2. AVRO-23

Schema.toString() fails for a union of versioned records.

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: java
    • Labels:
      None

      Description

      Schema.toString() fails to create a string representation for a union that contains versioned records.

        Activity

        Hide
        Hong Tang added a comment -

        test case:

          public void testVersionedUnion() throws Exception {
            Schema s1 = Schema.create("R", null, false);
            Schema s2 = Schema.create("R", null, false);
            Map<String, Schema> f1 = new LinkedHashMap<String, Schema>();
            f1.put("a", Schema.create(Type.BOOLEAN));
            Map<String, Schema> f2 = new LinkedHashMap<String, Schema>();
            f2.put("a", Schema.create(Type.INT));
            s1.setFields(f1);
            s2.setFields(f2);
            List<Schema> union = new ArrayList<Schema>();
            union.add(s1);
            union.add(s2);
            Schema s3 = Schema.create(union);
            String ss = s3.toString();
            assertTrue(ss.length() > 0);
          }
        

        Error received:

        Can't redefine: R
        org.apache.avro.SchemaParseException: Can't redefine: R
            at org.apache.avro.Schema$Names.put(Schema.java:439)
            at org.apache.avro.Schema$Names.put(Schema.java:1)
            at org.apache.avro.Schema$RecordSchema.toString(Schema.java:234)
            at org.apache.avro.Schema$UnionSchema.toString(Schema.java:333)
            at org.apache.avro.Schema.toString(Schema.java:167)
        
        Show
        Hong Tang added a comment - test case: public void testVersionedUnion() throws Exception { Schema s1 = Schema.create( "R" , null , false ); Schema s2 = Schema.create( "R" , null , false ); Map< String , Schema> f1 = new LinkedHashMap< String , Schema>(); f1.put( "a" , Schema.create(Type.BOOLEAN)); Map< String , Schema> f2 = new LinkedHashMap< String , Schema>(); f2.put( "a" , Schema.create(Type.INT)); s1.setFields(f1); s2.setFields(f2); List<Schema> union = new ArrayList<Schema>(); union.add(s1); union.add(s2); Schema s3 = Schema.create(union); String ss = s3.toString(); assertTrue(ss.length() > 0); } Error received: Can't redefine: R org.apache.avro.SchemaParseException: Can't redefine: R at org.apache.avro.Schema$Names.put(Schema.java:439) at org.apache.avro.Schema$Names.put(Schema.java:1) at org.apache.avro.Schema$RecordSchema.toString(Schema.java:234) at org.apache.avro.Schema$UnionSchema.toString(Schema.java:333) at org.apache.avro.Schema.toString(Schema.java:167)
        Hide
        Doug Cutting added a comment -

        One way to implement this, without altering Avro, would be to write a renaming schema copier, something like:

        public Schema copy(Schema source, Map<String,String> renames);

        This would be pretty simple to write. One could construct the renaming table by first walking the schema to find all of the names in it, with a method like:

        public Map<String,Schema> getNames(Schema schema);

        This would also be simple to write. Then, give a set of schemas, one can maintain a set of all names already seen, and as one adds the schemas to a union, rename any whose names have already been used.

        Another approach might be to define scoping rules, e.g., a re-definition might only be visible within itself. This would be confusingly inconsistent, however, since in a protocol we'd like sibling definitions to be visible to one another, but in this case we do not want sibling definitions to be visible to one another. So I prefer the renaming approach.

        Note that with renaming, one could use a convention like suffixing names with _X, which one could then remove again later. Or, once namespaces are correctly supported (AVRO-30), renaming could just change the package.

        Show
        Doug Cutting added a comment - One way to implement this, without altering Avro, would be to write a renaming schema copier, something like: public Schema copy(Schema source, Map<String,String> renames); This would be pretty simple to write. One could construct the renaming table by first walking the schema to find all of the names in it, with a method like: public Map<String,Schema> getNames(Schema schema); This would also be simple to write. Then, give a set of schemas, one can maintain a set of all names already seen, and as one adds the schemas to a union, rename any whose names have already been used. Another approach might be to define scoping rules, e.g., a re-definition might only be visible within itself. This would be confusingly inconsistent, however, since in a protocol we'd like sibling definitions to be visible to one another, but in this case we do not want sibling definitions to be visible to one another. So I prefer the renaming approach. Note that with renaming, one could use a convention like suffixing names with _X, which one could then remove again later. Or, once namespaces are correctly supported ( AVRO-30 ), renaming could just change the package.
        Hide
        Hong Tang added a comment -

        Another way to look at it is that the name defined in JSON schema is what a user sees, while the names stored in schema objects and maintained in internal dictionary are always renamed or, in compiler terms, mangled symbols. I personally like prefixing and/or length delimited schemes, which is easier to demangle.

        Show
        Hong Tang added a comment - Another way to look at it is that the name defined in JSON schema is what a user sees, while the names stored in schema objects and maintained in internal dictionary are always renamed or, in compiler terms, mangled symbols. I personally like prefixing and/or length delimited schemes, which is easier to demangle.
        Hide
        Doug Cutting added a comment -

        Can I resolve this as "won't fix"? I don't think we want to permit multiple records with the same name in a schema.

        Show
        Doug Cutting added a comment - Can I resolve this as "won't fix"? I don't think we want to permit multiple records with the same name in a schema.
        Hide
        Hong Tang added a comment -

        What if records of the same name creeps in due to inclusion and transitive closure calculation?

        Show
        Hong Tang added a comment - What if records of the same name creeps in due to inclusion and transitive closure calculation?
        Hide
        Doug Cutting added a comment -

        > What if records of the same name creeps in due to inclusion and transitive closure calculation?

        That would be an error. Since toString() fails on such schemas, this is implemented correctly. One cannot write data without at some point calling toString() on a schema, even if only to cache its MD5 hash.

        Show
        Doug Cutting added a comment - > What if records of the same name creeps in due to inclusion and transitive closure calculation? That would be an error. Since toString() fails on such schemas, this is implemented correctly. One cannot write data without at some point calling toString() on a schema, even if only to cache its MD5 hash.
        Hide
        Hong Tang added a comment -

        @doug, I am concerned that the avro's capability of supporting versioned records is diminishing if we disallow schemas of the same name in a union. Here is my hypothetical usage case:

        1. I have multiple data sources, each may contain a particular version of the same record R. To be concrete, let's say I have two sources, S1 is a container of V1 of R, and S2 is a container of V2 of R.
        2. I want to sort/merge the two containers by some key (outside R), which would lead to the interleaving of records from S1 and S2.
        3. I could convert all records into a common schema, but the problem is that I do not know which version is supposed to be the "best" one.
        4. Particularly, the program that does the merge may only care about the keys, and not the versions of records in the containers. What it could do is to create a new schema as union(R_v1, R_v2), and prefix the bytes coming from S1 with index 0, and bytes coming from S2 index 1.
        5. The consumer of this merged container must provide a version of R (Vx), and performs normal conversion from either V1 to Vx, or V2 to Vx for each individual record.

        Like you said earlier, a workaround is to rename the schemas to different names (and possibly rename back in step 5). But it seems messy. And what about records nested inside? Do we also need to rename those too? Or does/will avro support nested (static) scoping rules?

        Show
        Hong Tang added a comment - @doug, I am concerned that the avro's capability of supporting versioned records is diminishing if we disallow schemas of the same name in a union. Here is my hypothetical usage case: I have multiple data sources, each may contain a particular version of the same record R. To be concrete, let's say I have two sources, S1 is a container of V1 of R, and S2 is a container of V2 of R. I want to sort/merge the two containers by some key (outside R), which would lead to the interleaving of records from S1 and S2. I could convert all records into a common schema, but the problem is that I do not know which version is supposed to be the "best" one. Particularly, the program that does the merge may only care about the keys, and not the versions of records in the containers. What it could do is to create a new schema as union(R_v1, R_v2), and prefix the bytes coming from S1 with index 0, and bytes coming from S2 index 1. The consumer of this merged container must provide a version of R (Vx), and performs normal conversion from either V1 to Vx, or V2 to Vx for each individual record. Like you said earlier, a workaround is to rename the schemas to different names (and possibly rename back in step 5). But it seems messy. And what about records nested inside? Do we also need to rename those too? Or does/will avro support nested (static) scoping rules?
        Hide
        Doug Cutting added a comment -

        This is all an optimization to avoid deserializing while merging, right? Can't we deserialize as fast as we can read off disk anyway? We can simply read using the lastest version of the schema and re-write things.

        If you really believe you must have this optimization, then you could build it into your container. Rather than using a union, keep an explicit array of schemas that you've seen.

        Show
        Doug Cutting added a comment - This is all an optimization to avoid deserializing while merging, right? Can't we deserialize as fast as we can read off disk anyway? We can simply read using the lastest version of the schema and re-write things. If you really believe you must have this optimization, then you could build it into your container. Rather than using a union, keep an explicit array of schemas that you've seen.
        Hide
        Hong Tang added a comment -

        This is all an optimization to avoid deserializing while merging, right? Can't we deserialize as fast as we can read off disk anyway? We can simply read using the lastest version of the schema and re-write things.

        No, in the usage case I described, this is more than optimization. The sources may come from different origins and there may not be a total ordering of the versions of R. I think one of the advantages of Avro over Protocol Buffer is that it does not require global coordination for schema evolution.

        If you really believe you must have this optimization, then you could build it into your container. Rather than using a union, keep an explicit array of schemas that you've seen.

        Yes, I could work around the issue. However, schema evolution and bi-directional compatibility should be something that is closer to the framework than to the application. The analogy to your suggestion is that C++ does not need to provide method overloading, users can simply work around this issue by renaming methods with parameter types like addii(int, int), and adddd(double, double).

        Show
        Hong Tang added a comment - This is all an optimization to avoid deserializing while merging, right? Can't we deserialize as fast as we can read off disk anyway? We can simply read using the lastest version of the schema and re-write things. No, in the usage case I described, this is more than optimization. The sources may come from different origins and there may not be a total ordering of the versions of R. I think one of the advantages of Avro over Protocol Buffer is that it does not require global coordination for schema evolution. If you really believe you must have this optimization, then you could build it into your container. Rather than using a union, keep an explicit array of schemas that you've seen. Yes, I could work around the issue. However, schema evolution and bi-directional compatibility should be something that is closer to the framework than to the application. The analogy to your suggestion is that C++ does not need to provide method overloading, users can simply work around this issue by renaming methods with parameter types like addii(int, int), and adddd(double, double).

          People

          • Assignee:
            Unassigned
            Reporter:
            Hong Tang
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development