Uploaded image for project: 'Avro'
  1. Avro
  2. AVRO-1933

SchemaCompatibility class could be more user-friendly about incompatibilities

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.7.7, 1.8.1
    • Fix Version/s: 1.9.0
    • Component/s: java
    • Labels:
      None
    • Environment:

      Any Java env

    • Hadoop Flags:
      Incompatible change
    • Release Note:
      SchemaCompatibility returns the more detailed object SchemaCompatibilityResult instead of the simple enum SchemaCompatibilityType.
    • Flags:
      Patch

      Description

      Today, the class SchemaCompatibility reports incompatibilities with quite little detail. The whole reader and the whole writer schema is listed, and no particular detail about what was incompatible.

      The attached patch fixes this, introducing a new enum (SchemaIncompatibilityType), and more specific sub-schemas that were incompatible.
      The old, overall picture, is still there - the new compatibility state is encapsulated in the SchemaCompatibilityDetails class.
      Lots of test cases have been added, and there has been refactoring done in the TestSchemaCompatibility and other test classes.

      1. AVRO-1933.patch
        106 kB
        Anders Sundelin
      2. AVRO-1933-compatible-with-AVRO-1931.patch
        105 kB
        Anders Sundelin

        Issue Links

          Activity

          Hide
          epkanol Anders Sundelin added a comment - - edited

          Patch set adding more error details to SchemaCompatibility.
          Also includes part of AVRO-1931 (though not all tests therein)
          This patch is not compatible with changes done in AVRO-1931 - if you come that way, please use the other patch attached to this issue.
          Contact me if there is any issues.

          Show
          epkanol Anders Sundelin added a comment - - edited Patch set adding more error details to SchemaCompatibility. Also includes part of AVRO-1931 (though not all tests therein) This patch is not compatible with changes done in AVRO-1931 - if you come that way, please use the other patch attached to this issue. Contact me if there is any issues.
          Hide
          epkanol Anders Sundelin added a comment -

          Patch set compatible with AVRO-1931 patch
          One of these patches apply, depending on your starting point.

          Show
          epkanol Anders Sundelin added a comment - Patch set compatible with AVRO-1931 patch One of these patches apply, depending on your starting point.
          Hide
          teabot Elliot West added a comment - - edited

          I agree with the motivations behind this effort. There is a lot of room for improvement with regards to the reporting of incompatibilities within schemas. This is especially important as the end user focus of Avro Schema development shifts from developers to analysts and I know of a few projects that are heading in this direction. For example, one can envisage systems built upon Avro that have web-based UI tooling for schema submission etc.

          In these cases it would be very handy to model situations that would enable us to make statements such as:

          • "Field 'uuid' in record 'namespace.record_name' in the existing schema cannot be read by the new schema because it has changed to an incompatible type; was 'string', now 'long'"
          • "Field 'address' in record 'namespace.record_name' in the new schema does not exist in the earlier schema and does not declare a default"
          • "Field union branch 1 at 'namespace.record_name.uField' in the earlier schema does not exist in the new schema"

          Certainly these conditions should generated as an accessible model and not simply returned as a message so that integrations with editors will be possible. With this is mind, would it be possible for SchemaCompatibilityDetails to also encapsulate a machine readable, fully qualified path to the incompatible node within the schema tree?

          Nice to see the patch and hope to see such a feature present in a future version of Avro.

          Show
          teabot Elliot West added a comment - - edited I agree with the motivations behind this effort. There is a lot of room for improvement with regards to the reporting of incompatibilities within schemas. This is especially important as the end user focus of Avro Schema development shifts from developers to analysts and I know of a few projects that are heading in this direction. For example, one can envisage systems built upon Avro that have web-based UI tooling for schema submission etc. In these cases it would be very handy to model situations that would enable us to make statements such as: " Field 'uuid' in record 'namespace.record_name' in the existing schema cannot be read by the new schema because it has changed to an incompatible type; was 'string', now 'long' " " Field 'address' in record 'namespace.record_name' in the new schema does not exist in the earlier schema and does not declare a default " " Field union branch 1 at 'namespace.record_name.uField' in the earlier schema does not exist in the new schema " Certainly these conditions should generated as an accessible model and not simply returned as a message so that integrations with editors will be possible. With this is mind, would it be possible for SchemaCompatibilityDetails to also encapsulate a machine readable, fully qualified path to the incompatible node within the schema tree? Nice to see the patch and hope to see such a feature present in a future version of Avro.
          Hide
          teabot Elliot West added a comment -

          I've made an attempt at including a path to the location within the schema where the compatibility problem exists. I encoded the path according to the JSON Pointer specification with respect the the JSON form of the Avro schema. This requires the simple addition of a stack to track the SchemaCompatibility's current location within the schema. In this form it should be trivial to apply said path to the schema's JSON tree to flag/highlight the problem in the document. It is also fairly human readable.

          If it is ok, I'd like to add a patched version of the existing work attached to this ticket.

          Show
          teabot Elliot West added a comment - I've made an attempt at including a path to the location within the schema where the compatibility problem exists. I encoded the path according to the JSON Pointer specification with respect the the JSON form of the Avro schema. This requires the simple addition of a stack to track the SchemaCompatibility 's current location within the schema. In this form it should be trivial to apply said path to the schema's JSON tree to flag/highlight the problem in the document. It is also fairly human readable. If it is ok, I'd like to add a patched version of the existing work attached to this ticket.
          Hide
          epkanol Anders Sundelin added a comment -

          I agree that the addition of the conflicting path would make it more usable for complex schemas. The JSON Pointer spec talks a bit about escaped characters, though I do not know if these are valid in Avro identifiers.
          How would unions be represented in the JSON pointer syntax? In particular, if a union contains two "similar" records (where some fields might be the same, for instance due to evolved data model). Make sure to include some test cases in the patch, then it will be self-documenting (this is how I originally learned about the SchemaCompatibility class

          Show
          epkanol Anders Sundelin added a comment - I agree that the addition of the conflicting path would make it more usable for complex schemas. The JSON Pointer spec talks a bit about escaped characters, though I do not know if these are valid in Avro identifiers. How would unions be represented in the JSON pointer syntax? In particular, if a union contains two "similar" records (where some fields might be the same, for instance due to evolved data model). Make sure to include some test cases in the patch, then it will be self-documenting (this is how I originally learned about the SchemaCompatibility class
          Hide
          teabot Elliot West added a comment - - edited

          Thanks for your interest Anders Sundelin. JSON Pointer has a 'unusual' escaping system. As the / character is a path separator, any JSON identifiers containing said character must have those instances be replaced with ~0. The tilde character then holds a special meaning, so then any literal uses in JSON identifiers must then be escaped with ~1. The patch I have does this. With reference to your question regarding unions, the index of the branch is used as the identifier, which translates nicely to JSON Pointer's array index specification. The same applies to fields in records. Consider the following schema:

          {
            "type" : "record",
            "name" : "myRecord",
            "fields" : [ {
              "name" : "myId",
              "type" : "long"
            }, {
              "name" : "myValue",
              "type" : [ "string", "long", "null" ]
            } ]
          }
          

          If say the string branch of a union was missing in the companion schema, the path to the location of the issue in the schema would be: /fields/1/type/0.

          I'll certainly add some more test cases

          Show
          teabot Elliot West added a comment - - edited Thanks for your interest Anders Sundelin . JSON Pointer has a 'unusual' escaping system. As the / character is a path separator, any JSON identifiers containing said character must have those instances be replaced with ~0 . The tilde character then holds a special meaning, so then any literal uses in JSON identifiers must then be escaped with ~1 . The patch I have does this. With reference to your question regarding unions, the index of the branch is used as the identifier, which translates nicely to JSON Pointer's array index specification. The same applies to fields in records. Consider the following schema: { "type" : "record" , "name" : "myRecord" , "fields" : [ { "name" : "myId" , "type" : " long " }, { "name" : "myValue" , "type" : [ "string" , " long " , " null " ] } ] } If say the string branch of a union was missing in the companion schema, the path to the location of the issue in the schema would be: /fields/1/type/0 . I'll certainly add some more test cases
          Hide
          teabot Elliot West added a comment - - edited

          Anders Sundelin, I do not have permission to attach my patch to this ticket so have instead created a new issue, AVRO-2003, that depends on this one. The patch is probably clearer and my issue better documented this way too. I've updated all the test cases and added more.

          One further thought; perhaps the noun 'results' might better describe the return type than 'details'. This would lead to a classname of SchemaCompatibilityResults and also SchemaPairCompatibility.getResults(). Additionally, the term 'details' in SchemaCompatibilityDetails might be better described as the 'message'?

          These changes would avoid the potential ungainly invocation chain of: SchemaPairCompatibility.getDetails().getDetails() and would become: SchemaPairCompatibility.getResults().getMessage()

          Show
          teabot Elliot West added a comment - - edited Anders Sundelin , I do not have permission to attach my patch to this ticket so have instead created a new issue, AVRO-2003 , that depends on this one. The patch is probably clearer and my issue better documented this way too. I've updated all the test cases and added more. One further thought; perhaps the noun 'results' might better describe the return type than 'details'. This would lead to a classname of SchemaCompatibilityResults and also SchemaPairCompatibility.getResults() . Additionally, the term 'details' in SchemaCompatibilityDetails might be better described as the 'message'? These changes would avoid the potential ungainly invocation chain of: SchemaPairCompatibility.getDetails().getDetails() and would become: SchemaPairCompatibility.getResults().getMessage()
          Hide
          teabot Elliot West added a comment -

          Anders Sundelin can this ticket status be set to 'Patch Available'?

          Show
          teabot Elliot West added a comment - Anders Sundelin can this ticket status be set to 'Patch Available'?
          Hide
          epkanol Anders Sundelin added a comment -

          Two patches is attached, choose one, depending on whether AVRO-1931 is in the baseline or not

          Show
          epkanol Anders Sundelin added a comment - Two patches is attached, choose one, depending on whether AVRO-1931 is in the baseline or not
          Hide
          satish.duggana Satish Duggana added a comment - - edited

          Doug Cutting Ryan Blue It is nice to have this as it gives more insight into what is wrong with the schema when it is checked for compatibility with another schema.

          AVRO-2003 adds location details about the incompatible elements.

          Anders Sundelin I guess you need to raise PR for this JIRA.

          Show
          satish.duggana Satish Duggana added a comment - - edited Doug Cutting Ryan Blue It is nice to have this as it gives more insight into what is wrong with the schema when it is checked for compatibility with another schema. AVRO-2003 adds location details about the incompatible elements. Anders Sundelin I guess you need to raise PR for this JIRA.
          Hide
          epkanol Anders Sundelin added a comment - - edited

          Created a PR, from the correct repo, I hope. Please have a look in github

          Show
          epkanol Anders Sundelin added a comment - - edited Created a PR, from the correct repo, I hope. Please have a look in github
          Hide
          satish.duggana Satish Duggana added a comment -

          Anders Sundelin Please raise PR on apache avro repo.

          Show
          satish.duggana Satish Duggana added a comment - Anders Sundelin Please raise PR on apache avro repo.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user epkanol opened a pull request:

          https://github.com/apache/avro/pull/200

          AVRO-1933: Add more specific error details to SchemaCompatibility class

          New try, this time forking off of the apache:master branch

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/epkanol/avro-1 AVRO-1933

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/avro/pull/200.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #200



          Show
          githubbot ASF GitHub Bot added a comment - GitHub user epkanol opened a pull request: https://github.com/apache/avro/pull/200 AVRO-1933 : Add more specific error details to SchemaCompatibility class New try, this time forking off of the apache:master branch You can merge this pull request into a Git repository by running: $ git pull https://github.com/epkanol/avro-1 AVRO-1933 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/avro/pull/200.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #200
          Hide
          nkollar Nandor Kollar added a comment -

          Anders Sundelin since AVRO-1931 is committed to trunk now, it looks like your pull request has several merge conflicts. Could you please merge Avro trunk into your fork, and resolve the merge conflicts?

          Show
          nkollar Nandor Kollar added a comment - Anders Sundelin since AVRO-1931 is committed to trunk now, it looks like your pull request has several merge conflicts. Could you please merge Avro trunk into your fork, and resolve the merge conflicts?
          Hide
          epkanol Anders Sundelin added a comment -

          Updated according to your comments, please have a new look at my fork

          Show
          epkanol Anders Sundelin added a comment - Updated according to your comments, please have a new look at my fork
          Hide
          nkollar Nandor Kollar added a comment -

          Thanks Anders Sundelin, I'll help the committers with review, do my best to get this committed!

          Show
          nkollar Nandor Kollar added a comment - Thanks Anders Sundelin , I'll help the committers with review, do my best to get this committed!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit db8ed216eaf13c8f3862eb31152185f0504c4467 in avro's branch refs/heads/master from Anders Sundelin
          [ https://git-wip-us.apache.org/repos/asf?p=avro.git;h=db8ed21 ]

          AVRO-1933: Add more specific error details to SchemaCompatibility class

          Closes #200

          Signed-off-by: Sriharsha Chintalapani <harsha@hortonworks.com>
          Signed-off-by: Anna Szonyi <szonyi@cloudera.com>
          Signed-off-by: Nandor Kollar <nkollar@cloudera.com>
          Signed-off-by: Gabor Szadovszky <gabor@apache.org>

          Show
          jira-bot ASF subversion and git services added a comment - Commit db8ed216eaf13c8f3862eb31152185f0504c4467 in avro's branch refs/heads/master from Anders Sundelin [ https://git-wip-us.apache.org/repos/asf?p=avro.git;h=db8ed21 ] AVRO-1933 : Add more specific error details to SchemaCompatibility class Closes #200 Signed-off-by: Sriharsha Chintalapani <harsha@hortonworks.com> Signed-off-by: Anna Szonyi <szonyi@cloudera.com> Signed-off-by: Nandor Kollar <nkollar@cloudera.com> Signed-off-by: Gabor Szadovszky <gabor@apache.org>
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/avro/pull/200

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/avro/pull/200

            People

            • Assignee:
              epkanol Anders Sundelin
              Reporter:
              epkanol Anders Sundelin
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development