Avro
  1. Avro
  2. AVRO-1318

Python schema should store fingerprints

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: python
    • Labels:

      Description

      Python schema objects need to produce a simple representation that demonstrates their field identity.

      {avro.schema.Schema}

      objects need to provide a

      {fingerprint}

      member field to enable quick checking of schema matching (even when the schema has other, possibly changed decoration).

      Based on a patch pulled from Uri Laserson's proposed changes to make a collection of C-typing hints. These changes will be backwards-compatible.

        Issue Links

          Activity

          Hide
          Jeremy Kahn added a comment -

          Patch from here. A copy of changes from Uri Laserson for fingerprinting files.

          Show
          Jeremy Kahn added a comment - Patch from here . A copy of changes from Uri Laserson for fingerprinting files.
          Hide
          Jeremy Kahn added a comment -

          Tests pass ant clean build for me.

          Show
          Jeremy Kahn added a comment - Tests pass ant clean build for me.
          Hide
          Jeremy Kahn added a comment -

          Oh, to be clear: when AVRO-1323 is included, then AVRO-1318 passes. The AVRO-1318 changes trigger the bug that AVRO-1323 addresses.

          Show
          Jeremy Kahn added a comment - Oh, to be clear: when AVRO-1323 is included, then AVRO-1318 passes. The AVRO-1318 changes trigger the bug that AVRO-1323 addresses.
          Hide
          Jeremy Kahn added a comment -

          Nudging this issue to ask for review from a Pythonista and/or a committer. It'd be great if AVRO-1318 and AVRO-1323 could be included in Avro 1.7.5 release.

          Show
          Jeremy Kahn added a comment - Nudging this issue to ask for review from a Pythonista and/or a committer. It'd be great if AVRO-1318 and AVRO-1323 could be included in Avro 1.7.5 release.
          Hide
          Doug Cutting added a comment -

          This is a different notion of "fingerprint" than is used in the Avro specification:

          http://avro.apache.org/docs/current/spec.html#Schema+Fingerprints

          Is the purpose the same? If so, perhaps this should implement the specification. If not, can you please explain the purpose?

          Show
          Doug Cutting added a comment - This is a different notion of "fingerprint" than is used in the Avro specification: http://avro.apache.org/docs/current/spec.html#Schema+Fingerprints Is the purpose the same? If so, perhaps this should implement the specification. If not, can you please explain the purpose?
          Hide
          Jeremy Kahn added a comment -

          The purpose is roughly the same, if I understand correctly. This "fingerprint" notion is copied from Uri Laserson's perf branch to avoid recomputation of evolution decisions (to "to cache encoder and decoder objects", quoting the spec).

          This delta does most of the "parsing canonical form" part of the spec, if I understand correctly, but should be reviewed in light of that, for sure.

          I've found Uri's work on this useful to support Cython extensions, but adapting the Python decoder and encoder to cache those encoders and decoders is a pretty big change. I thought this one bit should be safe enough to include without requiring a 1.8.0 bump, so I pushed it forward as a proposal.

          Show
          Jeremy Kahn added a comment - The purpose is roughly the same, if I understand correctly. This "fingerprint" notion is copied from Uri Laserson 's perf branch to avoid recomputation of evolution decisions (to "to cache encoder and decoder objects", quoting the spec). This delta does most of the "parsing canonical form" part of the spec, if I understand correctly, but should be reviewed in light of that, for sure. I've found Uri's work on this useful to support Cython extensions, but adapting the Python decoder and encoder to cache those encoders and decoders is a pretty big change. I thought this one bit should be safe enough to include without requiring a 1.8.0 bump, so I pushed it forward as a proposal.
          Hide
          Doug Cutting added a comment -

          I'd prefer we don't overload the term 'fingerprint'. Perhaps this could be called 'profile' or 'signature' instead? Or it might be reworked to actually be "parsing canonical form".

          It's also strange to me to add something like this without adding a use of it. It doesn't seem like an end-user feature. If it's to aid performance, then the performance improvements should probably be added too, validating these changes.

          Perhaps we should we try to get Uri to contribute his Perf branch?

          Show
          Doug Cutting added a comment - I'd prefer we don't overload the term 'fingerprint'. Perhaps this could be called 'profile' or 'signature' instead? Or it might be reworked to actually be "parsing canonical form". It's also strange to me to add something like this without adding a use of it. It doesn't seem like an end-user feature. If it's to aid performance, then the performance improvements should probably be added too, validating these changes. Perhaps we should we try to get Uri to contribute his Perf branch?
          Hide
          Jeremy Kahn added a comment -

          It is not an end-user use case. it's a useful performance win, if you're caching encoder and decoder objects, as Uri's changes do. I've written my own extensions based heavily on this "signature" behavior.

          I'd be happy to have Uri Laserson's perf branch added. Doug Cutting, perhaps you can go chat with him about this? He's a Clouderian, if I understand correctly.

          Show
          Jeremy Kahn added a comment - It is not an end-user use case. it's a useful performance win, if you're caching encoder and decoder objects, as Uri's changes do. I've written my own extensions based heavily on this "signature" behavior. I'd be happy to have Uri Laserson 's perf branch added. Doug Cutting , perhaps you can go chat with him about this? He's a Clouderian, if I understand correctly.
          Hide
          Uri Laserson added a comment -

          I'd be happy to contribute, though it requires some additional cleanup. The only issue is that implementing these performance improvements would substantially change (and simplify) the read/write API.

          Show
          Uri Laserson added a comment - I'd be happy to contribute, though it requires some additional cleanup. The only issue is that implementing these performance improvements would substantially change (and simplify) the read/write API.
          Hide
          Jeremy Kahn added a comment -

          Uri Laserson, glad you're game to contribute!

          • what cleanup do you think is needed?
          • would it be possible to use your work without requiring a change to the read/write API? (could the old API be preserved in terms of your new one?)
          Show
          Jeremy Kahn added a comment - Uri Laserson , glad you're game to contribute! what cleanup do you think is needed? would it be possible to use your work without requiring a change to the read/write API? (could the old API be preserved in terms of your new one?)
          Hide
          Doug Cutting added a comment -

          An alternative is to add an entirely new API and deprecate the old one.

          If we add a new API it would be good to have it object-based rather than dictionary-based (AVRO-283), and for it to correctly resolve unions, using record names rather than recursive validation (AVRO-1343).

          Show
          Doug Cutting added a comment - An alternative is to add an entirely new API and deprecate the old one. If we add a new API it would be good to have it object-based rather than dictionary-based ( AVRO-283 ), and for it to correctly resolve unions, using record names rather than recursive validation ( AVRO-1343 ).
          Hide
          Jeremy Kahn added a comment -

          I prefer the approach you (Doug Cutting) suggest – especially the object aspects of it, and especially if the objects can be derived from collections.Sequence and collections.Mapping so that existing accessors can keep working the same way.

          Unfortunately, I don't have any free cycles for this, though I'd be happy to contribute later in July. I don't know if this should block 1.7.5 release though.

          Show
          Jeremy Kahn added a comment - I prefer the approach you ( Doug Cutting ) suggest – especially the object aspects of it, and especially if the objects can be derived from collections.Sequence and collections.Mapping so that existing accessors can keep working the same way. Unfortunately, I don't have any free cycles for this, though I'd be happy to contribute later in July. I don't know if this should block 1.7.5 release though.
          Hide
          Uri Laserson added a comment -

          I don't have bandwidth at this exact moment either. Probably better in August.

          Switching to object-based is also another larger effort I think, though not a bad idea, since the python dicts don't exactly correspond to Avro records IIRC.

          Show
          Uri Laserson added a comment - I don't have bandwidth at this exact moment either. Probably better in August. Switching to object-based is also another larger effort I think, though not a bad idea, since the python dicts don't exactly correspond to Avro records IIRC.

            People

            • Assignee:
              Jeremy Kahn
              Reporter:
              Jeremy Kahn
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development