Avro
  1. Avro
  2. AVRO-219

Rewrite Python implementation's IO path (schema.py, io.py, genericio.py, datafile.py) and associated tests

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: python
    • Labels:
      None

      Description

      Currently, the unit tests for schema.py, genericio.py, and datafile.py are grouped in with the unit tests for io.py in testio.py. We should break the tests into individual files so that we have better modularization of tests.

      1. AVRO-219-schema-io-and-datafile.patch
        81 kB
        Jeff Hammerbacher
      2. AVRO-219.patch.schema_and_io
        63 kB
        Jeff Hammerbacher
      3. AVRO-219.patch.schema
        27 kB
        Jeff Hammerbacher
      4. AVRO-219.patch
        227 kB
        Jeff Hammerbacher

        Issue Links

          Activity

          Hide
          Jeff Hammerbacher added a comment -

          Issues encountered in current schema.py:

          • {"type": "string"}

            is not a valid schema

          Show
          Jeff Hammerbacher added a comment - Issues encountered in current schema.py: {"type": "string"} is not a valid schema
          Hide
          Jeff Hammerbacher added a comment -

          Basically rewriting io path of Python implementation for this issue. See http://github.com/hammer/avro for progress.

          Show
          Jeff Hammerbacher added a comment - Basically rewriting io path of Python implementation for this issue. See http://github.com/hammer/avro for progress.
          Hide
          Jeff Hammerbacher added a comment -

          Okay, I've got most schemas parsing correctly in my new implementation. See the attached patch or http://github.com/hammer/avro.

          I'd really appreciate a review of the direction for new_schema.py; I'm moving on tonight to io.py and genericio.py and should be able to work out any issues with the implementation through that process, but I'd appreciate an extra set of eyes.

          One note: I didn't want to maintain state when serializing a schema object to a string or when comparing to another schema object, so I kept around private variables for the schema objects which could have children to keep track of whether or not a child's name was resolved from the names cache. It's a bit hacky but I can easily switch to the alternative method if people don't like it.

          Also, there are a few things not implemented: I don't check for correct default values, I don't handle the error schema, and I don't actually parse arbitrary properties (though they should not call a parse failure). On the other hand, I handle a far wider variety of schemas correctly than the existing Python implementation, and the variety of schemas used to test my implementation is wider.

          Show
          Jeff Hammerbacher added a comment - Okay, I've got most schemas parsing correctly in my new implementation. See the attached patch or http://github.com/hammer/avro . I'd really appreciate a review of the direction for new_schema.py; I'm moving on tonight to io.py and genericio.py and should be able to work out any issues with the implementation through that process, but I'd appreciate an extra set of eyes. One note: I didn't want to maintain state when serializing a schema object to a string or when comparing to another schema object, so I kept around private variables for the schema objects which could have children to keep track of whether or not a child's name was resolved from the names cache. It's a bit hacky but I can easily switch to the alternative method if people don't like it. Also, there are a few things not implemented: I don't check for correct default values, I don't handle the error schema, and I don't actually parse arbitrary properties (though they should not call a parse failure). On the other hand, I handle a far wider variety of schemas correctly than the existing Python implementation, and the variety of schemas used to test my implementation is wider.
          Hide
          Jeff Hammerbacher added a comment -

          Okay, schema parsing as well as round-trip IO works. I will add object file container support by tomorrow night, at which point this will be ready to check in.

          Any reviews would be helpful!

          Show
          Jeff Hammerbacher added a comment - Okay, schema parsing as well as round-trip IO works. I will add object file container support by tomorrow night, at which point this will be ready to check in. Any reviews would be helpful!
          Hide
          Philip Zeyliger added a comment -

          Are you doing the "new" object container (AVRO-160) or the "older" one?

          Show
          Philip Zeyliger added a comment - Are you doing the "new" object container ( AVRO-160 ) or the "older" one?
          Hide
          Jeff Hammerbacher added a comment -

          New. My initial attempt last night failed a bit, so will head over to that ticket for some comments.

          Show
          Jeff Hammerbacher added a comment - New. My initial attempt last night failed a bit, so will head over to that ticket for some comments.
          Hide
          Jeff Hammerbacher added a comment -

          Okay, I have the whole IO path working in a new implementation now: schema parsing and printing, binary datum serialization and deserialization, and the readng and writing the new (AVRO-160) file object container.

          Time for a code review? I'm going to start on the IPC path tomorrow.

          Also, given that Sharad has been absent of late, I'd like to propose replacing the current implementation with this implementation once it passes reviews.

          Show
          Jeff Hammerbacher added a comment - Okay, I have the whole IO path working in a new implementation now: schema parsing and printing, binary datum serialization and deserialization, and the readng and writing the new ( AVRO-160 ) file object container. Time for a code review? I'm going to start on the IPC path tomorrow. Also, given that Sharad has been absent of late, I'd like to propose replacing the current implementation with this implementation once it passes reviews.
          Hide
          Jeff Hammerbacher added a comment -

          I should also mention: if you'd like to see how this implementation proceeded, please check out http://github.com/hammer/avro/commits/trunk.

          Show
          Jeff Hammerbacher added a comment - I should also mention: if you'd like to see how this implementation proceeded, please check out http://github.com/hammer/avro/commits/trunk .
          Hide
          Jeff Hammerbacher added a comment -

          Started on the IPC path. See AVRO-264.

          Show
          Jeff Hammerbacher added a comment - Started on the IPC path. See AVRO-264 .
          Hide
          Jeff Hammerbacher added a comment -

          I should also mention: I've removed the dependency on odict.

          Show
          Jeff Hammerbacher added a comment - I should also mention: I've removed the dependency on odict.
          Hide
          Doug Cutting added a comment -

          A few random comments from someone who does not program in Python:

          • please package this as a single patch file that replaces the existing implementation.
          • some of the TODO's seem critical, like skip_int.
          • those big if .. elif expressions in read_data, write_data and skip_data look like performance pits. might something like http://simonwillison.net/2004/May/7/switch/ be better? or should schema itself have read/write/skip methods?
          • validate is overkill for picking the union branch. a union can only have one branch of each unnamed type, and named types can be distinguished by name. in python, the only two types that are not distinct are records and maps, since both are represented by python dicts. so any dict without a name might be considered a map and those with are records whose names can be checked against the schema in the union.
          Show
          Doug Cutting added a comment - A few random comments from someone who does not program in Python: please package this as a single patch file that replaces the existing implementation. some of the TODO's seem critical, like skip_int. those big if .. elif expressions in read_data, write_data and skip_data look like performance pits. might something like http://simonwillison.net/2004/May/7/switch/ be better? or should schema itself have read/write/skip methods? validate is overkill for picking the union branch. a union can only have one branch of each unnamed type, and named types can be distinguished by name. in python, the only two types that are not distinct are records and maps, since both are represented by python dicts. so any dict without a name might be considered a map and those with are records whose names can be checked against the schema in the union.
          Hide
          Jeff Hammerbacher added a comment -

          please package this as a single patch file that replaces the existing implementation.

          Sure, will address this on Sunday, when I return to the states.

          some of the TODO's seem critical, like skip_int.

          skip_int and skip_long are copied from the old Python implementation. I believe they are broken, but this patch doesn't introduce the problem. I plan to add tests and sort out that issue soon, but can I address the TODOs in separate JIRAs? Blocking the commit of this patch for TODO scrubbing will mean more work outside of Apache's SVN.

          those big if .. elif expressions in read_data, write_data and skip_data look like performance pits.

          The comments on that blog post point out that a bit if/(elif)+/else block is the standard way to approximate switch/case in Python. Simon's idiom is less popular in Python code I've seen. The previous implementation built a dict of function calls, similar to the blog post you point out, and I found that to be unnecessarily complex. My goal with the Python code is to be correct, concise, and easy to understand first, and fast second. Can we keep the current approach and benchmark it in AVRO-217?

          validate is overkill for picking the union branch.

          Your suggestion sounds like a performance optimization to avoid calling validate() many times, but which would further obfuscate the function of the code. I don't think it's a good idea at this time, given the above stated aims of the Python implementation. If I've misunderstood your intent, please correct me.

          Show
          Jeff Hammerbacher added a comment - please package this as a single patch file that replaces the existing implementation. Sure, will address this on Sunday, when I return to the states. some of the TODO's seem critical, like skip_int. skip_int and skip_long are copied from the old Python implementation. I believe they are broken, but this patch doesn't introduce the problem. I plan to add tests and sort out that issue soon, but can I address the TODOs in separate JIRAs? Blocking the commit of this patch for TODO scrubbing will mean more work outside of Apache's SVN. those big if .. elif expressions in read_data, write_data and skip_data look like performance pits. The comments on that blog post point out that a bit if/(elif)+/else block is the standard way to approximate switch/case in Python. Simon's idiom is less popular in Python code I've seen. The previous implementation built a dict of function calls, similar to the blog post you point out, and I found that to be unnecessarily complex. My goal with the Python code is to be correct, concise, and easy to understand first, and fast second. Can we keep the current approach and benchmark it in AVRO-217 ? validate is overkill for picking the union branch. Your suggestion sounds like a performance optimization to avoid calling validate() many times, but which would further obfuscate the function of the code. I don't think it's a good idea at this time, given the above stated aims of the Python implementation. If I've misunderstood your intent, please correct me.
          Hide
          Jeff Hammerbacher added a comment -

          Also holding up this patch: AVRO-160 changed the file object container format between the time I posted the patch and now. I'll find some time Sunday to implement the new file object container format once I've understood its specification.

          Show
          Jeff Hammerbacher added a comment - Also holding up this patch: AVRO-160 changed the file object container format between the time I posted the patch and now. I'll find some time Sunday to implement the new file object container format once I've understood its specification.
          Hide
          Doug Cutting added a comment -

          > I believe they are broken, but this patch doesn't introduce the problem.

          Okay, that's fine then.

          > Can we keep the current approach and benchmark it in AVRO-217?

          Yes, that's a good plan.

          > Your suggestion sounds like a performance optimization to avoid calling validate() many times [ .. ]

          It is in part performance, but also correctness. A union can contain two records with different names but which are otherwise identical. The current definition of validate does not handle this correctly, since it only validates field names and values and not the record name.

          Show
          Doug Cutting added a comment - > I believe they are broken, but this patch doesn't introduce the problem. Okay, that's fine then. > Can we keep the current approach and benchmark it in AVRO-217 ? Yes, that's a good plan. > Your suggestion sounds like a performance optimization to avoid calling validate() many times [ .. ] It is in part performance, but also correctness. A union can contain two records with different names but which are otherwise identical. The current definition of validate does not handle this correctly, since it only validates field names and values and not the record name.
          Hide
          Jeff Hammerbacher added a comment -

          Okay, here's a patch that replaces the current Python implementation with just the IO path that I've rewritten. I have not addressed the union validation concern, as I didn't completely grok the point. I'd like to address that issue in a separate JIRA after this patch gets committed, if possible.

          Also, I have removed the interop tests for now. Once this patch is approved and committed, I'll clean up AVRO-264 and get that committed, then open a ticket to add the interop tests back in.

          Thanks,
          Jeff

          Show
          Jeff Hammerbacher added a comment - Okay, here's a patch that replaces the current Python implementation with just the IO path that I've rewritten. I have not addressed the union validation concern, as I didn't completely grok the point. I'd like to address that issue in a separate JIRA after this patch gets committed, if possible. Also, I have removed the interop tests for now. Once this patch is approved and committed, I'll clean up AVRO-264 and get that committed, then open a ticket to add the interop tests back in. Thanks, Jeff
          Hide
          Jeff Hammerbacher added a comment -

          I should have noted: the new patch contains an implementation of the file object container format specified in AVRO-160.

          Show
          Jeff Hammerbacher added a comment - I should have noted: the new patch contains an implementation of the file object container format specified in AVRO-160 .
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Jeff!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Jeff!

            People

            • Assignee:
              Jeff Hammerbacher
              Reporter:
              Jeff Hammerbacher
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development