Avro
  1. Avro
  2. AVRO-673

Reduce time spent validating schemas

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.1
    • Component/s: python
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      avro.io has a validate method that currently occupies around half the time it takes to serialize a fairly complex record through a datafile. validate() gets called repeatedly during an object's traversal, even though validate itself is already recursive. This introduces combinatorially excessive validation that has a significant impact on the performance of serializing complex records.

        Activity

        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Erik!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Erik!
        Hide
        Doug Cutting added a comment -

        This looks reasonable to me. I'll commit it soon unless someone objects.

        Show
        Doug Cutting added a comment - This looks reasonable to me. I'll commit it soon unless someone objects.
        Hide
        Erik Frey added a comment -

        Agreed! The scope of this patch, though, is just to address what looks to me like a logic error.

        Show
        Erik Frey added a comment - Agreed! The scope of this patch, though, is just to address what looks to me like a logic error.
        Hide
        Doug Cutting added a comment -

        It's not clear to me that we need to validate at all before we start writing. The write should fail on invalid data.

        AVRO-654 is also related. Recursive validation is also not required to select a union branch, and, in the worst case, can result in exponentially bad performance.

        Show
        Doug Cutting added a comment - It's not clear to me that we need to validate at all before we start writing. The write should fail on invalid data. AVRO-654 is also related. Recursive validation is also not required to select a union branch, and, in the worst case, can result in exponentially bad performance.
        Hide
        Philip Zeyliger added a comment -

        Patch looks reasonable to me. Haven't downloaded and tried it.

        Show
        Philip Zeyliger added a comment - Patch looks reasonable to me. Haven't downloaded and tried it.
        Hide
        Erik Frey added a comment -

        Ensures validation is done only once in the .write() method. In an adhoc test, this reduced the time to serialize a datafile with a complex schema from 8 seconds to 5.5 seconds. Also includes a small test to ensure AvroTypeException is thrown before and after the patch.

        Show
        Erik Frey added a comment - Ensures validation is done only once in the .write() method. In an adhoc test, this reduced the time to serialize a datafile with a complex schema from 8 seconds to 5.5 seconds. Also includes a small test to ensure AvroTypeException is thrown before and after the patch.

          People

          • Assignee:
            Erik Frey
            Reporter:
            Erik Frey
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development