Avro
  1. Avro
  2. AVRO-91

Add JSON Encoder and Decoder in Python implementation

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: python
    • Labels:
      None

      Description

      Now that AVRO-50 is complete, it would be good to have a Json encoder and decoders in Python.

        Issue Links

          Activity

          Hide
          Ravi Gummadi added a comment -

          Planning to incorporate changes similar to AVRO-50(and AVRO-90) in python.

          Is there a better/simpler way of doing the same in python ?

          Show
          Ravi Gummadi added a comment - Planning to incorporate changes similar to AVRO-50 (and AVRO-90 ) in python. Is there a better/simpler way of doing the same in python ?
          Hide
          Doug Cutting added a comment -

          One can implement a JSON codec without implementing a parser. We do not want to force every Avro implementation to implement a parser.

          You will need to add some methods to your encoder/decoder API. Please look at my original patch of 6/26 for AVRO-50, where I implemented a JSON codec w/o a parser. In particular note the addition of

          {read,write} {Record,Union} {Start,End}

          , etc. methods in Encoder.java and Decoder.java.

          Show
          Doug Cutting added a comment - One can implement a JSON codec without implementing a parser. We do not want to force every Avro implementation to implement a parser. You will need to add some methods to your encoder/decoder API. Please look at my original patch of 6/26 for AVRO-50 , where I implemented a JSON codec w/o a parser. In particular note the addition of {read,write} {Record,Union} {Start,End} , etc. methods in Encoder.java and Decoder.java.
          Hide
          Sharad Agarwal added a comment -

          In particular note the addition of {read,write}{Record,Union}{Start,End}, etc. methods in Encoder.java and Decoder.java.

          I think it is a reasonable trade-off as it avoids the parser complexity.

          Show
          Sharad Agarwal added a comment - In particular note the addition of {read,write}{Record,Union}{Start,End}, etc. methods in Encoder.java and Decoder.java. I think it is a reasonable trade-off as it avoids the parser complexity.
          Hide
          Thiruvalluvan M. G. added a comment -

          One drawback of having

          {read,write} {Record,Union} {Start,End}

          methods is that all clients that use decoder/encoder will have to generate these calls. This could be cumbersome for the clients and/or have performance impact.

          Here is an approach which is not as complicated as the Java implementation of the parser. This parser is not as efficient as the one implemented in Java. But I guess performance is not vital for Json encoder/decoder as their main purpose is for diagnostics and debugging.

          Here I describe the Encoder, but the idea can be implemented for the decoder as well.

          The JSON encoder has a stack of "Markers". Markers are of these types - SCHEMA, RECORD_START, RECORD_END, FIELD, ARRAY_START, ARRAYEND, MAP_START, MAP_END, REPEATER etc. The SCHEMA marker will have a schema object associated with it. REPEATER marker has one or two schema objects associated with it. The FIELD marker has the field-name and the field-number associated with it.

          The method writeBoolean() will call advance(schema.BOOLEAN) before writing "true" or "false" into the underlying stream. Similarly writeInt() will call advance(schema.INT) before writing the decimal string corresponding to the int into the underlying stream. Other write() methods for primitive types call advance() with an appropriate schema type.

          The advance() method looks at the top of the stack, if the top of the stack is a SCHEMA marker and the schema matches the type passed to the advance(), then it simply pops the top element in the stack and returns. If the top of the stack is a SCHEMA marker, but the schema type is a compound type (such as a record, map or array) then it "expands" the top element (see below). If the top element is a SCHEMA marker, and the schema is non-compound type and it does not match the argument type of advance(), it is an error. If the top element is not a SCHEMA marker, it inserts appropriate text into the output stream. For example, if it is a RECORD_START or MAP_START a open-brace is written. Similarly, it it is a ARRAY_START a open square-bracket is written. If it is a FIELD marker, the field name associated with that field is written followed by a colon.

          The expand() operation pops the top of the stack and replaces with the expansion of that marker. Only SCHEMA markers with compound schema types or REPEATER markers get expanded. The RECORD SCHEMA marker gets expanded to a sequence [RECORD_START, <FIELD, SCHEMA>*, RECORD_END]. The number of FIELD, SCHEMA pairs is the same as the number of fields of the record. The expanded sequence is pushed in the reverse order; that is RECORD_START will be at the top of the stack after expansion. Array SCHEMA marker gets expanded to

          {ARRAY_START, REPEATER, ARRAY_END }

          . The REPEATER has the schema of the element-type of the array. Map SCHEMA marker gets expanded to

          {MAP_START, REPEATER, MAP_END}

          ; the REPEATER will have a string and a schema for the value of the map.

          Expanding a union is somewhat different. It replaces the union SCHEMA marker with a SCHEMA marker for the appropriate branch. REPEATER marker is expanded to

          { SCHEMA, REPEATER }

          or

          { SCHEMA, SCHEMA, REPEATER}

          where the SCHEMAs are the contents of the REPEATER. On reaching the end of array/map, the REPEATER marker at the top of the stack get discarded.

          The above should take care of all aspects of Json encoding except the commas that should appear between fields in a record, or elements in array/map. The field number field of FIELD marker can be used to decide if a comma needs to be inserted. Some additional information can be kept in REPEATER to decide if a comma is needed in arrays/maps.

          Show
          Thiruvalluvan M. G. added a comment - One drawback of having {read,write} {Record,Union} {Start,End} methods is that all clients that use decoder/encoder will have to generate these calls. This could be cumbersome for the clients and/or have performance impact. Here is an approach which is not as complicated as the Java implementation of the parser. This parser is not as efficient as the one implemented in Java. But I guess performance is not vital for Json encoder/decoder as their main purpose is for diagnostics and debugging. Here I describe the Encoder, but the idea can be implemented for the decoder as well. The JSON encoder has a stack of "Markers". Markers are of these types - SCHEMA, RECORD_START, RECORD_END, FIELD, ARRAY_START, ARRAYEND, MAP_START, MAP_END, REPEATER etc. The SCHEMA marker will have a schema object associated with it. REPEATER marker has one or two schema objects associated with it. The FIELD marker has the field-name and the field-number associated with it. The method writeBoolean() will call advance(schema.BOOLEAN) before writing "true" or "false" into the underlying stream. Similarly writeInt() will call advance(schema.INT) before writing the decimal string corresponding to the int into the underlying stream. Other write() methods for primitive types call advance() with an appropriate schema type. The advance() method looks at the top of the stack, if the top of the stack is a SCHEMA marker and the schema matches the type passed to the advance(), then it simply pops the top element in the stack and returns. If the top of the stack is a SCHEMA marker, but the schema type is a compound type (such as a record, map or array) then it "expands" the top element (see below). If the top element is a SCHEMA marker, and the schema is non-compound type and it does not match the argument type of advance(), it is an error. If the top element is not a SCHEMA marker, it inserts appropriate text into the output stream. For example, if it is a RECORD_START or MAP_START a open-brace is written. Similarly, it it is a ARRAY_START a open square-bracket is written. If it is a FIELD marker, the field name associated with that field is written followed by a colon. The expand() operation pops the top of the stack and replaces with the expansion of that marker. Only SCHEMA markers with compound schema types or REPEATER markers get expanded. The RECORD SCHEMA marker gets expanded to a sequence [RECORD_START, <FIELD, SCHEMA>*, RECORD_END] . The number of FIELD, SCHEMA pairs is the same as the number of fields of the record. The expanded sequence is pushed in the reverse order; that is RECORD_START will be at the top of the stack after expansion. Array SCHEMA marker gets expanded to {ARRAY_START, REPEATER, ARRAY_END } . The REPEATER has the schema of the element-type of the array. Map SCHEMA marker gets expanded to {MAP_START, REPEATER, MAP_END} ; the REPEATER will have a string and a schema for the value of the map. Expanding a union is somewhat different. It replaces the union SCHEMA marker with a SCHEMA marker for the appropriate branch. REPEATER marker is expanded to { SCHEMA, REPEATER } or { SCHEMA, SCHEMA, REPEATER} where the SCHEMAs are the contents of the REPEATER. On reaching the end of array/map, the REPEATER marker at the top of the stack get discarded. The above should take care of all aspects of Json encoding except the commas that should appear between fields in a record, or elements in array/map. The field number field of FIELD marker can be used to decide if a comma needs to be inserted. Some additional information can be kept in REPEATER to decide if a comma is needed in arrays/maps.
          Hide
          Doug Cutting added a comment -

          > One drawback of having

          {read,write} {Record,Union} {Start,End}

          methods is that all clients that use decoder/encoder
          > will have to generate these calls. This could be cumbersome for the clients and/or have performance impact.

          There are not many clients. Mostly it's just generic, since specific inherits from that, no? So this is only a problem if we expect applications to code directly to the Encoder/Decoder API. The Java parser was implemented with that in mind, so that folks could, e.g., write data in a streaming manner, without ever building objects. Do we expect folks to do this much in Python?

          As for performance, I would not expect two no-op method calls per record and union would impact things much.

          Show
          Doug Cutting added a comment - > One drawback of having {read,write} {Record,Union} {Start,End} methods is that all clients that use decoder/encoder > will have to generate these calls. This could be cumbersome for the clients and/or have performance impact. There are not many clients. Mostly it's just generic, since specific inherits from that, no? So this is only a problem if we expect applications to code directly to the Encoder/Decoder API. The Java parser was implemented with that in mind, so that folks could, e.g., write data in a streaming manner, without ever building objects. Do we expect folks to do this much in Python? As for performance, I would not expect two no-op method calls per record and union would impact things much.
          Hide
          Thiruvalluvan M. G. added a comment -

          Actually, in addition to

          {read,write} {Record,Union} {Start,End}

          we need to introduce a method to write field names each field of records, an add additional parameter while writing enums, enum schema while reading enums, union schema while reading/writing the branch. So the additional overhead will be something like one no-op call per field.

          Show
          Thiruvalluvan M. G. added a comment - Actually, in addition to {read,write} {Record,Union} {Start,End} we need to introduce a method to write field names each field of records, an add additional parameter while writing enums, enum schema while reading enums, union schema while reading/writing the branch. So the additional overhead will be something like one no-op call per field.
          Hide
          Roman Inozemtsev added a comment -

          Is somebody working on this?

          Show
          Roman Inozemtsev added a comment - Is somebody working on this?
          Hide
          Sergei Kuzmin added a comment -

          I've provided a patch in https://issues.apache.org/jira/browse/AVRO-1291. Can someone merge it into library or I should do this somehow?

          Show
          Sergei Kuzmin added a comment - I've provided a patch in https://issues.apache.org/jira/browse/AVRO-1291 . Can someone merge it into library or I should do this somehow?

            People

            • Assignee:
              Ravi Gummadi
              Reporter:
              Doug Cutting
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development