Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-4004

[Rust] Canonical form transformation does not strip the logicalType

    XMLWordPrintableJSON

Details

    Description

      The Rust implementation of for the canonical transformation does not strip the logicalType as required by the [STRIP] rule (https://avro.apache.org/docs/1.11.0/spec.html#Transforming+into+Parsing+Canonical+Form). This results in different fingerprints for the same schema compared to other implementations (at least for Python and Java)
      This is for instance can become an issue for the kafka-delta-ingest (https://github.com/delta-io/kafka-delta-ingest).

      Rust

      [package]
      name = "avro issue"
      version = "0.2.0"
      edition = "2018"
      
      [dependencies]
      apache-avro = "0.16.0"
      anyhow = "1.0.86"
      
      use anyhow::Result;
      use apache_avro::{rabin::Rabin, Schema};
      use sha2::Sha256;
      
      
      fn main() -> Result<()> {
      
          let schema_str = r#"
            {
              "type": "record",
              "name": "test",
              "fields": [
                  {"name": "a", "type": "long", "default": 42, "doc": "The field a"},
                  {"name": "b", "type": "string", "namespace": "test.a"},
                  {"name": "c", "type": "long", "logicalType": "timestamp-micros"}
              ]
          }"#;
      
          let schema =  Schema::parse_str(schema_str)?;
      
          let canonical_form = schema.canonical_form();
          let fp_rabin = schema.fingerprint::<Rabin>();
          println!("Canonical form: {}", canonical_form);
          println!("Rabin fingerprint: {}", fp_rabin);
          Ok(())
      }
      

      Output:

      Canonical form: {"name":"test","type":"record","fields":[{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"c","type":{"type":"long","logicalType":"timestamp-micros"}}]}
      Rabin fingerprint: 28cf0a67d9937bb3
      

      As you can see, the logicalType is still present in the "canonical form."

      Python

       
      import avro.schema
      
      schema_str = """
          {
              "type": "record",
              "name": "test",
              "fields": [
                  {"name": "a", "type": "long", "default": 42, "doc": "The field a"},
                  {"name": "b", "type": "string", "namespace": "test.a"},
                  {"name": "c", "type": "long", "logicalType": "timestamp-micros"}
              ]
          }"""
      
      schema = avro.schema.parse(schema_str)
      print(f"Canonical form: {schema.canonical_form}")
      print(f"Rabin fingerprint: {schema.fingerprint().hex()}")
      

      Output:

      Canonical form: {"name":"test","type":"record","fields":[{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"c","type":"long"}]}
      Rabin fingerprint: 385501e341b00a1c
      

      Java returns the same output as python.

      Imho, I think that changing the line
      https://github.com/apache/avro/blob/main/lang/rust/avro/src/schema.rs#L2159
      to

      //...
       if field_ordering_position(k).is_none() || k == "default" || k == "doc" || k == "aliases"  || k == "logicalType" {
      //...
       

      should resolve the issue. However, I am unsure if this line should actually include more even attributes (other than the currently explicitly stated).

      Nevertheless, the test in https://github.com/apache/avro/blob/fdab5db0816e28e3e10c87910c8b6f98c33072dc/lang/rust/avro/src/schema.rs#L3388
      must also be adopted to reflect the correct transformation of the canonical form and the corresponding fingerprint.

      Rabin: 385501e341b00a1c
      MD5: 384f46367ef8c22dbbf44109b82ff7aa
      SHA-256: 8e72f58f2d84a59d6a08e8db5fdc6484dee35babf33179cea72889ae63083f36

      Attachments

        Issue Links

          Activity

            People

              mgrigorov Martin Tzvetanov Grigorov
              dominikm Dominik Mautz
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h