[AVRO-4004] [Rust] Canonical form transformation does not strip the logicalType - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.12.0, 1.11.4
Component/s: rust
Labels:
- pull-request-available

Description

The Rust implementation of for the canonical transformation does not strip the logicalType as required by the [STRIP] rule (https://avro.apache.org/docs/1.11.0/spec.html#Transforming+into+Parsing+Canonical+Form). This results in different fingerprints for the same schema compared to other implementations (at least for Python and Java)
This is for instance can become an issue for the kafka-delta-ingest (https://github.com/delta-io/kafka-delta-ingest).

Rust

[package]
name = "avro issue"
version = "0.2.0"
edition = "2018"

[dependencies]
apache-avro = "0.16.0"
anyhow = "1.0.86"

use anyhow::Result;
use apache_avro::{rabin::Rabin, Schema};
use sha2::Sha256;


fn main() -> Result<()> {

    let schema_str = r#"
      {
        "type": "record",
        "name": "test",
        "fields": [
            {"name": "a", "type": "long", "default": 42, "doc": "The field a"},
            {"name": "b", "type": "string", "namespace": "test.a"},
            {"name": "c", "type": "long", "logicalType": "timestamp-micros"}
        ]
    }"#;

    let schema =  Schema::parse_str(schema_str)?;

    let canonical_form = schema.canonical_form();
    let fp_rabin = schema.fingerprint::<Rabin>();
    println!("Canonical form: {}", canonical_form);
    println!("Rabin fingerprint: {}", fp_rabin);
    Ok(())
}

Output:

Canonical form: {"name":"test","type":"record","fields":[{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"c","type":{"type":"long","logicalType":"timestamp-micros"}}]}
Rabin fingerprint: 28cf0a67d9937bb3

As you can see, the logicalType is still present in the "canonical form."

Python

 
import avro.schema

schema_str = """
    {
        "type": "record",
        "name": "test",
        "fields": [
            {"name": "a", "type": "long", "default": 42, "doc": "The field a"},
            {"name": "b", "type": "string", "namespace": "test.a"},
            {"name": "c", "type": "long", "logicalType": "timestamp-micros"}
        ]
    }"""

schema = avro.schema.parse(schema_str)
print(f"Canonical form: {schema.canonical_form}")
print(f"Rabin fingerprint: {schema.fingerprint().hex()}")

Output:

Canonical form: {"name":"test","type":"record","fields":[{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"c","type":"long"}]}
Rabin fingerprint: 385501e341b00a1c

Java returns the same output as python.

Imho, I think that changing the line
https://github.com/apache/avro/blob/main/lang/rust/avro/src/schema.rs#L2159
to

//...
 if field_ordering_position(k).is_none() || k == "default" || k == "doc" || k == "aliases"  || k == "logicalType" {
//...

should resolve the issue. However, I am unsure if this line should actually include more even attributes (other than the currently explicitly stated).

Nevertheless, the test in https://github.com/apache/avro/blob/fdab5db0816e28e3e10c87910c8b6f98c33072dc/lang/rust/avro/src/schema.rs#L3388
must also be adopted to reflect the correct transformation of the canonical form and the corresponding fingerprint.

Rabin: 385501e341b00a1c
MD5: 384f46367ef8c22dbbf44109b82ff7aa
SHA-256: 8e72f58f2d84a59d6a08e8db5fdc6484dee35babf33179cea72889ae63083f36

Attachments

Issue Links

is related to

AVRO-1721 Should LogicalTypes introduce schema (in)compatibility and canonical parsing form changes?

Open

links to

GitHub Pull Request #2976

Activity

People

Assignee:: Martin Tzvetanov Grigorov

Reporter:: Dominik Mautz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 21/Jun/24 19:45

Updated:: 17/Jul/24 11:18

Resolved:: 12/Jul/24 14:41

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h