Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1783

Gracefully handle strings with wrong character encoding

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7.7
    • 1.8.0
    • ruby
    • None

    Description

      In the vote thread for Avro 1.8.0-rc2, busbey noticed that phunt's avro-rpc-quickstart fails:

      busbey$ ruby sample_ipc_client.rb avro_user pat Hello_World
      Avro::IO::AvroTypeError: The datum
      "\x89\xA9\xD1\xFF@NUm\xEA\x9A\xFB\xDAx\xF5Zq"
      is not an example of schema
      {"type":"fixed","name":"MD5","namespace":"org.apache.avro.ipc","size":16}
                    write_data at
      /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:543
                  write_record at
      /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:610
                          each at org/jruby/RubyArray.java:1613
                  write_record at
      /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:609
                    write_data at
      /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:561
                         write at
      /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/io.rb:538
       write_handshake_request at
      /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:136
                       request at
      /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:105
                       request at
      /Users/busbey/.rvm/gems/jruby-1.7.3/gems/avro-1.8.0/lib/avro/ipc.rb:117
                        (root) at sample_ipc_client.rb:49
      

      I tried reproducing the error, and it is quite strange. avro-rpc-quickstart works fine for me in Ruby (MRI) 2.2 and 2.1, and in JRuby 1.7.23. However, busbey was using JRuby 1.7.3 (as visible from the path names above), and in this particular version of JRuby I was able to reproduce the issue.

      It seems that in some circumstances (but not always, bizarrely), JRuby 1.7.3 returns a UTF-8 encoded string from Digest::MD5.digest, rather than a binary-encoded string. Schema.validate checks that the string is suitable for writing as datum for a fixed type by calling #size. In this case, although the MD5 digest of the schema is a 16-byte string, if you interpret it as a UTF-8 encoded string, it consists of only 13 characters (i.e. some sequences are interpreted as multibyte characters).

      Rather than trying to divine why JRuby is being weird here, I think this is an opportunity to fix Avro's handling of strings to make it robust against unexpected encodings.

      Attachments

        1. AVRO-1783.patch
          3 kB
          Martin Kleppmann
        2. AVRO-1783.stack.text
          45 kB
          Ryan Blue
        3. AVRO-1783-2.patch
          6 kB
          Martin Kleppmann

        Activity

          People

            martinkl Martin Kleppmann
            martinkl Martin Kleppmann
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: