Avro
  1. Avro
  2. AVRO-565

Investigate Python encoding error

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: python
    • Labels:
      None

      Description

      Tyler B is seeing the following encoding error: http://avro.pastebin.com/b4HSYjCz.

        Activity

        Hide
        Miki Tebeka added a comment -

        IMO it's not the library role. To quote "The Zen of Python": In the face of ambiguity, refuse the temptation to guess.

        You can do the same in your code (try to encode, if fails convert to utf8)

        Show
        Miki Tebeka added a comment - IMO it's not the library role. To quote " The Zen of Python ": In the face of ambiguity, refuse the temptation to guess. You can do the same in your code (try to encode, if fails convert to utf8)
        Hide
        Russell Jurney added a comment -

        Pardon me, I mean it would be neat if Avro worked with ugly strings

        Show
        Russell Jurney added a comment - Pardon me, I mean it would be neat if Avro worked with ugly strings
        Hide
        Russell Jurney added a comment -

        I create unicode objects when I know the codec, otherwise the decode crashes.

        <code>if charset:
        subject = subject.decode(charset)</code>

        It would be neat if Python worked with ugly strings.

        Show
        Russell Jurney added a comment - I create unicode objects when I know the codec, otherwise the decode crashes. <code>if charset: subject = subject.decode(charset)</code> It would be neat if Python worked with ugly strings.
        Hide
        Miki Tebeka added a comment -

        Russel, the problem is that you pass a str with unicode characters. Python has no way of knowing the encoding and the default if 'ascii'. If you prepend a 'u' to make the body unicode:

        email_hash = {'body': u"Verit\xc3\xa1\r\nEstat\xc3\xadstica\r\n"}
        

        This will work since now it's a unicode string and it can be encoded.

        As a side note - these kind of things were one of the reasons for Python 3. I also recommend viewing this video which personally helped me.

        I'll give you time to respond before closing this.

        Show
        Miki Tebeka added a comment - Russel, the problem is that you pass a str with unicode characters. Python has no way of knowing the encoding and the default if 'ascii'. If you prepend a 'u' to make the body unicode: email_hash = {'body': u "Verit\xc3\xa1\r\nEstat\xc3\xadstica\r\n" } This will work since now it's a unicode string and it can be encoded. As a side note - these kind of things were one of the reasons for Python 3. I also recommend viewing this video which personally helped me. I'll give you time to respond before closing this.
        Hide
        Russell Jurney added a comment -

        Reproduces the error this ticket describes.

        Show
        Russell Jurney added a comment - Reproduces the error this ticket describes.
        Hide
        Russell Jurney added a comment -

        Miki, the problem is reproduced here: https://gist.github.com/2381748 and in an attached file.

        In my use case, some emails don't contain charsets... so I can't convert them into unicode objects via decode() to make them play well with Avro. So Avro dies. I'd rather it throw an exception I can catch, and then continue to write the record, than stop mid-record. Either that, or I would like the ability to remove the last Avro record written, in the Python Avro client.

        Show
        Russell Jurney added a comment - Miki, the problem is reproduced here: https://gist.github.com/2381748 and in an attached file. In my use case, some emails don't contain charsets... so I can't convert them into unicode objects via decode() to make them play well with Avro. So Avro dies. I'd rather it throw an exception I can catch, and then continue to write the record, than stop mid-record. Either that, or I would like the ability to remove the last Avro record written, in the Python Avro client.
        Hide
        Miki Tebeka added a comment -

        Russel, can you provide an example input that causes the problem? Even better will be an example script that reproduces the problem.

        Show
        Miki Tebeka added a comment - Russel, can you provide an example input that causes the problem? Even better will be an example script that reproduces the problem.
        Hide
        Russell Jurney added a comment -

        On further reflection...

        I have changed my code to have all strings become Unicode objects, which Avro can write: decode(mystring, mycharset)

        Before I was using UTF-8 encoded strings: mystring.decode(mycharset).encode('utf-8)

        As outlined above, this was resulting in double UTF-age. Suggest that the API/docs should only accept Unicode objects, not strings, since strings do not work well.

        Show
        Russell Jurney added a comment - On further reflection... I have changed my code to have all strings become Unicode objects, which Avro can write: decode(mystring, mycharset) Before I was using UTF-8 encoded strings: mystring.decode(mycharset).encode('utf-8) As outlined above, this was resulting in double UTF-age. Suggest that the API/docs should only accept Unicode objects, not strings, since strings do not work well.
        Hide
        Russell Jurney added a comment -

        When I change io.py's BinaryEncoder.write_utf8 to comment out the re-encoding to UTF-8, my problem goes away:

        def write_utf8(self, datum):
        datum = datum#.encode("utf-8")
        self.write_bytes(datum)

        However, with the original code, not encoding to UTF-8 before writing the string field also fails. So I'm not sure what to do here. Is UTF-8 a requirement for Avro string fields? How should I fix this?

        Show
        Russell Jurney added a comment - When I change io.py's BinaryEncoder.write_utf8 to comment out the re-encoding to UTF-8, my problem goes away: def write_utf8(self, datum): datum = datum#.encode("utf-8") self.write_bytes(datum) However, with the original code, not encoding to UTF-8 before writing the string field also fails. So I'm not sure what to do here. Is UTF-8 a requirement for Avro string fields? How should I fix this?
        Hide
        Russell Jurney added a comment -

        Oh man, this has been KILLING me for a couple weeks.

        What am I to do? What is the status of this? No matter what I do to the input field, python avro cannot write the field.

        Even though I do this: subject.decode(charset).encode('utf-8')

        The write still dies. Oh man this has been KILLING me. Help

        Show
        Russell Jurney added a comment - Oh man, this has been KILLING me for a couple weeks. What am I to do? What is the status of this? No matter what I do to the input field, python avro cannot write the field. Even though I do this: subject.decode(charset).encode('utf-8') The write still dies. Oh man this has been KILLING me. Help
        Hide
        R. Tyler Croy added a comment -

        I found the source of our issues, it appears we have some places where UTF-8 encoded str objects are floating around (with utf-8 code-points in them) that were failing to "re-encode" to UTF-8

        The solution we're working on is removing encoded strings in the code base and just using unicode objects for everything, I think we can close this ticket

        Show
        R. Tyler Croy added a comment - I found the source of our issues, it appears we have some places where UTF-8 encoded str objects are floating around (with utf-8 code-points in them) that were failing to "re-encode" to UTF-8 The solution we're working on is removing encoded strings in the code base and just using unicode objects for everything, I think we can close this ticket
        Hide
        Philip Zeyliger added a comment -

        What's the client call? What's the value of datum?

        I believe you see this error when you pass a string that python refuses to encode as unicode.

        >>> "\xe2".encode("utf-8")
        Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

        Show
        Philip Zeyliger added a comment - What's the client call? What's the value of datum? I believe you see this error when you pass a string that python refuses to encode as unicode. >>> "\xe2".encode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
        Hide
        Jeff Hammerbacher added a comment -

        Copying the error here in case the pastebin expires:

          File "/usr/local/lib/python2.6/site-packages/avro/ipc.py", line 134, in request
            self.write_call_request(message_name, request_datum, buffer_encoder)
          File "/usr/local/lib/python2.6/site-packages/avro/ipc.py", line 181, in write_call_request
            self.write_request(message.request, request_datum, encoder)
          File "/usr/local/lib/python2.6/site-packages/avro/ipc.py", line 185, in write_request
            datum_writer.write(request_datum, encoder)
          File "/usr/local/lib/python2.6/site-packages/avro/io.py", line 720, in write
            self.write_data(self.writers_schema, datum, encoder)
          File "/usr/local/lib/python2.6/site-packages/avro/io.py", line 755, in write_data
            self.write_record(writers_schema, datum, encoder)
          File "/usr/local/lib/python2.6/site-packages/avro/io.py", line 843, in write_record
            self.write_data(field.type, datum.get(field.name), encoder)
          File "/usr/local/lib/python2.6/site-packages/avro/io.py", line 753, in write_data
            self.write_union(writers_schema, datum, encoder)
          File "/usr/local/lib/python2.6/site-packages/avro/io.py", line 833, in write_union
            self.write_data(writers_schema.schemas[index_of_schema], datum, encoder)
          File "/usr/local/lib/python2.6/site-packages/avro/io.py", line 733, in write_data
            encoder.write_utf8(datum)
          File "/usr/local/lib/python2.6/site-packages/avro/io.py", line 328, in write_utf8
            datum = datum.encode("utf-8")
        UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
        
        Show
        Jeff Hammerbacher added a comment - Copying the error here in case the pastebin expires: File "/usr/local/lib/python2.6/site-packages/avro/ipc.py" , line 134, in request self.write_call_request(message_name, request_datum, buffer_encoder) File "/usr/local/lib/python2.6/site-packages/avro/ipc.py" , line 181, in write_call_request self.write_request(message.request, request_datum, encoder) File "/usr/local/lib/python2.6/site-packages/avro/ipc.py" , line 185, in write_request datum_writer.write(request_datum, encoder) File "/usr/local/lib/python2.6/site-packages/avro/io.py" , line 720, in write self.write_data(self.writers_schema, datum, encoder) File "/usr/local/lib/python2.6/site-packages/avro/io.py" , line 755, in write_data self.write_record(writers_schema, datum, encoder) File "/usr/local/lib/python2.6/site-packages/avro/io.py" , line 843, in write_record self.write_data(field.type, datum.get(field.name), encoder) File "/usr/local/lib/python2.6/site-packages/avro/io.py" , line 753, in write_data self.write_union(writers_schema, datum, encoder) File "/usr/local/lib/python2.6/site-packages/avro/io.py" , line 833, in write_union self.write_data(writers_schema.schemas[index_of_schema], datum, encoder) File "/usr/local/lib/python2.6/site-packages/avro/io.py" , line 733, in write_data encoder.write_utf8(datum) File "/usr/local/lib/python2.6/site-packages/avro/io.py" , line 328, in write_utf8 datum = datum.encode( "utf-8" ) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

          People

          • Assignee:
            Unassigned
            Reporter:
            Jeff Hammerbacher
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development