Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4240

Python UDF output serialization is invalid

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.12.0, 0.12.1, 0.13.0, 0.12.2, 0.13.1
    • None
    • None
    • None

    Description

      The serialize_output function handles str and unicode return types from UDFs inappropriately.

      Namely, a function could return a valid UTF-8 string, but the call to unicode will trigger a decode using the default Python encoding, ascii which can cause UnicodeDecodeError s.

      Example:

      #!/usr/bin/env python
      # -*- coding: utf-8 -*-
      # pig_test.py
      
      s = "¼½¾"  # valid utf-8 string
      print unicode(s, 'utf-8')  # works
      print unicode(s)  # triggers UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
      

      I propose making a few changes to serialize_output:

      • Since ASCII (Python 2.x's default encoding) is a proper subset of UTF-8, there's no need to check for str being UTF-8 encoded, so we should remove checking where output_type == str
      • Add a check for output_type == unicode that ensures utf-8 via output.encode('utf-8')

      Patch forthcoming.

      Attachments

        1. PIG-4240-1.patch
          0.8 kB
          Mike Sukmanowsky

        Activity

          People

            Unassigned Unassigned
            msukmanowsky Mike Sukmanowsky
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: