Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2613

Pig substitutes/mangles "upper ASCII" characters (values > 127)



    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.8.1
    • None
    • data, parser
    • None
    • linux

    • ASCII upper range high non-ASCII higher above 127 thorn character


      Create small/dummy input file that contains ASCII 254 (decimal) characters. These are often represented as the Thorn character. A sample line looks like this:


      but your browser may not render that correctly. Hex representation of that sample line:


      or, with spaces added for your convenience in reading:

      31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A

      You can see that this is just a sample line of plain ASCII numerals and lower-case letters, separated by the FE (hex) or 254 (decimal) code point.

      Now load, like this:

      dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);

      A dump

      dump dummyts;

      shows this:


      The problem does not seem to be with the dump. I have written a UDF that counts characters in the line and returns TRUE if the character count is correct. When I do this:

      fd = filter dummyts by CountRight(line, 254, 8);

      which is saying "validate that there are 8 instances of the ASCII 254 code point/character" I get no results. When I do this:

      fd1 = filter dummyts by CountRight(line, 97, 3);

      which says "validate that there are three instances of the 'a' (ASCII 97) character the results are perfect.

      It looks like something in Pig's load is changing instances of ASCII 254 to the following three characters:



        1. DummyDataTS.txt
          0.2 kB
          Leo Heska



            Unassigned Unassigned
            leoheska Leo Heska
            0 Vote for this issue
            0 Start watching this issue