Pig
  1. Pig
  2. PIG-3749

PigPerformance - data in the map gets lost during parsing

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.12.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Bug in PigPerformanceLoader when reading bytes, the loop which looks for a termination character in a map is missing the null value (Ascii=0)

      Description

      Create a Pigmix sample dataset which looks as follow:
      keren 1 2 qt 3 4 5.0 aaaabbbb mccccddddeeeedmffffgggghhhh

      Launch the following query:
      A = load 'page_views_sample.txt' using org.apache.pig.test.pigmix.udf.PigPerformanceLoader()
      as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
      store A into 'L1out_A';

      B = foreach A generate user, (int)action as action, (map[])page_info as page_info, flatten((bag

      {tuple(map[])}

      )page_links) as page_links;
      store B into 'L1out_B';

      The result looks like this:
      keren 1 b#bbb,a#aaa d#,e#eee,c#ccc
      keren 1 b#bbb,a#aaa [f#fff,g#ggg,h#hhh

      It is missing the 'ddd' value and a closing bracket.

      Thanks,
      Keren

      1. PIG-3749.patch
        0.5 kB
        Keren Ouaknine

        Activity

        Hide
        Daniel Dai added a comment -

        I tried something similar but not able to reproduce it.

        Seems your patch deals with the 0x00 in the bytearray. Is it in the middle of the bytearray or in the end? I checked DataGenerator, it does not seems we generate 0x00 in the middle. If it is in the end, shouldn't it also be bounded by b.length?

        Can you upload your page_views_sample with the offending record?

        Show
        Daniel Dai added a comment - I tried something similar but not able to reproduce it. Seems your patch deals with the 0x00 in the bytearray. Is it in the middle of the bytearray or in the end? I checked DataGenerator, it does not seems we generate 0x00 in the middle. If it is in the end, shouldn't it also be bounded by b.length? Can you upload your page_views_sample with the offending record?
        Hide
        Daniel Dai added a comment -

        Keren Ouaknine, is this still an issue?

        Show
        Daniel Dai added a comment - Keren Ouaknine , is this still an issue?
        Hide
        Cheolsoo Park added a comment -

        Canceling patch while waiting for response.

        Show
        Cheolsoo Park added a comment - Canceling patch while waiting for response.
        Hide
        Prashant Kommireddi added a comment -

        Keren Ouaknine moving this to 0.13, let me know if you have concerns with that. Also, can you please answer Cheolsoo's question above.

        Show
        Prashant Kommireddi added a comment - Keren Ouaknine moving this to 0.13, let me know if you have concerns with that. Also, can you please answer Cheolsoo's question above.
        Hide
        Cheolsoo Park added a comment -

        I don't seem to be able to reproduce it. I used "keren 1 2 qt 3 4 5.0 aaaabbbb mccccddddeeeedmffffgggghhhh" as input, and it gives me the following-

        (keren	1	2	qt	3	4	5.0	aaaabbbb	mccccddddeeeemffffgggghhhh,,,,,,,,)
        (keren	1	2	qt	3	4	5.0	aaaabbbb	mccccddddeeeemffffgggghhhh,,,)
        

        I think I am not loading the data properly. Do you mind attaching a sample dataset to the jira?

        Also, can you post a patch that can be easily applied with patch < filenamename in the root directory? Not a big deal for small patches, but it's helpful to reviewers.

        Thanks!

        Show
        Cheolsoo Park added a comment - I don't seem to be able to reproduce it. I used "keren 1 2 qt 3 4 5.0 aaaabbbb mccccddddeeeedmffffgggghhhh" as input, and it gives me the following- (keren 1 2 qt 3 4 5.0 aaaabbbb mccccddddeeeemffffgggghhhh,,,,,,,,) (keren 1 2 qt 3 4 5.0 aaaabbbb mccccddddeeeemffffgggghhhh,,,) I think I am not loading the data properly. Do you mind attaching a sample dataset to the jira? Also, can you post a patch that can be easily applied with patch < filenamename in the root directory? Not a big deal for small patches, but it's helpful to reviewers. Thanks!

          People

          • Assignee:
            Keren Ouaknine
            Reporter:
            Keren Ouaknine
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development