Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-739

Filter in foreach seems to drop records resulting in decreased count of records

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.3.0
    • 0.3.0
    • impl
    • None

    Description

      I have a Pig script in which I count the number of distinct records resulting from the filter, this statement is embedded in a foreach. The number of records I get with alias TESTDATA_AGG_2 is 1.

      TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);
      
      TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and timestamp lt '1230804000000' and value != 0);
      
      TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
      
      TESTDATA_AGG = foreach TESTDATA_GROUP {
                              A = filter TESTDATA_FILTERED by (userid eq sessionid);
                              C = distinct A.userid;
                              generate group as testid, COUNT(TESTDATA_FILTERED) as counttestdata, COUNT(C) as distcount, SUM(TESTDATA_FILTERED.flag) as total_flags;
                      }
      
      TESTDATA_AGG_1 = group TESTDATA_AGG ALL;
      
      -- count records generated through nested foreach which contains distinct
      TESTDATA_AGG_2 = foreach TESTDATA_AGG_1 generate COUNT(TESTDATA_AGG);
      
      --explain TESTDATA_AGG_2;
      dump TESTDATA_AGG_2;
      --RESULT (1L)
      

      But when I do the counting of records without the filter and distinct in the foreach I get a different value (20L)

      TESTDATA =  load 'testdata' using PigStorage() as (timestamp:chararray, testid:chararray, userid: chararray, sessionid:chararray, value:long, flag:int);
      
      TESTDATA_FILTERED = filter TESTDATA by (timestamp gte '1230800400000' and timestamp lt '1230804000000' and value != 0);
      
      TESTDATA_GROUP = group TESTDATA_FILTERED by testid;
      
      -- count records generated through simple foreach
      TESTDATA_AGG2 = foreach TESTDATA_GROUP generate group as testid, COUNT(TESTDATA_FILTERED) as counttestid, SUM(TESTDATA_FILTERED.flag) as total_flags;
      
      TESTDATA_AGG2_1 = group TESTDATA_AGG2 ALL;
      TESTDATA_AGG2_2 = foreach TESTDATA_AGG2_1 generate COUNT(TESTDATA_AGG2);
      dump TESTDATA_AGG2_2;
      --RESULT (20L)
      

      Attaching testdata

      Attachments

        1. testdata
          34 kB
          Viraj Bhat
        2. filter_distinctbug.pig
          1 kB
          Viraj Bhat

        Issue Links

          Activity

            People

              pkamath Pradeep Kamath
              viraj Viraj Bhat
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: