Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2137

SAMPLE should not be pushed above DISTINCT

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.8.0, 0.8.1, 0.9.0, 0.10.0
    • 0.9.0, 0.10.0
    • None
    • None

    Description

      I have an input file that contains 50,000 distinct integers. Each integer is repeated twice, for a total of 100,000 lines.

      Script 1, using GROUP BY to get distinct entries in the data, works:

      grunt> f = load 'tmp/dupnumbers.txt';              
      grunt> d = foreach (group f by $0) generate group; 
      grunt> s = sample d 0.01;                          
      grunt> n = foreach (group s all) generate COUNT(s);
      grunt> dump n;
      (493)
      

      Script 2, using DISTINCT for the same purpose, allows sampling to be done before DISTINCT:

      grunt> f = load 'tmp/dupnumbers.txt';              
      grunt> d = distinct f;
      grunt> s = sample d 0.01;                          
      grunt> n = foreach (group s all) generate COUNT(s);
      (980)
      

      Attachments

        1. PIG-2137.1.patch
          4 kB
          Thejas Nair
        2. PIG-2137.2.patch
          5 kB
          Thejas Nair
        3. PIG-2137.patch
          4 kB
          Dmitriy V. Ryaboy

        Issue Links

          Activity

            People

              dvryaboy Dmitriy V. Ryaboy
              dvryaboy Dmitriy V. Ryaboy
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: