Pig
  1. Pig
  2. PIG-420

Limit on nothing functionality

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Pig 2.0 implements the limit feature but as a standalone statement.

      Limit is very useful in debug mode where we could run queries on smaller amount of data (faster and on fewer nodes) to iron out issues but in the production mode we would like to run through all the data. It would be good to have a easy "switch" between debug and prod mode using the limit statement without having to change the underlying code templates. Given that LIMIT is a separate standalone statement it gets hard to parametrize the code.

      For instance a query template might look like,
      A = LOAD '...';
      B = LIMIT A $N;
      C = FOREACH B ....

      In debug mode, we would like to set the variable $N to 100 but in prod mode we would like to set it to a 'special value' that would not apply LIMIT and letting us run it on all the data.

        Activity

        Hide
        Romain Rigaux added a comment -

        We have commands that look like Unix commands (e.g. top-queries) and use Pig scripts below. These commands have parameters like -limit (e.g. how many results to return) and the user specifies -limit N where N is an integer.
        This is then simply transformed into a:

        B = LIMIT A $N;
        

        It would be nice if we could specify -limit * and the compiler removes the statement (in case users want everything). Currently we use a custom limit UDF filter or LIMIT with Integer.MAX_VALUE/(Long.MAX_VALUE soon!).

        Show
        Romain Rigaux added a comment - We have commands that look like Unix commands (e.g. top-queries) and use Pig scripts below. These commands have parameters like -limit (e.g. how many results to return) and the user specifies -limit N where N is an integer. This is then simply transformed into a: B = LIMIT A $N; It would be nice if we could specify -limit * and the compiler removes the statement (in case users want everything). Currently we use a custom limit UDF filter or LIMIT with Integer.MAX_VALUE/(Long.MAX_VALUE soon!).
        Hide
        Thejas M Nair added a comment -

        The idea proposed by Rekha seems to be better alternative for 'limit on nothing' . It would be good to have something similar to C++ preprocessor macros . This way the "if debug" decisions can be done at compile time, and there will not be any performance impact.

        Pig could have some syntax to denote debug only sections of the pig script , something like -

        a = load 'file';
        b = #IFDEF DEBUG { limit a, 100; } #ELSE { a; /*assuming we start supporting the syntax "b=a;" */}
        c = filter b by $0 = 1;
        #IFDEF DEBUG { store c into 'debug_file' ; }
        
        
        Show
        Thejas M Nair added a comment - The idea proposed by Rekha seems to be better alternative for 'limit on nothing' . It would be good to have something similar to C++ preprocessor macros . This way the "if debug" decisions can be done at compile time, and there will not be any performance impact. Pig could have some syntax to denote debug only sections of the pig script , something like - a = load 'file'; b = #IFDEF DEBUG { limit a, 100; } #ELSE { a; /*assuming we start supporting the syntax "b=a;" */} c = filter b by $0 = 1; #IFDEF DEBUG { store c into 'debug_file' ; }
        Hide
        Rekha added a comment -

        Although the long/int issue of limit is taken care by ticket 3201952, the usecase is a subset need to have debug mode in pig scripts.

        I faced similar concern, and wanted a debug mode for usecases like - storing intermediate data only if it is debug, not QA/prod, limit the dataset, applying filters only if debug mode, etc.

        I worked around the issue by loading a dummy dataset whose only record -> column would be populated with the passed param value $DEBUG.Depending on the value of this column, processsing was controlled.

        Since it was roundabout way, I agree an inbuilt understanding of debug mode in pig scripts would help.Thanks!

        Show
        Rekha added a comment - Although the long/int issue of limit is taken care by ticket 3201952, the usecase is a subset need to have debug mode in pig scripts. I faced similar concern, and wanted a debug mode for usecases like - storing intermediate data only if it is debug, not QA/prod, limit the dataset, applying filters only if debug mode, etc. I worked around the issue by loading a dummy dataset whose only record -> column would be populated with the passed param value $DEBUG.Depending on the value of this column, processsing was controlled. Since it was roundabout way, I agree an inbuilt understanding of debug mode in pig scripts would help.Thanks!

          People

          • Assignee:
            Unassigned
            Reporter:
            Anand Murugappan
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development