Pig
  1. Pig
  2. PIG-3367

Add assert keyword (operator) in pig

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: parser
    • Labels:
      None

      Description

      Assert operator can be used for data validation. With assert you can write script as following-

      a = load 'something' as (a0:int, a1:int);
      assert a by a0 > 0, 'a cant be negative for reasons';
      

      This script will fail if assert is violated.

      1. PIG-3367.patch
        12 kB
        Aniket Mokashi
      2. PIG-3367-2.patch
        10 kB
        Aniket Mokashi

        Issue Links

          Activity

          Hide
          Alex Levenson added a comment -

          I think you should be able to provide a message:

          assert a by a0 > 0, 'a cant be negative for reasons'
          
          Show
          Alex Levenson added a comment - I think you should be able to provide a message: assert a by a0 > 0, 'a cant be negative for reasons'
          Hide
          Alex Levenson added a comment -

          It might even be nice to require a message

          Show
          Alex Levenson added a comment - It might even be nice to require a message
          Hide
          Aniket Mokashi added a comment -

          Good suggestion! I will add that to the requirement.

          Show
          Aniket Mokashi added a comment - Good suggestion! I will add that to the requirement.
          Hide
          Aniket Mokashi added a comment -

          We also need to decide if we want to fail fast if assertion fails (Terminate the job instantaneously). However, I am not sure if hadoop/pig has a friendly mechanism to do so.

          Show
          Aniket Mokashi added a comment - We also need to decide if we want to fail fast if assertion fails (Terminate the job instantaneously). However, I am not sure if hadoop/pig has a friendly mechanism to do so.
          Hide
          Alex Levenson added a comment -

          also: adding a keyword means that some scripts will fail (ones that had relations named assert)
          I remember that happening with the "rank" keyword.

          What's the big win from having this built into the language opposed to as a UDF?
          The potential to fail fast the entire script instead of letting each task fail one by one?

          Could a UDF (especially a new builtin one like this) have the same ability to fail fast the whole script?

          Show
          Alex Levenson added a comment - also: adding a keyword means that some scripts will fail (ones that had relations named assert) I remember that happening with the "rank" keyword. What's the big win from having this built into the language opposed to as a UDF? The potential to fail fast the entire script instead of letting each task fail one by one? Could a UDF (especially a new builtin one like this) have the same ability to fail fast the whole script?
          Hide
          Julien Le Dem added a comment -

          I was thinking we could make the syntax part of FOREACH.

          B = FOREACH A GENERATE a, b, c ASSERT a >= 0, b IS NOT NULL;
          

          That way it is easy to integrate asserts in the flow.

          The advantage of having it part of the language:

          • the error message can be clear without extra user input.
          • it's more natural than doing a filter that does not filter. Also if the filter is not in the predecessors of a STORE, it won't be executed.

          A UDF can stop the job by throwing an exception. Although the task will retry before failing completely.

          For reference, the UDF based syntax:

          FILTER members BY ASSERT( (member_id >= 0 ? 1 : 0), 'Doh! Some member ID is negative.' );
          

          Yes adding new keywords is inconvenient when the keyword was used for relation or column names.
          When a field collides with a keyword it is sometimes difficult to rename it.
          I think we should:

          • try to avoid new keywords if possible
          • provide a mechanism to escape field names to facilitate fixing conflicts when they happen (using quotes or a similar mechanism)
          Show
          Julien Le Dem added a comment - I was thinking we could make the syntax part of FOREACH. B = FOREACH A GENERATE a, b, c ASSERT a >= 0, b IS NOT NULL; That way it is easy to integrate asserts in the flow. The advantage of having it part of the language: the error message can be clear without extra user input. it's more natural than doing a filter that does not filter. Also if the filter is not in the predecessors of a STORE, it won't be executed. A UDF can stop the job by throwing an exception. Although the task will retry before failing completely. For reference, the UDF based syntax: FILTER members BY ASSERT( (member_id >= 0 ? 1 : 0), 'Doh! Some member ID is negative.' ); Yes adding new keywords is inconvenient when the keyword was used for relation or column names. When a field collides with a keyword it is sometimes difficult to rename it. I think we should: try to avoid new keywords if possible provide a mechanism to escape field names to facilitate fixing conflicts when they happen (using quotes or a similar mechanism)
          Hide
          Alex Levenson added a comment -

          Making it part of foreach looks more readable.
          I think there should still be an option for custom error messages.

          Show
          Alex Levenson added a comment - Making it part of foreach looks more readable. I think there should still be an option for custom error messages.
          Hide
          Arun Ahuja added a comment -

          Isn't this available with http://linkedin.github.io/datafu/docs/javadoc/datafu/pig/util/ASSERT.html, so it this mainly about bringing it into the language?

          Show
          Arun Ahuja added a comment - Isn't this available with http://linkedin.github.io/datafu/docs/javadoc/datafu/pig/util/ASSERT.html , so it this mainly about bringing it into the language?
          Hide
          Aniket Mokashi added a comment -

          @Arun Ahuja, correct. We feel that it should be part of the language itself. Very useful indeed.

          Show
          Aniket Mokashi added a comment - @ Arun Ahuja , correct. We feel that it should be part of the language itself. Very useful indeed.
          Hide
          Aniket Mokashi added a comment -

          B = FOREACH A GENERATE a, b, c ASSERT a >= 0, b IS NOT NULL;

          I think this is a little ambiguous. Does that mean following is a right syntax?

          B = FOREACH A GENERATE a, b, c ASSERT d >= 0;

          I think assert should be it's own statement- just like other languages.

          A UDF can stop the job by throwing an exception. Although the task will retry before failing completely.

          Do we want to have a mechanism to identify and fast-fail? (It would be hacky, (publish a counter and client kills it etc.) but useful for this case).

          Show
          Aniket Mokashi added a comment - B = FOREACH A GENERATE a, b, c ASSERT a >= 0, b IS NOT NULL; I think this is a little ambiguous. Does that mean following is a right syntax? B = FOREACH A GENERATE a, b, c ASSERT d >= 0; I think assert should be it's own statement- just like other languages. A UDF can stop the job by throwing an exception. Although the task will retry before failing completely. Do we want to have a mechanism to identify and fast-fail? (It would be hacky, (publish a counter and client kills it etc.) but useful for this case).
          Hide
          Aniket Mokashi added a comment -

          I have attached a patch, but it does not have fast-fail mechanism yet. I will open a jira to have a generic framework for fail-fast from a UDF and modify assert to use that later.

          Show
          Aniket Mokashi added a comment - I have attached a patch, but it does not have fast-fail mechanism yet. I will open a jira to have a generic framework for fail-fast from a UDF and modify assert to use that later.
          Hide
          Aniket Mokashi added a comment -

          I opened PIG-3452 to add fail-fast mechanism.

          Show
          Aniket Mokashi added a comment - I opened PIG-3452 to add fail-fast mechanism.
          Hide
          Julien Le Dem added a comment -

          Looks good to me.
          Is there a way you can factor out some of the content of buildAssertOp() ? It looks like some of this would be common with other methods.

          Show
          Julien Le Dem added a comment - Looks good to me. Is there a way you can factor out some of the content of buildAssertOp() ? It looks like some of this would be common with other methods.
          Hide
          Aniket Mokashi added a comment -

          Incorporated code review comments

          Show
          Aniket Mokashi added a comment - Incorporated code review comments
          Hide
          Aniket Mokashi added a comment -

          Committed to trunk. Thanks Julien for the review!

          Show
          Aniket Mokashi added a comment - Committed to trunk. Thanks Julien for the review!
          Hide
          Cheolsoo Park added a comment -

          Aniket Mokashi, would you mind updating the documentation - PIG-3483?

          Show
          Cheolsoo Park added a comment - Aniket Mokashi , would you mind updating the documentation - PIG-3483 ?
          Hide
          Aniket Mokashi added a comment -

          Thanks Cheolsoo Park, I will try to wrap it up today.

          Show
          Aniket Mokashi added a comment - Thanks Cheolsoo Park , I will try to wrap it up today.
          Hide
          Daniel Dai added a comment -

          Seems the syntax:

          assert a by a0 > 0, 'a cant be negative for reasons';
          

          does not work, Grunt parser will complain. We will need to use the following syntax:

          b = assert a by a0 > 0, 'a cant be negative for reasons';
          

          Is that expected?

          Show
          Daniel Dai added a comment - Seems the syntax: assert a by a0 > 0, 'a cant be negative for reasons'; does not work, Grunt parser will complain. We will need to use the following syntax: b = assert a by a0 > 0, 'a cant be negative for reasons'; Is that expected?

            People

            • Assignee:
              Aniket Mokashi
              Reporter:
              Aniket Mokashi
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development