Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4704

Customizable Error Handling for Storers in Pig

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0
    • None
    • None
    • Patch Available
    • Reviewed

    Description

      On Thu, Oct 15, 2015 at 4:06 AM, Saggi Neumann <saggi@xplenty.com> wrote:
      You may also check these for ideas. It would be good to have them
      implemented:

      https://wiki.apache.org/pig/PigErrorHandlingInScripts
      https://issues.apache.org/jira/browse/PIG-2620

      Saggi Neumann

      Co-founder and CTO, Xplenty

      M: +972-544-546102

      On Thu, Oct 15, 2015 at 12:17 AM, Siddhi Mehta <smehtauser@gmail.com> wrote:

      > Hello Everyone,
      >
      > Just wanted to follow up on the my earlier post and see if there are any
      > thoughts around the same.
      > I was planning to take a stab to implement the same.
      >
      > The approach I was planning to use for the same is
      > 1. Make the storer that wants error handling capability implement an
      > interface(ErrorHandlingStoreFunc).
      > 2. Using this interface the storer can define if the thresholds for
      > error.Each store func can determine what the threshold should be.For
      > example HbaseStorage can have a different threshold from ParquetStorage.
      > 3. Whenever the storer gets created in
      >
      > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc()
      > we intercept the called and give it a wrappedStoreFunc
      > 4. Every put next calls now gets delegated to the actual storer via the
      > delegate and we can listen in for error on putNext() and take care of the
      > allowing the error if within threshold or re throwing from there.
      > 5. The client can get information about the threshold value from the
      > counters to know if there was any data dropped.
      >
      > Thougts?
      >
      > Thanks,
      > Siddhi
      >
      >
      > On Mon, Oct 12, 2015 at 1:49 PM, Siddhi Mehta <smehtauser@gmail.com>
      > wrote:
      >
      > > Hey Guys,
      > >
      > > Currently a Pig job fails when one record out of the billions records
      > > fails on STORE.
      > > This is not always desirable behavior when you are dealing with millions
      > > of records and only few fail.
      > > In certain use-cases its desirable to know how many such errors and have
      > > an accounting for the same.
      > > Is there a configurable limits that we can set for pig so that we can
      > > allow a threshold for bad records on STORE similar to the lines of the
      > JIRA
      > > for LOAD PIG-3059 <https://issues.apache.org/jira/browse/PIG-3059>
      > >
      > > Thanks,
      > > Siddhi
      > >
      >

      Attachments

        1. PIG-4704_3.patch
          24 kB
          Siddhi Mehta
        2. PIG-4704_4.patch
          27 kB
          Siddhi Mehta
        3. PIG-4704.patch
          23 kB
          Siddhi Mehta
        4. PIG-4704.patch
          20 kB
          Siddhi Mehta

        Issue Links

          Activity

            People

              siddhimehta Siddhi Mehta
              siddhimehta Siddhi Mehta
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: