Pig
  1. Pig
  2. PIG-2344

UDF / LoadFunc / StoreFunc should be serializable

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      If there is a redesign, this should be a requirement. We will get away with all the saving of state which got created in frontend and then recreating the same state in backend.

        Issue Links

          Activity

          Hide
          Thomas Weise added a comment -

          Serialization alone would not help in situations where UDF exec(..) depends on state that needs to be initialized where exec(..) runs. One of the workarounds is to do that lazily from exec(...) currently, that will guarantee it happens where the action is...

          Speaking about solutions with Ashutosh, we identified the following needs:

          Pig should construct UDF through default/no-arg ctor. No more multiple times instantiation through UDF ctor with arguments.

          Pig should call initialize(...) in the frontend, with the arguments provided for the UDF.

          Pig should call preExec() in the backend once, this would be the place where things like local file system access etc. can take place

          Probably there should also be a postExec() hook for any cleanup to be done.

          And finally, need to address backward compatibility also, so that existing UDFs don't suddenly stop to work.

          Show
          Thomas Weise added a comment - Serialization alone would not help in situations where UDF exec(..) depends on state that needs to be initialized where exec(..) runs. One of the workarounds is to do that lazily from exec(...) currently, that will guarantee it happens where the action is... Speaking about solutions with Ashutosh, we identified the following needs: Pig should construct UDF through default/no-arg ctor. No more multiple times instantiation through UDF ctor with arguments. Pig should call initialize(...) in the frontend, with the arguments provided for the UDF. Pig should call preExec() in the backend once, this would be the place where things like local file system access etc. can take place Probably there should also be a postExec() hook for any cleanup to be done. And finally, need to address backward compatibility also, so that existing UDFs don't suddenly stop to work.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I'm a fan of the general idea, but let's rethink those method names and provide cleaner (complete) lifecycle methods.

          How are you going to make sure this is backwards compatible? Some UDFs might not even have no-arg constructors.

          Show
          Dmitriy V. Ryaboy added a comment - I'm a fan of the general idea, but let's rethink those method names and provide cleaner (complete) lifecycle methods. How are you going to make sure this is backwards compatible? Some UDFs might not even have no-arg constructors.
          Hide
          Ashutosh Chauhan added a comment -

          Few problems which are related but possibly can be fixed without the redesign are following:

          Pig instantiates LF/SF 3 times in frontend and call different methods of the interface on different objects, making it impossible to communicate states between constructor and different methods within frontend. Illustration of this can be found in HCatLoader which saves schema in ctor in UDFContext and retrieves it back in frontend itself. This is a nasty nasty hack. This problem manifest itself in other places also in HCatlog, making code in it brittle.

          Second, if these LF/SF functions do not perform idempotent actions, then they have to workaround that too.

          Third problem is some of these methods pass jobconf, but writing anything into it is useless, since jobConf is thrown away. After making a call on interface, Pig should save this jobconf and when instantiate a real JobConf later, should initialize with this one.

          Show
          Ashutosh Chauhan added a comment - Few problems which are related but possibly can be fixed without the redesign are following: Pig instantiates LF/SF 3 times in frontend and call different methods of the interface on different objects, making it impossible to communicate states between constructor and different methods within frontend. Illustration of this can be found in HCatLoader which saves schema in ctor in UDFContext and retrieves it back in frontend itself. This is a nasty nasty hack. This problem manifest itself in other places also in HCatlog, making code in it brittle. Second, if these LF/SF functions do not perform idempotent actions, then they have to workaround that too. Third problem is some of these methods pass jobconf, but writing anything into it is useless, since jobConf is thrown away. After making a call on interface, Pig should save this jobconf and when instantiate a real JobConf later, should initialize with this one.

            People

            • Assignee:
              Unassigned
              Reporter:
              Ashutosh Chauhan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development