[TEZ-1062] Create SimpleProcessor for processors that only need to implement the run method - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.5.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

The SimpleProcessor could take care of all things like starting input, committing outputs. It would handle no events, since simple processors dont need to handle inputs. Thus the user would only need to implement their custom task logic in a new execute() method.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TEZ-1062.1.patch
21/Apr/14 23:06
7 kB
Mohammad Islam
TEZ-1062.2.patch
24/Apr/14 02:27
22 kB
Mohammad Islam
TEZ-1062.3.patch
26/Apr/14 02:22
22 kB
Mohammad Islam
TEZ-1062.4.patch
27/Apr/14 23:54
22 kB
Bikas Saha

Activity

Ascending order - Click to sort in descending order

Bikas Saha added a comment - 16/Apr/14 17:44

kamrul This is motivated by the discussion we had on ~~TEZ-700~~ about simplifying most of the existing processors we currently have. We skipped that in ~~TEZ-700~~ but we can now base on ~~TEZ-695~~ and create another abstract class SimpleProcessor that provides impls for all base methods except for the run method. We can use your suggestion of having simpleprocessor.run(Map<Inputs, Outputs>) call input.start() and then call real.run(). Similar to what we did for the initialize() method. This will really simplify writing simple processors.
This wont work for Input/Output since they usually have to handle events (unlike Processor). What do you think?

Bikas Saha added a comment - 16/Apr/14 17:44 kamrul This is motivated by the discussion we had on TEZ-700 about simplifying most of the existing processors we currently have. We skipped that in TEZ-700 but we can now base on TEZ-695 and create another abstract class SimpleProcessor that provides impls for all base methods except for the run method. We can use your suggestion of having simpleprocessor.run(Map<Inputs, Outputs>) call input.start() and then call real.run(). Similar to what we did for the initialize() method. This will really simplify writing simple processors. This wont work for Input/Output since they usually have to handle events (unlike Processor). What do you think?

Mohammad Islam added a comment - 19/Apr/14 03:09

Sure.

bikassaha How will it be related to TEZ-694? Will that be done separately based on this SimpleProcessor?

Mohammad Islam added a comment - 19/Apr/14 03:09 Sure. bikassaha How will it be related to TEZ-694 ? Will that be done separately based on this SimpleProcessor?

Bikas Saha added a comment - 19/Apr/14 18:30

For now, we can start with just the code that got left out of ~~TEZ-700~~ and provide the empty impls along with a pre-op that calls input.start() for all inputs. Lets see how that looks. Then we can add a post-op that examines all outputs and calls commit for all instances of MROutput.

Bikas Saha added a comment - 19/Apr/14 18:30 For now, we can start with just the code that got left out of TEZ-700 and provide the empty impls along with a pre-op that calls input.start() for all inputs. Lets see how that looks. Then we can add a post-op that examines all outputs and calls commit for all instances of MROutput.

Mohammad Islam added a comment - 21/Apr/14 23:06

WIP patch to get the early feedback.

Mohammad Islam added a comment - 21/Apr/14 23:06 WIP patch to get the early feedback.

Bikas Saha added a comment - 22/Apr/14 21:04

Looks good overall.

This needs to be in the tez-runtime-library project under new package org.apache.tez.runtime.library.processor

In general either input or outputs could be null.

+    Preconditions.checkNotNull(inputs, "inputs can't be null");
+    Preconditions.checkNotNull(outputs, "ouputs can't be null");

This means there we will need null checks in other places.

How about following exposing the inputs and outputs via getters and renaming this to run()?

+  public abstract void execute(Map<String, LogicalInput> inputs, Map<String, LogicalOutput> outputs)
+      throws Exception;

I think it makes sense to move this code into the postOp of SimpleProcessor.
Secondly, we should call getContext().canCommit() only once. Sorry, the code in original UnionExample is wrong. So we need to check if commit is required. If yes, then get permission from context, then commit all outputs that need commit. If any output fails to commit then we should abort all the outputs that needed commit.

+    protected void postOp(Map<String, LogicalInput> inputs, Map<String, LogicalOutput> outputs)
+        throws Exception {
+      for (LogicalOutput output : outputs.values()) {
+        if ((output instanceof MROutput) && (((MROutput) output).isCommitRequired())) {
+          while (!getContext().canCommit()) {
+            Thread.sleep(100);
+          }
+          ((MROutput) output).commit();
+        }
+      }

There are 3 pure Tez examples now, WordCount, UnionExample and BroadcastAndOneToOneExample. We should change all of them to use the SimpleProcessor where it makes sense.

Bikas Saha added a comment - 22/Apr/14 21:04 Looks good overall. This needs to be in the tez-runtime-library project under new package org.apache.tez.runtime.library.processor In general either input or outputs could be null. + Preconditions.checkNotNull(inputs, "inputs can't be null " ); + Preconditions.checkNotNull(outputs, "ouputs can't be null " ); This means there we will need null checks in other places. How about following exposing the inputs and outputs via getters and renaming this to run()? + public abstract void execute(Map< String , LogicalInput> inputs, Map< String , LogicalOutput> outputs) + throws Exception; I think it makes sense to move this code into the postOp of SimpleProcessor. Secondly, we should call getContext().canCommit() only once. Sorry, the code in original UnionExample is wrong. So we need to check if commit is required. If yes, then get permission from context, then commit all outputs that need commit. If any output fails to commit then we should abort all the outputs that needed commit. + protected void postOp(Map< String , LogicalInput> inputs, Map< String , LogicalOutput> outputs) + throws Exception { + for (LogicalOutput output : outputs.values()) { + if ((output instanceof MROutput) && (((MROutput) output).isCommitRequired())) { + while (!getContext().canCommit()) { + Thread .sleep(100); + } + ((MROutput) output).commit(); + } + } There are 3 pure Tez examples now, WordCount, UnionExample and BroadcastAndOneToOneExample. We should change all of them to use the SimpleProcessor where it makes sense.

Mohammad Islam added a comment - 23/Apr/14 02:07 - edited

Thanks again bikassaha for the review.
Need some clarifications:

Secondly, we should call getContext().canCommit() only once ...

Struggling to find the correct ordering of two checks. How about this?

+   while (!getContext().canCommit()) {
+        Thread.sleep(100);
+    }
+   for (LogicalOutput output : outputs.values()) {
+        if ((output instanceof MROutput) && (((MROutput) output).isCommitRequired())) {
+          ((MROutput) output).commit();
+        }
+    }

Related question:

If any output fails to commit then we should abort all the outputs that needed commit.

I think already committed outputs can't be aborted.

How to determine some output fails? when output.commit() throw exception?

By abort, do you mean to break the loop.

EDITED:

I think it makes sense to move this code into the postOp of SimpleProcessor.

MROutput is not accessible from Tez-runitme-library. If I include the tez-mapreduce (where MROutput resides) in the pom.xml of tez-runtime-library module, it creates a circular dependency . any thoughts?

Mohammad Islam added a comment - 23/Apr/14 02:07 - edited Thanks again bikassaha for the review. Need some clarifications: Secondly, we should call getContext().canCommit() only once ... Struggling to find the correct ordering of two checks. How about this? + while (!getContext().canCommit()) { + Thread.sleep(100); + } + for (LogicalOutput output : outputs.values()) { + if ((output instanceof MROutput) && (((MROutput) output).isCommitRequired())) { + ((MROutput) output).commit(); + } + } Related question: If any output fails to commit then we should abort all the outputs that needed commit. I think already committed outputs can't be aborted. How to determine some output fails? when output.commit() throw exception? By abort, do you mean to break the loop. EDITED: I think it makes sense to move this code into the postOp of SimpleProcessor. MROutput is not accessible from Tez-runitme-library. If I include the tez-mapreduce (where MROutput resides) in the pom.xml of tez-runtime-library module, it creates a circular dependency . any thoughts?

Bikas Saha added a comment - 23/Apr/14 19:42

We should first go through all outputs to check if they are MROutput and if they need commit. Collect these. If non-empty collection ask for commit permission. After that call commit on all collected outputs. If any commit throw exception then call abort on all collected outputs.
We can create a SimpleProcessor in tez-runtime-library and SimpleProcessorWithMRCommit that derives from SimpleProcessor() in tez-mapreduce project.

Bikas Saha added a comment - 23/Apr/14 19:42 We should first go through all outputs to check if they are MROutput and if they need commit. Collect these. If non-empty collection ask for commit permission. After that call commit on all collected outputs. If any commit throw exception then call abort on all collected outputs. We can create a SimpleProcessor in tez-runtime-library and SimpleProcessorWithMRCommit that derives from SimpleProcessor() in tez-mapreduce project.

Hitesh Shah added a comment - 23/Apr/14 19:50

We should first go through all outputs to check if they are MROutput and if they need commit.

This is wrong IMO. Checking for specific classes should not be done. Either assume all outputs need a commit or query the output to check if it requires a commit. I believe currently pretty much all current outputs ( intermediate data outputs too ) that write to disk need a commit as this is a task level commit to handle cases such as speculative attempts.

Hitesh Shah added a comment - 23/Apr/14 19:50 We should first go through all outputs to check if they are MROutput and if they need commit. This is wrong IMO. Checking for specific classes should not be done. Either assume all outputs need a commit or query the output to check if it requires a commit. I believe currently pretty much all current outputs ( intermediate data outputs too ) that write to disk need a commit as this is a task level commit to handle cases such as speculative attempts.

Siddharth Seth added a comment - 23/Apr/14 22:41

Intermediate outputs don't need a commit. They always write to their own specific paths - and if multiple attempts succeed (speculative), it should be possible to pull data from any of the successful attempts.

Siddharth Seth added a comment - 23/Apr/14 22:41 Intermediate outputs don't need a commit. They always write to their own specific paths - and if multiple attempts succeed (speculative), it should be possible to pull data from any of the successful attempts.

Bikas Saha added a comment - 23/Apr/14 23:12

We should first go through all outputs to check if they are MROutput

This is what we were discussing in TEZ-694 but we decided that we do not have enough context to formalize an API on the output that will tell us if an output needs commit. We only have 1 output that needs commit and we can special case that.
https://issues.apache.org/jira/browse/TEZ-694?focusedCommentId=13972156&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13972156

Bikas Saha added a comment - 23/Apr/14 23:12 We should first go through all outputs to check if they are MROutput This is what we were discussing in TEZ-694 but we decided that we do not have enough context to formalize an API on the output that will tell us if an output needs commit. We only have 1 output that needs commit and we can special case that. https://issues.apache.org/jira/browse/TEZ-694?focusedCommentId=13972156&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13972156

Mohammad Islam added a comment - 24/Apr/14 02:27

Uploaded with reviews comments.
Didn't ran the example code yet. After the review, will do that.

Mohammad Islam added a comment - 24/Apr/14 02:27 Uploaded with reviews comments. Didn't ran the example code yet. After the review, will do that.

Bikas Saha added a comment - 26/Apr/14 00:06

Shouldnt we break out here?

+          output.commit();
+        } catch (IOException ioe) {
+          willAbort = true;
+          savedEx = ioe;
+        }

Bikas Saha added a comment - 26/Apr/14 00:06 Shouldnt we break out here? + output.commit(); + } catch (IOException ioe) { + willAbort = true ; + savedEx = ioe; + }

Bikas Saha added a comment - 26/Apr/14 00:08

Overall this looks good to me. Hitesh, please let me know if the separation of SimpleProcessor and SimpleMRProcessor addresses your concerns based on the comments linked in the TEZ-694. SimpleMRProcessor is the one that takes care of MROutput commit.

Bikas Saha added a comment - 26/Apr/14 00:08 Overall this looks good to me. Hitesh, please let me know if the separation of SimpleProcessor and SimpleMRProcessor addresses your concerns based on the comments linked in the TEZ-694 . SimpleMRProcessor is the one that takes care of MROutput commit.

Mohammad Islam added a comment - 26/Apr/14 00:40

bikassaha

Shouldnt we break out here?

Yes. I missed that. I will do that once we get the feedback from hitesh.

Mohammad Islam added a comment - 26/Apr/14 00:40 bikassaha Shouldnt we break out here? Yes. I missed that. I will do that once we get the feedback from hitesh .

Hitesh Shah added a comment - 26/Apr/14 01:04

bikassaha Sounds good - as long as MROutput commit is invoked only for MR processor, this should be fine. I am assuming that the expectation of someone using a SimpleProcessor has to call commit on the outputs themselves?

Hitesh Shah added a comment - 26/Apr/14 01:04 bikassaha Sounds good - as long as MROutput commit is invoked only for MR processor, this should be fine. I am assuming that the expectation of someone using a SimpleProcessor has to call commit on the outputs themselves?

Bikas Saha added a comment - 26/Apr/14 01:17 - edited

Yes. SimpleProcessor has a postop() that is invoked after run(). The derived class can call commit in its impl of the postop() method.
kamrul I think we can proceed with the final patch.

Bikas Saha added a comment - 26/Apr/14 01:17 - edited Yes. SimpleProcessor has a postop() that is invoked after run(). The derived class can call commit in its impl of the postop() method. kamrul I think we can proceed with the final patch.

Mohammad Islam added a comment - 26/Apr/14 02:22

new patch

Mohammad Islam added a comment - 26/Apr/14 02:22 new patch

Bikas Saha added a comment - 27/Apr/14 23:54

Attaching commit patch with minor changes.

Bikas Saha added a comment - 27/Apr/14 23:54 Attaching commit patch with minor changes.

Bikas Saha added a comment - 27/Apr/14 23:56

Thanks for your contribution. Committed.
commit b084c7f8d12ff38ff3d824734e009a1fc71ee20b
Author: Bikas Saha <bikas@apache.org>
Date: Sun Apr 27 16:55:16 2014 -0700

~~TEZ-1062~~. Create SimpleProcessor for processors that only need to implement the run method (Mohammad Kamrul Islam via bikas)

Bikas Saha added a comment - 27/Apr/14 23:56 Thanks for your contribution. Committed. commit b084c7f8d12ff38ff3d824734e009a1fc71ee20b Author: Bikas Saha <bikas@apache.org> Date: Sun Apr 27 16:55:16 2014 -0700 TEZ-1062 . Create SimpleProcessor for processors that only need to implement the run method (Mohammad Kamrul Islam via bikas)

Bikas Saha added a comment - 06/Sep/14 01:35

Bulk close for jiras fixed in 0.5.0.

Bikas Saha added a comment - 06/Sep/14 01:35 Bulk close for jiras fixed in 0.5.0.

People

Assignee:: Mohammad Islam

Reporter:: Bikas Saha

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Apr/14 17:11

Updated:: 06/Sep/14 01:35

Resolved:: 27/Apr/14 23:56

Apache Tez

Details

Description

Attachments

Attachments

Activity

People

Dates