Description
There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups:
- MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict.
- MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover.
The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed.
The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA.
Note that this solution is to handle transient failures. If the DB is down for a longer period of time, we have to accept that the internal state of Oozie is corrupted.
Attachments
Attachments
- OOZIE-2854.006.patch
- 115 kB
- Andras Piros
- OOZIE-2854.007.patch
- 126 kB
- Andras Piros
- OOZIE-2854.008.patch
- 127 kB
- Andras Piros
- OOZIE-2854.009.patch
- 132 kB
- Andras Piros
- OOZIE-2854.010.patch
- 132 kB
- Andras Piros
- OOZIE-2854.011.patch
- 132 kB
- Andras Piros
- OOZIE-2854.012.patch
- 134 kB
- Andras Piros
- OOZIE-2854.013.patch
- 136 kB
- Andras Piros
- OOZIE-2854-001.patch
- 24 kB
- Peter Bacsko
- OOZIE-2854-002.patch
- 23 kB
- Peter Bacsko
- OOZIE-2854-003.patch
- 23 kB
- Peter Bacsko
- OOZIE-2854-004.patch
- 25 kB
- Peter Bacsko
- OOZIE-2854-005.patch
- 24 kB
- Peter Bacsko
- OOZIE-2854-POC-001.patch
- 18 kB
- Peter Bacsko
Issue Links
Activity
Hi Steven - no, I haven't tried it with Oracle RAC. In fact, I haven't tried it with these MySQL setups either. One of our customers experienced this issue and we have a MySQL expert who pointed out how Galera works and that's the reason why we saw exceptions in the Oozie server logs.
About the optimistic locking in Galera, here is a nice blog entry which explains in detail how this might cause problems (plus proposes a workaround): https://severalnines.com/blog/avoiding-deadlocks-galera-set-haproxy-single-node-writes-and-multi-node-reads
Hi Peter,
Yes I found the article and I set up a lab to test it.
It seems that is works well if I only run it on two nodes. Meaning the
locking is not an issue if only to active nodes are used. But executing it
on the third one will generate a deadlock on of them.. One way to fix this
is to use a DNS or ProxyHA and send write traffic to only two nodes.
Sending to only one as in the blog limits the throughput. Also I am working
with Cloudera on this take a look at case 132505.
Steven
–
This email is intended only for the individual(s) to whom it is addressed
and may be a confidential
communication protected by law. Any unauthorized use, dissemination,
distribution, disclosure,
or copying is prohibited. Please notify the sender immediately by return
email, if you believe
you have received this message in error, and please delete it from your
system.
I uploaded a POC which implements a retry logic.
To do:
- Add tests
- Make hard-coded values configurable
- Discuss which is better: number of max retries or a timeout
andras.piros, gezapeti, asasvari, rkanter pls check the uploaded patch.
Thanks for the PoC. I haven't tried it out, only looked at the code, but it seems good overall. I think a max number of retries is more reliable than a timeout, as we don't know how long each query will take (and I imagine different queries with different amounts of data will take varying amounts of time too). A few other comments:
- The new log message when getting the single results, LOG.info("No results found");, doesn't seem helpful because it has no context. I think we can just remove it.
- I think we should use a for loop in DBOperationRetryHandler instead of setting retries to 0 and doing a retries+ in the while condition. for (int retries = 0; retires < maxRetryCount; retries+)
- Also in DBOperationRetryHandler, I would change the log messages. People will see the error message and panic and might not notice the info message right after. I think we should combine the two messages into a warn message
LOG.warn("Operation failed, Sleeping {0} msecs before retry #{1}", waitTime, retries, e)
And we should have it log an error message only when the last retry fails, which should also have a different message in any case because we're not going to do any more retries.
Btw, are you sure about the for loop? I think we usually do for loops where we want to fully iterate through a range. To me a while loop is more natural here.
pbacsko thanks for the POC, I would also vote for having max retries. I also agree to handle the last exception.
I only have minor comments so far
- I would extract the anonymous classes passed to retryHandler.executeWithRetry into a separate classes
- em.getTransaction() calls could be assigned to local variable.
- There are call() methods that could throw less general exceptions (JPAExecutorException is enough in some cases)
- To me DBOperationRetryHandler looks like a general retryhandler (nothing DB specific, except name and package).
1. I think anon classes are perfectly valid here. The logic inside the callables are strongly tied to the methods that invoke them.
2. We can certainly throw JPAExecutorException from executeWithRetry() but since it's a checked exception it must be declared with throws. So the method would lose the "generic" nature that you mentioned.
em.getTransaction() calls could be assigned to local variable
I just have one problem with this, are we sure that em.getTransaction() returns the same transaction object after we retry? I checked OpenJPA and looks like it does, because the entity manager code implements the Transaction interface, but I'm not sure it's a good practice to rely on such explicit knowledge.
Btw, are you sure about the for loop? I think we usually do for loops where we want to fully iterate through a range. To me a while loop is more natural here.
I'm not a fan of modifying a variable during a condition check (the post-increment ++ in the while condition). I don't feel too strongly on this point, so if you prefer to leave it, that's fine. Another alternative is to put the retries++ inside the while loop instead of the condition.
Ok, moved retries++ to the statement body
Uploaded first version of the patch. Contains tests & updated oozie-default.xml
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
+1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. +1 the patch does adds/modifies 1 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
0 There are [4] new bugs found in total that would be nice to have fixed.
. +1 There are no new bugs found in [server].
. +1 There are no new bugs found in [client].
. 0 There are [4] new bugs found in [core] that would be nice to have fixed.
. You can find the FindBugs diff here: core/findbugs-new.html
. +1 There are no new bugs found in [docs].
. +1 There are no new bugs found in [hadooplibs/hadoop-utils-2].
. +1 There are no new bugs found in [tools].
. +1 There are no new bugs found in [examples].
. +1 There are no new bugs found in [sharelib/streaming].
. +1 There are no new bugs found in [sharelib/sqoop].
. +1 There are no new bugs found in [sharelib/distcp].
. +1 There are no new bugs found in [sharelib/oozie].
. +1 There are no new bugs found in [sharelib/hcatalog].
. +1 There are no new bugs found in [sharelib/hive].
. +1 There are no new bugs found in [sharelib/hive2].
. +1 There are no new bugs found in [sharelib/pig].
. +1 There are no new bugs found in [sharelib/spark].
+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files
-1 TESTS - patch does not compile, cannot run testcases
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3805/
Thanks for the patch!
Some comments:
- There is a TODO left in QueryExecutor.
- DBOperationRetryHandler is a generic retry handler, only the constants in it are referring to DB. Is it possible to either rename it and find a better place for the constants or instead of a Callable have some functions like DBOperationRetryHandler.executeUpdate(EntityManager em, Query q) which calls the q.executeUpdate() with the retry logic?
I don't see clearly if there is a nice way to define this interface though.
The question is which is better from a responsibility standpoint - should the retry handler be responsible for transaction handling? Or should it handle the Callable<?> failures only? In my opinion, it's better to separate the two (think of SRP). The client code which uses the retry handler should expect & handle the case when the code inside the callable is invoked again.
If we keep the class generic, we can rename it to RetryHandler.
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
+1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. +1 the patch does adds/modifies 1 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
0 There are [4] new bugs found in total that would be nice to have fixed.
. +1 There are no new bugs found in [server].
. +1 There are no new bugs found in [client].
. 0 There are [4] new bugs found in [core] that would be nice to have fixed.
. You can find the FindBugs diff here: core/findbugs-new.html
. +1 There are no new bugs found in [docs].
. +1 There are no new bugs found in [hadooplibs/hadoop-utils-2].
. +1 There are no new bugs found in [tools].
. +1 There are no new bugs found in [examples].
. +1 There are no new bugs found in [sharelib/streaming].
. +1 There are no new bugs found in [sharelib/sqoop].
. +1 There are no new bugs found in [sharelib/distcp].
. +1 There are no new bugs found in [sharelib/oozie].
. +1 There are no new bugs found in [sharelib/hcatalog].
. +1 There are no new bugs found in [sharelib/hive].
. +1 There are no new bugs found in [sharelib/hive2].
. +1 There are no new bugs found in [sharelib/pig].
. +1 There are no new bugs found in [sharelib/spark].
+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files
-1 TESTS - patch does not compile, cannot run testcases
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3806/
Another option is to use a 3rd party library like https://github.com/rholder/guava-retrying so we don't have to re-invent the wheel
I only took a quick look, but it's got a bunch of different options and retry strategies/policies.
That's a great idea Robert. I should have googled that first. Will check it out.
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. -1 the patch does not add/modify any testcase
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
-1 COMPILE
. +1 HEAD compiles
. -1 patch does not compile
. +1 the patch does not seem to introduce new javac warnings
-1 There are [181] new bugs found below threshold in total that must be fixed.
. -1 There are [98] new bugs found below threshold in [core] that must be fixed, listing only the first [5] ones.
. You can find the FindBugs diff here (look for the red and orange ones): core/findbugs-new.html
. The top [5] most important FindBugs errors are:
. At CLIParser.java:[line 169]: Found reliance on default encoding in org.apache.oozie.cli.CLIParser.showHelp(CommandLine): new java.io.PrintWriter(OutputStream)
. At CLIParser.java:[line 57]: new org.apache.oozie.cli.CLIParser(String, String[]) may expose internal representation by storing an externally mutable object into CLIParser.cliHelp
. At CLIParser.java:[lines 80-104]: Should org.apache.oozie.cli.CLIParser$Command be a static inner class?
. At OozieCLI.java:[line 845]: Found reliance on default encoding in org.apache.oozie.cli.OozieCLI.getConfiguration(OozieClient, CommandLine): new java.io.FileReader(File)
. At OozieCLI.java:[line 2175]: Found reliance on default encoding in org.apache.oozie.cli.OozieCLI.validateCommandV41(CommandLine): new java.io.FileReader(File)
. +1 There are no new bugs found in [client].
. -1 There are [1] new bugs found below threshold in [docs] that must be fixed.
. You can find the FindBugs diff here (look for the red and orange ones): docs/findbugs-new.html
. The most important FindBugs errors are:
. At StreamingMain.java:[line 70]: org.apache.oozie.action.hadoop.StreamingMain.addActionConf(JobConf, Configuration) concatenates strings using + in a loop
. -1 There are [2] new bugs found below threshold in [sharelib/streaming] that must be fixed.
. You can find the FindBugs diff here (look for the red and orange ones): sharelib/streaming/findbugs-new.html
. The most important FindBugs errors are:
. At HiveMain.java:[line 340]: Found reliance on default encoding in org.apache.oozie.action.hadoop.HiveMain.readStringFromFile(String): new java.io.FileReader(String)
. Obligation to clean up resource created at HiveMain.java:[line 184] is not discharged: org.apache.oozie.action.hadoop.HiveMain.setUpHiveSite() may fail to clean up java.io.OutputStream on checked exception
. Path continues at HiveMain.java:[line 185]
. -1 There are [3] new bugs found below threshold in [sharelib/hive] that must be fixed.
. You can find the FindBugs diff here (look for the red and orange ones): sharelib/hive/findbugs-new.html
. The most important FindBugs errors are:
. At Hive2Main.java:[line 281]: Found reliance on default encoding in org.apache.oozie.action.hadoop.Hive2Main.readStringFromFile(String): new java.io.FileReader(String)
. At Hive2Main.java:[line 270]: Found reliance on default encoding in org.apache.oozie.action.hadoop.Hive2Main.runBeeline(String[], String): new java.io.PrintStream(OutputStream)
. At Hive2Main.java:[line 273]: org.apache.oozie.action.hadoop.Hive2Main.runBeeline(String[], String) invokes System.exit(...), which shuts down the entire virtual machine
. +1 There are no new bugs found in [sharelib/hive2].
. -1 There are [19] new bugs found below threshold in [sharelib/sqoop] that must be fixed, listing only the first [5] ones.
. You can find the FindBugs diff here (look for the red and orange ones): sharelib/sqoop/findbugs-new.html
. The top [5] most important FindBugs errors are:
. At LauncherMain.java:[line 124]: Found reliance on default encoding in org.apache.oozie.action.hadoop.LauncherMain.getHadoopJobIds(String, Pattern[]): new java.io.FileReader(String)
. At LauncherMain.java:[line 160]: Found reliance on default encoding in org.apache.oozie.action.hadoop.LauncherMain.writeExternalChildIDs(String, Pattern[], String): String.getBytes()
. At LauncherMain.java:[line 61]: org.apache.oozie.action.hadoop.LauncherMain.HADOOP_SITE_FILES should be both final and package protected
. At LauncherMain.java:[line 327]: Exceptional return value of java.io.File.mkdirs() ignored in org.apache.oozie.action.hadoop.LauncherMain.writeHadoopConfig(String, File)
. At LauncherMapper.java:[line 407]: Found reliance on default encoding in org.apache.oozie.action.hadoop.LauncherMapper.getLocalFileContentStr(File, String, int): new java.io.FileReader(File)
. +1 There are no new bugs found in [sharelib/oozie].
. -1 There are [8] new bugs found below threshold in [sharelib/hcatalog] that must be fixed, listing only the first [5] ones.
. You can find the FindBugs diff here (look for the red and orange ones): sharelib/hcatalog/findbugs-new.html
. The top [5] most important FindBugs errors are:
. At OoziePigStats.java:[line 130]: org.apache.oozie.action.hadoop.OoziePigStats.toJSONFromMultiStoreCounters(Map) makes inefficient use of keySet iterator instead of entrySet iterator
. At PigMain.java:[line 334]: Dead store to klass in org.apache.oozie.action.hadoop.PigMain.runPigJob(String[], String, boolean, boolean)
. At PigMain.java:[line 306]: Found reliance on default encoding in org.apache.oozie.action.hadoop.PigMain.handleError(String): new java.io.FileReader(String)
. At PigMain.java:[line 407]: Found reliance on default encoding in org.apache.oozie.action.hadoop.PigMain.writeExternalData(String, File): new java.io.FileWriter(File)
. Obligation to clean up resource created at PigMain.java:[line 130] is not discharged: org.apache.oozie.action.hadoop.PigMain.run(String[]) may fail to clean up java.io.OutputStream on checked exception
. -1 There are [2] new bugs found below threshold in [sharelib/pig] that must be fixed.
. You can find the FindBugs diff here (look for the red and orange ones): sharelib/pig/findbugs-new.html
. The most important FindBugs errors are:
. At SparkMain.java:[line 493]: org.apache.oozie.action.hadoop.SparkMain.getJarVersion(File) may fail to close stream
. At SparkMain.java:[line 326]: Exceptional return value of java.io.File.mkdirs() ignored in org.apache.oozie.action.hadoop.SparkMain.createPySparkLibFolder()
. +1 There are no new bugs found in [sharelib/spark].
. -1 There are [6] new bugs found below threshold in [sharelib/distcp] that must be fixed, listing only the first [5] ones.
. You can find the FindBugs diff here (look for the red and orange ones): sharelib/distcp/findbugs-new.html
. The top [5] most important FindBugs errors are:
. At DateList.java:[line 55]: Nullcheck of date at line 55 of value previously dereferenced in org.apache.oozie.example.DateList.main(String[])
. Private method org.apache.oozie.example.DateList.formatDateUTC(Calendar) is never called: Redundant null check at DateList.java:[line 62]
. org.apache.oozie.example.LocalOozieExample.execute(String[]) may fail to clean up java.io.InputStream: At DateList.java:[line 97]
. Path continues at LocalOozieExample.java:[line 76]: Obligation to clean up resource created at LocalOozieExample.java:[line 72] is not discharged
. Path continues at LocalOozieExample.java:[line 78]: Path continues at LocalOozieExample.java:[line 77]
. +1 There are no new bugs found in [examples].
. -1 There are [42] new bugs found below threshold in [hadooplibs/hadoop-utils-2] that must be fixed, listing only the first [5] ones.
. You can find the FindBugs diff here (look for the red and orange ones): hadooplibs/hadoop-utils-2/findbugs-new.html
. The top [5] most important FindBugs errors are:
. At OozieDBCLI.java:[line 548]: Found reliance on default encoding in org.apache.oozie.tools.OozieDBCLI.convertClobToBlobInMysql(String, Connection): new java.io.FileWriter(String, boolean)
. At OozieDBCLI.java:[line 577]: Found reliance on default encoding in org.apache.oozie.tools.OozieDBCLI.convertClobToBlobInPostgres(String, Connection, String): new java.io.FileWriter(String, boolean)
. At OozieDBCLI.java:[line 983]: Found reliance on default encoding in org.apache.oozie.tools.OozieDBCLI.createOozieSysTable(String, boolean, String): new java.io.FileWriter(String, boolean)
. At OozieDBCLI.java:[line 759]: Found reliance on default encoding in org.apache.oozie.tools.OozieDBCLI.ddlTweaks(String, boolean): new java.io.FileWriter(String, boolean)
. At OozieDBCLI.java:[line 712]: Found reliance on default encoding in org.apache.oozie.tools.OozieDBCLI.ddlTweaksFor50(String, boolean, String): new java.io.FileWriter(String, boolean)
. 0 There are [1] new bugs found in [tools] that would be nice to have fixed.
. You can find the FindBugs diff here: tools/findbugs-new.html
+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files
-1 TESTS - patch does not compile, cannot run testcases
-1 DISTRO
. -1 distro tarball fails with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3812/
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. -1 the patch contains 1 line(s) with trailing spaces
. +1 the patch does not introduce any line longer than 132
. -1 the patch does not add/modify any testcase
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
0 There are [4] new bugs found in total that would be nice to have fixed.
. +1 There are no new bugs found in [server].
. +1 There are no new bugs found in [client].
. +1 There are no new bugs found in [docs].
. +1 There are no new bugs found in [sharelib/hive].
. +1 There are no new bugs found in [sharelib/spark].
. +1 There are no new bugs found in [sharelib/hcatalog].
. +1 There are no new bugs found in [sharelib/hive2].
. +1 There are no new bugs found in [sharelib/streaming].
. +1 There are no new bugs found in [sharelib/pig].
. +1 There are no new bugs found in [sharelib/sqoop].
. +1 There are no new bugs found in [sharelib/distcp].
. +1 There are no new bugs found in [sharelib/oozie].
. +1 There are no new bugs found in [hadooplibs/hadoop-utils-2].
. 0 There are [4] new bugs found in [core] that would be nice to have fixed.
. You can find the FindBugs diff here: core/findbugs-new.html
. +1 There are no new bugs found in [tools].
. +1 There are no new bugs found in [examples].
+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files
-1 TESTS - patch does not compile, cannot run testcases
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3813/
I performed tests with the latest patch. Oozie was running with MySQL. The execution successfully recovered after MySQL came back.
Note: in the exception handlers, there are separate branches for ExecutionException and RetryException. In our case, only RetryException can occur. ExecutionException would mean that retry did not take place but the Callable<V> threw an exception. Since we retry on all kinds of exception, this should not happen (unless it's an Error or Throwable). Perhaps this is worth a short comment.
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. -1 the patch does not add/modify any testcase
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
-1 There are [1] new bugs found below threshold in total that must be fixed.
. +1 There are no new bugs found in [server].
. +1 There are no new bugs found in [client].
. +1 There are no new bugs found in [docs].
. +1 There are no new bugs found in [sharelib/hive].
. +1 There are no new bugs found in [sharelib/spark].
. +1 There are no new bugs found in [sharelib/hcatalog].
. +1 There are no new bugs found in [sharelib/hive2].
. +1 There are no new bugs found in [sharelib/streaming].
. +1 There are no new bugs found in [sharelib/pig].
. +1 There are no new bugs found in [sharelib/sqoop].
. +1 There are no new bugs found in [sharelib/distcp].
. +1 There are no new bugs found in [sharelib/oozie].
. +1 There are no new bugs found in [hadooplibs/hadoop-utils-2].
. -1 There are [1] new bugs found below threshold in [core] that must be fixed.
. You can find the FindBugs diff here (look for the red and orange ones): core/findbugs-new.html
. The most important FindBugs errors are:
. At OozieRetryListener.java:[line 27]: org.apache.oozie.util.db.OozieRetryListener.INSTANCE isn't final but should be
. +1 There are no new bugs found in [tools].
. +1 There are no new bugs found in [examples].
+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files
-1 TESTS - patch does not compile, cannot run testcases
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3820/
Looks good overall, the guava-retrying stuff looks very clean. Here's a few related things I think we should change:
- The patch loads the config properties every time we create a Retryer. I think we should move that into JPAService so we can load them once, and then pass them to the Retryer.
- The Retryer is general enough that I think it should go in org.apache.oozie.util so that other code can use it, which is another reason not to tie the configs directly to the Retryer. I'm sure we have some other places in Oozie where we have custom retrying code that could be replaced by this, but that's out of scope for this JIRA.
- All of the database calls are in JPAService, except for QueryExecutor#insert. That's not your doing, but I think we should move it to JPAService to be more consistent and so we can use the configs that JPAService will load in point 1.
- Let's rename the config properties to oozie.service.JPAService.retry.initial-wait-time.ms and oozie.service.JPAService.retry.max-retries now that they'd be moved into JPAService, plus that's more consistent with the other DB related configs.
Let me know if that doesn't make sense.
Thanks for the comments rkanter.
Unfortunately I don't think we can use guava-retrier. The problem is that some of the Executor implementation that is passed to JPAService throw JPAExecutorException on purpose, like WorkflowJobGetJPAExecutor if a given job is not found. So if we catch all of them, we can't always know whether it's an underlying SQL problem or it came from the business logic (that's why previous tests timed out before). We have to revert back to our original implementation, but that's not enough in itself.
I think the next steps are:
1. Revert back to DBOperationRetryHandler
2. Catch JPAExecutorException and examine it:
- if it's an SQL error (E0603) then we retry
- otherwise we don't
I was also thinking about the following: we don't just check the error type, but the SQL exception that was thrown by the JDBC driver. For example, there are certain cases where doing a retry makes no sense, eg. SQLSyntaxErrorException or SQLFeatureNotSupportedException, although these are very unlikely. This is perhaps a bit too much and we should not go to this direction, so I'm just thinking out loud.
Update: after a short discussion with andras.piros, I decided not to implement this - too risky & adds extra complications.
Taken over from pbacsko. Walking that same path w/ following extensions.
Failure injection
- subclass org.apache.commons.dbcp.BasicDataSource to have its createConnectionFactory() method fixed, to have Class.forName(driverClassName) a real effect. (See the fixed method)
- introduce a JDBC driver extending com.mysql.jdbc.Driver that delegates its getConnection(String, Properties) method to a special wrapper
- let this java.sql.Connection wrapper do nothing but intercept the prepareStatement(String, int, int) call:
- investigate whether it's a DML statement
- investigate whether it's a statement handling an Oozie table
- if so, try to inject a PersistenceException w/ a relatively low database error percentage (5 %)
Integration testing
- use FailingHSQLDBDriverWrapper extending org.hsqldb.jdbcDriver to intercept JDBC calls
- use integration test cases extending MiniOozieTestCase for following scenarios:
- parallel call on JPA queries can easily succeed despite of the injected errors
- workflows continue to pass w/o injected errors
- workflows pass w/ injected errors
Functional testing
- use FailingMySQLDriverWrapper extending com.mysql.jdbc.Driver to intercept JDBC calls
- use following coordinator / workflow scenario:
- fired every minute
- executing for multiple days
- workflow consists of a <decision /> action followed by two paths of consecutive <fs /> and <shell /> actions
All functionalities covered here. A blacklist based javax.persistence.PersistenceException retry predicate is present to provide a database agnostic method to identify which Exception instances originating from JPA / DBCP / JDBC layers are good candidates to begin / continue retrying.
Tests covered in code:
- unit tests
- testing the retry handler, the retry predicate filter, and parallel calls to JPA EntityManager (mostly Oozie database reads and writes) when injecting failures
- integration tests
- using the MiniOozieTestCase framework
- fixing it so that also asynchronous workflow applications (the ones that use CallableQueueService) can be run
- for following workflow scenarios:
- a very simple one consisting only of a <start/> and an <end/> node
- a more sophisticated one consisting of multiple synchronous <fs/> nodes and a <decision/> node
- the ultimate one consisting of a <decision/> node, and two branches of an <fs/> and an asynchronous <shell/> nodes
Test cases run:
mvn clean test -Dtest=TestDBOperationRetryHandler,TestPersistenceExceptionSubclassFilterRetryPredicate,TestParallelJPAOperationRetries,TestWorkflow,TestWorkflowRetries,TestJPAService
Functional and stress tests performed on a 4-node MySQL cluster. MySQL daemon has been stopped / killed / restarted several times. Also firewall rules have been modified temporarily to simulate network outages.
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. -1 the patch contains 1 line(s) with trailing spaces
. -1 the patch contains 5 line(s) longer than 132 characters
. +1 the patch does adds/modifies 10 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
+1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
+1 TESTS
. Tests run: 1022
. Tests rerun: 22
. Tests failed at first run: org.apache.oozie.action.hadoop.TestLauncherAM,
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3933/
Addressing review comments as well as line length, and header issues.
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. -1 the patch contains 6 line(s) longer than 132 characters
. +1 the patch does adds/modifies 12 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
+1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
+1 TESTS
. Tests run: 1603
. Tests rerun: 3
. Tests failed at first run: org.apache.oozie.action.hadoop.TestHdfsOperations,
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3939/
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
+1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. +1 the patch does adds/modifies 3 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
-1 COMPILE
. +1 HEAD compiles
. -1 patch does not compile
. +1 the patch does not seem to introduce new javac warnings
+1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
-1 TESTS - patch does not compile, cannot run testcases
-1 DISTRO
. -1 distro tarball fails with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3949/
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. -1 the patch contains 5 line(s) longer than 132 characters
. +1 the patch does adds/modifies 12 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
+1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
+1 TESTS
. Tests run: 1606
. Tests rerun: 3
. Tests failed at first run: org.apache.oozie.action.hadoop.TestHdfsOperations,
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3950/
Addressing remaining review comments, as well as documenting the feature.
Addressing remaining review comments, as well as documenting the feature.
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. -1 the patch contains 1 line(s) longer than 132 characters
. +1 the patch does adds/modifies 12 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
+1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
+1 TESTS
. Tests run: 954
. Tests rerun: 3
. Tests failed at first run: org.apache.oozie.action.hadoop.TestHdfsOperations,
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3952/
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
1 PATCH_APPLIES
1 CLEAN
-1 RAW_PATCH_ANALYSIS
. 1 the patch does not introduce any @author tags
. 1 the patch does not introduce any tabs
. 1 the patch does not introduce any trailing spaces
. -1 the patch contains 1 line(s) longer than 132 characters
. 1 the patch does adds/modifies 12 testcase(s)
1 RAT
. 1 the patch does not seem to introduce new RAT warnings
1 JAVADOC
. 1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
1 COMPILE
. 1 HEAD compiles
. 1 patch compiles
. 1 the patch does not seem to introduce new javac warnings
1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. 1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
1 TESTS
. Tests run: 1110
. Tests rerun: 22
. Tests failed at first run: org.apache.oozie.action.hadoop.TestLauncherAM,
1 DISTRO
. 1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3953/
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
1 PATCH_APPLIES
1 CLEAN
-1 RAW_PATCH_ANALYSIS
. 1 the patch does not introduce any @author tags
. 1 the patch does not introduce any tabs
. 1 the patch does not introduce any trailing spaces
. -1 the patch contains 1 line(s) longer than 132 characters
. 1 the patch does adds/modifies 12 testcase(s)
1 RAT
. 1 the patch does not seem to introduce new RAT warnings
1 JAVADOC
. 1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
1 COMPILE
. 1 HEAD compiles
. 1 patch compiles
. 1 the patch does not seem to introduce new javac warnings
1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. 1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
1 TESTS
. Tests run: 344
. Tests rerun: 3
. Tests failed at first run: org.apache.oozie.action.hadoop.TestHdfsOperations,
1 DISTRO
. 1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3955/
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. -1 the patch contains 1 line(s) longer than 132 characters
. +1 the patch does adds/modifies 12 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
+1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
+1 TESTS
. Tests run: 1558
. Tests rerun: 3
. Tests failed at first run: org.apache.oozie.action.hadoop.TestHdfsOperations,
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3964/
Testing JIRA OOZIE-2854
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
-1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. -1 the patch contains 1 line(s) longer than 132 characters
. +1 the patch does adds/modifies 12 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
. WARNING: the current HEAD has 1 Javadoc warning(s)
+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings
+1 There are no new bugs found in total.
-1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. -1 the patch modifies 1 JPA file(s), persistence.xml or *-orm.xml
+1 TESTS
. Tests run: 1080
. Tests rerun: 22
. Tests failed at first run: org.apache.oozie.action.hadoop.TestLauncherAM,
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
. There is at least one warning, please check
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3967/
Peter,
I am glad that you found this issue. I just built two labs to test an HA solution for cluster metadata storage. It makes no sense to have a hadoop cluster and a single MySQL failure will bring it down.
So far I build a Galera MySQL cluster and Galera MariaDB cluster but I have not pointed hadoop to it just imported the data. NDB clustering will not work as it does not uses the InnoDB engine and the import will fail.
I also have a case opened with Cloudera on this matter that I am waiting on.
Galera cluster states that it is synchronous https://mariadb.com/kb/en/mariadb/about-galera-replication/
but your findings are different.
All I want to do is to have a metadata storage that is not prone to a single point of failure. Have you run similar test with Oracle RAC?
Steven