mvn dependency:analyze says there's a number of things that should be cleaned up in the new pom:
[INFO] --- maven-dependency-plugin:2.2:analyze (default-cli) @ hadoop-archive-logs ---
[WARNING] Used undeclared dependencies found:
[WARNING] Unused declared dependencies found:
It would be nice if the usage output used the actual values in the code rather than hardcoded strings. For example, we now have to keep minNumLogFiles and the usage string manually in sync. If the usage output leveraged the minNumLogFiles value directly then updating it would automatically correct the usage message. On a related note the usage currently mentions values like "1GB", but I don't believe the code supports memory units.
Do we only want to consider aggregating logs that have totally succeeded? What about the FAILED case or other terminal states? Seems like any terminal state where we know there aren't going to be any more logs arriving should be eligible.
Nit: it's wasteful for checkFiles to continue iterating the files once it finds an excluding condition. We can also eliminate the need to track file counts explicitly and simply check files.length directly before we even start looping.
Is there a reason to support maxEligible being zero? Wondering if that should be equivalent to a negative value and just cover everything.
Should the working directory contain something unique like the application ID in it somewhere? This has the benefit of making it easier to cleanup after a run and not worry about affecting other, possibly simultaneous runs.