- Set up simple LAMP submitter for basic bug report details: Hadoop version (Git SHA or SVN path and rev), HBase version (Git SHA or SVN path and rev), ZooKeeper version (Git SHA or SVN path and rev), HBase maven profile, stacktrace - Form - POST as JSON - Announce the availability of the above and meanwhile build the below. - Build a Hadoop+HBase back end - Switch over submission directly into HBase, or set up a continuous scheduled sqoop import of new arrivals. (The latter is possibly more representative of a real use case.) - Enumerate HBase revisions and build code paths for each. - Allocate an 8 bit revision identifier for each maven profile in a release. We only care about HBase versions -- Hadoop and ZK version information retained in the report but not used. We'll only be able to track 256 versions at a time, but this seems sufficient for major releases, RCs, and recent DRs. - Use FindBugs to dump all code paths discovered by control flow analysis for the revision: - Create a detector: - e.u.c.f.detect.CheckCalls is a template for proper method call resolution - e.u.c.f.detect.BuildInterproceduralCallGraph shows how to make your own ICG (but we should do it with nondeprecated types) - Each vertex stores method information: file, starting line, ending line. - Build a dictionary of codes for method signatures. 16 bits should be sufficient for only HBase for a while: a quick and dirty FindBugs plugin that simply counts methods as visited during classfile analysis found 15,593 methods in the HBase 0.92 jar (log2=14). If we want to include all dependencies, the HBase jar plus all in lib/ have 153,026 methods (log2=18). - Create a table for the revision and store a map from method signatures to dictionary codes. Encode the revision identifier into the table name. - Store a representation of all code paths into HBase tables that allows for fast lookup from a stacktrace: - Find all simple paths, enumerate the IGC depth-first from daemon Mains. When returning from the recursion, encode the accumulated path at the current level into a binary key, using the dictionary of methods, in reverse order, and store the path into a path table using that key. Use all the bits of the method code, vints won't sort right. Prepend the revision identifier to the key. This will let us quickly look up all paths to an exception site for a given revision, and then prune the results with successful resolutions of subsequent locations in the stacktrace. Can run this on multiple slaves to process revisions in parallel when bootstrapping. - Use Pig to calcuate the intersection of a batch of bug reports with code paths. - For Pig<->HBase integration coverage - Increment counters in path records in HBase - Probably can do in memory joins with the method signature dictionary - Use Pig to build reports by revision of code paths ordered by sum - Can visualize the reports as a tree map of packages with "hot" sites tiled in front. Quantize counts into 4 buckets -- red (most frequent), orange, yellow, green (least frequent) -- and color accordingly. - Can build different reports according to the severity level associated with the logged stacktrace.