An anonymizer implemented in Python attached. This anonymizer can work with v20, v22, or rumen log files. On doing anonymization, a private file with tables is created, and can be used to de-anonymize the anonymized trace. The tables file can be used in two ways, either grown incrementally or stand alone, when working with multiple traces.
Another file attached same.py is a simple Python script to compare two json-based trace files. It works similar to diff. Because json objects can be semantically equivalent even if keys in dictionaries are in different orders, so running diff directly on two files may not work as desired. It outputs nothing if the two files represent the same trace, otherwise print the objects (which can be big anyway) that are different in the two files. v22 and rumen log files can be compared using this script. Keys in v20 script have fixed orders so v20 log files can be compared using diff directly.
1. In v22 and rumen-trace log files, multiple json objects are in one file, and separate by white spaces. Without the power of Java Jackson package, the Python json module can only load a json object from a string or a file. Currently, the scripts rely on detecting "}\n" as a whole line to determine ending of a json object. That may fail if the particular pattern occurs in a string object. A better implementation is similar to what Java Jackson does. An object should be found from a file, leaving the rest of the file still operational for further operations.
2. Sample rumen-trace and rumen-topology files are got from hadoop-mapreduce/src/test/tools/data/rumen/. These sample files seem to be generated from v20 log files, since "." are escaped as "\." in many fields. I'm not sure if rumen works with v22 log files, and if there are differences between rumen files generated from v22 log files and rumen files generated from v20 log files.