|
Attaching a patch the implements the above mentioned technique. Tested it with
It seems that the jobhistory uses the display names instead of actual names in the stringified representation.
One such example of actual name and display name is Note that the mapping from the actual name to the display name is mentioned in the .properties file. Looks like its difficult to get the (actual) name from the display name because of the following reasons
So, I suggest we also add the actual name to jobhistory along with the display name and provide api to set the display-name of the counter/group. Let's just keep the display names. The internal names are for mostly for mapping to the counters. Yes, in theory the display names could collide, but in practice they won't, because they are always displayed to the user by the display name.
I think that putting in the lengths would make it very hard to read. I think it is better to stay with in the current scheme and use the current: g1.c1: v1, g1.c2:v2, ... As you point out, that means we need to quote the characters . : , and probably " and \. After an out-of-band discussion with Amar, I agree that we need both the internal and display names to be saved. If we are breaking the format, we should probably pull out the group names and put in both. Maybe we should use a lisp like format:
{g1 {ci1 cd1 v1} {ci2 cd2 v2} {ci3 cd3 v3}} {g2 {ci4 cd4 v4}}
Thoughts? Amar and I thought about this a bit. Groups also have internal and display names, apparently. Since space is a character that may be more common in display names (we have examples in the current list of counters), it may be better to do something like:
{(gi1)(gd1){(ci1)(cd1)(v1)}}
Real example: {(org.apache.hadoop.mapred.Task$FileSystemCounter)(File Systems){(HDFS_READ)(HDFS bytes read)(1020634)}}
as opposed to {org.apache.hadoop.mapred.Task$FileSystemCounter File\ Systems {HDFS_READ HDFS\ bytes\ read 1020634}}
IMHO, the first is a bit more readable, and clear where the tokens are ending. Comments ? Attaching a patch that adds the facility to recover the counter from (encoded)string. The format in which the counters are represented is
{(gn1) (gd1) [(cn1) (cd1) (cv1)] [] [] []} {}
Where {} are used for groups, [] are used for counter and () for basic values
Th change w.r.t Hemanth's comment is in the encapsulation for counter from '{ }' to '[ ]' which makes parsing and the recovery code simple. Added testcases that test counters. Comments? Thoughts? Uploading a new patch the fixes the findbugs warnings.
StringUtils
Counters
Counter
Group
Counters:
TestCounters:
Attaching a patch the incorporates Hemanth's comments.
The reason for deprecating the api is that I have introduced a new/enhanced api that essentially does the same thing but is more generic. I personally dont have any strong opinion about this.
The code to check content equality of counters is moved to the test case as only the test case requires it for now.
Now getBlock() throws ParseException if the string is malformed and returns null is no block found
+1 I agree with Hemant to not deprecate the api escapeString(String str, char escapeChar, char charToEscape). Attaching a patch that removes the deprecated tag. As rightly pointed out by Hemanth, I have not optimized the code for performance. Most of the changes in StringUtils.java are due to refactoring. The apis added to Counters are kept simple. Will check if it can be easily optimized.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389722/HADOOP-3970-v4.1.patch against trunk revision 693048. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3213/testReport/ This message is automatically generated. Core test failed on TestDatanodeDeath. Not sure if its related to this patch or Counters. Ran the same test on my machine and it passed.
I just committed this. Thanks, Amar!
Integrated in Hadoop-trunk #600 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/600/
Amar, patch applied through JIRA provides the function "Counters.fromEscapedCompactString(String compactString)". I was planing to use it to parse the counters printed in job history log file, but looks like job history logfile does NOT have the counters string representation in a escaped compact string format. Is it going to be changed? Which JIRA would address this issue (HADOOP-3200 ?). Pl. comment.
Suhas,
Here we have introduced 2 new api's namely Counters.makeEscapedCompactString() and Counters.fromEscapedCompactString(). What history uses is Counters.makeCompactString() and there is no mechanism to recover it into an object. If you look closely, the counter string in the job history is ambiguous and hence recovery becomes problematic. Counters.makeEscapedCompactString() represents the counters in an unambiguous way and will be used by Will this be a standardized format for counters in future Hadoop releases, or are there other issues you guys think will need to be fixed? I've been working on some scripts for parsing job history logs to compute statistics about how a Hadoop cluster is being used (see https://issues.apache.org/jira/browse/HADOOP-3708
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
groupname.countername:value
So for a counter which has the following contents
We get
g1.c1:v1, g1.c2:v2, g1.c3:v3, g2.c4:v4, g2.c5:v5
One way to overcome the problem stated above is to use the length of the names present in the counter
So the above counter might look like
[|g1|] g1 {[|c1|] c1:v1, [|c2|] c2:v2, [|c3|] c3:v3 }, [|g2|] g2 {[|c4|] c4:v4, [|c5|] c5:v5 }
Here |s| means length of s.
Hence the length of the name in the counter helps to correctly identify the name and helps in counter recovery.
Thoughts?