Issue Details (XML | Word | Printable)

Key: HADOOP-2423
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Tsz Wo (Nicholas), SZE
Reporter: Tsz Wo (Nicholas), SZE
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

The codes in FSDirectory.mkdirs(...) is inefficient.

Created: 13/Dec/07 11:45 PM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: 0.15.1
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 2423_20080130.patch 2008-01-30 11:36 PM Tsz Wo (Nicholas), SZE 14 kB
Text File Licensed for inclusion in ASF works 2423_20080303.patch 2008-03-03 07:13 PM Tsz Wo (Nicholas), SZE 13 kB
Text File Licensed for inclusion in ASF works 2423_20080304.patch 2008-03-04 06:10 PM Tsz Wo (Nicholas), SZE 5 kB
Text File Licensed for inclusion in ASF works 2423_20080304b.patch 2008-03-04 07:29 PM Tsz Wo (Nicholas), SZE 8 kB
Text File Licensed for inclusion in ASF works 2423_20080304c.patch 2008-03-04 08:20 PM Tsz Wo (Nicholas), SZE 8 kB
Text File Licensed for inclusion in ASF works 2423_20080304d.patch 2008-03-04 10:16 PM Tsz Wo (Nicholas), SZE 8 kB
Text File Licensed for inclusion in ASF works 2423_20080310.patch 2008-03-10 10:54 PM Tsz Wo (Nicholas), SZE 7 kB
Text File Licensed for inclusion in ASF works 2423_20080311.patch 2008-03-11 06:05 PM Tsz Wo (Nicholas), SZE 7 kB
Issue Links:
Reference
 

Release Note: Improved FSDirectory.mkdirs(...) performance. In NNThroughputBenchmark-create, the ops per sec in was improved ~54%.
Resolution Date: 13/Mar/08 07:51 PM


 Description  « Hide
FSDirectory.mkdirs(...) creates List<String> v to store all dirs. e.g.
//Suppose 
src = "/foo/bar/bas/"
//Then,
v = {"/", "/foo", "/foo/bar", "/foo/bar/bas"}

For each directory string cur in v, no matter cur already exists or not, it will try to do a unprotectedMkdir(cur, ...). Then, cur is parsed to byte[][] in INodeDirectory.addNode (...).

We don't need to do the parsing for each string in v. Instead, byte[][] should be stored. Also, the loop should not continue once it finds an existing subdirectory.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Konstantin Shvachko added a comment - 30/Jan/08 09:59 PM
Is this similar or a part of HADOOP-2002?

Tsz Wo (Nicholas), SZE added a comment - 30/Jan/08 10:35 PM
Yes, but this one is mainly for mkdir.

Tsz Wo (Nicholas), SZE added a comment - 03/Mar/08 07:13 PM
2423_20080303.patch: updated with current trunk

Konstantin Shvachko added a comment - 03/Mar/08 11:23 PM
This patch actually does 2 things.
  1. It introduces a class that caches INodes and local names on the path.
  2. Optimizes the mkDirs() using this class.

I agree both tasks are important, but if you just want to optimize mkDirs() as stated above then it could be done much easier.
The INode.getExistingPathINodes() should be made package private rather than INode-private.
And then you can get all components of the path and all existing inodes along the path by

FSDirectory.mkDirs(src, ... ) {
    byte[][] components = INode.getPathComponents(src);
    INode[] inodes = new INode[components.length];
    rootDir.getExistingPathINodes(components, inodes);

    // create missing inodes
    ...............
}

After that you just find first null in the array of inodes, and start adding directories using addNode() and fill in the null elements one by one.
You don't even need to start from the rootDir for every component.

Introducing the caching class is important but will take more effort, because we would probably want to to convert all String-path
parameters to it and use across all name-node parts. This is going to be a big change, and it will require proper testing,
benchmarking, etc.

IMO optimizing mkDirs is important by itself, since it is extensively used by our clients.


Tsz Wo (Nicholas), SZE added a comment - 04/Mar/08 06:10 PM
2423_20080304.patch:

The ops per sec in NNThroughputBenchmark create increases from 82.6 to 107.0 in my machine.


Tsz Wo (Nicholas), SZE added a comment - 04/Mar/08 07:29 PM
2423_20080304b.patch: further improved ops per sec to 127.2

Tsz Wo (Nicholas), SZE added a comment - 04/Mar/08 08:20 PM
2423_20080304c.patch: fixed a bug.

Tsz Wo (Nicholas), SZE added a comment - 04/Mar/08 10:16 PM
2423_20080304d.patch: fixed another bug. The one passed all the tests.

Konstantin Shvachko added a comment - 10/Mar/08 08:07 PM - edited
This looks much better. There is room for more optimization.
  1. Rather than calling rootDir.addNode() you should call parentDir.addChild().
    This is what essentially happens in your code.
    But doing this explicitely will let you avoid creating the new addNode() with redundant parameters.
  2. The inheritPermission logic should go from addNode() into addChild() in this case.
  3. String[] strings = path.split(Path.SEPARATOR, -1);
    why limit = -1 in the current code is replaced by 0?
  4. <T extends INode> T addNode(String path, byte[][] pathComponents, INode parentNode, T newNode, boolean inheritPermission)
    parentNode should be of type INodeDirectory
    String path, don't need the first parameter:
    return null rather than throwing an exception; and throw FileNotFoundException in the calling method.
  5. In FSDirectory.mkdirs() the FileNotFoundException thrown by addNode() is absorbed. Should it be?

Tsz Wo (Nicholas), SZE added a comment - 10/Mar/08 10:54 PM - edited
1. This is indeed a very good point. The codes becomes much simpler.

2. Sure.

3. Since path should not have consecutive "/", split with limit -1 or 0 are the same. So, I think the split with less number of parameters is preferred.

4. Because of (1), the changes of addNode(String path, T newNode, boolean inheritPermission) are reverted. We could make the change from throwing FileNotFoundException to returning null later.

5. Since addChild(...) does not throw FileNotFoundException, nothing to catch now.


Konstantin Shvachko added a comment - 11/Mar/08 01:32 AM
+1
I am only worried about limit 0 in String.split(). There was a reason for -1, which I don't remember now,
but could you please verify some corner cases, like creating a file/directory in the root, probably renaming, etc..

Tsz Wo (Nicholas), SZE added a comment - 11/Mar/08 06:05 PM
> I am only worried about limit 0 in String.split().
The only different for split(p, 0) and split(p, -1) is the root path, i.e. "/". However, root path is treated as a special case in the codes (for both before and after the patch.) In addition to the regression tests, I manually tested it.

2423_20080311.patch: fixed a bug.


Tsz Wo (Nicholas), SZE added a comment - 11/Mar/08 06:06 PM
Since there are already a lot of tests for mkdir, no additional tests added.

Hadoop QA added a comment - 13/Mar/08 08:15 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377629/2423_20080311.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included -1. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1950/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1950/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1950/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1950/console

This message is automatically generated.


dhruba borthakur added a comment - 13/Mar/08 07:51 PM
I just committed this. Thanks Nicholas.

Hudson added a comment - 14/Mar/08 03:53 PM