Description
There's a fair amount of Hadoop code which uses FileSystem.listStatus(path) }} just to get an {{FileStatus[] array which they can then iterate over in a for loop.
This is inefficient and scales badly, as the entire listing is done before the compute; it cannot handle directories with millions of entries.
The listLocatedStatus() calls return a RemoteIterator class, which can't be used in for loops as it has the right to throw an IOE in any hasNext/next call. That doesn't matter, as we now have closures and simple stream operations.
listLocatedStatus(path).filter((st) -> st.length > 0).apply(st -> fs.delete(st.path))}}
See? We could do shiny new closure things. It wouldn't necessarily need changes to FileSystem either, just something which took RemoteIterator and let you chain some closures off it, similar to the java 8 streams operations.
Once implemented, we can move to using it in the Hadoop code wherever we use listFiles() today
Attachments
Issue Links
- is part of
-
HADOOP-17450 hadoop-common to add IOStatistics API
- Resolved
- is related to
-
HADOOP-14000 S3guard metadata stores to support millions of entries
- Resolved