+1; I'll also update the filesystem spec docs to match the javadocs
I raised the topic on hdfs-dev as to whether we should say "any order works", or whether there is a hard sort-order requirement. The consensus was: Posix doesn't specify an order, and neither should hadoop. But the fact that HDFS is ordered means that applications may have expectations that other filesystems (or future versions of HDFS) may no meet.
Java's File.listFiles() javadoc:
*There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order.
The POSIX spec for readdir (http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html) doesn’t spell out a sort order, so it should be assumed that the ordering isn’t guaranteed.
Chris Siebenmann has written a few relative blog posts on the topic that might be of interest here:
So I think it’s OK to break the API here ...
POSIX ls (http://pubs.opengroup.org/onlinepubs/000095399/utilities/ls.html) DOES require its output be sorted. So breaking the sort order of 'hadoop fs -ls’ would be extremely bad. We need to make sure that doesn’t change.
We had a discussion about this on
HADOOP-10798. Although HDFS always
returns listStatus results in alphabetically sorted order because of
implementation issues, the local filesystem does not return things in
alphabetically sorted order.
I think it's fine in principle to specify that listStatus returns
things in undefined order. After all, as Allen mentioned, this is
what POSIX does. I do think that in practice, this will result in a
lot of HDFS-only code getting written where there is a hidden
assumption that listStatus, globStatus, etc. sort their responses.
This might make portability more difficult.
I'm not sure if there is a good way around this problem. Requiring
results to be returned in sorted order would be really harmful to
performance for things like Ceph and Lustre-- we'd essentially be
forcing a ton of client-side buffering and a sort. But having HDFS do
sorted order and other FSes not do it would certainly make portability
One possibility is that we could randomize the order of returned
results in HDFS (at least within a given batch of results returned
from the NN). This is similar to how the Go programming language
randomizes the order of iteration over hash table keys, to avoid code
being written which relies on a specific implementation-defined
Regardless of whether we do that, though, there is a bunch of code
even in Hadoop common that doesn't properly deal with unsorted
listStatus / globStatus... such as "hadoop fs -ls"