Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-202

Add a bulk FIleSystem.getFileBlockLocations

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: hdfs-client, namenode
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed

      Description

      Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file.
      The downsides are multiple:

      1. Even with a few thousand files to process the number of RPCs quickly starts getting noticeable
      2. The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'.

      It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'.

      When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

      1. hdfsListFiles5.patch
        48 kB
        Hairong Kuang
      2. hdfsListFiles4.patch
        47 kB
        Hairong Kuang
      3. hdfsListFiles3.patch
        54 kB
        Hairong Kuang
      4. hdfsListFiles2.patch
        42 kB
        Hairong Kuang
      5. hdfsListFiles1.patch
        40 kB
        Hairong Kuang
      6. hdfsListFiles.patch
        43 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.

          Show
          Doug Cutting added a comment - An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.
          Hide
          Doug Cutting added a comment -

          How about adding something like:
          Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);
          This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.

          Show
          Doug Cutting added a comment - How about adding something like: Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]); This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.
          Hide
          Arun C Murthy added a comment -

          Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);

          +1

          Show
          Arun C Murthy added a comment - Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]); +1
          Hide
          Konstantin Shvachko added a comment -

          Currently getBlockLocations(src, offset, length) returns a class called LocatedBlocks, which contains a list of LocatedBlock belonging to the file.

          public class LocatedBlocks implements Writable {
            private long fileLength;
            private List<LocatedBlock> blocks; // array of blocks with prioritized locations
          }
          

          The question is whether we should modify LocatedBlocks, which would include the map proposed by Doug and extend the semantics of getBlockLocations() to handle directories, or should we introduce a new method (rpc) getBlockLocations(srcDir) returning LocatedBlockMap.
          Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

          Show
          Konstantin Shvachko added a comment - Currently getBlockLocations(src, offset, length) returns a class called LocatedBlocks , which contains a list of LocatedBlock belonging to the file. public class LocatedBlocks implements Writable { private long fileLength; private List<LocatedBlock> blocks; // array of blocks with prioritized locations } The question is whether we should modify LocatedBlocks , which would include the map proposed by Doug and extend the semantics of getBlockLocations() to handle directories, or should we introduce a new method (rpc) getBlockLocations(srcDir) returning LocatedBlockMap . Is there a reason to keep current per file getBlockLocations() if we had a more generic method?
          Hide
          Doug Cutting added a comment -

          > Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

          Not that I can think of. +1 for replacing it.

          Show
          Doug Cutting added a comment - > Is there a reason to keep current per file getBlockLocations() if we had a more generic method? Not that I can think of. +1 for replacing it.
          Hide
          dhruba borthakur added a comment -

          If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?

          Show
          dhruba borthakur added a comment - If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little more flexible in nature, isn't it?
          Hide
          Arun C Murthy added a comment -

          Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the original requirement. Doug, can you please confirm?

          Show
          Arun C Murthy added a comment - Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the original requirement. Doug, can you please confirm?
          Hide
          Doug Cutting added a comment -

          > Doug, can you please confirm?

          Yes, I had assumed that any directories in the request would be expanded. The goal is to have something we can call from FileInputFormat, which takes a list of patterns. When the patterns contain no wildcards, we should be able to create splits with a single RPC to the NameNode. So the semantics should match those of FileInputFormat in this case.

          Show
          Doug Cutting added a comment - > Doug, can you please confirm? Yes, I had assumed that any directories in the request would be expanded. The goal is to have something we can call from FileInputFormat, which takes a list of patterns. When the patterns contain no wildcards, we should be able to create splits with a single RPC to the NameNode. So the semantics should match those of FileInputFormat in this case.
          Hide
          Konstantin Shvachko added a comment -

          > (non-recursively?)

          I think the rpc call itself should not be recursive. It is like with ls: the getListing() call is non-recursive, but the client recursively calls getListing() on sub-directories.
          The idea is to prevent people from making a mistake to call getBlockLocation("/") on large directory trees recursively, which may freeze the name-node for a long period of time.
          Non-recursive variant should be sufficient to cover Arun's use case.

          Show
          Konstantin Shvachko added a comment - > (non-recursively?) I think the rpc call itself should not be recursive. It is like with ls: the getListing() call is non-recursive, but the client recursively calls getListing() on sub-directories. The idea is to prevent people from making a mistake to call getBlockLocation("/") on large directory trees recursively, which may freeze the name-node for a long period of time. Non-recursive variant should be sufficient to cover Arun's use case.
          Hide
          dhruba borthakur added a comment -

          Ok, so from what I can understand, here is the proposal:

          Map<FileStatus, BlockLocation[]> listBlockLocations(Path[] inputPaths);

          The "inputPaths" can be a set of files and/or directories. If one of the inputPaths is a directory, then all items inside that directory (only one level, not recursive) are listed and their block locations are returned by this call. if one of the inputPaths is a file, then its block locations are returned by this call. The FileStatus returned by this call should have the absolulte path of the object being returned.

          Show
          dhruba borthakur added a comment - Ok, so from what I can understand, here is the proposal: Map<FileStatus, BlockLocation[]> listBlockLocations(Path[] inputPaths); The "inputPaths" can be a set of files and/or directories. If one of the inputPaths is a directory, then all items inside that directory (only one level, not recursive) are listed and their block locations are returned by this call. if one of the inputPaths is a file, then its block locations are returned by this call. The FileStatus returned by this call should have the absolulte path of the object being returned.
          Hide
          Doug Cutting added a comment -

          Dhruba: yes, that sounds right to me.

          A further clarification: should subdirectories be included, with empty block lists, or elided? My hunch is to eliminate them, so that every FileStatus returned is for a plain file--no directories. Does that sound right to others?

          Show
          Doug Cutting added a comment - Dhruba: yes, that sounds right to me. A further clarification: should subdirectories be included, with empty block lists, or elided? My hunch is to eliminate them, so that every FileStatus returned is for a plain file--no directories. Does that sound right to others?
          Hide
          dhruba borthakur added a comment -

          > every FileStatus returned is for a plain file--no directories

          Sounds good to me.

          Show
          dhruba borthakur added a comment - > every FileStatus returned is for a plain file--no directories Sounds good to me.
          Hide
          Raghu Angadi added a comment -

          I see why the interface takes array or paths. But not sure why it returns a map (not that there is anything wrong it). This is probably the only RPC returning a map in Hadoop.

          How does a user figure out which were valid and which were invalid/non-existent/empty paths? May be user does not care?

          'getBlockLocations()' returns the blocks (sort of) sorted w.r.t client. Should this interface do that too? M/R use case does not need that sorted.

          Is there any other interface that resembles this?

          Show
          Raghu Angadi added a comment - I see why the interface takes array or paths. But not sure why it returns a map (not that there is anything wrong it). This is probably the only RPC returning a map in Hadoop. How does a user figure out which were valid and which were invalid/non-existent/empty paths? May be user does not care? 'getBlockLocations()' returns the blocks (sort of) sorted w.r.t client. Should this interface do that too? M/R use case does not need that sorted. Is there any other interface that resembles this?
          Hide
          dhruba borthakur added a comment -

          > Is there any other interface that resembles this?

          The only thing that comes relatively close to this one is the READDIRPLUS operation in NFS. This call is more like getFileStatusBulk() for HDFS.

          Show
          dhruba borthakur added a comment - > Is there any other interface that resembles this? The only thing that comes relatively close to this one is the READDIRPLUS operation in NFS. This call is more like getFileStatusBulk() for HDFS.
          Hide
          Doug Cutting added a comment -

          > But not sure why it returns a map

          It could perhaps instead return an array of two-element structs, each containing a filestatis/blocklocations pair, but a Map seems simpler.

          > How does a user figure out which were valid and which were invalid/non-existent/empty paths?

          Non-existent paths should be ignored. Paths whose URIs are for different filesystems or are somehow unparseable should cause an exception.

          Show
          Doug Cutting added a comment - > But not sure why it returns a map It could perhaps instead return an array of two-element structs, each containing a filestatis/blocklocations pair, but a Map seems simpler. > How does a user figure out which were valid and which were invalid/non-existent/empty paths? Non-existent paths should be ignored. Paths whose URIs are for different filesystems or are somehow unparseable should cause an exception.
          Hide
          Jakob Homan added a comment -

          The current API implementation of getBlockLocations includes parameters for the byte offset within the file and the number of bytes within the files for which to return blocks. These parameters aren't provided currently in the specification for the new API. Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length?

          The other option would be to add the start and length specifications and for them to apply to each of the paths within the array, which doesn't seem particularly useful.

          Show
          Jakob Homan added a comment - The current API implementation of getBlockLocations includes parameters for the byte offset within the file and the number of bytes within the files for which to return blocks. These parameters aren't provided currently in the specification for the new API. Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length? The other option would be to add the start and length specifications and for them to apply to each of the paths within the array, which doesn't seem particularly useful.
          Hide
          dhruba borthakur added a comment -

          > pass in an array of BlockLocationRequests, each of which would consist of the path, start and length

          +1. This sounds better than assuming that we need to send back all blocks for the specified path(s)..

          Show
          dhruba borthakur added a comment - > pass in an array of BlockLocationRequests, each of which would consist of the path, start and length +1. This sounds better than assuming that we need to send back all blocks for the specified path(s)..
          Hide
          Doug Cutting added a comment -

          > Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length?

          That is more general, but what is the use case? The motivating use case for listBlockLocations() is map reduce split construction, which typically takes a list of files as input, not a list of sections of files. Adding a feature that won't be used will just make this new API harder to use. -1 without a compelling use case.

          Show
          Doug Cutting added a comment - > Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length? That is more general, but what is the use case? The motivating use case for listBlockLocations() is map reduce split construction, which typically takes a list of files as input, not a list of sections of files. Adding a feature that won't be used will just make this new API harder to use. -1 without a compelling use case.
          Hide
          Jakob Homan added a comment -

          That is more general, but what is the use case?

          The original motivation was Arun and Owen noticing during the terasort work that there were a large number of rpc calls were made during the task scheduling and that a bulk method could ameliorate that. That seems reasonable to me. I'll let Arun lobby further.

          One point that came up in discussions is that it would be a good idea to have a maximum number of files that can be returned at once in order to not overwhelm the namenode. Whether this is hard-coded or configurable was not decided.

          Show
          Jakob Homan added a comment - That is more general, but what is the use case? The original motivation was Arun and Owen noticing during the terasort work that there were a large number of rpc calls were made during the task scheduling and that a bulk method could ameliorate that. That seems reasonable to me. I'll let Arun lobby further. One point that came up in discussions is that it would be a good idea to have a maximum number of files that can be returned at once in order to not overwhelm the namenode. Whether this is hard-coded or configurable was not decided.
          Hide
          Doug Cutting added a comment -

          I meant, what is the use case for passing in start/end positions per file? I support the idea of a bulk call, but don't see the need to pass start/end positions per file.

          Show
          Doug Cutting added a comment - I meant, what is the use case for passing in start/end positions per file? I support the idea of a bulk call, but don't see the need to pass start/end positions per file.
          Hide
          dhruba borthakur added a comment -

          I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported. We use "distcp -update" to do an incremental copy of files that have changed in length, but having this proposed extended API (and more) allows distcp to copy only changed portions of a file.

          Show
          dhruba borthakur added a comment - I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported. We use "distcp -update" to do an incremental copy of files that have changed in length, but having this proposed extended API (and more) allows distcp to copy only changed portions of a file.
          Hide
          Arun C Murthy added a comment -

          Quick note: making the length mandatory (i.e. part of the api) has the unfortunate side-effect of forcing a stat on each file apriori to the call to listBlockLocations. So, from a Map-Reduce perspective it is important to have an api which does not force InputFormats to pass in the lengths.

          OTOH if we really need the more general version of the api I'd like to pass in "-1" to imply the whole file.

          Show
          Arun C Murthy added a comment - Quick note: making the length mandatory (i.e. part of the api) has the unfortunate side-effect of forcing a stat on each file apriori to the call to listBlockLocations. So, from a Map-Reduce perspective it is important to have an api which does not force InputFormats to pass in the lengths. OTOH if we really need the more general version of the api I'd like to pass in "-1" to imply the whole file.
          Hide
          Doug Cutting added a comment -

          > I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported.

          Thanks for the use case! An append-savvy incremental distcp might first use listStatus to get all file lengths and dates from both filesystems, then figure out which had grown longer but whose creation dates had not changed, indicating they'd been appended to. Then a batch call could be made to fetch block locations of just newly appended sections, and these would be used to construct splits that can be localized well. Does that sound right?

          In this case we would not list directories, but rather always pass in a list of individual files. The mapping from inputs to outputs would be 1:1 so it could take the form:

          BlockLocation[] getBlockLocations(BlockLocationRequest[])

          A corollary is that it does not make sense to pass start/end positions for a directory, although these could be ignored.

          Do we want to try to develop a single swiss-army-knife batch call, or add operation-optimized calls as we go?

          Show
          Doug Cutting added a comment - > I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported. Thanks for the use case! An append-savvy incremental distcp might first use listStatus to get all file lengths and dates from both filesystems, then figure out which had grown longer but whose creation dates had not changed, indicating they'd been appended to. Then a batch call could be made to fetch block locations of just newly appended sections, and these would be used to construct splits that can be localized well. Does that sound right? In this case we would not list directories, but rather always pass in a list of individual files. The mapping from inputs to outputs would be 1:1 so it could take the form: BlockLocation[] getBlockLocations(BlockLocationRequest[]) A corollary is that it does not make sense to pass start/end positions for a directory, although these could be ignored. Do we want to try to develop a single swiss-army-knife batch call, or add operation-optimized calls as we go?
          Hide
          Sanjay Radia added a comment -

          Is the optimization for sending only partial block reports really necessary? Most files have very few blocks ...
          Also arun's point of doing an extra call for doing the getFileStatus() is valid.

          Why not create a class called DetailedFileStatus which contains both the file status and block locations:

          DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested.

          We can add the range one later if we really need that optimization.

          Show
          Sanjay Radia added a comment - Is the optimization for sending only partial block reports really necessary? Most files have very few blocks ... Also arun's point of doing an extra call for doing the getFileStatus() is valid. Why not create a class called DetailedFileStatus which contains both the file status and block locations: DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested. We can add the range one later if we really need that optimization.
          Hide
          Doug Cutting added a comment -

          > Is the optimization for sending only partial block reports really necessary?

          It may help the append-savvy distcp use case, but is not in the mapred job submission use case. Even in the append-savvy distcp use case, it's not clear that it's required. Maybe we should punt that until someone develops an append-savvy distcp?

          > Why not create a class called DetailedFileStatus which contains both the file status and block locations:

          Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent.

          > DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested.

          That was intended for the append-savvy distcp use case. The original use case was for mapred job submission, where we typically have a list of directories. With directories there is not a 1:1 mapping.

          Show
          Doug Cutting added a comment - > Is the optimization for sending only partial block reports really necessary? It may help the append-savvy distcp use case, but is not in the mapred job submission use case. Even in the append-savvy distcp use case, it's not clear that it's required. Maybe we should punt that until someone develops an append-savvy distcp? > Why not create a class called DetailedFileStatus which contains both the file status and block locations: Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent. > DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested. That was intended for the append-savvy distcp use case. The original use case was for mapred job submission, where we typically have a list of directories. With directories there is not a 1:1 mapping.
          Hide
          Sanjay Radia added a comment -

          > Maybe we should punt that until someone develops an append-savvy distcp?
          +1

          >Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent.
          I was holding out on a file system interface return a map. But that is old school.
          Fine I am convinced.

          I suspect you also want the rpc signature to return a map (that makes me more nervous because most rpcs do not support that - but ours does I guess.).


          Wrt to the new FileContext api, my proposal is that its provides a single getBlockLocation method:

          Map<FileStatus,BlockLocation[]> getBlockLocations(Path[] path)

          and abandon the BlockLocation[] getBlockLocations(path, start, end).

          (of course FileSystem will continue to support the old getBlockLocations.)

          Show
          Sanjay Radia added a comment - > Maybe we should punt that until someone develops an append-savvy distcp? +1 >Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent. I was holding out on a file system interface return a map. But that is old school. Fine I am convinced. I suspect you also want the rpc signature to return a map (that makes me more nervous because most rpcs do not support that - but ours does I guess.). Wrt to the new FileContext api, my proposal is that its provides a single getBlockLocation method: Map<FileStatus,BlockLocation[]> getBlockLocations(Path[] path) and abandon the BlockLocation[] getBlockLocations(path, start, end). (of course FileSystem will continue to support the old getBlockLocations.)
          Hide
          Doug Cutting added a comment -

          +1

          Show
          Doug Cutting added a comment - +1
          Hide
          dhruba borthakur added a comment -

          +1

          Show
          dhruba borthakur added a comment - +1
          Hide
          Hairong Kuang added a comment -

          I am quite bothered that the proposed API returns a map. Is the reason for returning a map because the API does one-level listPath? Is there a use case that needs only one-level expansion?

          If we eventually need to get the block locations of all files recursively under the input paths, is the following API a better choice?

          /**
            * @return the block locations of all files recursively under the input paths
            */
          Iterator<BlockLocation> getBlockLocations(Path[] paths)
          

          When implementing this in HDFS, we might need to issue multiple RPCs and be very careful to limit the size of each RPC request and response.

          Show
          Hairong Kuang added a comment - I am quite bothered that the proposed API returns a map. Is the reason for returning a map because the API does one-level listPath? Is there a use case that needs only one-level expansion? If we eventually need to get the block locations of all files recursively under the input paths, is the following API a better choice? /** * @ return the block locations of all files recursively under the input paths */ Iterator<BlockLocation> getBlockLocations(Path[] paths) When implementing this in HDFS, we might need to issue multiple RPCs and be very careful to limit the size of each RPC request and response.
          Hide
          Hairong Kuang added a comment -

          I read FileInputFormat and understand the usecase much better. So the client needs to know FileStatus for filtering and there is a configuration parameter to specify whether the input paths need to be traversed recursively. In this case, how about the following revised API?

          class FileStatusAndBlockLocations {
            FileStatus fileStatus;
            BlockLocation [] blocks;
          }
          
          Iterator<FileStatusAndBlockLocations> getBlockLocations(Path[] paths, boolean isRecursive);
          
          Show
          Hairong Kuang added a comment - I read FileInputFormat and understand the usecase much better. So the client needs to know FileStatus for filtering and there is a configuration parameter to specify whether the input paths need to be traversed recursively. In this case, how about the following revised API? class FileStatusAndBlockLocations { FileStatus fileStatus; BlockLocation [] blocks; } Iterator<FileStatusAndBlockLocations> getBlockLocations(Path[] paths, boolean isRecursive);
          Hide
          Hairong Kuang added a comment -

          The above proposed method is an API in FileSystem.

          Internally in HDFS, I plan to add two new client-to-namenode RPCs:

          class HdfsFileStatusAndBlockLocations

          { HdfsFileStatus fileStatus; BlockLocation [] blocks; }

          /**

          • Given an array of input paths, return an array of file status and block locations.
          • The input array and output array have the same size.
          • The ith item in the output array is the file status and block locations of the ith path in input array.
          • if an input path is a directory, its block locations is empty.
            */
            HdfsFileStatusAndBlockLocations[] getFileStatusAndBlockLocations( Path[] paths);

          /**

          • Given an input directory, return the file status and block locations of its children.
            */
            HdfsFileStatusAndBlockLocations[] listFileStatusAndBlockLocations(Path path);

          Suppose the subtrees that represent a job's input paths contain N directories, the two APIs allow a dfs client to issue N+1 RPCs to NameNode to implement the above proposed file system API.

          Show
          Hairong Kuang added a comment - The above proposed method is an API in FileSystem. Internally in HDFS, I plan to add two new client-to-namenode RPCs: class HdfsFileStatusAndBlockLocations { HdfsFileStatus fileStatus; BlockLocation [] blocks; } /** Given an array of input paths, return an array of file status and block locations. The input array and output array have the same size. The ith item in the output array is the file status and block locations of the ith path in input array. if an input path is a directory, its block locations is empty. */ HdfsFileStatusAndBlockLocations[] getFileStatusAndBlockLocations( Path[] paths); /** Given an input directory, return the file status and block locations of its children. */ HdfsFileStatusAndBlockLocations[] listFileStatusAndBlockLocations(Path path); Suppose the subtrees that represent a job's input paths contain N directories, the two APIs allow a dfs client to issue N+1 RPCs to NameNode to implement the above proposed file system API.
          Hide
          Hairong Kuang added a comment -

          I also plan to use the same idea of iterative listing (HDFS-985) to limit the size of the response when listingFileStatusAndBlockLocations of a directory.

          Show
          Hairong Kuang added a comment - I also plan to use the same idea of iterative listing ( HDFS-985 ) to limit the size of the response when listingFileStatusAndBlockLocations of a directory.
          Hide
          Hairong Kuang added a comment -

          I want to explain the difference between my proposal and the previous proposal.
          1. For the FileSystem API, the user can specify whether the input paths need to be recursively traversed or not. The return result is an iterator, which allows the input files to be fetched from server one batch at a time so to avoid OOM exception when input paths are huge.
          2. The design of new RPCs allows us to return HdfsFileStatus (local file name) instead of FileStatus (full path name), saving CPU processing time. It also allows us to easily limit the response size.

          If nobody is against it, I will go ahead with the implementation.

          Show
          Hairong Kuang added a comment - I want to explain the difference between my proposal and the previous proposal. 1. For the FileSystem API, the user can specify whether the input paths need to be recursively traversed or not. The return result is an iterator, which allows the input files to be fetched from server one batch at a time so to avoid OOM exception when input paths are huge. 2. The design of new RPCs allows us to return HdfsFileStatus (local file name) instead of FileStatus (full path name), saving CPU processing time. It also allows us to easily limit the response size. If nobody is against it, I will go ahead with the implementation.
          Hide
          dhruba borthakur added a comment -

          +1 to this proposal.

          > The return result is an iterator, which allows the input files to be fetched from

          However, if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?

          Show
          dhruba borthakur added a comment - +1 to this proposal. > The return result is an iterator, which allows the input files to be fetched from However, if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?
          Hide
          Hairong Kuang added a comment -

          > if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?
          I will reuse DFS_LIST_LIMIT introduced in HDFS-985. Its default value is 1000. So by default, 500 will be fetched in one RPC.

          Show
          Hairong Kuang added a comment - > if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it? I will reuse DFS_LIST_LIMIT introduced in HDFS-985 . Its default value is 1000. So by default, 500 will be fetched in one RPC.
          Hide
          Hairong Kuang added a comment -

          Taking multiple paths as an input to a FileContext API and HDFS clinet-NN rpc seems to be a bad idea. It adds quite a lot of complexity for grouping paths by file systems and for resolving symbolic links. Does not sound clean and I'd like to avoid it. So here is the revised proposal:

          class LocatedFileStatus extends FileStatus {
            BlockLocation [] blocks;
          }
          

          FileSystem and FileContext will have a new API

          public Iterator<FileStatusAndBlockLocations> listLocatedFileStatus(Path path, boolean isRecursive);
          

          This new API is similar to FileContext#listStatus in many ways except that the returned LocatedFileStatus contains its block locations and if isRecursive is true, all the files in the subtree rooted at the input path will be returned.

          Similarly in HDFS, we will have

          class HdfsLocatedFileStatus extends HdfsFileStaus {
            BlockLocations[] blocks;
          }
          

          ClientProtocol will add one more parameter "boolean withLocation" to the existing getListing RPC.

          public DirectoryListing getListing(String src,
                                               byte[] startAfter,
                                               boolean withLocation)
                throws AccessControlException, FileNotFoundException,
                UnresolvedLinkException, IOException;
          

          If withLocation is false, the semantics is the same as before. When withLocations is true, DirectoryListing will contains LocatedFileStatus.

          Show
          Hairong Kuang added a comment - Taking multiple paths as an input to a FileContext API and HDFS clinet-NN rpc seems to be a bad idea. It adds quite a lot of complexity for grouping paths by file systems and for resolving symbolic links. Does not sound clean and I'd like to avoid it. So here is the revised proposal: class LocatedFileStatus extends FileStatus { BlockLocation [] blocks; } FileSystem and FileContext will have a new API public Iterator<FileStatusAndBlockLocations> listLocatedFileStatus(Path path, boolean isRecursive); This new API is similar to FileContext#listStatus in many ways except that the returned LocatedFileStatus contains its block locations and if isRecursive is true, all the files in the subtree rooted at the input path will be returned. Similarly in HDFS, we will have class HdfsLocatedFileStatus extends HdfsFileStaus { BlockLocations[] blocks; } ClientProtocol will add one more parameter "boolean withLocation" to the existing getListing RPC. public DirectoryListing getListing( String src, byte [] startAfter, boolean withLocation) throws AccessControlException, FileNotFoundException, UnresolvedLinkException, IOException; If withLocation is false, the semantics is the same as before. When withLocations is true, DirectoryListing will contains LocatedFileStatus.
          Hide
          Hairong Kuang added a comment -

          Sorry the new FileSystem/FileContext API should be

          public Iterator<LocatedFileStatus> listLocatedFileStatus(Path path, boolean isRecursive);
          
          Show
          Hairong Kuang added a comment - Sorry the new FileSystem/FileContext API should be public Iterator<LocatedFileStatus> listLocatedFileStatus(Path path, boolean isRecursive);
          Hide
          Hairong Kuang added a comment -

          This is an initial patch for the HDFS implementation of the proposed FileSystem API.

          Show
          Hairong Kuang added a comment - This is an initial patch for the HDFS implementation of the proposed FileSystem API.
          Hide
          Hairong Kuang added a comment -

          I am not sure what should we do if a child of the input directory is a symbolic link. Whether the symbolic link should be resolved or not better to be decided by applications.

          It seems cleaner if the new API changes to be listLocatedFileStatus(Path path) so it does not traverse the subtree recursively and it returns all the content of the directory. BlockLocations are piggybacked if a child is a file. This design decision leaves the questions like how to deal with when a child is a symbolic link or a directory to be answered by applications.

          Show
          Hairong Kuang added a comment - I am not sure what should we do if a child of the input directory is a symbolic link. Whether the symbolic link should be resolved or not better to be decided by applications. It seems cleaner if the new API changes to be listLocatedFileStatus(Path path) so it does not traverse the subtree recursively and it returns all the content of the directory. BlockLocations are piggybacked if a child is a file. This design decision leaves the questions like how to deal with when a child is a symbolic link or a directory to be answered by applications.
          Hide
          Doug Cutting added a comment -

          > I am not sure what should we do if a child of the input directory is a symbolic link.

          Handling of symlinks should be addressed in HADOOP-6870, no?

          Show
          Doug Cutting added a comment - > I am not sure what should we do if a child of the input directory is a symbolic link. Handling of symlinks should be addressed in HADOOP-6870 , no?
          Hide
          Hairong Kuang added a comment -

          Hi Doug, thanks for your review comments.

          Yes, Handling of symlinks should be addressed in FileContext in HADOOP-6870. HDFS-202 severs as the discussion board for this issue. So I posted the question here.

          My question is whether this new API should handle recursive traversal and symbolic resolution. Is it cleaner if it does not do any of these and leave decisions to applications?

          Show
          Hairong Kuang added a comment - Hi Doug, thanks for your review comments. Yes, Handling of symlinks should be addressed in FileContext in HADOOP-6870 . HDFS-202 severs as the discussion board for this issue. So I posted the question here. My question is whether this new API should handle recursive traversal and symbolic resolution. Is it cleaner if it does not do any of these and leave decisions to applications?
          Hide
          Doug Cutting added a comment -

          > My question is whether this new API should handle recursive traversal and symbolic resolution.

          My intuition is that recursive file listings for open should follow symbolic links, since open follows symbolic links. Recursive traversal for remove should not follow symbolic links, but should just remove the symbolic link, like remove does on a symbolic link.

          Show
          Doug Cutting added a comment - > My question is whether this new API should handle recursive traversal and symbolic resolution. My intuition is that recursive file listings for open should follow symbolic links, since open follows symbolic links. Recursive traversal for remove should not follow symbolic links, but should just remove the symbolic link, like remove does on a symbolic link.
          Hide
          Hairong Kuang added a comment -

          hdfsListFiles1.patch support listFiles API in both DistributedFileSystem and Hdfs.

          It has two unit tests. In particular, TestListFilesInFileContext includes a test on a input directory that contains two symbolic links.

          Show
          Hairong Kuang added a comment - hdfsListFiles1.patch support listFiles API in both DistributedFileSystem and Hdfs. It has two unit tests. In particular, TestListFilesInFileContext includes a test on a input directory that contains two symbolic links.
          Hide
          Suresh Srinivas added a comment -
          1. HDFS.java
            • Not sure about "NB:" in the comment
            • DirListingIterator
              • make f, src, needLocation final. Add javadoc to the class.
              • getNext() should call hasNext(). A caller calling next() without calling hasNext() could result in not fetching new partial list and also ArrayIndexOutOfBoundsExeption.
          2. DistributedFileSystem
            • listLocatedStatus() make src final.
            • listFiles() make itor, curFile private and dirStats final.
          3. HDFSFileLocatedStatus.java - missing banner.
          4. FSNamesystem.java - make getBlockLocationsInternal() private
          5. NameNode.java - change variable hasLocation to needLocation. Method getListing() variant with boolean flag is just used by fsck? Do we need that variant?
          6. General - as in hadoop common implementation, on IOException, should next() and hasNext() throw RuntimeException instead of returning false. Please note the comment in HADOOP-6870 about if this is right way to handle FileNotFoundException.
          Show
          Suresh Srinivas added a comment - HDFS.java Not sure about "NB:" in the comment DirListingIterator make f, src, needLocation final. Add javadoc to the class. getNext() should call hasNext(). A caller calling next() without calling hasNext() could result in not fetching new partial list and also ArrayIndexOutOfBoundsExeption. DistributedFileSystem listLocatedStatus() make src final. listFiles() make itor , curFile private and dirStats final. HDFSFileLocatedStatus.java - missing banner. FSNamesystem.java - make getBlockLocationsInternal() private NameNode.java - change variable hasLocation to needLocation. Method getListing() variant with boolean flag is just used by fsck? Do we need that variant? General - as in hadoop common implementation, on IOException, should next() and hasNext() throw RuntimeException instead of returning false. Please note the comment in HADOOP-6870 about if this is right way to handle FileNotFoundException.
          Hide
          Doug Cutting added a comment -

          By default I think exceptions should be thrown. This is like the return status of the unix 'ls' command, which is non-zero if, e.g., a directory is unreadable. But perhaps an option to force enumeration in light of exception would be useful.

          Show
          Doug Cutting added a comment - By default I think exceptions should be thrown. This is like the return status of the unix 'ls' command, which is non-zero if, e.g., a directory is unreadable. But perhaps an option to force enumeration in light of exception would be useful.
          Hide
          Suresh Srinivas added a comment -

          Unix 'ls' returns all the results in one shot. However, when getting response iteratively the behavior is different:

          1. When listing a single directory, if some ls results has been returned and the directory is deleted, we should throw FileNotFoundException, to indicate the directory is no longer available.
          2. When recursively listing under a directory, if a subdirectory is deleted, the more appropriate response is to ignore FileNotFound for that directory and return the remaining results. This would be consistent with what the result would be, if the command is repeated. Further, if an application is listing recursively a large directory, the state of the directory keeps changing, an application may have to try many times to list it.
          Show
          Suresh Srinivas added a comment - Unix 'ls' returns all the results in one shot. However, when getting response iteratively the behavior is different: When listing a single directory, if some ls results has been returned and the directory is deleted, we should throw FileNotFoundException, to indicate the directory is no longer available. When recursively listing under a directory, if a subdirectory is deleted, the more appropriate response is to ignore FileNotFound for that directory and return the remaining results. This would be consistent with what the result would be, if the command is repeated. Further, if an application is listing recursively a large directory, the state of the directory keeps changing, an application may have to try many times to list it.
          Hide
          Hairong Kuang added a comment -

          As I commented in HADOOP-6890, I would prefer throwing exceptions when a file/directory is deleted during listing. This is because getFiles is used by MapReduce job client to calculate splits. So the expectation is that the input directories remain no change during job execution. It is good to fail the job earlier than later.

          Show
          Hairong Kuang added a comment - As I commented in HADOOP-6890 , I would prefer throwing exceptions when a file/directory is deleted during listing. This is because getFiles is used by MapReduce job client to calculate splits. So the expectation is that the input directories remain no change during job execution. It is good to fail the job earlier than later.
          Hide
          Hairong Kuang added a comment -

          This patch addressed Suresh's review comments.

          Show
          Hairong Kuang added a comment - This patch addressed Suresh's review comments.
          Hide
          Hairong Kuang added a comment -

          hdfsListFiles3,patch adds an fault-injection test to makes sure that the iterator returned by listLocatedStatus throws RuntimeException in case of io error.

          Show
          Hairong Kuang added a comment - hdfsListFiles3,patch adds an fault-injection test to makes sure that the iterator returned by listLocatedStatus throws RuntimeException in case of io error.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch
          against trunk revision 981289.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/testReport/
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch against trunk revision 981289. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/testReport/ Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          There seems to be a problem with Hudson. Let me try it one more time.

          Show
          Hairong Kuang added a comment - There seems to be a problem with Hudson. Let me try it one more time.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch
          against trunk revision 982091.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/228/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch against trunk revision 982091. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/228/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          This patch incorporated changes made in HADOOP-6900.

          Show
          Hairong Kuang added a comment - This patch incorporated changes made in HADOOP-6900 .
          Hide
          Suresh Srinivas added a comment -

          Comments:

          1. ListPathAspects.aj - callGetListing() method has description which says rename
          2. HDFSFileLocatedStatus.java - missing banner.
          Show
          Suresh Srinivas added a comment - Comments: ListPathAspects.aj - callGetListing() method has description which says rename HDFSFileLocatedStatus.java - missing banner.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch if the above comments are taken care of.

          Show
          Suresh Srinivas added a comment - +1 for the patch if the above comments are taken care of.
          Hide
          Hairong Kuang added a comment -

          This patch addresses Suresh's comments.

          Show
          Hairong Kuang added a comment - This patch addresses Suresh's comments.
          Hide
          Hairong Kuang added a comment -

          Not able to run ant test-patch because the trunk does not compile. But I checked that this patch does not introduce new Javadoc warnings and adds new tests. There were quite a few unit tests failing. But seems not related to this patch.

          Show
          Hairong Kuang added a comment - Not able to run ant test-patch because the trunk does not compile. But I checked that this patch does not introduce new Javadoc warnings and adds new tests. There were quite a few unit tests failing. But seems not related to this patch.
          Hide
          Konstantin Shvachko added a comment -

          > the trunk does not compile
          See here

          Show
          Konstantin Shvachko added a comment - > the trunk does not compile See here
          Hide
          Hairong Kuang added a comment -

          Konstantin, the hdfs trunk should be able to compile because I've committed this patch. HDFS-202 is the HDFS side of HADOOP-6900!

          Thanks Suresh for reviewing this patch at full speed!

          Show
          Hairong Kuang added a comment - Konstantin, the hdfs trunk should be able to compile because I've committed this patch. HDFS-202 is the HDFS side of HADOOP-6900 ! Thanks Suresh for reviewing this patch at full speed!
          Hide
          Hairong Kuang added a comment -

          I've committed this!

          Show
          Hairong Kuang added a comment - I've committed this!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #370 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/370/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #370 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/370/ )
          Hide
          Amareshwari Sriramadasu added a comment -

          Shouldn't we mark feature as Incompatible change? It changed the signature of getListing() and broke MaReduce build, MAPREDUCE-2022

          Show
          Amareshwari Sriramadasu added a comment - Shouldn't we mark feature as Incompatible change? It changed the signature of getListing() and broke MaReduce build, MAPREDUCE-2022
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Yes, this is an Incompatible change.

          Show
          Tsz Wo Nicholas Sze added a comment - Yes, this is an Incompatible change.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Arun C Murthy
            • Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development