Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      We would like to be able to create certain files on certain storage device classes (e.g. spinning media, solid state devices, RAM disk, non-volatile memory). HDFS-2832 enables heterogeneous storage at the DataNode, so the NameNode can gain awareness of what different storage options are available in the pool and where they are located, but no API is provided for clients or block placement plugins to perform device aware block placement. We would like to propose a set of extensions that also have broad applicability to use cases where storage device affinity is important:

      • Add an enum of generic storage device classes, borrowing from current taxonomy of the storage industry
      • Augment DataNode volume metadata in storage reports with this enum
      • Extend the namespace so pluggable block policies can be specified on a directory and storage device class can be tracked in the Inode. Perhaps this could be a larger discussion on adding support for extended attributes in the HDFS namespace. The Inode should track both the storage device class hint and the current actual storage device class. FileStatus should expose this information (or xattrs in general) to clients.
      • Extend the pluggable block policy framework so policies can also consider, and specify, affinity for a particular storage device class
      • Extend the file creation API to accept a storage device class affinity hint. Such a hint can be supplied directly as a parameter, or, if we are considering extended attribute support, then instead as one of a set of xattrs. The hint would be stored in the namespace and also used by the client to indicate to the NameNode/block placement policy/DataNode constraints on block placement. Furthermore, if xattrs or device storage class affinity hints are associated with directories, then the NameNode should provide the storage device affinity hint to the client in the create API response, so the client can provide the appropriate hint to DataNodes when writing new blocks.
      • The list of candidate DataNodes for new blocks supplied by the NameNode to clients should be weighted/sorted by availability of the desired storage device class.
      • Block replication should consider storage device affinity hints. If a client move()s a file from a location under a path with affinity hint X to under a path with affinity hint Y, then all blocks currently residing on media X should be eventually replicated onto media Y with the then excess replicas on media X deleted.
      • Introduce the concept of degraded path: a path can be degraded if a block placement policy is forced to abandon a constraint in order to persist the block, when there may not be available space on the desired device class, or to maintain the minimum necessary replication factor. This concept is distinct from the corrupt path, where one or more blocks are missing. Paths in degraded state should be periodically reevaluated for re-replication.
      • The FSShell should be extended with commands for changing the storage device class hint for a directory or file.
      • Clients like DistCP which compare metadata should be extended to be aware of the storage device class hint. For DistCP specifically, there should be an option to ignore the storage device class hints, enabled by default.

      Suggested semantics:

      • The default storage device class should be the null class, or simply the “default class”, for all cases where a hint is not available. This should be configurable. hdfs-defaults.xml could provide the default as spinning media.
      • A storage device class hint should be provided (and is necessary) only when the default is not sufficient.
      • For backwards compatibility, any FSImage or edit log entry lacking a storage device class hint is interpreted as having affinity for the null class.
      • All blocks for a given file share the same storage device class. If the replication factor for this file is increased the replicas should all be placed on the same storage device class.
      • If one or more blocks for a given file cannot be placed on the required device class, then the file is marked as degraded. Files in degraded state should be periodically reevaluated for re-replication.
      • A directory and path can only have one storage device affinity hint. If the file inode specifies a hint, this is used, otherwise we walk up the path until a hint is found and use that one, otherwise the default storage class is used.

        Issue Links

          Activity

          Hide
          Andrew Purtell added a comment -

          HDFS-2832 and new subtasks have picked up some ideas from here, we might address deltas later. Once we have production experience with our respective implementations, perhaps we can circle back and compare notes.

          Show
          Andrew Purtell added a comment - HDFS-2832 and new subtasks have picked up some ideas from here, we might address deltas later. Once we have production experience with our respective implementations, perhaps we can circle back and compare notes.
          Hide
          Suresh Srinivas added a comment -

          Today the scope of HDFS-2832 was widened to duplicate this issue. Since the issues are linked, that was not necessary.

          I disagree. Here is the brief comment I had posted on that jira - https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12539644&commentId=13192326

          1. Support for heterogeneous storages:
            • DN could support along with disks, other types of storage such as flash etc.
            • Suitable storage can be chosen based on client preference such as need for random reads etc.
          2. Block report scaling: instead of a single monolithic block report, a smaller block report per storage becomes possible. This is important with the growth in disk capacity and number of disks per datanode.
          3. Better granularity of storage failure handling:
            • DN could just indicate loss of storage and namenode can handle it better since it knows the list of blocks belonging to a storage.
            • DN could locally handle storage failures or provide decommissioning of a storage by marking a storage as ReadOnly.
          4. Hot pluggability of disks/storages: adding and deleting a storage to a node is simplified.
          5. Other flexibility: includes future enhancements to balance storages with in a datanode, balancing the load (number of transceivers) per storage etc and better block placement strategies.

          It has brief mentions of the following, that is duplicated in this jira:

          1. Client preference for writing to storages - automatically means that block placement must consider storage type etc.
          2. Support for different storage types in datanode and block reports based on that.
          3. Awareness of those storage types at the namenode (not for just block placement with various other benefits)
          4. Affinity of replicas to a storage type.

          Certainly you have elaborated along these points and more implementation details. Does not mean it is a different jira.

          Show
          Suresh Srinivas added a comment - Today the scope of HDFS-2832 was widened to duplicate this issue. Since the issues are linked, that was not necessary. I disagree. Here is the brief comment I had posted on that jira - https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12539644&commentId=13192326 Support for heterogeneous storages: DN could support along with disks, other types of storage such as flash etc. Suitable storage can be chosen based on client preference such as need for random reads etc. Block report scaling: instead of a single monolithic block report, a smaller block report per storage becomes possible. This is important with the growth in disk capacity and number of disks per datanode. Better granularity of storage failure handling: DN could just indicate loss of storage and namenode can handle it better since it knows the list of blocks belonging to a storage. DN could locally handle storage failures or provide decommissioning of a storage by marking a storage as ReadOnly. Hot pluggability of disks/storages: adding and deleting a storage to a node is simplified. Other flexibility: includes future enhancements to balance storages with in a datanode, balancing the load (number of transceivers) per storage etc and better block placement strategies. It has brief mentions of the following, that is duplicated in this jira: Client preference for writing to storages - automatically means that block placement must consider storage type etc. Support for different storage types in datanode and block reports based on that. Awareness of those storage types at the namenode (not for just block placement with various other benefits) Affinity of replicas to a storage type. Certainly you have elaborated along these points and more implementation details. Does not mean it is a different jira.
          Hide
          Andrew Purtell added a comment -

          Given poorly toned comments on HDFS-2832 and on public forums like twitter, I am asking why this is not a dupe of HDFS-2832?

          Suresh,

          With this issue I had hoped to engage on discussions about tiered storage concerns, specifically, and volunteered some of our thoughts for consideration. Today the scope of HDFS-2832 was widened to duplicate this issue. Since the issues are linked, that was not necessary. This is not an event that has happened in a vacuum. Several individuals were disappointed to see this, and were discussing it in a non-ASF forum. Your response here to those conversations, a suggestion to shut down at least this aspect of our attempt to engage the Hadoop community, highlights why those conversations took place.

          Show
          Andrew Purtell added a comment - Given poorly toned comments on HDFS-2832 and on public forums like twitter, I am asking why this is not a dupe of HDFS-2832 ? Suresh, With this issue I had hoped to engage on discussions about tiered storage concerns, specifically, and volunteered some of our thoughts for consideration. Today the scope of HDFS-2832 was widened to duplicate this issue. Since the issues are linked, that was not necessary. This is not an event that has happened in a vacuum. Several individuals were disappointed to see this, and were discussing it in a non-ASF forum. Your response here to those conversations, a suggestion to shut down at least this aspect of our attempt to engage the Hadoop community, highlights why those conversations took place.
          Hide
          Colin Patrick McCabe added a comment -

          I think the progress on HDFS-2832 has been encouraging. Perhaps we should have a meetup about HDFS-2832, HDFS-4672, and related issues next week? It would be a shame if some of the code here was duplicated effort

          Show
          Colin Patrick McCabe added a comment - I think the progress on HDFS-2832 has been encouraging. Perhaps we should have a meetup about HDFS-2832 , HDFS-4672 , and related issues next week? It would be a shame if some of the code here was duplicated effort
          Hide
          Suresh Srinivas added a comment -

          What is the difference between this issue and HDFS-2832?

          I have been waiting for answer to this comment. Given poorly toned comments on HDFS-2832 and on public forums like twitter, I am asking why this is not a dupe of HDFS-2832?

          Show
          Suresh Srinivas added a comment - What is the difference between this issue and HDFS-2832 ? I have been waiting for answer to this comment. Given poorly toned comments on HDFS-2832 and on public forums like twitter, I am asking why this is not a dupe of HDFS-2832 ?
          Hide
          Konstantin Shvachko added a comment - - edited

          I think the Extended Attributes is orthogonal to this issue.

          Extended file attributes is a file system feature that enables users to associate computer files with metadata not interpreted by the filesystem

          This implies that xattr should not be interpreted just stored by the system, while in this case you create a whole framework of block placement inside the file system.

          What is the difference between this issue and HDFS-2832? Are you adding api to specify file placement, is it an extension to HDFS-2832?
          Sorry for hitting save button too early

          Show
          Konstantin Shvachko added a comment - - edited I think the Extended Attributes is orthogonal to this issue. Extended file attributes is a file system feature that enables users to associate computer files with metadata not interpreted by the filesystem This implies that xattr should not be interpreted just stored by the system, while in this case you create a whole framework of block placement inside the file system. What is the difference between this issue and HDFS-2832 ? Are you adding api to specify file placement, is it an extension to HDFS-2832 ? Sorry for hitting save button too early
          Hide
          Andrew Purtell added a comment -

          The granularity of where the policies can be placed is an important consideration. Having it at a directory level can make it much easier to manage, even though a per-file level appears to give more flexibility.

          Sure.

          Do you have an opinion on if directory level policies should be inherited from parents (or not)?

          I want HBASE journal and HBase store files to have different storage policies even though they reside in the same volume.

          Makes sense. I would like to see intermediate persisted output from chained MR jobs have different storage policies than the final output.

          Show
          Andrew Purtell added a comment - The granularity of where the policies can be placed is an important consideration. Having it at a directory level can make it much easier to manage, even though a per-file level appears to give more flexibility. Sure. Do you have an opinion on if directory level policies should be inherited from parents (or not)? I want HBASE journal and HBase store files to have different storage policies even though they reside in the same volume. Makes sense. I would like to see intermediate persisted output from chained MR jobs have different storage policies than the final output.
          Hide
          Sanjay Radia added a comment -

          The granularity of where the policies can be placed is an important consideration. Having it at a directory level can make it much easier to manage, even though a per-file level appears to give more flexibility. HDFS Snapshots, quotas are at a dir level. BTW Many file systems allow certain policies at a volume level. If HDFS supported multiple volumes within a NN then I would have put quotas and snapshots at a volume level, but still propose that storage policies should be at a directory level because one may want slightly finer granularity than a volume (I want HBASE journal and HBase store files to have different storage policies even though they reside in the same volume).

          Show
          Sanjay Radia added a comment - The granularity of where the policies can be placed is an important consideration. Having it at a directory level can make it much easier to manage, even though a per-file level appears to give more flexibility. HDFS Snapshots, quotas are at a dir level. BTW Many file systems allow certain policies at a volume level. If HDFS supported multiple volumes within a NN then I would have put quotas and snapshots at a volume level, but still propose that storage policies should be at a directory level because one may want slightly finer granularity than a volume (I want HBASE journal and HBase store files to have different storage policies even though they reside in the same volume).
          Hide
          Andrew Purtell added a comment -

          I would like to see a clear real use case to be identified first, in addition to generic design goals. HBASE-6572 can be it, but it needs more details, IMO

          Sure, if there is consensus on the general direction (introduce xattrs, use xattrs for storing device class hints, give block placement plugins xattr / device hint awareness) then we can put together a strawman implementation of HBASE-6572 that follows that consensus, with patches provided here and/or on HDFS-2006 later, for the bits that reach down into HDFS, for further consideration and discussion at that time. I think the general direction here is good, modulo details on how xattrs will work exactly, such that the new capabilities will be applicable to many use cases beyond HBASE-6572.

          Show
          Andrew Purtell added a comment - I would like to see a clear real use case to be identified first, in addition to generic design goals. HBASE-6572 can be it, but it needs more details, IMO Sure, if there is consensus on the general direction (introduce xattrs, use xattrs for storing device class hints, give block placement plugins xattr / device hint awareness) then we can put together a strawman implementation of HBASE-6572 that follows that consensus, with patches provided here and/or on HDFS-2006 later, for the bits that reach down into HDFS, for further consideration and discussion at that time. I think the general direction here is good, modulo details on how xattrs will work exactly, such that the new capabilities will be applicable to many use cases beyond HBASE-6572 .
          Hide
          Andrew Purtell added a comment -

          I linked this issue to related jira HDFS-2006.

          Show
          Andrew Purtell added a comment - I linked this issue to related jira HDFS-2006 .
          Hide
          Kihwal Lee added a comment -

          I would like to see a clear real use case to be identified first, in addition to generic design goals. HBASE-6572 can be it, but it needs more details, IMO. Without it a lot of technical discussions will see no end. It lets us think about different features at a concrete context. Without a concrete use case to run through, some design decisions are very hard to make.

          Show
          Kihwal Lee added a comment - I would like to see a clear real use case to be identified first, in addition to generic design goals. HBASE-6572 can be it, but it needs more details, IMO. Without it a lot of technical discussions will see no end. It lets us think about different features at a concrete context. Without a concrete use case to run through, some design decisions are very hard to make.
          Hide
          Chris Nauroth added a comment -

          xattrs can also serve as a building block in the future towards implementing ACLs, something I had researched a few months ago.

          Existing file systems have used numerous strategies for implementing xattrs, with different space/time trade-offs. I've found this document to be useful:

          http://users.suse.com/~agruen/acl/linux-acls/online/

          The document primarily focuses on ACLs, but there is a section titled Extended Attributes, which goes into some detail about implementation on various file systems. Some of the points I find most interesting are:

          1. It's common to use blocks to store xattrs, but in a distributed file system, I expect we wouldn't want to incur the extra latency of a block read to retrieve xattrs.
          2. ext3 employs a flyweight-style pattern, so that multiple inodes with identical sets of xattrs can share the same copy, even if they are not in a parent-child relationship. In practice, this is likely to save a lot of space, because the number of distinct sets of xattrs is likely to be much lower than the number of inodes.
          3. XFS initially stores xattrs on the inode, which has a statically allocated size. Once the number of xattrs grows too large to fit on the inode, XFS then promotes xattrs through a series of more and more sophisticated external data structures to handle the growth. It can effectively support very large numbers of xattrs on a single inode.

          Project management question: do we think it makes sense to spin out a separate xattrs jira and move discussion there? It would serve as a pre-requisite for storage policies and also ACLs. I think the xattrs feature itself is sufficiently complex to warrant its own round of design and implementation. (In fact, this jira has already spent more time discussing xattrs design than storage policies.)

          Show
          Chris Nauroth added a comment - xattrs can also serve as a building block in the future towards implementing ACLs, something I had researched a few months ago. Existing file systems have used numerous strategies for implementing xattrs, with different space/time trade-offs. I've found this document to be useful: http://users.suse.com/~agruen/acl/linux-acls/online/ The document primarily focuses on ACLs, but there is a section titled Extended Attributes, which goes into some detail about implementation on various file systems. Some of the points I find most interesting are: It's common to use blocks to store xattrs, but in a distributed file system, I expect we wouldn't want to incur the extra latency of a block read to retrieve xattrs. ext3 employs a flyweight-style pattern, so that multiple inodes with identical sets of xattrs can share the same copy, even if they are not in a parent-child relationship. In practice, this is likely to save a lot of space, because the number of distinct sets of xattrs is likely to be much lower than the number of inodes. XFS initially stores xattrs on the inode, which has a statically allocated size. Once the number of xattrs grows too large to fit on the inode, XFS then promotes xattrs through a series of more and more sophisticated external data structures to handle the growth. It can effectively support very large numbers of xattrs on a single inode. Project management question: do we think it makes sense to spin out a separate xattrs jira and move discussion there? It would serve as a pre-requisite for storage policies and also ACLs. I think the xattrs feature itself is sufficiently complex to warrant its own round of design and implementation. (In fact, this jira has already spent more time discussing xattrs design than storage policies.)
          Hide
          Andrew Purtell added a comment -

          In terms of storage, you would see a big win by only requiring one attribute per dir vs per file & again you would have only a single place to look, so less code.

          Can you clarify if you mean only ONE attribute per directory or file, or if you mean that one (or more) attributes apply only to the one directory or file they are associated with?

          Show
          Andrew Purtell added a comment - In terms of storage, you would see a big win by only requiring one attribute per dir vs per file & again you would have only a single place to look, so less code. Can you clarify if you mean only ONE attribute per directory or file, or if you mean that one (or more) attributes apply only to the one directory or file they are associated with?
          Hide
          eric baldeschwieler added a comment -

          The complexity of lookup is why I'd suggested only the immediately containing directory. You don't want to have to walk the tree to see what policy applies. Checking exactly one directory would be a lot simpler.

          In terms of storage, you would see a big win by only requiring one attribute per dir vs per file & again you would have only a single place to look, so less code.

          Show
          eric baldeschwieler added a comment - The complexity of lookup is why I'd suggested only the immediately containing directory. You don't want to have to walk the tree to see what policy applies. Checking exactly one directory would be a lot simpler. In terms of storage, you would see a big win by only requiring one attribute per dir vs per file & again you would have only a single place to look, so less code.
          Hide
          Colin Patrick McCabe added a comment -

          We can add extended attributes in a way that imposes zero overhead for users who don't make use of them, by creating another subclass (or subclasses) of INode. Inherited xattrs (that apply to all descendants) is also a reasonable idea.

          Show
          Colin Patrick McCabe added a comment - We can add extended attributes in a way that imposes zero overhead for users who don't make use of them, by creating another subclass (or subclasses) of INode. Inherited xattrs (that apply to all descendants) is also a reasonable idea.
          Hide
          Andrew Purtell added a comment -

          One way to reduce complexity and RAM pressure would be to only support placement hints on directories and have them apply only to files in that immediate directory. That should limit meta-data cost and address HBase and other use cases.

          On minimizing RAM pressure then the thing to do here might be to allow for hints on a directory to apply to all descendants. Otherwise if we have N directories under one parent then we would need N hints instead of 1.

          If the proposal for storage device class hints will be generalized/incorporated into an extended attributes facility, then this may be an interesting discussion. In the case of least Linux, Windows NT+, and *BSD, xattrs are arbitrary name/value pairs associated only with a single file or directory object, and a query on a given file or directory returns only the xattrs found in its inode (or equivalent). However since namespace storage in HDFS is at a premium, it may make sense to introduce a bit that signals the xattr should be inherited by all descendants.

          Show
          Andrew Purtell added a comment - One way to reduce complexity and RAM pressure would be to only support placement hints on directories and have them apply only to files in that immediate directory. That should limit meta-data cost and address HBase and other use cases. On minimizing RAM pressure then the thing to do here might be to allow for hints on a directory to apply to all descendants. Otherwise if we have N directories under one parent then we would need N hints instead of 1. If the proposal for storage device class hints will be generalized/incorporated into an extended attributes facility, then this may be an interesting discussion. In the case of least Linux, Windows NT+, and *BSD, xattrs are arbitrary name/value pairs associated only with a single file or directory object, and a query on a given file or directory returns only the xattrs found in its inode (or equivalent). However since namespace storage in HDFS is at a premium, it may make sense to introduce a bit that signals the xattr should be inherited by all descendants.
          Hide
          Andrew Purtell added a comment -

          It would be particularly interesting if people could use both flash and hard disks in the same cluster. Perhaps the flash could be used for HBase-backed storage, and the hard disks for everything else, for example.

          That is certainly a use case we are looking at. More specifically, migration of the blocks for a given column accessed in a read-mostly random access manner to the most suitable available storage device class for that type of workload. That would be the first aim of HBASE-6572.

          I feel like we might also want to enable automatic migration between tiers, at least for some files. I suppose this could also be done outside HDFS, with a daemon that looks at file access times (atimes) and attaches the correct xattrs.

          Our thinking is block placement and replication policy plug points could be extended or introduced so it’s not necessary to deploy and manage an additional set of daemons, but that is only one possible implementation option.

          Show
          Andrew Purtell added a comment - It would be particularly interesting if people could use both flash and hard disks in the same cluster. Perhaps the flash could be used for HBase-backed storage, and the hard disks for everything else, for example. That is certainly a use case we are looking at. More specifically, migration of the blocks for a given column accessed in a read-mostly random access manner to the most suitable available storage device class for that type of workload. That would be the first aim of HBASE-6572 . I feel like we might also want to enable automatic migration between tiers, at least for some files. I suppose this could also be done outside HDFS, with a daemon that looks at file access times (atimes) and attaches the correct xattrs. Our thinking is block placement and replication policy plug points could be extended or introduced so it’s not necessary to deploy and manage an additional set of daemons, but that is only one possible implementation option.
          Hide
          eric baldeschwieler added a comment -

          We should be careful to add as little complexity as possible while enabling the core feature here...

          Adding extended attributes is a well discussed idea, probably the right way to go, but it adds RAM pressure on the NN and needs to be thought out carefully. I believe there is already a JIRA on that?

          One way to reduce complexity and RAM pressure would be to only support placement hints on directories and have them apply only to files in that immediate directory. That should limit meta-data cost and address HBase and other use cases.

          That said, tying namespace data to the blocks, where replication policy is applied is a little complicated and deserves discussion. Something sanjay, suresh and I have been discussing. Maybe they can jump in with their thoughts.

          Show
          eric baldeschwieler added a comment - We should be careful to add as little complexity as possible while enabling the core feature here... Adding extended attributes is a well discussed idea, probably the right way to go, but it adds RAM pressure on the NN and needs to be thought out carefully. I believe there is already a JIRA on that? One way to reduce complexity and RAM pressure would be to only support placement hints on directories and have them apply only to files in that immediate directory. That should limit meta-data cost and address HBase and other use cases. That said, tying namespace data to the blocks, where replication policy is applied is a little complicated and deserves discussion. Something sanjay, suresh and I have been discussing. Maybe they can jump in with their thoughts.
          Hide
          Colin Patrick McCabe added a comment -

          Thanks for thinking about this, Andrew. This will be a nice feature to have in the future.

          It would be particularly interesting if people could use both flash and hard disks in the same cluster. Perhaps the flash could be used for HBase-backed storage, and the hard disks for everything else, for example.

          The xattr idea sounds like the right way to go for when you know what tier you want to put something in. I feel like we might also want to enable automatic migration between tiers, at least for some files. I suppose this could also be done outside HDFS, with a daemon that looks at file access times (atimes) and attaches the correct xattrs. However, traditional hierarchical storage management (HSM) systems integrate this into the filesystem itself, so we may want to consider this.

          This would also allow us to consider other features like compressing infrequently-used data.

          Show
          Colin Patrick McCabe added a comment - Thanks for thinking about this, Andrew. This will be a nice feature to have in the future. It would be particularly interesting if people could use both flash and hard disks in the same cluster. Perhaps the flash could be used for HBase-backed storage, and the hard disks for everything else, for example. The xattr idea sounds like the right way to go for when you know what tier you want to put something in. I feel like we might also want to enable automatic migration between tiers, at least for some files. I suppose this could also be done outside HDFS, with a daemon that looks at file access times (atimes) and attaches the correct xattrs. However, traditional hierarchical storage management (HSM) systems integrate this into the filesystem itself, so we may want to consider this. This would also allow us to consider other features like compressing infrequently-used data.

            People

            • Assignee:
              Unassigned
              Reporter:
              Andrew Purtell
            • Votes:
              0 Vote for this issue
              Watchers:
              49 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development