Details

    • Type: New Feature
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This jira proposes to add object store capabilities into HDFS.
      As part of the federation work (HDFS-1052) we separated block storage as a generic storage layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the storage layer i.e. datanodes.
      In this jira I will explore building an object store using the datanode storage, but independent of namespace metadata.

      I will soon update with a detailed design document.

      1. Ozone-architecture-v1.pdf
        1.60 MB
        Jitendra Nath Pandey
      2. ozone_user_v0.pdf
        4.48 MB
        Anu Engineer
      3. Ozonedesignupdate.pdf
        970 kB
        Anu Engineer
      4. HDFS-7240.001.patch
        5.83 MB
        Mukul Kumar Singh
      5. HDFS-7240.002.patch
        5.96 MB
        Mukul Kumar Singh
      6. HDFS-7240.003.patch
        5.96 MB
        Mukul Kumar Singh
      7. HDFS-7240.003.patch
        5.96 MB
        Mukul Kumar Singh
      8. HDFS-7240.004.patch
        5.96 MB
        Mukul Kumar Singh
      9. HDFS Scalability and Ozone.pdf
        328 kB
        Sanjay Radia
      10. HDFS-7240.005.patch
        6.03 MB
        Mukul Kumar Singh
      11. HDFS-7240.006.patch
        6.13 MB
        Mukul Kumar Singh
      12. MeetingMinutes.pdf
        94 kB
        Anu Engineer

        Issue Links

        1. Ozone: C/C++ implementation of ozone client using curl Sub-task Patch Available Shashikant Banerjee
         
        2. Ozone: Improve SQLCLI performance Sub-task Patch Available Yuanbo Liu
         
        3. Use GenericOptionsParser for scm and ksm daemon Sub-task Patch Available Elek, Marton
         
        4. Ozone : delete open key entries that will no longer be closed Sub-task Patch Available Chen Liang
         
        5. Ozone: Documentation: Add version specific documentation site to KSM Sub-task Patch Available Anu Engineer
         
        6. Add support for KSM --expunge command Sub-task Patch Available Shashikant Banerjee
         
        7. Ozone: Non-admin user is unable to run InfoVolume to the volume owned by itself Sub-task In Progress Lokesh Jain
         
        8. Ozone: Container : Add key versioning support-1 Sub-task In Progress Chen Liang
         
        9. Ozone: KSM: Allocate key should honour volume quota if quota is set on the volume Sub-task In Progress Lokesh Jain
         
        10. Ozone : debug cli: add support to load user-provided SQL query Sub-task In Progress Chen Liang
         
        11. Ozone : add read/write random access to Chunks of a key Sub-task In Progress Chen Liang
         
        12. Ozone: Audit Logs Sub-task Open Weiwei Yang
         
        13. Ozone: Use time units in the Ozone configuration values Sub-task Open Elek, Marton
         
        14. Ozone: SCM CLI: Implement get container metrics command Sub-task Open Yuanbo Liu
         
        15. Ozone: Support range in getKey operation Sub-task Open Mukul Kumar Singh
         
        16. Ozone:SCM: explore if we need 3 maps for tracking the state of nodes Sub-task Open Unassigned
         
        17. Ozone: SCM : Add priority for datanode commands Sub-task Open Unassigned
         
        18. Ozone: add TestDistributedOzoneVolumesRatis, TestOzoneRestWithMiniClusterRatis and TestOzoneWebAccessRatis Sub-task Open Tsz Wo Nicholas Sze
         
        19. Ozone : better handling of operation fail due to chill mode Sub-task Open Unassigned
         
        20. Ozone: SCM: clean up containers that timeout during creation Sub-task Open Xiaoyu Yao
         
        21. Ozone: In Ratis, leader should validate ContainerCommandRequestProto before propagating it to followers Sub-task Open Tsz Wo Nicholas Sze
         
        22. Ozone: Ozone shell: the root is assumed to hdfs Sub-task Open Nanda kumar
         
        23. Ozone: change TestRatisManager to check cluster with data Sub-task Open Tsz Wo Nicholas Sze
         
        24. Ozone: KSM: Add checkBucketAccess Sub-task Open Nanda kumar
         
        25. Ozone: Container server needs enhancements to control of bind address for greater flexibility and testability. Sub-task Open Anu Engineer
         
        26. Ozone : Optimize putKey operation to be async and consensus Sub-task Open Weiwei Yang
         
        27. Ozone: provide a way to validate ContainerCommandRequestProto Sub-task Open Anu Engineer
         
        28. ChunkManager functions do not use the argument keyName Sub-task Open Chen Liang
         
        29. Ozone: Document ozone metadata directory structure Sub-task Open Xiaoyu Yao
         
        30. OZone: SCM CLI: Implement get container command Sub-task Open Chen Liang
         
        31. Ozone:SCM: Add support for close containers in SCM Sub-task Open Anu Engineer
         
        32. Ozone: Support CopyContainer Sub-task Open Anu Engineer
         
        33. Ozone: Replace Jersey container with Netty Container Sub-task Open Anu Engineer
         
        34. Ozone: SCM: Handle duplicate Datanode ID Sub-task Open Anu Engineer
         
        35. Ozone: Fix the Cluster ID generation code in SCM Sub-task Open Shashikant Banerjee
         
        36. Ozone: Compact DB should be called on Open Containers. Sub-task Open Weiwei Yang
         
        37. Ozone: Support SCM multiple instance for HA Sub-task Open Anu Engineer
         
        38. Ozone:KSM: Add setVolumeAcls to allow adding/removing acls from a KSM volume Sub-task Open Mukul Kumar Singh
         
        39. Ozone: KSM : Support for simulated file system operations Sub-task Open Mukul Kumar Singh
         
        40. Ozone: Handle potential inconsistent states while listing keys Sub-task Open Weiwei Yang
         
        41. Ozone: Misc: Explore if the default memory settings are correct Sub-task Open Mukul Kumar Singh
         
        42. Ozone: Misc : Cleanup error messages Sub-task Open Unassigned
         
        43. Ozone: Corona: Support for online mode Sub-task Open Nanda kumar
         
        44. Ozone: Documentation: Add Ozone-defaults documentation Sub-task Open Ajay Kumar
         
        45. Ozone: Purge metadata of deleted blocks after max retry times Sub-task Open Yuanbo Liu
         
        46. Ozone: write deleted block to RAFT log for consensus on datanodes Sub-task Open Unassigned
         
        47. Ozone: ListVolume output misses some attributes Sub-task Open Mukul Kumar Singh
         
        48. Ozone: Remove the Priority Queues used in the Container State Manager Sub-task Open Elek, Marton
         
        49. Ozone: OzoneClient: Removal of old OzoneRestClient Sub-task Open Nanda kumar
         
        50. Ozone : Add an API to get Open Container by Owner, Replication Type and Replication Count Sub-task Open Lokesh Jain
         
        51. Ozone: TestContainerPersistence#testListContainer sometimes timeout Sub-task Open Ajay Kumar
         
        52. Ozone: Put key operation concurrent executes failed on Windows Sub-task Open Unassigned
         
        53. Ozone: fix and reenable TestKeysRatis#testPutAndGetKeyWithDnRestart Sub-task Open Mukul Kumar Singh
         
        54. Ozone: number of keys/values/buckets to KSMMetrics Sub-task Open Elek, Marton
         
        55. Ozone: OzoneFileSystem: Add StorageStatistics to OzoneFileSystem Sub-task Open Mukul Kumar Singh
         
        56. Ozone: OzoneFileSystem: both rest/rpc backend should be supported using unified OzoneClient client Sub-task Open Mukul Kumar Singh
         
        57. Ozone: SCM: avoid synchronously loading all the keys from containers upon SCM datanode start Sub-task Open Xiaoyu Yao
         
        58. Ozone: start-all script is missing ozone start Sub-task Open Bharat Viswanadham
         
        59. Ozone: Ratis: Moving Ratis pipeline creation to client Sub-task Open Unassigned
         
        60. Ozone: SCM: Lease support for pipeline creation Sub-task Open Unassigned
         
        61. Ozone: generate optional, version specific documentation during the build Sub-task Open Elek, Marton
         
        62. Make ContainerStateMachine#applyTransaction async Sub-task Open Lokesh Jain
         
        63. ADD support for KSM --createObjectStore command Sub-task Patch Available Shashikant Banerjee
         
        64. Ozone: XceiverClientManager should cache objects based on pipeline name Sub-task Patch Available Mukul Kumar Singh
         
        65. Ozone: SCM: update container allocated size to container db for all the open containers in ContainerStateManager#close Sub-task Open Unassigned
         
        66. Ozone: SCM: Make container report processing asynchronous Sub-task Open Unassigned
         
        67. Ozone: OzoneException needs to become an IOException Sub-task Open Anu Engineer
         
        68. Ozone: Parallelize ChunkOutputSream Writes to container Sub-task Patch Available Shashikant Banerjee
         
        69. Ozone: SCM: Support for Container LifeCycleState PENDING_CLOSE and LifeCycleEvent FULL_CONTAINER Sub-task Patch Available Nanda kumar
         
        70. Ozone: SCM: Close containers: extend SCMCommandResponseProto with SCMCloseContainerCmdResponseProto Sub-task Patch Available Elek, Marton
         
        71. Ozone: Expose RockDB stats via JMX for Ozone metadata stores Sub-task In Progress Elek, Marton
         
        72. Ozone: dozone: initialize scm directory on cluster startup Sub-task Patch Available Elek, Marton
         
        73. Ozone: Optimize number of allocated block rpc by aggregating multiple block allocation requests Sub-task Patch Available Mukul Kumar Singh
         
        74. Ozone: remove container name from pipeline and protobuf definition. Sub-task Open Mukul Kumar Singh
         
        75. Ozone: Client: TestOzoneRpcClient#testPutKeyRatisThreeNodes is failing Sub-task Open Nanda kumar
         
        76. Ozone: TestContainerPersistence#testListContainer fails frequently due to timed out Sub-task Patch Available Yiqun Lin
         
        77. Ozone: Upgrade to latest ratis build Sub-task Patch Available Mukul Kumar Singh
         

          Activity

          Hide
          ebortnik Edward Bortnikov added a comment -

          Very interested to follow. How is this related to the previous jira and design on Block-Management-as-a-Service (HDFS-5477)?

          Show
          ebortnik Edward Bortnikov added a comment - Very interested to follow. How is this related to the previous jira and design on Block-Management-as-a-Service ( HDFS-5477 )?
          Hide
          jnp Jitendra Nath Pandey added a comment -

          I think HDFS-5477 takes us towards making the block management service generic enough to support different storage semantics and API. In that sense object store will be one more use case for the block management. The object store design should work with the block management service.

          Show
          jnp Jitendra Nath Pandey added a comment - I think HDFS-5477 takes us towards making the block management service generic enough to support different storage semantics and API. In that sense object store will be one more use case for the block management. The object store design should work with the block management service.
          Hide
          azuryy Fengdong Yu added a comment -

          please look at here for a simple description:
          http://www.hortonworks.com/blog/ozone-object-store-hdfs/

          Show
          azuryy Fengdong Yu added a comment - please look at here for a simple description: http://www.hortonworks.com/blog/ozone-object-store-hdfs/
          Hide
          gsogol Jeff Sogolov added a comment -

          Salivating at this feature. Would somehow want to talk over the geo replication. But it's a killer feature set. I could start doing direct Analytics on unstructured data without any ETL. Beam me up Scotty!

          Show
          gsogol Jeff Sogolov added a comment - Salivating at this feature. Would somehow want to talk over the geo replication. But it's a killer feature set. I could start doing direct Analytics on unstructured data without any ETL. Beam me up Scotty!
          Hide
          gautamphegde Gautam Hegde added a comment -

          Awesome feature... would like to hel develop this... where do i sign up?

          Show
          gautamphegde Gautam Hegde added a comment - Awesome feature... would like to hel develop this... where do i sign up?
          Hide
          jnp Jitendra Nath Pandey added a comment -

          I have attached a high level architecture document for the object store in hdfs. Please provide your feedback. The development will start in a feature branch and jiras will be created as subtasks to this jira.

          Show
          jnp Jitendra Nath Pandey added a comment - I have attached a high level architecture document for the object store in hdfs. Please provide your feedback. The development will start in a feature branch and jiras will be created as subtasks to this jira.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Awesome stuff. We (Salesforce) have a need for this.

          I think these will lead to immediate management problems:

          • Object Size : 5G
          • Number of buckets system-­wide : 10 million
          • Number of objects per bucket: 1 million
          • Number of buckets per storage volume : 1000

          We have a large number of tenant (many times more than 1000). Some of the tenants will be very large (storing many times more than 1m objects). Of course there are simple workarounds for that, such as including a tenant id in the volume name and a bucket name in our internal blob ids. Are these technical limits?

          I don't think that we're the only ones who will to store a large amount of objects (more than 1m) and the bucket management would get into the way, rather than help.

          Show
          lhofhansl Lars Hofhansl added a comment - Awesome stuff. We (Salesforce) have a need for this. I think these will lead to immediate management problems: Object Size : 5G Number of buckets system-­wide : 10 million Number of objects per bucket: 1 million Number of buckets per storage volume : 1000 We have a large number of tenant (many times more than 1000). Some of the tenants will be very large (storing many times more than 1m objects). Of course there are simple workarounds for that, such as including a tenant id in the volume name and a bucket name in our internal blob ids. Are these technical limits? I don't think that we're the only ones who will to store a large amount of objects (more than 1m) and the bucket management would get into the way, rather than help.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          cluster level

          1. is there a limit on the #of storage volumes in a cluster? does GET/ return all of them?

          Storage Volume Level

          1. any way to enum users? e.g. GET /admin/user/

          Bucket Level

          1. what if I want to GET the 1001st entry in an object store? GET spec doesn't allow this.
          2. Propose: listing of entries to be a structure that includes length, block sizes, everything to be rebuild into a FileStatus

          Object level

          1. GET on object must support ranges
          2. HEAD should supply content-length
          Show
          stevel@apache.org Steve Loughran added a comment - cluster level is there a limit on the #of storage volumes in a cluster? does GET/ return all of them? Storage Volume Level any way to enum users? e.g. GET /admin/user/ Bucket Level what if I want to GET the 1001st entry in an object store? GET spec doesn't allow this. Propose: listing of entries to be a structure that includes length, block sizes, everything to be rebuild into a FileStatus Object level GET on object must support ranges HEAD should supply content-length
          Hide
          clamb Charles Lamb added a comment -

          Jitendra Nath Pandey et al,

          This is very interesting. Thanks for posting it.

          Is the 1KB key size limit a hard limit or just a design/implementation target? There will be users who want keys that can be arbitrarily large (e.g. 10's to 100's of KB). So although it may be acceptable to degrade above 1KB, I don't think you want to make it a hard limit. You could argue that they could just hash their keys, or that they could have some sort of key map, but then it would be hard to do secondary indices in the future.

          The details of partitions are kind of lacking beyond the second to last paragraph on page 4. Are partitions and storage containers 1:1? ("A storage container can contain a maximum of one partition..."). Obviously a storage container holds more than just a partition. Perhaps a little more detail about partitions and how they are located, etc. is warranted.

          In the call flow diagram on page 6, it looks like there's a lot going on in terms of network traffic. There's the initial REST call, then an RPC to get the bucket metadata, then one to read the bucket metadata, then another to get the object's container location, then back to the client who gets redirected. That's a lot of REST/RPCs just to get to the actual data. Will any of this be cached, perhaps in the Ozone Handler or maybe even on the client (I realize that's a bit hard with a REST based protocol). For instance, if it were possible to cache some of the hash in the client, then that would cut some RPCs to the Ozone Handler. If the cache were out of date, then the actual call to the data (step (9) in the diagram) could be rejected, the cache invalidated, and the entire call sequence (1) - (8) could be executed to get the right location.

          IWBNI there was some description of the protocols used between all these moving parts. I know that it's REST from client to Ozone Handler, but what about the other network calls in the diagram? Will it be more REST, or Hadoop RPC, or something else? You talk about security at the end so I guess the authentication will be Kerberos based? Or will you allow more authentication options such as those that HDFS currently has?

          Hash partitioning can also suffer from hotspots depending on the semantics of the key. That's not to say that it's the wrong decision to use it, only that it can have similar drawbacks as key partitioning. Since it looks like you have two separate hashes, one for buckets, and then one for the object key within the bucket, it is possible that there could be hotspots based on a particular bucket. Presumably some sort of caching would help here since the bucket mapping is relatively immutable.

          Secondary indexing will not be easy in a distributed sharded system, especially the consistency issues in dealing with updates. That said, I am reasonably certain that you will find that many users will need this feature relatively soon such that it is high on the roadmap.

          You don't say much about ACLs other than to include them in the REST API. I suppose they'll be implemented in the Ozone Handler, but what will they look like? HDFS/Linux ACLs?

          In the Cluster Level APIs, presumably DELETE Storage Volume is only allowed by the admin. What about GET?

          How are quotas enabled and set? I don't see it in the API anywhere. There's mention early on that they're set up by the administrator. Perhaps it's via some http jsp thing to the Ozone Handler or Storage Container Manager? Who enforces them?

          "no guarantees on partially written objects" - Does this also mean that there are no block-order guarantees during write? Are "holey" objects allowed or will the only inconsistencies be at the tail of an object. This is obviously important for log-based storage systems.

          In the "Size requirements" section on page 3 you say "Number of objects per bucket: 1 million", and then later on you say "A bucket can have millions of objects". You may want to shore that up a little.

          Also in the Size requirements section you say "Object Size: 5G", but then later it says "The storage container needs to store object data that can vary from a few hundred KB to hundreds of megabytes". I'm not sure those are necessarily inconsistent, but I'm also not sure how to reconcile them.

          Perhaps you could include a diagram showing how an object maps to partitions and storage containers and then onto DNs. In other words, a general diagram showing all the various storage concepts (objects, partitions, storage containers, hash tables, transactions, etc.)

          "We plan to re-use Namenode's block management implementation for container management, as much as possible." I'd love to see more detail on what can be reused, what high level changes to the BlkMgr code will be needed, what existing APIs (RPCs) you'll continue to use or need to be changed, etc.

          wrt the storage container prototype using leveldbjni. Where would this fit into the scheme of things? I get the impression that it would just be a backend storage component for the storage container manager. Would it use HDFS blocks as its persistent storage? Presumably not. Maybe a little bit more detail here?

          s/where where/where/

          Show
          clamb Charles Lamb added a comment - Jitendra Nath Pandey et al, This is very interesting. Thanks for posting it. Is the 1KB key size limit a hard limit or just a design/implementation target? There will be users who want keys that can be arbitrarily large (e.g. 10's to 100's of KB). So although it may be acceptable to degrade above 1KB, I don't think you want to make it a hard limit. You could argue that they could just hash their keys, or that they could have some sort of key map, but then it would be hard to do secondary indices in the future. The details of partitions are kind of lacking beyond the second to last paragraph on page 4. Are partitions and storage containers 1:1? ("A storage container can contain a maximum of one partition..."). Obviously a storage container holds more than just a partition. Perhaps a little more detail about partitions and how they are located, etc. is warranted. In the call flow diagram on page 6, it looks like there's a lot going on in terms of network traffic. There's the initial REST call, then an RPC to get the bucket metadata, then one to read the bucket metadata, then another to get the object's container location, then back to the client who gets redirected. That's a lot of REST/RPCs just to get to the actual data. Will any of this be cached, perhaps in the Ozone Handler or maybe even on the client (I realize that's a bit hard with a REST based protocol). For instance, if it were possible to cache some of the hash in the client, then that would cut some RPCs to the Ozone Handler. If the cache were out of date, then the actual call to the data (step (9) in the diagram) could be rejected, the cache invalidated, and the entire call sequence (1) - (8) could be executed to get the right location. IWBNI there was some description of the protocols used between all these moving parts. I know that it's REST from client to Ozone Handler, but what about the other network calls in the diagram? Will it be more REST, or Hadoop RPC, or something else? You talk about security at the end so I guess the authentication will be Kerberos based? Or will you allow more authentication options such as those that HDFS currently has? Hash partitioning can also suffer from hotspots depending on the semantics of the key. That's not to say that it's the wrong decision to use it, only that it can have similar drawbacks as key partitioning. Since it looks like you have two separate hashes, one for buckets, and then one for the object key within the bucket, it is possible that there could be hotspots based on a particular bucket. Presumably some sort of caching would help here since the bucket mapping is relatively immutable. Secondary indexing will not be easy in a distributed sharded system, especially the consistency issues in dealing with updates. That said, I am reasonably certain that you will find that many users will need this feature relatively soon such that it is high on the roadmap. You don't say much about ACLs other than to include them in the REST API. I suppose they'll be implemented in the Ozone Handler, but what will they look like? HDFS/Linux ACLs? In the Cluster Level APIs, presumably DELETE Storage Volume is only allowed by the admin. What about GET? How are quotas enabled and set? I don't see it in the API anywhere. There's mention early on that they're set up by the administrator. Perhaps it's via some http jsp thing to the Ozone Handler or Storage Container Manager? Who enforces them? "no guarantees on partially written objects" - Does this also mean that there are no block-order guarantees during write? Are "holey" objects allowed or will the only inconsistencies be at the tail of an object. This is obviously important for log-based storage systems. In the "Size requirements" section on page 3 you say "Number of objects per bucket: 1 million", and then later on you say "A bucket can have millions of objects". You may want to shore that up a little. Also in the Size requirements section you say "Object Size: 5G", but then later it says "The storage container needs to store object data that can vary from a few hundred KB to hundreds of megabytes". I'm not sure those are necessarily inconsistent, but I'm also not sure how to reconcile them. Perhaps you could include a diagram showing how an object maps to partitions and storage containers and then onto DNs. In other words, a general diagram showing all the various storage concepts (objects, partitions, storage containers, hash tables, transactions, etc.) "We plan to re-use Namenode's block management implementation for container management, as much as possible." I'd love to see more detail on what can be reused, what high level changes to the BlkMgr code will be needed, what existing APIs (RPCs) you'll continue to use or need to be changed, etc. wrt the storage container prototype using leveldbjni. Where would this fit into the scheme of things? I get the impression that it would just be a backend storage component for the storage container manager. Would it use HDFS blocks as its persistent storage? Presumably not. Maybe a little bit more detail here? s/where where/where/
          Hide
          ebortnik Edward Bortnikov added a comment -

          Great stuff. Block- and object- level storage scales much better from the metadata perspective (flat space). Could play really well with the block-management-as-a-service proposal (HDFS-5477) that splits the namenode into the FS manager and the block manager services, and scales the latter horizontally.

          Show
          ebortnik Edward Bortnikov added a comment - Great stuff. Block- and object- level storage scales much better from the metadata perspective (flat space). Could play really well with the block-management-as-a-service proposal ( HDFS-5477 ) that splits the namenode into the FS manager and the block manager services, and scales the latter horizontally.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Thanks for the feedback and comments. I will try to answer the questions over my next few comments. I will also update the document to reflect the discussion here.

          The stated limits in the document are more of the design goals, and parameters we have in mind while designing for the first phase of the project. These are not hard limits and most of these will be configurable. First I will state a few technical limits and then describe some back of the envelope calculations and heuristics I have used behind these numbers.
          The technical limitations are following.

          1. The memory in the storage container manager limits the number of storage containers. From the namenode experience, I believe we can go up to a few 100 million storage containers. In later phases of the project we can have a federated architecture with multiple storage container managers for further scale up.
          2. The size of a storage container is limited by how quick we want to replicate the containers when a datanode goes down. The advantage of using a large container size is that it reduces the metadata needed to track container locations which is proportional to number of containers. However, a very large container will reduce the parallelization that cluster can achieve to replicate when a node fails. The container size will be configurable. A default size of 10G seems like a good choice, which is much larger than hdfs block sizes, but still allows hundreds of containers on datanodes with a few terabytes of disk.

          The maximum size of an object is stated as 5G. In future we would like to even increase this limit when we can support multi-part writes similar to S3. However, it is expected that average size of the objects would be much smaller. The most common range is expected to be a few hundred KBs to a few hundred MBs.
          Assuming 100 million containers, 1MB average size of an object, and 10G the storage container size, it amounts to 10 Trillion objects. I think 10 trillion is a lofty goal to have : ). The division of 10 trillion into 10 million buckets with a million object in each bucket is kind of arbitrary, but we believed users will prefer smaller buckets for better organization. We will keep these configurable.

          The storage volume settings provide admins a control over the usage of the storage. In a private cloud, a cluster shared by lots of tenants can have a storage volume dedicated to each tenant. A tenant can be a user or a project or a group of users. Therefore, a limit of 1000 buckets implying around 1PB of storage per tenant seems reasonable. But, I do agree that when we have a quota on a storage volume size, an additional limit on number of buckets is not really needed.

          We plan to carry out the project in several phases. I would like to propose following phases:

          Phase 1

          1. Basic API as covered in the document.
          2. Storage container machinery, reliability, replication.

          Phase 2

          1. High availability
          2. Security
          3. Secondary index for object listing with prefixes.

          Phase 3

          1. Caching to improve latency.
          2. Further scalability in terms of number of objects and object sizes.
          3. Cross-geo replication.

          I have created branch HDFS-7240 for this work. We will start filing jiras and posting patches.

          Show
          jnp Jitendra Nath Pandey added a comment - Thanks for the feedback and comments. I will try to answer the questions over my next few comments. I will also update the document to reflect the discussion here. The stated limits in the document are more of the design goals, and parameters we have in mind while designing for the first phase of the project. These are not hard limits and most of these will be configurable. First I will state a few technical limits and then describe some back of the envelope calculations and heuristics I have used behind these numbers. The technical limitations are following. The memory in the storage container manager limits the number of storage containers. From the namenode experience, I believe we can go up to a few 100 million storage containers. In later phases of the project we can have a federated architecture with multiple storage container managers for further scale up. The size of a storage container is limited by how quick we want to replicate the containers when a datanode goes down. The advantage of using a large container size is that it reduces the metadata needed to track container locations which is proportional to number of containers. However, a very large container will reduce the parallelization that cluster can achieve to replicate when a node fails. The container size will be configurable. A default size of 10G seems like a good choice, which is much larger than hdfs block sizes, but still allows hundreds of containers on datanodes with a few terabytes of disk. The maximum size of an object is stated as 5G. In future we would like to even increase this limit when we can support multi-part writes similar to S3. However, it is expected that average size of the objects would be much smaller. The most common range is expected to be a few hundred KBs to a few hundred MBs. Assuming 100 million containers, 1MB average size of an object, and 10G the storage container size, it amounts to 10 Trillion objects. I think 10 trillion is a lofty goal to have : ). The division of 10 trillion into 10 million buckets with a million object in each bucket is kind of arbitrary, but we believed users will prefer smaller buckets for better organization. We will keep these configurable. The storage volume settings provide admins a control over the usage of the storage. In a private cloud, a cluster shared by lots of tenants can have a storage volume dedicated to each tenant. A tenant can be a user or a project or a group of users. Therefore, a limit of 1000 buckets implying around 1PB of storage per tenant seems reasonable. But, I do agree that when we have a quota on a storage volume size, an additional limit on number of buckets is not really needed. We plan to carry out the project in several phases. I would like to propose following phases: Phase 1 Basic API as covered in the document. Storage container machinery, reliability, replication. Phase 2 High availability Security Secondary index for object listing with prefixes. Phase 3 Caching to improve latency. Further scalability in terms of number of objects and object sizes. Cross-geo replication. I have created branch HDFS-7240 for this work. We will start filing jiras and posting patches.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Steve Loughran, thanks for the review.

          is there a limit on the #of storage volumes in a cluster? does GET/ return all of them?

          Please see my discussion on limits above. The storage volumes are created by admins, therefore, are not expected to be too many.

          any way to enum users? e.g. GET /admin/user/ ?

          We don't plan to manage users in ozone. In this respect we deviate from popular public object stores. This is because, in a private cluster deployment the user management is usually tied to corporate user accounts. Instead we choose storage volume abstraction for certain administrative settings like quota. However, admins can choose to allocate a storage volume for each user.

          what if I want to GET the 1001st entry in an object store?

          Not sure I understand the use case. Do you mean the users would like to query using some sort of entry number or index?

          GET on object must support ranges

          Agree, we plan to take up this feature in the 2nd phase.

          HEAD should supply content-length

          This should be easily doable. We will keep it in mind for container implementation.

          Show
          jnp Jitendra Nath Pandey added a comment - Steve Loughran , thanks for the review. is there a limit on the #of storage volumes in a cluster? does GET/ return all of them? Please see my discussion on limits above. The storage volumes are created by admins, therefore, are not expected to be too many. any way to enum users? e.g. GET /admin/user/ ? We don't plan to manage users in ozone. In this respect we deviate from popular public object stores. This is because, in a private cluster deployment the user management is usually tied to corporate user accounts. Instead we choose storage volume abstraction for certain administrative settings like quota. However, admins can choose to allocate a storage volume for each user. what if I want to GET the 1001st entry in an object store? Not sure I understand the use case. Do you mean the users would like to query using some sort of entry number or index? GET on object must support ranges Agree, we plan to take up this feature in the 2nd phase. HEAD should supply content-length This should be easily doable. We will keep it in mind for container implementation.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Charles Lamb, thanks for a detailed review and feedback.
          Some of the answers are below, for others I will post the updated document with details as you have pointed out.

          Is the 1KB key size limit a hard limit or just a design/implementation target

          It is a design target. Amazon's S3 limits the keys to 1KB. I doubt there would be many use cases that need beyond it. I see the point that instead of hard limit allow for degradation. But at this point in the project, I would prefer to have more strict limits, and relax later instead of setting user expectations too high to begin with.

          Caching to reduce network traffic

          I agree that a good caching layer will significantly help the performance. Ozone handler seems like a natural place for caching. However, a thick client can do its own caching without overloading datanodes. The focus of phase 1 is to get the semantics right and lay down the basic architecture in place. We plan to attack performance improvements in a later phase of the project.

          Security mechanisms

          Frankly, I haven't thought about anything other than kerberos. I agree, we should evaluate it against what other popular object stores use.

          Hot spots in hash partitioning.

          It is possible for a pathological sequence of keys, but in practice hash partitioning has been successfully used to avoid hot spots e.g. hash-partitioned indexes in databases. We would need to pick hash functions with nice distribution properties.

          Secondary indexing consistency

          The secondary index need not be strictly consistent with the bucket. That means a listing operation with prefix or key range may not reflect the latest of the bucket. We will have a more concrete proposal in the second phase of the project.

          Storage volume GET for admin

          I believed that it is not a security concern in allowing users to see all storage volume names. However, it is possible to conceive a use case where an admin would want to restrict that. Probably we can support both the modes.

          "no guarantees on partially written objects"

          The object will not be visible until completely written. Also, no recovery is planned for the first phase if the write fails. In future, we would like to support multi-part uploads.

          Re-using block management implementation for container management.

          We intend to reuse the DatanodeProtocol that datanode uses to talk to namenode. I will add more details to the document and on the corresponding jira.

          storage container prototype using leveldbjni

          We will add lot more details on this in its own jira. The idea is to use leveldbjni in the storage container in the datanodes. We plan to prototype a storage container that stores objects as individual files within the container however, that would need an index within the container to map a key to a file. We will use leveldbjni for that index.
          Another possible prototype is to put the entire object in the leveldbjni itself. It will take some experimentation to zero-down to the right approach. We will also try to make the storage container implementation pluggable to make it easy to try different implementations.

          How are quotas enabled and set? ....who enforces them

          All the Ozone APIs are implemented in ozone handler. The quota will also be enforced by the ozone handler. I will update the document with the APIs.

          Show
          jnp Jitendra Nath Pandey added a comment - Charles Lamb , thanks for a detailed review and feedback. Some of the answers are below, for others I will post the updated document with details as you have pointed out. Is the 1KB key size limit a hard limit or just a design/implementation target It is a design target. Amazon's S3 limits the keys to 1KB. I doubt there would be many use cases that need beyond it. I see the point that instead of hard limit allow for degradation. But at this point in the project, I would prefer to have more strict limits, and relax later instead of setting user expectations too high to begin with. Caching to reduce network traffic I agree that a good caching layer will significantly help the performance. Ozone handler seems like a natural place for caching. However, a thick client can do its own caching without overloading datanodes. The focus of phase 1 is to get the semantics right and lay down the basic architecture in place. We plan to attack performance improvements in a later phase of the project. Security mechanisms Frankly, I haven't thought about anything other than kerberos. I agree, we should evaluate it against what other popular object stores use. Hot spots in hash partitioning. It is possible for a pathological sequence of keys, but in practice hash partitioning has been successfully used to avoid hot spots e.g. hash-partitioned indexes in databases. We would need to pick hash functions with nice distribution properties. Secondary indexing consistency The secondary index need not be strictly consistent with the bucket. That means a listing operation with prefix or key range may not reflect the latest of the bucket. We will have a more concrete proposal in the second phase of the project. Storage volume GET for admin I believed that it is not a security concern in allowing users to see all storage volume names. However, it is possible to conceive a use case where an admin would want to restrict that. Probably we can support both the modes. "no guarantees on partially written objects" The object will not be visible until completely written. Also, no recovery is planned for the first phase if the write fails. In future, we would like to support multi-part uploads. Re-using block management implementation for container management. We intend to reuse the DatanodeProtocol that datanode uses to talk to namenode. I will add more details to the document and on the corresponding jira. storage container prototype using leveldbjni We will add lot more details on this in its own jira. The idea is to use leveldbjni in the storage container in the datanodes. We plan to prototype a storage container that stores objects as individual files within the container however, that would need an index within the container to map a key to a file. We will use leveldbjni for that index. Another possible prototype is to put the entire object in the leveldbjni itself. It will take some experimentation to zero-down to the right approach. We will also try to make the storage container implementation pluggable to make it easy to try different implementations. How are quotas enabled and set? ....who enforces them All the Ozone APIs are implemented in ozone handler. The quota will also be enforced by the ozone handler. I will update the document with the APIs.
          Hide
          cmccabe Colin P. McCabe added a comment -

          This looks like a really interesting way to achieve a scalable blob store using some of the infrastructure we already have in HDFS. It could be a good direction for the project to go in.

          We should have a meeting to review the design and talk about how it fits in with the rest of what's going on in HDFS-land. Perhaps we could have a webex on the week of May 25th or June 1? (I am going to be out of town next week so I can't do next week.)

          Show
          cmccabe Colin P. McCabe added a comment - This looks like a really interesting way to achieve a scalable blob store using some of the infrastructure we already have in HDFS. It could be a good direction for the project to go in. We should have a meeting to review the design and talk about how it fits in with the rest of what's going on in HDFS-land. Perhaps we could have a webex on the week of May 25th or June 1? (I am going to be out of town next week so I can't do next week.)
          Hide
          jnp Jitendra Nath Pandey added a comment -

          That's a good idea Colin P. McCabe, I will setup a webex in the week of June 1st. Does June 3rd, 1pm to 3pm PST sound ok? I will update with the details.

          Show
          jnp Jitendra Nath Pandey added a comment - That's a good idea Colin P. McCabe , I will setup a webex in the week of June 1st. Does June 3rd, 1pm to 3pm PST sound ok? I will update with the details.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Thanks, Jitendra Nath Pandey. June 3rd at 1pm to 3pm sounds good to me. I believe Aaron T. Myers and Zhe Zhang will be able to attend as well.

          Show
          cmccabe Colin P. McCabe added a comment - Thanks, Jitendra Nath Pandey . June 3rd at 1pm to 3pm sounds good to me. I believe Aaron T. Myers and Zhe Zhang will be able to attend as well.
          Hide
          kanaka Kanaka Kumar Avvaru added a comment -

          Very interesting to follow Jitendra Nath Pandey, we also have some requirements to support Trillion level small objects/files.
          We will be intrested to contibute for OZone development. Can you please invite me also to the webex meeting?

          For now I have few comments on the this Project

          Practically partitioning may be difficult to be controlled by Storage Layer alone, as distribution depends on key construction applications.
          So, bucket partitioner classes can be a input while creating a bucket so that applications can handle the partitions well.

          Object level Metadata would be required such as tags/labels which can be used by computing jobs as additional info(similar to xaatributes on file)

          What is the plan for leveldbjni content file persistence & has any concept like WAL for reliability is planned?
          When & how does the leveldbjni content will be replicated?

          As millions are buckets are expected, is Partitioning for buckets is also required based on volume name?

          Swift & AWS S3 support supports Object versions and replace. Does OZone also plan for the same?

          Missing feature like multi part loading,heavy object/storage space splits etc,, also can be pooled in the coming phases( may be phase 2 or later)

          We can also add readable snap shots of a bucket in the features queue? (may be at later stage of project)

          As part of Transparency encryption, encryption zone at bucket level could be an expectation from applications.

          Show
          kanaka Kanaka Kumar Avvaru added a comment - Very interesting to follow Jitendra Nath Pandey , we also have some requirements to support Trillion level small objects/files. We will be intrested to contibute for OZone development. Can you please invite me also to the webex meeting? For now I have few comments on the this Project Practically partitioning may be difficult to be controlled by Storage Layer alone, as distribution depends on key construction applications. So, bucket partitioner classes can be a input while creating a bucket so that applications can handle the partitions well. Object level Metadata would be required such as tags/labels which can be used by computing jobs as additional info(similar to xaatributes on file) What is the plan for leveldbjni content file persistence & has any concept like WAL for reliability is planned? When & how does the leveldbjni content will be replicated? As millions are buckets are expected, is Partitioning for buckets is also required based on volume name? Swift & AWS S3 support supports Object versions and replace. Does OZone also plan for the same? Missing feature like multi part loading,heavy object/storage space splits etc,, also can be pooled in the coming phases( may be phase 2 or later) We can also add readable snap shots of a bucket in the features queue? (may be at later stage of project) As part of Transparency encryption, encryption zone at bucket level could be an expectation from applications.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Webex:
          https://hortonworks.webex.com/meet/jitendra

          1-650-479-3208 Call-in toll number (US/Canada)
          1-877-668-4493 Call-in toll-free number (US/Canada)
          Access code: 623 433 021​

          Time: 6/3/2015, 1pm to 3pm

          Show
          jnp Jitendra Nath Pandey added a comment - Webex: https://hortonworks.webex.com/meet/jitendra 1-650-479-3208 Call-in toll number (US/Canada) 1-877-668-4493 Call-in toll-free number (US/Canada) Access code: 623 433 021​ Time: 6/3/2015, 1pm to 3pm
          Hide
          thodemoor Thomas Demoor added a comment -

          Maybe some of the (ongoing) work for currently supported object stores can be reused here (f.i. HADOOP-9565)? Will probably call-in.

          Show
          thodemoor Thomas Demoor added a comment - Maybe some of the (ongoing) work for currently supported object stores can be reused here (f.i. HADOOP-9565 )? Will probably call-in.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Hi Jitendra Nath Pandey,

          Thanks for organizing the webex on Wednesday. Are you planning on having an in-person meetup, or just the webex? We can offer to host if that is convenient (or visit your office in Santa Clara).

          Show
          cmccabe Colin P. McCabe added a comment - Hi Jitendra Nath Pandey , Thanks for organizing the webex on Wednesday. Are you planning on having an in-person meetup, or just the webex? We can offer to host if that is convenient (or visit your office in Santa Clara).
          Hide
          jnp Jitendra Nath Pandey added a comment -

          I was planning to do it on webex only, but you are welcome to come over to our office.
          Ask for Jitendra Pandey at the lobby.

          5470 Great America Pkwy,
          Santa Clara, CA 95054

          Show
          jnp Jitendra Nath Pandey added a comment - I was planning to do it on webex only, but you are welcome to come over to our office. Ask for Jitendra Pandey at the lobby. 5470 Great America Pkwy, Santa Clara, CA 95054
          Hide
          cmccabe Colin P. McCabe added a comment -

          Thanks, Jitendra. I think it makes sense to meet up, since we're in the area.

          Show
          cmccabe Colin P. McCabe added a comment - Thanks, Jitendra. I think it makes sense to meet up, since we're in the area.
          Hide
          thodemoor Thomas Demoor added a comment -

          Very interesting call yesterday. Might be interesting to have a group discussion at Hadoop Summit next week?

          Show
          thodemoor Thomas Demoor added a comment - Very interesting call yesterday. Might be interesting to have a group discussion at Hadoop Summit next week?
          Hide
          lars_francke Lars Francke added a comment -

          Would anyone mind - for documentation and those of us who couldn't join - to post a quick summary of the call? Thank you!

          Show
          lars_francke Lars Francke added a comment - Would anyone mind - for documentation and those of us who couldn't join - to post a quick summary of the call? Thank you!
          Hide
          jnp Jitendra Nath Pandey added a comment -

          I will post a summary shortly.

          Show
          jnp Jitendra Nath Pandey added a comment - I will post a summary shortly.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          The call started with high level description of object stores, motivations and the design approach as covered in the architectural document.
          Following points were discussed in detail

          1. 3 level namespace with storage volumes, buckets and keys vs 2 level namespace with buckets and keys
            • Storage volumes are created by admins and provide admin controls such as quota. Buckets are created and managed by user.
              Since HDFS doesn't have a separate notion of user accounts as in S3 or Azure, Storage volume allows admins to set policies.
            • The argument in favor of 2 level scheme was that typically organizations have very few buckets and users organize their data within the buckets. The admin controls can be set at bucket level.
          2. Is it exactly S3 API? It would be good to easily migrate from s3 to Ozone.
            • Storage volume concept is not in S3. In Azure, accounts are part of the URL, Ozone URLs look similar to Azure with storage volume instead of account name.
            • We will publish a more detailed spec including headers, authorization semantics etc. We try to follow S3 closely.
          3. Http2
            • There is a jira already in hadoop for http2. We should evaluate supporting http2 as well.
          4. OzoneFileSystem: Hadoop file system implementation on top of ozone, similar to S3FileSystem.
            • It will not support rename
            • This was only briefly mentioned.
          5. Storage Container Implementation
            • Storage container replication must provide efficient replication. Replication by key-object enumeration will be too inefficient. RocksDB is a promising choice as it provides features for live replication i.e. replication while it is being written. In the architecture document we talked about leveldbjni. RocksDB is similar, and provides additional features and java binding as well.
            • If a datanode dies and some of the containers lag in generation stamp, these containers will be discarded. Since containers are much larger than typical hdfs blocks, this will be lot more inefficient. An important optimization is needed to allow stale containers to catch up the state.
            • To support a large range of object sizes, a hybrid model may be needed: Store small objects in RocksDB, but large objects as files with their file-paths in RocksDB.
            • Colin suggested Linux sparse files.
            • We are working on a prototype.
          6. Ordered listing with read after write semantics might be an important requirement. In the hash partitioning scheme that would need consistent secondary indexes or a range partitioning should be used. This needs to be investigated.

          I will follow up on these points and update the design doc.

          It was a great discussion with many valuable points raised. Thanks to everyone who attended.

          Show
          jnp Jitendra Nath Pandey added a comment - The call started with high level description of object stores, motivations and the design approach as covered in the architectural document. Following points were discussed in detail 3 level namespace with storage volumes, buckets and keys vs 2 level namespace with buckets and keys Storage volumes are created by admins and provide admin controls such as quota. Buckets are created and managed by user. Since HDFS doesn't have a separate notion of user accounts as in S3 or Azure, Storage volume allows admins to set policies. The argument in favor of 2 level scheme was that typically organizations have very few buckets and users organize their data within the buckets. The admin controls can be set at bucket level. Is it exactly S3 API? It would be good to easily migrate from s3 to Ozone. Storage volume concept is not in S3. In Azure, accounts are part of the URL, Ozone URLs look similar to Azure with storage volume instead of account name. We will publish a more detailed spec including headers, authorization semantics etc. We try to follow S3 closely. Http2 There is a jira already in hadoop for http2. We should evaluate supporting http2 as well. OzoneFileSystem: Hadoop file system implementation on top of ozone, similar to S3FileSystem. It will not support rename This was only briefly mentioned. Storage Container Implementation Storage container replication must provide efficient replication. Replication by key-object enumeration will be too inefficient. RocksDB is a promising choice as it provides features for live replication i.e. replication while it is being written. In the architecture document we talked about leveldbjni. RocksDB is similar, and provides additional features and java binding as well. If a datanode dies and some of the containers lag in generation stamp, these containers will be discarded. Since containers are much larger than typical hdfs blocks, this will be lot more inefficient. An important optimization is needed to allow stale containers to catch up the state. To support a large range of object sizes, a hybrid model may be needed: Store small objects in RocksDB, but large objects as files with their file-paths in RocksDB. Colin suggested Linux sparse files. We are working on a prototype. Ordered listing with read after write semantics might be an important requirement. In the hash partitioning scheme that would need consistent secondary indexes or a range partitioning should be used. This needs to be investigated. I will follow up on these points and update the design doc. It was a great discussion with many valuable points raised. Thanks to everyone who attended.
          Hide
          john.jian.fang Jian Fang added a comment -

          Sorry for jumping in. I saw the following statement and wonder why you need to follow S3FileSystem closely.

          "OzoneFileSystem: Hadoop file system implementation on top of ozone, similar to S3FileSystem.

          • It will not support rename"

          Not supporting rename could cause a lot of trouble because hadoop uses it to achieve some thing like "two phase commit" in many places, for example, FileOutputCommitter. What would you do for such use cases? Add extra logic to hadoop code or copy the data to the final destination? The latter could be very expensive by the way.

          I am not very clear about the motivations here. Shouldn't some features have already been covered by HBase? Would this feature make HDFS too fat and become difficult to manage? Or is this on top of HDFS just like HBase? Also how do you handle small objects in the object store?

          Show
          john.jian.fang Jian Fang added a comment - Sorry for jumping in. I saw the following statement and wonder why you need to follow S3FileSystem closely. "OzoneFileSystem: Hadoop file system implementation on top of ozone, similar to S3FileSystem. It will not support rename" Not supporting rename could cause a lot of trouble because hadoop uses it to achieve some thing like "two phase commit" in many places, for example, FileOutputCommitter. What would you do for such use cases? Add extra logic to hadoop code or copy the data to the final destination? The latter could be very expensive by the way. I am not very clear about the motivations here. Shouldn't some features have already been covered by HBase? Would this feature make HDFS too fat and become difficult to manage? Or is this on top of HDFS just like HBase? Also how do you handle small objects in the object store?
          Hide
          john.jian.fang Jian Fang added a comment -

          One more question is how you would handle object partitions? HDFS federation is more like a manual partition process to me. You probably need some dynamic partition mechanism similar to HBase's region idea. Data consistency is another big issue here.

          Show
          john.jian.fang Jian Fang added a comment - One more question is how you would handle object partitions? HDFS federation is more like a manual partition process to me. You probably need some dynamic partition mechanism similar to HBase's region idea. Data consistency is another big issue here.
          Hide
          khanderao khanderao added a comment -

          Great initiative! Jitendra and team.

          1. What are the assumptions about data locality? Typical cloud storages are not candidates for computations hence hardware choices are done differently for storage centric vs compute centric vs typical Hadoop nodes (typically compute + storage).

          2. Would Datanode can be exclusively configured to either Object Store or HDFS. I assume yes.

          Show
          khanderao khanderao added a comment - Great initiative! Jitendra and team. 1. What are the assumptions about data locality? Typical cloud storages are not candidates for computations hence hardware choices are done differently for storage centric vs compute centric vs typical Hadoop nodes (typically compute + storage). 2. Would Datanode can be exclusively configured to either Object Store or HDFS. I assume yes.
          Hide
          andrew.wang Andrew Wang added a comment -

          I see some JIRAs related to volumes; did we resolve the question of the 2-level vs. 3-level scheme? Based on Colin's (and my own) experiences using S3, we did not feel the need for users to be able to create buckets, which seemed to be the primary motivation for volume -> bucket. Typically users also create their own hierarchy under the bucket anyway, and prefix scans become important then.

          My main reason for asking is to make the API as simple and as similar to S3 as possible, which should help with porting applications.

          Show
          andrew.wang Andrew Wang added a comment - I see some JIRAs related to volumes; did we resolve the question of the 2-level vs. 3-level scheme? Based on Colin's (and my own) experiences using S3, we did not feel the need for users to be able to create buckets, which seemed to be the primary motivation for volume -> bucket. Typically users also create their own hierarchy under the bucket anyway, and prefix scans become important then. My main reason for asking is to make the API as simple and as similar to S3 as possible, which should help with porting applications.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Jian Fang, thanks for review and comments
          1) There is some work going on to support yarn for s3 (HADOOP-11262) by Thomas Demoor and Steve Loughran]. We hope we can leverage that to support mapreduce use case.
          2) This will be part of hdfs natively, unlike hbase which is a separate service. There is only one additional daemon needed (storage container manager) only if ozone is deployed. Hdfs already supports multiple namespaces and blockpools, we will leverage that work. This work will not impact hdfs manageability in anyway. There will be some additional code in datanode to support storage containers. In future, we hope to use containers to store hdfs blocks as well for hdfs block-space scalability.
          3) We will partition objects using hash partitioning and range partitioning. The document already talks about hash partitioning, I will add more details for range partition support. The small objects will also be stored in the container. In the document I mentioned leveldbjni, but we are also looking at rocksDB for container implementation to store objects in the container.
          4) The containers will be replicated and kept consistent using data pipeline, similar to hdfs blocks. The document talks about it at a high level. We will update more details in the related jira.

          Show
          jnp Jitendra Nath Pandey added a comment - Jian Fang , thanks for review and comments 1) There is some work going on to support yarn for s3 ( HADOOP-11262 ) by Thomas Demoor and Steve Loughran ]. We hope we can leverage that to support mapreduce use case. 2) This will be part of hdfs natively, unlike hbase which is a separate service. There is only one additional daemon needed (storage container manager) only if ozone is deployed. Hdfs already supports multiple namespaces and blockpools, we will leverage that work. This work will not impact hdfs manageability in anyway. There will be some additional code in datanode to support storage containers. In future, we hope to use containers to store hdfs blocks as well for hdfs block-space scalability. 3) We will partition objects using hash partitioning and range partitioning. The document already talks about hash partitioning, I will add more details for range partition support. The small objects will also be stored in the container. In the document I mentioned leveldbjni, but we are also looking at rocksDB for container implementation to store objects in the container. 4) The containers will be replicated and kept consistent using data pipeline, similar to hdfs blocks. The document talks about it at a high level. We will update more details in the related jira.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          khanderao, thanks for the review.
          1) Ozone stores data on datanode themselves, therefore it can provide locality for computations running on the datanodes. The hardware can be chosen based on the use case and computational needs on the datanodes.
          2) Of course, it would be possible to have a dedicated object store or hdfs deployment, with datanodes configured to talk to respective blockpool.

          Show
          jnp Jitendra Nath Pandey added a comment - khanderao , thanks for the review. 1) Ozone stores data on datanode themselves, therefore it can provide locality for computations running on the datanodes. The hardware can be chosen based on the use case and computational needs on the datanodes. 2) Of course, it would be possible to have a dedicated object store or hdfs deployment, with datanodes configured to talk to respective blockpool.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          For prefix scans we are planning to support range partitioned index in the storage container manager. I will update the design document. That will allow users to have hierarchy under the bucket.

          The storage volumes do not really change the semantics that much from S3. For a url like http://host:port/volume/bucket/key, apart from the volume prefix, the semantics are really similar. I believe, having a notion of 'admin created volumes' allows us to have buckets which are very similar to S3, as it provides clear domains of admin control vs user control. Volume can be viewed as accounts, which are kind of similar to accounts in S3 or Azure.
          It will be a bigger deviation if we insist that only admin can create buckets. It is useful to organize data in buckets using prefix, but having that as the only mechanism for organizing data seems too restrictive.
          For similarity to S3, we will try to have similar headers and auth semantics as well, which are also very important for portability.

          Show
          jnp Jitendra Nath Pandey added a comment - For prefix scans we are planning to support range partitioned index in the storage container manager. I will update the design document. That will allow users to have hierarchy under the bucket. The storage volumes do not really change the semantics that much from S3. For a url like http://host:port/volume/bucket/key , apart from the volume prefix, the semantics are really similar. I believe, having a notion of 'admin created volumes' allows us to have buckets which are very similar to S3, as it provides clear domains of admin control vs user control. Volume can be viewed as accounts, which are kind of similar to accounts in S3 or Azure. It will be a bigger deviation if we insist that only admin can create buckets. It is useful to organize data in buckets using prefix, but having that as the only mechanism for organizing data seems too restrictive. For similarity to S3, we will try to have similar headers and auth semantics as well, which are also very important for portability.
          Hide
          thodemoor Thomas Demoor added a comment -

          Jian Fang and Jitendra Nath Pandey:

          • Avoiding rename happens in HADOOP-9565 by introducing ObjectStore (extends Filesystem) and letting FileOutputCommitter, Hadoop CLI, ... act on this (by avoiding rename). Ozone could easily extend ObjectStore and benefit from this.
          • HADOOP-11262 extends DelegateToFileSystem to implement s3a as an AbstractFileSystem and works around issues as modification times for directories (cfr. Azure).
          Show
          thodemoor Thomas Demoor added a comment - Jian Fang and Jitendra Nath Pandey : Avoiding rename happens in HADOOP-9565 by introducing ObjectStore (extends Filesystem) and letting FileOutputCommitter, Hadoop CLI, ... act on this (by avoiding rename). Ozone could easily extend ObjectStore and benefit from this. HADOOP-11262 extends DelegateToFileSystem to implement s3a as an AbstractFileSystem and works around issues as modification times for directories (cfr. Azure).
          Hide
          john.jian.fang Jian Fang added a comment -

          Thanks for all your explanations, however, I think you missed my points. Doable and performance are two difference concepts. From my own experiences with S3 and s3 native file system, the most costly operations are listing keys and copying data from one bucket to the other one to simulate the rename operation. The former one will take a very long time for a bucket with millions of objects and the latter one has a double performance penalty, i.e., assume your objects are 1TB, you actually almost upload 2TB of data to s3. That is why fast key listing and native fast rename operations are two of the most desirable features for s3.

          Before you make decision to follow the S3N API, I would suggest you actually test the performance of S3N and get to know what are good and what are bad. Why do you need to follow the bad ones at all?

          It is still not very clear to me how do you guarantee your partitions are balanced. HBase used region auto split to achieve that, which is also my concern that the code and logic would grow rapidly when your object store becomes really mature. In my personal opinion, it is better to build the object store on top of HDFS and leave HDFS to be simple.

          Show
          john.jian.fang Jian Fang added a comment - Thanks for all your explanations, however, I think you missed my points. Doable and performance are two difference concepts. From my own experiences with S3 and s3 native file system, the most costly operations are listing keys and copying data from one bucket to the other one to simulate the rename operation. The former one will take a very long time for a bucket with millions of objects and the latter one has a double performance penalty, i.e., assume your objects are 1TB, you actually almost upload 2TB of data to s3. That is why fast key listing and native fast rename operations are two of the most desirable features for s3. Before you make decision to follow the S3N API, I would suggest you actually test the performance of S3N and get to know what are good and what are bad. Why do you need to follow the bad ones at all? It is still not very clear to me how do you guarantee your partitions are balanced. HBase used region auto split to achieve that, which is also my concern that the code and logic would grow rapidly when your object store becomes really mature. In my personal opinion, it is better to build the object store on top of HDFS and leave HDFS to be simple.
          Hide
          cmccabe Colin P. McCabe added a comment -

          3) We will partition objects using hash partitioning and range partitioning. The document already talks about hash partitioning, I will add more details for range partition support. The small objects will also be stored in the container. In the document I mentioned leveldbjni, but we are also looking at rocksDB for container implementation to store objects in the container.

          Interesting, thanks for posting more details about this.

          Maybe this has been discussed on another JIRA (I apologize if so), but does this mean that the admin will have to choose between hash and range partitioning for a particular bucket? This seems suboptimal to me... we will have to support both approaches, which is more complex, and admins will be left with a difficult choice.

          It seems better just to make everything range-partitioned. Although this is more complex than simple hash partitioning, it provides "performance compatibility" with s3 and other object stores. s3 provides a fast (sub-linear) way of getting all the keys in between some A and B. It will be very difficult to really position ozone as s3-compatible if operations that are quick in s3 such as listing all the keys between A and B are O(num_keys^2) in ozone.

          For example, in (all of) Hadoop's s3 filesystem implementations, listStatus uses this quick listing of keys between A and B. When someone does "listStatus /a/b/c", we can ask s3 for all the keys between /a/b/c/ and /a/b/c0 (0 is the ASCII value right after slash). Of course, s3 does not really have directories, but we can treat the keys in this range as being in the directory /a/b/c for the purposes of s3a or s3n. If we just had hash partitioning, this kind of operation would be O(N^2) where N is the number of keys. It would just be infeasible for any large bucket.

          Show
          cmccabe Colin P. McCabe added a comment - 3) We will partition objects using hash partitioning and range partitioning. The document already talks about hash partitioning, I will add more details for range partition support. The small objects will also be stored in the container. In the document I mentioned leveldbjni, but we are also looking at rocksDB for container implementation to store objects in the container. Interesting, thanks for posting more details about this. Maybe this has been discussed on another JIRA (I apologize if so), but does this mean that the admin will have to choose between hash and range partitioning for a particular bucket? This seems suboptimal to me... we will have to support both approaches, which is more complex, and admins will be left with a difficult choice. It seems better just to make everything range-partitioned. Although this is more complex than simple hash partitioning, it provides "performance compatibility" with s3 and other object stores. s3 provides a fast (sub-linear) way of getting all the keys in between some A and B. It will be very difficult to really position ozone as s3-compatible if operations that are quick in s3 such as listing all the keys between A and B are O(num_keys^2) in ozone. For example, in (all of) Hadoop's s3 filesystem implementations, listStatus uses this quick listing of keys between A and B. When someone does "listStatus /a/b/c", we can ask s3 for all the keys between /a/b/c/ and /a/b/c0 (0 is the ASCII value right after slash). Of course, s3 does not really have directories, but we can treat the keys in this range as being in the directory /a/b/c for the purposes of s3a or s3n. If we just had hash partitioning, this kind of operation would be O(N^2) where N is the number of keys. It would just be infeasible for any large bucket.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Jitendra Nath Pandey, Sanjay Radia, did we come to a conclusion about range partitioning?

          Show
          cmccabe Colin P. McCabe added a comment - Jitendra Nath Pandey , Sanjay Radia , did we come to a conclusion about range partitioning?
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          2 branches, HDFS-7240 and hdsfs-7240, seems to be created on repository.

          $ git pull
          From https://git-wip-us.apache.org/repos/asf/hadoop

          Is it assumed one?

          Show
          ozawa Tsuyoshi Ozawa added a comment - 2 branches, HDFS-7240 and hdsfs-7240, seems to be created on repository. $ git pull From https://git-wip-us.apache.org/repos/asf/hadoop [new branch] HDFS-7240 -> origin/ HDFS-7240 [new branch] hdfs-7240 -> origin/hdfs-7240 Is it assumed one?
          Hide
          cnauroth Chris Nauroth added a comment -

          The upper-case one is the correct one. The lower-case one was created by mistake. We haven't been able to delete it yet though because of the recent ASF infrastructure changes to prohibit force pushes.

          Show
          cnauroth Chris Nauroth added a comment - The upper-case one is the correct one. The lower-case one was created by mistake. We haven't been able to delete it yet though because of the recent ASF infrastructure changes to prohibit force pushes.
          Hide
          anu Anu Engineer added a comment -

          Please see INFRA-10720 to see the status of branch deletes. I am presuming sooner or later we will be able to delete the smaller caps "hdfs-7240".

          Show
          anu Anu Engineer added a comment - Please see INFRA-10720 to see the status of branch deletes. I am presuming sooner or later we will be able to delete the smaller caps "hdfs-7240".
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Thank you for following up, Anu and Chris.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Thank you for following up, Anu and Chris.
          Hide
          anu Anu Engineer added a comment -

          Proposed user interfaces for Ozone. Documentation of REST and CLI interfaces.

          Show
          anu Anu Engineer added a comment - Proposed user interfaces for Ozone. Documentation of REST and CLI interfaces.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hi all, I had the opportunity to hear more about Ozone at Apache Big Data, and chatted with Anu afterwards. Quite interesting, I learned a lot. Thanks Anu for the presentation and fielding my questions.

          I'm re-posting my notes and questions here. Anu said he'd be posting a new design doc soon to address my questions.

          Notes:

          • Key Space Manager and Storage Container Manager are the "master" services in Ozone, and are the equivalent of FSNamesystem and the BlockManager in HDFS. Both are Raft-replicated services. There is a new Raft implementation being worked on internally.
          • The block container abstraction is a mutable range of KV pairs. It's essentially a ~5GB LevelDB for metadata + on-disk files for the data. Container metadata is replicated via Raft. Container data is replicated via chain replication.
          • Since containers are mutable and the replicas are independent, the on-disk state will be different. This means we need to do logical rather than physical replication.
          • Container data is stored as chunks, where a chunk is maybe 4-8MB. Chunks are immutable. Chunks are a (file, offset, length) triplet. Currently each chunk is stored as a separate file.
          • Use of copysets to reduce the risk of data loss due to independent node failures.

          Questions:

          • My biggest concern is that erasure coding is not a first-class consideration in this system, and seems like it will be quite difficult to implement. EC is table stakes in the blobstore world, it's implemented by all the cloud blobstores I'm aware of (S3, WASB, etc). Since containers are mutable, we are not able to erasure-code containers together, else we suffer from the equivalent of the RAID-5 write hole. It's the same issue we're dealing with on HDFS-7661 for hflush/hsync EC support. There's also the complexity that a container is replicated to 3 nodes via Raft, but EC data is typically stored across 14 nodes.
          • Since LevelDB is being used for metadata storage and separately being replicated via Raft, are there concerns about metadata write amplification?
          • Can we re-use the QJM code instead of writing a new replicated log implementation? QJM is battle-tested, and consensus is a known hard problem to get right.
          • Are there concerns about storing millions of chunk files per disk? Writing each chunk as a separate file requires more metadata ops and fsyncs than appending to a file. We also need to be very careful to never require a full scan of the filesystem. The HDFS DN does full scans right now (DU, volume scanner).
          • Any thoughts about how we go about packing multiple chunks into a larger file?
          • Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability targets. However, I think we're going to have a harder time with this than a system like HBase. HDFS sees a relatively high delete rate for recently written data, e.g. intermediate data in a processing pipeline. HDFS also sees a much higher variance in key/value size. Together, these factors mean Ozone will likely be doing many more merges and splits than HBase to keep the container size high. This is concerning since splits and merges are expensive operations, and based on HBase's experience, are hard to get right.
          • What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers, and the metadata services are separate from the NN? not shared?
          • Any thoughts on how we will transition applications like Hive and HBase to Ozone? These apps use rename and directories for synchronization, which are not possible on Ozone.
          • Have you experienced data loss from independent node failures, thus motivating the need for copysets? I think the idea is cool, but the RAMCloud network hardware expectations are quite different from ours. Limiting the set of nodes for re-replication means you have less flexibility to avoid top-of-rack switches and decreased parallelism. It's also not clear how this type of data placement meshes with EC, or the other quite sophisticated types of block placement we currently support in HDFS.
          • How do you plan to handle files larger than 5GB? Large files right now are also not spread across multiple nodes and disks, limiting IO performance.
          • Are all reads and writes served by the container's Raft master? IIUC that's how you get strong consistency, but it means we don't have the same performance benefits we have now in HDFS from 3-node replication.

          I also ask that more of this information and decision making be shared on public mailing lists and JIRA. The KSM is not mentioned in the architecture document, nor the fact that the Ozone metadata is being replicated via Raft rather than stored in containers. I not aware that there is already progress internally at Hortonworks on a Raft implementation. We've previously expressed interest in being involved in the design and implementation of Ozone, but we can't meaningfully contribute if this work is being done privately.

          Show
          andrew.wang Andrew Wang added a comment - Hi all, I had the opportunity to hear more about Ozone at Apache Big Data, and chatted with Anu afterwards. Quite interesting, I learned a lot. Thanks Anu for the presentation and fielding my questions. I'm re-posting my notes and questions here. Anu said he'd be posting a new design doc soon to address my questions. Notes: Key Space Manager and Storage Container Manager are the "master" services in Ozone, and are the equivalent of FSNamesystem and the BlockManager in HDFS. Both are Raft-replicated services. There is a new Raft implementation being worked on internally. The block container abstraction is a mutable range of KV pairs. It's essentially a ~5GB LevelDB for metadata + on-disk files for the data. Container metadata is replicated via Raft. Container data is replicated via chain replication. Since containers are mutable and the replicas are independent, the on-disk state will be different. This means we need to do logical rather than physical replication. Container data is stored as chunks, where a chunk is maybe 4-8MB. Chunks are immutable. Chunks are a (file, offset, length) triplet. Currently each chunk is stored as a separate file. Use of copysets to reduce the risk of data loss due to independent node failures. Questions: My biggest concern is that erasure coding is not a first-class consideration in this system, and seems like it will be quite difficult to implement. EC is table stakes in the blobstore world, it's implemented by all the cloud blobstores I'm aware of (S3, WASB, etc). Since containers are mutable, we are not able to erasure-code containers together, else we suffer from the equivalent of the RAID-5 write hole. It's the same issue we're dealing with on HDFS-7661 for hflush/hsync EC support. There's also the complexity that a container is replicated to 3 nodes via Raft, but EC data is typically stored across 14 nodes. Since LevelDB is being used for metadata storage and separately being replicated via Raft, are there concerns about metadata write amplification? Can we re-use the QJM code instead of writing a new replicated log implementation? QJM is battle-tested, and consensus is a known hard problem to get right. Are there concerns about storing millions of chunk files per disk? Writing each chunk as a separate file requires more metadata ops and fsyncs than appending to a file. We also need to be very careful to never require a full scan of the filesystem. The HDFS DN does full scans right now (DU, volume scanner). Any thoughts about how we go about packing multiple chunks into a larger file? Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability targets. However, I think we're going to have a harder time with this than a system like HBase. HDFS sees a relatively high delete rate for recently written data, e.g. intermediate data in a processing pipeline. HDFS also sees a much higher variance in key/value size. Together, these factors mean Ozone will likely be doing many more merges and splits than HBase to keep the container size high. This is concerning since splits and merges are expensive operations, and based on HBase's experience, are hard to get right. What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers, and the metadata services are separate from the NN? not shared? Any thoughts on how we will transition applications like Hive and HBase to Ozone? These apps use rename and directories for synchronization, which are not possible on Ozone. Have you experienced data loss from independent node failures, thus motivating the need for copysets? I think the idea is cool, but the RAMCloud network hardware expectations are quite different from ours. Limiting the set of nodes for re-replication means you have less flexibility to avoid top-of-rack switches and decreased parallelism. It's also not clear how this type of data placement meshes with EC, or the other quite sophisticated types of block placement we currently support in HDFS. How do you plan to handle files larger than 5GB? Large files right now are also not spread across multiple nodes and disks, limiting IO performance. Are all reads and writes served by the container's Raft master? IIUC that's how you get strong consistency, but it means we don't have the same performance benefits we have now in HDFS from 3-node replication. I also ask that more of this information and decision making be shared on public mailing lists and JIRA. The KSM is not mentioned in the architecture document, nor the fact that the Ozone metadata is being replicated via Raft rather than stored in containers. I not aware that there is already progress internally at Hortonworks on a Raft implementation. We've previously expressed interest in being involved in the design and implementation of Ozone, but we can't meaningfully contribute if this work is being done privately.
          Hide
          anu Anu Engineer added a comment -

          Andrew Wang Thank you for showing up at the talk and having a very interesting follow up conversation. I am glad that we are continuing that on this JIRA. Just to make sure that others who might read our discussion gets the right context, Here are the slides from the talk

          http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf

          Show
          anu Anu Engineer added a comment - Andrew Wang Thank you for showing up at the talk and having a very interesting follow up conversation. I am glad that we are continuing that on this JIRA. Just to make sure that others who might read our discussion gets the right context, Here are the slides from the talk http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf
          Hide
          anu Anu Engineer added a comment -

          Andrew Wang Thank you for your comments, They are well thought out and extremely valuable questions. I will make sure that all areas that you are asking about is discussed in the next update of design doc.

          Anu said he'd be posting a new design doc soon to address my questions.

          I am working on that, but just to make sure your questions are not lost in the big picture of the design doc, I am answering them individually here.

          My biggest concern is that erasure coding is not a first-class consideration in this system.

          Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all. So it is quite trivial to create a new chunk type and write them into containers. We are focused on overall picture of ozone right now, and I would welcome any contribution you can make on EC and ozone chunks if that is a concern that you would like us to address earlier. From the architecture point of view I do not see any issues.

          Since LevelDB is being used for metadata storage and separately being replicated via Raft, are there concerns about metadata write amplification?

          Metadata is such a small slice of information of a block – really what you are saying is that block Name, hash for the block gets written twice, once thru RAFT log and second time when RAFT commits this information. Since the data we are talking about is so small I am not worried about it at all.

          Can we re-use the QJM code instead of writing a new replicated log implementation? QJM is battle-tested, and bq.sensus is a known hard problem to get right.

          We considered this, however the consensus is to write a consensus protocol that is easier to understand and make it easy for more contributors to work on it. The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it.

          Are there concerns about storing millions of chunk files per disk? Writing each chunk as a separate file requires more metadata ops and fsyncs than appending to a file. We also need to be very careful to never require a full scan of the filesystem. The HDFS DN does full scans right now (DU, volume scanner).

          Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file.

          Any thoughts about how we go about packing multiple chunks into a larger file?

          Yes, write the first chunk and then write the second chunk to the same file. In fact, chunks are specifically designed to address the small file problem. So two keys can point to a same file.
          For example
          KeyA -> {File,0, 100}
          KeyB -> {File,101, 1000} Is a perfectly valid layout under container architecture

          Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability targets. Together, these factors mean Ozone will likely be doing many more merges and splits than HBase to keep the container size high

          Ozone actively tries to avoid merges and tries to split only when needed. A container can be thought of as a really large block, so I am not sure if I am going to see anything other than standard block workload on containers. The fact that containers can be split, is something that allows us to avoid pre-allocation of container space. That is merely a convenience and if you think of these as blocks, you will see that it is very similar.

          Ozone will never try to do merges and splits at HBase level. From the container and ozone perspective we are more focused on a good data distribution on the cluster – aka what the balancer does today, and containers are a flat namespace – just like blocks which we allocate when needed.

          So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine.

          What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers, and the metadata services are separate from the NN? not shared?

          Great question. We initially started off by attacking the scalability question of ozone and soon realized that HDFS scalability and ozone scalability has to solve the same problems. So the container infrastructure that we have built is something that can be used by both ozone and HDFS. Currently we are focused on ozone and containers will co-exist on datanodes with blockpools. That is ozone should be and will be deployable on a vanilla HDFS cluster. In future, if we want to scale HDFS, containers might be an easy way to do it.

          Any thoughts on how we will transition applications like Hive and HBase to Ozone? These apps use rename and directories for synchronization, which are not possible on Ozone.

          These applications are written with the assumption of a Posix file system, so migrating them to Ozone does not make much sense. The work we are doing in ozone, especially container layer is useful in HDFS and these applications might benefit indirectly.

          Have you experienced data loss from independent node failures, thus motivating the need for copysets?

          Yes, we have seen this issue in real world. Since we had to pick an algorithm for allocation and copysets offered a set of advantages with very minimal work. Hence the allocation choice. If you look at the paper you will see this was done on HDFS itself with very minimal cost, and also this work originally came from facebook. So it is useful in both normal HDFS and in RAMCloud.

          It's also not clear how this type of data placement meshes with EC, or the other quite sophisticated types of block placement we currently support in HDFS

          The chunks will support remote blocks. That is a notion that we did not discuss due to time constraints in the presentation. In the presentation we just showed chunks as a files on the same machine, but a chunk could exist on a EC block and key could point to it.

          How do you plan to handle files larger than 5GB?

          Large files will live on a set of containers, again the easiest model to reason about containers is to think of them as blocks. In other words, just like normal HDFS.

          Large files right now are also not spread across multiple nodes and disks, limiting IO performance.

          Are you saying that when you write a file of say 256 MB size you would be constrained to same machine since the block size is much larger? That is why we have chunks. With chunks we can easily distribute this information and leave pointers to those chunks. I agree with the concern and we would might have to tune the placement algorithms once we see more real world workloads.

          Are all reads and writes served by the container's Raft master? IIUC that's how you get strong consistency, but it means we don't have the same performance benefits we have now in HDFS from 3-node replication.

          Yes, and I don't think it will be a bottleneck. Since we are only reading and writing metadata – which is very small – and data can be read from any machine.

          I also ask that more of this information and decision making be shared on public mailing lists and JIRA.

          Absolutely, we have been working actively and posting code patches to ozone branch. Feel free to ask us anything. It is kind of sad that we have to wait for an apachecon for someone to ask me these questions. I would encourage you to follow the JIRAs tagged with ozone to keep up with what is happening in the ozone world. I can assure you all the work we are doing is always tagged with ozone:` so that when you get the mail from JIRA it is immediately visible.

          The KSM is not mentioned in the architecture document.

          The namespace management component did not have separate name, it was called SCM in the original arch. doc. Thanks to Arpit Agarwal – he came up with this name (just like ozone) and argued that we should make it separate, which will make it easy to understand and maintain. If you look at code you will see KSM functionality is currently in SCM. I am planning to move it out to KSM.

          the fact that the Ozone metadata is being replicated via Raft rather than stored in containers.

          Metadata is stored in containers, but containers need replication. Replication via RAFT was something that design evolved to, and we are surfacing that to the community. Like all proposals under consideration, we will update the design doc and listen to community feedback. You just had a chance to listen to our design (little earlier) in person since we got a chance meet. Writing documents take far more energy and time than chatting face to face. We will soon update the design docs.

          I not aware that there is already progress internally at Hortonworks on a Raft implementation.

          We have been prototyping various parts of the system. Including KSM, SCM and other parts of Ozone.

          We've previously expressed interest in being involved in the design and implementation of Ozone, but we can't meaningfully contribute if this work is being done privately.

          I feel that this is a completely uncalled for statement / misplaced sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions or comment. I have personally posted a 43 page document explaining how ozone would look like to users, how they can manage it and REST interfaces offered by Ozone. Unfortunately I have not seen much engagement. While I am sorry that you feel left out, I do not see how you can attribute it to engineers who work on ozone. We have been more than liberal about the sharing internals including this specific presentation we are discussing here. So I am completely lost when you say that this work is being done privately. In fact, I would request you to be active on ozone JIRAs, start by asking questions, make comments (just like you did with this one except for this last comment) and I would love working with you on ozone. It is never our intent to abandon our fellow contributors in this journey, but unless you participate in JIRAs it is very difficult for us to know that you have more than a fleeting interest.

          Show
          anu Anu Engineer added a comment - Andrew Wang Thank you for your comments, They are well thought out and extremely valuable questions. I will make sure that all areas that you are asking about is discussed in the next update of design doc. Anu said he'd be posting a new design doc soon to address my questions. I am working on that, but just to make sure your questions are not lost in the big picture of the design doc, I am answering them individually here. My biggest concern is that erasure coding is not a first-class consideration in this system. Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all. So it is quite trivial to create a new chunk type and write them into containers. We are focused on overall picture of ozone right now, and I would welcome any contribution you can make on EC and ozone chunks if that is a concern that you would like us to address earlier. From the architecture point of view I do not see any issues. Since LevelDB is being used for metadata storage and separately being replicated via Raft, are there concerns about metadata write amplification? Metadata is such a small slice of information of a block – really what you are saying is that block Name, hash for the block gets written twice, once thru RAFT log and second time when RAFT commits this information. Since the data we are talking about is so small I am not worried about it at all. Can we re-use the QJM code instead of writing a new replicated log implementation? QJM is battle-tested, and bq.sensus is a known hard problem to get right. We considered this, however the consensus is to write a consensus protocol that is easier to understand and make it easy for more contributors to work on it. The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it. Are there concerns about storing millions of chunk files per disk? Writing each chunk as a separate file requires more metadata ops and fsyncs than appending to a file. We also need to be very careful to never require a full scan of the filesystem. The HDFS DN does full scans right now (DU, volume scanner). Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file. Any thoughts about how we go about packing multiple chunks into a larger file? Yes, write the first chunk and then write the second chunk to the same file. In fact, chunks are specifically designed to address the small file problem. So two keys can point to a same file. For example KeyA -> {File,0, 100} KeyB -> {File,101, 1000} Is a perfectly valid layout under container architecture Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability targets. Together, these factors mean Ozone will likely be doing many more merges and splits than HBase to keep the container size high Ozone actively tries to avoid merges and tries to split only when needed. A container can be thought of as a really large block, so I am not sure if I am going to see anything other than standard block workload on containers. The fact that containers can be split, is something that allows us to avoid pre-allocation of container space. That is merely a convenience and if you think of these as blocks, you will see that it is very similar. Ozone will never try to do merges and splits at HBase level. From the container and ozone perspective we are more focused on a good data distribution on the cluster – aka what the balancer does today, and containers are a flat namespace – just like blocks which we allocate when needed. So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine. What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers, and the metadata services are separate from the NN? not shared? Great question. We initially started off by attacking the scalability question of ozone and soon realized that HDFS scalability and ozone scalability has to solve the same problems. So the container infrastructure that we have built is something that can be used by both ozone and HDFS. Currently we are focused on ozone and containers will co-exist on datanodes with blockpools. That is ozone should be and will be deployable on a vanilla HDFS cluster. In future, if we want to scale HDFS, containers might be an easy way to do it. Any thoughts on how we will transition applications like Hive and HBase to Ozone? These apps use rename and directories for synchronization, which are not possible on Ozone. These applications are written with the assumption of a Posix file system, so migrating them to Ozone does not make much sense. The work we are doing in ozone, especially container layer is useful in HDFS and these applications might benefit indirectly. Have you experienced data loss from independent node failures, thus motivating the need for copysets? Yes, we have seen this issue in real world. Since we had to pick an algorithm for allocation and copysets offered a set of advantages with very minimal work. Hence the allocation choice. If you look at the paper you will see this was done on HDFS itself with very minimal cost, and also this work originally came from facebook. So it is useful in both normal HDFS and in RAMCloud. It's also not clear how this type of data placement meshes with EC, or the other quite sophisticated types of block placement we currently support in HDFS The chunks will support remote blocks. That is a notion that we did not discuss due to time constraints in the presentation. In the presentation we just showed chunks as a files on the same machine, but a chunk could exist on a EC block and key could point to it. How do you plan to handle files larger than 5GB? Large files will live on a set of containers, again the easiest model to reason about containers is to think of them as blocks. In other words, just like normal HDFS. Large files right now are also not spread across multiple nodes and disks, limiting IO performance. Are you saying that when you write a file of say 256 MB size you would be constrained to same machine since the block size is much larger? That is why we have chunks. With chunks we can easily distribute this information and leave pointers to those chunks. I agree with the concern and we would might have to tune the placement algorithms once we see more real world workloads. Are all reads and writes served by the container's Raft master? IIUC that's how you get strong consistency, but it means we don't have the same performance benefits we have now in HDFS from 3-node replication. Yes, and I don't think it will be a bottleneck. Since we are only reading and writing metadata – which is very small – and data can be read from any machine. I also ask that more of this information and decision making be shared on public mailing lists and JIRA. Absolutely, we have been working actively and posting code patches to ozone branch. Feel free to ask us anything. It is kind of sad that we have to wait for an apachecon for someone to ask me these questions. I would encourage you to follow the JIRAs tagged with ozone to keep up with what is happening in the ozone world. I can assure you all the work we are doing is always tagged with ozone: ` so that when you get the mail from JIRA it is immediately visible. The KSM is not mentioned in the architecture document. The namespace management component did not have separate name, it was called SCM in the original arch. doc. Thanks to Arpit Agarwal – he came up with this name (just like ozone) and argued that we should make it separate, which will make it easy to understand and maintain. If you look at code you will see KSM functionality is currently in SCM. I am planning to move it out to KSM. the fact that the Ozone metadata is being replicated via Raft rather than stored in containers. Metadata is stored in containers, but containers need replication. Replication via RAFT was something that design evolved to, and we are surfacing that to the community. Like all proposals under consideration, we will update the design doc and listen to community feedback. You just had a chance to listen to our design (little earlier) in person since we got a chance meet. Writing documents take far more energy and time than chatting face to face. We will soon update the design docs. I not aware that there is already progress internally at Hortonworks on a Raft implementation. We have been prototyping various parts of the system. Including KSM, SCM and other parts of Ozone. We've previously expressed interest in being involved in the design and implementation of Ozone, but we can't meaningfully contribute if this work is being done privately. I feel that this is a completely uncalled for statement / misplaced sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions or comment. I have personally posted a 43 page document explaining how ozone would look like to users, how they can manage it and REST interfaces offered by Ozone. Unfortunately I have not seen much engagement. While I am sorry that you feel left out, I do not see how you can attribute it to engineers who work on ozone. We have been more than liberal about the sharing internals including this specific presentation we are discussing here. So I am completely lost when you say that this work is being done privately. In fact, I would request you to be active on ozone JIRAs, start by asking questions, make comments (just like you did with this one except for this last comment) and I would love working with you on ozone. It is never our intent to abandon our fellow contributors in this journey, but unless you participate in JIRAs it is very difficult for us to know that you have more than a fleeting interest.
          Hide
          andrew.wang Andrew Wang added a comment -

          Thanks for the reply Anu, I'd like to follow up on some points.

          Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all...

          The chunks will support remote blocks...

          My understanding was that the SCM was the entity responsible for the equivalent of BlockPlacementPolicy, and doing it on containers. It sounds like that's incorrect, and each container is independently doing chunk placement. That raises a number of questions:

          • How are we coordinating distributed data placement and replication? Are all containers heartbeating to other containers to determine liveness? Giving up global coordination of replication makes it hard to do throttling and control use of top-of-rack switches. It also makes it harder to understand the operation of the system.
          • Aren't 4MB chunks a rather small unit for cross-machine replication? We've been growing the HDFS block size over the years as networks get faster, since it amortizes overheads.
          • Does this mean also we have a "chunk report" from the remote chunk servers to the master?

          I also still have the same questions about mutability of an EC group requiring the parities to be rewritten. How are we forming and potentially rewriting EC groups?

          The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it.

          I don't follow the argument that a new consensus implementation is more understandable than the one we've been supporting and using for years. Working with QJM, and adding support for missing functionality like multiple logs and dynamic quorum membership, would also have benefits in HDFS.

          I'm also just asking questions here. I'm not required to refactor QJM into a library to discuss the merits of code reuse.

          Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file.

          Understood, but in this scenario how do you plan to handle compaction? We essentially need to implement mutability on immutability. The traditional answer here is an LSM tree, a la Kudu or HBase. If this is important, it really should be discussed.

          One easy option would be storing the data in LevelDB as well. I'm not sure about the performance though, and it also doesn't blend well with the mutability of EC groups.

          So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine.

          I think I didn't explain my two points thoroughly enough. Let me try again:

          The first problem is the typical write/delete pattern for a system like HDFS. IIUC in Ozone, each container is allocated a contiguous range of the keyspace by the KSM. As an example, perhaps the KSM decides to allocate the range (i,j] to a container. Then, the user decides to kick off a job that writes a whole bunch of files with the format ingest/file_N. Until we do a split, all those files are landing in that {{i,j] container. So we split. Then, it's common for ingested data to be ETL'd and deleted. If we split earlier, that means we now have a lot of very small containers. This kind of hotspotting is less common in HBase, since DB users aren't encoding this type of nested structure in their keys.

          The other problem is that files can be pretty big. 1GB is common for data warehouses. If we have a 5GB container, a few deletes could quickly drop us below that target size. Similarly, a few additions can quickly raise us past it.

          Would appreciate an answer in light of the above concerns.

          So the container infrastructure that we have built is something that can be used by both ozone and HDFS...In future, if we want to scale HDFS, containers might be an easy way to do it.

          This sounds like a major refactoring of HDFS. We'd need to start by splitting the FSNS and BM locks, which is a massive undertaking, and possibly incompatible for operations like setrep. Moving the BM across an RPC boundary is also a demanding task.

          I think a split FSN / BM is a great architecture, but it's also something that has been attempted unsuccessfully a number of times in the community.

          These applications are written with the assumption of a Posix file system, so migrating them to Ozone does not make much sense.

          If we do not plan to support Hive and HBase, what is the envisioned set of target applications for Ozone?

          Yes, we have seen this issue in real world.

          That's a very interesting datapoint. Could you give any more details, e.g. circumstances and frequency? I asked around internally, and AFAIK we haven't encountered this issue before.

          It is kind of sad that we have to wait for an apachecon for someone to ask me these questions.

          I feel that this is a completely uncalled for statement / misplaced sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions or comment.

          We have been more than liberal about the sharing internals including this specific presentation we are discussing here. So I am completely lost when you say that this work is being done privately.

          but unless you participate in JIRAs it is very difficult for us to know that you have more than a fleeting interest.

          As I said in my previous comment, it's very hard for the community to contribute meaningfully if the design is being done internally. I wouldn't have had the context to ask any of the above questions. None of the information about the KSM, Raft, or remote chunks have been mentioned in JIRA. For all I knew, we were still progressing down the design proposed in the architecture doc.

          I think community interest in Ozone has also been very clearly expressed. Speaking for myself, you can see my earlier comments on this JIRA, as well as my attendance at previous community phone calls. Really though, even if the community hadn't explicitly expressed interest, all of this activity should still have been done in a public forum. It's very hard for newcomers to ramp up unless design discussions are being done publicly.

          This "newcomer" issue is basically what's happening right now with our conversation. I'm sure you've already discussed many of the points I'm raising now with your colleagues. I actually would be delighted to commit my time and energy to Ozone development if I believed it's the solution to our HDFS scalability issues. However, since this work is going on internally at HWX, it's hard for me to assess, assist, or advocate for the project.

          Show
          andrew.wang Andrew Wang added a comment - Thanks for the reply Anu, I'd like to follow up on some points. Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all... The chunks will support remote blocks... My understanding was that the SCM was the entity responsible for the equivalent of BlockPlacementPolicy, and doing it on containers. It sounds like that's incorrect, and each container is independently doing chunk placement. That raises a number of questions: How are we coordinating distributed data placement and replication? Are all containers heartbeating to other containers to determine liveness? Giving up global coordination of replication makes it hard to do throttling and control use of top-of-rack switches. It also makes it harder to understand the operation of the system. Aren't 4MB chunks a rather small unit for cross-machine replication? We've been growing the HDFS block size over the years as networks get faster, since it amortizes overheads. Does this mean also we have a "chunk report" from the remote chunk servers to the master? I also still have the same questions about mutability of an EC group requiring the parities to be rewritten. How are we forming and potentially rewriting EC groups? The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it. I don't follow the argument that a new consensus implementation is more understandable than the one we've been supporting and using for years. Working with QJM, and adding support for missing functionality like multiple logs and dynamic quorum membership, would also have benefits in HDFS. I'm also just asking questions here. I'm not required to refactor QJM into a library to discuss the merits of code reuse. Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file. Understood, but in this scenario how do you plan to handle compaction? We essentially need to implement mutability on immutability. The traditional answer here is an LSM tree, a la Kudu or HBase. If this is important, it really should be discussed. One easy option would be storing the data in LevelDB as well. I'm not sure about the performance though, and it also doesn't blend well with the mutability of EC groups. So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine. I think I didn't explain my two points thoroughly enough. Let me try again: The first problem is the typical write/delete pattern for a system like HDFS. IIUC in Ozone, each container is allocated a contiguous range of the keyspace by the KSM. As an example, perhaps the KSM decides to allocate the range (i,j] to a container. Then, the user decides to kick off a job that writes a whole bunch of files with the format ingest/file_N . Until we do a split, all those files are landing in that {{ i,j] container. So we split. Then, it's common for ingested data to be ETL'd and deleted. If we split earlier, that means we now have a lot of very small containers. This kind of hotspotting is less common in HBase, since DB users aren't encoding this type of nested structure in their keys. The other problem is that files can be pretty big. 1GB is common for data warehouses. If we have a 5GB container, a few deletes could quickly drop us below that target size. Similarly, a few additions can quickly raise us past it. Would appreciate an answer in light of the above concerns. So the container infrastructure that we have built is something that can be used by both ozone and HDFS...In future, if we want to scale HDFS, containers might be an easy way to do it. This sounds like a major refactoring of HDFS. We'd need to start by splitting the FSNS and BM locks, which is a massive undertaking, and possibly incompatible for operations like setrep. Moving the BM across an RPC boundary is also a demanding task. I think a split FSN / BM is a great architecture, but it's also something that has been attempted unsuccessfully a number of times in the community. These applications are written with the assumption of a Posix file system, so migrating them to Ozone does not make much sense. If we do not plan to support Hive and HBase, what is the envisioned set of target applications for Ozone? Yes, we have seen this issue in real world. That's a very interesting datapoint. Could you give any more details, e.g. circumstances and frequency? I asked around internally, and AFAIK we haven't encountered this issue before. It is kind of sad that we have to wait for an apachecon for someone to ask me these questions. I feel that this is a completely uncalled for statement / misplaced sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions or comment. We have been more than liberal about the sharing internals including this specific presentation we are discussing here. So I am completely lost when you say that this work is being done privately. but unless you participate in JIRAs it is very difficult for us to know that you have more than a fleeting interest. As I said in my previous comment, it's very hard for the community to contribute meaningfully if the design is being done internally. I wouldn't have had the context to ask any of the above questions. None of the information about the KSM, Raft, or remote chunks have been mentioned in JIRA. For all I knew, we were still progressing down the design proposed in the architecture doc. I think community interest in Ozone has also been very clearly expressed. Speaking for myself, you can see my earlier comments on this JIRA, as well as my attendance at previous community phone calls. Really though, even if the community hadn't explicitly expressed interest, all of this activity should still have been done in a public forum. It's very hard for newcomers to ramp up unless design discussions are being done publicly. This "newcomer" issue is basically what's happening right now with our conversation. I'm sure you've already discussed many of the points I'm raising now with your colleagues. I actually would be delighted to commit my time and energy to Ozone development if I believed it's the solution to our HDFS scalability issues. However, since this work is going on internally at HWX, it's hard for me to assess, assist, or advocate for the project.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks for the discussions Andrew Wang Anu Engineer.

          I'm still trying to catch up all the new updates (looking forward to the updated design doc, maybe also post the ApacheCon video?). Meanwhile, some thoughts around EC:

          My biggest concern is that erasure coding is not a first-class consideration in this system, and seems like it will be quite difficult to implement.

          I agree. The most difficult parts in building HDFS-EC was to remove / generalize inflexible assumptions in HDFS block management and namespace logics about replicas etc. So it would help a lot to at least conceptually discuss the plan for EC implementation in the new design doc.

          Another question about reading the ApacheCon slides: the question "Why an Object Store" was well answered. How about "why an object store as part of HDFS"? IIUC Ozone is only leveraging a very small portion of HDFS code. Why should it be a part of HDFS instead of a separate project?

          Show
          zhz Zhe Zhang added a comment - Thanks for the discussions Andrew Wang Anu Engineer . I'm still trying to catch up all the new updates (looking forward to the updated design doc, maybe also post the ApacheCon video?). Meanwhile, some thoughts around EC: My biggest concern is that erasure coding is not a first-class consideration in this system, and seems like it will be quite difficult to implement. I agree. The most difficult parts in building HDFS-EC was to remove / generalize inflexible assumptions in HDFS block management and namespace logics about replicas etc. So it would help a lot to at least conceptually discuss the plan for EC implementation in the new design doc. Another question about reading the ApacheCon slides: the question "Why an Object Store" was well answered. How about "why an object store as part of HDFS"? IIUC Ozone is only leveraging a very small portion of HDFS code. Why should it be a part of HDFS instead of a separate project?
          Hide
          cmccabe Colin P. McCabe added a comment -

          Another question about reading the ApacheCon slides: the question "Why an Object Store" was well answered. How about "why an object store as part of HDFS"? IIUC Ozone is only leveraging a very small portion of HDFS code. Why should it be a part of HDFS instead of a separate project?

          That's a very good question. Why can't ozone be its own subproject within Hadoop? We could add a hadoop-ozone directory at the top level of the git repo. Ozone seems to be reusing very little of the HDFS code. For example, it doesn't store blocks the way the DataNode stores blocks. It doesn't run the HDFS NameNode. It doesn't use the HDFS client code.

          Show
          cmccabe Colin P. McCabe added a comment - Another question about reading the ApacheCon slides: the question "Why an Object Store" was well answered. How about "why an object store as part of HDFS"? IIUC Ozone is only leveraging a very small portion of HDFS code. Why should it be a part of HDFS instead of a separate project? That's a very good question. Why can't ozone be its own subproject within Hadoop? We could add a hadoop-ozone directory at the top level of the git repo. Ozone seems to be reusing very little of the HDFS code. For example, it doesn't store blocks the way the DataNode stores blocks. It doesn't run the HDFS NameNode. It doesn't use the HDFS client code.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          Andrew, this tone is not helpful. Nothing we presented at ApacheCon (also an Apache forum) was a significant change from the architecture doc. We will post a design doc soon and there is ample opportunity/need for community contributions as the implementation is still in an early stage. This is in line with how features are developed in Apache.

          Show
          arpitagarwal Arpit Agarwal added a comment - Andrew, this tone is not helpful. Nothing we presented at ApacheCon (also an Apache forum) was a significant change from the architecture doc. We will post a design doc soon and there is ample opportunity/need for community contributions as the implementation is still in an early stage. This is in line with how features are developed in Apache.
          Hide
          andrew.wang Andrew Wang added a comment -

          Arpit Agarwal the impedance mismatch here is illustrated in your most recent comment:

          We will post a design doc soon and there is ample opportunity/need for community contributions as the implementation is still in an early stage. This is in line with how features are developed in Apache.

          The Apache community is supposed to be involved in the design too, not just the implementation. I thought we were doing this, since we had a nice design discussion when the architecture doc was released, and when we last spoke in late February this year, the design seemed unchanged from the design doc.

          Since then, it's clear that a lot of work has been done internally at Hortonworks, without community involvement. I consider changing how metadata is stored to be a very significant design change, as well as the addition of a new master service.

          If the design is still flexible and under discussion, great. What it feels like though is a completed design being dropped on us. It's hard for external contributors to interpret these design changes without the related context and discussions. If the design is viewed as completed and just needs implementation, it's also hard for us to make meaningful design changes.

          Again, I would love to collaborate with everyone on this project. HDFS scale is a topic at the forefront of my mind, and we would all benefit from working together on a single solution. But that requires opening it up so non-Hortonworkers can be deeply involved in the requirements and design, not just implementation.

          Show
          andrew.wang Andrew Wang added a comment - Arpit Agarwal the impedance mismatch here is illustrated in your most recent comment: We will post a design doc soon and there is ample opportunity/need for community contributions as the implementation is still in an early stage. This is in line with how features are developed in Apache. The Apache community is supposed to be involved in the design too, not just the implementation. I thought we were doing this, since we had a nice design discussion when the architecture doc was released, and when we last spoke in late February this year, the design seemed unchanged from the design doc. Since then, it's clear that a lot of work has been done internally at Hortonworks, without community involvement. I consider changing how metadata is stored to be a very significant design change, as well as the addition of a new master service. If the design is still flexible and under discussion, great. What it feels like though is a completed design being dropped on us. It's hard for external contributors to interpret these design changes without the related context and discussions. If the design is viewed as completed and just needs implementation, it's also hard for us to make meaningful design changes. Again, I would love to collaborate with everyone on this project. HDFS scale is a topic at the forefront of my mind, and we would all benefit from working together on a single solution. But that requires opening it up so non-Hortonworkers can be deeply involved in the requirements and design, not just implementation.
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          I actually would be delighted to commit my time and energy to Ozone development

          I would love to collaborate with everyone on this project.

          Andrew, what has been your technical contribution over the last year to help move the project forward? Did you give any thought to how the architecture spec could be converted to a technically feasible design and did you at any time post your ideas on the Jira or approach the developers who were prototyping in the feature branch?

          Show
          arpitagarwal Arpit Agarwal added a comment - I actually would be delighted to commit my time and energy to Ozone development I would love to collaborate with everyone on this project. Andrew, what has been your technical contribution over the last year to help move the project forward? Did you give any thought to how the architecture spec could be converted to a technically feasible design and did you at any time post your ideas on the Jira or approach the developers who were prototyping in the feature branch?
          Hide
          drankye Kai Zheng added a comment -

          Looking forward, things could be better, given the prototype implementation, the upcoming updated design doc, and now also important, it's in the right track under this active discussion. IMHO, it may help if you guys could meet and discuss about this together, as HDFS erasure coding effort did, considering this as another significant architecture change to the project. I wish the overall direction and design doc could be settled down sooner, and would also try to catch up.

          Show
          drankye Kai Zheng added a comment - Looking forward, things could be better, given the prototype implementation, the upcoming updated design doc, and now also important, it's in the right track under this active discussion. IMHO, it may help if you guys could meet and discuss about this together, as HDFS erasure coding effort did, considering this as another significant architecture change to the project. I wish the overall direction and design doc could be settled down sooner, and would also try to catch up.
          Hide
          andrew.wang Andrew Wang added a comment -

          Sorry, hit reply too early. Quoting from my earlier response to Anu:

          Really though, even if the community hadn't explicitly expressed interest, all of this activity should still have been done in a public forum. It's very hard for newcomers to ramp up unless design discussions are being done publicly.

          This is how software is supposed to be developed at Apache, so everyone can watch and contribute. It's not a reasonable standard to require each of the 160 watchers on this JIRA to explicitly reach out to be involved in the conversation. And, like I said above, it's very hard to contribute at this conversation is happening publicly.

          I'm a bit annoyed here since we did reach out in late Feb, and we had a nice design convo. We discussed the need for range instead of hash partitioning (which I'm happy to see made it), as well as the overhead of doing metadata and data lookups (which could motivate storing Ozone metadata in Raft instead of in a container). Then, as now, I also asked to be involved in the design discussions since this is a topic I'm very interested in. Here, I'm also trying to be as constructive as possible, raising questions as well as proposing possible solutions.

          I keep saying this, but I would like to collaborate on this project. If you're willing to revisit some of the design points we're discussing above, we can put the past behind us and move forward. So far though it feels like I'm being rebuffed.

          Show
          andrew.wang Andrew Wang added a comment - Sorry, hit reply too early. Quoting from my earlier response to Anu: Really though, even if the community hadn't explicitly expressed interest, all of this activity should still have been done in a public forum. It's very hard for newcomers to ramp up unless design discussions are being done publicly. This is how software is supposed to be developed at Apache, so everyone can watch and contribute. It's not a reasonable standard to require each of the 160 watchers on this JIRA to explicitly reach out to be involved in the conversation. And, like I said above, it's very hard to contribute at this conversation is happening publicly. I'm a bit annoyed here since we did reach out in late Feb, and we had a nice design convo. We discussed the need for range instead of hash partitioning (which I'm happy to see made it), as well as the overhead of doing metadata and data lookups (which could motivate storing Ozone metadata in Raft instead of in a container). Then, as now, I also asked to be involved in the design discussions since this is a topic I'm very interested in. Here, I'm also trying to be as constructive as possible, raising questions as well as proposing possible solutions. I keep saying this, but I would like to collaborate on this project. If you're willing to revisit some of the design points we're discussing above, we can put the past behind us and move forward. So far though it feels like I'm being rebuffed.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          +1 for some meetup, an online hangout webex would be good for remote people like me to catch up.

          arguing with each other over a JIRA isn't the way to review designs or their implementations

          Show
          stevel@apache.org Steve Loughran added a comment - +1 for some meetup, an online hangout webex would be good for remote people like me to catch up. arguing with each other over a JIRA isn't the way to review designs or their implementations
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Of course the design is flexible and the project would benefit from a constructive discussion here. As repeatedly mentioned before, an updated document will be posted soon and that essentially means to discuss any input and concerns. No design gets frozen until it is implemented. All the implementation so far is in the jiras, and will continue to be so.

          Show
          jnp Jitendra Nath Pandey added a comment - Of course the design is flexible and the project would benefit from a constructive discussion here. As repeatedly mentioned before, an updated document will be posted soon and that essentially means to discuss any input and concerns. No design gets frozen until it is implemented. All the implementation so far is in the jiras, and will continue to be so.
          Hide
          anu Anu Engineer added a comment -

          Steve Loughran Kai Zheng We will certainly have a call to discuss the design once a detailed design doc is posted.

          Andrew Wang Thanks for your comments.

          Here, I'm also trying to be as constructive as possible, raising questions as well as proposing possible solutions.

          I appreciate the spirit and rest assured that we really appreciate you raising questions. It is just that writing a design doc takes a little time.

          We discussed the need for range instead of hash partitioning (which I'm happy to see made it), as well as the overhead of doing metadata and data lookups (which could motivate storing Ozone metadata in Raft instead of in a container).

          This has been my sentiment all along, that we have been listening to the community feedback and making changes. we will certainly do the same going forward. I look forward to your comments and thoughts on the ozone once we post the design doc.

          Zhe Zhang Colin P. McCabe and Andrew Wang I would like to discuss the technical issues that have been raised in this JIRA after I post the design doc. It will allow us to have a shared understanding of where we are and will eliminate lot of repetition. I personally believe it would be much more productive to have the discussion once we all have a shared view of the issues and suggested solutions.

          Show
          anu Anu Engineer added a comment - Steve Loughran Kai Zheng We will certainly have a call to discuss the design once a detailed design doc is posted. Andrew Wang Thanks for your comments. Here, I'm also trying to be as constructive as possible, raising questions as well as proposing possible solutions. I appreciate the spirit and rest assured that we really appreciate you raising questions. It is just that writing a design doc takes a little time. We discussed the need for range instead of hash partitioning (which I'm happy to see made it), as well as the overhead of doing metadata and data lookups (which could motivate storing Ozone metadata in Raft instead of in a container). This has been my sentiment all along, that we have been listening to the community feedback and making changes. we will certainly do the same going forward. I look forward to your comments and thoughts on the ozone once we post the design doc. Zhe Zhang Colin P. McCabe and Andrew Wang I would like to discuss the technical issues that have been raised in this JIRA after I post the design doc. It will allow us to have a shared understanding of where we are and will eliminate lot of repetition. I personally believe it would be much more productive to have the discussion once we all have a shared view of the issues and suggested solutions.
          Hide
          jingzhao Jing Zhao added a comment -

          Talking about EC in ozone, I had a general discussion with Kai Zheng last week while he was visiting us. We think ozone's storage container layer can make EC work easier and more clean, especially considering we're planning the EC phase II, i.e., to do EC in an offline mode.

          Fundamentally EC/replication should be handled in the storage layer (i.e., the block of HDFS, and the storage container in ozone) as two options for maintaining data's durability. A ozone's storage container will have the capability to support both. The general design to support EC in ozone can be very similar to some existing object store such as magic pocket . We can have more detailed discussion about the design and finally have a section in the design doc, but I do not think to support EC can become a hurdle for us.

          Show
          jingzhao Jing Zhao added a comment - Talking about EC in ozone, I had a general discussion with Kai Zheng last week while he was visiting us. We think ozone's storage container layer can make EC work easier and more clean, especially considering we're planning the EC phase II, i.e., to do EC in an offline mode. Fundamentally EC/replication should be handled in the storage layer (i.e., the block of HDFS, and the storage container in ozone) as two options for maintaining data's durability. A ozone's storage container will have the capability to support both. The general design to support EC in ozone can be very similar to some existing object store such as magic pocket . We can have more detailed discussion about the design and finally have a section in the design doc, but I do not think to support EC can become a hurdle for us.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Andrew Wang, I understand you like to contribute to this issue. However, why don't you fix HDFS symlink first? It is also a very useful and important feature. It is one of the most wanted feature. Many people are asking for it.

          I seem to recall that you got your committership by contributing the symlink feature, however, the symlink feature is still not working as of today. Why don't you fix it? I think you want to build up a good track record for yourself.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Andrew Wang , I understand you like to contribute to this issue. However, why don't you fix HDFS symlink first? It is also a very useful and important feature. It is one of the most wanted feature. Many people are asking for it. I seem to recall that you got your committership by contributing the symlink feature, however, the symlink feature is still not working as of today. Why don't you fix it? I think you want to build up a good track record for yourself.
          Hide
          jingzhao Jing Zhao added a comment -

          Looks like contributors do not have permission to attach files anymore ? I assign the jira to Anu Engineer so that he can upload the updated design doc.

          Show
          jingzhao Jing Zhao added a comment - Looks like contributors do not have permission to attach files anymore ? I assign the jira to Anu Engineer so that he can upload the updated design doc.
          Hide
          anu Anu Engineer added a comment -

          Hi All,

          I have attached the ozone design update. Hopefully this addresses the concerns expressed by Andrew Wang. My apologies for the delay.

          I am also hoping that this will take us back to ozone's technical issues, and I would like to host a call if anyone would like to discuss this in greater depth.

          Andrew Wang Zhe Zhang Colin P. McCabe Kai Zheng I would like to respond to the technical issues you have raised in this JIRA once you get time to read thru this design update and we all have a shared understanding of current state of ozone.

          I would like to reassure you all that this is a design proposal and very much open to change. I would love to discuss the merits of this proposal and would love to see more community engagement and participation in ozone. Please do let me know if I can do anything more to address that concern.

          Show
          anu Anu Engineer added a comment - Hi All, I have attached the ozone design update. Hopefully this addresses the concerns expressed by Andrew Wang . My apologies for the delay. I am also hoping that this will take us back to ozone's technical issues, and I would like to host a call if anyone would like to discuss this in greater depth. Andrew Wang Zhe Zhang Colin P. McCabe Kai Zheng I would like to respond to the technical issues you have raised in this JIRA once you get time to read thru this design update and we all have a shared understanding of current state of ozone. I would like to reassure you all that this is a design proposal and very much open to change. I would love to discuss the merits of this proposal and would love to see more community engagement and participation in ozone. Please do let me know if I can do anything more to address that concern.
          Hide
          anu Anu Engineer added a comment -

          Thank you, I have updated the JIRA and assigned this back to Jitendra

          Show
          anu Anu Engineer added a comment - Thank you, I have updated the JIRA and assigned this back to Jitendra
          Hide
          bikassaha Bikas Saha added a comment -

          In case there is a conference call, please send an email to hdfs-dev with the proposed meeting details for wider dispersal and participation since that is the right forum to organize community activities.

          Show
          bikassaha Bikas Saha added a comment - In case there is a conference call, please send an email to hdfs-dev with the proposed meeting details for wider dispersal and participation since that is the right forum to organize community activities.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Tsz Wo Nicholas Sze wrote: I seem to recall that you got your committership by contributing the symlink feature, however, the symlink feature is still not working as of today. Why don't you fix it? I think you want to build up a good track record for yourself.

          Andrew Wang did not get his commitership by contributing the symlink feature. By the time he was elected as a committer, he had contributed a system for efficiently storing and reporting high-percentile metrics, an API to expose disk location information to advanced HDFS clients, converted all remaining JUnit 3 HDFS tests to JUnit 4, and added symlink support to FileSystem. The last one was just contributing a new API to the FileSystem class, not implementing the symlink feature itself. You are probably thinking of Eli Collins, who became a committer partly by working on HDFS symlinks.

          Show
          cmccabe Colin P. McCabe added a comment - Tsz Wo Nicholas Sze wrote: I seem to recall that you got your committership by contributing the symlink feature, however, the symlink feature is still not working as of today. Why don't you fix it? I think you want to build up a good track record for yourself. Andrew Wang did not get his commitership by contributing the symlink feature. By the time he was elected as a committer, he had contributed a system for efficiently storing and reporting high-percentile metrics, an API to expose disk location information to advanced HDFS clients, converted all remaining JUnit 3 HDFS tests to JUnit 4, and added symlink support to FileSystem. The last one was just contributing a new API to the FileSystem class, not implementing the symlink feature itself. You are probably thinking of Eli Collins , who became a committer partly by working on HDFS symlinks.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          For example, in (all of) Hadoop's s3 filesystem implementations, listStatus uses this quick listing of keys between A and B. When someone does "listStatus /a/b/c", we can ask s3 for all the keys between /a/b/c/ and /a/b/c0 (0 is the ASCII value right after slash). Of course, s3 does not really have directories, but we can treat the keys in this range as being in the directory /a/b/c for the purposes of s3a or s3n. If we just had hash partitioning, this kind of operation would be O(N^2) where N is the number of keys. It would just be infeasible for any large bucket.

          FWIW I'm looking at bulk-recursive directory listing in s3a for listStatus, moving the cost of listing from a very slow O(all-directories) to an O(all-files/1000). Be nice to retain that as otherwise dir listing is a very expensive operation, which kills split calculation.

          Now, can people stop being territorial or making any form of criticism of each other. It is fundamentally against the ASF philosophy of collaborative, community development, doesn't help long term collaboration and makes the entire project look bad. Thanks.

          Show
          stevel@apache.org Steve Loughran added a comment - For example, in (all of) Hadoop's s3 filesystem implementations, listStatus uses this quick listing of keys between A and B. When someone does "listStatus /a/b/c", we can ask s3 for all the keys between /a/b/c/ and /a/b/c0 (0 is the ASCII value right after slash). Of course, s3 does not really have directories, but we can treat the keys in this range as being in the directory /a/b/c for the purposes of s3a or s3n. If we just had hash partitioning, this kind of operation would be O(N^2) where N is the number of keys. It would just be infeasible for any large bucket. FWIW I'm looking at bulk-recursive directory listing in s3a for listStatus, moving the cost of listing from a very slow O(all-directories) to an O(all-files/1000). Be nice to retain that as otherwise dir listing is a very expensive operation, which kills split calculation. Now, can people stop being territorial or making any form of criticism of each other. It is fundamentally against the ASF philosophy of collaborative, community development, doesn't help long term collaboration and makes the entire project look bad. Thanks.
          Hide
          stack stack added a comment -

          Now, can people stop being territorial or making any form of criticism of each other. It is fundamentally against the ASF philosophy of collaborative, community development, doesn't help long term collaboration and makes the entire project look bad. Thanks.

          Amen.

          Thanks for posting design Anu Engineer

          Datanodes provide a shared generic storage service called the container layer .

          Is this HDFS Datanode? We'd add block manager functionality to the Datanode? (Did we answer the Zhe Zhang question, "How about "why an object store as part of HDFS"?)

          Thanks

          Show
          stack stack added a comment - Now, can people stop being territorial or making any form of criticism of each other. It is fundamentally against the ASF philosophy of collaborative, community development, doesn't help long term collaboration and makes the entire project look bad. Thanks. Amen. Thanks for posting design Anu Engineer Datanodes provide a shared generic storage service called the container layer . Is this HDFS Datanode? We'd add block manager functionality to the Datanode? (Did we answer the Zhe Zhang question, "How about "why an object store as part of HDFS"?) Thanks
          Hide
          arpitagarwal Arpit Agarwal added a comment -

          So far though it feels like I'm being rebuffed.

          As you pointed out, your and Colin's feedback from our last discussion has influenced the design (and Anu rightly credited you for that during the ApacheCon talk too). Also I recall Anu spending over an hour with you in person at ApacheCon to go over your comments. It is unfair to say that you are being rebuffed. I again request you avoid such remarks and share your technical feedback/ideas with us to help identify gaps in our thinking. We'd be happy to schedule a webex. Many of us working on Ozone are remote but perhaps we can get together at the Hadoop Summit in June.

          Show
          arpitagarwal Arpit Agarwal added a comment - So far though it feels like I'm being rebuffed. As you pointed out, your and Colin's feedback from our last discussion has influenced the design (and Anu rightly credited you for that during the ApacheCon talk too). Also I recall Anu spending over an hour with you in person at ApacheCon to go over your comments. It is unfair to say that you are being rebuffed. I again request you avoid such remarks and share your technical feedback/ideas with us to help identify gaps in our thinking. We'd be happy to schedule a webex. Many of us working on Ozone are remote but perhaps we can get together at the Hadoop Summit in June.
          Hide
          stack stack added a comment -

          It is unfair to say that you are being rebuffed.

          Can we please move to discussion of the design. Back and forth on what is 'fair', 'tone', and how folks got commit bits is corrosive and derails what is important here; i.e. landing this big one.

          Show
          stack stack added a comment - It is unfair to say that you are being rebuffed. Can we please move to discussion of the design. Back and forth on what is 'fair', 'tone', and how folks got commit bits is corrosive and derails what is important here; i.e. landing this big one.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Why an object store as part of HDFS?

          It is one of the goals to have both hdfs and ozone being available in the same deployment. That means same datanodes serve both ozone and hdfs data. Therefore, having ozone as a separate subproject in hadoop is ok as long as they can share the storage layer. The datanode changes would still be needed in hdfs.
          There is another proposal in HDFS-10419, that moves HDFS data into storage containers. I think that effort will need a new datanode implementation, that shares storage container layer with ozone.

          Show
          jnp Jitendra Nath Pandey added a comment - Why an object store as part of HDFS? It is one of the goals to have both hdfs and ozone being available in the same deployment. That means same datanodes serve both ozone and hdfs data. Therefore, having ozone as a separate subproject in hadoop is ok as long as they can share the storage layer. The datanode changes would still be needed in hdfs. There is another proposal in HDFS-10419 , that moves HDFS data into storage containers. I think that effort will need a new datanode implementation, that shares storage container layer with ozone.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... and added symlink support to FileSystem. The last one was just contributing a new API to the FileSystem class, not implementing the symlink feature itself. You are probably thinking of Eli Collins, who became a committer partly by working on HDFS symlinks.

          Thanks Colin for clarifying it.

          Correct me if I am wrong – before Andrew Wang's contribution, symlink was somehow working (based on Eli Collins's work). After Andrew's work, we had no choice but disable the symlink feature. It this sense, symlink became even worse. Anyway, Andrew/Eli, any plan to fix symlink?

          Indeed, this JIRA is about object store. We should not discuss symlink too much here. My previous comment was just a suggestion to Andrew. Let's discuss symlink in the dev mailing list or another JIRA. Thanks.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... and added symlink support to FileSystem. The last one was just contributing a new API to the FileSystem class, not implementing the symlink feature itself. You are probably thinking of Eli Collins, who became a committer partly by working on HDFS symlinks. Thanks Colin for clarifying it. Correct me if I am wrong – before Andrew Wang 's contribution, symlink was somehow working (based on Eli Collins 's work). After Andrew's work, we had no choice but disable the symlink feature. It this sense, symlink became even worse. Anyway, Andrew/Eli, any plan to fix symlink? Indeed, this JIRA is about object store. We should not discuss symlink too much here. My previous comment was just a suggestion to Andrew. Let's discuss symlink in the dev mailing list or another JIRA. Thanks.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Correct me if I am wrong – before Andrew Wang's contribution, symlink was somehow working (based on Eli Collins's work). After Andrew's work, we had no choice but disable the symlink feature. It this sense, symlink became even worse. Anyway, Andrew/Eli, any plan to fix symlink?

          Symlinks were broken before Andrew started working on them. They had serious security, performance, and usability issues. If you are interested in learning more about the issues and helping to fix them, take a look at HADOOP-10019. They were disabled to avoid exposing people to serious security risks. In the meantime, I will note that you were one of the reviewers on the JIRA that initially introduced symlinks, HDFS-245, before Andrew or I had even started working on Hadoop.

          Show
          cmccabe Colin P. McCabe added a comment - Correct me if I am wrong – before Andrew Wang's contribution, symlink was somehow working (based on Eli Collins's work). After Andrew's work, we had no choice but disable the symlink feature. It this sense, symlink became even worse. Anyway, Andrew/Eli, any plan to fix symlink? Symlinks were broken before Andrew started working on them. They had serious security, performance, and usability issues. If you are interested in learning more about the issues and helping to fix them, take a look at HADOOP-10019 . They were disabled to avoid exposing people to serious security risks. In the meantime, I will note that you were one of the reviewers on the JIRA that initially introduced symlinks, HDFS-245 , before Andrew or I had even started working on Hadoop.
          Hide
          lars_francke Lars Francke added a comment -

          I'm trying to get up to speed on the current proposal. Your new document starts with

          This document is an Ozone design update that builds on the original Ozone Architecture ​and describes in greater detail Ozone namespace management and data replication consistency.

          Does that mean I can disregard everything in the Ozone-architecture-v1.pdf document? I have to be admit at being a bit confused on what the current state is. I started reading both.

          Show
          lars_francke Lars Francke added a comment - I'm trying to get up to speed on the current proposal. Your new document starts with This document is an Ozone design update that builds on the original Ozone Architecture ​and describes in greater detail Ozone namespace management and data replication consistency. Does that mean I can disregard everything in the Ozone-architecture-v1.pdf document? I have to be admit at being a bit confused on what the current state is. I started reading both.
          Hide
          anu Anu Engineer added a comment -

          Lars Francke Sorry for the confusion. Ozone-architecture-v1.pdf is the original ozone architecture document that it was referring to. So you are on the right track. This is an update of the original design, where we are proposing that SCM – which was similar to Namenode – that did both namespace management and block management – be separated into KSM and SCM. So most of the original document stands as is. This design update also contains a section on data pipeline that contains details on how we would like to use an RSM to get strong consistency. The talk we gave ( http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf ) in ApacheCon assumes no prior knowledge of current state of ozone. if you like you can look at those slides and then read the updated design. That will also provide you with continuity to read the updated design doc.

          Show
          anu Anu Engineer added a comment - Lars Francke Sorry for the confusion. Ozone-architecture-v1.pdf is the original ozone architecture document that it was referring to. So you are on the right track. This is an update of the original design, where we are proposing that SCM – which was similar to Namenode – that did both namespace management and block management – be separated into KSM and SCM. So most of the original document stands as is. This design update also contains a section on data pipeline that contains details on how we would like to use an RSM to get strong consistency. The talk we gave ( http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf ) in ApacheCon assumes no prior knowledge of current state of ozone. if you like you can look at those slides and then read the updated design. That will also provide you with continuity to read the updated design doc.
          Hide
          anu Anu Engineer added a comment -

          As promised earlier, we would like to host a ozone design review meeting. Agenda is to discuss ozone design and future work.

          Anu Engineer is inviting you to a scheduled Zoom meeting. 
          
          Topic: Ozone design review
          Time: Jun 9, 2016 2:00 PM (GMT-7:00) Pacific Time (US and Canada) 
          
          Join from PC, Mac, Linux, iOS or Android: https://hortonworks.zoom.us/j/679978944
          
          Or join by phone:
          
          +1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll)
          +1 855 880 1246 (US Toll Free)
          +1 888 974 9888 (US Toll Free)
          Meeting ID: 679 978 944 
          International numbers available: https://hortonworks.zoom.us/zoomconference?m=VJJvnfHtsvBoBXaaCftwMsOm8b-4ZkBj 
          

          Kai Zheng Steve Loughran Akira Ajisaka My apologies for a very north america centric time for the meeting, We will host another follow up meeting for contributors from Asia and Europe.

          Show
          anu Anu Engineer added a comment - As promised earlier, we would like to host a ozone design review meeting. Agenda is to discuss ozone design and future work. Anu Engineer is inviting you to a scheduled Zoom meeting. Topic: Ozone design review Time: Jun 9, 2016 2:00 PM (GMT-7:00) Pacific Time (US and Canada) Join from PC, Mac, Linux, iOS or Android: https://hortonworks.zoom.us/j/679978944 Or join by phone: +1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll) +1 855 880 1246 (US Toll Free) +1 888 974 9888 (US Toll Free) Meeting ID: 679 978 944 International numbers available: https://hortonworks.zoom.us/zoomconference?m=VJJvnfHtsvBoBXaaCftwMsOm8b-4ZkBj Kai Zheng Steve Loughran Akira Ajisaka My apologies for a very north america centric time for the meeting, We will host another follow up meeting for contributors from Asia and Europe.
          Hide
          eddyxu Lei (Eddy) Xu added a comment -

          Hi, Anu Engineer Thanks a lot for organize the meeting.

          I also have a few questions that are hopefully be answered in the meeting

          • Since Ozone is decided to use range partition, how would key / data distribution achieve balancing from initial state? For example, a user Foo runs Hive and creates 10GB of data, these data are distributed to up to 6 (containers) DNs?
          • Would you explain what is the benefit of recovering failure pipeline by using a parallel writes to all 3 containers? It is not very clear in the design.
          • It seems to me that in the new pipeline in Ozone, there is no multiple intermediate states for each chunk?

            due to immutability of chunks w rite chunk is an idempotent operation

            How does ozone differentiate a recover write from a malicious (or buggy) re-write?

          • You mentioned that KMS/SCM separation is for future scalability. Do KMS / SCM maintains 1:1, 1:n or n:m relationship? Though it is not in this phase. I'd like to know whether it is considered. Btw, they are also Raft replicated?
          • The raft ring / leader is per-container?
          • For pipeline, say if we have a pipeline A->B->C, if the data writes successfully on A->B, and the metadata Raft writes are succeed on B,C, IIUC, that is a What would be the result for a read request sent to A or C ?
          • How to handle split (merge, migrate) container during writes?
          • Since container size is determined by the space usage instead of # of keys, would that result large performance variants on listing operation, because {{# of DN reached for a list operation = total # of keys / (# of keys per container)). And # of keys per container is determined by average object size in the container.

          Thanks.

          Show
          eddyxu Lei (Eddy) Xu added a comment - Hi, Anu Engineer Thanks a lot for organize the meeting. I also have a few questions that are hopefully be answered in the meeting Since Ozone is decided to use range partition, how would key / data distribution achieve balancing from initial state? For example, a user Foo runs Hive and creates 10GB of data, these data are distributed to up to 6 (containers) DNs? Would you explain what is the benefit of recovering failure pipeline by using a parallel writes to all 3 containers? It is not very clear in the design. It seems to me that in the new pipeline in Ozone, there is no multiple intermediate states for each chunk? due to immutability of chunks w rite chunk is an idempotent operation How does ozone differentiate a recover write from a malicious (or buggy) re-write? You mentioned that KMS/SCM separation is for future scalability. Do KMS / SCM maintains 1:1, 1:n or n:m relationship? Though it is not in this phase. I'd like to know whether it is considered. Btw, they are also Raft replicated? The raft ring / leader is per-container? For pipeline, say if we have a pipeline A->B->C, if the data writes successfully on A->B, and the metadata Raft writes are succeed on B,C, IIUC, that is a What would be the result for a read request sent to A or C ? How to handle split (merge, migrate) container during writes? Since container size is determined by the space usage instead of # of keys, would that result large performance variants on listing operation, because {{# of DN reached for a list operation = total # of keys / (# of keys per container)). And # of keys per container is determined by average object size in the container. Thanks.
          Hide
          anu Anu Engineer added a comment -

          Hi Lei (Eddy) Xu , Thank you for reviewing the design doc and comments. Please see my comments below.

          Since Ozone is decided to use range partition, how would key / data distribution achieve balancing from initial state? For example, a user Foo runs Hive and creates 10GB of data, these data are distributed to up to 6 (containers) DNs?

          You bring up a very valid point. This was the most contentious issue in ozone world for a while. We originally went with hash partition schemes and secondary index because of these concerns. The issue (and very rightly so) with that approach was that secondary index is eventually consistent and makes it hard to use. So we switched over to this scheme.

          So our current thought is this, each of the containers will report – size, number of operations and number of keys to SCM. This will allow SCM to balance the allocation of the key space. So if you have a large number of reads and writes, which are completely independent, then they will fill up the cluster/container space evenly.

          But we have an opposing requirement here, generally there is a locality of access in the namespace. So for most cases if you are reading and writing to a bucket, then it is most efficient to keep that data together.

          Now let us look at this specific case, if you have containers configured to say 2GB, then 10GB of data will map to 5 containers. So the model works out to 5 containers. These containers will be spread across a set of machines due to the SCM’s location choosing algorithms.

          Would you explain what is the benefit of recovering failure pipeline by using a parallel writes to all 3 containers? It is not very clear in the design.

          The point I was trying to make is that pipeline relies on Quorum as defined by RSM.
          So if we decide to use this pipeline with RAFT, then I was just trying to make a point that pipeline can be broken, and we will not attempt to heal it. Please let me know if this makes sense.

          How does ozone differentiate a recover write from a malicious (or buggy) re-write?

          Thanks for flagging this, right now we do not. We can always prevent it in the container layer. It is small extension to make, we can write to a temporary file and replace the original if and only if the hashes match. I will file a work item to fix this.

          You mentioned that KMS/SCM separation is for future scalability. Do KMS / SCM maintains 1:1, 1:n or n:m relationship? Though it is not in this phase. I'd like to know whether it is considered. Btw, they are also Raft replicated?

          KSM:SCM has a n:m relationship. Even though in easiest deployment configuration it is 1:1. So yes it is defined that way. They are always Raft replicated.

          The raft ring / leader is per-container?

          Yes, and No. Let me explain this a little more. If you think only in terms of RAFT protocol, then we have a RAFT leader is per machine set. That is, we are going to have a leader for 3 machines (assuming a 3 machine RAFT ring). Now let us switch over to a developer’s point of view. Someone like me who is writing code against containers thinks strictly in terms of containers. So from an ozone developers point of view, we have a Raft leader for a container. In other words, containers provide an abstraction that makes you think that RAFT protocol is for the container, whereas in reality it is a shared ring that is used by many containers that share those 3 machines. This might be something that we want to explore in greater depth during the call.

          For pipeline, say if we have a pipeline A->B->C, if the data writes successfully on A->B, and the metadata Raft writes are succeed on B,C, IIUC, that is a What would be the result for a read request sent to A or C?

          I am going to walk thru this with little more details, so that we are all on the same page.

          What you are describing is a situation where the RAFT leader is either B or C (Since RAFT is an active leader protocol) and for the sake of this illustration let us assume that we are talking about 2 scenarios. One where data is written to leader and another datanode and scenario two, where data is written to followers but not to the leader.

          Let us look at both in greater detail.

          Case 1: Data is written to machines B (leader) and Machine A. But when RAFT commit happens, machine A is off-line and RAFT data is written to Machine B and Machine C.

          So we have situation where B is the only machine with metadata as well as data. We deal with this issue in two ways, one when the commit callback happens in C, C will check if it has the data block and since it does not, it will attempt to copy that block from either B or A.

          Also when A's RAFT ring comes back up it will catch up with the RAFT log and the data is already available on Machine A.

          So in both cases, we are replicating data/metadata as soon as we can. Now let us look at the case where a client goes to C and says I want this data block, before the copy is done – the client will feel that read is a bit slow, since Machine C will copy the data from Machine B or A, write to its local storage and then return the data.

          Case 2: Data is written to 2 followers and leader does not have the data block. The work flow is identical; leader will copy block from another machine before returning the block.

          Case 3: I also want to illustrate an extreme case, let us say a client did NOT write any data blocks, and attempted to write a key, a key will get committed, but container will not be able to find the data blocks at all. Since no data blocks were written by the client, the copy attempt will fail, and RAFT leader will learn that this is Block with No replicas. This would be similar in nature to HDFS.

          How to handle split (merge, migrate) container during writes?

          I have made an eloquent argument about why we don't need to do merge in the first release of Ozone.

          When split is happening, the easiest way to deal with it is to pause the writes.

          if you don't mind, could you please take a look at
          http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf - slides 38-42.
          I avoided repeating that in the design doc, since it was already quite large. We can go over this in detail if you like during the call.

          Since container size is determined by the space usage instead of # of keys, would that result large performance variants on listing operation.

          You are absolutely right; it can have variation in performance. The alternative we have is to use hash partition with secondary indices. if you like we can revisit hash/range partition in the conf, call.
          Last time, we decided to have range as the primary method, but reserved the option of bringing hash partition back at a later stage.

          Show
          anu Anu Engineer added a comment - Hi Lei (Eddy) Xu , Thank you for reviewing the design doc and comments. Please see my comments below. Since Ozone is decided to use range partition, how would key / data distribution achieve balancing from initial state? For example, a user Foo runs Hive and creates 10GB of data, these data are distributed to up to 6 (containers) DNs? You bring up a very valid point. This was the most contentious issue in ozone world for a while. We originally went with hash partition schemes and secondary index because of these concerns. The issue (and very rightly so) with that approach was that secondary index is eventually consistent and makes it hard to use. So we switched over to this scheme. So our current thought is this, each of the containers will report – size, number of operations and number of keys to SCM. This will allow SCM to balance the allocation of the key space. So if you have a large number of reads and writes, which are completely independent, then they will fill up the cluster/container space evenly. But we have an opposing requirement here, generally there is a locality of access in the namespace. So for most cases if you are reading and writing to a bucket, then it is most efficient to keep that data together. Now let us look at this specific case, if you have containers configured to say 2GB, then 10GB of data will map to 5 containers. So the model works out to 5 containers. These containers will be spread across a set of machines due to the SCM’s location choosing algorithms. Would you explain what is the benefit of recovering failure pipeline by using a parallel writes to all 3 containers? It is not very clear in the design. The point I was trying to make is that pipeline relies on Quorum as defined by RSM. So if we decide to use this pipeline with RAFT, then I was just trying to make a point that pipeline can be broken, and we will not attempt to heal it. Please let me know if this makes sense. How does ozone differentiate a recover write from a malicious (or buggy) re-write? Thanks for flagging this, right now we do not. We can always prevent it in the container layer. It is small extension to make, we can write to a temporary file and replace the original if and only if the hashes match. I will file a work item to fix this. You mentioned that KMS/SCM separation is for future scalability. Do KMS / SCM maintains 1:1, 1:n or n:m relationship? Though it is not in this phase. I'd like to know whether it is considered. Btw, they are also Raft replicated? KSM:SCM has a n:m relationship. Even though in easiest deployment configuration it is 1:1. So yes it is defined that way. They are always Raft replicated. The raft ring / leader is per-container? Yes, and No. Let me explain this a little more. If you think only in terms of RAFT protocol, then we have a RAFT leader is per machine set. That is, we are going to have a leader for 3 machines (assuming a 3 machine RAFT ring). Now let us switch over to a developer’s point of view. Someone like me who is writing code against containers thinks strictly in terms of containers. So from an ozone developers point of view, we have a Raft leader for a container. In other words, containers provide an abstraction that makes you think that RAFT protocol is for the container, whereas in reality it is a shared ring that is used by many containers that share those 3 machines. This might be something that we want to explore in greater depth during the call. For pipeline, say if we have a pipeline A->B->C, if the data writes successfully on A->B, and the metadata Raft writes are succeed on B,C, IIUC, that is a What would be the result for a read request sent to A or C? I am going to walk thru this with little more details, so that we are all on the same page. What you are describing is a situation where the RAFT leader is either B or C (Since RAFT is an active leader protocol) and for the sake of this illustration let us assume that we are talking about 2 scenarios. One where data is written to leader and another datanode and scenario two, where data is written to followers but not to the leader. Let us look at both in greater detail. Case 1: Data is written to machines B (leader) and Machine A. But when RAFT commit happens, machine A is off-line and RAFT data is written to Machine B and Machine C. So we have situation where B is the only machine with metadata as well as data. We deal with this issue in two ways, one when the commit callback happens in C, C will check if it has the data block and since it does not, it will attempt to copy that block from either B or A. Also when A's RAFT ring comes back up it will catch up with the RAFT log and the data is already available on Machine A. So in both cases, we are replicating data/metadata as soon as we can. Now let us look at the case where a client goes to C and says I want this data block, before the copy is done – the client will feel that read is a bit slow, since Machine C will copy the data from Machine B or A, write to its local storage and then return the data. Case 2: Data is written to 2 followers and leader does not have the data block. The work flow is identical; leader will copy block from another machine before returning the block. Case 3: I also want to illustrate an extreme case, let us say a client did NOT write any data blocks, and attempted to write a key, a key will get committed, but container will not be able to find the data blocks at all. Since no data blocks were written by the client, the copy attempt will fail, and RAFT leader will learn that this is Block with No replicas. This would be similar in nature to HDFS. How to handle split (merge, migrate) container during writes? I have made an eloquent argument about why we don't need to do merge in the first release of Ozone. When split is happening, the easiest way to deal with it is to pause the writes. if you don't mind, could you please take a look at http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf - slides 38-42. I avoided repeating that in the design doc, since it was already quite large. We can go over this in detail if you like during the call. Since container size is determined by the space usage instead of # of keys, would that result large performance variants on listing operation. You are absolutely right; it can have variation in performance. The alternative we have is to use hash partition with secondary indices. if you like we can revisit hash/range partition in the conf, call. Last time, we decided to have range as the primary method, but reserved the option of bringing hash partition back at a later stage.
          Hide
          anu Anu Engineer added a comment -

          Just posting here a reminder for the ozone design review. It is scheduled @ Jun 9, 2016 2:00 PM (GMT-7:00) Pacific Time.
          This meeting is to review ozone's proposed design. Hopefully everyone has got a chance to read the posted doc already.

          Show
          anu Anu Engineer added a comment - Just posting here a reminder for the ozone design review. It is scheduled @ Jun 9, 2016 2:00 PM (GMT-7:00) Pacific Time. This meeting is to review ozone's proposed design. Hopefully everyone has got a chance to read the posted doc already.
          Hide
          anu Anu Engineer added a comment - - edited

          Ozone meeting notes – Jun, 9th, 2016

          Attendees: Thomas Demoor, Arpit Agarwal, JV Jujjuri, Jing Zhao, Andrew Wang, Lei Xu, Aaron Myers, Colin McCabe, Aaron Fabbri, Lars Francke, Sijie Guo, Stiwari, Anu Engineer

          We started the discussion with how Erasure coding will be supported in ozone. This was quite a lengthy discussion taking over half the meeting time. Jing Zhao explained the high-level architecture and pointed to similar work done by Drobox.

          We then divide into details of this problem, since we wanted to make sure that supporting Erasure coding will be easy and efficient in ozone.

          Here are the major points:

          SCM currently supports a simple replicated container. To support Erasure coding, SCM will have to return more than 3 machines, let us say we were using 6 + 3 model of erasure coding then then a container is spread across nine machines. Once we modify SCM to support this model, the container client will have write data to the locations and update the RAFT state with the metadata of this block.

          When a file read happens in ozone, container client will go to KSM/SCM and find out the container to read the metadata from. The metadata will tell the client where the actual data is residing and it will re-construct the data from EC coded blocks.

          We all agreed that getting EC done for ozone is an important goal, and to get to that objective, we will need to get the SCM and KSM done first.

          We also discussed how small files will cause an issue with EC especially since container would pack lots of these together and how this would lead to requiring compaction due to deletes.

          Eddy brought up this issue of making sure that data is spread evenly across the cluster. Currently our plan is to maintain a list of machines based on container reports. The container reports would contain number of keys, bytes stored and number of accesses to that container. Based on this SCM would be able to maintain a list that allows it to pick machines that are under-utilized from the cluster, thus ensuring a good data spread. Andrew Wang pointed out that counting I/O requests is not good enough and we actually need the number of bytes read/written. That is an excellent suggestion and we will modify container reports to have this information and will use that in SCMs allocation decisions.

          Eddy followed up this question with how would something like Hive behave over ozone? Say hive creates a bucket, and creates lots of tables and after work, it deletes all the tables. Ozone would have allocated containers to accommodate the overflowing bucket. So it is possible to have many empty containers on an ozone cluster.

          SCM is free to delete any container that does not have a key. This is because in the ozone world, metadata exists inside a container. Therefore, if a container is empty, then we know that no objects (Ozone volume, bucket or key) exists in that container. This gives the freedom to delete any empty container. This is how the containers would be removed in the ozone world.

          Andrew Wang pointed out that it is possible to create thousands of volumes and map them to similar number of containers. He was worried that it would become a scalability bottle neck. While is this possible in reality if you have cluster with only volumes – then KSM is free to map as many ozone volumes to a container. We agreed that if this indeed becomes a problem, we can write a simple compaction tool for KSM which will move all these volumes to few containers. Then SCM delete containers would kick in and clean up the cluster.

          We reiterated through all the scenarios for merge and concluded the for v1, ozone can live without needing to support merges of containers.

          Then Eddy pointed out that by switching to range partitions from hash partitions we have introduced a variability in the list operations for a container. Since it is not documented on JIRA why we switched to using range partition, we discussed the issue which caused us to switch over to using range partition.

          The original design called for hash partition and operations like list relying on secondary index. This would create an eventual consistency model where you might create key, but it is visible in the namespace only after the secondary index is updated. Colin argued that is easier for our users to see consistent namespace operations. This is the core reason why we moved to using range partitions.

          However, range partitions do pose the issue, that a bucket might be split across a large number of containers and list operation does not have fixed time guarantees. The worst case scenario is if you have bucket with thousands of 5 GB objects which internally causes that the bucket to be mapped over a set of containers. This would imply that list operation could have to be read sequentially from many containers to build the list.

          We discussed many solutions to this problem:

          • In the original design, we had proposed a separate meta-data container and data container. We can follow the same model, with the assumption that data container and metadata container are on the same machine. Both Andrew and Thomas seemed to think that is a good idea.

          • Anu argued that this may not be an issue since the datanode (front ends) would be able to cache lots of this info as well as pre-fetch lists since it is a forward iteration.

          • Arpit pointed out that while this is an issue that we need to tackle, we would need to build the system, measure and choose the appropriate solution based on data.

          • In an off-line conversation after the call, Jitendra pointed out that this will not have any performance impact since each split point is well known in KSM, it is trivial to add hints / caching in the KSM layer itself to address this issue – In other words, we can issue parallel reads to all the containers if the client wants 1000 keys and we know that we need to reach out to 3 containers to get that many keys, since KSM would give us that hint.

          While we agree that this is an issue that we might have to tackle eventually in ozone world, we were not able to converge to an exact solution since we ran out of time at this point.

          ATM mentioned that we would benefit by getting together and doing some white boarding of ozone’s design and we intend to do that soon.

          This was a very productive discussion and I want thank all participants. It was a pleasure talking to all of you.

          Please feel free to add/edit these notes for completeness or corrections.

          Show
          anu Anu Engineer added a comment - - edited Ozone meeting notes – Jun, 9th, 2016 Attendees: Thomas Demoor, Arpit Agarwal, JV Jujjuri, Jing Zhao, Andrew Wang, Lei Xu, Aaron Myers, Colin McCabe, Aaron Fabbri, Lars Francke, Sijie Guo, Stiwari, Anu Engineer We started the discussion with how Erasure coding will be supported in ozone. This was quite a lengthy discussion taking over half the meeting time. Jing Zhao explained the high-level architecture and pointed to similar work done by Drobox. We then divide into details of this problem, since we wanted to make sure that supporting Erasure coding will be easy and efficient in ozone. Here are the major points: SCM currently supports a simple replicated container. To support Erasure coding, SCM will have to return more than 3 machines, let us say we were using 6 + 3 model of erasure coding then then a container is spread across nine machines. Once we modify SCM to support this model, the container client will have write data to the locations and update the RAFT state with the metadata of this block. When a file read happens in ozone, container client will go to KSM/SCM and find out the container to read the metadata from. The metadata will tell the client where the actual data is residing and it will re-construct the data from EC coded blocks. We all agreed that getting EC done for ozone is an important goal, and to get to that objective, we will need to get the SCM and KSM done first. We also discussed how small files will cause an issue with EC especially since container would pack lots of these together and how this would lead to requiring compaction due to deletes. Eddy brought up this issue of making sure that data is spread evenly across the cluster. Currently our plan is to maintain a list of machines based on container reports. The container reports would contain number of keys, bytes stored and number of accesses to that container. Based on this SCM would be able to maintain a list that allows it to pick machines that are under-utilized from the cluster, thus ensuring a good data spread. Andrew Wang pointed out that counting I/O requests is not good enough and we actually need the number of bytes read/written. That is an excellent suggestion and we will modify container reports to have this information and will use that in SCMs allocation decisions. Eddy followed up this question with how would something like Hive behave over ozone? Say hive creates a bucket, and creates lots of tables and after work, it deletes all the tables. Ozone would have allocated containers to accommodate the overflowing bucket. So it is possible to have many empty containers on an ozone cluster. SCM is free to delete any container that does not have a key. This is because in the ozone world, metadata exists inside a container. Therefore, if a container is empty, then we know that no objects (Ozone volume, bucket or key) exists in that container. This gives the freedom to delete any empty container. This is how the containers would be removed in the ozone world. Andrew Wang pointed out that it is possible to create thousands of volumes and map them to similar number of containers. He was worried that it would become a scalability bottle neck. While is this possible in reality if you have cluster with only volumes – then KSM is free to map as many ozone volumes to a container. We agreed that if this indeed becomes a problem, we can write a simple compaction tool for KSM which will move all these volumes to few containers. Then SCM delete containers would kick in and clean up the cluster. We reiterated through all the scenarios for merge and concluded the for v1, ozone can live without needing to support merges of containers. Then Eddy pointed out that by switching to range partitions from hash partitions we have introduced a variability in the list operations for a container. Since it is not documented on JIRA why we switched to using range partition, we discussed the issue which caused us to switch over to using range partition. The original design called for hash partition and operations like list relying on secondary index. This would create an eventual consistency model where you might create key, but it is visible in the namespace only after the secondary index is updated. Colin argued that is easier for our users to see consistent namespace operations. This is the core reason why we moved to using range partitions. However, range partitions do pose the issue, that a bucket might be split across a large number of containers and list operation does not have fixed time guarantees. The worst case scenario is if you have bucket with thousands of 5 GB objects which internally causes that the bucket to be mapped over a set of containers. This would imply that list operation could have to be read sequentially from many containers to build the list. We discussed many solutions to this problem: • In the original design, we had proposed a separate meta-data container and data container. We can follow the same model, with the assumption that data container and metadata container are on the same machine. Both Andrew and Thomas seemed to think that is a good idea. • Anu argued that this may not be an issue since the datanode (front ends) would be able to cache lots of this info as well as pre-fetch lists since it is a forward iteration. • Arpit pointed out that while this is an issue that we need to tackle, we would need to build the system, measure and choose the appropriate solution based on data. • In an off-line conversation after the call, Jitendra pointed out that this will not have any performance impact since each split point is well known in KSM, it is trivial to add hints / caching in the KSM layer itself to address this issue – In other words, we can issue parallel reads to all the containers if the client wants 1000 keys and we know that we need to reach out to 3 containers to get that many keys, since KSM would give us that hint. While we agree that this is an issue that we might have to tackle eventually in ozone world, we were not able to converge to an exact solution since we ran out of time at this point. ATM mentioned that we would benefit by getting together and doing some white boarding of ozone’s design and we intend to do that soon. This was a very productive discussion and I want thank all participants. It was a pleasure talking to all of you. Please feel free to add/edit these notes for completeness or corrections.
          Hide
          drankye Kai Zheng added a comment -

          Thanks all for the discussion and Anu Engineer for this nice summary.

          To support Erasure coding, SCM will have to return more than 3 machines, let us say we were using 6 + 3 model of erasure coding then then a container is spread across nine machines. Once we modify SCM to support this model, the container client will have write data to the locations and update the RAFT state with the metadata of this block.

          This looks like to support the striping erasure coding in client when putting/updating a k/v to the store, right? For small objects, the write will trigger the relatively expensive work of coding and writing to 6+3 locations, I would doubt about the performance/overhead and the benefit. For large objects, it sounds fine. So like we did for striping files, users should also be able to opt striping or not according to their bucket conditions, I guess.

          In HDFS files, in addition to striping, there is another way to do erasure coding in block level as discussed in HDFS-8030, mainly targeting to convert old/cold data from replica into erasure coded for saving storage. In Ozone, how about this approach? Would we have old/cold buckets that can be frozen and no update any longer? I'm not sure about this from users' point of view, but we might not reuse the same sets of buckets/containers across many years, right?

          Show
          drankye Kai Zheng added a comment - Thanks all for the discussion and Anu Engineer for this nice summary. To support Erasure coding, SCM will have to return more than 3 machines, let us say we were using 6 + 3 model of erasure coding then then a container is spread across nine machines. Once we modify SCM to support this model, the container client will have write data to the locations and update the RAFT state with the metadata of this block. This looks like to support the striping erasure coding in client when putting/updating a k/v to the store, right? For small objects, the write will trigger the relatively expensive work of coding and writing to 6+3 locations, I would doubt about the performance/overhead and the benefit. For large objects, it sounds fine. So like we did for striping files, users should also be able to opt striping or not according to their bucket conditions, I guess. In HDFS files, in addition to striping, there is another way to do erasure coding in block level as discussed in HDFS-8030 , mainly targeting to convert old/cold data from replica into erasure coded for saving storage. In Ozone, how about this approach? Would we have old/cold buckets that can be frozen and no update any longer? I'm not sure about this from users' point of view, but we might not reuse the same sets of buckets/containers across many years, right?
          Hide
          anu Anu Engineer added a comment -

          I have opened an ozone channel at the ASF slack. Since we are deploying and testing ozone, real time communication is very useful to notify issues as we see them.

          Please signup at the following page using your apache ID, and join the #Ozone channel if you would like to get notified about how the testing and deployment is going for ozone.

          https://the-asf.slack.com/signup

          Show
          anu Anu Engineer added a comment - I have opened an ozone channel at the ASF slack. Since we are deploying and testing ozone, real time communication is very useful to notify issues as we see them. Please signup at the following page using your apache ID, and join the #Ozone channel if you would like to get notified about how the testing and deployment is going for ozone. https://the-asf.slack.com/signup
          Hide
          elek Elek, Marton added a comment -

          Is this chat for commiters only?

          Show
          elek Elek, Marton added a comment - Is this chat for commiters only?
          Hide
          anu Anu Engineer added a comment -

          Elek, Marton Nope, all are welcome.

          Show
          anu Anu Engineer added a comment - Elek, Marton Nope, all are welcome.
          Hide
          elek Elek, Marton added a comment -

          Yeah, but it seems for the registration I need an @apache.org email address. And no information about invites anywhere.

          Show
          elek Elek, Marton added a comment - Yeah, but it seems for the registration I need an @apache.org email address. And no information about invites anywhere.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Putting my ASF process hat on, it is important that anyone interested in collaborating is allowed to join in, especially as real time chats tend to be exclusive enough anyway.

          IRC channel, perhaps?

          Show
          stevel@apache.org Steve Loughran added a comment - Putting my ASF process hat on, it is important that anyone interested in collaborating is allowed to join in, especially as real time chats tend to be exclusive enough anyway. IRC channel, perhaps?
          Hide
          anu Anu Engineer added a comment -

          Steve Loughran, Elek, Marton I have asked the slack community how this can be solved. I am hopeful there is a way to invite people without Apache email ID. I will update this discussion when I hear back from the community in slack. If we cannot add people without apache ID, I will move this to IRC as steve suggested.

          Show
          anu Anu Engineer added a comment - Steve Loughran , Elek, Marton I have asked the slack community how this can be solved. I am hopeful there is a way to invite people without Apache email ID. I will update this discussion when I hear back from the community in slack. If we cannot add people without apache ID, I will move this to IRC as steve suggested.
          Hide
          anu Anu Engineer added a comment - - edited

          djohnament Would you please care to comment on the ASF slack usage for people without Apache email ID?

          Show
          anu Anu Engineer added a comment - - edited djohnament Would you please care to comment on the ASF slack usage for people without Apache email ID?
          Hide
          msingh Mukul Kumar Singh added a comment -

          Posting a preliminary patch of HDFS-7240 to get Jenkins feedback(checkstyle/findbugs/unit tests/whitespaces/etc) before merge.

          Show
          msingh Mukul Kumar Singh added a comment - Posting a preliminary patch of HDFS-7240 to get Jenkins feedback(checkstyle/findbugs/unit tests/whitespaces/etc) before merge.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 25s Docker mode activated.
                Prechecks
          0 shelldocs 0m 17s Shelldocs was not available.
          +1 @author 0m 1s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 118 new or modified test files.
                trunk Compile Tests
          0 mvndep 0m 25s Maven dependency ordering for branch
          +1 mvninstall 14m 1s trunk passed
          +1 compile 15m 50s trunk passed
          +1 checkstyle 2m 24s trunk passed
          +1 mvnsite 9m 44s trunk passed
          +1 shadedclient 9m 6s branch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants .
          +1 findbugs 5m 41s trunk passed
          +1 javadoc 5m 30s trunk passed
                Patch Compile Tests
          0 mvndep 0m 48s Maven dependency ordering for patch
          +1 mvninstall 24m 45s the patch passed
          +1 compile 13m 17s the patch passed
          +1 cc 13m 17s the patch passed
          -1 javac 13m 17s root generated 160 new + 1115 unchanged - 156 fixed = 1275 total (was 1271)
          -0 checkstyle 2m 32s root: The patch generated 60 new + 872 unchanged - 16 fixed = 932 total (was 888)
          +1 mvnsite 11m 21s the patch passed
          +1 shellcheck 0m 26s There were no new shellcheck issues.
          -1 whitespace 0m 2s The patch has 11 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
          -1 whitespace 0m 2s The patch 1 line(s) with tabs.
          +1 xml 0m 18s The patch has no ill-formed XML file.
          +1 shadedclient 10m 50s patch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-tools hadoop-tools/hadoop-tools-dist
          +1 findbugs 7m 12s the patch passed
          +1 javadoc 6m 2s the patch passed
                Other Tests
          -1 unit 15m 45s root in the patch failed.
          +1 asflicense 0m 35s The patch does not generate ASF License warnings.
          159m 33s



          Reason Tests
          Failed junit tests hadoop.net.TestDNS



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:3d04c00
          JIRA Issue HDFS-7240
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12891538/HDFS-7240.001.patch
          Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc
          uname Linux f4c60f5ccb4f 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / fa5cfc6
          Default Java 1.8.0_144
          shellcheck v0.4.6
          findbugs v3.1.0-RC1
          javac https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/whitespace-eol.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21650/testReport/
          modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-tools hadoop-tools/hadoop-tools-dist U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21650/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated.       Prechecks 0 shelldocs 0m 17s Shelldocs was not available. +1 @author 0m 1s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 118 new or modified test files.       trunk Compile Tests 0 mvndep 0m 25s Maven dependency ordering for branch +1 mvninstall 14m 1s trunk passed +1 compile 15m 50s trunk passed +1 checkstyle 2m 24s trunk passed +1 mvnsite 9m 44s trunk passed +1 shadedclient 9m 6s branch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants . +1 findbugs 5m 41s trunk passed +1 javadoc 5m 30s trunk passed       Patch Compile Tests 0 mvndep 0m 48s Maven dependency ordering for patch +1 mvninstall 24m 45s the patch passed +1 compile 13m 17s the patch passed +1 cc 13m 17s the patch passed -1 javac 13m 17s root generated 160 new + 1115 unchanged - 156 fixed = 1275 total (was 1271) -0 checkstyle 2m 32s root: The patch generated 60 new + 872 unchanged - 16 fixed = 932 total (was 888) +1 mvnsite 11m 21s the patch passed +1 shellcheck 0m 26s There were no new shellcheck issues. -1 whitespace 0m 2s The patch has 11 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply -1 whitespace 0m 2s The patch 1 line(s) with tabs. +1 xml 0m 18s The patch has no ill-formed XML file. +1 shadedclient 10m 50s patch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-tools hadoop-tools/hadoop-tools-dist +1 findbugs 7m 12s the patch passed +1 javadoc 6m 2s the patch passed       Other Tests -1 unit 15m 45s root in the patch failed. +1 asflicense 0m 35s The patch does not generate ASF License warnings. 159m 33s Reason Tests Failed junit tests hadoop.net.TestDNS Subsystem Report/Notes Docker Image:yetus/hadoop:3d04c00 JIRA Issue HDFS-7240 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12891538/HDFS-7240.001.patch Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc uname Linux f4c60f5ccb4f 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / fa5cfc6 Default Java 1.8.0_144 shellcheck v0.4.6 findbugs v3.1.0-RC1 javac https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/whitespace-eol.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/21650/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21650/testReport/ modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-tools hadoop-tools/hadoop-tools-dist U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21650/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          msingh Mukul Kumar Singh added a comment -

          Updated v2 patch fixes the checkstyle, javac, whitespace issues in the last patch.

          Show
          msingh Mukul Kumar Singh added a comment - Updated v2 patch fixes the checkstyle, javac, whitespace issues in the last patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 11m 48s Docker mode activated.
                Prechecks
          0 shelldocs 0m 18s Shelldocs was not available.
          +1 @author 0m 1s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 132 new or modified test files.
                trunk Compile Tests
          0 mvndep 0m 28s Maven dependency ordering for branch
          +1 mvninstall 20m 31s trunk passed
          +1 compile 16m 56s trunk passed
          +1 checkstyle 3m 17s trunk passed
          +1 mvnsite 14m 54s trunk passed
          +1 shadedclient 10m 58s branch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist .
          +1 findbugs 7m 46s trunk passed
          +1 javadoc 8m 12s trunk passed
                Patch Compile Tests
          0 mvndep 1m 0s Maven dependency ordering for patch
          +1 mvninstall 35m 53s the patch passed
          +1 compile 18m 43s the patch passed
          +1 cc 18m 43s the patch passed
          -1 javac 18m 43s root generated 17 new + 1231 unchanged - 17 fixed = 1248 total (was 1248)
          -0 checkstyle 3m 32s root: The patch generated 61 new + 858 unchanged - 16 fixed = 919 total (was 874)
          +1 mvnsite 15m 42s the patch passed
          +1 shellcheck 0m 29s There were no new shellcheck issues.
          -1 whitespace 0m 3s The patch 1 line(s) with tabs.
          +1 xml 0m 37s The patch has no ill-formed XML file.
          +1 shadedclient 13m 37s patch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist
          +1 findbugs 10m 28s the patch passed
          +1 javadoc 8m 57s the patch passed
                Other Tests
          -1 unit 170m 24s root in the patch failed.
          +1 asflicense 0m 49s The patch does not generate ASF License warnings.
          378m 57s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting
            hadoop.hdfs.server.federation.metrics.TestFederationMetrics
            hadoop.hdfs.TestDFSUpgradeFromImage
            hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
            hadoop.hdfs.server.blockmanagement.TestReconstructStripedBlocksWithRackAwareness
            hadoop.hdfs.TestReadStripedFileWithMissingBlocks
            hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler
            hadoop.yarn.server.nodemanager.TestNodeStatusUpdater



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:5b98639
          JIRA Issue HDFS-7240
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12893962/HDFS-7240.002.patch
          Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc
          uname Linux d6f22be77cbc 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 5b98639
          maven version: Apache Maven 3.3.9
          Default Java 1.8.0_131
          shellcheck v0.4.6
          findbugs v3.1.0-RC1
          javac https://builds.apache.org/job/PreCommit-HDFS-Build/21822/artifact/patchprocess/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21822/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21822/artifact/patchprocess/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/21822/artifact/patchprocess/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21822/testReport/
          modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21822/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 11m 48s Docker mode activated.       Prechecks 0 shelldocs 0m 18s Shelldocs was not available. +1 @author 0m 1s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 132 new or modified test files.       trunk Compile Tests 0 mvndep 0m 28s Maven dependency ordering for branch +1 mvninstall 20m 31s trunk passed +1 compile 16m 56s trunk passed +1 checkstyle 3m 17s trunk passed +1 mvnsite 14m 54s trunk passed +1 shadedclient 10m 58s branch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist . +1 findbugs 7m 46s trunk passed +1 javadoc 8m 12s trunk passed       Patch Compile Tests 0 mvndep 1m 0s Maven dependency ordering for patch +1 mvninstall 35m 53s the patch passed +1 compile 18m 43s the patch passed +1 cc 18m 43s the patch passed -1 javac 18m 43s root generated 17 new + 1231 unchanged - 17 fixed = 1248 total (was 1248) -0 checkstyle 3m 32s root: The patch generated 61 new + 858 unchanged - 16 fixed = 919 total (was 874) +1 mvnsite 15m 42s the patch passed +1 shellcheck 0m 29s There were no new shellcheck issues. -1 whitespace 0m 3s The patch 1 line(s) with tabs. +1 xml 0m 37s The patch has no ill-formed XML file. +1 shadedclient 13m 37s patch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist +1 findbugs 10m 28s the patch passed +1 javadoc 8m 57s the patch passed       Other Tests -1 unit 170m 24s root in the patch failed. +1 asflicense 0m 49s The patch does not generate ASF License warnings. 378m 57s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting   hadoop.hdfs.server.federation.metrics.TestFederationMetrics   hadoop.hdfs.TestDFSUpgradeFromImage   hadoop.hdfs.server.namenode.ha.TestPipelinesFailover   hadoop.hdfs.server.blockmanagement.TestReconstructStripedBlocksWithRackAwareness   hadoop.hdfs.TestReadStripedFileWithMissingBlocks   hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler   hadoop.yarn.server.nodemanager.TestNodeStatusUpdater Subsystem Report/Notes Docker Image:yetus/hadoop:5b98639 JIRA Issue HDFS-7240 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12893962/HDFS-7240.002.patch Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc uname Linux d6f22be77cbc 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 5b98639 maven version: Apache Maven 3.3.9 Default Java 1.8.0_131 shellcheck v0.4.6 findbugs v3.1.0-RC1 javac https://builds.apache.org/job/PreCommit-HDFS-Build/21822/artifact/patchprocess/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21822/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21822/artifact/patchprocess/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/21822/artifact/patchprocess/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21822/testReport/ modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21822/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          msingh Mukul Kumar Singh added a comment -

          Re-uploading v3 patch to trigger jenkins again.

          Show
          msingh Mukul Kumar Singh added a comment - Re-uploading v3 patch to trigger jenkins again.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 21s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 132 new or modified test files.
                trunk Compile Tests
          0 mvndep 1m 38s Maven dependency ordering for branch
          +1 mvninstall 13m 18s trunk passed
          +1 compile 11m 28s trunk passed
          +1 checkstyle 2m 25s trunk passed
          +1 mvnsite 9m 4s trunk passed
          +1 shadedclient 8m 45s branch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist .
          +1 findbugs 4m 41s trunk passed
          +1 javadoc 5m 22s trunk passed
                Patch Compile Tests
          0 mvndep 0m 46s Maven dependency ordering for patch
          +1 mvninstall 22m 45s the patch passed
          +1 compile 11m 7s the patch passed
          +1 cc 11m 7s the patch passed
          -1 javac 11m 7s root generated 17 new + 1231 unchanged - 17 fixed = 1248 total (was 1248)
          -0 checkstyle 2m 27s root: The patch generated 10 new + 859 unchanged - 16 fixed = 869 total (was 875)
          +1 mvnsite 9m 28s the patch passed
          +1 shellcheck 0m 25s There were no new shellcheck issues.
          +1 shelldocs 0m 10s The patch generated 0 new + 100 unchanged - 4 fixed = 100 total (was 104)
          -1 whitespace 0m 2s The patch 1 line(s) with tabs.
          +1 xml 0m 18s The patch has no ill-formed XML file.
          +1 shadedclient 10m 24s patch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist
          +1 findbugs 6m 16s the patch passed
          +1 javadoc 5m 41s the patch passed
                Other Tests
          -1 unit 135m 4s root in the patch failed.
          -1 asflicense 0m 38s The patch generated 3 ASF License warnings.
          265m 17s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestReencryptionWithKMS
            hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA
            hadoop.hdfs.TestDFSUpgradeFromImage
            hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler



          Subsystem Report/Notes
          Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639
          JIRA Issue HDFS-7240
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12894504/HDFS-7240.003.patch
          Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc
          uname Linux 3b769eee827f 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8be5707
          maven version: Apache Maven 3.3.9
          Default Java 1.8.0_131
          shellcheck v0.4.6
          findbugs v3.1.0-RC1
          javac https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21860/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21860/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 132 new or modified test files.       trunk Compile Tests 0 mvndep 1m 38s Maven dependency ordering for branch +1 mvninstall 13m 18s trunk passed +1 compile 11m 28s trunk passed +1 checkstyle 2m 25s trunk passed +1 mvnsite 9m 4s trunk passed +1 shadedclient 8m 45s branch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist . +1 findbugs 4m 41s trunk passed +1 javadoc 5m 22s trunk passed       Patch Compile Tests 0 mvndep 0m 46s Maven dependency ordering for patch +1 mvninstall 22m 45s the patch passed +1 compile 11m 7s the patch passed +1 cc 11m 7s the patch passed -1 javac 11m 7s root generated 17 new + 1231 unchanged - 17 fixed = 1248 total (was 1248) -0 checkstyle 2m 27s root: The patch generated 10 new + 859 unchanged - 16 fixed = 869 total (was 875) +1 mvnsite 9m 28s the patch passed +1 shellcheck 0m 25s There were no new shellcheck issues. +1 shelldocs 0m 10s The patch generated 0 new + 100 unchanged - 4 fixed = 100 total (was 104) -1 whitespace 0m 2s The patch 1 line(s) with tabs. +1 xml 0m 18s The patch has no ill-formed XML file. +1 shadedclient 10m 24s patch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist +1 findbugs 6m 16s the patch passed +1 javadoc 5m 41s the patch passed       Other Tests -1 unit 135m 4s root in the patch failed. -1 asflicense 0m 38s The patch generated 3 ASF License warnings. 265m 17s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestReencryptionWithKMS   hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA   hadoop.hdfs.TestDFSUpgradeFromImage   hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 JIRA Issue HDFS-7240 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12894504/HDFS-7240.003.patch Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc uname Linux 3b769eee827f 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 8be5707 maven version: Apache Maven 3.3.9 Default Java 1.8.0_131 shellcheck v0.4.6 findbugs v3.1.0-RC1 javac https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21860/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/21860/artifact/out/patch-asflicense-problems.txt modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21860/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          shv Konstantin Shvachko added a comment -

          It is an interesting question whether Ozone should be a part of Hadoop. There are two main reasons why I think it should not.

          1. With close to 500 sub-tasks, with 6 MB of code changes, and with a sizable community behind, it looks to me like a whole new project.
            It is essentially a new storage system, with different (than HDFS) architecture, separate S3-like APIs. This is really great - the World sure needs more distributed file systems. But it is not clear why Ozone should co-exist with HDFS under the same roof.
          2. Ozone is probably just the first step in rebuilding HDFS under a new architecture. With the next steps presumably being HDFS-10419 and HDFS-11118.
            The design doc for the new architecture has never been published. I can only assume based on some presentations and personal communications that the idea is to use Ozone as a block storage, and re-implement NameNode, so that it stores only a partial namesapce in memory, while the bulk of it (cold data) is persisted to a local storage.
            Such architecture makes me wonder if it solves Hadoop's main problems. There are two main limitations in HDF
            a) The throughput of Namespace operations. Which is limited by the number of RPCs the NameNode can handle
            b) The number of objects (files + blocks) the system can maintain. Which is limited by the memory size of the NameNode.
            The RPC performance (a) is more important for Hadoop scalability than the object count (b). The read RPCs being the main priority.
            The new architecture targets the object count problem, but in the expense of the RPC throughput. Which seems to be a wrong resolution of the tradeoff.
            Also based on the use patterns on our large clusters we read up to 90% of the data we write, so cold data is a small fraction and most of it must be cached.

          To summarize:

          • Ozone is a big enough system to deserve its own project.
          • The architecture that Ozone leads to does not seem to solve the intrinsic problems of current HDFS.
          Show
          shv Konstantin Shvachko added a comment - It is an interesting question whether Ozone should be a part of Hadoop. There are two main reasons why I think it should not. With close to 500 sub-tasks, with 6 MB of code changes, and with a sizable community behind, it looks to me like a whole new project. It is essentially a new storage system, with different (than HDFS) architecture, separate S3-like APIs. This is really great - the World sure needs more distributed file systems. But it is not clear why Ozone should co-exist with HDFS under the same roof. Ozone is probably just the first step in rebuilding HDFS under a new architecture. With the next steps presumably being HDFS-10419 and HDFS-11118 . The design doc for the new architecture has never been published. I can only assume based on some presentations and personal communications that the idea is to use Ozone as a block storage, and re-implement NameNode, so that it stores only a partial namesapce in memory, while the bulk of it (cold data) is persisted to a local storage. Such architecture makes me wonder if it solves Hadoop's main problems. There are two main limitations in HDF a) The throughput of Namespace operations . Which is limited by the number of RPCs the NameNode can handle b) The number of objects (files + blocks) the system can maintain. Which is limited by the memory size of the NameNode. The RPC performance (a) is more important for Hadoop scalability than the object count (b). The read RPCs being the main priority. The new architecture targets the object count problem, but in the expense of the RPC throughput. Which seems to be a wrong resolution of the tradeoff. Also based on the use patterns on our large clusters we read up to 90% of the data we write, so cold data is a small fraction and most of it must be cached. To summarize: Ozone is a big enough system to deserve its own project. The architecture that Ozone leads to does not seem to solve the intrinsic problems of current HDFS.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 12m 44s Docker mode activated.
                Prechecks
          +1 @author 0m 1s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 132 new or modified test files.
                trunk Compile Tests
          0 mvndep 0m 27s Maven dependency ordering for branch
          +1 mvninstall 20m 27s trunk passed
          +1 compile 15m 58s trunk passed
          +1 checkstyle 3m 7s trunk passed
          +1 mvnsite 14m 32s trunk passed
          +1 shadedclient 12m 7s branch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist .
          +1 findbugs 7m 1s trunk passed
          +1 javadoc 7m 27s trunk passed
                Patch Compile Tests
          0 mvndep 0m 52s Maven dependency ordering for patch
          +1 mvninstall 37m 0s the patch passed
          +1 compile 17m 26s the patch passed
          +1 cc 17m 26s the patch passed
          -1 javac 17m 26s root generated 17 new + 1231 unchanged - 17 fixed = 1248 total (was 1248)
          -0 checkstyle 3m 24s root: The patch generated 9 new + 858 unchanged - 16 fixed = 867 total (was 874)
          +1 mvnsite 16m 1s the patch passed
          +1 shellcheck 0m 29s There were no new shellcheck issues.
          +1 shelldocs 0m 15s The patch generated 0 new + 100 unchanged - 4 fixed = 100 total (was 104)
          -1 whitespace 0m 2s The patch 1 line(s) with tabs.
          +1 xml 0m 28s The patch has no ill-formed XML file.
          +1 shadedclient 14m 50s patch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist
          +1 findbugs 9m 45s the patch passed
          +1 javadoc 9m 0s the patch passed
                Other Tests
          -1 unit 17m 35s root in the patch failed.
          -1 asflicense 0m 51s The patch generated 3 ASF License warnings.
          225m 35s



          Reason Tests
          Failed junit tests hadoop.security.TestShellBasedUnixGroupsMapping
            hadoop.fs.shell.TestCopyPreserveFlag
            hadoop.security.token.delegation.TestZKDelegationTokenSecretManager



          Subsystem Report/Notes
          Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639
          JIRA Issue HDFS-7240
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12894749/HDFS-7240.004.patch
          Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc
          uname Linux ce8d51ff22a0 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/patchprocess/precommit/personality/provided.sh
          git revision trunk / 9711b78
          maven version: Apache Maven 3.3.9
          Default Java 1.8.0_131
          shellcheck v0.4.6
          findbugs v3.1.0-RC1
          javac https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21875/testReport/
          asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/patch-asflicense-problems.txt
          modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21875/console
          Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 12m 44s Docker mode activated.       Prechecks +1 @author 0m 1s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 132 new or modified test files.       trunk Compile Tests 0 mvndep 0m 27s Maven dependency ordering for branch +1 mvninstall 20m 27s trunk passed +1 compile 15m 58s trunk passed +1 checkstyle 3m 7s trunk passed +1 mvnsite 14m 32s trunk passed +1 shadedclient 12m 7s branch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist . +1 findbugs 7m 1s trunk passed +1 javadoc 7m 27s trunk passed       Patch Compile Tests 0 mvndep 0m 52s Maven dependency ordering for patch +1 mvninstall 37m 0s the patch passed +1 compile 17m 26s the patch passed +1 cc 17m 26s the patch passed -1 javac 17m 26s root generated 17 new + 1231 unchanged - 17 fixed = 1248 total (was 1248) -0 checkstyle 3m 24s root: The patch generated 9 new + 858 unchanged - 16 fixed = 867 total (was 874) +1 mvnsite 16m 1s the patch passed +1 shellcheck 0m 29s There were no new shellcheck issues. +1 shelldocs 0m 15s The patch generated 0 new + 100 unchanged - 4 fixed = 100 total (was 104) -1 whitespace 0m 2s The patch 1 line(s) with tabs. +1 xml 0m 28s The patch has no ill-formed XML file. +1 shadedclient 14m 50s patch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist +1 findbugs 9m 45s the patch passed +1 javadoc 9m 0s the patch passed       Other Tests -1 unit 17m 35s root in the patch failed. -1 asflicense 0m 51s The patch generated 3 ASF License warnings. 225m 35s Reason Tests Failed junit tests hadoop.security.TestShellBasedUnixGroupsMapping   hadoop.fs.shell.TestCopyPreserveFlag   hadoop.security.token.delegation.TestZKDelegationTokenSecretManager Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 JIRA Issue HDFS-7240 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12894749/HDFS-7240.004.patch Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc uname Linux ce8d51ff22a0 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / 9711b78 maven version: Apache Maven 3.3.9 Default Java 1.8.0_131 shellcheck v0.4.6 findbugs v3.1.0-RC1 javac https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21875/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/21875/artifact/out/patch-asflicense-problems.txt modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21875/console Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Konstantin Shvachko Thank you for taking out time to review ozone. I appreciate your comments and questions.

          There are two main limitations in HDFS
          a) The throughput of Namespace operations. Which is limited by the number of RPCs the NameNode can handle
          b) The number of objects (files + blocks) the system can maintain. Which is limited by the memory size of the NameNode.

          I agree completely. We believe ozone attempts to address both these issues for HDFS.

          Let us look at the Number of objects problem. Ozone directly addresses the scalability of number of blocks by introducing storage containers that can hold multiple blocks together. The earlier efforts on this were complicated by the fact that block manager and namespace are intertwined in HDFS Namenode. There have been efforts in past to separate block manager from namespace for e.g. HDFS-5477. Ozone addresses this problem by cleanly separating the block layer. Separation of block layer also addresses the file/directories scalability because it frees up the blockmap from the namenode.

          Separate block layer relieves namenode from handling block reports, IBRs, heartbeats, replication monitor etc, and thus reduces the contention on FSNamesystem lock and significantly reduces the GC pressure on the namenode. These improvements will greatly help the RPC performance of the Namenode.

          Ozone is probably just the first step in rebuilding HDFS under a new architecture. With the next steps presumably being HDFS-10419 and HDFS-11118. The design doc for the new architecture has never been published.

          We do believe that Namenode can leverage the ozone’s storage container layer, however, that is also a big effort. We would like to first have block layer stabilized in ozone before taking that up. However, we would certainly support any community effort on that, and in fact it was brought up in last BoF session at the summit.

          Big data is evolving rapidly. We see our customers needing scalable file systems, Objects stores(like S3) and Block Store(for docker and VMs). Ozone improves HDFS in two ways. It addresses throughput and scale issues of HDFS, and enriches it with newer capabilities.

          Ozone is a big enough system to deserve its own project.

          I took a quick look at the core code in ozone and the cloc command reports 22,511 lines of functionality changes in Java.

          This patch also brings in web framework code like Angular.js and that brings in bunch of css and js files that contribute to the size of the patch, and the rest are test and documentation changes.

          I hope this addresses your concerns.

          Show
          jnp Jitendra Nath Pandey added a comment - Konstantin Shvachko Thank you for taking out time to review ozone. I appreciate your comments and questions. There are two main limitations in HDFS a) The throughput of Namespace operations. Which is limited by the number of RPCs the NameNode can handle b) The number of objects (files + blocks) the system can maintain. Which is limited by the memory size of the NameNode. I agree completely. We believe ozone attempts to address both these issues for HDFS. Let us look at the Number of objects problem. Ozone directly addresses the scalability of number of blocks by introducing storage containers that can hold multiple blocks together. The earlier efforts on this were complicated by the fact that block manager and namespace are intertwined in HDFS Namenode. There have been efforts in past to separate block manager from namespace for e.g. HDFS-5477 . Ozone addresses this problem by cleanly separating the block layer. Separation of block layer also addresses the file/directories scalability because it frees up the blockmap from the namenode. Separate block layer relieves namenode from handling block reports, IBRs, heartbeats, replication monitor etc, and thus reduces the contention on FSNamesystem lock and significantly reduces the GC pressure on the namenode. These improvements will greatly help the RPC performance of the Namenode. Ozone is probably just the first step in rebuilding HDFS under a new architecture. With the next steps presumably being HDFS-10419 and HDFS-11118 . The design doc for the new architecture has never been published. We do believe that Namenode can leverage the ozone’s storage container layer, however, that is also a big effort. We would like to first have block layer stabilized in ozone before taking that up. However, we would certainly support any community effort on that, and in fact it was brought up in last BoF session at the summit. Big data is evolving rapidly. We see our customers needing scalable file systems, Objects stores(like S3) and Block Store(for docker and VMs). Ozone improves HDFS in two ways. It addresses throughput and scale issues of HDFS, and enriches it with newer capabilities. Ozone is a big enough system to deserve its own project. I took a quick look at the core code in ozone and the cloc command reports 22,511 lines of functionality changes in Java. This patch also brings in web framework code like Angular.js and that brings in bunch of css and js files that contribute to the size of the patch, and the rest are test and documentation changes. I hope this addresses your concerns.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I'm starting with hadoop-common and hadoop-ozone; more to follow on thursday.

          For now, biggest issue I have is that OzoneException needs to become an IOE, so simplifying excpetion handling all round, preserving information, not losing stack traces, and generally leading to happy support teams as well as developers. Changing the base class isn't itself traumatic, but it will implicate the client code as there's almost no longer any need to catch & wrap things.

          Other: What's your scale limit? I see a single PUT for the upload, GET path > tmp in open() . Is there a test for different sizes of file?

          hadoop-common

          Config

          I've filed some comments on thecreated HADOOP-15007, "Stabilize and document Configuration <tag> element", to cover making sure that there are the tests & docs for this to go in.

          • HDFSPropertyTag: s/DEPRICATED/r/DEPRECATED/
          • OzonePropertyTag: s/there/their/
          • OzoneConfig Property.toString() is going to be "key valuenull" if there is no tag defined. Space?

          FileUtils

          minor: imports all shuffled about compared to trunk & branch-2. revert.

          OzoneException

          This is is own exception, not an IOE, and at least in OzoneFileSystem the process to build an IOE from itinvariably loses the inner stack trace and all meaningful information about the exception type. Equally, OzoneBucket catches all forms of IOException, converts to an OzoneRestClientException.

          We don't need to do this.

          it will lose stack trace data, cause confusion, is already making the client code over complex with catching IOEs, wrapping to OzoneException, catching OzoneException and converting to an IOE, at which point all core information is lost.

          1. Make this subclass of IOE, consistent with the rest of our code, and then clients can throw up untouched, except in the special case that they need to perform some form of exception.
          1. Except for (any?) special cases, pass up IOEs raised in the http client as is.

          Also.

          • confused by the overridding of message/getmessage. Is for serialization?
          • Consider adding a setMessage(String format, string...args) and calling STring.format: it would tie in with uses in the code.
          • override setThrowable and setMessage() called to set the nested ex (hence full stack) and handle the case where the exception returns null for getMessage().
          OzoneException initCause(Throwable t) {
            super.initCause(t)
            setMessage(t.getMessage() != null ? t.getMessage() : t.toString())
          }
          

          OzoneFileSystem

          general

          • various places use LOG.info("text " + something). they should all move to LOG.info("text {}", something)
          • Once OzoneException -> IOE, you can cut the catch and translate here.
          • qualify path before all uses. That's needed to stop them being relative, and to catch things like someone calling ozfs.rename("o3://bucket/src", "s3a://bucket/dest"), delete("s3a://bucket/path"), etc, as well as problems with validation happening before paths are made absolute.
          • RenameIterator.iterate() it's going to log @ warn whenever it can't delete a temp file because it doesn't exist, which may be a distraction in failures. Better: if(!tmpFile.delete() && tmpFile.exists()), as that will only warn if the temp file is actually there.

          OzoneFileSystem.rename().

          Rename() is the operation to fear on an object store. I haven't looked at in full detail,.

          • Qualify all the paths before doing directory validation. Otherwise you can defeat the "don't rename into self checks" rename("/path/src", "/path/../path/src/dest").
          • Log @ debu all the paths taken before returning so you can debug if needed.
          • S3A rename ended up having a special RenameFailedException() which innerRename() raises, with text and return code. Outer rename logs the text and returns the return code. This means that all failing paths have an exception clearly thrown, and when we eventually make rename/3 public, it's lined up to throw exceptions back to the caller. Consider copying this code.

          OzoneFileSystem.delete

          • qualify path before use
          • dont' log at error if you can't delete a nonexistent path, it is used everywhere for silent cleanup. Cut it

          OzoneFileSystem.ListStatusIterator

          • make status field final

          OzoneFileSystem.mkdir

          Liked your algorithm here; took me a moment to understand how rollback didn't need to track all created directories. nice.

          • do qualify path first.

          OzoneFileSystem.getFileStatus

          getKeyInfo() catches all exceptions and maps to null, which is interpreted not found and eventually surfaces as FNFE. This is misleading if the failure is for any other reason.

          Once OzoneException -> IOException, getKeyInfo() should only catch & downgrade the explicit not found (404?) responses.

          OzoneFileSystem.listKeys()

          unless this needs to be tagged as VisibleForTesting, make private.

          OzoneOutputStream

          • Implement StreamCapabilities and declare that hsync/hflush are not supported.
          • Unless there is no limit on the size of a PUT request/multipart uploads are supported, consider having the
            stream's write(int) method fail when the limit is reached. That way, things will at least fail fast.
          • after close, set backupStream = null.
          • flush() should be a no-op if called on a closed stream, so if (closed) return
          • write() must fail if called on a closed stream,
          • Again, OzoneException -> IOE translation which could/should be eliminated.

          OzoneInputStream

          • You have chosen an interesting solution to the "efficient seek" problem here: D/L the entire file and
            then seek around. While this probably works for the first release, larger files will have problems in both
            disk space and size of
          • Again, OzoneException -> IOE translation which could/should be eliminated.

          Testing

          • Implement something like AbstractSTestS3AHugeFiles for scale tests, again with the ability to spec on the maven build how big the files to be created are. Developers should be able to ask for a test run with an 8GB test write, read and seek, to see what happens.
          • Add a subclass of org.apache.hadoop.fs.FileSystemContractBaseTest, ideally org.apache.hadoop.fs.FSMainOperationsBaseTest. These test things which the newer contract tests haven't yet reimplimented.

          TestOzoneFileInterfaces

          • Needs a Timeout rule for test timeouts.
          • all your assertEquals strings are the wrong way round. sorry.
          Show
          stevel@apache.org Steve Loughran added a comment - I'm starting with hadoop-common and hadoop-ozone; more to follow on thursday. For now, biggest issue I have is that OzoneException needs to become an IOE, so simplifying excpetion handling all round, preserving information, not losing stack traces, and generally leading to happy support teams as well as developers. Changing the base class isn't itself traumatic, but it will implicate the client code as there's almost no longer any need to catch & wrap things. Other: What's your scale limit? I see a single PUT for the upload, GET path > tmp in open() . Is there a test for different sizes of file? hadoop-common Config I've filed some comments on thecreated HADOOP-15007 , "Stabilize and document Configuration <tag> element", to cover making sure that there are the tests & docs for this to go in. HDFSPropertyTag: s/DEPRICATED/r/DEPRECATED/ OzonePropertyTag: s/there/their/ OzoneConfig Property.toString() is going to be "key valuenull" if there is no tag defined. Space? FileUtils minor: imports all shuffled about compared to trunk & branch-2. revert. OzoneException This is is own exception, not an IOE, and at least in OzoneFileSystem the process to build an IOE from itinvariably loses the inner stack trace and all meaningful information about the exception type. Equally, OzoneBucket catches all forms of IOException, converts to an OzoneRestClientException . We don't need to do this. it will lose stack trace data, cause confusion, is already making the client code over complex with catching IOEs, wrapping to OzoneException, catching OzoneException and converting to an IOE, at which point all core information is lost. 1. Make this subclass of IOE, consistent with the rest of our code, and then clients can throw up untouched, except in the special case that they need to perform some form of exception. 1. Except for (any?) special cases, pass up IOEs raised in the http client as is. Also. confused by the overridding of message/getmessage. Is for serialization? Consider adding a setMessage(String format, string...args) and calling STring.format: it would tie in with uses in the code. override setThrowable and setMessage() called to set the nested ex (hence full stack) and handle the case where the exception returns null for getMessage(). OzoneException initCause(Throwable t) { super .initCause(t) setMessage(t.getMessage() != null ? t.getMessage() : t.toString()) } OzoneFileSystem general various places use LOG.info("text " + something). they should all move to LOG.info("text {}", something) Once OzoneException -> IOE, you can cut the catch and translate here. qualify path before all uses. That's needed to stop them being relative, and to catch things like someone calling ozfs.rename("o3://bucket/src", "s3a://bucket/dest"), delete("s3a://bucket/path"), etc, as well as problems with validation happening before paths are made absolute. RenameIterator.iterate() it's going to log @ warn whenever it can't delete a temp file because it doesn't exist, which may be a distraction in failures. Better: if(!tmpFile.delete() && tmpFile.exists()) , as that will only warn if the temp file is actually there. OzoneFileSystem.rename(). Rename() is the operation to fear on an object store. I haven't looked at in full detail,. Qualify all the paths before doing directory validation. Otherwise you can defeat the "don't rename into self checks" rename("/path/src", "/path/../path/src/dest"). Log @ debu all the paths taken before returning so you can debug if needed. S3A rename ended up having a special RenameFailedException() which innerRename() raises, with text and return code. Outer rename logs the text and returns the return code. This means that all failing paths have an exception clearly thrown, and when we eventually make rename/3 public, it's lined up to throw exceptions back to the caller. Consider copying this code. OzoneFileSystem.delete qualify path before use dont' log at error if you can't delete a nonexistent path, it is used everywhere for silent cleanup. Cut it OzoneFileSystem.ListStatusIterator make status field final OzoneFileSystem.mkdir Liked your algorithm here; took me a moment to understand how rollback didn't need to track all created directories. nice. do qualify path first. OzoneFileSystem.getFileStatus getKeyInfo() catches all exceptions and maps to null, which is interpreted not found and eventually surfaces as FNFE. This is misleading if the failure is for any other reason. Once OzoneException -> IOException, getKeyInfo() should only catch & downgrade the explicit not found (404?) responses. OzoneFileSystem.listKeys() unless this needs to be tagged as VisibleForTesting, make private. OzoneOutputStream Implement StreamCapabilities and declare that hsync/hflush are not supported. Unless there is no limit on the size of a PUT request/multipart uploads are supported, consider having the stream's write(int) method fail when the limit is reached. That way, things will at least fail fast. after close, set backupStream = null. flush() should be a no-op if called on a closed stream, so if (closed) return write() must fail if called on a closed stream, Again, OzoneException -> IOE translation which could/should be eliminated. OzoneInputStream You have chosen an interesting solution to the "efficient seek" problem here: D/L the entire file and then seek around. While this probably works for the first release, larger files will have problems in both disk space and size of Again, OzoneException -> IOE translation which could/should be eliminated. Testing Implement something like AbstractSTestS3AHugeFiles for scale tests, again with the ability to spec on the maven build how big the files to be created are. Developers should be able to ask for a test run with an 8GB test write, read and seek, to see what happens. Add a subclass of org.apache.hadoop.fs.FileSystemContractBaseTest , ideally org.apache.hadoop.fs.FSMainOperationsBaseTest . These test things which the newer contract tests haven't yet reimplimented. TestOzoneFileInterfaces Needs a Timeout rule for test timeouts. all your assertEquals strings are the wrong way round. sorry.
          Hide
          anu Anu Engineer added a comment -

          Steve Loughran Thank you for the comments.

          For now, biggest issue I have is that OzoneException needs to become an IOE

          I have filed HDFS-12755 for converting the OzoneException to an IOException.

          What's your scale limit? I see a single PUT for the upload, GET path > tmp in open() . Is there a test for different sizes of file?

          We have tested with different sizes from 1 byte files to 2 GB. There is no size limit imposed by ozone architecture. However, we have always planned to follow the S3 limit of 5 GB. We can certainly add tests for different size of files – but creating these data files during unit tests take time. We have strived to keep the unit tests of ozone under 4 mins so far. Large key sizes add prohibitive unit test times. So our approach is to use Corona, which is a load-generation tool for ozone. we run this 4 times daily with different key sizes. It is trivial to setup and run.

          For the comments on the OzoneFileSystem, I will let the appropriate person respond.

          Show
          anu Anu Engineer added a comment - Steve Loughran Thank you for the comments. For now, biggest issue I have is that OzoneException needs to become an IOE I have filed HDFS-12755 for converting the OzoneException to an IOException. What's your scale limit? I see a single PUT for the upload, GET path > tmp in open() . Is there a test for different sizes of file? We have tested with different sizes from 1 byte files to 2 GB. There is no size limit imposed by ozone architecture. However, we have always planned to follow the S3 limit of 5 GB. We can certainly add tests for different size of files – but creating these data files during unit tests take time. We have strived to keep the unit tests of ozone under 4 mins so far. Large key sizes add prohibitive unit test times. So our approach is to use Corona, which is a load-generation tool for ozone. we run this 4 times daily with different key sizes. It is trivial to setup and run. For the comments on the OzoneFileSystem, I will let the appropriate person respond.
          Hide
          shv Konstantin Shvachko added a comment -

          I hope this addresses your concerns.

          I don't think that addressed any of my concerns.

          • Ozone by itself does not solve any of HDFS problems. It uses HDFS-agnostic S3-like API, and I cannot use it on my clusters.
            Unless I can convince thousands of my users to rewrite their thousands of applications, along with the existing computational frameworks: YARN, Hive, Pig, Spark, ...... created over the past 10 years.
          • I was talking about futuristic architecture, when you start using Ozone for block management, and rewrite NameNode to store its namespace in LevelDB. If this is still your plan. I agree this architecture solves the objects-count problem. But it does not solve the problem of scaling RPC requests, which is more important to me than the # of objects, since you still cannot grow the cluster beyond the single-NameNode's-RPC-processing capability.
          Show
          shv Konstantin Shvachko added a comment - I hope this addresses your concerns. I don't think that addressed any of my concerns. Ozone by itself does not solve any of HDFS problems. It uses HDFS-agnostic S3-like API, and I cannot use it on my clusters. Unless I can convince thousands of my users to rewrite their thousands of applications, along with the existing computational frameworks: YARN, Hive, Pig, Spark, ...... created over the past 10 years. I was talking about futuristic architecture, when you start using Ozone for block management, and rewrite NameNode to store its namespace in LevelDB. If this is still your plan. I agree this architecture solves the objects-count problem. But it does not solve the problem of scaling RPC requests, which is more important to me than the # of objects, since you still cannot grow the cluster beyond the single-NameNode's-RPC-processing capability.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Anu, I've got some more comments. Given the size of this JIRA, & no of watchers, I'm going to suggest a low-level "merge HDFS-7240 in JIRA"where we can discuss the low level code details, and attach a .patch of the entire thing for Jenkins/yetus to handle. This is how we've done the s3guard work and it helps split code issues from more strategic things like Constantin.

          For large scale tests, make sure you have test which scale and the test runner timeout designed to support multi hour tests. AWS s3 now supports multi-TB files through multipart uploads; we do run multi-GB long-haul uploads as part of the release process as follows: run

          Now, more client side comments. I think this is my client side done, and its more at the HTTP REST API level than anything else

          OzoneFileSystem

          • implement getDefaultBlockSize(); add a config option to let people set it. add a sensible default like 64 or 128 MB.
          • you could implement copyFromLocalFile and copyToLocalFile trivially using bucket.putKey(dst, path) & bucket.getKey(). This lines you up for HADOOP-14766, which is a high performance upload from the local FS to a store

          org.apache.hadoop.ozone.web.ozShell.Shell

          1. dispatch always returns "1", so doesn't differentiate success (0) from any other error.
          2. run(String[] args) doesn't check for `parseHandler` returning null; will NPE on parse error, when normally you'd want to print a usage command.
          3. Shell bucket options should return some explicit usage erorr

          Have a look at what we've done in org.apache.hadoop.fs.s3a.s3guard.S3GuardTool w.r.t raising exceptions.

          • Any tool (for you: handler) can raise a UnknownOptionException which triggers the usage() command, no stack trace printed.
          • Any `ExitUtils.ExitException` sets the exit code of the
          • we use `ExitUtil.terminate(status, text)` to exit the shell in main(); which can be turned off, so that tests
            can invoke the full CLI, verify failure modes

          Tests: I don't see any CLI tests. Look at TestS3GuardCLI to see how we do CLI testing, lookin gfor those ExitExceptions & asserting on the return value,
          as well as grabbing all the string output (which can be printed to a string as well as stdout) & checking that.

          OzoneRestClient

          Need to plan for a common failure mode being wrong endpoint, possibly returning plaintext or HTML error messages. Causes include: client config, proxy things. Also failures where the response is cut off partway through the read. Content-length is your friend here.

          • All HTTP requests MUST verify content-type of response. Otherwise a GET of an http page will return 200 & trigger a parse failure, when its probably "you just got the the URL wrong".
          • Error handling should handle content: text/plain or text/html and build a string from it. Why? Jetty &c can raise their own exceptions and they will return text.
          • Probably good to print the URL being used here on HTTP failures, as it helps debug foundational config issues of "Wrong endpoint"
          • Anything downloading to a string before parsing should look for a Content-Length header, and, if set (1) verify that it's within a sensible range & (2) use it to size the download. Actually, EntityUtils.toString(entity) does that, but it limits the size of a response to 4KB for that reason. Are you confident that all responses parsed that way will be <= 4KB long? I'd write tests there.
          • You can parse JSON straight off an InputStream: do that for any operation which can return large amounts of data.
          • why does setEndPointURI(URI) not just use Preconditions.checkArgument and throw an InvalidArgumentException?
          • for consistency setEndPoint(String clusterFQDN) needs a Preconditions.checkArgument(Strings.isNullOrEmpty(clusterFQDN)).
          • delete(). Allow for the full set of delete responses (200, 204). Less critical against your own endpoint, but still best practise.
          • putKey. It's very inefficient to generate the checksum before the upload, as it means 1x scan of the data before the upload. This won't scale efficiently to multiGB uploads. What about: calculate during the put and have it returned by the endpoint; client to verify the resulting value. This does mean that a corrupted PUT will overwrite the dest data.
          • If you do use the checksum, verify that the result of an invalid checksum on PUT returns an exception which is handled in OzoneBucket.executePutKey(), that is: 400. 422 or whatever is sent back by the web engine. Of course, invalid checksum must trigger a failure.
          • listVolumes: what's the max # of volumes? Can all volumes be returned in a single 4KB payload
          • getKey: verify content type; require content-length & use when reading file. Otherwise you won't pick up broken GET
          • Add a LOG.debug() before every HTTP request for debugging of live systems.

          I see handling pooled connections is on the TODO list. Be good to have, obviously. It will make a significant difference on metadata ops, List etc. The big perf killer here is getKeyInfo, which is executed 1 or 2 times on OzoneFileSystem.getFileStatus(), and that is used throughout the hadoop stack on the assumption it's a low cost operation. Here it involves at worst setting up two HTTP connections. At least here they are not HTTPS connections, which are much more expensive.

          Ozone Bucket

          • OzoneBucket.executeListKeys: again, needs to move to direct JSON parse to handle response > 4GB
          • OzoneBucket.listKeys: javadocs to indicate "number of entries to return" to distinguish from response length in chars
          • OzoneBucket.executeGetKeyInfo.

          REST protocol itself

          • I'd have expected KeyInfo to be a HEAD with the fields returned as different headers to parse. That way the same headers can be returned on a GET, and its consistent with the RESTy way. Many of the keyinfo fields are the standard HTTP headers anyway (Content-Length, checksum, last modified)
          • And its potentially useful to be able to support custom headers in future.

          Testing with TestOzoneRpcClient

          • Add a rule for timeouts; critical here
          • Parameterize it for both REST and RPC protocols, so guaranteeing equivalent behaviour & identify regressions.
          • TestOzoneRpcClient to use LambdaTestUtils.intercept() to catch the IOE and check the message. It will (correctly) fail when an operation doesn't raise an exception, and include the output of the operation instead. LambdaTestUtils.intercept(IOException.class, "VOLUME_ALREADY_EXISTS", () -> Store.createVolume(volumeName))

          Tests to add

          • Add a test where the OzoneRestClient is pointed at a MiniDFSCluster RPC endpoint, e,g http://localhost:8088/ to verify it fails meaningfully
          • Add a test where it's pointed at the miniDFS cluster web UI, again, expect failure''
          • Create a volume with an odd name like '%20' or '&'; see what happens.
          • Create lots of volumes with long names. List them.
          • Create lots of long keys, verify that listing works
          Show
          stevel@apache.org Steve Loughran added a comment - Anu, I've got some more comments. Given the size of this JIRA, & no of watchers, I'm going to suggest a low-level "merge HDFS-7240 in JIRA"where we can discuss the low level code details, and attach a .patch of the entire thing for Jenkins/yetus to handle. This is how we've done the s3guard work and it helps split code issues from more strategic things like Constantin. For large scale tests, make sure you have test which scale and the test runner timeout designed to support multi hour tests. AWS s3 now supports multi-TB files through multipart uploads; we do run multi-GB long-haul uploads as part of the release process as follows: run Now, more client side comments. I think this is my client side done, and its more at the HTTP REST API level than anything else OzoneFileSystem implement getDefaultBlockSize(); add a config option to let people set it. add a sensible default like 64 or 128 MB. you could implement copyFromLocalFile and copyToLocalFile trivially using bucket.putKey(dst, path) & bucket.getKey(). This lines you up for HADOOP-14766 , which is a high performance upload from the local FS to a store org.apache.hadoop.ozone.web.ozShell.Shell dispatch always returns "1", so doesn't differentiate success (0) from any other error. run(String[] args) doesn't check for `parseHandler` returning null; will NPE on parse error, when normally you'd want to print a usage command. Shell bucket options should return some explicit usage erorr Have a look at what we've done in org.apache.hadoop.fs.s3a.s3guard.S3GuardTool w.r.t raising exceptions. Any tool (for you: handler) can raise a UnknownOptionException which triggers the usage() command, no stack trace printed. Any `ExitUtils.ExitException` sets the exit code of the we use `ExitUtil.terminate(status, text)` to exit the shell in main(); which can be turned off, so that tests can invoke the full CLI, verify failure modes Tests: I don't see any CLI tests. Look at TestS3GuardCLI to see how we do CLI testing, lookin gfor those ExitExceptions & asserting on the return value, as well as grabbing all the string output (which can be printed to a string as well as stdout) & checking that. OzoneRestClient Need to plan for a common failure mode being wrong endpoint, possibly returning plaintext or HTML error messages. Causes include: client config, proxy things. Also failures where the response is cut off partway through the read. Content-length is your friend here. All HTTP requests MUST verify content-type of response. Otherwise a GET of an http page will return 200 & trigger a parse failure, when its probably "you just got the the URL wrong". Error handling should handle content: text/plain or text/html and build a string from it. Why? Jetty &c can raise their own exceptions and they will return text. Probably good to print the URL being used here on HTTP failures, as it helps debug foundational config issues of "Wrong endpoint" Anything downloading to a string before parsing should look for a Content-Length header, and, if set (1) verify that it's within a sensible range & (2) use it to size the download. Actually, EntityUtils.toString(entity) does that, but it limits the size of a response to 4KB for that reason. Are you confident that all responses parsed that way will be <= 4KB long? I'd write tests there. You can parse JSON straight off an InputStream: do that for any operation which can return large amounts of data. why does setEndPointURI(URI) not just use Preconditions.checkArgument and throw an InvalidArgumentException? for consistency setEndPoint(String clusterFQDN) needs a Preconditions.checkArgument(Strings.isNullOrEmpty(clusterFQDN)) . delete(). Allow for the full set of delete responses (200, 204). Less critical against your own endpoint, but still best practise. putKey. It's very inefficient to generate the checksum before the upload, as it means 1x scan of the data before the upload. This won't scale efficiently to multiGB uploads. What about: calculate during the put and have it returned by the endpoint; client to verify the resulting value. This does mean that a corrupted PUT will overwrite the dest data. If you do use the checksum, verify that the result of an invalid checksum on PUT returns an exception which is handled in OzoneBucket.executePutKey(), that is: 400. 422 or whatever is sent back by the web engine. Of course, invalid checksum must trigger a failure. listVolumes: what's the max # of volumes? Can all volumes be returned in a single 4KB payload getKey: verify content type; require content-length & use when reading file. Otherwise you won't pick up broken GET Add a LOG.debug() before every HTTP request for debugging of live systems. I see handling pooled connections is on the TODO list. Be good to have, obviously. It will make a significant difference on metadata ops, List etc. The big perf killer here is getKeyInfo, which is executed 1 or 2 times on OzoneFileSystem.getFileStatus(), and that is used throughout the hadoop stack on the assumption it's a low cost operation. Here it involves at worst setting up two HTTP connections. At least here they are not HTTPS connections, which are much more expensive. Ozone Bucket OzoneBucket.executeListKeys: again, needs to move to direct JSON parse to handle response > 4GB OzoneBucket.listKeys: javadocs to indicate "number of entries to return" to distinguish from response length in chars OzoneBucket.executeGetKeyInfo. REST protocol itself I'd have expected KeyInfo to be a HEAD with the fields returned as different headers to parse. That way the same headers can be returned on a GET, and its consistent with the RESTy way. Many of the keyinfo fields are the standard HTTP headers anyway (Content-Length, checksum, last modified) And its potentially useful to be able to support custom headers in future. Testing with TestOzoneRpcClient Add a rule for timeouts; critical here Parameterize it for both REST and RPC protocols, so guaranteeing equivalent behaviour & identify regressions. TestOzoneRpcClient to use LambdaTestUtils.intercept() to catch the IOE and check the message. It will (correctly) fail when an operation doesn't raise an exception, and include the output of the operation instead. LambdaTestUtils.intercept(IOException.class, "VOLUME_ALREADY_EXISTS", () -> Store.createVolume(volumeName)) Tests to add Add a test where the OzoneRestClient is pointed at a MiniDFSCluster RPC endpoint, e,g http://localhost:8088/ to verify it fails meaningfully Add a test where it's pointed at the miniDFS cluster web UI, again, expect failure'' Create a volume with an odd name like '%20' or '&'; see what happens. Create lots of volumes with long names. List them. Create lots of long keys, verify that listing works
          Hide
          anu Anu Engineer added a comment -

          Steve Loughran I have filed HDFS-12761 to discuss code/design/arch questions on the merge. I really appreciate your feedback and time. Please do share any other issues you find. Sharing HDFS-12761 here so others know which jira to follow to see the merge discussion comments.

          Show
          anu Anu Engineer added a comment - Steve Loughran I have filed HDFS-12761 to discuss code/design/arch questions on the merge. I really appreciate your feedback and time. Please do share any other issues you find. Sharing HDFS-12761 here so others know which jira to follow to see the merge discussion comments.
          Hide
          msingh Mukul Kumar Singh added a comment -

          Steve Loughran I have filed HDFS-12768, HDFS-12767, HDFS-12762 & HDFS-12764 to address OzoneFileSystem related review comments. Thanks a lot for taking a look at the code and for your valuable comments.

          Show
          msingh Mukul Kumar Singh added a comment - Steve Loughran I have filed HDFS-12768 , HDFS-12767 , HDFS-12762 & HDFS-12764 to address OzoneFileSystem related review comments. Thanks a lot for taking a look at the code and for your valuable comments.
          Hide
          sanjay.radia Sanjay Radia added a comment -

          I have added a document that explains a design for scaling HDFS and how Ozone paves the way towards the full solution.

          Show
          sanjay.radia Sanjay Radia added a comment - I have added a document that explains a design for scaling HDFS and how Ozone paves the way towards the full solution.
          Hide
          shv Konstantin Shvachko added a comment -

          Sanjay Radia, thank you for sharing the doc, your vision for Ozone evolution, motivation, and compelling use cases.
          I am glad I had generally correct understanding that you envisioned Ozone as a block management layer for HDFS and a NameNode with partial namespace in memory.
          As I mentioned above the partial namespace architecture does not fully address the problem of scaling RPCs on Hadoop clusters, which is the main pain point for me and I believe everybody else running big analytics clusters.

          You give three main reasons for Ozone inclusion into Hadoop.I think Ozone can do all three as a separate project as well.
          People run different systems on the same cluster along with Hadoop, e.g. HBase, Spark. So Ozone will be yet one more.
          Separate project Ozone does not prevent from using it as a scalable block-container layer in HDFS. HDFS can always include Ozone as a dependency. Especially if Ozone is already optimized for large IO scans.

          Show
          shv Konstantin Shvachko added a comment - Sanjay Radia , thank you for sharing the doc, your vision for Ozone evolution, motivation, and compelling use cases. I am glad I had generally correct understanding that you envisioned Ozone as a block management layer for HDFS and a NameNode with partial namespace in memory. As I mentioned above the partial namespace architecture does not fully address the problem of scaling RPCs on Hadoop clusters, which is the main pain point for me and I believe everybody else running big analytics clusters. You give three main reasons for Ozone inclusion into Hadoop.I think Ozone can do all three as a separate project as well. People run different systems on the same cluster along with Hadoop, e.g. HBase, Spark. So Ozone will be yet one more. Separate project Ozone does not prevent from using it as a scalable block-container layer in HDFS. HDFS can always include Ozone as a dependency. Especially if Ozone is already optimized for large IO scans.
          Hide
          stack stack added a comment -

          The posted document needs author, date, and ref to this issue. Can it be made a google doc so can comment inline rather than here?

          I skipped to the end, "So​ ​why​ ​put​ ​the​ ​Ozone​ ​in​ ​HDFS​ ​and​ ​not​ ​keep​ ​it​ ​a​ ​separate​ ​project". There is no argument here on why Ozone needs to be part of Apache Hadoop. As per Konstantin Shvachko above, Ozone as separate project does not preclude its being brought in instead as a dependency nor does it dictate the shape of deploy (Bullet #3 is an aspiration, not an argument).

          Show
          stack stack added a comment - The posted document needs author, date, and ref to this issue. Can it be made a google doc so can comment inline rather than here? I skipped to the end, "So​ ​why​ ​put​ ​the​ ​Ozone​ ​in​ ​HDFS​ ​and​ ​not​ ​keep​ ​it​ ​a​ ​separate​ ​project". There is no argument here on why Ozone needs to be part of Apache Hadoop. As per Konstantin Shvachko above, Ozone as separate project does not preclude its being brought in instead as a dependency nor does it dictate the shape of deploy (Bullet #3 is an aspiration, not an argument).
          Hide
          hadoopqa Hadoop QA added a comment -

          A patch to the testing environment has been detected.
          Re-executing against the patched versions to perform further tests.
          The console is at https://builds.apache.org/job/PreCommit-HDFS-Build/21996/console in case of problems.

          Show
          hadoopqa Hadoop QA added a comment - A patch to the testing environment has been detected. Re-executing against the patched versions to perform further tests. The console is at https://builds.apache.org/job/PreCommit-HDFS-Build/21996/console in case of problems.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 25s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 134 new or modified test files.
                trunk Compile Tests
          0 mvndep 1m 53s Maven dependency ordering for branch
          +1 mvninstall 17m 15s trunk passed
          +1 compile 13m 36s trunk passed
          +1 checkstyle 2m 34s trunk passed
          +1 mvnsite 11m 31s trunk passed
          +1 shadedclient 10m 5s branch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist .
          +1 findbugs 5m 23s trunk passed
          +1 javadoc 5m 11s trunk passed
                Patch Compile Tests
          0 mvndep 0m 48s Maven dependency ordering for patch
          +1 mvninstall 31m 43s the patch passed
          +1 compile 15m 7s the patch passed
          +1 cc 15m 7s the patch passed
          -1 javac 15m 7s root generated 24 new + 1216 unchanged - 24 fixed = 1240 total (was 1240)
          -0 checkstyle 3m 14s root: The patch generated 15 new + 1397 unchanged - 19 fixed = 1412 total (was 1416)
          +1 mvnsite 13m 28s the patch passed
          +1 shellcheck 0m 27s There were no new shellcheck issues.
          +1 shelldocs 0m 10s The patch generated 0 new + 100 unchanged - 4 fixed = 100 total (was 104)
          -1 whitespace 0m 2s The patch 1 line(s) with tabs.
          +1 xml 0m 21s The patch has no ill-formed XML file.
          +1 shadedclient 13m 0s patch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist
          +1 findbugs 7m 26s the patch passed
          +1 javadoc 7m 2s the patch passed
                Other Tests
          -1 unit 19m 3s root in the patch failed.
          +1 asflicense 0m 42s The patch does not generate ASF License warnings.
          183m 30s



          Reason Tests
          Failed junit tests hadoop.net.TestDNS



          Subsystem Report/Notes
          Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639
          JIRA Issue HDFS-7240
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12896583/HDFS-7240.005.patch
          Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc
          uname Linux 15f28a3c8a9e 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/patchprocess/precommit/personality/provided.sh
          git revision trunk / bb8a6ee
          maven version: Apache Maven 3.3.9
          Default Java 1.8.0_131
          shellcheck v0.4.6
          findbugs v3.1.0-RC1
          javac https://builds.apache.org/job/PreCommit-HDFS-Build/21996/artifact/out/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21996/artifact/out/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21996/artifact/out/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/21996/artifact/out/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21996/testReport/
          Max. process+thread count 1333 (vs. ulimit of 5000)
          modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21996/console
          Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 134 new or modified test files.       trunk Compile Tests 0 mvndep 1m 53s Maven dependency ordering for branch +1 mvninstall 17m 15s trunk passed +1 compile 13m 36s trunk passed +1 checkstyle 2m 34s trunk passed +1 mvnsite 11m 31s trunk passed +1 shadedclient 10m 5s branch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist . +1 findbugs 5m 23s trunk passed +1 javadoc 5m 11s trunk passed       Patch Compile Tests 0 mvndep 0m 48s Maven dependency ordering for patch +1 mvninstall 31m 43s the patch passed +1 compile 15m 7s the patch passed +1 cc 15m 7s the patch passed -1 javac 15m 7s root generated 24 new + 1216 unchanged - 24 fixed = 1240 total (was 1240) -0 checkstyle 3m 14s root: The patch generated 15 new + 1397 unchanged - 19 fixed = 1412 total (was 1416) +1 mvnsite 13m 28s the patch passed +1 shellcheck 0m 27s There were no new shellcheck issues. +1 shelldocs 0m 10s The patch generated 0 new + 100 unchanged - 4 fixed = 100 total (was 104) -1 whitespace 0m 2s The patch 1 line(s) with tabs. +1 xml 0m 21s The patch has no ill-formed XML file. +1 shadedclient 13m 0s patch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist +1 findbugs 7m 26s the patch passed +1 javadoc 7m 2s the patch passed       Other Tests -1 unit 19m 3s root in the patch failed. +1 asflicense 0m 42s The patch does not generate ASF License warnings. 183m 30s Reason Tests Failed junit tests hadoop.net.TestDNS Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 JIRA Issue HDFS-7240 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12896583/HDFS-7240.005.patch Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc uname Linux 15f28a3c8a9e 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / bb8a6ee maven version: Apache Maven 3.3.9 Default Java 1.8.0_131 shellcheck v0.4.6 findbugs v3.1.0-RC1 javac https://builds.apache.org/job/PreCommit-HDFS-Build/21996/artifact/out/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/21996/artifact/out/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/21996/artifact/out/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/21996/artifact/out/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/21996/testReport/ Max. process+thread count 1333 (vs. ulimit of 5000) modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone . hadoop-client-modules/hadoop-client-check-test-invariants hadoop-client-modules/hadoop-client-minicluster hadoop-dist hadoop-tools hadoop-tools/hadoop-tools-dist U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/21996/console Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          msingh Mukul Kumar Singh added a comment -

          patch v6 is after fixing check style/findbugs/whitespace issues in the HDFS-7240 branch.

          Show
          msingh Mukul Kumar Singh added a comment - patch v6 is after fixing check style/findbugs/whitespace issues in the HDFS-7240 branch.
          Hide
          hadoopqa Hadoop QA added a comment -

          A patch to the testing environment has been detected.
          Re-executing against the patched versions to perform further tests.
          The console is at https://builds.apache.org/job/PreCommit-HDFS-Build/22054/console in case of problems.

          Show
          hadoopqa Hadoop QA added a comment - A patch to the testing environment has been detected. Re-executing against the patched versions to perform further tests. The console is at https://builds.apache.org/job/PreCommit-HDFS-Build/22054/console in case of problems.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 14m 59s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 137 new or modified test files.
                trunk Compile Tests
          0 mvndep 1m 38s Maven dependency ordering for branch
          +1 mvninstall 17m 4s trunk passed
          +1 compile 14m 16s trunk passed
          +1 checkstyle 2m 34s trunk passed
          +1 mvnsite 13m 46s trunk passed
          +1 shadedclient 10m 11s branch has no errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist .
          +1 findbugs 6m 49s trunk passed
          +1 javadoc 8m 1s trunk passed
                Patch Compile Tests
          0 mvndep 0m 33s Maven dependency ordering for patch
          +1 mvninstall 48m 25s the patch passed
          +1 compile 24m 58s the patch passed
          +1 cc 24m 58s the patch passed
          -1 javac 24m 58s root generated 26 new + 1208 unchanged - 26 fixed = 1234 total (was 1234)
          -0 checkstyle 4m 33s root: The patch generated 16 new + 1399 unchanged - 19 fixed = 1415 total (was 1418)
          -1 mvnsite 0m 44s root in the patch failed.
          +1 shellcheck 0m 27s There were no new shellcheck issues.
          -0 shelldocs 0m 26s The patch generated 260 new + 100 unchanged - 4 fixed = 360 total (was 104)
          -1 whitespace 0m 2s The patch 1 line(s) with tabs.
          +1 xml 0m 43s The patch has no ill-formed XML file.
          -1 shadedclient 14m 5s patch has errors when building and testing our client artifacts.
          0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist .
          +1 findbugs 7m 10s the patch passed
          +1 javadoc 5m 56s the patch passed
                Other Tests
          -1 unit 14m 45s root in the patch failed.
          +1 asflicense 0m 32s The patch does not generate ASF License warnings.
          216m 44s



          Reason Tests
          Failed junit tests hadoop.security.TestRaceWhenRelogin



          Subsystem Report/Notes
          Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639
          JIRA Issue HDFS-7240
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12897244/HDFS-7240.006.patch
          Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc
          uname Linux eef8a1dbc8c3 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/patchprocess/precommit/personality/provided.sh
          git revision trunk / ff9f7fc
          maven version: Apache Maven 3.3.9
          Default Java 1.8.0_151
          shellcheck v0.4.6
          findbugs v3.1.0-RC1
          javac https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/diff-checkstyle-root.txt
          mvnsite https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/patch-mvnsite-root.txt
          shelldocs https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/diff-patch-shelldocs.txt
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/whitespace-tabs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/patch-unit-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22054/testReport/
          Max. process+thread count 1334 (vs. ulimit of 5000)
          modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist . U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22054/console
          Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 14m 59s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 137 new or modified test files.       trunk Compile Tests 0 mvndep 1m 38s Maven dependency ordering for branch +1 mvninstall 17m 4s trunk passed +1 compile 14m 16s trunk passed +1 checkstyle 2m 34s trunk passed +1 mvnsite 13m 46s trunk passed +1 shadedclient 10m 11s branch has no errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist . +1 findbugs 6m 49s trunk passed +1 javadoc 8m 1s trunk passed       Patch Compile Tests 0 mvndep 0m 33s Maven dependency ordering for patch +1 mvninstall 48m 25s the patch passed +1 compile 24m 58s the patch passed +1 cc 24m 58s the patch passed -1 javac 24m 58s root generated 26 new + 1208 unchanged - 26 fixed = 1234 total (was 1234) -0 checkstyle 4m 33s root: The patch generated 16 new + 1399 unchanged - 19 fixed = 1415 total (was 1418) -1 mvnsite 0m 44s root in the patch failed. +1 shellcheck 0m 27s There were no new shellcheck issues. -0 shelldocs 0m 26s The patch generated 260 new + 100 unchanged - 4 fixed = 360 total (was 104) -1 whitespace 0m 2s The patch 1 line(s) with tabs. +1 xml 0m 43s The patch has no ill-formed XML file. -1 shadedclient 14m 5s patch has errors when building and testing our client artifacts. 0 findbugs 0m 0s Skipped patched modules with no Java source: hadoop-project hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist . +1 findbugs 7m 10s the patch passed +1 javadoc 5m 56s the patch passed       Other Tests -1 unit 14m 45s root in the patch failed. +1 asflicense 0m 32s The patch does not generate ASF License warnings. 216m 44s Reason Tests Failed junit tests hadoop.security.TestRaceWhenRelogin Subsystem Report/Notes Docker Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:5b98639 JIRA Issue HDFS-7240 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12897244/HDFS-7240.006.patch Optional Tests asflicense shellcheck shelldocs mvnsite unit shadedclient compile javac javadoc mvninstall xml findbugs checkstyle cc uname Linux eef8a1dbc8c3 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/patchprocess/precommit/personality/provided.sh git revision trunk / ff9f7fc maven version: Apache Maven 3.3.9 Default Java 1.8.0_151 shellcheck v0.4.6 findbugs v3.1.0-RC1 javac https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/diff-checkstyle-root.txt mvnsite https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/patch-mvnsite-root.txt shelldocs https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/diff-patch-shelldocs.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/whitespace-tabs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/22054/artifact/out/patch-unit-root.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/22054/testReport/ Max. process+thread count 1334 (vs. ulimit of 5000) modules C: hadoop-project hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-ozone hadoop-tools/hadoop-tools-dist hadoop-tools hadoop-client-modules/hadoop-client-minicluster hadoop-client-modules/hadoop-client-check-test-invariants hadoop-dist . U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/22054/console Powered by Apache Yetus 0.7.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          shv Konstantin Shvachko added a comment -

          We had a F2F meeting with Ozone authors. Anu is publishing his notes. The focus was on the following issues:

          • Should Ozone be a part of HDFS or a separate project?
          • How Ozone can help addressing scalable RPC performance?
          • Can Ozone be used as a block management layer for HDFS?
          • Migration to Ozone from HDFS

          Ozone as a block management layer

          I think we made pretty good progress in understanding the role of Ozone and the future of HDFS.
          On large production Hadoop clusters such as LinkedIn's and others, traced via multiple publications, we see that

          1. We read 90% of data that we write. No cold metadata
          2. RPC load on the NameNode increases proportionally to the growth of storage, which is exponential.

          Thus, the idea of a NameNode with partial namespace in memory does not fully solve these growth problems. Because a) it is still limited by the single NN performance, and b) we will still have to provision NN to keep most of the namespace in memory.

          We came to the following high-level roadmap for evolving HDFS:

          1. NameNode with the block management delegated to Ozone layer. There is a prototype of such NN, which is believed to show 30-50% performance improvement. POC would be good.
          2. A single NameNode with namespace implemented as KV-collection. The KV-collection is partitionable in memory, which allows breaking the single lock restriction of current NN. Performance gains not measured yet.
          3. Split the KV-namespace into two or more physical NNs.

          Important requirement: we should provide a no-data-copy migration of the clusters along the entire transformation.
          It is not feasible to DistCp a e.g. 100PB cluster, since it requires a prolonged down-time and is expensive - doubles the amount of hardware involved.
          Thus, an upgrade should keep the data blocks on the same DataNodes, and may need to provide an offline tool to convert metadata (fsimage) to new format.

          There is a lot to design here, but it looks to me like a gradual path from current single NN to distributed namespace architecture. So if people agree with the direction in general I'll be glad to create a Wiki page describing this intention so that folks could comment and discuss.
          Could Ozone authors (Anu Engineer, Jitendra Nath Pandey, Sanjay Radia) please confirm our common understanding of the roadmap.

          Merging Ozone to HDFS

          There are pros and cons to merging Ozone into Hadoop vs a separate project. The pros include (please expand):

          • Code sharing
          • Ozone should improve DataNode pipeline code
          • Better testing for Ozone within Hadoop

          Some cons:

          • As part of HDFS it will need to support standard HDFS features, like security, snapshots, erasure codes, etc. While as a separate project it can implement them later
          • As a separate project Ozone can benefit from frequent release cycles
          • Bugs in Ozone can affect HDFS and vice versa
          • Incompatible changes may be allowed in Ozone on early stages, but not allowed in Hadoop
          • Rolling upgrades are required for HDFS, which may not be possible for Ozone

          The roadmap above sets Ozone as a step to partitioned NameNode, which solves both RPC scalability and cluster growth problems for big Hadoop installations. This validates merging Ozone to Hadoop for me. Given the cons though I'm not sure when is the right time. I think we should at least have a design doc for security before merging in order to avoid API changes.

          Show
          shv Konstantin Shvachko added a comment - We had a F2F meeting with Ozone authors. Anu is publishing his notes. The focus was on the following issues: Should Ozone be a part of HDFS or a separate project? How Ozone can help addressing scalable RPC performance? Can Ozone be used as a block management layer for HDFS? Migration to Ozone from HDFS Ozone as a block management layer I think we made pretty good progress in understanding the role of Ozone and the future of HDFS. On large production Hadoop clusters such as LinkedIn's and others, traced via multiple publications, we see that We read 90% of data that we write. No cold metadata RPC load on the NameNode increases proportionally to the growth of storage, which is exponential. Thus, the idea of a NameNode with partial namespace in memory does not fully solve these growth problems. Because a) it is still limited by the single NN performance, and b) we will still have to provision NN to keep most of the namespace in memory. We came to the following high-level roadmap for evolving HDFS: NameNode with the block management delegated to Ozone layer. There is a prototype of such NN, which is believed to show 30-50% performance improvement. POC would be good. A single NameNode with namespace implemented as KV-collection. The KV-collection is partitionable in memory, which allows breaking the single lock restriction of current NN. Performance gains not measured yet. Split the KV-namespace into two or more physical NNs. Important requirement: we should provide a no-data-copy migration of the clusters along the entire transformation. It is not feasible to DistCp a e.g. 100PB cluster, since it requires a prolonged down-time and is expensive - doubles the amount of hardware involved. Thus, an upgrade should keep the data blocks on the same DataNodes, and may need to provide an offline tool to convert metadata (fsimage) to new format. There is a lot to design here, but it looks to me like a gradual path from current single NN to distributed namespace architecture. So if people agree with the direction in general I'll be glad to create a Wiki page describing this intention so that folks could comment and discuss. Could Ozone authors ( Anu Engineer , Jitendra Nath Pandey , Sanjay Radia ) please confirm our common understanding of the roadmap. Merging Ozone to HDFS There are pros and cons to merging Ozone into Hadoop vs a separate project. The pros include (please expand): Code sharing Ozone should improve DataNode pipeline code Better testing for Ozone within Hadoop Some cons: As part of HDFS it will need to support standard HDFS features, like security, snapshots, erasure codes, etc. While as a separate project it can implement them later As a separate project Ozone can benefit from frequent release cycles Bugs in Ozone can affect HDFS and vice versa Incompatible changes may be allowed in Ozone on early stages, but not allowed in Hadoop Rolling upgrades are required for HDFS, which may not be possible for Ozone The roadmap above sets Ozone as a step to partitioned NameNode, which solves both RPC scalability and cluster growth problems for big Hadoop installations. This validates merging Ozone to Hadoop for me. Given the cons though I'm not sure when is the right time. I think we should at least have a design doc for security before merging in order to avoid API changes.
          Hide
          ywskycn Wei Yan added a comment -

          Thanks Konstantin Shvachko for the detailed notes.

          Have a qq there

          2. A single NameNode with namespace implemented as KV-collection. The KV-collection is partitionable in memory, which allows breaking the single lock restriction of current NN. Performance gains not measured yet.
          3. Split the KV-namespace into two or more physical NNs.

          How does this align with the router-based federation HDFS-10467?

          Show
          ywskycn Wei Yan added a comment - Thanks Konstantin Shvachko for the detailed notes. Have a qq there 2. A single NameNode with namespace implemented as KV-collection. The KV-collection is partitionable in memory, which allows breaking the single lock restriction of current NN. Performance gains not measured yet. 3. Split the KV-namespace into two or more physical NNs. How does this align with the router-based federation HDFS-10467 ?
          Hide
          anu Anu Engineer added a comment -

          Konstantin Shvachko, Thanks for the write-up.

          Anu is publishing his notes.

          I have attached the meeting notes to this JIRA.

          Could Ozone authors (Anu Engineer, Jitendra Nath Pandey, Sanjay Radia) please confirm our common understanding of the roadmap.

          Confirmed, My notes pretty much echo your notes.

          Show
          anu Anu Engineer added a comment - Konstantin Shvachko , Thanks for the write-up. Anu is publishing his notes. I have attached the meeting notes to this JIRA. Could Ozone authors (Anu Engineer, Jitendra Nath Pandey, Sanjay Radia) please confirm our common understanding of the roadmap. Confirmed, My notes pretty much echo your notes.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Thanks for posting meeting minutes Konstantin Shvachko and Anu Engineer. It is great to see alignment in the roadmap.

          Show
          jnp Jitendra Nath Pandey added a comment - Thanks for posting meeting minutes Konstantin Shvachko and Anu Engineer . It is great to see alignment in the roadmap.
          Hide
          pono Daniel Takamori added a comment - - edited

          *+This message is being proxy posted by an Infra admin because it contains banned strings triggering our spam filter.
          +*

          Before we send out the [Vote] thread for ozone, I propose that we do two community
          meetings. This allows us to address any questions issues over a high bandwidth medium.
          Since many contributors/committers of ozone are spread across the world, the first meeting is friendly toward Europe/Asia time Zone. I propose the following time for the meeting.

          Location Local Time Time zone UTC Offset
          Seattle / San Jose Thursday, November 16, 2017 at 1:00:00 am (night) PST UTC-8 hours
          London Thursday, November 16, 2017 at 9:00:00 am GMT UTC
          Budapest/Brussels Thursday, November 16, 2017 at 10:00:00 am CET UTC+1 hour
          Bangalore Thursday, November 16, 2017 at 2:30:00 pm IST UTC+5:30 hours
          Shanghai Thursday, November 16, 2017 at 5:00:00 pm CST UTC+8 hours

          I propose that we have a follow-up meeting targetting Americas time zone, and I propose Friday, November 17, 2017, at 4:00 pm PST for that meeting.

          Here is the meeting info:

          
          

          Topic: Ozone Merge meeting
          Time: Nov 16, 2017 1:00 AM Pacific Time (US and Canada)

          Join from PC, Mac, Linux, iOS or Android: https://hortonworks.zoom.us/j/5451676776
          Or join by phone:

              +1 646 558 8656 (US Toll) or +1 669 900 6833 (US Toll)
              +1 877 369 0926 (US Toll Free)
              +1 877 853 5247  (US Toll Free)
              Meeting ID: 545 167 6776
              International numbers available: https://hortonworks.zoom.us/zoomconference?m=rYZYSAOVLYtFE6wkwIrjJeqO3CP_I6ij

          Show
          pono Daniel Takamori added a comment - - edited *+This message is being proxy posted by an Infra admin because it contains banned strings triggering our spam filter. +* Before we send out the [Vote] thread for ozone, I propose that we do two community meetings. This allows us to address any questions issues over a high bandwidth medium. Since many contributors/committers of ozone are spread across the world, the first meeting is friendly toward Europe/Asia time Zone. I propose the following time for the meeting. Location Local Time Time zone UTC Offset Seattle / San Jose Thursday, November 16, 2017 at 1:00:00 am (night) PST UTC-8 hours London Thursday, November 16, 2017 at 9:00:00 am GMT UTC Budapest/Brussels Thursday, November 16, 2017 at 10:00:00 am CET UTC+1 hour Bangalore Thursday, November 16, 2017 at 2:30:00 pm IST UTC+5:30 hours Shanghai Thursday, November 16, 2017 at 5:00:00 pm CST UTC+8 hours I propose that we have a follow-up meeting targetting Americas time zone, and I propose Friday, November 17, 2017, at 4:00 pm PST for that meeting. Here is the meeting info: Topic: Ozone Merge meeting Time: Nov 16, 2017 1:00 AM Pacific Time (US and Canada) Join from PC, Mac, Linux, iOS or Android: https://hortonworks.zoom.us/j/5451676776 Or join by phone:     +1 646 558 8656 (US Toll) or +1 669 900 6833 (US Toll)     +1 877 369 0926 (US Toll Free)     +1 877 853 5247  (US Toll Free)     Meeting ID: 545 167 6776     International numbers available: https://hortonworks.zoom.us/zoomconference?m=rYZYSAOVLYtFE6wkwIrjJeqO3CP_I6ij
          Hide
          anu Anu Engineer added a comment -

          Daniel Takamori Thank you very much for you help. I really appreciate it.

          My apologies for not posting this meeting message earlier. I have been trying for a while to post this message, but apparently if you try to post a phone number , then you get banned from the JIRA. I did not know that and had been without JIRA access for a while. Thanks to Daniel Takamori for using his Admin super powers and helping me post the meeting invite.

          Show
          anu Anu Engineer added a comment - Daniel Takamori Thank you very much for you help. I really appreciate it. My apologies for not posting this meeting message earlier. I have been trying for a while to post this message, but apparently if you try to post a phone number , then you get banned from the JIRA. I did not know that and had been without JIRA access for a while. Thanks to Daniel Takamori for using his Admin super powers and helping me post the meeting invite.
          Hide
          eddyxu Lei (Eddy) Xu added a comment -

          Anu Engineer, Daniel Takamori, thanks for posting the meeting details. One question: is the Americas time zone meeting (Friday, November 17th 4PM) has the same dial in numbers and zoom.us URL? We'd like to join this one.

          Show
          eddyxu Lei (Eddy) Xu added a comment - Anu Engineer , Daniel Takamori , thanks for posting the meeting details. One question: is the Americas time zone meeting (Friday, November 17th 4PM) has the same dial in numbers and zoom.us URL? We'd like to join this one.
          Hide
          anu Anu Engineer added a comment -

          Lei (Eddy) Xu) Yes it is same. Thanks for checking.

          Show
          anu Anu Engineer added a comment - Lei (Eddy) Xu ) Yes it is same. Thanks for checking.
          Hide
          shv Konstantin Shvachko added a comment - - edited

          How does this align with the router-based federation HDFS-10467?

          Hey Wei Yan, router-based federation (in fact all federation approaches) are orthogonal to distributed NN. One should be able to run RBF over multiple HDFS clusters, potentially having different versions.

          Show
          shv Konstantin Shvachko added a comment - - edited How does this align with the router-based federation HDFS-10467 ? Hey Wei Yan , router-based federation (in fact all federation approaches) are orthogonal to distributed NN. One should be able to run RBF over multiple HDFS clusters, potentially having different versions.
          Hide
          shv Konstantin Shvachko added a comment -

          Thanks for organizing community meeting(s). Hope there will be a deep-dive into Ozone impl, as it may take a long time to go through the code on your own.
          Would be good to give people some time to review the code before starting the vote.

          Anything on Ozone security design?

          Show
          shv Konstantin Shvachko added a comment - Thanks for organizing community meeting(s). Hope there will be a deep-dive into Ozone impl, as it may take a long time to go through the code on your own. Would be good to give people some time to review the code before starting the vote. Anything on Ozone security design?
          Hide
          anu Anu Engineer added a comment -

          Thanks for organizing community meeting(s). Hope there will be a deep-dive into Ozone impl, as it may take a long time to go through the code on your own.

          I will be happy to do it.

          Anything on Ozone security design?

          We are working on a design, we will post it soon.

          Show
          anu Anu Engineer added a comment - Thanks for organizing community meeting(s). Hope there will be a deep-dive into Ozone impl, as it may take a long time to go through the code on your own. I will be happy to do it. Anything on Ozone security design? We are working on a design, we will post it soon.
          Hide
          andrew.wang Andrew Wang added a comment -

          Some Hortonworkers and Clouderans met yesterday, here are my meeting notes. I wanted to get them up before the broader meeting today. I already sent these around to the attendees, but please comment if I got anything incorrect.

          Attendees: ATM, Andrew, Anu, Aaron Fabbri, Jitendra, Sanjay, other listeners on the phone

          High-level questions raised:

          • Wouldn't Ozone be better off as a separate project?
          • Why should it be merged now?

          Things we agree on:

          • We're all on Team Ozone, and applaud any effort to address scaling HDFS.
          • There are benefits to Ozone being a separate project. Can release faster, iterate more quickly on feedback, and mature without having to worry about features like high-availability, security, encryption, etc. that not all customers need.
          • No agreement on whether the benefits of separation outweigh the downsides.

          Discussion:

          • Anu: Don't want to have this separate since it confuses people about the long-term vision of Ozone. It's intended as block management for HDFS.
          • Andrew: In its current state, Ozone cannot be plugged into the NN as the BM layer, so it seems premature to merge. Can't benefit existing users, and they can't test it.
          • Response: The Ozone block layer is at a good integration point, and we want to move onto the NameNode changes like splitting the FSN/BM lock.
          • Andrew: We can do the FSN/BM lock split without merging Ozone. Separate efforts. This lock split is also a major effort by itself, and is a dangerous change. It's something that should be baked in production.
          • Sanjay: Ozone developers "willing to take the hit" of the slow Hadoop release cadence. Want to make this part of HDFS since it's easier for users to test and consume without installing a new cluster.
          • ATM: Can still share the same hardware, and run the Ozone daemons alongside.
          • Sanjay: Want to keep Ozone block management inside the Datanode process to enable a fast-copy between HDFS and Ozone. Not all data needs all the HDFS features like encryption, erasure coding, etc, and this data could be stored in Ozone.
          • Andrew: This fast-copy hasn't been implemented or discussed yet. Unclear if it'll work at all with existing HDFS block management. Won't work with encryption or erasure coding. Not clear whether it requires being in the same DN process even.
          • Sanjay/Anu: Ozone is also useful to test with just the key-value interface. It's a Hadoop-compatible FileSystem, so apps that work on S3 will work on Ozone too.
          • Andrew: If it provides a new API and doesn't support the HDFS feature-set, doesn't this support it being its own project?

          Summary

          • No consensus on the high-level questions raised
          • Ozone could be its own project and integrated later, or remain on an HDFS branch
          • Without the FSN/BM lock split, it can't serve as the block management layer for HDFS
          • Without fast copy, there's no need for the to be part of the DataNode process, and it might not need to be in the same process anyway.
          Show
          andrew.wang Andrew Wang added a comment - Some Hortonworkers and Clouderans met yesterday, here are my meeting notes. I wanted to get them up before the broader meeting today. I already sent these around to the attendees, but please comment if I got anything incorrect. Attendees: ATM, Andrew, Anu, Aaron Fabbri, Jitendra, Sanjay, other listeners on the phone High-level questions raised: Wouldn't Ozone be better off as a separate project? Why should it be merged now? Things we agree on: We're all on Team Ozone, and applaud any effort to address scaling HDFS. There are benefits to Ozone being a separate project. Can release faster, iterate more quickly on feedback, and mature without having to worry about features like high-availability, security, encryption, etc. that not all customers need. No agreement on whether the benefits of separation outweigh the downsides. Discussion: Anu: Don't want to have this separate since it confuses people about the long-term vision of Ozone. It's intended as block management for HDFS. Andrew: In its current state, Ozone cannot be plugged into the NN as the BM layer, so it seems premature to merge. Can't benefit existing users, and they can't test it. Response: The Ozone block layer is at a good integration point, and we want to move onto the NameNode changes like splitting the FSN/BM lock. Andrew: We can do the FSN/BM lock split without merging Ozone. Separate efforts. This lock split is also a major effort by itself, and is a dangerous change. It's something that should be baked in production. Sanjay: Ozone developers "willing to take the hit" of the slow Hadoop release cadence. Want to make this part of HDFS since it's easier for users to test and consume without installing a new cluster. ATM: Can still share the same hardware, and run the Ozone daemons alongside. Sanjay: Want to keep Ozone block management inside the Datanode process to enable a fast-copy between HDFS and Ozone. Not all data needs all the HDFS features like encryption, erasure coding, etc, and this data could be stored in Ozone. Andrew: This fast-copy hasn't been implemented or discussed yet. Unclear if it'll work at all with existing HDFS block management. Won't work with encryption or erasure coding. Not clear whether it requires being in the same DN process even. Sanjay/Anu: Ozone is also useful to test with just the key-value interface. It's a Hadoop-compatible FileSystem, so apps that work on S3 will work on Ozone too. Andrew: If it provides a new API and doesn't support the HDFS feature-set, doesn't this support it being its own project? Summary No consensus on the high-level questions raised Ozone could be its own project and integrated later, or remain on an HDFS branch Without the FSN/BM lock split, it can't serve as the block management layer for HDFS Without fast copy, there's no need for the to be part of the DataNode process, and it might not need to be in the same process anyway.
          Hide
          anu Anu Engineer added a comment -

          Ozone - First community meeting

          Time: Thursday, November 16, 2017, at 1:00:00 am PST
          Participants: Anu Engineer, Mukul Kumar Singh, Nandakumar Vadivelu, Weiwei Yang, Steve Loughran, Thomas Demoor, Shashikant Banerjee, Lokesh Jain

          We discussed quite a large number of technical issues at this meeting.

          We went over how Ozone's works, the Namespace architecture of KSM and how it interacts with SCM. We traced both a write I/O path and read I/O path.

          There was some discussion over the REST protocol and making sure that Rest protocol is good enough to support Hadoop based workloads. We look at various REST APIs of Ozone and also discussed O3 FS working over RPC instead of REST protocol. This is a work in progress.

          Steve Loughran suggested that we add Storm to the applications that are tested against Ozone. Currently, we use Hive, Spark, YARN, as the applications to test against Ozone. We will add Storm to this testing mix.

          We discussed performance and scale of testing; ozone has been tested with millions of keys. We have also tested with cluster sizes up to 300 nodes.

          Steve suggested that we upgrade the Ratis version and lock that down before the merge.

          Thomas Demoor pointed out the difference between the commit ordering of S3 and Ozone. Ozone uses the actual commit time to decide the key ordering, S3 uses the key creation time to decide the ordering of the keys. He also mentioned that this should not matter in the real world as he is not aware hard-coded dependency on commit ordering.

          Show
          anu Anu Engineer added a comment - Ozone - First community meeting Time: Thursday, November 16, 2017, at 1:00:00 am PST Participants: Anu Engineer, Mukul Kumar Singh, Nandakumar Vadivelu, Weiwei Yang, Steve Loughran, Thomas Demoor, Shashikant Banerjee, Lokesh Jain We discussed quite a large number of technical issues at this meeting. We went over how Ozone's works, the Namespace architecture of KSM and how it interacts with SCM. We traced both a write I/O path and read I/O path. There was some discussion over the REST protocol and making sure that Rest protocol is good enough to support Hadoop based workloads. We look at various REST APIs of Ozone and also discussed O3 FS working over RPC instead of REST protocol. This is a work in progress. Steve Loughran suggested that we add Storm to the applications that are tested against Ozone. Currently, we use Hive, Spark, YARN, as the applications to test against Ozone. We will add Storm to this testing mix. We discussed performance and scale of testing; ozone has been tested with millions of keys. We have also tested with cluster sizes up to 300 nodes. Steve suggested that we upgrade the Ratis version and lock that down before the merge. Thomas Demoor pointed out the difference between the commit ordering of S3 and Ozone. Ozone uses the actual commit time to decide the key ordering, S3 uses the key creation time to decide the ordering of the keys. He also mentioned that this should not matter in the real world as he is not aware hard-coded dependency on commit ordering.
          Hide
          sanjay.radia Sanjay Radia added a comment -

          Ozone Cloudera Meeting Date: Thursday, November 16th 2017
          Location: online conferencing

          Attendees: ATM, Andrew, Anu, Aaron Fabbri, Jitendra, Sanjay, Sean Mackrory, other listeners on the phone

          Main discussion centered around:

          • Wouldn't Ozone be better off as a separate project?
          • Why should it be merged now?

          Discussion: (This incorporate Andrew’s minutes and adds to it.)

          • Anu: Don't want to have this separate since it confuses people about the long-term vision of Ozone. It's intended as block management for HDFS.
          • Andrew: In its current state, Ozone cannot be plugged into the NN as the BM layer, so it seems premature to merge. Can't benefit existing users, and they can't test it.
          • Response: The Ozone block layer is at a good integration point, and we want to move on with the NameNode integration as new block layer. Benefits via KV namespace/FileSystemAPI is there and completely usable for Hive and Spark apps.
          • Andrew: We can do the FSN/BM lock split without merging Ozone. Separate efforts. This lock split is also a major effort by itself, and is a dangerous change. It's something that should be baked in production.
          • Sanjay: Agree that the lock split should be done in branch. But disagree on how hard it will be. The split was hard in past but will be easier with new block layer: one of the key reasons for the coupling of Block-layer to Namespace layer is that the block length of the each replica at block close time, esp under failures, has to be consistent. This is done in the central NN today (due to lack of raft/paxos like protocol in the original block layer). The block-container layer uses raft for consistency and no longer needs a central agent like the NN. Then new block-layers built-in consistent state management simplifies the separation.
          • Sanjay: Ozone developers "willing to take the hit" of the slow Hadoop release cadence. Want to make this part of HDFS since it's easier for users to test and consume without installing a new cluster.
          • ATM: Can still share the same hardware, and run the Ozone daemons alongside.
          • Sanjay countered this <see summary section to avoid repetition>
          • Sanjay: Want to keep Ozone block management inside the Datanode process to enable various synergies such as sharing the new netty based protocol engy or fast-copy between HDFS and Ozone. Not all data needs all the HDFS features like encryption, erasure coding, etc, and this data could be stored in Ozone.
          • Andrew: This fast-copy hasn't been implemented or discussed yet. Unclear if it'll work at all with existing HDFS block management. Won't work with encryption or erasure coding. Not clear whether it requires being in the same DN process even.
          • It does have to work with encryption and EC to give value. It can work with non-encrypted and non EC which are majority of blocks in most Hadoop clusters. We will provide a design of the shallow-copy.
            Sanjay/Anu: Ozone is also useful to test with just the key-value interface. It's a Hadoop-compatible FileSystem, so apps many apps such as Hive and Spark can work also on Ozone since they have or ensured that they work well on KV flat namespace.
          • Andrew: If it provides a new API and doesn't support the HDFS feature-set, doesn't this support it being its own project?
          • Sanjay - It provides the EXISTING Hadoop FileSystem interface now. Note customers are used to have different parts of the namespace(s) having different features: Customers have asked for Zones with different features enabled [ see summary - to avoid duplication].
          • AaronF: Ozone is a lot of new code and Hadoop already has so much code.. It is better to have separate projects and not add to Hadoop/HDFS.
            Sanjay: Agree it is lot of code. Sometimes, we often have to add siginficant new code for a project move forward. We have tried to incrementally work around HDFS Scaling, the NN’s manageability and slow startup issues. This new code base fundamentally moves us forward in addressing the long standing issues. Besides this “lots of new code” argument can be used later to prevent the merge of the projects.

          Summary:
          There is agreement that the new block-container layer is a good way to solve the block scaling issue of HDFS. There is no consensus on merging the branch in vs fork Ozone into a new project. The main objection to merging into HDFS is that integrating the new block-container layer with the exiting NN will be a very hard project since the lock split in the NN is very challenging.

          Cloudera’s team perspective: (taken from Anderew’s minutes)

          • Ozone could be its own project and integrated later, or remain on an HDFS branch. There are benefits to Ozone being a separate project. Can release faster, iterate more quickly on feedback, and mature without having to worry about features like high-availability, security, encryption, etc. that not all customers need.
          • Without the FSN/BM lock split, it can't serve as the block management layer for HDFS
          • Without fast copy, there's no need for the to be part of the DataNode process, and it might not need to be in the same process anyway.

          Hortonworks HDFS team’s perspective:

          • Keeping Ozone inside HDFS makes sense as it’s block layer is designed to address core HDFS issues. the KV namespace is simply an intermediate step to stabilize.
          • Based on feedback from Konstantine and Chris Douglas we have show how the NN plugged into the new block layer (including get some namespace relief) which can then be evolved.
          • Saying that Ozone must be feature compatible before it can be merged in is not a reasonable argument. HDFS allows multiple namespace (federation within same cluster, to HDFS and non-HDFS namespace) and also multiple zone within a namespace of different features (Encryption zone, EC zone); customer have asked for an use different zones for such features.. The KV namespace fits right into this model till a full NN is plugged in. It can be used via FileSystem or File Context API.
          • We gave multiple reasons (not just one) to keep together:
            • Sharing DN daemon,
            • sharing the new netty based protocol engine in DNs (a long time sore issue for HDFS block layer),
            • shallow data copy is practical only if within same project and daemon otherwise have deal with security setting and coordinations across daemons.
            • share disk scheduling.
            • As customers test the new block layer they want to share the same cluster and DNs the storage and daemons.
          • Merge later will be harder. The code and goals will diverge further. We will taking on extra work to split and then take extra work to merge. The opponents will raise the same issue as today: show feature parity in both file system so that one can completely replace the other, or that this is a lot of new code and HDFS is already very big, or a new NN can plug to a separate project’s block layer (its all software so anything is possible) and hence keep it separate because the objector’s does not fully buy into the benefits.
          Show
          sanjay.radia Sanjay Radia added a comment - Ozone Cloudera Meeting Date: Thursday, November 16th 2017 Location: online conferencing Attendees: ATM, Andrew, Anu, Aaron Fabbri, Jitendra, Sanjay, Sean Mackrory, other listeners on the phone Main discussion centered around: Wouldn't Ozone be better off as a separate project? Why should it be merged now? Discussion: (This incorporate Andrew’s minutes and adds to it.) Anu: Don't want to have this separate since it confuses people about the long-term vision of Ozone. It's intended as block management for HDFS. Andrew: In its current state, Ozone cannot be plugged into the NN as the BM layer, so it seems premature to merge. Can't benefit existing users, and they can't test it. Response: The Ozone block layer is at a good integration point, and we want to move on with the NameNode integration as new block layer. Benefits via KV namespace/FileSystemAPI is there and completely usable for Hive and Spark apps. Andrew: We can do the FSN/BM lock split without merging Ozone. Separate efforts. This lock split is also a major effort by itself, and is a dangerous change. It's something that should be baked in production. Sanjay: Agree that the lock split should be done in branch. But disagree on how hard it will be. The split was hard in past but will be easier with new block layer: one of the key reasons for the coupling of Block-layer to Namespace layer is that the block length of the each replica at block close time, esp under failures, has to be consistent. This is done in the central NN today (due to lack of raft/paxos like protocol in the original block layer). The block-container layer uses raft for consistency and no longer needs a central agent like the NN. Then new block-layers built-in consistent state management simplifies the separation. Sanjay: Ozone developers "willing to take the hit" of the slow Hadoop release cadence. Want to make this part of HDFS since it's easier for users to test and consume without installing a new cluster. ATM: Can still share the same hardware, and run the Ozone daemons alongside. Sanjay countered this <see summary section to avoid repetition> Sanjay: Want to keep Ozone block management inside the Datanode process to enable various synergies such as sharing the new netty based protocol engy or fast-copy between HDFS and Ozone. Not all data needs all the HDFS features like encryption, erasure coding, etc, and this data could be stored in Ozone. Andrew: This fast-copy hasn't been implemented or discussed yet. Unclear if it'll work at all with existing HDFS block management. Won't work with encryption or erasure coding. Not clear whether it requires being in the same DN process even. It does have to work with encryption and EC to give value. It can work with non-encrypted and non EC which are majority of blocks in most Hadoop clusters. We will provide a design of the shallow-copy. Sanjay/Anu: Ozone is also useful to test with just the key-value interface. It's a Hadoop-compatible FileSystem, so apps many apps such as Hive and Spark can work also on Ozone since they have or ensured that they work well on KV flat namespace. Andrew: If it provides a new API and doesn't support the HDFS feature-set, doesn't this support it being its own project? Sanjay - It provides the EXISTING Hadoop FileSystem interface now. Note customers are used to have different parts of the namespace(s) having different features: Customers have asked for Zones with different features enabled [ see summary - to avoid duplication]. AaronF: Ozone is a lot of new code and Hadoop already has so much code.. It is better to have separate projects and not add to Hadoop/HDFS. Sanjay: Agree it is lot of code. Sometimes, we often have to add siginficant new code for a project move forward. We have tried to incrementally work around HDFS Scaling, the NN’s manageability and slow startup issues. This new code base fundamentally moves us forward in addressing the long standing issues. Besides this “lots of new code” argument can be used later to prevent the merge of the projects. Summary: There is agreement that the new block-container layer is a good way to solve the block scaling issue of HDFS. There is no consensus on merging the branch in vs fork Ozone into a new project. The main objection to merging into HDFS is that integrating the new block-container layer with the exiting NN will be a very hard project since the lock split in the NN is very challenging. Cloudera’s team perspective: (taken from Anderew’s minutes) Ozone could be its own project and integrated later, or remain on an HDFS branch. There are benefits to Ozone being a separate project. Can release faster, iterate more quickly on feedback, and mature without having to worry about features like high-availability, security, encryption, etc. that not all customers need. Without the FSN/BM lock split, it can't serve as the block management layer for HDFS Without fast copy, there's no need for the to be part of the DataNode process, and it might not need to be in the same process anyway. Hortonworks HDFS team’s perspective: Keeping Ozone inside HDFS makes sense as it’s block layer is designed to address core HDFS issues. the KV namespace is simply an intermediate step to stabilize. Based on feedback from Konstantine and Chris Douglas we have show how the NN plugged into the new block layer (including get some namespace relief) which can then be evolved. Saying that Ozone must be feature compatible before it can be merged in is not a reasonable argument. HDFS allows multiple namespace (federation within same cluster, to HDFS and non-HDFS namespace) and also multiple zone within a namespace of different features (Encryption zone, EC zone); customer have asked for an use different zones for such features.. The KV namespace fits right into this model till a full NN is plugged in. It can be used via FileSystem or File Context API. We gave multiple reasons (not just one) to keep together: Sharing DN daemon, sharing the new netty based protocol engine in DNs (a long time sore issue for HDFS block layer), shallow data copy is practical only if within same project and daemon otherwise have deal with security setting and coordinations across daemons. share disk scheduling. As customers test the new block layer they want to share the same cluster and DNs the storage and daemons. Merge later will be harder. The code and goals will diverge further. We will taking on extra work to split and then take extra work to merge. The opponents will raise the same issue as today: show feature parity in both file system so that one can completely replace the other, or that this is a lot of new code and HDFS is already very big, or a new NN can plug to a separate project’s block layer (its all software so anything is possible) and hence keep it separate because the objector’s does not fully buy into the benefits.
          Hide
          fabbri Aaron Fabbri added a comment -

          Thanks for taking notes, and thank you for the lively discussion. A lot of valid concerns on all sides here. A couple of minor corrections:

          AaronF: Ozone is a lot of new code and Hadoop already has so much code..

          My concerns particularly are around things that affect stability and operability. Not concerned about lines of code, more about "how manageable" is the codebase in terms of getting a stable release, and supporting in production.

          shallow data copy is practical only if within same project and daemon otherwise have deal with security setting and coordinations across daemons.

          We can factor common code, if any, to a shared dependency. I don't see how which repository code lives in really affects fast copy between storage systems. I can think of ways to do it both within a JVM process consisting of code from multiple git repositories, and via IPC (hand off ownership of a file to another process--not even talking about fancy stuff like shmem).

          The opponents will raise the same issue as today: show feature parity

          I get your concern, but I didn't hear anyone say feature parity. I only heard "integrate with HDFS". Even integrated with HDFS, there is still a high bar of "utility" to pass, IMO, to justify a very large patch which affects production code.

          We all want stable, scalable HDFS. Nobody opposes that ideal.

          I'm not sure trying to evolve HDFS to scale is a better approach than being separate with maybe some shared, well-factored dependencies. The latter, IMO, could result in better code and dramatically less risk to HDFS.

          Appreciate all your hard work thus far and appreciate the challenges you guys face here. I hope you can understand my perspective as well.

          Show
          fabbri Aaron Fabbri added a comment - Thanks for taking notes, and thank you for the lively discussion. A lot of valid concerns on all sides here. A couple of minor corrections: AaronF: Ozone is a lot of new code and Hadoop already has so much code.. My concerns particularly are around things that affect stability and operability. Not concerned about lines of code, more about "how manageable" is the codebase in terms of getting a stable release, and supporting in production. shallow data copy is practical only if within same project and daemon otherwise have deal with security setting and coordinations across daemons. We can factor common code, if any, to a shared dependency. I don't see how which repository code lives in really affects fast copy between storage systems. I can think of ways to do it both within a JVM process consisting of code from multiple git repositories, and via IPC (hand off ownership of a file to another process--not even talking about fancy stuff like shmem). The opponents will raise the same issue as today: show feature parity I get your concern, but I didn't hear anyone say feature parity. I only heard "integrate with HDFS". Even integrated with HDFS, there is still a high bar of "utility" to pass, IMO, to justify a very large patch which affects production code. We all want stable, scalable HDFS. Nobody opposes that ideal. I'm not sure trying to evolve HDFS to scale is a better approach than being separate with maybe some shared, well-factored dependencies. The latter, IMO, could result in better code and dramatically less risk to HDFS. Appreciate all your hard work thus far and appreciate the challenges you guys face here. I hope you can understand my perspective as well.
          Hide
          anu Anu Engineer added a comment -

          Ozone - Second community meeting

          Time: Friday, November 17, 2017, at 4:00 pm PST

          Participants: Arpit Agarwal, Robert Boyd, Wei Chiu, Marton Elek, Anu Engineer, Aaron Fabri, Manoj Govindassamy, Virajith Jalaparti, Aaron Myers, Jitendra Pandey, Sanjay Radia, Chao Sun, Bharat Viswanadham, Andrew Wang, Lei (Eddy) Xu, Wei Yan, Xiaoyu Yao. [Apologies to anyone that I might have missed who joined over phone]

          We started the meeting with discussing the notions of ozone's block storage layer and followed by the deep dive into the code.
          We discussed the notions of the block layer, which is similar to HDFS block layer, Ozone's container layer and how replication works via pipelines. Then we did a code walk-thru of the ozone codebase, starting with KSM, SCM, Container layer and Rest handler.

          We had some technical questions about containers. Is the unit of replication the containers, and if we can truncate a block that is already part of containers, say block three inside a container. Both of these were answered in affirmative, that the unit of the replication is indeed a container and you can terminate block three inside the container without any issues.

          Once we finished the technical discussion, we discussed some of the merge issues; essentially the question was whether we should postpone the merge of ozone into HDFS.

          • Andrew Wang wanted to know how this would benefit the enterprise customers?.
            • It was pointed out that customers can use the storage via a Hadoop Compatible filesystem (FileSystem or FileContext), and more important, apps such as Hive and spark, etc which those APIs will work (we are testing Hive and Spark). In fact, all the data in ozone is expected to come via Hive, YARN, Spark, etc. Making ozone work seamlessly via such Hadoop frameworks is very important because it enables real customer use.
          • ATM objected to the Ozone’s merge, as wanted to see the new block layer integrated with the existing NN. He argued that creating the block layer is just the first phase, and separation of block layer inside Namenode needs to be done. He further argued that we should merge after Namenode block separation is completely done.
            • Sanjay refuted that a project of this size can only be implemented in phases. Fixing HDFS’s scalability in a fundamental way requires fixing both the Namespace layer and the block layer. We provide a simpler namespace (Key-Value) as an intermediate step to allow real customer usage via spark and hive, and also as a way of stabilizing the new block layer. This is a good consistency point for integration to start working on integrating with a hierarchical namespace of the NN.
          • Aaron Fabbri was concerned that code is new and may not be stable and that the support load for HDFS is quite high. This would further destabilize HDFS.
            • Sanjay’s Response It was pointed out that the feature is not on by default and that the code is in a separate module. Indeed new shareable parts like the new netty protocol-engine in the DN will replace the old thread-based protocol engine only with HDFS community’s blessing after it has been sufficiently been tested via the Ozone path. Further, Ozone can be kept disabled if so desired by a customer.
          • ATM’s concern is that connecting the NN to the new block layer will require separating the NSM/BM lock (a good thing to do) which is very hard to do.
            • Sanjay’s response. This issue was raised and explained at yesterday’s meeting. A very strong coupling was added between block layer and namespace layer when we wrote the new block pipeline as part of the append work in 2010: the block length of each replica at finalizing time, esp under failures, has to be consistent. This is done in the central NN today (due to lack of raft/paxos like protocol in the original block layer). The new block-container layer uses the raft for consistency and no longer needs a central agent like the NN. Thus the new block-container layer’s built-in consistent state management eliminates this coupling and hence simplifies the separation of the lock.
          Show
          anu Anu Engineer added a comment - Ozone - Second community meeting Time: Friday, November 17, 2017, at 4:00 pm PST Participants: Arpit Agarwal, Robert Boyd, Wei Chiu, Marton Elek, Anu Engineer, Aaron Fabri, Manoj Govindassamy, Virajith Jalaparti, Aaron Myers, Jitendra Pandey, Sanjay Radia, Chao Sun, Bharat Viswanadham, Andrew Wang, Lei (Eddy) Xu, Wei Yan, Xiaoyu Yao. [Apologies to anyone that I might have missed who joined over phone] We started the meeting with discussing the notions of ozone's block storage layer and followed by the deep dive into the code. We discussed the notions of the block layer, which is similar to HDFS block layer, Ozone's container layer and how replication works via pipelines. Then we did a code walk-thru of the ozone codebase, starting with KSM, SCM, Container layer and Rest handler. We had some technical questions about containers. Is the unit of replication the containers, and if we can truncate a block that is already part of containers, say block three inside a container. Both of these were answered in affirmative, that the unit of the replication is indeed a container and you can terminate block three inside the container without any issues. Once we finished the technical discussion, we discussed some of the merge issues; essentially the question was whether we should postpone the merge of ozone into HDFS. Andrew Wang wanted to know how this would benefit the enterprise customers?. It was pointed out that customers can use the storage via a Hadoop Compatible filesystem (FileSystem or FileContext), and more important, apps such as Hive and spark, etc which those APIs will work (we are testing Hive and Spark). In fact, all the data in ozone is expected to come via Hive, YARN, Spark, etc. Making ozone work seamlessly via such Hadoop frameworks is very important because it enables real customer use. ATM objected to the Ozone’s merge, as wanted to see the new block layer integrated with the existing NN. He argued that creating the block layer is just the first phase, and separation of block layer inside Namenode needs to be done. He further argued that we should merge after Namenode block separation is completely done. Sanjay refuted that a project of this size can only be implemented in phases. Fixing HDFS’s scalability in a fundamental way requires fixing both the Namespace layer and the block layer. We provide a simpler namespace (Key-Value) as an intermediate step to allow real customer usage via spark and hive, and also as a way of stabilizing the new block layer. This is a good consistency point for integration to start working on integrating with a hierarchical namespace of the NN. Aaron Fabbri was concerned that code is new and may not be stable and that the support load for HDFS is quite high. This would further destabilize HDFS. Sanjay’s Response It was pointed out that the feature is not on by default and that the code is in a separate module. Indeed new shareable parts like the new netty protocol-engine in the DN will replace the old thread-based protocol engine only with HDFS community’s blessing after it has been sufficiently been tested via the Ozone path. Further, Ozone can be kept disabled if so desired by a customer. ATM’s concern is that connecting the NN to the new block layer will require separating the NSM/BM lock (a good thing to do) which is very hard to do. Sanjay’s response. This issue was raised and explained at yesterday’s meeting. A very strong coupling was added between block layer and namespace layer when we wrote the new block pipeline as part of the append work in 2010: the block length of each replica at finalizing time, esp under failures, has to be consistent. This is done in the central NN today (due to lack of raft/paxos like protocol in the original block layer). The new block-container layer uses the raft for consistency and no longer needs a central agent like the NN. Thus the new block-container layer’s built-in consistent state management eliminates this coupling and hence simplifies the separation of the lock.

            People

            • Assignee:
              jnp Jitendra Nath Pandey
              Reporter:
              jnp Jitendra Nath Pandey
            • Votes:
              15 Vote for this issue
              Watchers:
              203 Start watching this issue

              Dates

              • Created:
                Updated:

                Development