Details

    • Type: New Feature
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This jira proposes to add object store capabilities into HDFS.
      As part of the federation work (HDFS-1052) we separated block storage as a generic storage layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the storage layer i.e. datanodes.
      In this jira I will explore building an object store using the datanode storage, but independent of namespace metadata.

      I will soon update with a detailed design document.

      1. ozone_user_v0.pdf
        4.48 MB
        Anu Engineer
      2. Ozone-architecture-v1.pdf
        1.60 MB
        Jitendra Nath Pandey
      3. Ozonedesignupdate.pdf
        970 kB
        Anu Engineer

        Issue Links

        1.
        DataNode support for multiple datasets Sub-task Resolved Arpit Agarwal
         
        2.
        Ozone: Introduce STORAGE_CONTAINER_SERVICE as a new NodeType. Sub-task Resolved Arpit Agarwal
         
        3.
        Ozone: Refactor FsDatasetSpi to pull up HDFS-agnostic functionality into parent interface Sub-task Closed Arpit Agarwal
         
        4.
        OzoneHandler : Add Quota Support Sub-task Resolved Anu Engineer
         
        5.
        Create REST Interface for Volumes Sub-task Resolved Anu Engineer
         
        6.
        OzoneHandler : Add Error Table Sub-task Resolved Anu Engineer
         
        7.
        OzoneHandler: Add userAuth Interface and Simple userAuth handler Sub-task Resolved Anu Engineer
         
        8.
        Ozone: fix the missing entries in hdfs script Sub-task Resolved Chen Liang
         
        9.
        Ozone: TestBlockPoolManager fails in ozone branch. Sub-task Resolved Chen Liang
         
        10.
        Ozone:SCM: Add support for registerNode in datanode Sub-task Resolved Anu Engineer
         
        11.
        OzoneHandler : Support List and Info Volumes Sub-task Resolved Anu Engineer
         
        12.
        Ozone: StorageContainerManager fails to compile after merge of HDFS-10312 maxDataLength enforcement. Sub-task Resolved Chris Nauroth
         
        13.
        Ozone: Add Ozone Client lib for volume handling Sub-task Resolved Anu Engineer
         
        14.
        Ozone: Support starting StorageContainerManager as a daemon Sub-task Resolved Arpit Agarwal
         
        15.
        Ozone : move Chunk IO and container protocol calls to hdfs-client Sub-task Resolved Chen Liang
         
        16.
        Ozone : fix XceiverClient slow shutdown Sub-task Resolved Chen Liang
         
        17.
        Ozone : move ozone XceiverClient to hdfs-client Sub-task Resolved Chen Liang
         
        18.
        Ozone: SCM: Add datanode protocol Sub-task Resolved Anu Engineer
         
        19.
        Ozone: TestEndpoint task failure Sub-task Resolved Xiaoyu Yao
         
        20.
        Ozone:SCM: Add support for registerNode in SCM Sub-task Resolved Anu Engineer
         
        21.
        Ozone : add client-facing container APIs and container references Sub-task Resolved Chen Liang
         
        22.
        OzoneHandler : Add volume handler Sub-task Resolved Anu Engineer
         
        23.
        Ozone: SCM: Make SCM use container protocol Sub-task Resolved Anu Engineer
         
        24.
        DataNode should filter the set of NameSpaceInfos passed to Datasets Sub-task Closed Arpit Agarwal
         
        25.
        OzoneHandler : Add ACL support Sub-task Resolved Anu Engineer
         
        26.
        Move DatasetSpi to new package Sub-task Closed Arpit Agarwal
         
        27.
        OzoneHandler : Add Local StorageHandler support for volumes Sub-task Resolved Anu Engineer
         
        28.
        Ozone: Introduce KeyValueContainerDatasetSpi Sub-task Closed Arpit Agarwal
         
        29.
        OzoneHandler : Add Volume Interface to Data Node HTTP Server Sub-task Resolved Kanaka Kumar Avvaru
         
        30.
        OzoneHandler : Add common bucket objects Sub-task Resolved Anu Engineer
         
        31.
        OzoneHandler: Integration of REST interface and container data pipeline back-end Sub-task Resolved Chris Nauroth
         
        32.
        OzoneHandler : Add Bucket REST Interface Sub-task Resolved Anu Engineer
         
        33.
        Ozone: Add small file support RPC Sub-task Resolved Anu Engineer
         
        34.
        Ozone: Cleanup some dependencies Sub-task Resolved Anu Engineer
         
        35.
        Ozone: use containers with the state machine Sub-task Resolved Anu Engineer
         
        36.
        Ozone: Add allocateContainer RPC Sub-task Resolved Anu Engineer
         
        37.
        Ozone: Unify StorageContainerConfiguration with ozone-default.xml & ozone-site.xml Sub-task Resolved Kanaka Kumar Avvaru
         
        38.
        OzoneHandler : Enable stand-alone local testing mode Sub-task Resolved Anu Engineer
         
        39.
        OzoneHandler : Add localStorageHandler support for Buckets Sub-task Resolved Anu Engineer
         
        40.
        OzoneHandler : Enable MiniDFSCluster based testing for Ozone Sub-task Resolved Anu Engineer
         
        41.
        Ozone: Add container definitions Sub-task Resolved Anu Engineer
         
        42.
        Exclude Ozone protobuf-generated classes from Findbugs analysis. Sub-task Resolved Chris Nauroth
         
        43.
        Ozone: Add container transport server Sub-task Resolved Anu Engineer
         
        44.
        Ozone: Add container transport client Sub-task Resolved Anu Engineer
         
        45.
        Stop tracking CHANGES.txt in the HDFS-7240 feature branch. Sub-task Resolved Chris Nauroth
         
        46.
        OzoneHandler : Add Key handler Sub-task Resolved Anu Engineer
         
        47.
        Ozone: Add Ozone Client lib for bucket handling Sub-task Resolved Anu Engineer
         
        48.
        ozone : Add volume commands to CLI Sub-task Resolved Anu Engineer
         
        49.
        Ozone : Add container dispatcher Sub-task Resolved Anu Engineer
         
        50.
        Ozone: Add buckets commands to CLI Sub-task Resolved Anu Engineer
         
        51.
        OzoneHandler : Add localstorage support for keys Sub-task Resolved Anu Engineer
         
        52.
        Ozone: Adding logging support Sub-task Resolved Anu Engineer
         
        53.
        Ozone: Refactor container Namespace Sub-task Resolved Anu Engineer
         
        54.
        Ozone : Enable better error reporting for failed commands in ozone shell Sub-task Resolved Anu Engineer
         
        55.
        ozone : Add key commands to CLI Sub-task Resolved Anu Engineer
         
        56.
        Ozone: Add container persistence Sub-task Resolved Anu Engineer
         
        57.
        Ozone: Implement storage container manager Sub-task Resolved Chris Nauroth
         
        58.
        Fix failing HDFS tests on HDFS-7240 Ozone branch. Sub-task Resolved Xiaobing Zhou
         
        59.
        Ozone : Add chunk persistance Sub-task Resolved Anu Engineer
         
        60.
        Ozone: Add key Persistence Sub-task Resolved Anu Engineer
         
        61.
        Ozone: end-to-end integration for create/get volumes, buckets and keys. Sub-task Resolved Chris Nauroth
         
        62.
        Ozone: Make config key naming consistent Sub-task Resolved Anu Engineer
         
        63.
        Ozone: Shutdown cleanly Sub-task Resolved Anu Engineer
         
        64.
        Ozone : exclude cblock protobuf files from findbugs check Sub-task Resolved Xiaoyu Yao
         
        65.
        Ozone : reuse Xceiver connection Sub-task Resolved Chen Liang
         
        66.
        Ozone : move StorageContainerLocation protocol to hdfs-client Sub-task Resolved Chen Liang
         
        67.
        Ozone: Add metrics for container operations and export over JMX Sub-task Resolved Mukul Kumar Singh
         
        68.
        Ozone:SCM: Add chill mode support to NodeManager. Sub-task Resolved Anu Engineer
         
        69.
        Ozone: Add paging support to list Volumes Sub-task Resolved Anu Engineer
         
        70.
        Ozone: Optimize key writes to chunks by providing a bulk write implementation in ChunkOutputStream. Sub-task Resolved Chris Nauroth
         
        71.
        Ozone: Introduce new config keys for SCM service endpoints Sub-task Resolved Arpit Agarwal
         
        72.
        Fix Ozone unit tests to use MiniOzoneCluster Sub-task Resolved Arpit Agarwal
         
        73.
        Ozone: SCM: Add NodeManager Sub-task Resolved Anu Engineer
         
        74.
        Ozone: TestContainerMapping needs to cleanup levelDB files Sub-task Resolved Xiaoyu Yao
         
        75.
        Ozone: SCM: Send node report to SCM with heartbeat Sub-task Resolved Xiaoyu Yao
         
        76.
        Ozone: Fix flaky TestNodeManager#testScmNodeReportUpdate Sub-task Resolved Xiaoyu Yao
         
        77.
        Ozone: SCM: Add negative tests cases for datanodeStatemachine Sub-task Resolved Weiwei Yang
         
        78.
        Ozone: Exclude container protobuf files from findbugs check Sub-task Resolved Yuanbo Liu
         
        79.
        Ozone: Fix Container test regression. Sub-task Resolved Anu Engineer
         
        80.
        Ozone: TestDatanodeStateMachine is failing intermittently Sub-task Resolved Weiwei Yang
         
        81.
        Ozone: Add protobuf definitions for container reports Sub-task Resolved Anu Engineer
         
        82.
        Ozone: Fix TestEndpoint test regression Sub-task Resolved Anu Engineer
         
        83.
        Ozone: Improve logging and error handling in the container layer Sub-task Resolved Anu Engineer
         
        84.
        Ozone:SCM: Support MXBean for SCM and NodeManager Sub-task Resolved Weiwei Yang
         
        85.
        Ozone: Separate XceiverServer and XceiverClient into interfaces and implementations Sub-task Resolved Tsz Wo Nicholas Sze
         
        86.
        Ozone: SCM: Move SCM config keys to ScmConfig Sub-task Resolved Weiwei Yang
         
        87.
        Ozone: SCM: Container allocation based on node report Sub-task Resolved Xiaoyu Yao
         
        88.
        Ozone: Fix TestContainerPersistence failures Sub-task Resolved Xiaoyu Yao
         
        89.
        Ozone: Fix TestDatanodeStateMachine#testDatanodeStateContext Sub-task Resolved Anu Engineer
         
        90.
        Ozone: SCM: Add close container RPC Sub-task Resolved Anu Engineer
         
        91.
        Ozone: Fix datanode ID handling in MiniOzoneCluster Sub-task Resolved Weiwei Yang
         
        92.
        Ozone: Add the ability to handle sendContainerReport Command Sub-task Resolved Anu Engineer
         
        93.
        Ozone: Merge with trunk needed a ignore duplicate entry in pom file due to shading Sub-task Resolved Tsz Wo Nicholas Sze
         
        94.
        Ozone: MiniOzoneCluster prints too many log messages by default Sub-task Resolved Tsz Wo Nicholas Sze
         
        95.
        Ozone: Add a check to prevent removing a container that has keys in it Sub-task Resolved Weiwei Yang
         
        96.
        Ozone: SCM: Add node pool management API Sub-task Resolved Xiaoyu Yao
         
        97.
        Ozone: SCM: Support update container Sub-task Resolved Weiwei Yang
         
        98.
        Ozone: close container should call compactDB Sub-task Resolved Anu Engineer
         
        99.
        Ozone: SCM: Add Comparable Metric Support Sub-task Resolved Anu Engineer
         
        100.
        Ozone: Implement XceiverServerSpi and XceiverClientSpi using Ratis Sub-task Resolved Tsz Wo Nicholas Sze
         
        101.
        Ozone: Document missing metrics for container operations Sub-task Resolved Yiqun Lin
         
        102.
        Ozone: Unit tests running failed in Windows Sub-task Resolved Yiqun Lin
         
        103.
        Ozone: Allocate container for MiniOzone cluster fails because of insufficient space error Sub-task Resolved Mukul Kumar Singh
         
        104.
        Ozone: Fix UT failures that caused by hard coded datanode data dirs Sub-task Resolved Weiwei Yang
         
        105.
        Ozone: support setting chunk size in streaming API Sub-task Resolved Yiqun Lin
         
        106.
        Ozone:SCM: Remove null command Sub-task Resolved Yuanbo Liu
         
        107.
        Ozone: TestContainerPlacement fails because of string mismatch in expected Message Sub-task Resolved Mukul Kumar Singh
         
        108.
        Ozone: Implement listKey function for KeyManager Sub-task Resolved Weiwei Yang
         
        109.
        Ozone: SCM: CLI: Design SCM Command line interface Sub-task Resolved Anu Engineer
         
        110.
        Ozone: Add unit test for storage container metrics Sub-task Resolved Yiqun Lin
         
        111.
        Ozone: Support force delete a container Sub-task Resolved Yuanbo Liu
         
        112.
        Ozone: SCM daemon is unable to be started via CLI Sub-task Resolved Weiwei Yang
         
        113.
        Ozone: SCM: CLI: Add shell code placeholder classes Sub-task Resolved Chen Liang
         
        114.
        Ozone: SCM: Add Block APIs Sub-task Resolved Xiaoyu Yao
         
        115.
        Ozone: Fix compile error due to inconsistent package name Sub-task Resolved Yiqun Lin
         
        116.
        Ozone: misc improvements for SCM CLI Sub-task Resolved Weiwei Yang
         
        117.
        Ozone: SCM: Support listContainers API Sub-task Resolved Xiaoyu Yao
         
        118.
        Ozone: SCM CLI: Implement delete container command Sub-task Resolved Weiwei Yang
         
        119.
        Ozone: add the DB names to OzoneConsts Sub-task Resolved Chen Liang
         
        120.
        Ozone: Revise create container CLI specification and implementation Sub-task Resolved Weiwei Yang
         
        121.
        Ozone:SCM: Add support for getContainer in SCM Sub-task Resolved Nandakumar
         
        122.
        Ozone : need to fix OZONE_SCM_DEFAULT_PORT Sub-task Resolved Chen Liang
         
        123.
        Ozone: Reuse ObjectMapper instance to improve the performance Sub-task Resolved Yiqun Lin
         
        124.
        Ozone: SCM: CLI: Add Debug command Sub-task Resolved Chen Liang
         
        125.
        Ozone : SCMNodeManager#close() should also close node pool manager object Sub-task Resolved Chen Liang
         
        126.
        Ozone: Get container report should only report closed containers Sub-task Resolved Weiwei Yang
         
        127.
        Ozone: SCM: CLI: Revisit delete container API Sub-task Resolved Weiwei Yang
         
        128.
        Ozone : add DEBUG CLI support of blockDB file Sub-task Resolved Chen Liang
         
        129.
        Ozone: Add finer error codes for container operaions Sub-task Resolved Yiqun Lin
         
        130.
        Ozone: SCM CLI: Implement info container command Sub-task Resolved Yuanbo Liu
         
        131.
        Ozone: Fix spotbugs warnings Sub-task Resolved Weiwei Yang
         
        132.
        Ozone: Correct description for ozone.handler.type in ozone-default.xml Sub-task Resolved Mukul Kumar Singh
         
        133.
        Ozone : add DEBUG CLI support for nodepool db file Sub-task Resolved Chen Liang
         
        134.
        Ozone: KSM: Create Key Space manager service Sub-task Resolved Anu Engineer
         
        135.
        Ozone : add DEBUG CLI support for open container db file Sub-task Resolved Chen Liang
         
        136.
        Ozone: SCM: Support Delete Block Sub-task Resolved Xiaoyu Yao
         
        137.
        Ensure LevelDB DBIterator is closed Sub-task Resolved Chen Liang
         
        138.
        CBlockManager#main should join() after start() service Sub-task Resolved Chen Liang
         
        139.
        Ozone: Add archive support to containers Sub-task Resolved Anu Engineer
         
        140.
        Ozone: fix the consistently timeout test testUpgradeFromRel22Image Sub-task Resolved Chen Liang
         
        141.
        Ozone: Improve the way of getting test file path in unit tests for Windows Sub-task Resolved Yiqun Lin
         
        142.
        Ozone: Fix javac warnings Sub-task Resolved Yiqun Lin
         
        143.
        Ozone : add sql debug CLI to hdfs script Sub-task Resolved Chen Liang
         
        144.
        Ozone: XceiverClientRatis should implement XceiverClientSpi.connect() Sub-task Resolved Tsz Wo Nicholas Sze
         
        145.
        Ozone: SCM: SCMContainerPlacementCapacity#chooseNode sometimes does not remove chosen node from healthy list. Sub-task Resolved Xiaoyu Yao
         
        146.
        Ozone: Fix Http connection leaks in ozone clients Sub-task Resolved Weiwei Yang
         
        147.
        Ozone: Stack Overflow in XceiverClientManager because of race condition in accessing openClient Sub-task Resolved Mukul Kumar Singh
         
        148.
        Ozone: Datanode needs to re-register to SCM if SCM is restarted Sub-task Resolved Weiwei Yang
         
        149.
        Ozone : need to refactor StorageContainerLocationProtocolServerSideTranslatorPB Sub-task Resolved Chen Liang
         
        150.
        Ozone : implement StorageContainerManager#getStorageContainerLocations Sub-task Resolved Chen Liang
         
        151.
        Ozone: KSM: Add createVolume API Sub-task Resolved Mukul Kumar Singh
         
        152.
        Ozone: KSM: Add setVolumeProperty Sub-task Resolved Mukul Kumar Singh
         
        153.
        Ozone: KSM : add createBucket Sub-task Resolved Nandakumar
         
        154.
        Ozone: KSM: add getBucketInfo Sub-task Resolved Nandakumar
         
        155.
        Ozone: SCM: Separate BlockLocationProtocol from ContainerLocationProtocol Sub-task Resolved Xiaoyu Yao
         
        156.
        Ozone: KSM : Add putKey Sub-task Resolved Chen Liang
         
        157.
        Ozone: Do not initialize Ratis cluster during datanode startup Sub-task Resolved Tsz Wo Nicholas Sze
         
        158.
        Ozone: CLI: Guarantees user runs SCM commands has appropriate permission Sub-task Resolved Weiwei Yang
         
        159.
        Ozone: KSM: Add getKey Sub-task Resolved Chen Liang
         
        160.
        Ozone: KSM: add SetBucketProperty Sub-task Resolved Nandakumar
         
        161.
        Ozone:KSM : add deleteVolume Sub-task Resolved Mukul Kumar Singh
         
        162.
        Ozone: Differentiate time interval for different DatanodeStateMachine state tasks Sub-task Resolved Weiwei Yang
         
        163.
        Ozone: Cleaning up local storage when closing MiniOzoneCluster Sub-task Resolved Mingliang Liu
         
        164.
        Ozone: Cleanup imports Sub-task Resolved Weiwei Yang
         
        165.
        Ozone: Add Ratis management API Sub-task Resolved Tsz Wo Nicholas Sze
         
        166.
        Ozone: TestKeySpaceManager#testDeleteVolume fails Sub-task Resolved Weiwei Yang
         
        167.
        Ozone: MiniOzoneCluster should set "ozone.handler.type" key correctly Sub-task Resolved Mukul Kumar Singh
         
        168.
        Ozone: KSM: Add checkVolumeAccess Sub-task Resolved Mukul Kumar Singh
         
        169.
        Ozone: KSM: Add deleteKey Sub-task Resolved Yuanbo Liu
         
        170.
        Ozone: SCM: TestNodeManager takes too long to execute Sub-task Resolved Yiqun Lin
         
        171.
        Shared XceiverClient should be closed if there is no open clients to avoid resource leak Sub-task Resolved Mukul Kumar Singh
         
        172.
        Ozone: Create metadata path automatically after null checking Sub-task Resolved Mukul Kumar Singh
         
        173.
        Ozone: Implement a common helper to return a range of KVs in levelDB Sub-task Resolved Weiwei Yang
         
        174.
        Ozone: KSM: add deleteBucket Sub-task Resolved Nandakumar
         
        175.
        Ozone: KSM: Remove protobuf formats from KSM wrappers Sub-task Resolved Nandakumar
         
        176.
        Ozone: KSM: add listBuckets Sub-task Resolved Weiwei Yang
         
        177.
        Ozone: Fix TestContainerSQLCli#testConvertContainerDB Sub-task Resolved Weiwei Yang
         
        178.
        Remove Guava v21 usage from HDFS-7240 Sub-task Resolved Xiaoyu Yao
         
        179.
        Ozone: remove disabled tests Sub-task Resolved Anu Engineer
         
        180.
        Ozone: Output error when DN handshakes with SCM Sub-task Resolved Weiwei Yang
         
        181.
        Ozone: Ensure KSM is initiated using ProtobufRpcEngine Sub-task Resolved Weiwei Yang
         
        182.
        Ozone: Containers in different datanodes are mapped to the same location Sub-task Resolved Nandakumar
         
        183.
        Ozone: Add start-ozone.sh to quickly start ozone. Sub-task Resolved Weiwei Yang
         
        184.
        Ozone: Add stop-ozone.sh script Sub-task Resolved Weiwei Yang
         
        185.
        Ozone: cannot enable test debug/trace log Sub-task Resolved Tsz Wo Nicholas Sze
         
        186.
        Ozone: TestContainerPersistence never uses MiniOzoneCluster Sub-task Resolved Tsz Wo Nicholas Sze
         
        187.
        Ozone: KSM: Add listKey Sub-task Resolved Yiqun Lin
         
        188.
        Ozone: Support force update a container Sub-task Resolved Yuanbo Liu
         
        189.
        Ozone: Clarify startup error message of Datanode in case namenode is missing Sub-task Resolved Elek, Marton
         
        190.
        Ozone: Documentation: Add getting started page Sub-task Resolved Anu Engineer
         
        191.
        Ozone: Add documentation for metrics in KSMMetrics to OzoneMetrics.md Sub-task Resolved Yiqun Lin
         
        192.
        Ozone: TestXceiverClientManager.testFreeByEviction fails occasionally Sub-task Resolved Mukul Kumar Singh
         
        193.
        Ozone: SCM: Container metadata are not loaded properly after datanode restart Sub-task Resolved Xiaoyu Yao
         
        194.
        Ozone: Recover SCM state when SCM is restarted Sub-task Resolved Anu Engineer
         
        195.
        Ozone: Add all configurable entries into ozone-default.xml Sub-task Resolved Yiqun Lin
         
        196.
        Ozone: Encapsulate KSM metadata key for better (de)serialization Sub-task Resolved Weiwei Yang
         
        197.
        Ozone: Enable HttpServer2 for SCM and KSM Sub-task Resolved Elek, Marton
         
        198.
        Ozone: Misc: Add CBlocks defaults to ozone-defaults.xml Sub-task Resolved Chen Liang
         
        199.
        Ozone: CLI: remove noisy slf4j binding output from hdfs oz command. Sub-task Resolved Chen Liang
         
        200.
        Ozone: Rename OzoneClient to OzoneRestClient Sub-task Resolved Nandakumar
         
        201.
        Ozone: StorageHandler: Implementation of "close" for releasing resources during shutdown Sub-task Resolved Nandakumar
         
        202.
        Ozone: Ozone shell: Multiple RPC calls for put/get key Sub-task Resolved Yiqun Lin
         
        203.
        Ozone: KSM : add listVolumes Sub-task Resolved Weiwei Yang
         
        204.
        Ozone: Set proper parameter default values for listBuckets http request Sub-task Resolved Weiwei Yang
         
        205.
        Ozone : SCM cli misc fixes/improvements Sub-task Resolved Chen Liang
         
        206.
        Ozone: CLI: support infoKey command Sub-task Resolved Yiqun Lin
         
        207.
        Ozone: Test if all the configuration keys are documented in ozone-defaults.xml Sub-task Resolved Elek, Marton
         
        208.
        Ozone: ozone server should create missing metadata directory if it has permission to Sub-task Resolved Weiwei Yang
         
        209.
        Ozone: Add REST API documentation Sub-task Resolved Weiwei Yang
         
        210.
        Ozone:TestOzoneContainerRatis & TestRatisManager are failing consistently Sub-task Resolved Mukul Kumar Singh
         
        211.
        Ozone: listKey doesn't work from ozone commandline Sub-task Resolved Yiqun Lin
         
        212.
        Ozone: Add infoKey REST API document Sub-task Resolved Weiwei Yang
         
        213.
        Ozone: Add the unit test for KSMMetrics Sub-task Resolved Yiqun Lin
         
        214.
        Ozone: Review all cases where we are returning FAILED_INTERNAL_ERROR Sub-task Resolved Chen Liang
         
        215.
        Ozone: OzoneClient: Implementation of OzoneClient Sub-task Resolved Nandakumar
         
        216.
        Ozone: SCM CLI: Implement list container command Sub-task Resolved Yuanbo Liu
         
        217.
        Ozone: Improvement rest API output format for better looking Sub-task Resolved Weiwei Yang
         
        218.
        Ozone: Fix UT failure in TestOzoneConfigurationFields Sub-task Resolved Mukul Kumar Singh
         
        219.
        Ozone: listVolumes doesn't work from ozone commandline Sub-task Resolved Yiqun Lin
         
        220.
        Ozone: Bucket versioning design document Sub-task Resolved Weiwei Yang
         
        221.
        Ozone: SCM http server is not stopped with SCM#stop() Sub-task Resolved Weiwei Yang
         
        222.
        Ozone: KSM: previous key has to be excluded from result in listVolumes, listBuckets and listKeys Sub-task Resolved Nandakumar
         
        223.
        Ozone: SCM: Add the ability to handle container reports Sub-task Resolved Anu Engineer
         
        224.
        Ozone: OzoneClient: Abstraction of OzoneClient and default implementation Sub-task Resolved Nandakumar
         
        225.
        Ozone: add TestKeysRatis, TestBucketsRatis and TestVolumeRatis Sub-task Resolved Tsz Wo Nicholas Sze
         
        226.
        Ozone: KSM: Cleanup of keys in KSM for failed clients Sub-task Resolved Anu Engineer
         
        227.
        Ozone: Create a general abstraction for metadata store Sub-task Resolved Weiwei Yang
         
        228.
        Ozone: TestOzoneConfigurationFields is failing because ozone-default.xml has some missing properties Sub-task Resolved Weiwei Yang
         
        229.
        Ozone: Ozone shell: Add more testing for volume shell commands Sub-task Resolved Yiqun Lin
         
        230.
        Ozone: OzoneClient: OzoneClientImpl add getDetails calls Sub-task Resolved Nandakumar
         
        231.
        Ozone : add an UT to test partial read of chunks Sub-task Resolved Chen Liang
         
        232.
        Ozone: Ozone shell: Add more testing for bucket shell commands Sub-task Resolved Yiqun Lin
         
        233.
        Ozone: Corona: Implementation of Corona Sub-task Resolved Nandakumar
         
        234.
        Ozone: RocksDB implementation of ozone metadata store Sub-task Resolved Weiwei Yang
         
        235.
        Ozone: Ozone shell: Add more testing for key shell commands Sub-task Resolved Yiqun Lin
         
        236.
        Ozone: TestNodeManager times out before it is able to find all nodes Sub-task Resolved Yuanbo Liu
         
        237.
        Ozone: Fix TestContainerReplicationManager by setting proper log level for LogCapturer Sub-task Resolved Mukul Kumar Singh
         
        238.
        Ozone: Fix Leaking in TestStorageContainerManager#testRpcPermission Sub-task Resolved Xiaoyu Yao
         
        239.
        Ozone: Fix Leaking in TestXceiverClientManager Sub-task Resolved Xiaoyu Yao
         
        240.
        Ozone: SCM: Add queryNode RPC Call Sub-task Resolved Anu Engineer
         
        241.
        Ozone: OzoneFileSystem: Ozone & KSM should support "/" delimited key names Sub-task Resolved Mukul Kumar Singh
         
        242.
        Ozone: OzoneFileSystem: KSM should maintain key creation time and modification time Sub-task Resolved Mukul Kumar Singh
         
        243.
        Ozone: Corona: Add stats and progress bar to corona Sub-task Resolved Nandakumar
         
        244.
        Ozone : add RocksDB support to DEBUG CLI Sub-task Resolved Chen Liang
         
        245.
        Ozone: Intermittent failure TestContainerPersistence#testListKey Sub-task Resolved Surendra Singh Lilhore
         
        246.
        Ozone: Fix the remaining failure tests for Windows caused by incorrect path generated Sub-task Resolved Yiqun Lin
         
        247.
        Ozone : add support to DEBUG CLI for ksm.db Sub-task Resolved Chen Liang
         
        248.
        Ozone: OzoneClient: OzoneClientImpl Add setBucketProperty and delete calls Sub-task Resolved Nandakumar
         
        249.
        Ozone: KSM : Use proper defaults for block client address Sub-task Resolved Lokesh Jain
         
        250.
        Ozone: Corona: Adding corona as part of hdfs command Sub-task Resolved Nandakumar
         
        251.
        Ozone : add key partition Sub-task Resolved Chen Liang
         
        252.
        Ozone: DeleteKey-1: KSM replies delete key request asynchronously Sub-task Resolved Yuanbo Liu
         
        253.
        Ozone: Web interface for KSM Sub-task Resolved Elek, Marton
         
        254.
        Ozone: Ensures listKey lists all required key fields Sub-task Resolved Yiqun Lin
         
        255.
        OzoneClientUtils#updateListenAddress should use server address to update listening addrees Sub-task Resolved Surendra Singh Lilhore
         
        256.
        Ozone: Support asynchronus client API for SCM and containers Sub-task Resolved Anu Engineer
         
        257.
        Ozone: TestStorageContainerManager#testRpcPermission fails when ipv6 address used Sub-task Resolved Yiqun Lin
         
        258.
        Ozone: OzoneClient: Handling SCM container creationFlag at client side Sub-task Resolved Nandakumar
         
        259.
        Ozone: KeySpaceManager should unregister KSMMetrics upon stop Sub-task Resolved Yiqun Lin
         
        260.
        Ozone: Reduce MiniOzoneCluster handler thread count Sub-task Resolved Weiwei Yang
         
        261.
        Ozone: potential thread leaks Sub-task Resolved Weiwei Yang
         
        262.
        Ozone: KSM: Add creation time field in volume info Sub-task Resolved Yiqun Lin
         
        263.
        Ozone: KSM: Reduce default handler thread count from 200 Sub-task Resolved Ajay Kumar
         
        264.
        Ozone: OzoneClient: Refactor and move ozone client from hadoop-hdfs to hadoop-hdfs-client Sub-task Resolved Nandakumar
         
        265.
        Ozone: XceiverClientManager should not close the connection if client holds the reference Sub-task Resolved Nandakumar
         
        266.
        Ozone : add debug cli to hdfs script Sub-task Resolved Chen Liang
         
        267.
        Ozone: Corona: move corona from test to tools package Sub-task Resolved Nandakumar
         
        268.
        Ozone: DeleteKey-2: Implement block deleting service to delete stale blocks at background Sub-task Resolved Weiwei Yang
         
        269.
        Ozone: Extend MBeans utility to add any key value pairs to the registered MXBeans Sub-task Resolved Elek, Marton
         
        270.
        Ozone: Ozone-default.xml has 3 properties that do not match the default Config value Sub-task Resolved Ajay Kumar
         
        271.
        Ozone: Web interface for SCM Sub-task Resolved Elek, Marton
         
        272.
        Ozone: KSM: Add creation time field in bucket info Sub-task Resolved Yiqun Lin
         
        273.
        Ozone: Block deletion service floods the log when deleting large number of block files Sub-task Resolved Yiqun Lin
         
        274.
        Ozone: Add valid trace ID check in sendCommandAsync Sub-task Resolved Ajay Kumar
         
        275.
        Ozone: SCM: Add StateMachine for pipeline/container Sub-task Resolved Xiaoyu Yao
         
        276.
        Ozone: TestKeys#testPutAndGetKeyWithDnRestart fails Sub-task Resolved Weiwei Yang
         
        277.
        Ozone: SCM: Add create replication pipeline RPC Sub-task Resolved Anu Engineer
         
        278.
        Ozone: Implement update volume owner in ozone shell Sub-task Resolved Lokesh Jain
         
        279.
        Ozone: SCM: move container/pipeline StateMachine to the right package Sub-task Resolved Xiaoyu Yao
         
        280.
        Ozone: TestKeys is failing consistently Sub-task Resolved Mukul Kumar Singh
         
        281.
        Ozone: support setting timeout in background service Sub-task Resolved Yiqun Lin
         
        282.
        Ozone: the static cache provided by ContainerCache does not work in Unit tests Sub-task Resolved Anu Engineer
         
        283.
        Ozone: Concurrent RocksDB open calls fail because of "No locks available" Sub-task Resolved Mukul Kumar Singh
         
        284.
        Ozone: DeleteKey-5: Implement SCM DeletedBlockLog Sub-task Resolved Yuanbo Liu
         
        285.
        Ozone: SCM: use state machine for open containers allocated for key/blocks Sub-task Resolved Xiaoyu Yao
         
        286.
        Ozone: TestQueryNode#testStaleNodesCount is failing. Sub-task Resolved Xiaoyu Yao
         
        287.
        Ozone: TestOzoneContainer#testCreateOzoneContainer fails Sub-task Resolved Lokesh Jain
         
        288.
        TestKeys#testPutAndGetKeyWithDnRestart failed Sub-task Resolved Mukul Kumar Singh
         
        289.
        Ozone: maven dist compilation fails with "Duplicate classes found" error Sub-task Resolved Mukul Kumar Singh
         
        290.
        Ozone: Corona: Support for random validation of writes Sub-task Resolved Nandakumar
         
        291.
        Ozone: DeleteKey-4: Block delete between SCM and DN Sub-task Resolved Weiwei Yang
         
        292.
        Ozone: ListVolume displays incorrect createdOn time when the volume was created by OzoneRpcClient Sub-task Resolved Weiwei Yang
         
        293.
        Ozone: Shuffle container list for datanode BlockDeletingService Sub-task Resolved Yiqun Lin
         
        294.
        Ozone: SCM failed to start when a container metadata is empty Sub-task Resolved Weiwei Yang
         
        295.
        Ozone: Refactor KSM metadata class names to avoid confusion Sub-task Resolved Weiwei Yang
         
        296.
        Ozone: Fix TestArchive#testArchive Sub-task Resolved Xiaoyu Yao
         
        297.
        Ozone: Extend Datanode web interface with SCM information Sub-task Resolved Elek, Marton
         
        298.
        Ozone: Too many open files error while running corona Sub-task Resolved Nandakumar
         
        299.
        Ozone: OzoneFileSystem: OzoneFileystem initialization code Sub-task Resolved Mukul Kumar Singh
         
        300.
        Ozone: TestKSMSQLCli is not working as expected Sub-task Resolved Weiwei Yang
         
        301.
        Ozone: SCM: BlockManager creates a new container for each allocateBlock call Sub-task Resolved Nandakumar
         
        302.
        Ozone: Storage container data pipeline Sub-task Resolved Jitendra Nath Pandey
         
        303.
        Ozone: Datanode is unable to register with scm if scm starts later Sub-task Resolved Weiwei Yang
         
        304.
        Ozone: DeleteKey-3: KSM SCM block deletion message and ACK interactions Sub-task Resolved Weiwei Yang
         
        305.
        Ozone: Implement TopN container choosing policy for BlockDeletionService Sub-task Resolved Yiqun Lin
         
        306.
        Ozone: oz commandline list calls should return valid JSON format output Sub-task Resolved Weiwei Yang
         
        307.
        Ozone: KSM: Garbage collect deleted blocks Sub-task Resolved Weiwei Yang
         
        308.
        Ozone: OzoneClient: Refactoring OzoneClient API Sub-task Resolved Nandakumar
         
        309.
        Ozone: KSM: Unable to put keys with zero length Sub-task Resolved Mukul Kumar Singh
         
        310.
        Add markdown documentation about Ozone Sub-task Resolved Mukul Kumar Singh
         
        311.
        Ozone: Object store handler supports reusing http client for multiple requests. Sub-task Resolved Xiaoyu Yao
         
        312.
        Ozone: Cleanup Checkstyle issues Sub-task Resolved Shashikant Banerjee
         
        313.
        Ozone: TopN container choosing policy should ignore containers that has no pending deletion blocks Sub-task Resolved Yiqun Lin
         
        314.
        Ozone: SCM CLI: Implement close container command Sub-task Resolved Chen Liang
         
        315.
        Ozone: KSM: set creationTime for volume/bucket/key Sub-task Resolved Mukul Kumar Singh
         
        316.
        Ozone: Implement the trace ID generator Sub-task Resolved Yiqun Lin
         
        317.
        Ozone: OzoneClient : Remove createContainer handling from client Sub-task Resolved Yuanbo Liu
         
        318.
        Ozone: SCM: Container State Machine -1- Track container creation state in SCM Sub-task Resolved Xiaoyu Yao
         
        319.
        Ozone: KSM: multiple delete methods in KSMMetadataManager Sub-task Resolved Nandakumar
         
        320.
        Ozone: Corona: Support for variable key length in offline mode Sub-task Resolved Nandakumar
         
        321.
        Ozone: fix a KeySpaceManager startup message typo Sub-task Resolved Nandakumar
         
        322.
        Ozone: fix hard coded version in the Ozone GettingStarted guide Sub-task Resolved Elek, Marton
         
        323.
        Ozone: Fix TestXceiverClientMetrics#testMetrics Sub-task Resolved Yiqun Lin
         
        324.
        Ozone: TestAllocateContainer fails on jenkins Sub-task Resolved Weiwei Yang
         
        325.
        Ozone: Misc : Make sure that ozone-site.xml is in etc/hadoop in tarball created from mvn package. Sub-task Resolved Mukul Kumar Singh
         
        326.
        Ozone: Ozone data placement is not even Sub-task Resolved Weiwei Yang
         
        327. Ozone: add TestDistributedOzoneVolumesRatis, TestOzoneRestWithMiniClusterRatis and TestOzoneWebAccessRatis Sub-task Patch Available Tsz Wo Nicholas Sze
         
        328. Ozone : add read/write random access to Chunks of a key Sub-task Patch Available Chen Liang
         
        329. Ozone : debug cli: add support to load user-provided SQL query Sub-task Patch Available Chen Liang
         
        330. Ozone: Add metrics for pending storage container requests Sub-task Patch Available Yiqun Lin
         
        331. Ozone: Container : Add key versioning support-1 Sub-task Patch Available Chen Liang
         
        332.
        Ozone: Ratis: Readonly calls in XceiverClientRatis should use sendReadOnly Sub-task Resolved Mukul Kumar Singh
         
        333. Ozone: Mini cluster can't start up on Windows after HDFS-12159 Sub-task Patch Available Yiqun Lin
         
        334. Ozone: SCM: NodeManager should log when it comes out of chill mode Sub-task Patch Available Nandakumar
         
        335. Ozone: Support Ratis as a first class replication mechanism Sub-task Patch Available Anu Engineer
         
        336. Ozone : the sample ozone-site.xml in OzoneGettingStarted does not work Sub-task Patch Available Chen Liang
         
        337.
        Ozone : handle inactive containers on DataNode Sub-task Resolved Chen Liang
         
        338. Ozone: KSM: Allocate key should honour volume quota if quota is set on the volume Sub-task Patch Available Lokesh Jain
         
        339. Ozone: Container: Move IPC port to 98xx range Sub-task Patch Available Nandakumar
         
        340. Ozone: OzoneFileSystem: OzoneFileystem read/write/create/open/getFileInfo APIs Sub-task Patch Available Mukul Kumar Singh
         
        341. Ozone: List Key on an empty ozone bucket fails with command failed error Sub-task Patch Available Lokesh Jain
         
        342.
        Ozone: KSM: Make "ozone.ksm.address" as mandatory property for client Sub-task Resolved Nandakumar
         
        343. Ozone: TestXceiverClientManager and TestAllocateContainer occasionally fails Sub-task Reopened Weiwei Yang
         
        344. Ozone: Improve SCM block deletion throttling algorithm Sub-task In Progress Weiwei Yang
         
        345. Ozone : better handling of operation fail due to chill mode Sub-task Open Unassigned
         
        346. Ozone: C/C++ implementation of ozone client using curl Sub-task Open Shashikant Banerjee
         
        347. Ozone: SCM: clean up containers that timeout during creation Sub-task Open Xiaoyu Yao
         
        348. Ozone: More detailed documentation about the ozone components Sub-task Patch Available Elek, Marton
         
        349. Ozone: Non-admin user is unable to run InfoVolume to the volume owned by itself Sub-task Patch Available Lokesh Jain
         
        350. Ozone: OzoneClient: Add list calls in RpcClient Sub-task Patch Available Nandakumar
         
        351. Ozone: Ozone shell: the root is assumed to hdfs Sub-task Open Nandakumar
         
        352. Ozone: change TestRatisManager to check cluster with data Sub-task Open Tsz Wo Nicholas Sze
         
        353. Ozone: KSM: Add checkBucketAccess Sub-task Open Nandakumar
         
        354. Ozone: Container server needs enhancements to control of bind address for greater flexibility and testability. Sub-task Open Anu Engineer
         
        355. Ozone: SCM: Add Node Metrics for SCM Sub-task Open Yiqun Lin
         
        356. OzoneFileSystem: A Hadoop file system implementation for Ozone Sub-task Open Mukul Kumar Singh
         
        357. Ozone : Optimize putKey operation to be async and consensus Sub-task Open Weiwei Yang
         
        358. Ozone: provide a way to validate ContainerCommandRequestProto Sub-task Open Anu Engineer
         
        359. Ozone: Document ozone metadata directory structure Sub-task Open Xiaoyu Yao
         
        360. ChunkManager functions do not use the argument keyName Sub-task Open Chen Liang
         
        361. OZone: SCM CLI: Implement get container command Sub-task Open Chen Liang
         
        362. Ozone: Compact DB should be called on Open Containers. Sub-task Open Weiwei Yang
         
        363. Ozone:SCM: Add support for close containers in SCM Sub-task Open Anu Engineer
         
        364. Ozone: Support CopyContainer Sub-task Open Anu Engineer
         
        365. Ozone: Replace Jersey container with Netty Container Sub-task Open Anu Engineer
         
        366. Ozone: SCM: Handle duplicate Datanode ID Sub-task Open Anu Engineer
         
        367. Ozone: Fix the Cluster ID generation code in SCM Sub-task Open Anu Engineer
         
        368. Ozone: Support SCM multiple instance for HA Sub-task Open Anu Engineer
         
        369.
        Ozone: enforce DependencyConvergence uniqueVersions Sub-task Resolved Tsz Wo Nicholas Sze
         
        370. Ozone: Cleanup javadoc issues Sub-task Open Mukul Kumar Singh
         
        371. Ozone:KSM: Add setVolumeAcls to allow adding/removing acls from a KSM volume Sub-task Open Mukul Kumar Singh
         
        372. Ozone:SCM: explore if we need 3 maps for tracking the state of nodes Sub-task Open Unassigned
         
        373. Ozone: SCM CLI: Implement get container metrics command Sub-task Open Yuanbo Liu
         
        374. Ozone: In Ratis, leader should validate ContainerCommandRequestProto before propagating it to followers Sub-task Open Tsz Wo Nicholas Sze
         
        375. Ozone: SCM : Add priority for datanode commands Sub-task Open Unassigned
         
        376.
        Ozone: KSM: Changing log level for client calls in KSM Sub-task Resolved Shashikant Banerjee
         
        377. Ozone: KSM : Support for simulated file system operations Sub-task Open Mukul Kumar Singh
         
        378. Ozone: Support range in getKey operation Sub-task Open Unassigned
         
        379. Ozone: Handle potential inconsistent states while listing keys Sub-task Open Weiwei Yang
         
        380. Ozone: Audit Logs Sub-task Open Weiwei Yang
         
        381. Ozone: Misc: Explore if the default memory settings are correct Sub-task Open Mukul Kumar Singh
         
        382.
        Ozone: Cleanup findbugs issues Sub-task Resolved Shashikant Banerjee
         
        383. Ozone: Misc : Cleanup error messages Sub-task Open Unassigned
         
        384. Ozone: Documentation: Add Ozone-defaults documentation Sub-task Open Ajay Kumar
         
        385. Ozone: Corona: Support for online mode Sub-task Open Nandakumar
         
        386. Ozone: Purge metadata of deleted blocks after max retry times Sub-task Open Yuanbo Liu
         
        387. Ozone: write deleted block to RAFT log for consensus on datanodes Sub-task Open Unassigned
         
        388. Ozone: TestBlockDeletingService#testBlockDeletionTimeout sometimes timeout Sub-task Open Weiwei Yang
         
        389. Ozone: Add container usage information to DN container report Sub-task Open Xiaoyu Yao
         
        390. Ozone: Create docker-compose definition to easily test real clusters Sub-task Patch Available Elek, Marton
         
        391. Ozone: SCM: Handling container report with key count and container usage. Sub-task Open Xiaoyu Yao
         
        392. Ozone : add document for using Datanode http address Sub-task Open Lokesh Jain
         
        393.
        Ozone: Some minor text improvement in SCM web UI Sub-task Resolved Elek, Marton
         
        394. Ozone: OzoneRestClient needs to be configuration awareness Sub-task Open Weiwei Yang
         
        395.
        Ozone: OzoneRestClientException swallows exceptions which makes client hard to debug failures Sub-task Resolved Weiwei Yang
         
        396. Ozone: OzoneClient: OzoneBucket should have information about the bucket creation time Sub-task Patch Available Mukul Kumar Singh
         
        397. Ozone: ListVolume output misses some attributes Sub-task Open Mukul Kumar Singh
         
        398. Ozone: add logger for oz shell commands and move error stack traces to DEBUG level Sub-task Open Yiqun Lin
         
        399. Ozone: Cleanup javac issues Sub-task Patch Available Yiqun Lin
         
        400.
        Ozone: some UX improvements to oz_debug Sub-task Resolved Weiwei Yang
         
        401. Ozone: Improve SQLCLI performance Sub-task Open Yuanbo Liu
         
        402.
        Ozone: ListBucket is too slow Sub-task Resolved Weiwei Yang
         
        403. Ozone: Revert files not related to ozone change in HDFS-7240 branch Sub-task Patch Available Mukul Kumar Singh
         
        404.
        Ozone: mvn package compilation fails on HDFS-7240 Sub-task Resolved Mukul Kumar Singh
         
        405.
        Ozone: mvn package is failing with out skipshade Sub-task Resolved Bharat Viswanadham
         
        406. Ozone: Create UI page to show Ozone configs by tags Sub-task Open Ajay Kumar
         
        407. Ozone: Add a Lease Manager to SCM Sub-task Open Anu Engineer
         
        408. Ozone : Add an API to get Open Container by Owner, Replication Type and Replication Count Sub-task Open Unassigned
         
        409. Ozone: SCM should read all Container info into memory when booting up Sub-task Open Unassigned
         
        410. Ozone: Remove the Priority Queues used in the Container State Manager Sub-task Open Elek, Marton
         
        411. Ozone: Add tags to config Sub-task In Progress Ajay Kumar
         
        412. Ozone: Record number of keys scanned and hinted for getRangeKVs call Sub-task Patch Available Weiwei Yang
         
        413. Ozone: OzoneClient: Verify bucket/volume name in create calls Sub-task Patch Available Nandakumar
         
        414.
        javadoc: error - class file for org.apache.http.annotation.ThreadSafe not found Sub-task Resolved Mukul Kumar Singh
         
        415. Ozone: Reduce key creation overhead in Corona Sub-task Patch Available Lokesh Jain
         
        416. Ozone: refactor some functions in KSMMetadataManagerImpl to be more readable and reusable Sub-task Open Yuanbo Liu
         
        417. Ozone: node status text reported by SCM is a confusing Sub-task Open Unassigned
         

          Activity

          Hide
          ebortnik Edward Bortnikov added a comment -

          Very interested to follow. How is this related to the previous jira and design on Block-Management-as-a-Service (HDFS-5477)?

          Show
          ebortnik Edward Bortnikov added a comment - Very interested to follow. How is this related to the previous jira and design on Block-Management-as-a-Service ( HDFS-5477 )?
          Hide
          jnp Jitendra Nath Pandey added a comment -

          I think HDFS-5477 takes us towards making the block management service generic enough to support different storage semantics and API. In that sense object store will be one more use case for the block management. The object store design should work with the block management service.

          Show
          jnp Jitendra Nath Pandey added a comment - I think HDFS-5477 takes us towards making the block management service generic enough to support different storage semantics and API. In that sense object store will be one more use case for the block management. The object store design should work with the block management service.
          Hide
          azuryy Fengdong Yu added a comment -

          please look at here for a simple description:
          http://www.hortonworks.com/blog/ozone-object-store-hdfs/

          Show
          azuryy Fengdong Yu added a comment - please look at here for a simple description: http://www.hortonworks.com/blog/ozone-object-store-hdfs/
          Hide
          gsogol Jeff Sogolov added a comment -

          Salivating at this feature. Would somehow want to talk over the geo replication. But it's a killer feature set. I could start doing direct Analytics on unstructured data without any ETL. Beam me up Scotty!

          Show
          gsogol Jeff Sogolov added a comment - Salivating at this feature. Would somehow want to talk over the geo replication. But it's a killer feature set. I could start doing direct Analytics on unstructured data without any ETL. Beam me up Scotty!
          Hide
          gautamphegde Gautam Hegde added a comment -

          Awesome feature... would like to hel develop this... where do i sign up?

          Show
          gautamphegde Gautam Hegde added a comment - Awesome feature... would like to hel develop this... where do i sign up?
          Hide
          jnp Jitendra Nath Pandey added a comment -

          I have attached a high level architecture document for the object store in hdfs. Please provide your feedback. The development will start in a feature branch and jiras will be created as subtasks to this jira.

          Show
          jnp Jitendra Nath Pandey added a comment - I have attached a high level architecture document for the object store in hdfs. Please provide your feedback. The development will start in a feature branch and jiras will be created as subtasks to this jira.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Awesome stuff. We (Salesforce) have a need for this.

          I think these will lead to immediate management problems:

          • Object Size : 5G
          • Number of buckets system-­wide : 10 million
          • Number of objects per bucket: 1 million
          • Number of buckets per storage volume : 1000

          We have a large number of tenant (many times more than 1000). Some of the tenants will be very large (storing many times more than 1m objects). Of course there are simple workarounds for that, such as including a tenant id in the volume name and a bucket name in our internal blob ids. Are these technical limits?

          I don't think that we're the only ones who will to store a large amount of objects (more than 1m) and the bucket management would get into the way, rather than help.

          Show
          lhofhansl Lars Hofhansl added a comment - Awesome stuff. We (Salesforce) have a need for this. I think these will lead to immediate management problems: Object Size : 5G Number of buckets system-­wide : 10 million Number of objects per bucket: 1 million Number of buckets per storage volume : 1000 We have a large number of tenant (many times more than 1000). Some of the tenants will be very large (storing many times more than 1m objects). Of course there are simple workarounds for that, such as including a tenant id in the volume name and a bucket name in our internal blob ids. Are these technical limits? I don't think that we're the only ones who will to store a large amount of objects (more than 1m) and the bucket management would get into the way, rather than help.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          cluster level

          1. is there a limit on the #of storage volumes in a cluster? does GET/ return all of them?

          Storage Volume Level

          1. any way to enum users? e.g. GET /admin/user/

          Bucket Level

          1. what if I want to GET the 1001st entry in an object store? GET spec doesn't allow this.
          2. Propose: listing of entries to be a structure that includes length, block sizes, everything to be rebuild into a FileStatus

          Object level

          1. GET on object must support ranges
          2. HEAD should supply content-length
          Show
          stevel@apache.org Steve Loughran added a comment - cluster level is there a limit on the #of storage volumes in a cluster? does GET/ return all of them? Storage Volume Level any way to enum users? e.g. GET /admin/user/ Bucket Level what if I want to GET the 1001st entry in an object store? GET spec doesn't allow this. Propose: listing of entries to be a structure that includes length, block sizes, everything to be rebuild into a FileStatus Object level GET on object must support ranges HEAD should supply content-length
          Hide
          clamb Charles Lamb added a comment -

          Jitendra Nath Pandey et al,

          This is very interesting. Thanks for posting it.

          Is the 1KB key size limit a hard limit or just a design/implementation target? There will be users who want keys that can be arbitrarily large (e.g. 10's to 100's of KB). So although it may be acceptable to degrade above 1KB, I don't think you want to make it a hard limit. You could argue that they could just hash their keys, or that they could have some sort of key map, but then it would be hard to do secondary indices in the future.

          The details of partitions are kind of lacking beyond the second to last paragraph on page 4. Are partitions and storage containers 1:1? ("A storage container can contain a maximum of one partition..."). Obviously a storage container holds more than just a partition. Perhaps a little more detail about partitions and how they are located, etc. is warranted.

          In the call flow diagram on page 6, it looks like there's a lot going on in terms of network traffic. There's the initial REST call, then an RPC to get the bucket metadata, then one to read the bucket metadata, then another to get the object's container location, then back to the client who gets redirected. That's a lot of REST/RPCs just to get to the actual data. Will any of this be cached, perhaps in the Ozone Handler or maybe even on the client (I realize that's a bit hard with a REST based protocol). For instance, if it were possible to cache some of the hash in the client, then that would cut some RPCs to the Ozone Handler. If the cache were out of date, then the actual call to the data (step (9) in the diagram) could be rejected, the cache invalidated, and the entire call sequence (1) - (8) could be executed to get the right location.

          IWBNI there was some description of the protocols used between all these moving parts. I know that it's REST from client to Ozone Handler, but what about the other network calls in the diagram? Will it be more REST, or Hadoop RPC, or something else? You talk about security at the end so I guess the authentication will be Kerberos based? Or will you allow more authentication options such as those that HDFS currently has?

          Hash partitioning can also suffer from hotspots depending on the semantics of the key. That's not to say that it's the wrong decision to use it, only that it can have similar drawbacks as key partitioning. Since it looks like you have two separate hashes, one for buckets, and then one for the object key within the bucket, it is possible that there could be hotspots based on a particular bucket. Presumably some sort of caching would help here since the bucket mapping is relatively immutable.

          Secondary indexing will not be easy in a distributed sharded system, especially the consistency issues in dealing with updates. That said, I am reasonably certain that you will find that many users will need this feature relatively soon such that it is high on the roadmap.

          You don't say much about ACLs other than to include them in the REST API. I suppose they'll be implemented in the Ozone Handler, but what will they look like? HDFS/Linux ACLs?

          In the Cluster Level APIs, presumably DELETE Storage Volume is only allowed by the admin. What about GET?

          How are quotas enabled and set? I don't see it in the API anywhere. There's mention early on that they're set up by the administrator. Perhaps it's via some http jsp thing to the Ozone Handler or Storage Container Manager? Who enforces them?

          "no guarantees on partially written objects" - Does this also mean that there are no block-order guarantees during write? Are "holey" objects allowed or will the only inconsistencies be at the tail of an object. This is obviously important for log-based storage systems.

          In the "Size requirements" section on page 3 you say "Number of objects per bucket: 1 million", and then later on you say "A bucket can have millions of objects". You may want to shore that up a little.

          Also in the Size requirements section you say "Object Size: 5G", but then later it says "The storage container needs to store object data that can vary from a few hundred KB to hundreds of megabytes". I'm not sure those are necessarily inconsistent, but I'm also not sure how to reconcile them.

          Perhaps you could include a diagram showing how an object maps to partitions and storage containers and then onto DNs. In other words, a general diagram showing all the various storage concepts (objects, partitions, storage containers, hash tables, transactions, etc.)

          "We plan to re-use Namenode's block management implementation for container management, as much as possible." I'd love to see more detail on what can be reused, what high level changes to the BlkMgr code will be needed, what existing APIs (RPCs) you'll continue to use or need to be changed, etc.

          wrt the storage container prototype using leveldbjni. Where would this fit into the scheme of things? I get the impression that it would just be a backend storage component for the storage container manager. Would it use HDFS blocks as its persistent storage? Presumably not. Maybe a little bit more detail here?

          s/where where/where/

          Show
          clamb Charles Lamb added a comment - Jitendra Nath Pandey et al, This is very interesting. Thanks for posting it. Is the 1KB key size limit a hard limit or just a design/implementation target? There will be users who want keys that can be arbitrarily large (e.g. 10's to 100's of KB). So although it may be acceptable to degrade above 1KB, I don't think you want to make it a hard limit. You could argue that they could just hash their keys, or that they could have some sort of key map, but then it would be hard to do secondary indices in the future. The details of partitions are kind of lacking beyond the second to last paragraph on page 4. Are partitions and storage containers 1:1? ("A storage container can contain a maximum of one partition..."). Obviously a storage container holds more than just a partition. Perhaps a little more detail about partitions and how they are located, etc. is warranted. In the call flow diagram on page 6, it looks like there's a lot going on in terms of network traffic. There's the initial REST call, then an RPC to get the bucket metadata, then one to read the bucket metadata, then another to get the object's container location, then back to the client who gets redirected. That's a lot of REST/RPCs just to get to the actual data. Will any of this be cached, perhaps in the Ozone Handler or maybe even on the client (I realize that's a bit hard with a REST based protocol). For instance, if it were possible to cache some of the hash in the client, then that would cut some RPCs to the Ozone Handler. If the cache were out of date, then the actual call to the data (step (9) in the diagram) could be rejected, the cache invalidated, and the entire call sequence (1) - (8) could be executed to get the right location. IWBNI there was some description of the protocols used between all these moving parts. I know that it's REST from client to Ozone Handler, but what about the other network calls in the diagram? Will it be more REST, or Hadoop RPC, or something else? You talk about security at the end so I guess the authentication will be Kerberos based? Or will you allow more authentication options such as those that HDFS currently has? Hash partitioning can also suffer from hotspots depending on the semantics of the key. That's not to say that it's the wrong decision to use it, only that it can have similar drawbacks as key partitioning. Since it looks like you have two separate hashes, one for buckets, and then one for the object key within the bucket, it is possible that there could be hotspots based on a particular bucket. Presumably some sort of caching would help here since the bucket mapping is relatively immutable. Secondary indexing will not be easy in a distributed sharded system, especially the consistency issues in dealing with updates. That said, I am reasonably certain that you will find that many users will need this feature relatively soon such that it is high on the roadmap. You don't say much about ACLs other than to include them in the REST API. I suppose they'll be implemented in the Ozone Handler, but what will they look like? HDFS/Linux ACLs? In the Cluster Level APIs, presumably DELETE Storage Volume is only allowed by the admin. What about GET? How are quotas enabled and set? I don't see it in the API anywhere. There's mention early on that they're set up by the administrator. Perhaps it's via some http jsp thing to the Ozone Handler or Storage Container Manager? Who enforces them? "no guarantees on partially written objects" - Does this also mean that there are no block-order guarantees during write? Are "holey" objects allowed or will the only inconsistencies be at the tail of an object. This is obviously important for log-based storage systems. In the "Size requirements" section on page 3 you say "Number of objects per bucket: 1 million", and then later on you say "A bucket can have millions of objects". You may want to shore that up a little. Also in the Size requirements section you say "Object Size: 5G", but then later it says "The storage container needs to store object data that can vary from a few hundred KB to hundreds of megabytes". I'm not sure those are necessarily inconsistent, but I'm also not sure how to reconcile them. Perhaps you could include a diagram showing how an object maps to partitions and storage containers and then onto DNs. In other words, a general diagram showing all the various storage concepts (objects, partitions, storage containers, hash tables, transactions, etc.) "We plan to re-use Namenode's block management implementation for container management, as much as possible." I'd love to see more detail on what can be reused, what high level changes to the BlkMgr code will be needed, what existing APIs (RPCs) you'll continue to use or need to be changed, etc. wrt the storage container prototype using leveldbjni. Where would this fit into the scheme of things? I get the impression that it would just be a backend storage component for the storage container manager. Would it use HDFS blocks as its persistent storage? Presumably not. Maybe a little bit more detail here? s/where where/where/
          Hide
          ebortnik Edward Bortnikov added a comment -

          Great stuff. Block- and object- level storage scales much better from the metadata perspective (flat space). Could play really well with the block-management-as-a-service proposal (HDFS-5477) that splits the namenode into the FS manager and the block manager services, and scales the latter horizontally.

          Show
          ebortnik Edward Bortnikov added a comment - Great stuff. Block- and object- level storage scales much better from the metadata perspective (flat space). Could play really well with the block-management-as-a-service proposal ( HDFS-5477 ) that splits the namenode into the FS manager and the block manager services, and scales the latter horizontally.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Thanks for the feedback and comments. I will try to answer the questions over my next few comments. I will also update the document to reflect the discussion here.

          The stated limits in the document are more of the design goals, and parameters we have in mind while designing for the first phase of the project. These are not hard limits and most of these will be configurable. First I will state a few technical limits and then describe some back of the envelope calculations and heuristics I have used behind these numbers.
          The technical limitations are following.

          1. The memory in the storage container manager limits the number of storage containers. From the namenode experience, I believe we can go up to a few 100 million storage containers. In later phases of the project we can have a federated architecture with multiple storage container managers for further scale up.
          2. The size of a storage container is limited by how quick we want to replicate the containers when a datanode goes down. The advantage of using a large container size is that it reduces the metadata needed to track container locations which is proportional to number of containers. However, a very large container will reduce the parallelization that cluster can achieve to replicate when a node fails. The container size will be configurable. A default size of 10G seems like a good choice, which is much larger than hdfs block sizes, but still allows hundreds of containers on datanodes with a few terabytes of disk.

          The maximum size of an object is stated as 5G. In future we would like to even increase this limit when we can support multi-part writes similar to S3. However, it is expected that average size of the objects would be much smaller. The most common range is expected to be a few hundred KBs to a few hundred MBs.
          Assuming 100 million containers, 1MB average size of an object, and 10G the storage container size, it amounts to 10 Trillion objects. I think 10 trillion is a lofty goal to have : ). The division of 10 trillion into 10 million buckets with a million object in each bucket is kind of arbitrary, but we believed users will prefer smaller buckets for better organization. We will keep these configurable.

          The storage volume settings provide admins a control over the usage of the storage. In a private cloud, a cluster shared by lots of tenants can have a storage volume dedicated to each tenant. A tenant can be a user or a project or a group of users. Therefore, a limit of 1000 buckets implying around 1PB of storage per tenant seems reasonable. But, I do agree that when we have a quota on a storage volume size, an additional limit on number of buckets is not really needed.

          We plan to carry out the project in several phases. I would like to propose following phases:

          Phase 1

          1. Basic API as covered in the document.
          2. Storage container machinery, reliability, replication.

          Phase 2

          1. High availability
          2. Security
          3. Secondary index for object listing with prefixes.

          Phase 3

          1. Caching to improve latency.
          2. Further scalability in terms of number of objects and object sizes.
          3. Cross-geo replication.

          I have created branch HDFS-7240 for this work. We will start filing jiras and posting patches.

          Show
          jnp Jitendra Nath Pandey added a comment - Thanks for the feedback and comments. I will try to answer the questions over my next few comments. I will also update the document to reflect the discussion here. The stated limits in the document are more of the design goals, and parameters we have in mind while designing for the first phase of the project. These are not hard limits and most of these will be configurable. First I will state a few technical limits and then describe some back of the envelope calculations and heuristics I have used behind these numbers. The technical limitations are following. The memory in the storage container manager limits the number of storage containers. From the namenode experience, I believe we can go up to a few 100 million storage containers. In later phases of the project we can have a federated architecture with multiple storage container managers for further scale up. The size of a storage container is limited by how quick we want to replicate the containers when a datanode goes down. The advantage of using a large container size is that it reduces the metadata needed to track container locations which is proportional to number of containers. However, a very large container will reduce the parallelization that cluster can achieve to replicate when a node fails. The container size will be configurable. A default size of 10G seems like a good choice, which is much larger than hdfs block sizes, but still allows hundreds of containers on datanodes with a few terabytes of disk. The maximum size of an object is stated as 5G. In future we would like to even increase this limit when we can support multi-part writes similar to S3. However, it is expected that average size of the objects would be much smaller. The most common range is expected to be a few hundred KBs to a few hundred MBs. Assuming 100 million containers, 1MB average size of an object, and 10G the storage container size, it amounts to 10 Trillion objects. I think 10 trillion is a lofty goal to have : ). The division of 10 trillion into 10 million buckets with a million object in each bucket is kind of arbitrary, but we believed users will prefer smaller buckets for better organization. We will keep these configurable. The storage volume settings provide admins a control over the usage of the storage. In a private cloud, a cluster shared by lots of tenants can have a storage volume dedicated to each tenant. A tenant can be a user or a project or a group of users. Therefore, a limit of 1000 buckets implying around 1PB of storage per tenant seems reasonable. But, I do agree that when we have a quota on a storage volume size, an additional limit on number of buckets is not really needed. We plan to carry out the project in several phases. I would like to propose following phases: Phase 1 Basic API as covered in the document. Storage container machinery, reliability, replication. Phase 2 High availability Security Secondary index for object listing with prefixes. Phase 3 Caching to improve latency. Further scalability in terms of number of objects and object sizes. Cross-geo replication. I have created branch HDFS-7240 for this work. We will start filing jiras and posting patches.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Steve Loughran, thanks for the review.

          is there a limit on the #of storage volumes in a cluster? does GET/ return all of them?

          Please see my discussion on limits above. The storage volumes are created by admins, therefore, are not expected to be too many.

          any way to enum users? e.g. GET /admin/user/ ?

          We don't plan to manage users in ozone. In this respect we deviate from popular public object stores. This is because, in a private cluster deployment the user management is usually tied to corporate user accounts. Instead we choose storage volume abstraction for certain administrative settings like quota. However, admins can choose to allocate a storage volume for each user.

          what if I want to GET the 1001st entry in an object store?

          Not sure I understand the use case. Do you mean the users would like to query using some sort of entry number or index?

          GET on object must support ranges

          Agree, we plan to take up this feature in the 2nd phase.

          HEAD should supply content-length

          This should be easily doable. We will keep it in mind for container implementation.

          Show
          jnp Jitendra Nath Pandey added a comment - Steve Loughran , thanks for the review. is there a limit on the #of storage volumes in a cluster? does GET/ return all of them? Please see my discussion on limits above. The storage volumes are created by admins, therefore, are not expected to be too many. any way to enum users? e.g. GET /admin/user/ ? We don't plan to manage users in ozone. In this respect we deviate from popular public object stores. This is because, in a private cluster deployment the user management is usually tied to corporate user accounts. Instead we choose storage volume abstraction for certain administrative settings like quota. However, admins can choose to allocate a storage volume for each user. what if I want to GET the 1001st entry in an object store? Not sure I understand the use case. Do you mean the users would like to query using some sort of entry number or index? GET on object must support ranges Agree, we plan to take up this feature in the 2nd phase. HEAD should supply content-length This should be easily doable. We will keep it in mind for container implementation.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Charles Lamb, thanks for a detailed review and feedback.
          Some of the answers are below, for others I will post the updated document with details as you have pointed out.

          Is the 1KB key size limit a hard limit or just a design/implementation target

          It is a design target. Amazon's S3 limits the keys to 1KB. I doubt there would be many use cases that need beyond it. I see the point that instead of hard limit allow for degradation. But at this point in the project, I would prefer to have more strict limits, and relax later instead of setting user expectations too high to begin with.

          Caching to reduce network traffic

          I agree that a good caching layer will significantly help the performance. Ozone handler seems like a natural place for caching. However, a thick client can do its own caching without overloading datanodes. The focus of phase 1 is to get the semantics right and lay down the basic architecture in place. We plan to attack performance improvements in a later phase of the project.

          Security mechanisms

          Frankly, I haven't thought about anything other than kerberos. I agree, we should evaluate it against what other popular object stores use.

          Hot spots in hash partitioning.

          It is possible for a pathological sequence of keys, but in practice hash partitioning has been successfully used to avoid hot spots e.g. hash-partitioned indexes in databases. We would need to pick hash functions with nice distribution properties.

          Secondary indexing consistency

          The secondary index need not be strictly consistent with the bucket. That means a listing operation with prefix or key range may not reflect the latest of the bucket. We will have a more concrete proposal in the second phase of the project.

          Storage volume GET for admin

          I believed that it is not a security concern in allowing users to see all storage volume names. However, it is possible to conceive a use case where an admin would want to restrict that. Probably we can support both the modes.

          "no guarantees on partially written objects"

          The object will not be visible until completely written. Also, no recovery is planned for the first phase if the write fails. In future, we would like to support multi-part uploads.

          Re-using block management implementation for container management.

          We intend to reuse the DatanodeProtocol that datanode uses to talk to namenode. I will add more details to the document and on the corresponding jira.

          storage container prototype using leveldbjni

          We will add lot more details on this in its own jira. The idea is to use leveldbjni in the storage container in the datanodes. We plan to prototype a storage container that stores objects as individual files within the container however, that would need an index within the container to map a key to a file. We will use leveldbjni for that index.
          Another possible prototype is to put the entire object in the leveldbjni itself. It will take some experimentation to zero-down to the right approach. We will also try to make the storage container implementation pluggable to make it easy to try different implementations.

          How are quotas enabled and set? ....who enforces them

          All the Ozone APIs are implemented in ozone handler. The quota will also be enforced by the ozone handler. I will update the document with the APIs.

          Show
          jnp Jitendra Nath Pandey added a comment - Charles Lamb , thanks for a detailed review and feedback. Some of the answers are below, for others I will post the updated document with details as you have pointed out. Is the 1KB key size limit a hard limit or just a design/implementation target It is a design target. Amazon's S3 limits the keys to 1KB. I doubt there would be many use cases that need beyond it. I see the point that instead of hard limit allow for degradation. But at this point in the project, I would prefer to have more strict limits, and relax later instead of setting user expectations too high to begin with. Caching to reduce network traffic I agree that a good caching layer will significantly help the performance. Ozone handler seems like a natural place for caching. However, a thick client can do its own caching without overloading datanodes. The focus of phase 1 is to get the semantics right and lay down the basic architecture in place. We plan to attack performance improvements in a later phase of the project. Security mechanisms Frankly, I haven't thought about anything other than kerberos. I agree, we should evaluate it against what other popular object stores use. Hot spots in hash partitioning. It is possible for a pathological sequence of keys, but in practice hash partitioning has been successfully used to avoid hot spots e.g. hash-partitioned indexes in databases. We would need to pick hash functions with nice distribution properties. Secondary indexing consistency The secondary index need not be strictly consistent with the bucket. That means a listing operation with prefix or key range may not reflect the latest of the bucket. We will have a more concrete proposal in the second phase of the project. Storage volume GET for admin I believed that it is not a security concern in allowing users to see all storage volume names. However, it is possible to conceive a use case where an admin would want to restrict that. Probably we can support both the modes. "no guarantees on partially written objects" The object will not be visible until completely written. Also, no recovery is planned for the first phase if the write fails. In future, we would like to support multi-part uploads. Re-using block management implementation for container management. We intend to reuse the DatanodeProtocol that datanode uses to talk to namenode. I will add more details to the document and on the corresponding jira. storage container prototype using leveldbjni We will add lot more details on this in its own jira. The idea is to use leveldbjni in the storage container in the datanodes. We plan to prototype a storage container that stores objects as individual files within the container however, that would need an index within the container to map a key to a file. We will use leveldbjni for that index. Another possible prototype is to put the entire object in the leveldbjni itself. It will take some experimentation to zero-down to the right approach. We will also try to make the storage container implementation pluggable to make it easy to try different implementations. How are quotas enabled and set? ....who enforces them All the Ozone APIs are implemented in ozone handler. The quota will also be enforced by the ozone handler. I will update the document with the APIs.
          Hide
          cmccabe Colin P. McCabe added a comment -

          This looks like a really interesting way to achieve a scalable blob store using some of the infrastructure we already have in HDFS. It could be a good direction for the project to go in.

          We should have a meeting to review the design and talk about how it fits in with the rest of what's going on in HDFS-land. Perhaps we could have a webex on the week of May 25th or June 1? (I am going to be out of town next week so I can't do next week.)

          Show
          cmccabe Colin P. McCabe added a comment - This looks like a really interesting way to achieve a scalable blob store using some of the infrastructure we already have in HDFS. It could be a good direction for the project to go in. We should have a meeting to review the design and talk about how it fits in with the rest of what's going on in HDFS-land. Perhaps we could have a webex on the week of May 25th or June 1? (I am going to be out of town next week so I can't do next week.)
          Hide
          jnp Jitendra Nath Pandey added a comment -

          That's a good idea Colin P. McCabe, I will setup a webex in the week of June 1st. Does June 3rd, 1pm to 3pm PST sound ok? I will update with the details.

          Show
          jnp Jitendra Nath Pandey added a comment - That's a good idea Colin P. McCabe , I will setup a webex in the week of June 1st. Does June 3rd, 1pm to 3pm PST sound ok? I will update with the details.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Thanks, Jitendra Nath Pandey. June 3rd at 1pm to 3pm sounds good to me. I believe Aaron T. Myers and Zhe Zhang will be able to attend as well.

          Show
          cmccabe Colin P. McCabe added a comment - Thanks, Jitendra Nath Pandey . June 3rd at 1pm to 3pm sounds good to me. I believe Aaron T. Myers and Zhe Zhang will be able to attend as well.
          Hide
          kanaka Kanaka Kumar Avvaru added a comment -

          Very interesting to follow Jitendra Nath Pandey, we also have some requirements to support Trillion level small objects/files.
          We will be intrested to contibute for OZone development. Can you please invite me also to the webex meeting?

          For now I have few comments on the this Project

          Practically partitioning may be difficult to be controlled by Storage Layer alone, as distribution depends on key construction applications.
          So, bucket partitioner classes can be a input while creating a bucket so that applications can handle the partitions well.

          Object level Metadata would be required such as tags/labels which can be used by computing jobs as additional info(similar to xaatributes on file)

          What is the plan for leveldbjni content file persistence & has any concept like WAL for reliability is planned?
          When & how does the leveldbjni content will be replicated?

          As millions are buckets are expected, is Partitioning for buckets is also required based on volume name?

          Swift & AWS S3 support supports Object versions and replace. Does OZone also plan for the same?

          Missing feature like multi part loading,heavy object/storage space splits etc,, also can be pooled in the coming phases( may be phase 2 or later)

          We can also add readable snap shots of a bucket in the features queue? (may be at later stage of project)

          As part of Transparency encryption, encryption zone at bucket level could be an expectation from applications.

          Show
          kanaka Kanaka Kumar Avvaru added a comment - Very interesting to follow Jitendra Nath Pandey , we also have some requirements to support Trillion level small objects/files. We will be intrested to contibute for OZone development. Can you please invite me also to the webex meeting? For now I have few comments on the this Project Practically partitioning may be difficult to be controlled by Storage Layer alone, as distribution depends on key construction applications. So, bucket partitioner classes can be a input while creating a bucket so that applications can handle the partitions well. Object level Metadata would be required such as tags/labels which can be used by computing jobs as additional info(similar to xaatributes on file) What is the plan for leveldbjni content file persistence & has any concept like WAL for reliability is planned? When & how does the leveldbjni content will be replicated? As millions are buckets are expected, is Partitioning for buckets is also required based on volume name? Swift & AWS S3 support supports Object versions and replace. Does OZone also plan for the same? Missing feature like multi part loading,heavy object/storage space splits etc,, also can be pooled in the coming phases( may be phase 2 or later) We can also add readable snap shots of a bucket in the features queue? (may be at later stage of project) As part of Transparency encryption, encryption zone at bucket level could be an expectation from applications.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Webex:
          https://hortonworks.webex.com/meet/jitendra

          1-650-479-3208 Call-in toll number (US/Canada)
          1-877-668-4493 Call-in toll-free number (US/Canada)
          Access code: 623 433 021​

          Time: 6/3/2015, 1pm to 3pm

          Show
          jnp Jitendra Nath Pandey added a comment - Webex: https://hortonworks.webex.com/meet/jitendra 1-650-479-3208 Call-in toll number (US/Canada) 1-877-668-4493 Call-in toll-free number (US/Canada) Access code: 623 433 021​ Time: 6/3/2015, 1pm to 3pm
          Hide
          thodemoor Thomas Demoor added a comment -

          Maybe some of the (ongoing) work for currently supported object stores can be reused here (f.i. HADOOP-9565)? Will probably call-in.

          Show
          thodemoor Thomas Demoor added a comment - Maybe some of the (ongoing) work for currently supported object stores can be reused here (f.i. HADOOP-9565 )? Will probably call-in.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Hi Jitendra Nath Pandey,

          Thanks for organizing the webex on Wednesday. Are you planning on having an in-person meetup, or just the webex? We can offer to host if that is convenient (or visit your office in Santa Clara).

          Show
          cmccabe Colin P. McCabe added a comment - Hi Jitendra Nath Pandey , Thanks for organizing the webex on Wednesday. Are you planning on having an in-person meetup, or just the webex? We can offer to host if that is convenient (or visit your office in Santa Clara).
          Hide
          jnp Jitendra Nath Pandey added a comment -

          I was planning to do it on webex only, but you are welcome to come over to our office.
          Ask for Jitendra Pandey at the lobby.

          5470 Great America Pkwy,
          Santa Clara, CA 95054

          Show
          jnp Jitendra Nath Pandey added a comment - I was planning to do it on webex only, but you are welcome to come over to our office. Ask for Jitendra Pandey at the lobby. 5470 Great America Pkwy, Santa Clara, CA 95054
          Hide
          cmccabe Colin P. McCabe added a comment -

          Thanks, Jitendra. I think it makes sense to meet up, since we're in the area.

          Show
          cmccabe Colin P. McCabe added a comment - Thanks, Jitendra. I think it makes sense to meet up, since we're in the area.
          Hide
          thodemoor Thomas Demoor added a comment -

          Very interesting call yesterday. Might be interesting to have a group discussion at Hadoop Summit next week?

          Show
          thodemoor Thomas Demoor added a comment - Very interesting call yesterday. Might be interesting to have a group discussion at Hadoop Summit next week?
          Hide
          lars_francke Lars Francke added a comment -

          Would anyone mind - for documentation and those of us who couldn't join - to post a quick summary of the call? Thank you!

          Show
          lars_francke Lars Francke added a comment - Would anyone mind - for documentation and those of us who couldn't join - to post a quick summary of the call? Thank you!
          Hide
          jnp Jitendra Nath Pandey added a comment -

          I will post a summary shortly.

          Show
          jnp Jitendra Nath Pandey added a comment - I will post a summary shortly.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          The call started with high level description of object stores, motivations and the design approach as covered in the architectural document.
          Following points were discussed in detail

          1. 3 level namespace with storage volumes, buckets and keys vs 2 level namespace with buckets and keys
            • Storage volumes are created by admins and provide admin controls such as quota. Buckets are created and managed by user.
              Since HDFS doesn't have a separate notion of user accounts as in S3 or Azure, Storage volume allows admins to set policies.
            • The argument in favor of 2 level scheme was that typically organizations have very few buckets and users organize their data within the buckets. The admin controls can be set at bucket level.
          2. Is it exactly S3 API? It would be good to easily migrate from s3 to Ozone.
            • Storage volume concept is not in S3. In Azure, accounts are part of the URL, Ozone URLs look similar to Azure with storage volume instead of account name.
            • We will publish a more detailed spec including headers, authorization semantics etc. We try to follow S3 closely.
          3. Http2
            • There is a jira already in hadoop for http2. We should evaluate supporting http2 as well.
          4. OzoneFileSystem: Hadoop file system implementation on top of ozone, similar to S3FileSystem.
            • It will not support rename
            • This was only briefly mentioned.
          5. Storage Container Implementation
            • Storage container replication must provide efficient replication. Replication by key-object enumeration will be too inefficient. RocksDB is a promising choice as it provides features for live replication i.e. replication while it is being written. In the architecture document we talked about leveldbjni. RocksDB is similar, and provides additional features and java binding as well.
            • If a datanode dies and some of the containers lag in generation stamp, these containers will be discarded. Since containers are much larger than typical hdfs blocks, this will be lot more inefficient. An important optimization is needed to allow stale containers to catch up the state.
            • To support a large range of object sizes, a hybrid model may be needed: Store small objects in RocksDB, but large objects as files with their file-paths in RocksDB.
            • Colin suggested Linux sparse files.
            • We are working on a prototype.
          6. Ordered listing with read after write semantics might be an important requirement. In the hash partitioning scheme that would need consistent secondary indexes or a range partitioning should be used. This needs to be investigated.

          I will follow up on these points and update the design doc.

          It was a great discussion with many valuable points raised. Thanks to everyone who attended.

          Show
          jnp Jitendra Nath Pandey added a comment - The call started with high level description of object stores, motivations and the design approach as covered in the architectural document. Following points were discussed in detail 3 level namespace with storage volumes, buckets and keys vs 2 level namespace with buckets and keys Storage volumes are created by admins and provide admin controls such as quota. Buckets are created and managed by user. Since HDFS doesn't have a separate notion of user accounts as in S3 or Azure, Storage volume allows admins to set policies. The argument in favor of 2 level scheme was that typically organizations have very few buckets and users organize their data within the buckets. The admin controls can be set at bucket level. Is it exactly S3 API? It would be good to easily migrate from s3 to Ozone. Storage volume concept is not in S3. In Azure, accounts are part of the URL, Ozone URLs look similar to Azure with storage volume instead of account name. We will publish a more detailed spec including headers, authorization semantics etc. We try to follow S3 closely. Http2 There is a jira already in hadoop for http2. We should evaluate supporting http2 as well. OzoneFileSystem: Hadoop file system implementation on top of ozone, similar to S3FileSystem. It will not support rename This was only briefly mentioned. Storage Container Implementation Storage container replication must provide efficient replication. Replication by key-object enumeration will be too inefficient. RocksDB is a promising choice as it provides features for live replication i.e. replication while it is being written. In the architecture document we talked about leveldbjni. RocksDB is similar, and provides additional features and java binding as well. If a datanode dies and some of the containers lag in generation stamp, these containers will be discarded. Since containers are much larger than typical hdfs blocks, this will be lot more inefficient. An important optimization is needed to allow stale containers to catch up the state. To support a large range of object sizes, a hybrid model may be needed: Store small objects in RocksDB, but large objects as files with their file-paths in RocksDB. Colin suggested Linux sparse files. We are working on a prototype. Ordered listing with read after write semantics might be an important requirement. In the hash partitioning scheme that would need consistent secondary indexes or a range partitioning should be used. This needs to be investigated. I will follow up on these points and update the design doc. It was a great discussion with many valuable points raised. Thanks to everyone who attended.
          Hide
          john.jian.fang Jian Fang added a comment -

          Sorry for jumping in. I saw the following statement and wonder why you need to follow S3FileSystem closely.

          "OzoneFileSystem: Hadoop file system implementation on top of ozone, similar to S3FileSystem.

          • It will not support rename"

          Not supporting rename could cause a lot of trouble because hadoop uses it to achieve some thing like "two phase commit" in many places, for example, FileOutputCommitter. What would you do for such use cases? Add extra logic to hadoop code or copy the data to the final destination? The latter could be very expensive by the way.

          I am not very clear about the motivations here. Shouldn't some features have already been covered by HBase? Would this feature make HDFS too fat and become difficult to manage? Or is this on top of HDFS just like HBase? Also how do you handle small objects in the object store?

          Show
          john.jian.fang Jian Fang added a comment - Sorry for jumping in. I saw the following statement and wonder why you need to follow S3FileSystem closely. "OzoneFileSystem: Hadoop file system implementation on top of ozone, similar to S3FileSystem. It will not support rename" Not supporting rename could cause a lot of trouble because hadoop uses it to achieve some thing like "two phase commit" in many places, for example, FileOutputCommitter. What would you do for such use cases? Add extra logic to hadoop code or copy the data to the final destination? The latter could be very expensive by the way. I am not very clear about the motivations here. Shouldn't some features have already been covered by HBase? Would this feature make HDFS too fat and become difficult to manage? Or is this on top of HDFS just like HBase? Also how do you handle small objects in the object store?
          Hide
          john.jian.fang Jian Fang added a comment -

          One more question is how you would handle object partitions? HDFS federation is more like a manual partition process to me. You probably need some dynamic partition mechanism similar to HBase's region idea. Data consistency is another big issue here.

          Show
          john.jian.fang Jian Fang added a comment - One more question is how you would handle object partitions? HDFS federation is more like a manual partition process to me. You probably need some dynamic partition mechanism similar to HBase's region idea. Data consistency is another big issue here.
          Hide
          khanderao khanderao added a comment -

          Great initiative! Jitendra and team.

          1. What are the assumptions about data locality? Typical cloud storages are not candidates for computations hence hardware choices are done differently for storage centric vs compute centric vs typical Hadoop nodes (typically compute + storage).

          2. Would Datanode can be exclusively configured to either Object Store or HDFS. I assume yes.

          Show
          khanderao khanderao added a comment - Great initiative! Jitendra and team. 1. What are the assumptions about data locality? Typical cloud storages are not candidates for computations hence hardware choices are done differently for storage centric vs compute centric vs typical Hadoop nodes (typically compute + storage). 2. Would Datanode can be exclusively configured to either Object Store or HDFS. I assume yes.
          Hide
          andrew.wang Andrew Wang added a comment -

          I see some JIRAs related to volumes; did we resolve the question of the 2-level vs. 3-level scheme? Based on Colin's (and my own) experiences using S3, we did not feel the need for users to be able to create buckets, which seemed to be the primary motivation for volume -> bucket. Typically users also create their own hierarchy under the bucket anyway, and prefix scans become important then.

          My main reason for asking is to make the API as simple and as similar to S3 as possible, which should help with porting applications.

          Show
          andrew.wang Andrew Wang added a comment - I see some JIRAs related to volumes; did we resolve the question of the 2-level vs. 3-level scheme? Based on Colin's (and my own) experiences using S3, we did not feel the need for users to be able to create buckets, which seemed to be the primary motivation for volume -> bucket. Typically users also create their own hierarchy under the bucket anyway, and prefix scans become important then. My main reason for asking is to make the API as simple and as similar to S3 as possible, which should help with porting applications.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          Jian Fang, thanks for review and comments
          1) There is some work going on to support yarn for s3 (HADOOP-11262) by Thomas Demoor and Steve Loughran]. We hope we can leverage that to support mapreduce use case.
          2) This will be part of hdfs natively, unlike hbase which is a separate service. There is only one additional daemon needed (storage container manager) only if ozone is deployed. Hdfs already supports multiple namespaces and blockpools, we will leverage that work. This work will not impact hdfs manageability in anyway. There will be some additional code in datanode to support storage containers. In future, we hope to use containers to store hdfs blocks as well for hdfs block-space scalability.
          3) We will partition objects using hash partitioning and range partitioning. The document already talks about hash partitioning, I will add more details for range partition support. The small objects will also be stored in the container. In the document I mentioned leveldbjni, but we are also looking at rocksDB for container implementation to store objects in the container.
          4) The containers will be replicated and kept consistent using data pipeline, similar to hdfs blocks. The document talks about it at a high level. We will update more details in the related jira.

          Show
          jnp Jitendra Nath Pandey added a comment - Jian Fang , thanks for review and comments 1) There is some work going on to support yarn for s3 ( HADOOP-11262 ) by Thomas Demoor and Steve Loughran ]. We hope we can leverage that to support mapreduce use case. 2) This will be part of hdfs natively, unlike hbase which is a separate service. There is only one additional daemon needed (storage container manager) only if ozone is deployed. Hdfs already supports multiple namespaces and blockpools, we will leverage that work. This work will not impact hdfs manageability in anyway. There will be some additional code in datanode to support storage containers. In future, we hope to use containers to store hdfs blocks as well for hdfs block-space scalability. 3) We will partition objects using hash partitioning and range partitioning. The document already talks about hash partitioning, I will add more details for range partition support. The small objects will also be stored in the container. In the document I mentioned leveldbjni, but we are also looking at rocksDB for container implementation to store objects in the container. 4) The containers will be replicated and kept consistent using data pipeline, similar to hdfs blocks. The document talks about it at a high level. We will update more details in the related jira.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          khanderao, thanks for the review.
          1) Ozone stores data on datanode themselves, therefore it can provide locality for computations running on the datanodes. The hardware can be chosen based on the use case and computational needs on the datanodes.
          2) Of course, it would be possible to have a dedicated object store or hdfs deployment, with datanodes configured to talk to respective blockpool.

          Show
          jnp Jitendra Nath Pandey added a comment - khanderao , thanks for the review. 1) Ozone stores data on datanode themselves, therefore it can provide locality for computations running on the datanodes. The hardware can be chosen based on the use case and computational needs on the datanodes. 2) Of course, it would be possible to have a dedicated object store or hdfs deployment, with datanodes configured to talk to respective blockpool.
          Hide
          jnp Jitendra Nath Pandey added a comment -

          For prefix scans we are planning to support range partitioned index in the storage container manager. I will update the design document. That will allow users to have hierarchy under the bucket.

          The storage volumes do not really change the semantics that much from S3. For a url like http://host:port/volume/bucket/key, apart from the volume prefix, the semantics are really similar. I believe, having a notion of 'admin created volumes' allows us to have buckets which are very similar to S3, as it provides clear domains of admin control vs user control. Volume can be viewed as accounts, which are kind of similar to accounts in S3 or Azure.
          It will be a bigger deviation if we insist that only admin can create buckets. It is useful to organize data in buckets using prefix, but having that as the only mechanism for organizing data seems too restrictive.
          For similarity to S3, we will try to have similar headers and auth semantics as well, which are also very important for portability.

          Show
          jnp Jitendra Nath Pandey added a comment - For prefix scans we are planning to support range partitioned index in the storage container manager. I will update the design document. That will allow users to have hierarchy under the bucket. The storage volumes do not really change the semantics that much from S3. For a url like http://host:port/volume/bucket/key , apart from the volume prefix, the semantics are really similar. I believe, having a notion of 'admin created volumes' allows us to have buckets which are very similar to S3, as it provides clear domains of admin control vs user control. Volume can be viewed as accounts, which are kind of similar to accounts in S3 or Azure. It will be a bigger deviation if we insist that only admin can create buckets. It is useful to organize data in buckets using prefix, but having that as the only mechanism for organizing data seems too restrictive. For similarity to S3, we will try to have similar headers and auth semantics as well, which are also very important for portability.
          Hide
          thodemoor Thomas Demoor added a comment -

          Jian Fang and Jitendra Nath Pandey:

          • Avoiding rename happens in HADOOP-9565 by introducing ObjectStore (extends Filesystem) and letting FileOutputCommitter, Hadoop CLI, ... act on this (by avoiding rename). Ozone could easily extend ObjectStore and benefit from this.
          • HADOOP-11262 extends DelegateToFileSystem to implement s3a as an AbstractFileSystem and works around issues as modification times for directories (cfr. Azure).
          Show
          thodemoor Thomas Demoor added a comment - Jian Fang and Jitendra Nath Pandey : Avoiding rename happens in HADOOP-9565 by introducing ObjectStore (extends Filesystem) and letting FileOutputCommitter, Hadoop CLI, ... act on this (by avoiding rename). Ozone could easily extend ObjectStore and benefit from this. HADOOP-11262 extends DelegateToFileSystem to implement s3a as an AbstractFileSystem and works around issues as modification times for directories (cfr. Azure).
          Hide
          john.jian.fang Jian Fang added a comment -

          Thanks for all your explanations, however, I think you missed my points. Doable and performance are two difference concepts. From my own experiences with S3 and s3 native file system, the most costly operations are listing keys and copying data from one bucket to the other one to simulate the rename operation. The former one will take a very long time for a bucket with millions of objects and the latter one has a double performance penalty, i.e., assume your objects are 1TB, you actually almost upload 2TB of data to s3. That is why fast key listing and native fast rename operations are two of the most desirable features for s3.

          Before you make decision to follow the S3N API, I would suggest you actually test the performance of S3N and get to know what are good and what are bad. Why do you need to follow the bad ones at all?

          It is still not very clear to me how do you guarantee your partitions are balanced. HBase used region auto split to achieve that, which is also my concern that the code and logic would grow rapidly when your object store becomes really mature. In my personal opinion, it is better to build the object store on top of HDFS and leave HDFS to be simple.

          Show
          john.jian.fang Jian Fang added a comment - Thanks for all your explanations, however, I think you missed my points. Doable and performance are two difference concepts. From my own experiences with S3 and s3 native file system, the most costly operations are listing keys and copying data from one bucket to the other one to simulate the rename operation. The former one will take a very long time for a bucket with millions of objects and the latter one has a double performance penalty, i.e., assume your objects are 1TB, you actually almost upload 2TB of data to s3. That is why fast key listing and native fast rename operations are two of the most desirable features for s3. Before you make decision to follow the S3N API, I would suggest you actually test the performance of S3N and get to know what are good and what are bad. Why do you need to follow the bad ones at all? It is still not very clear to me how do you guarantee your partitions are balanced. HBase used region auto split to achieve that, which is also my concern that the code and logic would grow rapidly when your object store becomes really mature. In my personal opinion, it is better to build the object store on top of HDFS and leave HDFS to be simple.
          Hide
          cmccabe Colin P. McCabe added a comment -

          3) We will partition objects using hash partitioning and range partitioning. The document already talks about hash partitioning, I will add more details for range partition support. The small objects will also be stored in the container. In the document I mentioned leveldbjni, but we are also looking at rocksDB for container implementation to store objects in the container.

          Interesting, thanks for posting more details about this.

          Maybe this has been discussed on another JIRA (I apologize if so), but does this mean that the admin will have to choose between hash and range partitioning for a particular bucket? This seems suboptimal to me... we will have to support both approaches, which is more complex, and admins will be left with a difficult choice.

          It seems better just to make everything range-partitioned. Although this is more complex than simple hash partitioning, it provides "performance compatibility" with s3 and other object stores. s3 provides a fast (sub-linear) way of getting all the keys in between some A and B. It will be very difficult to really position ozone as s3-compatible if operations that are quick in s3 such as listing all the keys between A and B are O(num_keys^2) in ozone.

          For example, in (all of) Hadoop's s3 filesystem implementations, listStatus uses this quick listing of keys between A and B. When someone does "listStatus /a/b/c", we can ask s3 for all the keys between /a/b/c/ and /a/b/c0 (0 is the ASCII value right after slash). Of course, s3 does not really have directories, but we can treat the keys in this range as being in the directory /a/b/c for the purposes of s3a or s3n. If we just had hash partitioning, this kind of operation would be O(N^2) where N is the number of keys. It would just be infeasible for any large bucket.

          Show
          cmccabe Colin P. McCabe added a comment - 3) We will partition objects using hash partitioning and range partitioning. The document already talks about hash partitioning, I will add more details for range partition support. The small objects will also be stored in the container. In the document I mentioned leveldbjni, but we are also looking at rocksDB for container implementation to store objects in the container. Interesting, thanks for posting more details about this. Maybe this has been discussed on another JIRA (I apologize if so), but does this mean that the admin will have to choose between hash and range partitioning for a particular bucket? This seems suboptimal to me... we will have to support both approaches, which is more complex, and admins will be left with a difficult choice. It seems better just to make everything range-partitioned. Although this is more complex than simple hash partitioning, it provides "performance compatibility" with s3 and other object stores. s3 provides a fast (sub-linear) way of getting all the keys in between some A and B. It will be very difficult to really position ozone as s3-compatible if operations that are quick in s3 such as listing all the keys between A and B are O(num_keys^2) in ozone. For example, in (all of) Hadoop's s3 filesystem implementations, listStatus uses this quick listing of keys between A and B. When someone does "listStatus /a/b/c", we can ask s3 for all the keys between /a/b/c/ and /a/b/c0 (0 is the ASCII value right after slash). Of course, s3 does not really have directories, but we can treat the keys in this range as being in the directory /a/b/c for the purposes of s3a or s3n. If we just had hash partitioning, this kind of operation would be O(N^2) where N is the number of keys. It would just be infeasible for any large bucket.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Jitendra Nath Pandey, Sanjay Radia, did we come to a conclusion about range partitioning?

          Show
          cmccabe Colin P. McCabe added a comment - Jitendra Nath Pandey , Sanjay Radia , did we come to a conclusion about range partitioning?
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          2 branches, HDFS-7240 and hdsfs-7240, seems to be created on repository.

          $ git pull
          From https://git-wip-us.apache.org/repos/asf/hadoop

          Is it assumed one?

          Show
          ozawa Tsuyoshi Ozawa added a comment - 2 branches, HDFS-7240 and hdsfs-7240, seems to be created on repository. $ git pull From https://git-wip-us.apache.org/repos/asf/hadoop [new branch] HDFS-7240 -> origin/ HDFS-7240 [new branch] hdfs-7240 -> origin/hdfs-7240 Is it assumed one?
          Hide
          cnauroth Chris Nauroth added a comment -

          The upper-case one is the correct one. The lower-case one was created by mistake. We haven't been able to delete it yet though because of the recent ASF infrastructure changes to prohibit force pushes.

          Show
          cnauroth Chris Nauroth added a comment - The upper-case one is the correct one. The lower-case one was created by mistake. We haven't been able to delete it yet though because of the recent ASF infrastructure changes to prohibit force pushes.
          Hide
          anu Anu Engineer added a comment -

          Please see INFRA-10720 to see the status of branch deletes. I am presuming sooner or later we will be able to delete the smaller caps "hdfs-7240".

          Show
          anu Anu Engineer added a comment - Please see INFRA-10720 to see the status of branch deletes. I am presuming sooner or later we will be able to delete the smaller caps "hdfs-7240".
          Hide
          ozawa Tsuyoshi Ozawa added a comment -

          Thank you for following up, Anu and Chris.

          Show
          ozawa Tsuyoshi Ozawa added a comment - Thank you for following up, Anu and Chris.
          Hide
          anu Anu Engineer added a comment -

          Proposed user interfaces for Ozone. Documentation of REST and CLI interfaces.

          Show
          anu Anu Engineer added a comment - Proposed user interfaces for Ozone. Documentation of REST and CLI interfaces.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hi all, I had the opportunity to hear more about Ozone at Apache Big Data, and chatted with Anu afterwards. Quite interesting, I learned a lot. Thanks Anu for the presentation and fielding my questions.

          I'm re-posting my notes and questions here. Anu said he'd be posting a new design doc soon to address my questions.

          Notes:

          • Key Space Manager and Storage Container Manager are the "master" services in Ozone, and are the equivalent of FSNamesystem and the BlockManager in HDFS. Both are Raft-replicated services. There is a new Raft implementation being worked on internally.
          • The block container abstraction is a mutable range of KV pairs. It's essentially a ~5GB LevelDB for metadata + on-disk files for the data. Container metadata is replicated via Raft. Container data is replicated via chain replication.
          • Since containers are mutable and the replicas are independent, the on-disk state will be different. This means we need to do logical rather than physical replication.
          • Container data is stored as chunks, where a chunk is maybe 4-8MB. Chunks are immutable. Chunks are a (file, offset, length) triplet. Currently each chunk is stored as a separate file.
          • Use of copysets to reduce the risk of data loss due to independent node failures.

          Questions:

          • My biggest concern is that erasure coding is not a first-class consideration in this system, and seems like it will be quite difficult to implement. EC is table stakes in the blobstore world, it's implemented by all the cloud blobstores I'm aware of (S3, WASB, etc). Since containers are mutable, we are not able to erasure-code containers together, else we suffer from the equivalent of the RAID-5 write hole. It's the same issue we're dealing with on HDFS-7661 for hflush/hsync EC support. There's also the complexity that a container is replicated to 3 nodes via Raft, but EC data is typically stored across 14 nodes.
          • Since LevelDB is being used for metadata storage and separately being replicated via Raft, are there concerns about metadata write amplification?
          • Can we re-use the QJM code instead of writing a new replicated log implementation? QJM is battle-tested, and consensus is a known hard problem to get right.
          • Are there concerns about storing millions of chunk files per disk? Writing each chunk as a separate file requires more metadata ops and fsyncs than appending to a file. We also need to be very careful to never require a full scan of the filesystem. The HDFS DN does full scans right now (DU, volume scanner).
          • Any thoughts about how we go about packing multiple chunks into a larger file?
          • Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability targets. However, I think we're going to have a harder time with this than a system like HBase. HDFS sees a relatively high delete rate for recently written data, e.g. intermediate data in a processing pipeline. HDFS also sees a much higher variance in key/value size. Together, these factors mean Ozone will likely be doing many more merges and splits than HBase to keep the container size high. This is concerning since splits and merges are expensive operations, and based on HBase's experience, are hard to get right.
          • What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers, and the metadata services are separate from the NN? not shared?
          • Any thoughts on how we will transition applications like Hive and HBase to Ozone? These apps use rename and directories for synchronization, which are not possible on Ozone.
          • Have you experienced data loss from independent node failures, thus motivating the need for copysets? I think the idea is cool, but the RAMCloud network hardware expectations are quite different from ours. Limiting the set of nodes for re-replication means you have less flexibility to avoid top-of-rack switches and decreased parallelism. It's also not clear how this type of data placement meshes with EC, or the other quite sophisticated types of block placement we currently support in HDFS.
          • How do you plan to handle files larger than 5GB? Large files right now are also not spread across multiple nodes and disks, limiting IO performance.
          • Are all reads and writes served by the container's Raft master? IIUC that's how you get strong consistency, but it means we don't have the same performance benefits we have now in HDFS from 3-node replication.

          I also ask that more of this information and decision making be shared on public mailing lists and JIRA. The KSM is not mentioned in the architecture document, nor the fact that the Ozone metadata is being replicated via Raft rather than stored in containers. I not aware that there is already progress internally at Hortonworks on a Raft implementation. We've previously expressed interest in being involved in the design and implementation of Ozone, but we can't meaningfully contribute if this work is being done privately.

          Show
          andrew.wang Andrew Wang added a comment - Hi all, I had the opportunity to hear more about Ozone at Apache Big Data, and chatted with Anu afterwards. Quite interesting, I learned a lot. Thanks Anu for the presentation and fielding my questions. I'm re-posting my notes and questions here. Anu said he'd be posting a new design doc soon to address my questions. Notes: Key Space Manager and Storage Container Manager are the "master" services in Ozone, and are the equivalent of FSNamesystem and the BlockManager in HDFS. Both are Raft-replicated services. There is a new Raft implementation being worked on internally. The block container abstraction is a mutable range of KV pairs. It's essentially a ~5GB LevelDB for metadata + on-disk files for the data. Container metadata is replicated via Raft. Container data is replicated via chain replication. Since containers are mutable and the replicas are independent, the on-disk state will be different. This means we need to do logical rather than physical replication. Container data is stored as chunks, where a chunk is maybe 4-8MB. Chunks are immutable. Chunks are a (file, offset, length) triplet. Currently each chunk is stored as a separate file. Use of copysets to reduce the risk of data loss due to independent node failures. Questions: My biggest concern is that erasure coding is not a first-class consideration in this system, and seems like it will be quite difficult to implement. EC is table stakes in the blobstore world, it's implemented by all the cloud blobstores I'm aware of (S3, WASB, etc). Since containers are mutable, we are not able to erasure-code containers together, else we suffer from the equivalent of the RAID-5 write hole. It's the same issue we're dealing with on HDFS-7661 for hflush/hsync EC support. There's also the complexity that a container is replicated to 3 nodes via Raft, but EC data is typically stored across 14 nodes. Since LevelDB is being used for metadata storage and separately being replicated via Raft, are there concerns about metadata write amplification? Can we re-use the QJM code instead of writing a new replicated log implementation? QJM is battle-tested, and consensus is a known hard problem to get right. Are there concerns about storing millions of chunk files per disk? Writing each chunk as a separate file requires more metadata ops and fsyncs than appending to a file. We also need to be very careful to never require a full scan of the filesystem. The HDFS DN does full scans right now (DU, volume scanner). Any thoughts about how we go about packing multiple chunks into a larger file? Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability targets. However, I think we're going to have a harder time with this than a system like HBase. HDFS sees a relatively high delete rate for recently written data, e.g. intermediate data in a processing pipeline. HDFS also sees a much higher variance in key/value size. Together, these factors mean Ozone will likely be doing many more merges and splits than HBase to keep the container size high. This is concerning since splits and merges are expensive operations, and based on HBase's experience, are hard to get right. What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers, and the metadata services are separate from the NN? not shared? Any thoughts on how we will transition applications like Hive and HBase to Ozone? These apps use rename and directories for synchronization, which are not possible on Ozone. Have you experienced data loss from independent node failures, thus motivating the need for copysets? I think the idea is cool, but the RAMCloud network hardware expectations are quite different from ours. Limiting the set of nodes for re-replication means you have less flexibility to avoid top-of-rack switches and decreased parallelism. It's also not clear how this type of data placement meshes with EC, or the other quite sophisticated types of block placement we currently support in HDFS. How do you plan to handle files larger than 5GB? Large files right now are also not spread across multiple nodes and disks, limiting IO performance. Are all reads and writes served by the container's Raft master? IIUC that's how you get strong consistency, but it means we don't have the same performance benefits we have now in HDFS from 3-node replication. I also ask that more of this information and decision making be shared on public mailing lists and JIRA. The KSM is not mentioned in the architecture document, nor the fact that the Ozone metadata is being replicated via Raft rather than stored in containers. I not aware that there is already progress internally at Hortonworks on a Raft implementation. We've previously expressed interest in being involved in the design and implementation of Ozone, but we can't meaningfully contribute if this work is being done privately.
          Hide
          anu Anu Engineer added a comment -

          Andrew Wang Thank you for showing up at the talk and having a very interesting follow up conversation. I am glad that we are continuing that on this JIRA. Just to make sure that others who might read our discussion gets the right context, Here are the slides from the talk

          http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf

          Show
          anu Anu Engineer added a comment - Andrew Wang Thank you for showing up at the talk and having a very interesting follow up conversation. I am glad that we are continuing that on this JIRA. Just to make sure that others who might read our discussion gets the right context, Here are the slides from the talk http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf
          Hide
          anu Anu Engineer added a comment -

          Andrew Wang Thank you for your comments, They are well thought out and extremely valuable questions. I will make sure that all areas that you are asking about is discussed in the next update of design doc.

          Anu said he'd be posting a new design doc soon to address my questions.

          I am working on that, but just to make sure your questions are not lost in the big picture of the design doc, I am answering them individually here.

          My biggest concern is that erasure coding is not a first-class consideration in this system.

          Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all. So it is quite trivial to create a new chunk type and write them into containers. We are focused on overall picture of ozone right now, and I would welcome any contribution you can make on EC and ozone chunks if that is a concern that you would like us to address earlier. From the architecture point of view I do not see any issues.

          Since LevelDB is being used for metadata storage and separately being replicated via Raft, are there concerns about metadata write amplification?

          Metadata is such a small slice of information of a block – really what you are saying is that block Name, hash for the block gets written twice, once thru RAFT log and second time when RAFT commits this information. Since the data we are talking about is so small I am not worried about it at all.

          Can we re-use the QJM code instead of writing a new replicated log implementation? QJM is battle-tested, and bq.sensus is a known hard problem to get right.

          We considered this, however the consensus is to write a consensus protocol that is easier to understand and make it easy for more contributors to work on it. The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it.

          Are there concerns about storing millions of chunk files per disk? Writing each chunk as a separate file requires more metadata ops and fsyncs than appending to a file. We also need to be very careful to never require a full scan of the filesystem. The HDFS DN does full scans right now (DU, volume scanner).

          Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file.

          Any thoughts about how we go about packing multiple chunks into a larger file?

          Yes, write the first chunk and then write the second chunk to the same file. In fact, chunks are specifically designed to address the small file problem. So two keys can point to a same file.
          For example
          KeyA -> {File,0, 100}
          KeyB -> {File,101, 1000} Is a perfectly valid layout under container architecture

          Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability targets. Together, these factors mean Ozone will likely be doing many more merges and splits than HBase to keep the container size high

          Ozone actively tries to avoid merges and tries to split only when needed. A container can be thought of as a really large block, so I am not sure if I am going to see anything other than standard block workload on containers. The fact that containers can be split, is something that allows us to avoid pre-allocation of container space. That is merely a convenience and if you think of these as blocks, you will see that it is very similar.

          Ozone will never try to do merges and splits at HBase level. From the container and ozone perspective we are more focused on a good data distribution on the cluster – aka what the balancer does today, and containers are a flat namespace – just like blocks which we allocate when needed.

          So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine.

          What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers, and the metadata services are separate from the NN? not shared?

          Great question. We initially started off by attacking the scalability question of ozone and soon realized that HDFS scalability and ozone scalability has to solve the same problems. So the container infrastructure that we have built is something that can be used by both ozone and HDFS. Currently we are focused on ozone and containers will co-exist on datanodes with blockpools. That is ozone should be and will be deployable on a vanilla HDFS cluster. In future, if we want to scale HDFS, containers might be an easy way to do it.

          Any thoughts on how we will transition applications like Hive and HBase to Ozone? These apps use rename and directories for synchronization, which are not possible on Ozone.

          These applications are written with the assumption of a Posix file system, so migrating them to Ozone does not make much sense. The work we are doing in ozone, especially container layer is useful in HDFS and these applications might benefit indirectly.

          Have you experienced data loss from independent node failures, thus motivating the need for copysets?

          Yes, we have seen this issue in real world. Since we had to pick an algorithm for allocation and copysets offered a set of advantages with very minimal work. Hence the allocation choice. If you look at the paper you will see this was done on HDFS itself with very minimal cost, and also this work originally came from facebook. So it is useful in both normal HDFS and in RAMCloud.

          It's also not clear how this type of data placement meshes with EC, or the other quite sophisticated types of block placement we currently support in HDFS

          The chunks will support remote blocks. That is a notion that we did not discuss due to time constraints in the presentation. In the presentation we just showed chunks as a files on the same machine, but a chunk could exist on a EC block and key could point to it.

          How do you plan to handle files larger than 5GB?

          Large files will live on a set of containers, again the easiest model to reason about containers is to think of them as blocks. In other words, just like normal HDFS.

          Large files right now are also not spread across multiple nodes and disks, limiting IO performance.

          Are you saying that when you write a file of say 256 MB size you would be constrained to same machine since the block size is much larger? That is why we have chunks. With chunks we can easily distribute this information and leave pointers to those chunks. I agree with the concern and we would might have to tune the placement algorithms once we see more real world workloads.

          Are all reads and writes served by the container's Raft master? IIUC that's how you get strong consistency, but it means we don't have the same performance benefits we have now in HDFS from 3-node replication.

          Yes, and I don't think it will be a bottleneck. Since we are only reading and writing metadata – which is very small – and data can be read from any machine.

          I also ask that more of this information and decision making be shared on public mailing lists and JIRA.

          Absolutely, we have been working actively and posting code patches to ozone branch. Feel free to ask us anything. It is kind of sad that we have to wait for an apachecon for someone to ask me these questions. I would encourage you to follow the JIRAs tagged with ozone to keep up with what is happening in the ozone world. I can assure you all the work we are doing is always tagged with ozone:` so that when you get the mail from JIRA it is immediately visible.

          The KSM is not mentioned in the architecture document.

          The namespace management component did not have separate name, it was called SCM in the original arch. doc. Thanks to Arpit Agarwal – he came up with this name (just like ozone) and argued that we should make it separate, which will make it easy to understand and maintain. If you look at code you will see KSM functionality is currently in SCM. I am planning to move it out to KSM.

          the fact that the Ozone metadata is being replicated via Raft rather than stored in containers.

          Metadata is stored in containers, but containers need replication. Replication via RAFT was something that design evolved to, and we are surfacing that to the community. Like all proposals under consideration, we will update the design doc and listen to community feedback. You just had a chance to listen to our design (little earlier) in person since we got a chance meet. Writing documents take far more energy and time than chatting face to face. We will soon update the design docs.

          I not aware that there is already progress internally at Hortonworks on a Raft implementation.

          We have been prototyping various parts of the system. Including KSM, SCM and other parts of Ozone.

          We've previously expressed interest in being involved in the design and implementation of Ozone, but we can't meaningfully contribute if this work is being done privately.

          I feel that this is a completely uncalled for statement / misplaced sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions or comment. I have personally posted a 43 page document explaining how ozone would look like to users, how they can manage it and REST interfaces offered by Ozone. Unfortunately I have not seen much engagement. While I am sorry that you feel left out, I do not see how you can attribute it to engineers who work on ozone. We have been more than liberal about the sharing internals including this specific presentation we are discussing here. So I am completely lost when you say that this work is being done privately. In fact, I would request you to be active on ozone JIRAs, start by asking questions, make comments (just like you did with this one except for this last comment) and I would love working with you on ozone. It is never our intent to abandon our fellow contributors in this journey, but unless you participate in JIRAs it is very difficult for us to know that you have more than a fleeting interest.

          Show
          anu Anu Engineer added a comment - Andrew Wang Thank you for your comments, They are well thought out and extremely valuable questions. I will make sure that all areas that you are asking about is discussed in the next update of design doc. Anu said he'd be posting a new design doc soon to address my questions. I am working on that, but just to make sure your questions are not lost in the big picture of the design doc, I am answering them individually here. My biggest concern is that erasure coding is not a first-class consideration in this system. Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all. So it is quite trivial to create a new chunk type and write them into containers. We are focused on overall picture of ozone right now, and I would welcome any contribution you can make on EC and ozone chunks if that is a concern that you would like us to address earlier. From the architecture point of view I do not see any issues. Since LevelDB is being used for metadata storage and separately being replicated via Raft, are there concerns about metadata write amplification? Metadata is such a small slice of information of a block – really what you are saying is that block Name, hash for the block gets written twice, once thru RAFT log and second time when RAFT commits this information. Since the data we are talking about is so small I am not worried about it at all. Can we re-use the QJM code instead of writing a new replicated log implementation? QJM is battle-tested, and bq.sensus is a known hard problem to get right. We considered this, however the consensus is to write a consensus protocol that is easier to understand and make it easy for more contributors to work on it. The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it. Are there concerns about storing millions of chunk files per disk? Writing each chunk as a separate file requires more metadata ops and fsyncs than appending to a file. We also need to be very careful to never require a full scan of the filesystem. The HDFS DN does full scans right now (DU, volume scanner). Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file. Any thoughts about how we go about packing multiple chunks into a larger file? Yes, write the first chunk and then write the second chunk to the same file. In fact, chunks are specifically designed to address the small file problem. So two keys can point to a same file. For example KeyA -> {File,0, 100} KeyB -> {File,101, 1000} Is a perfectly valid layout under container architecture Merges and splits of containers. We need nice large 5GB containers to hit the SCM scalability targets. Together, these factors mean Ozone will likely be doing many more merges and splits than HBase to keep the container size high Ozone actively tries to avoid merges and tries to split only when needed. A container can be thought of as a really large block, so I am not sure if I am going to see anything other than standard block workload on containers. The fact that containers can be split, is something that allows us to avoid pre-allocation of container space. That is merely a convenience and if you think of these as blocks, you will see that it is very similar. Ozone will never try to do merges and splits at HBase level. From the container and ozone perspective we are more focused on a good data distribution on the cluster – aka what the balancer does today, and containers are a flat namespace – just like blocks which we allocate when needed. So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine. What kind of sharing do we get with HDFS, considering that HDFS doesn't use block containers, and the metadata services are separate from the NN? not shared? Great question. We initially started off by attacking the scalability question of ozone and soon realized that HDFS scalability and ozone scalability has to solve the same problems. So the container infrastructure that we have built is something that can be used by both ozone and HDFS. Currently we are focused on ozone and containers will co-exist on datanodes with blockpools. That is ozone should be and will be deployable on a vanilla HDFS cluster. In future, if we want to scale HDFS, containers might be an easy way to do it. Any thoughts on how we will transition applications like Hive and HBase to Ozone? These apps use rename and directories for synchronization, which are not possible on Ozone. These applications are written with the assumption of a Posix file system, so migrating them to Ozone does not make much sense. The work we are doing in ozone, especially container layer is useful in HDFS and these applications might benefit indirectly. Have you experienced data loss from independent node failures, thus motivating the need for copysets? Yes, we have seen this issue in real world. Since we had to pick an algorithm for allocation and copysets offered a set of advantages with very minimal work. Hence the allocation choice. If you look at the paper you will see this was done on HDFS itself with very minimal cost, and also this work originally came from facebook. So it is useful in both normal HDFS and in RAMCloud. It's also not clear how this type of data placement meshes with EC, or the other quite sophisticated types of block placement we currently support in HDFS The chunks will support remote blocks. That is a notion that we did not discuss due to time constraints in the presentation. In the presentation we just showed chunks as a files on the same machine, but a chunk could exist on a EC block and key could point to it. How do you plan to handle files larger than 5GB? Large files will live on a set of containers, again the easiest model to reason about containers is to think of them as blocks. In other words, just like normal HDFS. Large files right now are also not spread across multiple nodes and disks, limiting IO performance. Are you saying that when you write a file of say 256 MB size you would be constrained to same machine since the block size is much larger? That is why we have chunks. With chunks we can easily distribute this information and leave pointers to those chunks. I agree with the concern and we would might have to tune the placement algorithms once we see more real world workloads. Are all reads and writes served by the container's Raft master? IIUC that's how you get strong consistency, but it means we don't have the same performance benefits we have now in HDFS from 3-node replication. Yes, and I don't think it will be a bottleneck. Since we are only reading and writing metadata – which is very small – and data can be read from any machine. I also ask that more of this information and decision making be shared on public mailing lists and JIRA. Absolutely, we have been working actively and posting code patches to ozone branch. Feel free to ask us anything. It is kind of sad that we have to wait for an apachecon for someone to ask me these questions. I would encourage you to follow the JIRAs tagged with ozone to keep up with what is happening in the ozone world. I can assure you all the work we are doing is always tagged with ozone: ` so that when you get the mail from JIRA it is immediately visible. The KSM is not mentioned in the architecture document. The namespace management component did not have separate name, it was called SCM in the original arch. doc. Thanks to Arpit Agarwal – he came up with this name (just like ozone) and argued that we should make it separate, which will make it easy to understand and maintain. If you look at code you will see KSM functionality is currently in SCM. I am planning to move it out to KSM. the fact that the Ozone metadata is being replicated via Raft rather than stored in containers. Metadata is stored in containers, but containers need replication. Replication via RAFT was something that design evolved to, and we are surfacing that to the community. Like all proposals under consideration, we will update the design doc and listen to community feedback. You just had a chance to listen to our design (little earlier) in person since we got a chance meet. Writing documents take far more energy and time than chatting face to face. We will soon update the design docs. I not aware that there is already progress internally at Hortonworks on a Raft implementation. We have been prototyping various parts of the system. Including KSM, SCM and other parts of Ozone. We've previously expressed interest in being involved in the design and implementation of Ozone, but we can't meaningfully contribute if this work is being done privately. I feel that this is a completely uncalled for statement / misplaced sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions or comment. I have personally posted a 43 page document explaining how ozone would look like to users, how they can manage it and REST interfaces offered by Ozone. Unfortunately I have not seen much engagement. While I am sorry that you feel left out, I do not see how you can attribute it to engineers who work on ozone. We have been more than liberal about the sharing internals including this specific presentation we are discussing here. So I am completely lost when you say that this work is being done privately. In fact, I would request you to be active on ozone JIRAs, start by asking questions, make comments (just like you did with this one except for this last comment) and I would love working with you on ozone. It is never our intent to abandon our fellow contributors in this journey, but unless you participate in JIRAs it is very difficult for us to know that you have more than a fleeting interest.
          Hide
          andrew.wang Andrew Wang added a comment -

          Thanks for the reply Anu, I'd like to follow up on some points.

          Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all...

          The chunks will support remote blocks...

          My understanding was that the SCM was the entity responsible for the equivalent of BlockPlacementPolicy, and doing it on containers. It sounds like that's incorrect, and each container is independently doing chunk placement. That raises a number of questions:

          • How are we coordinating distributed data placement and replication? Are all containers heartbeating to other containers to determine liveness? Giving up global coordination of replication makes it hard to do throttling and control use of top-of-rack switches. It also makes it harder to understand the operation of the system.
          • Aren't 4MB chunks a rather small unit for cross-machine replication? We've been growing the HDFS block size over the years as networks get faster, since it amortizes overheads.
          • Does this mean also we have a "chunk report" from the remote chunk servers to the master?

          I also still have the same questions about mutability of an EC group requiring the parities to be rewritten. How are we forming and potentially rewriting EC groups?

          The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it.

          I don't follow the argument that a new consensus implementation is more understandable than the one we've been supporting and using for years. Working with QJM, and adding support for missing functionality like multiple logs and dynamic quorum membership, would also have benefits in HDFS.

          I'm also just asking questions here. I'm not required to refactor QJM into a library to discuss the merits of code reuse.

          Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file.

          Understood, but in this scenario how do you plan to handle compaction? We essentially need to implement mutability on immutability. The traditional answer here is an LSM tree, a la Kudu or HBase. If this is important, it really should be discussed.

          One easy option would be storing the data in LevelDB as well. I'm not sure about the performance though, and it also doesn't blend well with the mutability of EC groups.

          So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine.

          I think I didn't explain my two points thoroughly enough. Let me try again:

          The first problem is the typical write/delete pattern for a system like HDFS. IIUC in Ozone, each container is allocated a contiguous range of the keyspace by the KSM. As an example, perhaps the KSM decides to allocate the range (i,j] to a container. Then, the user decides to kick off a job that writes a whole bunch of files with the format ingest/file_N. Until we do a split, all those files are landing in that {{i,j] container. So we split. Then, it's common for ingested data to be ETL'd and deleted. If we split earlier, that means we now have a lot of very small containers. This kind of hotspotting is less common in HBase, since DB users aren't encoding this type of nested structure in their keys.

          The other problem is that files can be pretty big. 1GB is common for data warehouses. If we have a 5GB container, a few deletes could quickly drop us below that target size. Similarly, a few additions can quickly raise us past it.

          Would appreciate an answer in light of the above concerns.

          So the container infrastructure that we have built is something that can be used by both ozone and HDFS...In future, if we want to scale HDFS, containers might be an easy way to do it.

          This sounds like a major refactoring of HDFS. We'd need to start by splitting the FSNS and BM locks, which is a massive undertaking, and possibly incompatible for operations like setrep. Moving the BM across an RPC boundary is also a demanding task.

          I think a split FSN / BM is a great architecture, but it's also something that has been attempted unsuccessfully a number of times in the community.

          These applications are written with the assumption of a Posix file system, so migrating them to Ozone does not make much sense.

          If we do not plan to support Hive and HBase, what is the envisioned set of target applications for Ozone?

          Yes, we have seen this issue in real world.

          That's a very interesting datapoint. Could you give any more details, e.g. circumstances and frequency? I asked around internally, and AFAIK we haven't encountered this issue before.

          It is kind of sad that we have to wait for an apachecon for someone to ask me these questions.

          I feel that this is a completely uncalled for statement / misplaced sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions or comment.

          We have been more than liberal about the sharing internals including this specific presentation we are discussing here. So I am completely lost when you say that this work is being done privately.

          but unless you participate in JIRAs it is very difficult for us to know that you have more than a fleeting interest.

          As I said in my previous comment, it's very hard for the community to contribute meaningfully if the design is being done internally. I wouldn't have had the context to ask any of the above questions. None of the information about the KSM, Raft, or remote chunks have been mentioned in JIRA. For all I knew, we were still progressing down the design proposed in the architecture doc.

          I think community interest in Ozone has also been very clearly expressed. Speaking for myself, you can see my earlier comments on this JIRA, as well as my attendance at previous community phone calls. Really though, even if the community hadn't explicitly expressed interest, all of this activity should still have been done in a public forum. It's very hard for newcomers to ramp up unless design discussions are being done publicly.

          This "newcomer" issue is basically what's happening right now with our conversation. I'm sure you've already discussed many of the points I'm raising now with your colleagues. I actually would be delighted to commit my time and energy to Ozone development if I believed it's the solution to our HDFS scalability issues. However, since this work is going on internally at HWX, it's hard for me to assess, assist, or advocate for the project.

          Show
          andrew.wang Andrew Wang added a comment - Thanks for the reply Anu, I'd like to follow up on some points. Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all... The chunks will support remote blocks... My understanding was that the SCM was the entity responsible for the equivalent of BlockPlacementPolicy, and doing it on containers. It sounds like that's incorrect, and each container is independently doing chunk placement. That raises a number of questions: How are we coordinating distributed data placement and replication? Are all containers heartbeating to other containers to determine liveness? Giving up global coordination of replication makes it hard to do throttling and control use of top-of-rack switches. It also makes it harder to understand the operation of the system. Aren't 4MB chunks a rather small unit for cross-machine replication? We've been growing the HDFS block size over the years as networks get faster, since it amortizes overheads. Does this mean also we have a "chunk report" from the remote chunk servers to the master? I also still have the same questions about mutability of an EC group requiring the parities to be rewritten. How are we forming and potentially rewriting EC groups? The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it. I don't follow the argument that a new consensus implementation is more understandable than the one we've been supporting and using for years. Working with QJM, and adding support for missing functionality like multiple logs and dynamic quorum membership, would also have benefits in HDFS. I'm also just asking questions here. I'm not required to refactor QJM into a library to discuss the merits of code reuse. Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file. Understood, but in this scenario how do you plan to handle compaction? We essentially need to implement mutability on immutability. The traditional answer here is an LSM tree, a la Kudu or HBase. If this is important, it really should be discussed. One easy option would be storing the data in LevelDB as well. I'm not sure about the performance though, and it also doesn't blend well with the mutability of EC groups. So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine. I think I didn't explain my two points thoroughly enough. Let me try again: The first problem is the typical write/delete pattern for a system like HDFS. IIUC in Ozone, each container is allocated a contiguous range of the keyspace by the KSM. As an example, perhaps the KSM decides to allocate the range (i,j] to a container. Then, the user decides to kick off a job that writes a whole bunch of files with the format ingest/file_N . Until we do a split, all those files are landing in that {{ i,j] container. So we split. Then, it's common for ingested data to be ETL'd and deleted. If we split earlier, that means we now have a lot of very small containers. This kind of hotspotting is less common in HBase, since DB users aren't encoding this type of nested structure in their keys. The other problem is that files can be pretty big. 1GB is common for data warehouses. If we have a 5GB container, a few deletes could quickly drop us below that target size. Similarly, a few additions can quickly raise us past it. Would appreciate an answer in light of the above concerns. So the container infrastructure that we have built is something that can be used by both ozone and HDFS...In future, if we want to scale HDFS, containers might be an easy way to do it. This sounds like a major refactoring of HDFS. We'd need to start by splitting the FSNS and BM locks, which is a massive undertaking, and possibly incompatible for operations lik