Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      <!-- markdown -->
      HDFS now provides native support for erasure coding (EC) to store data more efficiently. Each individual directory can be configured with an EC policy with command `hdfs erasurecode -setPolicy`. When a file is created, it will inherit the EC policy from its nearest ancestor directory to determine how its blocks are stored. Compared to 3-way replication, the default EC policy saves 50% of storage space while also tolerating more storage failures.

      To support small files, the currently phase of HDFS-EC stores blocks in _striped_ layout, where a logical file block is divided into small units (64KB by default) and distributed to a set of DataNodes. This enables parallel I/O but also decreases data locality. Therefore, the cluster environment and I/O workloads should be considered before configuring EC policies.
      Show
      <!-- markdown --> HDFS now provides native support for erasure coding (EC) to store data more efficiently. Each individual directory can be configured with an EC policy with command `hdfs erasurecode -setPolicy`. When a file is created, it will inherit the EC policy from its nearest ancestor directory to determine how its blocks are stored. Compared to 3-way replication, the default EC policy saves 50% of storage space while also tolerating more storage failures. To support small files, the currently phase of HDFS-EC stores blocks in _striped_ layout, where a logical file block is divided into small units (64KB by default) and distributed to a set of DataNodes. This enables parallel I/O but also decreases data locality. Therefore, the cluster environment and I/O workloads should be considered before configuring EC policies.

      Description

      Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice of data reliability, comparing to the existing HDFS 3-replica approach. For example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, with storage overhead only being 40%. This makes EC a quite attractive alternative for big data storage, particularly for cold data.

      Facebook had a related open source project called HDFS-RAID. It used to be one of the contribute packages in HDFS but had been removed since Hadoop 2.0 for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends on MapReduce to do encoding and decoding tasks; 2) it can only be used for cold files that are intended not to be appended anymore; 3) the pure Java EC coding implementation is extremely slow in practical use. Due to these, it might not be a good idea to just bring HDFS-RAID back.

      We (Intel and Cloudera) are working on a design to build EC into HDFS that gets rid of any external dependencies, makes it self-contained and independently maintained. This design lays the EC feature on the storage type support and considers compatible with existing HDFS features like caching, snapshot, encryption, high availability and etc. This design will also support different EC coding schemes, implementations and policies for different deployment scenarios. By utilizing advanced libraries (e.g. Intel ISA-L library), an implementation can greatly improve the performance of EC encoding/decoding and makes the EC solution even more attractive. We will post the design document soon.

      1. Compare-consolidated-20150824.diff
        72 kB
        Zhe Zhang
      2. Consolidated-20150707.patch
        1.02 MB
        Zhe Zhang
      3. Consolidated-20150806.patch
        1.24 MB
        Zhe Zhang
      4. Consolidated-20150810.patch
        1.23 MB
        Zhe Zhang
      5. ECAnalyzer.py
        2 kB
        Zhe Zhang
      6. ECParser.py
        5 kB
        Zhe Zhang
      7. fsimage-analysis-20150105.pdf
        82 kB
        Zhe Zhang
      8. HDFS-7285-Consolidated-20150911.patch
        1.20 MB
        Zhe Zhang
      9. HDFS-7285-initial-PoC.patch
        470 kB
        Zhe Zhang
      10. HDFS-7285-merge-consolidated.trunk.03.patch
        1.04 MB
        Vinayakumar B
      11. HDFS-7285-merge-consolidated.trunk.04.patch
        1.03 MB
        Vinayakumar B
      12. HDFS-7285-merge-consolidated-01.patch
        1.06 MB
        Vinayakumar B
      13. HDFS-7285-merge-consolidated-trunk-01.patch
        1.06 MB
        Vinayakumar B
      14. HDFS-bistriped.patch
        19 kB
        Zhe Zhang
      15. HDFS-EC-merge-consolidated-01.patch
        1.06 MB
        Zhe Zhang
      16. HDFS-EC-Merge-PoC-20150624.patch
        811 kB
        Zhe Zhang
      17. HDFSErasureCodingDesign-20141028.pdf
        1.98 MB
        Zhe Zhang
      18. HDFSErasureCodingDesign-20141217.pdf
        1.59 MB
        Zhe Zhang
      19. HDFSErasureCodingDesign-20150204.pdf
        1.40 MB
        Tsz Wo Nicholas Sze
      20. HDFSErasureCodingDesign-20150206.pdf
        1.42 MB
        Tsz Wo Nicholas Sze
      21. HDFSErasureCodingPhaseITestPlan.pdf
        111 kB
        Zhe Zhang
      22. HDFSErasureCodingSystemTestPlan-20150824.pdf
        72 kB
        Rui Gao
      23. HDFSErasureCodingSystemTestReport-20150826.pdf
        218 kB
        Rui Gao

        Issue Links

        1.
        Configurable erasure coding policy for individual files and directories Sub-task Resolved Zhe Zhang
         
        2.
        Representing striped block groups in NameNode with hierarchical naming protocol Sub-task Resolved Zhe Zhang
         
        3.
        Process block reports for erasure coded blocks Sub-task Resolved Zhe Zhang
         
        4.
        [umbrella] Data striping support in HDFS client Sub-task Resolved Li Bo
         
        5.
        Erasure coding: distribute recovery work for striped blocks to DataNode Sub-task Resolved Zhe Zhang
         
        6.
        Document the HDFS Erasure Coding feature Sub-task Resolved Uma Maheswara Rao G
         
        7.
        Erasure Coding: extend BlockInfo to handle EC info Sub-task Resolved Jing Zhao
         
        8.
        Implement COMPLETE state of erasure coding block groups Sub-task Resolved Zhe Zhang
         
        9.
        Add a test for BlockGroup support in FSImage Sub-task Resolved Takuya Fukudome
         
        10.
        Add unit tests for editlog transactions for EC Sub-task Resolved Hui Zheng
         
        11.
        Change disk quota calculation for EC files Sub-task Resolved Tsz Wo Nicholas Sze
         
        12.
        Erasure Coding: update the Balancer/Mover data migration logic Sub-task Resolved Walter Su
         
        13.
        Erasure Coding: consolidate streamer coordination logic and handle failure when writing striped blocks Sub-task Resolved Tsz Wo Nicholas Sze
         
        14.
        Erasure coding: DFSInputStream with decode functionality (pread) Sub-task Resolved Zhe Zhang
         
        15.
        Change fsck to support EC files Sub-task Resolved Takanobu Asanuma
         
        16.
        Client side api/config changes to support online encoding Sub-task Resolved Vinayakumar B
         
        17.
        Add periodic checker to find the corrupted EC blocks/files Sub-task Resolved Vinayakumar B
         
        18.
        Avoid Block movement in Balancer and Mover for the erasure encoded blocks Sub-task Resolved Vinayakumar B
         
        19.
        Add logic to DFSOutputStream to support writing a file in striping layout Sub-task Resolved Li Bo
         
        20.
        Erasure Coding: Add striped block support in INodeFile Sub-task Resolved Jing Zhao
         
        21.
        Erasure coding: pread from files in striped layout Sub-task Resolved Zhe Zhang
         
        22.
        Support appending to a striping layout file Sub-task Resolved Li Bo
         
        23.
        Erasure Coding: Update INodeFile quota computation for striped blocks Sub-task Resolved Kai Sasaki
         
        24.
        Erasure Coding: allocate and persist striped blocks in NameNode Sub-task Resolved Jing Zhao
         
        25.
        Erasure Coding: support striped blocks in non-protobuf fsimage Sub-task Resolved Hui Zheng
         
        26.
        Erasure coding: implement facilities in NameNode to create and manage EC zones Sub-task Resolved Zhe Zhang
         
        27.
        Erasure coding: extend LocatedBlocks to support reading from striped files Sub-task Resolved Jing Zhao
         
        28.
        Erasure Coding: Update safemode calculation for striped blocks Sub-task Resolved Rui Gao
         
        29.
        Erasure Coding: INodeFile.dumpTreeRecursively() supports to print striped blocks Sub-task Resolved Takuya Fukudome
         
        30.
        Subclass DFSOutputStream to support writing striping layout files Sub-task Resolved Li Bo
         
        31.
        Erasure Coding: track invalid, corrupt, and under-recovery striped blocks in NameNode Sub-task Resolved Jing Zhao
         
        32.
        Erasure coding: use BlockInfo[] for both striped and contiguous blocks in INodeFile Sub-task Resolved Zhe Zhang
         
        33.
        Erasure Coding: track BlockInfo instead of Block in UnderReplicatedBlocks and PendingReplicationBlocks Sub-task Resolved Jing Zhao
         
        34.
        Erasure coding: resolving conflicts in the branch when merging trunk changes. Sub-task Resolved Zhe Zhang
         
        35.
        Erasure Coding: INodeFile quota computation unit tests Sub-task Resolved Kai Sasaki
         
        36.
        WebImageViewer need support file size calculation with striped blocks Sub-task Resolved Rakesh R
         
        37.
        Erasure coding: NameNode support for lease recovery of striped block groups Sub-task Resolved Zhe Zhang
         
        38.
        Erasure coding: DataNode support for block recovery of striped block groups Sub-task Resolved Yi Liu
         
        39.
        getStoragePolicy() regards HOT policy as EC policy Sub-task Resolved Takanobu Asanuma
         
        40.
        Erasure Coding: simplify striped block recovery work computation and add tests Sub-task Resolved Jing Zhao
         
        41.
        Erasure coding: extend UnderReplicatedBlocks to accurately handle striped blocks Sub-task Resolved Zhe Zhang
         
        42.
        Erasure Coding: retrieve eraure coding schema for a file from NameNode Sub-task Resolved Vinayakumar B
         
        43.
        Erasure Coding: ECworker frame, basics, bootstraping and configuration Sub-task Resolved Uma Maheswara Rao G
         
        44.
        Erasure Coding: Update CHANGES-HDFS-7285.txt with branch commits Sub-task Resolved Vinayakumar B
         
        45.
        Erasure coding: stateful (non-positional) read from files in striped layout Sub-task Resolved Zhe Zhang
         
        46.
        Erasure coding: Decommission handle for EC blocks. Sub-task Resolved Yi Liu
         
        47.
        Define a system-wide default EC schema Sub-task Resolved Kai Zheng
         
        48.
        Erasure coding: fix bug in EC zone and symlinks Sub-task Resolved Jing Zhao
         
        49.
        Erasure Coding: Add RPC to client-namenode to list all ECSchemas loaded in Namenode. Sub-task Resolved Vinayakumar B
         
        50.
        Erasure coding: fix bug in TestFSImage Sub-task Resolved Rakesh R
         
        51.
        Make hard-coded values consistent with the system default schema first before remove them Sub-task Resolved Kai Zheng
         
        52.
        Erasure coding: Add auditlog FSNamesystem#createErasureCodingZone if this operation fails Sub-task Resolved Rakesh R
         
        53.
        Erasure coding: created util class to analyze striped block groups Sub-task Resolved Zhe Zhang
         
        54.
        BlockManager treates good blocks in a block group as corrput Sub-task Resolved Li Bo
         
        55.
        Erasure Coding: Support specifying ECSchema during creation of ECZone Sub-task Resolved Vinayakumar B
         
        56.
        Erasure Coding: Better to move EC related proto messages to a separate erasurecoding proto file Sub-task Resolved Rakesh R
         
        57.
        DFSStripedInputStream fails to read data after one stripe Sub-task Resolved Zhe Zhang
         
        58.
        Erasure Coding: Maintain consistent naming for Erasure Coding related classes - EC/ErasureCoding Sub-task Resolved Uma Maheswara Rao G
         
        59.
        Client gets and uses EC schema when reads and writes a stripping file Sub-task Resolved Kai Sasaki
         
        60.
        Send the EC schema to DataNode via EC encoding/recovering command Sub-task Resolved Uma Maheswara Rao G
         
        61.
        Fix the editlog corruption exposed by failed TestAddStripedBlocks Sub-task Resolved Jing Zhao
         
        62.
        Protobuf changes for BlockECRecoveryCommand and its fields for making it ready for transfer to DN Sub-task Resolved Uma Maheswara Rao G
         
        63.
        Support DFS command for the EC encoding Sub-task Resolved Vinayakumar B
         
        64.
        Add/implement necessary APIs even we just have the system default schema Sub-task Resolved Kai Zheng
         
        65.
        Detect if resevered EC Block ID is already used Sub-task Resolved Hui Zheng
         
        66.
        DFSStripedOutputStream should not create empty blocks Sub-task Resolved Jing Zhao
         
        67.
        BlockManager.addBlockCollectionWithCheck should check if the block is a striped block Sub-task Resolved Hui Zheng
         
        68.
        Erasure Coding: Keep default schema's name consistent Sub-task Resolved Unassigned
         
        69.
        Failure handling: DFSStripedOutputStream continues writing with enough remaining datanodes Sub-task Resolved Li Bo
         
        70.
        Erasure Coding: DataNode reconstruct striped blocks Sub-task Resolved Yi Liu
         
        71.
        createErasureCodingZone sets retryCache state as false always Sub-task Resolved Uma Maheswara Rao G
         
        72.
        Erasure Coding: Improve DFSStripedOutputStream closing of datastreamer threads Sub-task Resolved Rakesh R
         
        73.
        Erasure coding: Make block placement policy for EC file configurable Sub-task Resolved Walter Su
         
        74.
        Erasure coding: refactor client-related code to sync with HDFS-8082 and HDFS-8169 Sub-task Resolved Zhe Zhang
         
        75.
        ClientProtocol#createErasureCodingZone API was wrongly annotated as Idempotent Sub-task Resolved Vinayakumar B
         
        76.
        StripedBlockUtil.getInternalBlockLength may have overflow error Sub-task Resolved Tsz Wo Nicholas Sze
         
        77.
        Erasure coding: Fix file quota change when we complete/commit the striped blocks Sub-task Resolved Takuya Fukudome
         
        78.
        Improve end to end striping file test to add erasure recovering test Sub-task Resolved Xinwei Qin
         
        79.
        Erasure Coding: Seek and other Ops in DFSStripedInputStream. Sub-task Resolved Yi Liu
         
        80.
        DistributedFileSystem.createErasureCodingZone should pass schema in FileSystemLinkResolver Sub-task Resolved Tsz Wo Nicholas Sze
         
        81.
        TestDFSStripedOutputStream should use BlockReaderTestUtil to create BlockReader Sub-task Resolved Tsz Wo Nicholas Sze
         
        82.
        Erasure Coding: StripedDataStreamer fails to handle the blocklocations which doesn't satisfy BlockGroupSize Sub-task Resolved Rakesh R
         
        83.
        Should calculate checksum for parity blocks in DFSStripedOutputStream Sub-task Resolved Yi Liu
         
        84.
        Erasure Coding: Ignore DatanodeProtocol#DNA_ERASURE_CODING_RECOVERY commands from standbynode if any Sub-task Resolved Vinayakumar B
         
        85.
        Erasure Coding: SequentialBlockGroupIdGenerator#nextValue may cause block id conflicts Sub-task Resolved Jing Zhao
         
        86.
        Fix DFSStripedOutputStream#getCurrentBlockGroupBytes when the last stripe is at the block group boundary Sub-task Resolved Jing Zhao
         
        87.
        Erasure Coding: Create DFSStripedInputStream in DFSClient#open Sub-task Resolved Kai Sasaki
         
        88.
        Erasure coding: [bug] should always allocate unique striped block group IDs Sub-task Resolved Zhe Zhang
         
        89.
        Erasure Coding: XML based end-to-end test for ECCli commands Sub-task Resolved Rakesh R
         
        90.
        DFSStripedOutputStream.closeThreads releases cellBuffers multiple times Sub-task Resolved Kai Sasaki
         
        91.
        Avoid assigning a leading streamer in StripedDataStreamer to tolerate datanode failure Sub-task Resolved Tsz Wo Nicholas Sze
         
        92.
        Erasure Coding: simplify the retry logic in DFSStripedInputStream (stateful read) Sub-task Resolved Jing Zhao
         
        93.
        Erasure Coding: Implement batched listing of enrasure coding zones Sub-task Resolved Rakesh R
         
        94.
        Erasure Coding: implement parallel stateful reading for striped layout Sub-task Resolved Jing Zhao
         
        95.
        Erasure coding: move striped reading logic to StripedBlockUtil Sub-task Resolved Zhe Zhang
         
        96.
        Refactor DFSStripedOutputStream and StripedDataStreamer Sub-task Resolved Tsz Wo Nicholas Sze
         
        97.
        Erasure Coding: add ECSchema to HdfsFileStatus Sub-task Resolved Yong Zhang
         
        98.
        Erasure Coding: Fix Findbug warnings present in erasure coding Sub-task Resolved Rakesh R
         
        99.
        Erasure Coding: NameNode may get blocked in waitForLoadingFSImage() when loading editlog Sub-task Resolved Jing Zhao
         
        100.
        Erasure Coding: DFSStripedOutputStream#close throws NullPointerException exception in some cases Sub-task Resolved Li Bo
         
        101.
        Erasure coding: refactor EC constants to be consistent with HDFS-8249 Sub-task Resolved Zhe Zhang
         
        102.
        Erasure Coding: support decoding for stateful read Sub-task Resolved Jing Zhao
         
        103.
        Erasure coding: consolidate striping-related terminologies Sub-task Resolved Zhe Zhang
         
        104.
        Bump GenerationStamp for write faliure in DFSStripedOutputStream Sub-task Resolved Tsz Wo Nicholas Sze
         
        105.
        Add trace info to DFSClient#getErasureCodingZoneInfo(..) Sub-task Resolved Vinayakumar B
         
        106.
        Follow-on to update decode for DataNode striped blocks reconstruction Sub-task Resolved Yi Liu
         
        107.
        Erasure coding: Rename Striped block recovery to reconstruction to eliminate confusion. Sub-task Resolved Yi Liu
         
        108.
        Erasure coding: rename DFSStripedInputStream related test classes Sub-task Resolved Zhe Zhang
         
        109.
        Expose some administrative erasure coding operations to HdfsAdmin Sub-task Resolved Uma Maheswara Rao G
         
        110.
        Erasure Coding: Badly treated when createBlockOutputStream failed in DataStreamer Sub-task Resolved Unassigned
         
        111.
        Erasure Coding: test skip in TestDFSStripedInputStream Sub-task Resolved Walter Su
         
        112.
        Erasure Coding: test failed in TestDFSStripedInputStream.testStatefulRead() when use ByteBuffer Sub-task Resolved Walter Su
         
        113.
        Erasure Coding: whether to use the same chunkSize in decoding with the value in encoding Sub-task Resolved Unassigned
         
        114.
        Erasure Coding: test webhdfs read write stripe file Sub-task Resolved Walter Su
         
        115.
        Erasure Coding: Refactor BlockInfo and BlockInfoUnderConstruction Sub-task Resolved Tsz Wo Nicholas Sze
         
        116.
        Fix FindBugs issues introduced by erasure coding Sub-task Resolved Unassigned
         
        117.
        Erasure Coding: DFSStripedInputStream#seekToNewSource Sub-task Resolved Yi Liu
         
        118.
        Erasure coding: fix some minor bugs in EC CLI Sub-task Resolved Walter Su
         
        119.
        Erasure Coding: Badly treated when short of Datanode in StripedDataStreamer Sub-task Resolved Walter Su
         
        120.
        Erasure Coding: Make the timeout parameter of polling blocking queue configurable in DFSStripedOutputStream Sub-task Resolved Li Bo
         
        121.
        BlockInfoStriped uses EC schema Sub-task Resolved Kai Sasaki
         
        122.
        Erasure Coding: DFS opening a non-existent file need to be handled properly Sub-task Resolved Rakesh R
         
        123.
        Erasure Coding: TestRecoverStripedFile#testRecoverOneParityBlock is failing Sub-task Resolved Rakesh R
         
        124.
        Erasure coding: compute storage type quotas for striped files, to be consistent with HDFS-8327 Sub-task Resolved Zhe Zhang
         
        125.
        Add cellSize as an XAttr to ECZone Sub-task Resolved Vinayakumar B
         
        126.
        Erasure Coding: Few improvements for the erasure coding worker Sub-task Resolved Rakesh R
         
        127.
        Fix issues like NPE in TestRecoverStripedFile Sub-task Resolved Kai Zheng
         
        128.
        Remove chunkSize and initialize from erasure coder Sub-task Resolved Kai Zheng
         
        129.
        NN should consider current EC tasks handling count from DN while assigning new tasks Sub-task Resolved Uma Maheswara Rao G
         
        130.
        Erasure Coding: unit test the behaviour of BlockManager recovery work for the deleted blocks Sub-task Resolved Rakesh R
         
        131.
        Revisit and refactor ErasureCodingInfo Sub-task Resolved Vinayakumar B
         
        132.
        Erasure Coding: Pread failed to read data starting from not-first stripe Sub-task Resolved Walter Su
         
        133.
        Fix the isNeededReplication calculation for Striped block in NN Sub-task Resolved Yi Liu
         
        134.
        Erasure Coding: ECZoneManager#getECZoneInfo is not resolving the path properly if zone dir itself is the snapshottable dir Sub-task Resolved Rakesh R
         
        135.
        Remove dataBlockNum and parityBlockNum from BlockInfoStriped Sub-task Resolved Kai Sasaki
         
        136.
        Erasure Coding: Fix the NullPointerException when deleting file Sub-task Resolved Yi Liu
         
        137.
        set blockToken in LocatedStripedBlock Sub-task Resolved Walter Su
         
        138.
        Erasure Coding: make condition check earlier for setReplication Sub-task Resolved Walter Su
         
        139.
        Erasure Coding: fix cannot rename a zone dir Sub-task Resolved Walter Su
         
        140.
        Erasure Coding: Consolidate erasure coding zone related implementation into a single class Sub-task Resolved Rakesh R
         
        141.
        Erasure coding: properly handle start offset for internal blocks in a block group Sub-task Resolved Zhe Zhang
         
        142.
        Erasure Coding: stateful read result doesn't match data occasionally because of flawed test Sub-task Resolved Walter Su
         
        143.
        Erasure coding: fix priority level of UnderReplicatedBlocks for striped block Sub-task Resolved Walter Su
         
        144.
        Refactor BlockInfoContiguous and fix NPE in TestBlockInfo#testCopyConstructor() Sub-task Resolved Vinayakumar B
         
        145.
        2 RPC calls for every file read in DFSClient#open(..) resulting in double Audit log entries Sub-task Resolved Vinayakumar B
         
        146.
        createErasureCodingZone should check whether cellSize is available Sub-task Resolved Yong Zhang
         
        147.
        Erasure coding: fix striping related logic in FSDirWriteFileOp to sync with HDFS-8421 Sub-task Resolved Zhe Zhang
         
        148.
        Erasure coding: remove workarounds in client side stripped blocks recovering Sub-task Resolved Zhe Zhang
         
        149.
        Erasure coding: test DataNode reporting bad/corrupted blocks which belongs to a striped block. Sub-task Resolved Takanobu Asanuma
         
        150.
        Erasure coding: Two contiguous blocks occupy IDs belong to same striped group Sub-task Resolved Walter Su
         
        151.
        ErasureCodingWorker fails to do decode work Sub-task Resolved Li Bo
         
        152.
        Fix a decoding issue in stripped block recovering in client side Sub-task Resolved Kai Zheng
         
        153.
        Restore ECZone info inside FSImageLoader Sub-task Resolved Kai Sasaki
         
        154.
        Erasure Coding: processOverReplicatedBlock() handles striped block Sub-task Resolved Walter Su
         
        155.
        Erasure Coding: Fix FindBugs Multithreaded correctness Warning Sub-task Resolved Rakesh R
         
        156.
        Erasure Coding: Fix usage of 'createZone' Sub-task Resolved Vinayakumar B
         
        157.
        Allow to configure RS and XOR raw coders Sub-task Resolved Kai Zheng
         
        158.
        Erasure Coding: fix non-protobuf fsimage for striped blocks Sub-task Resolved Jing Zhao
         
        159.
        Erasure Coding: fsck handles file smaller than a full stripe Sub-task Resolved Walter Su
         
        160.
        Erasure Coding: SafeMode handles file smaller than a full stripe Sub-task Resolved Walter Su
         
        161.
        Fix TestErasureCodingCli test Sub-task Resolved Vinayakumar B
         
        162.
        Erasure coding: Persist cellSize in BlockInfoStriped and StripedBlocksFeature Sub-task Resolved Walter Su
         
        163.
        Erasure Coding: Remove dataBlockNum and parityBlockNum from StripedBlockProto Sub-task Resolved Yi Liu
         
        164.
        Erasure Coding: fix the copy constructor of BlockInfoStriped and BlockInfoContiguous Sub-task Resolved Vinayakumar B
         
        165.
        Erasure Coding: Client can't read(decode) the EC files which have corrupt blocks. Sub-task Resolved Kai Sasaki
         
        166.
        Erasure Coding: revisit replica counting for striped blocks Sub-task Resolved Jing Zhao
         
        167.
        Erasure Coding: handle missing internal block locations in DFSStripedInputStream Sub-task Resolved Jing Zhao
         
        168.
        Erasure Coding: fix some block number calculation for striped block Sub-task Resolved Yi Liu
         
        169.
        DFSClient hang up when there are not sufficient DataNodes in EC cluster. Sub-task Resolved Kai Sasaki
         
        170.
        Erasure coding: update BlockManager.blockHasEnoughRacks(..) logic for striped block Sub-task Resolved Kai Sasaki
         
        171.
        Erasure Coding: client generates too many small packets when writing parity data Sub-task Resolved Li Bo
         
        172.
        Erasure coding: revisit and simplify BlockInfoStriped and INodeFile Sub-task Resolved Zhe Zhang
         
        173.
        Erasure coding: For a small file missing and under replicated ec-block calculation is incorrect Sub-task Resolved J.Andreina
         
        174.
        Erasure Coding: Fail to read a file with corrupted blocks Sub-task Resolved Walter Su
         
        175.
        Erasure Coding: fix one cell need two packets Sub-task Resolved Walter Su
         
        176.
        Erasure Coding: the number of chunks in packet is not updated when writing parity data Sub-task Resolved Li Bo
         
        177.
        Erasure Coding: reuse BlockReader when reading the same block in pread Sub-task Resolved Jing Zhao
         
        178.
        Erasure Coding: unit test for SequentialBlockGroupIdGenerator Sub-task Resolved Rakesh R
         
        179.
        Erasure Coding: Correctly handle BlockManager#InvalidateBlocks for striped block Sub-task Resolved Yi Liu
         
        180.
        Erasure coding: rename BlockInfoContiguousUC and BlockInfoStripedUC to be consistent with trunk Sub-task Resolved Zhe Zhang
         
        181.
        Erasure Coding: fix DFSStripedInputStream/DFSStripedOutputStream re-fetch token when expired Sub-task Resolved Walter Su
         
        182.
        Erasure Coding: use DirectBufferPool in DFSStripedInputStream for buffer allocation Sub-task Resolved Jing Zhao
         
        183.
        Erasure Coding: Client no need to decode missing parity blocks Sub-task Resolved Walter Su
         
        184.
        Erasure Coding: add test for namenode process over replicated striped block Sub-task Resolved Takuya Fukudome
         
        185.
        Erasure Coding: Fix NPE when NameNode processes over-replicated striped blocks Sub-task Resolved Walter Su
         
        186.
        Erasure coding: store EC schema and cell size in INodeFile and eliminate notion of EC zones Sub-task Resolved Zhe Zhang
         
        187.
        Erasure Coding: Tolerate datanode failures in DFSStripedOutputStream when the data length is small Sub-task Resolved Tsz Wo Nicholas Sze
         
        188.
        Erasure Coding: client occasionally gets less block locations when some datanodes fail Sub-task Resolved Li Bo
         
        189.
        Erasure Coding: Provide ECSchema validation when setting EC policy Sub-task Resolved J.Andreina
         
        190.
        Erasure coding: add ECPolicy to replace schema+cellSize in hadoop-hdfs Sub-task Resolved Walter Su
         
        191.
        Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure Sub-task Resolved Li Bo
         
        192.
        Erasure Coding: Use datablocks, parityblocks and cell size from ErasureCodingPolicy Sub-task Resolved Vinayakumar B
         
        193.
        Erasure Coding: use threadpool for EC recovery tasks on DataNode Sub-task Resolved Rakesh R
         
        194.
        Erasure coding: update BlockInfoContiguousUC and BlockInfoStripedUC to use BlockUnderConstructionFeature Sub-task Resolved Jing Zhao
         
        195.
        Erasure coding: do not throw exception when setting replication factor on EC files Sub-task Resolved Rui Gao
         
        196.
        Erasure coding : Fix random failure in TestSafeModeWithStripedFile Sub-task Resolved J.Andreina
         
        197.
        Erasure coding: fix 2 failed tests of DFSStripedOutputStream Sub-task Resolved Walter Su
         
        198.
        Erasure coding: MapReduce job failed when I set the / folder to the EC zone Sub-task Resolved Unassigned
         
        199.
        Rename dfs.datanode.stripedread.threshold.millis to dfs.datanode.stripedread.timeout.millis Sub-task Resolved Andrew Wang
         
        200.
        Cleanup erasure coding documentation Sub-task Resolved Andrew Wang
         
        201.
        Erasure Coding: Provide DistributedFilesystem API to getAllErasureCodingPolicies Sub-task Resolved Rakesh R
         
        202.
        Erasure coding: update EC command "-s" flag to "-p" when specifying policy Sub-task Resolved Zhe Zhang
         
        203.
        ErasureCodingWorker#processErasureCodingTasks should not fail to process remaining tasks due to one invalid ECTask Sub-task Resolved Uma Maheswara Rao G
         
        204.
        Erasure Coding: when recovering lost blocks, logs can be too verbose and hurt performance Sub-task Resolved Rui Li
         
        205.
        Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests to Coordinator) Sub-task Resolved Jing Zhao
         

          Activity

          Hide
          zhz Zhe Zhang added a comment -

          Thanks Weihua Jiang for reporting this JIRA.

          I'm attaching the first draft of the design doc and will be more than happy to incorporate feedbacks under this JIRA.

          I'd also like to invite the community to a meeting at Cloudera headquarter (1001 Page Mill Road, Palo Alto) to discuss the design in more detail. My tentative plan is to hold it in the morning of the coming Friday (Oct. 31st). Please let me know if you are interested and if so, your availability. After soliciting feedbacks I will confirm logistic details, including the remote meeting URL.

          Show
          zhz Zhe Zhang added a comment - Thanks Weihua Jiang for reporting this JIRA. I'm attaching the first draft of the design doc and will be more than happy to incorporate feedbacks under this JIRA. I'd also like to invite the community to a meeting at Cloudera headquarter (1001 Page Mill Road, Palo Alto) to discuss the design in more detail. My tentative plan is to hold it in the morning of the coming Friday (Oct. 31st). Please let me know if you are interested and if so, your availability. After soliciting feedbacks I will confirm logistic details, including the remote meeting URL.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Zhe for hosting the session.

          After soliciting feedbacks I will confirm logistic details, including the remote meeting URL.

          Sounds great to have a local meeting to discuss the design and it can be efficient. For those that are not convenient to join it and the remote meeting, please provide the feedback and the logistic details here for broader further discussion. Thanks.

          Particularly, as our design targets to support EC (erasure code) layered on top of HSM feature, it would be great to discuss this aspect to address questions like: what's the overall status of HSM? what's the gap to boost the EC support? Can we align between the two efforts? And etc.

          Show
          drankye Kai Zheng added a comment - Thanks Zhe for hosting the session. After soliciting feedbacks I will confirm logistic details, including the remote meeting URL. Sounds great to have a local meeting to discuss the design and it can be efficient. For those that are not convenient to join it and the remote meeting, please provide the feedback and the logistic details here for broader further discussion. Thanks. Particularly, as our design targets to support EC (erasure code) layered on top of HSM feature, it would be great to discuss this aspect to address questions like: what's the overall status of HSM? what's the gap to boost the EC support? Can we align between the two efforts? And etc.
          Hide
          zhz Zhe Zhang added a comment -

          A meeting has been scheduled:

          Please drop me a note (zhezhang@cloudera.com) if you prefer a different time.

          Thanks Kai Zheng for the suggestion. The interface of the erasure coding feature potentially has a close relationship with HSM (HDFS-2832) and archival storage (HDFS-6584). We'll make sure to cover this topic in the meeting and share the summary here.

          Show
          zhz Zhe Zhang added a comment - A meeting has been scheduled: When: Friday Oct. 31st 10am~12pm Where: Cloudera Headquarter, 1001 Page Mill Road, Palo Alto. Both the lobby (for guests check-in) and the meeting room (Hadoop) are in building #2 URL: https://cloudera.webex.com/cloudera/j.php?MTID=me26394d0a3559c7a9498f18ad7de8962 Call-in: 1-650-479-3208 (US/Canada) with access code: 290 472 605. Please drop me a note (zhezhang@cloudera.com) if you prefer a different time. Thanks Kai Zheng for the suggestion. The interface of the erasure coding feature potentially has a close relationship with HSM ( HDFS-2832 ) and archival storage ( HDFS-6584 ). We'll make sure to cover this topic in the meeting and share the summary here.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Hi Zhe,

          Thanks for organizing the meeting, and thanks all the folks for attending, it's a good one!

          One thing I forgot to mention, we need to add a section in the design spec about the impact to distcp. There are multiple things to be addressed (I do not mean to make the list complete):

          • copy between two clusters, one with EC support and one without,
          • should we restore data before copying or just copying the coded blocks. The latter may not be feasible since the coded data may contain other files that are not to be copied; Even if the coded blocks belongs to the file to be copied, copying it to a target directory not tagged as EC means some handling there. And copying to a cluster not supporting EC need special care too.
          • distcp has switch to preserve block size, if this switch is not specified, a different block size may be used at target, we need special handling here.
            ...

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Hi Zhe, Thanks for organizing the meeting, and thanks all the folks for attending, it's a good one! One thing I forgot to mention, we need to add a section in the design spec about the impact to distcp. There are multiple things to be addressed (I do not mean to make the list complete): copy between two clusters, one with EC support and one without, should we restore data before copying or just copying the coded blocks. The latter may not be feasible since the coded data may contain other files that are not to be copied; Even if the coded blocks belongs to the file to be copied, copying it to a target directory not tagged as EC means some handling there. And copying to a cluster not supporting EC need special care too. distcp has switch to preserve block size, if this switch is not specified, a different block size may be used at target, we need special handling here. ... Thanks.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hi all, here are some discussion notes from today's meeting. Thanks everyone for attending, and Zhe for presenting.

          Attendees: Yongjun, Dave Wang, ATM, Eddy, Jing, Zhe, Matteo, Charles, Suresh, Todd, Michael C, Colin, Govind, myself

          EC cost

          • Suresh: EC is much more CPU/network intensive than replication. Sanjay talked to Dhruba at FB and they didn't erasure code more than 10-20% of their data because of the impact on MR workloads.
          • Todd: Systems like QFS and Colossus EC 100% of their data though, which shows this is possible. They have fast networks, which makes this possible.
          • Andrew: Can also offload work to dedicated erasure coding nodes rather than doing it on DNs, like Facebook f4
          • Suresh/ATM: Higher-level point, it would be good to have a way of delaying replication/EC work when an archival node or rack dies. There's already work for a "maintenance mode" for DNs, would be good to have the same thing at a rack-level. Then automatically kick an archival node or rack into maintenance mode on detecting a failure.
          • Overall theme was that having throttling of EC work is a basic requirement, which is provided for in the design doc.

          Distributing work

          • Suresh: Possible for lots of EC work and tracking to stack up at the NN. Would be good to push some of these responsibilities down to the DNs, let groups of DNs figure out EC work by themselves without NN involvement.
          • Jing: Copyset work is related.
          • Todd: Striping might alleviate NN load concerns, since it's a simpler in terms of metadata and coordination.
          • This doesn't need to happen in the first cut, but just something to keep in mind for the future. This idea is being examined more generally for the blobstore work.

          Quotas

          • This is an open issue. No quick agreement in the room whether to charge users for the replicated cost, or the EC cost.
          • Charging the EC cost is good since it encourages users to save space
          • However, can get in strange scenarios where user is at quota when things are EC'd, tries to append or otherwise convert to replication, are now over quota.

          Policy

          • Agreement that StoragePolicy is currently an inflexible entity, hardcoded specification of what StorageTypes to use.
          • Need some higher-level blob of code that looks at file attributes, access patterns, other higher-level metadata, and then based on that chooses the StorageType for the data.
          • Ideally, flow looks like (file attributes and metadata) -> (policy engine) -> (data temperature) -> (storage types / EC / compression to use based on what's present in cluster).
          • User would not be allowed to manually set the data temperature, but could query it. This prevents users and the policy engine from fighting each other.
          • Users could possibly set some kind of "force" xattr though, which the policy engine would respect.
          • Issue of keeping the policy consistent. Things like the balancer and mover need to be aware of the policy so they don't fight it. How is the policy distributed if it's not hardcoded, and is some code blob? Ultimately would be good to move these responsibilities back into the NN.
          • Question of what if anything needs to be changed in branch-2.6. Since custom StoragePolicies are not allowed, we should be good as long as whatever future policy engine respects the current settings.
          • AI: rename StoragePolicy to StorageTag or StoragePolicyTag or some other name.
          • AI: potentially rename StoragePolicy#SSD to something else, naming confusion with StorageType#SSD
          • AI: Potentially still some more discussion about naming and interfaces to be done. Might want a more complex struct for StorageTag to also encompass EC/replication, compressed/not, the different ways things can be EC or compressed.

          Misc other notes

          • Suresh: in archival storage, data goes from hot to cold and never back, good simplification
          • Suresh: Important to document the hardware / network topology requirements, i.e. RS(10,4) needing 14 racks.
          • Thinking is that 90% of data on the cluster will be EC'd, which means failures are going to have a very large impact on performance. Hard to predict what will happen.
          • Definitely need to handle small files well too, since most data is also <1 full block, like ~40MB.
          • Suresh/Andrew: would be cool to rewrite BlockPlacementPolicy to be more general, handle the little twists that are necessary for node-striping, rack-striping, etc.

          I'll file JIRAs for the above action items for further discussion.

          Show
          andrew.wang Andrew Wang added a comment - Hi all, here are some discussion notes from today's meeting. Thanks everyone for attending, and Zhe for presenting. Attendees: Yongjun, Dave Wang, ATM, Eddy, Jing, Zhe, Matteo, Charles, Suresh, Todd, Michael C, Colin, Govind, myself EC cost Suresh: EC is much more CPU/network intensive than replication. Sanjay talked to Dhruba at FB and they didn't erasure code more than 10-20% of their data because of the impact on MR workloads. Todd: Systems like QFS and Colossus EC 100% of their data though, which shows this is possible. They have fast networks, which makes this possible. Andrew: Can also offload work to dedicated erasure coding nodes rather than doing it on DNs, like Facebook f4 Suresh/ATM: Higher-level point, it would be good to have a way of delaying replication/EC work when an archival node or rack dies. There's already work for a "maintenance mode" for DNs, would be good to have the same thing at a rack-level. Then automatically kick an archival node or rack into maintenance mode on detecting a failure. Overall theme was that having throttling of EC work is a basic requirement, which is provided for in the design doc. Distributing work Suresh: Possible for lots of EC work and tracking to stack up at the NN. Would be good to push some of these responsibilities down to the DNs, let groups of DNs figure out EC work by themselves without NN involvement. Jing: Copyset work is related. Todd: Striping might alleviate NN load concerns, since it's a simpler in terms of metadata and coordination. This doesn't need to happen in the first cut, but just something to keep in mind for the future. This idea is being examined more generally for the blobstore work. Quotas This is an open issue. No quick agreement in the room whether to charge users for the replicated cost, or the EC cost. Charging the EC cost is good since it encourages users to save space However, can get in strange scenarios where user is at quota when things are EC'd, tries to append or otherwise convert to replication, are now over quota. Policy Agreement that StoragePolicy is currently an inflexible entity, hardcoded specification of what StorageTypes to use. Need some higher-level blob of code that looks at file attributes, access patterns, other higher-level metadata, and then based on that chooses the StorageType for the data. Ideally, flow looks like (file attributes and metadata) -> (policy engine) -> (data temperature) -> (storage types / EC / compression to use based on what's present in cluster). User would not be allowed to manually set the data temperature, but could query it. This prevents users and the policy engine from fighting each other. Users could possibly set some kind of "force" xattr though, which the policy engine would respect. Issue of keeping the policy consistent. Things like the balancer and mover need to be aware of the policy so they don't fight it. How is the policy distributed if it's not hardcoded, and is some code blob? Ultimately would be good to move these responsibilities back into the NN. Question of what if anything needs to be changed in branch-2.6. Since custom StoragePolicies are not allowed, we should be good as long as whatever future policy engine respects the current settings. AI: rename StoragePolicy to StorageTag or StoragePolicyTag or some other name. AI: potentially rename StoragePolicy#SSD to something else, naming confusion with StorageType#SSD AI: Potentially still some more discussion about naming and interfaces to be done. Might want a more complex struct for StorageTag to also encompass EC/replication, compressed/not, the different ways things can be EC or compressed. Misc other notes Suresh: in archival storage, data goes from hot to cold and never back, good simplification Suresh: Important to document the hardware / network topology requirements, i.e. RS(10,4) needing 14 racks. Thinking is that 90% of data on the cluster will be EC'd, which means failures are going to have a very large impact on performance. Hard to predict what will happen. Definitely need to handle small files well too, since most data is also <1 full block, like ~40MB. Suresh/Andrew: would be cool to rewrite BlockPlacementPolicy to be more general, handle the little twists that are necessary for node-striping, rack-striping, etc. I'll file JIRAs for the above action items for further discussion.
          Hide
          andrew.wang Andrew Wang added a comment -

          I filed HDFS-7317 for renaming StoragePolicy, HDFS-7318 for renaming some policies in the default StoragePolicySuite that are named after a StorageType.

          I think HDFS-7317 is also a reasonable place to further discuss the policy logic / StorageTag semantics. Since this stuff is currently in branch-2.6, it would be good to figure out any potentially incompatible changes so we don't block the release.

          Show
          andrew.wang Andrew Wang added a comment - I filed HDFS-7317 for renaming StoragePolicy, HDFS-7318 for renaming some policies in the default StoragePolicySuite that are named after a StorageType. I think HDFS-7317 is also a reasonable place to further discuss the policy logic / StorageTag semantics. Since this stuff is currently in branch-2.6, it would be good to figure out any potentially incompatible changes so we don't block the release.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Andrew Wang for the great summary! Many helpful suggestions were brought up in the meeting, thanks everyone for attending.

          I will post an updated design doc to incorporate the feedbacks pretty soon. If you have any additional comments please post on the JIRA.

          Show
          zhz Zhe Zhang added a comment - Thanks Andrew Wang for the great summary! Many helpful suggestions were brought up in the meeting, thanks everyone for attending. I will post an updated design doc to incorporate the feedbacks pretty soon. If you have any additional comments please post on the JIRA.
          Hide
          zhz Zhe Zhang added a comment -

          Yongjun Zhang and I had an offline discussion. Erasure coding could reduce the data locality for distcp, similar to how it impacts other MapReduce jobs. This shouldn't be a significant performance degradation for warm/cold data.

          Show
          zhz Zhe Zhang added a comment - Yongjun Zhang and I had an offline discussion. Erasure coding could reduce the data locality for distcp, similar to how it impacts other MapReduce jobs. This shouldn't be a significant performance degradation for warm/cold data.
          Hide
          apurtell Andrew Purtell added a comment -
          • User would not be allowed to manually set the data temperature, but could query it. This prevents users and the policy engine from fighting each other.
          • Users could possibly set some kind of "force" xattr though, which the policy engine would respect.

          If a user (application) is aware - through its own statistics - of data temperature, it should be possible to hint this to the policy engine.

          How do applications plug into the policy engine machinery?

          Ideally, flow looks like (file attributes and metadata) -> (policy engine) -> (data temperature) -> (storage types / EC / compression to use based on what's present in cluster).

          Why wouldn't the policy engine have the last word? I.e. (file attributes and metadata, data temperature, cluster configuration metadata) -> (policy engine)

          Show
          apurtell Andrew Purtell added a comment - User would not be allowed to manually set the data temperature, but could query it. This prevents users and the policy engine from fighting each other. Users could possibly set some kind of "force" xattr though, which the policy engine would respect. If a user (application) is aware - through its own statistics - of data temperature, it should be possible to hint this to the policy engine. How do applications plug into the policy engine machinery? Ideally, flow looks like (file attributes and metadata) -> (policy engine) -> (data temperature) -> (storage types / EC / compression to use based on what's present in cluster). Why wouldn't the policy engine have the last word? I.e. (file attributes and metadata, data temperature, cluster configuration metadata) -> (policy engine)
          Hide
          umamaheswararao Uma Maheswara Rao G added a comment -

          I think this feature development should go in a branch. Lets create a branch for this development?

          Show
          umamaheswararao Uma Maheswara Rao G added a comment - I think this feature development should go in a branch. Lets create a branch for this development?
          Hide
          vinayrpet Vinayakumar B added a comment -

          Hi Uma, Yes, we will need the branch considering the changes involved in this feature.
          I will create a branch soon and post it here. thanks

          Show
          vinayrpet Vinayakumar B added a comment - Hi Uma, Yes, we will need the branch considering the changes involved in this feature. I will create a branch soon and post it here. thanks
          Hide
          vinayrpet Vinayakumar B added a comment -

          I have created the branch "HDFS-EC"

          named the branch as "HDFS-EC" just to recognize the branch easily

          Thanks

          Show
          vinayrpet Vinayakumar B added a comment - I have created the branch "HDFS-EC" named the branch as "HDFS-EC" just to recognize the branch easily Thanks
          Hide
          zhz Zhe Zhang added a comment -

          Uma Maheswara Rao G Vinayakumar B Thanks, I agree we certainly need a feature branch for this.

          Show
          zhz Zhe Zhang added a comment - Uma Maheswara Rao G Vinayakumar B Thanks, I agree we certainly need a feature branch for this.
          Hide
          drankye Kai Zheng added a comment -

          Thanks everyone for having the branch. We're currently working on the breakdown and sub tasks for the following implementation.

          Show
          drankye Kai Zheng added a comment - Thanks everyone for having the branch. We're currently working on the breakdown and sub tasks for the following implementation.
          Hide
          drankye Kai Zheng added a comment -

          Hi Andrew Purtell, I think you have good points. As the mentioned policy engine might not be very related to or coupled with the EC feature, we might be better to have separate JIRA to discuss and implement it. I will create one and track your points there. Thanks.

          Show
          drankye Kai Zheng added a comment - Hi Andrew Purtell , I think you have good points. As the mentioned policy engine might not be very related to or coupled with the EC feature, we might be better to have separate JIRA to discuss and implement it. I will create one and track your points there. Thanks.
          Hide
          drankye Kai Zheng added a comment -

          Hi Andrew Wang, to address the policy engine aspects as discussed, just opened HDFS-7343 A comprehensive and flexible storage policy engine. It maybe not very coupled with this. Thanks.

          Show
          drankye Kai Zheng added a comment - Hi Andrew Wang , to address the policy engine aspects as discussed, just opened HDFS-7343 A comprehensive and flexible storage policy engine. It maybe not very coupled with this. Thanks.
          Hide
          sureshms Suresh Srinivas added a comment -

          Zhe Zhang, can you please post an update design document for this work that considers comments from the meeting we earlier had?

          Show
          sureshms Suresh Srinivas added a comment - Zhe Zhang , can you please post an update design document for this work that considers comments from the meeting we earlier had?
          Hide
          zhz Zhe Zhang added a comment -

          Suresh Srinivas Sure, we are working on an updated design document and will post it soon.

          Show
          zhz Zhe Zhang added a comment - Suresh Srinivas Sure, we are working on an updated design document and will post it soon.
          Hide
          zhz Zhe Zhang added a comment -

          Based on feedbacks from the meetup and a deeper study of file size distribution (detailed report to be posted later), Data Striping is added to this updated design, mainly to support EC on small files. A few highlights compared to the first version:

          1. Client: extended with striping and codec logic
          2. NameNode: INodeFile extended to store both block and BlockGroup information; optimizations are proposed to reduce memory usage caused by striping and parity data
          3. DataNode remains mostly unchanged from the original EC design
          4. Prioritizing EC with striping as the focus of the initial phase, and putting EC with contiguous (non-striping) layout to a 2nd phase
          Show
          zhz Zhe Zhang added a comment - Based on feedbacks from the meetup and a deeper study of file size distribution (detailed report to be posted later), Data Striping is added to this updated design, mainly to support EC on small files. A few highlights compared to the first version: Client : extended with striping and codec logic NameNode : INodeFile extended to store both block and BlockGroup information; optimizations are proposed to reduce memory usage caused by striping and parity data DataNode remains mostly unchanged from the original EC design Prioritizing EC with striping as the focus of the initial phase, and putting EC with contiguous (non-striping) layout to a 2nd phase
          Hide
          zhz Zhe Zhang added a comment -

          To motivate and guide the design (especially data striping), we have analyzed several production clusters and generated what-if scenarios with different policies. Please refer to the attached report for full details.

          Show
          zhz Zhe Zhang added a comment - To motivate and guide the design (especially data striping), we have analyzed several production clusters and generated what-if scenarios with different policies. Please refer to the attached report for full details.
          Hide
          raviprak Ravi Prakash added a comment -

          Hi Zhe Zhang. Do you plan to publish the tool you used to analyze the fsimage?

          Show
          raviprak Ravi Prakash added a comment - Hi Zhe Zhang. Do you plan to publish the tool you used to analyze the fsimage?
          Hide
          zhz Zhe Zhang added a comment -

          Ravi Prakash Thanks for your interest. Sure, I can publish the program in a few days, after some basic code cleaning.

          Show
          zhz Zhe Zhang added a comment - Ravi Prakash Thanks for your interest. Sure, I can publish the program in a few days, after some basic code cleaning.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Thanks for posting the design doc. It looks really nice! Some comments/questions:

          • Is a BlockGroup only used by one file, i.e. it cannot be shared by servel files?
          • Suppose the answer to the above question is yes. Then, how to encode small files?
          • HDFS-3107 "HDFS truncate" is now committed. We should revisit it. (EC with truncation is still a non-goal in the design doc.)
          • "Due to the complexity we will not support hflush or hsync for EC files at this phase." Then what will happen if users call hflush/hsync?
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Thanks for posting the design doc. It looks really nice! Some comments/questions: Is a BlockGroup only used by one file, i.e. it cannot be shared by servel files? Suppose the answer to the above question is yes. Then, how to encode small files? HDFS-3107 "HDFS truncate" is now committed. We should revisit it. (EC with truncation is still a non-goal in the design doc.) "Due to the complexity we will not support hflush or hsync for EC files at this phase." Then what will happen if users call hflush/hsync?
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -
          • HDFS upgrade is not covered in the doc. More specifically, if there are existing blocks using the block id's reserved for BlockGroup, how to upgrade the cluster? And how to rollback?
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - HDFS upgrade is not covered in the doc. More specifically, if there are existing blocks using the block id's reserved for BlockGroup, how to upgrade the cluster? And how to rollback?
          Hide
          zhz Zhe Zhang added a comment -

          Attaching the Python scripts used to generate the fsimage analysis report.

          Show
          zhz Zhe Zhang added a comment - Attaching the Python scripts used to generate the fsimage analysis report.
          Hide
          zhz Zhe Zhang added a comment -

          Tsz Wo Nicholas Sze Thank you for reviewing the design doc; great questions!

          Is a BlockGroup only used by one file, i.e. it cannot be shared by servel files?

          Each BlockGroup is used by only one file. In general, bundling multiple files in a single erasure coding stripe/group complicates file deletions.

          Suppose the answer to the above question is yes. Then, how to encode small files?

          That's the benefit of striping: when divided into small units (64KB by default) and striped to multiple servers, even small files can be encoded. The fsimage analysis report quantifies the difference in space saving between striping and the traditional contiguous data layouts.

          "HDFS truncate" is now committed. We should revisit it. (EC with truncation is still a non-goal in the design doc.)

          Good point. I will update the design doc to address this. We can either disallow truncate for encoded files, or convert the file (or at least the last block) into replication before truncating.

          "Due to the complexity we will not support hflush or hsync for EC files at this phase." Then what will happen if users call hflush/hsync?

          Similar as above, we can either return an error telling the user hflush/hsync is not supported on the target file, or convert the file first (which sounds like too slow for frequent flush/sync operations). An ongoing optimization is to leverage the incremental encoding feature from Intel's Storage Acceleration Library (ISAL) to flush parity data from partial stripe, and updating the parity data when more data is available.

          HDFS upgrade is not covered in the doc. More specifically, if there are existing blocks using the block id's reserved for BlockGroup, how to upgrade the cluster? And how to rollback?

          This is a really good catch. IIUC SequentialBlockIdGenerator#LAST_RESERVED_BLOCK_ID is for this purpose? We can just divided the unreserved ID space into regular blocks and BlockGroups. The latest patch under HDFS-7339 implements this logic (SequentialBlockIdGenerator and SequentialBlockGroupIdGenerator.

          Show
          zhz Zhe Zhang added a comment - Tsz Wo Nicholas Sze Thank you for reviewing the design doc; great questions! Is a BlockGroup only used by one file, i.e. it cannot be shared by servel files? Each BlockGroup is used by only one file. In general, bundling multiple files in a single erasure coding stripe/group complicates file deletions. Suppose the answer to the above question is yes. Then, how to encode small files? That's the benefit of striping: when divided into small units (64KB by default) and striped to multiple servers, even small files can be encoded. The fsimage analysis report quantifies the difference in space saving between striping and the traditional contiguous data layouts. "HDFS truncate" is now committed. We should revisit it. (EC with truncation is still a non-goal in the design doc.) Good point. I will update the design doc to address this. We can either disallow truncate for encoded files, or convert the file (or at least the last block) into replication before truncating. "Due to the complexity we will not support hflush or hsync for EC files at this phase." Then what will happen if users call hflush/hsync? Similar as above, we can either return an error telling the user hflush/hsync is not supported on the target file, or convert the file first (which sounds like too slow for frequent flush/sync operations). An ongoing optimization is to leverage the incremental encoding feature from Intel's Storage Acceleration Library (ISAL) to flush parity data from partial stripe, and updating the parity data when more data is available. HDFS upgrade is not covered in the doc. More specifically, if there are existing blocks using the block id's reserved for BlockGroup, how to upgrade the cluster? And how to rollback? This is a really good catch. IIUC SequentialBlockIdGenerator#LAST_RESERVED_BLOCK_ID is for this purpose? We can just divided the unreserved ID space into regular blocks and BlockGroups. The latest patch under HDFS-7339 implements this logic ( SequentialBlockIdGenerator and SequentialBlockGroupIdGenerator .
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... We can just divided the unreserved ID space into regular blocks and BlockGroups. ...

          Let me clarify my question. The unreserved ID space currently is used only by blocks. After the division, the IDs for BlockGroup could possibly be already used by some existing blocks. How to upgrade the cluster to fix this problem?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... We can just divided the unreserved ID space into regular blocks and BlockGroups. ... Let me clarify my question. The unreserved ID space currently is used only by blocks. After the division, the IDs for BlockGroup could possibly be already used by some existing blocks. How to upgrade the cluster to fix this problem?
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > That's the benefit of striping: when divided into small units (64KB by default) and striped to multiple servers, even small files can be encoded. The fsimage analysis report quantifies the difference in space saving between striping and the traditional contiguous data layouts.

          I thought the strip size is 1MB according to the figure in Data Striping Support in HDFS Client. Anyway, small files (say < 10MB) may not be EC at all since the disk space save is not much and the namespace usage is increased significantly.

          BTW, the fsimage analysis is very nice. There are two more types of cost not covered, CPU overhead and replication cost (re-constructing the EC blocks). How could we quantify them?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > That's the benefit of striping: when divided into small units (64KB by default) and striped to multiple servers, even small files can be encoded. The fsimage analysis report quantifies the difference in space saving between striping and the traditional contiguous data layouts. I thought the strip size is 1MB according to the figure in Data Striping Support in HDFS Client . Anyway, small files (say < 10MB) may not be EC at all since the disk space save is not much and the namespace usage is increased significantly. BTW, the fsimage analysis is very nice. There are two more types of cost not covered, CPU overhead and replication cost (re-constructing the EC blocks). How could we quantify them?
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Some more questions:

          • The failure cases for write seems not yet covered – what happen if some datanodes fails during a write?
          • Datanode failure may generate a EC strom – say a datanode has 40TB data, it requires accessing 240TB data for recovering it. It is in the order of PB for rack failure. How could we solve this problem?
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Some more questions: The failure cases for write seems not yet covered – what happen if some datanodes fails during a write? Datanode failure may generate a EC strom – say a datanode has 40TB data, it requires accessing 240TB data for recovering it. It is in the order of PB for rack failure. How could we solve this problem?
          Hide
          zhz Zhe Zhang added a comment -

          The unreserved ID space currently is used only by blocks. After the division, the IDs for BlockGroup could possibly be already used by some existing blocks. How to upgrade the cluster to fix this problem?

          OK I understand the issue now. Do you know when HDFS started to use sequentially generated (rather than random) block IDs – starting from which version? I guess we can still attempt to allocate block group IDs from the second half of the unreserved ID space, and we check for conflicts in each allocation; if there is a conflict we move the pointer forward to the conflicting ID. In that scenario, to tell if a block is regular or striped, we need to parse out the middle part of the ID and check if it exists in the map of block groups.

          I thought the strip size is 1MB according to the figure in Data Striping Support in HDFS Client.

          Good catch. I need to update that design doc to match the 64KB default stripe cell size. BTW 1MB is the I/O buffer size (see parameters C and B on page 8 of the master design doc .

          Anyway, small files (say < 10MB) may not be EC at all since the disk space save is not much and the namespace usage is increased significantly.

          I agree it doesn't make much sense to encode files as small as a few MB. The current fsimage analysis didn't further categorize files under 1 block. But my guess is they only contribute a minor portion of space usage in most clusters. I'll try to run the analysis again to verify.

          BTW, the fsimage analysis is very nice. There are two more types of cost not covered, CPU overhead and replication cost (re-constructing the EC blocks). How could we quantify them?

          Thanks and I really like the suggestion. CPU and I/O bandwidth usage is hard to simulate in a simple analyzer. We'll make sure it's included in the real system test plan.

          The failure cases for write seems not yet covered – what happen if some datanodes fails during a write?

          I believe Li Bo will share more details under HDFS-7545 soon. There is a range of policies we can adopt, the most strict being to return I/O error when any one target DN fails. In a "smarter" policy. the application can keep writing until m DNs fail, which m is equal to the number of parity blocks in the schema.

          Datanode failure may generate a EC strom – say a datanode has 40TB data, it requires accessing 240TB data for recovering it. It is in the order of PB for rack failure. How could we solve this problem?

          I think this is an inevitable challenge with EC. The best we can do is to schedule EC recovery tasks together with UnderReplicatedBlocks with appropriate priority settings. This way blocks are recovered when the system is relatively idle. When lost blocks are accessed it can recovered on-the-fly, but traffic from online recovery should be much lower.

          Show
          zhz Zhe Zhang added a comment - The unreserved ID space currently is used only by blocks. After the division, the IDs for BlockGroup could possibly be already used by some existing blocks. How to upgrade the cluster to fix this problem? OK I understand the issue now. Do you know when HDFS started to use sequentially generated (rather than random) block IDs – starting from which version? I guess we can still attempt to allocate block group IDs from the second half of the unreserved ID space, and we check for conflicts in each allocation; if there is a conflict we move the pointer forward to the conflicting ID. In that scenario, to tell if a block is regular or striped, we need to parse out the middle part of the ID and check if it exists in the map of block groups. I thought the strip size is 1MB according to the figure in Data Striping Support in HDFS Client. Good catch. I need to update that design doc to match the 64KB default stripe cell size. BTW 1MB is the I/O buffer size (see parameters C and B on page 8 of the master design doc . Anyway, small files (say < 10MB) may not be EC at all since the disk space save is not much and the namespace usage is increased significantly. I agree it doesn't make much sense to encode files as small as a few MB. The current fsimage analysis didn't further categorize files under 1 block. But my guess is they only contribute a minor portion of space usage in most clusters. I'll try to run the analysis again to verify. BTW, the fsimage analysis is very nice. There are two more types of cost not covered, CPU overhead and replication cost (re-constructing the EC blocks). How could we quantify them? Thanks and I really like the suggestion. CPU and I/O bandwidth usage is hard to simulate in a simple analyzer. We'll make sure it's included in the real system test plan. The failure cases for write seems not yet covered – what happen if some datanodes fails during a write? I believe Li Bo will share more details under HDFS-7545 soon. There is a range of policies we can adopt, the most strict being to return I/O error when any one target DN fails. In a "smarter" policy. the application can keep writing until m DNs fail, which m is equal to the number of parity blocks in the schema. Datanode failure may generate a EC strom – say a datanode has 40TB data, it requires accessing 240TB data for recovering it. It is in the order of PB for rack failure. How could we solve this problem? I think this is an inevitable challenge with EC. The best we can do is to schedule EC recovery tasks together with UnderReplicatedBlocks with appropriate priority settings. This way blocks are recovered when the system is relatively idle. When lost blocks are accessed it can recovered on-the-fly, but traffic from online recovery should be much lower.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... Do you know when HDFS started to use sequentially generated (rather than random) block IDs – starting from which version? ...

          It was done by HDFS-4645.

          After some discussion with Jing, we think that block group ID is not needed at all – we only need to keep the block group index within a file. Will give more details later.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... Do you know when HDFS started to use sequentially generated (rather than random) block IDs – starting from which version? ... It was done by HDFS-4645 . After some discussion with Jing, we think that block group ID is not needed at all – we only need to keep the block group index within a file. Will give more details later.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks for clarifying.

          After some discussion with Jing, we think that block group ID is not needed at all – we only need to keep the block group index within a file. Will give more details later.

          This is discussed under HDFS-7339.

          Show
          zhz Zhe Zhang added a comment - Thanks for clarifying. After some discussion with Jing, we think that block group ID is not needed at all – we only need to keep the block group index within a file. Will give more details later. This is discussed under HDFS-7339 .
          Hide
          drankye Kai Zheng added a comment -

          From HDFS-7353, posted by Tsz Wo Nicholas Sze, suggesting we use 'erasure' package name instead of 'ec'.

          ec also can mean error correcting. How about renaming the package to io.erasure? Then, using EC inside the package won't be ambiguous.

          I'm not sure about this, but we'd better discuss this overall and have the conclusion. If decided, we should use it consistently in places regarding design, discussion, codes and etc. Currently, we all use EC/ec to mention about erasure coding. Does it conflict with error correction? Is there any related work about error correction? If not, I guess we could still use EC as we might not wish to change all the places. A better naming is good, being consistent is important for the big effort.

          Show
          drankye Kai Zheng added a comment - From HDFS-7353 , posted by Tsz Wo Nicholas Sze , suggesting we use 'erasure' package name instead of 'ec'. ec also can mean error correcting. How about renaming the package to io.erasure? Then, using EC inside the package won't be ambiguous. I'm not sure about this, but we'd better discuss this overall and have the conclusion. If decided, we should use it consistently in places regarding design, discussion, codes and etc. Currently, we all use EC/ec to mention about erasure coding. Does it conflict with error correction? Is there any related work about error correction? If not, I guess we could still use EC as we might not wish to change all the places. A better naming is good, being consistent is important for the big effort.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Zhe Zhang, how about using 'erasure' package name instead of 'ec' since ec is ambiguous – it could also mean error correcting (or elliptic curve)?

          BTW, would you mind to share the design doc in a editable format? Otherwise, I am going to post a new revised design doc.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Zhe Zhang , how about using 'erasure' package name instead of 'ec' since ec is ambiguous – it could also mean error correcting (or elliptic curve)? BTW, would you mind to share the design doc in a editable format? Otherwise, I am going to post a new revised design doc.
          Hide
          drankye Kai Zheng added a comment -

          I'm not an English speaker, but I guess we could also find many meanings for 'erasure' too. The perfect naming to resolve any ambiguity would be 'erasure coding' or 'erasure code' but they're a little verbose for a package name. When simplified form like 'ec' is desired, we are sure and able to find other meanings for it like the cases you mentioned, but does that mean we shouldn't use it? If so, we can probably find many cases in the world.

          I might be bad to express what I thought, but I do not think it makes the great sense to change the overall design just for this. Would anyone give more thoughts?

          Show
          drankye Kai Zheng added a comment - I'm not an English speaker, but I guess we could also find many meanings for 'erasure' too. The perfect naming to resolve any ambiguity would be 'erasure coding' or 'erasure code' but they're a little verbose for a package name. When simplified form like 'ec' is desired, we are sure and able to find other meanings for it like the cases you mentioned, but does that mean we shouldn't use it? If so, we can probably find many cases in the world. I might be bad to express what I thought, but I do not think it makes the great sense to change the overall design just for this. Would anyone give more thoughts?
          Hide
          drankye Kai Zheng added a comment -

          I searched the code base and didn't find any existing package named 'ec' so there won't conflict actually.
          As you might note:
          1. In this JIRA descrition for this effort, it's explicitly stated as below:

          Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice of data reliability

          2. Our branch is also named HDFS-EC.

          If we do have other efforts for error correcting (or elliptic curve) for the project, then it's the point to avoid using 'ec' package name since it's already used here in this effort for 'erasure coding'.

          Show
          drankye Kai Zheng added a comment - I searched the code base and didn't find any existing package named 'ec' so there won't conflict actually. As you might note: 1. In this JIRA descrition for this effort, it's explicitly stated as below: Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice of data reliability 2. Our branch is also named HDFS-EC. If we do have other efforts for error correcting (or elliptic curve) for the project, then it's the point to avoid using 'ec' package name since it's already used here in this effort for 'erasure coding'.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... but I guess we could also find many meanings for 'erasure' too. ...

          Since it is under the io package, it is much harder to have an ambiguous meaning. However, io.ec still possibly means for error correcting (or arguably elliptic-curve cryptography). Won't you agree that a two-letter acronym "ec" is much more ambiguous than the word "erasure"?

          > In this JIRA descrition for this effort, it's explicitly stated as below:

          Yes, it is very clear in the current context since there is no other project for error correcting. But the package name sits there forever. It we have a error correcting project later on, it may become a problem.

          > Our branch is also named HDFS-EC.

          Branch is temporary and invisible to users, we may delete, rename and reuse branch name as some point. It is much harder to change package names since it is an incompatible change.

          The "io.erasure" package name is just a minor suggestion. I am fine if you insist using "io.ec" although I think it may lead to some unnecessary confusion.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... but I guess we could also find many meanings for 'erasure' too. ... Since it is under the io package, it is much harder to have an ambiguous meaning. However, io.ec still possibly means for error correcting (or arguably elliptic-curve cryptography). Won't you agree that a two-letter acronym "ec" is much more ambiguous than the word "erasure"? > In this JIRA descrition for this effort, it's explicitly stated as below: Yes, it is very clear in the current context since there is no other project for error correcting. But the package name sits there forever. It we have a error correcting project later on, it may become a problem. > Our branch is also named HDFS-EC. Branch is temporary and invisible to users, we may delete, rename and reuse branch name as some point. It is much harder to change package names since it is an incompatible change. The "io.erasure" package name is just a minor suggestion. I am fine if you insist using "io.ec" although I think it may lead to some unnecessary confusion.
          Hide
          zhz Zhe Zhang added a comment -

          I created a Google doc which is editable. If possible please login so I know who's making each comment and update.

          Tsz Wo Nicholas Sze The doc was last updated in mid December and doesn't contain some of the latest updates (mainly from HDFS-7339). If you don't mind the wait I plan to finish updating it before Wednesday. You can also go ahead with your updates assuming the HDFS-7339 discussions were incorporated.

          Package naming is an interesting topic. Erasure doesn't sound very appropriate because it literally means "the act of erasing something" and is a bit ambiguous itself. Actually erasure coding is a type of error correction codes so we don't need to worry about the conflict with "error correction". The only way to decrease ambiguity in general is to enlengthen the abbreviation. Two potential candidates came to mind: ecc standing for "error correction codes"; or erc "standing for "erasure coding" more specifically. Thoughts?

          Show
          zhz Zhe Zhang added a comment - I created a Google doc which is editable. If possible please login so I know who's making each comment and update. Tsz Wo Nicholas Sze The doc was last updated in mid December and doesn't contain some of the latest updates (mainly from HDFS-7339 ). If you don't mind the wait I plan to finish updating it before Wednesday. You can also go ahead with your updates assuming the HDFS-7339 discussions were incorporated. Package naming is an interesting topic. Erasure doesn't sound very appropriate because it literally means "the act of erasing something" and is a bit ambiguous itself. Actually erasure coding is a type of error correction codes so we don't need to worry about the conflict with "error correction". The only way to decrease ambiguity in general is to enlengthen the abbreviation. Two potential candidates came to mind: ecc standing for "error correction codes"; or erc "standing for "erasure coding" more specifically. Thoughts?
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > I created a Google doc which is editable. If possible please login so I know who's making each comment and update.

          Thanks for share it. I do suggest that we only share it with the contributors who have intention to edit the doc. Anyone who wants to edit the doc should send a request to you. It could prevent accidental changes from someone reading the doc. Sound good?

          > ... If you don't mind the wait I plan to finish updating it before Wednesday. ...

          Happy to wait. Please take you time.

          I will think more about the package name.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > I created a Google doc which is editable. If possible please login so I know who's making each comment and update. Thanks for share it. I do suggest that we only share it with the contributors who have intention to edit the doc. Anyone who wants to edit the doc should send a request to you. It could prevent accidental changes from someone reading the doc. Sound good? > ... If you don't mind the wait I plan to finish updating it before Wednesday. ... Happy to wait. Please take you time. I will think more about the package name.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          How about io.erasure_code? Intel ISA-L also uses erasure_code as the directory name and library name.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - How about io.erasure_code? Intel ISA-L also uses erasure_code as the directory name and library name.
          Hide
          drankye Kai Zheng added a comment -

          It's good to use 'erasure_code' for a c library directory name, but not so elegant for a Java package name. How about "erasurecoding" in full.

          Show
          drankye Kai Zheng added a comment - It's good to use 'erasure_code' for a c library directory name, but not so elegant for a Java package name. How about "erasurecoding" in full.
          Hide
          drankye Kai Zheng added a comment -

          The rational to use "erasurecoding" or "erasurecode" is simple, if no abbreviation sounds comfortable, then use the full words.

          Show
          drankye Kai Zheng added a comment - The rational to use "erasurecoding" or "erasurecode" is simple, if no abbreviation sounds comfortable, then use the full words.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          I am fine to use "erasurecode" although I prefer "erasure_code".

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - I am fine to use "erasurecode" although I prefer "erasure_code".
          Hide
          drankye Kai Zheng added a comment -

          Thanks for your confirm. I would use "erasurecode" in the style like ones I found in the codebase, "datatransfer", "blockmanagement" and many.

          Show
          drankye Kai Zheng added a comment - Thanks for your confirm. I would use "erasurecode" in the style like ones I found in the codebase, "datatransfer", "blockmanagement" and many.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Hi Zhe Zhang

          Design doc updates might be required for some points.

          1. DISK-EC was added with the intention of identifying the Parity blocks in case of Non-strip encoding. But now, IMO the logical storage type DISK-EC is no longer required, as BLOCK can be identified either Parity/original using the BlockGroup.
          2. blockStoragePolicydefault.xml is no longer in the code base and storage policies are no longer user configurable. It was removed before merging the HDFS-6584 to trunk, instead all the policies are hardcoded into BlockStoragePolicySuite.java
          3.

          Transition between erasurecoded and replicated forms can be done by changing the storage policy and triggering the Mover to enforce the new policy.


          I think this is not applicable for the striped design. This should be completely controlled by ECManager right?

          4.

          Under this framework, a unique storage policy should be defined for each codec schema. For example, if both 3of5 and 4of10 ReedSolomon coding are supported, policies RS3of5 and RS4of10 should be defined and they can be applied on different paths.


          This also may not be applicable for the striped design, as schema information also will be saved inside the BlockGroup itself. So IMO there is no need of separate policies for each of the schema.

          Show
          vinayrpet Vinayakumar B added a comment - Hi Zhe Zhang Design doc updates might be required for some points. 1. DISK-EC was added with the intention of identifying the Parity blocks in case of Non-strip encoding. But now, IMO the logical storage type DISK-EC is no longer required, as BLOCK can be identified either Parity/original using the BlockGroup. 2. blockStoragePolicydefault.xml is no longer in the code base and storage policies are no longer user configurable. It was removed before merging the HDFS-6584 to trunk, instead all the policies are hardcoded into BlockStoragePolicySuite.java 3. Transition between erasurecoded and replicated forms can be done by changing the storage policy and triggering the Mover to enforce the new policy. I think this is not applicable for the striped design. This should be completely controlled by ECManager right? 4. Under this framework, a unique storage policy should be defined for each codec schema. For example, if both 3of5 and 4of10 ReedSolomon coding are supported, policies RS3of5 and RS4of10 should be defined and they can be applied on different paths. This also may not be applicable for the striped design, as schema information also will be saved inside the BlockGroup itself. So IMO there is no need of separate policies for each of the schema.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Tsz Wo Nicholas Sze for the suggestion. The Google doc has been updated with limited permission. Please let me know if you'd like to be added as an editor. Note that Jing Zhao proposed and arranged an offline discussion today with Nicholas and myself. I'll make another update to the doc this afternoon.

          Show
          zhz Zhe Zhang added a comment - Thanks Tsz Wo Nicholas Sze for the suggestion. The Google doc has been updated with limited permission. Please let me know if you'd like to be added as an editor. Note that Jing Zhao proposed and arranged an offline discussion today with Nicholas and myself. I'll make another update to the doc this afternoon.
          Hide
          zhz Zhe Zhang added a comment -

          Vinayakumar B Good points on the storage policy section. Please take a look at the updated design doc.

          This also may not be applicable for the striped design, as schema information also will be saved inside the BlockGroup itself. So IMO there is no need of separate policies for each of the schema.

          This is also related to the discussion under HDFS-7349. Maybe we should wrap up that JIRA and incorporate the conclusion in the next rev of the design?

          Show
          zhz Zhe Zhang added a comment - Vinayakumar B Good points on the storage policy section. Please take a look at the updated design doc. This also may not be applicable for the striped design, as schema information also will be saved inside the BlockGroup itself. So IMO there is no need of separate policies for each of the schema. This is also related to the discussion under HDFS-7349 . Maybe we should wrap up that JIRA and incorporate the conclusion in the next rev of the design?
          Hide
          zhz Zhe Zhang added a comment -

          We had a very productive meetup today. Please find a summary below:
          Attendees: Tsz Wo Nicholas Sze, Zhe Zhang, Jing Zhao

          NameNode handling of block groups (HDFS-7339):

          1. Under the striping layout, it's viable to use the first block to represent the entire block group.
          2. A separate map for block groups is not necessary; blocksMap can be used for both regular blocks and striped block groups.
          3. Block ID allocation: we will use the following protocol, which partitions the entire ID space with a binary flag
            Contiguous: {reserved block IDs | flag | block ID}
            Striped: {reserved block IDs | flag | reserved block group IDs | block group ID | index in group}
            
          4. When the cluster has randomly generated block IDs (from legacy code), the block group ID generator needs to check for ID conflicts in the entire range of IDs generated. We should file a follow-on JIRA to investigate possible optimizations for efficient conflict detection.
          5. To make HDFS-7339 more trackable, we should shrink its scope and remove the client RPC code. It should be limited to block management and INode handling.
          6. Existing block states are sufficient to represent a block group. A client should COMMIT a block group just as a block. The COMPLETE state needs to collect ack from all participating DNs in the group.
          7. We should subclass BlockInfo to remember the block group layout. This is an optimization to avoid frequently retrieving the info from file INode.

          EC and storage policy:

          1. We agreed that EC vs. replication is another configuration dimension, orthogonal to the current storage-type-based policies (HOT, WARM, COLD). Adding EC in the storage policy space will require too many combinations to be explicitly listed and chosen from.
          2. On-going development can still use HDFS-7347, which embeds EC as one of the storage policies (it has already been committed to HDFS-EC). HDFS-7337 should take the EC policy out from file header and put it as an XAttr. Other EC parameters, including codec algorithm and schema, should also be stored in XAttr
          3. HDFS-7343 fundamentally addresses the issue of complex storage policy space. It's a hard problem and should be kept separate from the HDFS-EC project.

          Client and DataNode:

          1. At this point the design of HDFS-7545 – which wraps around the DataStreamer logic – looks reasonable. In the future we can consider adding a simpler and more efficient output class for the one replica scenario.

          We also went over the list of subtasks. Several high level comments:

          1. The list is already pretty long. We should reorder the items to have better grouping and more appropriate priorities. I will make a first pass.
          2. It seems HDFS-7689 should extend the ReplicationMonitor rather than creating another checker.
          3. We agreed the best way to support hflush/hsync is to write temporary parity data and update later, when a complete stripe is accumulated.
          4. We need another JIRA for truncate/append support.
          Show
          zhz Zhe Zhang added a comment - We had a very productive meetup today. Please find a summary below: Attendees : Tsz Wo Nicholas Sze , Zhe Zhang , Jing Zhao NameNode handling of block groups ( HDFS-7339 ): Under the striping layout, it's viable to use the first block to represent the entire block group. A separate map for block groups is not necessary; blocksMap can be used for both regular blocks and striped block groups. Block ID allocation: we will use the following protocol, which partitions the entire ID space with a binary flag Contiguous: {reserved block IDs | flag | block ID} Striped: {reserved block IDs | flag | reserved block group IDs | block group ID | index in group} When the cluster has randomly generated block IDs (from legacy code), the block group ID generator needs to check for ID conflicts in the entire range of IDs generated. We should file a follow-on JIRA to investigate possible optimizations for efficient conflict detection. To make HDFS-7339 more trackable, we should shrink its scope and remove the client RPC code. It should be limited to block management and INode handling. Existing block states are sufficient to represent a block group. A client should COMMIT a block group just as a block. The COMPLETE state needs to collect ack from all participating DNs in the group. We should subclass BlockInfo to remember the block group layout. This is an optimization to avoid frequently retrieving the info from file INode. EC and storage policy : We agreed that EC vs. replication is another configuration dimension, orthogonal to the current storage-type-based policies (HOT, WARM, COLD). Adding EC in the storage policy space will require too many combinations to be explicitly listed and chosen from. On-going development can still use HDFS-7347 , which embeds EC as one of the storage policies (it has already been committed to HDFS-EC). HDFS-7337 should take the EC policy out from file header and put it as an XAttr. Other EC parameters, including codec algorithm and schema, should also be stored in XAttr HDFS-7343 fundamentally addresses the issue of complex storage policy space. It's a hard problem and should be kept separate from the HDFS-EC project. Client and DataNode : At this point the design of HDFS-7545 – which wraps around the DataStreamer logic – looks reasonable. In the future we can consider adding a simpler and more efficient output class for the one replica scenario. We also went over the list of subtasks . Several high level comments: The list is already pretty long. We should reorder the items to have better grouping and more appropriate priorities. I will make a first pass. It seems HDFS-7689 should extend the ReplicationMonitor rather than creating another checker. We agreed the best way to support hflush/hsync is to write temporary parity data and update later, when a complete stripe is accumulated. We need another JIRA for truncate/append support.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Thanks for posting the meeting note. The meeting was very productive!

          > ... The COMPLETE state needs to collect ack from all participating DNs in the group.

          It should be collect ack from minimum number of DNs required for reading the data. E.g. the min is 6 for (6,3)-Reed-Solomon.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Thanks for posting the meeting note. The meeting was very productive! > ... The COMPLETE state needs to collect ack from all participating DNs in the group. It should be collect ack from minimum number of DNs required for reading the data. E.g. the min is 6 for (6,3)-Reed-Solomon.
          Hide
          hitliuyi Yi Liu added a comment - - edited

          Good catch. I need to update that design doc to match the 64KB default stripe cell size. BTW 1MB is the I/O buffer size (see parameters C and B on page 8 of the master design doc

          I think we need to allow dynamic stripe cell size depends on the file size. If we only use small fixed value, for example 64KB as the stripe cell size, then for large file, we need much more ec block groups to store the entire file than the number of blocks we need using replication way, even as implemented in HDFS-7339, we only store the first ec block of the ec block groups in NN, but the NN memory consumption is a big issue since there are too many ec block groups.

          Show
          hitliuyi Yi Liu added a comment - - edited Good catch. I need to update that design doc to match the 64KB default stripe cell size. BTW 1MB is the I/O buffer size (see parameters C and B on page 8 of the master design doc I think we need to allow dynamic stripe cell size depends on the file size. If we only use small fixed value, for example 64KB as the stripe cell size, then for large file, we need much more ec block groups to store the entire file than the number of blocks we need using replication way, even as implemented in HDFS-7339 , we only store the first ec block of the ec block groups in NN, but the NN memory consumption is a big issue since there are too many ec block groups.
          Hide
          zhz Zhe Zhang added a comment -

          If we only use small fixed value, for example 64KB as the stripe cell size, then for large file, we need much more ec block groups to store the entire file than the number of blocks we need using replication way,

          The number of block groups is actually unrelated to the cell size (e.g. 64KB). For example, under a 6+3 schema, any file smaller than 9 blocks will have 1 block group.

          A smaller cell size better handles small files. But data locality is degraded – for example, it might be hard to fit MapReduce records into 64KB cells.

          Show
          zhz Zhe Zhang added a comment - If we only use small fixed value, for example 64KB as the stripe cell size, then for large file, we need much more ec block groups to store the entire file than the number of blocks we need using replication way, The number of block groups is actually unrelated to the cell size (e.g. 64KB). For example, under a 6+3 schema, any file smaller than 9 blocks will have 1 block group. A smaller cell size better handles small files. But data locality is degraded – for example, it might be hard to fit MapReduce records into 64KB cells.
          Hide
          drankye Kai Zheng added a comment -

          I think we need to allow dynamic stripe cell size depends on the file size.

          Good idea. Small strip cell size for small files in a zone, and large strip cell size for large files in another zone. For MR or data locality sensitive files, use larger cell size. As we're going to support various stripping and EC forms by configurable schema and file system zones, different stripping cell size is possible I guess.

          Show
          drankye Kai Zheng added a comment - I think we need to allow dynamic stripe cell size depends on the file size. Good idea. Small strip cell size for small files in a zone, and large strip cell size for large files in another zone. For MR or data locality sensitive files, use larger cell size. As we're going to support various stripping and EC forms by configurable schema and file system zones, different stripping cell size is possible I guess.
          Hide
          hitliuyi Yi Liu added a comment - - edited

          The number of block groups is actually unrelated to the cell size (e.g. 64KB). For example, under a 6+3 schema, any file smaller than 9 blocks will have 1 block group.
          A smaller cell size better handles small files. But data locality is degraded – for example, it might be hard to fit MapReduce records into 64KB cells.

          I think it's incorrect for normal file. For example, we have a file, and it's length is 128M. If we use 6+3 schema, and ec stripe cell size is 64K, then we need (128*1024K)/(6*64K) = 342 block groups. But if the ec stripe cell size is 8M, then we need 128/6*8 = 3 block groups.
          Obviously, small stripe cell size will cause much more NN memory for normal/big file, even we only store the first ec block of the ec block groups in NN.

          Show
          hitliuyi Yi Liu added a comment - - edited The number of block groups is actually unrelated to the cell size (e.g. 64KB). For example, under a 6+3 schema, any file smaller than 9 blocks will have 1 block group. A smaller cell size better handles small files. But data locality is degraded – for example, it might be hard to fit MapReduce records into 64KB cells. I think it's incorrect for normal file. For example, we have a file, and it's length is 128M. If we use 6+3 schema, and ec stripe cell size is 64K, then we need (128*1024K)/(6*64K) = 342 block groups. But if the ec stripe cell size is 8M, then we need 128/6*8 = 3 block groups. Obviously, small stripe cell size will cause much more NN memory for normal/big file, even we only store the first ec block of the ec block groups in NN.
          Hide
          hitliuyi Yi Liu added a comment - - edited

          Small strip cell size for small files in a zone, and large strip cell size for large files in another zone

          Right, for large file, using large stripe cell size can decrease NN memory consumption. Otherwise, ec feature will cause big issue for NN memory.
          BTW, I have one thing not very clear, we need the concept of "zone"? The definition of "zone" is?

          Show
          hitliuyi Yi Liu added a comment - - edited Small strip cell size for small files in a zone, and large strip cell size for large files in another zone Right, for large file, using large stripe cell size can decrease NN memory consumption. Otherwise, ec feature will cause big issue for NN memory. BTW, I have one thing not very clear, we need the concept of "zone"? The definition of "zone" is?
          Hide
          drankye Kai Zheng added a comment -

          Yes we have EC zones, each zone actually represents a folder path and associates with an EC schema. Using the schema all the files in the zone will be in the form defined by it.

          Show
          drankye Kai Zheng added a comment - Yes we have EC zones, each zone actually represents a folder path and associates with an EC schema. Using the schema all the files in the zone will be in the form defined by it.
          Hide
          zhz Zhe Zhang added a comment -

          I think it's incorrect. For example, we have a file, and it's length is 128M. If we use 6+3 schema, and ec stripe cell size is 64K, then we need (128*1024K)/(6*64K) = 342 block groups.

          Aah I see where the confusion came from. Sorry that the design doc didn't explain clearly the different parameters. When the client writes to a striped file, the following 3 events happen:

          1. Once the client accumulates 6*64KB data, it does not flush the data to the DNs. The client buffers the data and starts buffering the next 6*64KB stripe.
          2. Once the client accumulates 1024 / 64 = 16 stripes – that is 1MB for each DN – it flushes out the data to DNs.
          3. Once the data flushed to each DN reaches 128MB – that is 128MB * 6 = 768MB data overall – it allocates a new block group from NN.

          Section 2.1 of the QFS paper has a pretty detailed explanation too.

          Show
          zhz Zhe Zhang added a comment - I think it's incorrect. For example, we have a file, and it's length is 128M. If we use 6+3 schema, and ec stripe cell size is 64K, then we need (128*1024K)/(6*64K) = 342 block groups. Aah I see where the confusion came from. Sorry that the design doc didn't explain clearly the different parameters. When the client writes to a striped file, the following 3 events happen: Once the client accumulates 6*64KB data, it does not flush the data to the DNs. The client buffers the data and starts buffering the next 6*64KB stripe. Once the client accumulates 1024 / 64 = 16 stripes – that is 1MB for each DN – it flushes out the data to DNs. Once the data flushed to each DN reaches 128MB – that is 128MB * 6 = 768MB data overall – it allocates a new block group from NN. Section 2.1 of the QFS paper has a pretty detailed explanation too.
          Hide
          hitliuyi Yi Liu added a comment -

          Yes we have EC zones, each zone actually represents a folder path and associates with an EC schema. Using the schema all the files in the zone will be in the form defined by it.

          OK, I see. It's fine for me. It's similar with storage policies for directories and files, and I think we don't need the concept of zone here. It's impressed me that we have restriction for the zone, for example, for an encryption zone, the files can't rename to other folders outside the zone and so on.

          Show
          hitliuyi Yi Liu added a comment - Yes we have EC zones, each zone actually represents a folder path and associates with an EC schema. Using the schema all the files in the zone will be in the form defined by it. OK, I see. It's fine for me. It's similar with storage policies for directories and files, and I think we don't need the concept of zone here. It's impressed me that we have restriction for the zone, for example, for an encryption zone, the files can't rename to other folders outside the zone and so on.
          Hide
          drankye Kai Zheng added a comment -

          I'm not sure storage policy can cover all the cases and forms we're going to support considering strip support. I guess EC zone might not hurt. You're right about restriction for a EC zone, yes a file in a zone should not move outside or elsewhere without necessary transforming first.

          Show
          drankye Kai Zheng added a comment - I'm not sure storage policy can cover all the cases and forms we're going to support considering strip support. I guess EC zone might not hurt. You're right about restriction for a EC zone, yes a file in a zone should not move outside or elsewhere without necessary transforming first.
          Hide
          hitliuyi Yi Liu added a comment -
          1. Once the client accumulates 6*64KB data, it does not flush the data to the DNs. The client buffers the data and starts buffering the next 6*64KB stripe.
          2. Once the client accumulates 1024 / 64 = 16 stripes – that is 1MB for each DN – it flushes out the data to DNs.
          3. Once the data flushed to each DN reaches 128MB – that is 128MB * 6 = 768MB data overall – it allocates a new block group from NN.

          Yes, it makes sense now. Thanks.

          Show
          hitliuyi Yi Liu added a comment - Once the client accumulates 6*64KB data, it does not flush the data to the DNs. The client buffers the data and starts buffering the next 6*64KB stripe. Once the client accumulates 1024 / 64 = 16 stripes – that is 1MB for each DN – it flushes out the data to DNs. Once the data flushed to each DN reaches 128MB – that is 128MB * 6 = 768MB data overall – it allocates a new block group from NN. Yes, it makes sense now. Thanks.
          Hide
          drankye Kai Zheng added a comment -

          == Update ==

          I'm happy to update that Huawei has also interest in this erasure coding support and some engineering resources from the company will join the effort. They're most welcome. With their dedicated contributions we're sure to be able to move even faster. Today we had a PRC local meeting with Huawei engineers. The attendees: Zhe Zhang (Cloudera); Yi Liu, Li Bo and Kai Zheng (Intel); Yong Zhang, dandantu and some other team members (Huawei). They will start with two challenge tasks: 1) implementing EC customized block placement policy HDFS-7613; 2) investigating and implementing Hitchhiker erasure coding algorithm HDFS-7715. Thanks !

          Show
          drankye Kai Zheng added a comment - == Update == I'm happy to update that Huawei has also interest in this erasure coding support and some engineering resources from the company will join the effort. They're most welcome. With their dedicated contributions we're sure to be able to move even faster. Today we had a PRC local meeting with Huawei engineers. The attendees : Zhe Zhang (Cloudera); Yi Liu , Li Bo and Kai Zheng (Intel); Yong Zhang , dandantu and some other team members (Huawei). They will start with two challenge tasks: 1) implementing EC customized block placement policy HDFS-7613 ; 2) investigating and implementing Hitchhiker erasure coding algorithm HDFS-7715 . Thanks !
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Revised the design doc as follows:

          • Revised the sections for
            • Saving I/O bandwidth
            • BlockGroup,
            • ErasureCodec,
            • ECClient, and
            • NameNode Memory Usage Reduction.
          • Added new sections for
            • EC Writer,
            • Handling Datanode Failure during Write,
            • Reading a Closed File,
            • Reading a Being Written File,
            • Hflush,
            • Hsync,
            • Append,
            • Truncate,
            • BlockGroup States,
            • Generation Stamp,
            • BlockGroup Recovery,
            • EC Block Reconstruction,
            • Collision with Random Block ID.
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Revised the design doc as follows: Revised the sections for Saving I/O bandwidth BlockGroup, ErasureCodec, ECClient, and NameNode Memory Usage Reduction. Added new sections for EC Writer, Handling Datanode Failure during Write, Reading a Closed File, Reading a Being Written File, Hflush, Hsync, Append, Truncate, BlockGroup States, Generation Stamp, BlockGroup Recovery, EC Block Reconstruction, Collision with Random Block ID.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Forgot to mention that the new design doc file is HDFSErasureCodingDesign-20150204.pdf.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Forgot to mention that the new design doc file is HDFSErasureCodingDesign-20150204.pdf .
          Hide
          jingzhao Jing Zhao added a comment -

          I tried to merge the trunk into the HDFS-EC branch and got a lot of conflicts. Looks like some of the trunk changes were applied manually to the EC branch? "git rebase origin/trunk" got several hundred commits diverge. Someone knows how to fix this? or we can create a different EC branch.

          Show
          jingzhao Jing Zhao added a comment - I tried to merge the trunk into the HDFS-EC branch and got a lot of conflicts. Looks like some of the trunk changes were applied manually to the EC branch? "git rebase origin/trunk" got several hundred commits diverge. Someone knows how to fix this? or we can create a different EC branch.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Since there are only a few patches committed, let's recreate the branch in order to fix the divergence.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Since there are only a few patches committed, let's recreate the branch in order to fix the divergence.
          Hide
          zhz Zhe Zhang added a comment -

          Jing Zhao, Tsz Wo Nicholas Sze I do the following every week (Monday) to merge trunk into HDFS-EC:

               git rebase apache/trunk
               git rebase apache/HDFS-EC
               git push apache HDFS-EC:HDFS-EC
          

          Does it look correct? I just rebased again, didn't see any conflict.

          Show
          zhz Zhe Zhang added a comment - Jing Zhao , Tsz Wo Nicholas Sze I do the following every week (Monday) to merge trunk into HDFS-EC: git rebase apache/trunk git rebase apache/HDFS-EC git push apache HDFS-EC:HDFS-EC Does it look correct? I just rebased again, didn't see any conflict.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Tsz Wo Nicholas Sze for adding more details to the design doc. I see you are still editing so will hold on to my updates for now.

          In general the added sections look good to me. I think they reflect what we discussed in the meetup.

          I didn't put too much detail on ECClient / EC Writer originally, because there's a separate design under HDFS-7545. When I make the next rev I can move all client related content there and refer it in the master design doc.

          Show
          zhz Zhe Zhang added a comment - Thanks Tsz Wo Nicholas Sze for adding more details to the design doc. I see you are still editing so will hold on to my updates for now. In general the added sections look good to me. I think they reflect what we discussed in the meetup. I didn't put too much detail on ECClient / EC Writer originally, because there's a separate design under HDFS-7545 . When I make the next rev I can move all client related content there and refer it in the master design doc.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... I see you are still editing so will hold on to my updates for now.

          Yes, just found some note which not yet added to the doc.

          HDFSErasureCodingDesign-20150206.pdf:

          • adds new sections for
            • Datanode Decommission, and
            • Appendix: 3-replication vs (6,3)-Reed-Solomon.
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... I see you are still editing so will hold on to my updates for now. Yes, just found some note which not yet added to the doc. HDFSErasureCodingDesign-20150206.pdf: adds new sections for Datanode Decommission, and Appendix: 3-replication vs (6,3)-Reed-Solomon.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Tsz Wo Nicholas Sze and Zhe Zhang a lot for taking care of and updating the main design doc.

          When I make the next rev I can move all client related content there and refer it in the master design doc.

          Good point Zhe Zhang. Let's go this way as we previously discussed some time ago. The doc here is for the overall design and general considerations, with more focus on NameNode/ECManager part, for ECWorker, ECClient and ErasureCodec, it refers to their low level design docs attached elsewhere or to be attached here. I will take some time to update the design doc attached in HDFS-7337, move erasure codec details and consolidate all the related discussions into the doc. I think Li Bo can do the similar thing for HDFS-7344. This way we could keep the overall design doc here relatively maintainable and easier readable.

          Show
          drankye Kai Zheng added a comment - Thanks Tsz Wo Nicholas Sze and Zhe Zhang a lot for taking care of and updating the main design doc. When I make the next rev I can move all client related content there and refer it in the master design doc. Good point Zhe Zhang . Let's go this way as we previously discussed some time ago. The doc here is for the overall design and general considerations, with more focus on NameNode/ECManager part, for ECWorker, ECClient and ErasureCodec, it refers to their low level design docs attached elsewhere or to be attached here. I will take some time to update the design doc attached in HDFS-7337 , move erasure codec details and consolidate all the related discussions into the doc. I think Li Bo can do the similar thing for HDFS-7344 . This way we could keep the overall design doc here relatively maintainable and easier readable.
          Hide
          zhz Zhe Zhang added a comment -

          The current HDFS-EC branch needs double rebasing because both 'git merge' and 'git rebase' were used in its history. To resolve this issue and make maintenance a little easier, I will recreate the branch now.

          Both 'git rebase' and 'git merge' have their own merits. Based on an offline discussion, Andrew Wang and myself prefer 'git rebase' and Jing Zhao is OK with both. If you plan to contribute code and prefer 'git merge', please let me know. I usually import trunk changes into HDFS-EC every Monday, so let's reach an agreement within a week (before Feb. 16). Thanks!

          Show
          zhz Zhe Zhang added a comment - The current HDFS-EC branch needs double rebasing because both 'git merge' and 'git rebase' were used in its history. To resolve this issue and make maintenance a little easier, I will recreate the branch now. Both 'git rebase' and 'git merge' have their own merits. Based on an offline discussion, Andrew Wang and myself prefer 'git rebase' and Jing Zhao is OK with both. If you plan to contribute code and prefer 'git merge', please let me know. I usually import trunk changes into HDFS-EC every Monday, so let's reach an agreement within a week (before Feb. 16). Thanks!
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Either 'git rebase' and 'git merge' is fine. Thanks for taking care it.

          We should keep "HDFS-EC" for a while. How about using "HDFS-7285" as the name for the new branch?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Either 'git rebase' and 'git merge' is fine. Thanks for taking care it. We should keep "HDFS-EC" for a while. How about using " HDFS-7285 " as the name for the new branch?
          Hide
          zhz Zhe Zhang added a comment -

          We should keep "HDFS-EC" for a while.

          tszwo] Good point. I'd like to have keep using "HDFS-EC" to name the primary branch though. How about we rename the current branch as-is to "HDFS-EC-backup" or convert it to a tag?

          Show
          zhz Zhe Zhang added a comment - We should keep "HDFS-EC" for a while. tszwo] Good point. I'd like to have keep using "HDFS-EC" to name the primary branch though. How about we rename the current branch as-is to "HDFS-EC-backup" or convert it to a tag?
          Hide
          drankye Kai Zheng added a comment -

          keep using "HDFS-EC" to name the primary branch

          I would appreciate keeping "HDFS-EC" if it's possible or doable. It avoids we having to update relevant target/fix/affect versions in all the related issues. It would not also disturb our related discussions. Thanks !

          Show
          drankye Kai Zheng added a comment - keep using "HDFS-EC" to name the primary branch I would appreciate keeping "HDFS-EC" if it's possible or doable. It avoids we having to update relevant target/fix/affect versions in all the related issues. It would not also disturb our related discussions. Thanks !
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... It avoids we having to update relevant target/fix/affect versions ...

          We do not have to update individual JIRAs. It is easy to rename HDFS-EC to something else.

          > ... How about we rename the current branch as-is to "HDFS-EC-backup" or convert it to a tag?

          It is fine if you really like "HDFS-EC". It just looks different from most of other development branches, which simply use the JIRA number.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... It avoids we having to update relevant target/fix/affect versions ... We do not have to update individual JIRAs. It is easy to rename HDFS-EC to something else. > ... How about we rename the current branch as-is to "HDFS-EC-backup" or convert it to a tag? It is fine if you really like "HDFS-EC". It just looks different from most of other development branches, which simply use the JIRA number.
          Hide
          drankye Kai Zheng added a comment -

          It is easy to rename HDFS-EC to something else

          Glad to know this. Thanks for clarifying this for me.

          It is fine if you really like "HDFS-EC". It just looks different from most of other development branches, which simply use the JIRA number.

          It's great we can keep "HDFS-EC". I agree it looks kinds of different. "fs-encryption" is another exception.

          Show
          drankye Kai Zheng added a comment - It is easy to rename HDFS-EC to something else Glad to know this. Thanks for clarifying this for me. It is fine if you really like "HDFS-EC". It just looks different from most of other development branches, which simply use the JIRA number. It's great we can keep "HDFS-EC". I agree it looks kinds of different. "fs-encryption" is another exception.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          The "branch-trunk-win", "fs-encryption" and "HDFS-EC" are the only exceptions. There are 39 other branches use the JIRA number as the name (or the name prefix). Why we do not follow the naming convention here?

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - The "branch-trunk-win", "fs-encryption" and "HDFS-EC" are the only exceptions. There are 39 other branches use the JIRA number as the name (or the name prefix). Why we do not follow the naming convention here?
          Hide
          zhz Zhe Zhang added a comment -

          I'm fine with changing to "HDFS-7285" to be consistent with other feature branches. So here's the summary:

          1. All future commits should go into HDFS-7285
          2. I will import trunk changes into HDFS-7285 with git rebase every Monday. If anyone else needs to rebase please do the same
          3. Let's keep HDFS-EC for a week to be safe. Meanwhile let's make all necessary changes, including JIRA fix/target version.
          Show
          zhz Zhe Zhang added a comment - I'm fine with changing to " HDFS-7285 " to be consistent with other feature branches. So here's the summary: All future commits should go into HDFS-7285 I will import trunk changes into HDFS-7285 with git rebase every Monday. If anyone else needs to rebase please do the same Let's keep HDFS-EC for a week to be safe. Meanwhile let's make all necessary changes, including JIRA fix/target version.
          Hide
          zhz Zhe Zhang added a comment -

          Sorry forgot to mention that the HDFS-7285 branch has been created with all HDFS-EC commits applied on top of the current trunk. Kai Zheng and Jing Zhao please let me know if I missed anything. Thanks!

          Show
          zhz Zhe Zhang added a comment - Sorry forgot to mention that the HDFS-7285 branch has been created with all HDFS-EC commits applied on top of the current trunk. Kai Zheng and Jing Zhao please let me know if I missed anything. Thanks!
          Hide
          umamaheswararao Uma Maheswara Rao G added a comment -

          Note: Since branch name changed HDFS-7285 now, I changed version name also HDFS-7285 in JIRA now.

          Show
          umamaheswararao Uma Maheswara Rao G added a comment - Note: Since branch name changed HDFS-7285 now, I changed version name also HDFS-7285 in JIRA now.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Uma! Seems it took care of the target version of all subtasks too.

          Show
          zhz Zhe Zhang added a comment - Thanks Uma! Seems it took care of the target version of all subtasks too.
          Hide
          zhz Zhe Zhang added a comment -

          Per discussion above let's officially switch to the HDFS-7285 branch now. We have a nightly Jenkins job to monitor all incoming changes: https://builds.apache.org/job/Hadoop-HDFS-7285-nightly/

          Show
          zhz Zhe Zhang added a comment - Per discussion above let's officially switch to the HDFS-7285 branch now. We have a nightly Jenkins job to monitor all incoming changes: https://builds.apache.org/job/Hadoop-HDFS-7285-nightly/
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Zhe Zhang, thanks for setting the Jenkins job!

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Zhe Zhang , thanks for setting the Jenkins job!
          Hide
          zhz Zhe Zhang added a comment -

          I'm seeing a lot of conflicts when rebasing against trunk. Somehow git decides to re-apply HDFS-7723. Below is the output of git rebase -i apache/trunk.

            1 pick 5c27789 HDFS-7347. Configurable erasure coding policy for individual files and directories ( Contributed by Zhe Zhang )
            2 pick ae4e4d4 HDFS-7339. Allocating and persisting block groups in NameNode. Contributed by Zhe Zhang
            3 pick eb3132b HDFS-7652. Process block reports for erasure coded blocks. Contributed by Zhe Zhang
            4 pick 2477b02 Fix Compilation Error in TestAddBlockgroup.java after the merge
            5 pick 0ae52c8 HADOOP-11514. Raw Erasure Coder API for concrete encoding and decoding (Kai Zheng via umamahesh)
            6 pick f9e1cc2 HADOOP-11534. Minor improvements for raw erasure coders ( Contributed by Kai Zheng )
            7 pick c36a7a9 HADOOP-11541. Raw XOR coder
            8 pick 93fc299 Added the missed entry for commit of HADOOP-11541
            9 pick 2516efd HDFS-7716. Erasure Coding: extend BlockInfo to handle EC info. Contributed by Jing Zhao.
           10 pick e746443 HADOOP-11542. Raw Reed-Solomon coder in pure Java. Contributed by Kai Zheng
           11 pick 1611bb2 HDFS-7723. Quota By Storage Type namenode implemenation. (Contributed by Xiaoyu Yao)
          

          I'll just re-apply 1~10 on top of current trunk.

          Show
          zhz Zhe Zhang added a comment - I'm seeing a lot of conflicts when rebasing against trunk. Somehow git decides to re-apply HDFS-7723 . Below is the output of git rebase -i apache/trunk . 1 pick 5c27789 HDFS-7347. Configurable erasure coding policy for individual files and directories ( Contributed by Zhe Zhang ) 2 pick ae4e4d4 HDFS-7339. Allocating and persisting block groups in NameNode. Contributed by Zhe Zhang 3 pick eb3132b HDFS-7652. Process block reports for erasure coded blocks. Contributed by Zhe Zhang 4 pick 2477b02 Fix Compilation Error in TestAddBlockgroup.java after the merge 5 pick 0ae52c8 HADOOP-11514. Raw Erasure Coder API for concrete encoding and decoding (Kai Zheng via umamahesh) 6 pick f9e1cc2 HADOOP-11534. Minor improvements for raw erasure coders ( Contributed by Kai Zheng ) 7 pick c36a7a9 HADOOP-11541. Raw XOR coder 8 pick 93fc299 Added the missed entry for commit of HADOOP-11541 9 pick 2516efd HDFS-7716. Erasure Coding: extend BlockInfo to handle EC info. Contributed by Jing Zhao. 10 pick e746443 HADOOP-11542. Raw Reed-Solomon coder in pure Java. Contributed by Kai Zheng 11 pick 1611bb2 HDFS-7723. Quota By Storage Type namenode implemenation. (Contributed by Xiaoyu Yao) I'll just re-apply 1~10 on top of current trunk.
          Hide
          zhz Zhe Zhang added a comment - - edited

          We had another meetup last Friday (2/20). Below is a summary, followed by a plan to generate a functional prototype.

          Attendees: Zhe Zhang, Jing Zhao, and Kai Zheng

          Summary of BlockInfo extension

          1. The following diagram illustrates the extension of BlockInfo to handle striped block groups (I recreated it from a whiteboard drawing). This is mainly contributed by Jing and thanks again for the great work!
                            BlockInfo
                           /   |     \
            BlockInfoStriped   |      BlockInfoContiguous
                   |           |            |
                   |       BlockInfoUC?     |
                   |       /         \      |
            BlockInfoStripedUC       BlockInfoContiguousUC
            
          2. BlockInfoStriped and BlockInfoContiguous are already created under HDFS-7743 and HDFS-7716
          3. {BlockInfoStripedUC}} and BlockInfoContiguousUC are created under HDFS-7749. The current plan is to keep them separate despite the duplicate codes. A later effort will abstract out a common BlockInfoUC class.
          4. HDFS-7837 as well as part of HDFS-7749 handle persisting BlockInfo variants in multiple places:
            • BlockManager
            • INodeFile
            • FSImage
            • Editlog

          Remaining NameNode tasks

          1. LocatedBlocks should be extended for striped reader (HDFS-7853)
          2. Initial XAttr structure for EC configuration (HDFS-7839)
          3. Other tasks, including HDFS-7369, do not block creating an initial prototype and should have a lower priority.

          DataNode high level thoughts

          1. The NN will select a DN as the ECWorker in charge of recovering the lost data or parity block. That worker node might or might not be the same as the storage target (e.g., ECWorker should have powerful CPU)
          2. At this stage we should use a simple logic assuming ECWorker is the final target. It should construct the recovered block and store locally, before pushing to next targets if necessary

          EC policies

          1. A set of default EC schemas should be embedded as part of HDFS
          2. An interface should be provided to define new EC schemas (either through command line or manipulate and refresh an XML file)
          3. EC and block layout (striping vs. contiguous) should be 2 orthogonal configuration dimensions: in the next phase we can enable contiguous+EC. At this phase we can assume striping layout when EC is enabled.

          Plan for a PoC prototype

          1. An initial PoC prototype should contain the following features:
            • Configure a file to be stored in striping + EC format
            • Client requests to allocate and persist the striped block groups in NN
            • NN returns located striped block group
            • Client writes to the allocated DNs in striping fashion
            • NN correctly processes striped block reports
            • Blocks in the striped block group can go through the state machine of UC-COMMITTED-COMPLETE. UNDER_RECOVERY doesn't have to be supported at this stage.
            • Client can close the file
            • Client can read back the content correctly
            • Optional: File system states and metrics are correctly updated – fsimage, edit logs, quota, etc.
          2. I think the following JIRAs should be resolved for the prototype:

          It's quite likely that the list is incomplete. So please feel free to add to it. Thanks!

          Show
          zhz Zhe Zhang added a comment - - edited We had another meetup last Friday (2/20). Below is a summary, followed by a plan to generate a functional prototype. Attendees : Zhe Zhang , Jing Zhao , and Kai Zheng Summary of BlockInfo extension The following diagram illustrates the extension of BlockInfo to handle striped block groups (I recreated it from a whiteboard drawing). This is mainly contributed by Jing and thanks again for the great work! BlockInfo / | \ BlockInfoStriped | BlockInfoContiguous | | | | BlockInfoUC? | | / \ | BlockInfoStripedUC BlockInfoContiguousUC BlockInfoStriped and BlockInfoContiguous are already created under HDFS-7743 and HDFS-7716 {BlockInfoStripedUC}} and BlockInfoContiguousUC are created under HDFS-7749 . The current plan is to keep them separate despite the duplicate codes. A later effort will abstract out a common BlockInfoUC class. HDFS-7837 as well as part of HDFS-7749 handle persisting BlockInfo variants in multiple places: BlockManager INodeFile FSImage Editlog Remaining NameNode tasks LocatedBlocks should be extended for striped reader ( HDFS-7853 ) Initial XAttr structure for EC configuration ( HDFS-7839 ) Other tasks, including HDFS-7369 , do not block creating an initial prototype and should have a lower priority. DataNode high level thoughts The NN will select a DN as the ECWorker in charge of recovering the lost data or parity block. That worker node might or might not be the same as the storage target (e.g., ECWorker should have powerful CPU) At this stage we should use a simple logic assuming ECWorker is the final target. It should construct the recovered block and store locally, before pushing to next targets if necessary EC policies A set of default EC schemas should be embedded as part of HDFS An interface should be provided to define new EC schemas (either through command line or manipulate and refresh an XML file) EC and block layout (striping vs. contiguous) should be 2 orthogonal configuration dimensions: in the next phase we can enable contiguous+EC. At this phase we can assume striping layout when EC is enabled. Plan for a PoC prototype An initial PoC prototype should contain the following features: Configure a file to be stored in striping + EC format Client requests to allocate and persist the striped block groups in NN NN returns located striped block group Client writes to the allocated DNs in striping fashion NN correctly processes striped block reports Blocks in the striped block group can go through the state machine of UC-COMMITTED-COMPLETE. UNDER_RECOVERY doesn't have to be supported at this stage. Client can close the file Client can read back the content correctly Optional : File system states and metrics are correctly updated – fsimage, edit logs, quota, etc. I think the following JIRAs should be resolved for the prototype: HDFS-7749 : need to fix a few Jenkins test failures HDFS-7837 HDFS-7853 HDFS-7839 HDFS-7782 It's quite likely that the list is incomplete. So please feel free to add to it. Thanks!
          Hide
          drankye Kai Zheng added a comment -

          Thanks Zhe Zhang a lot for scheduling the meetup discussion and the complete summary !
          To make the first prototype complete and more solid, I guess we also need to incorporate HDFS-7349.

          Show
          drankye Kai Zheng added a comment - Thanks Zhe Zhang a lot for scheduling the meetup discussion and the complete summary ! To make the first prototype complete and more solid, I guess we also need to incorporate HDFS-7349 .
          Hide
          zhz Zhe Zhang added a comment -

          To follow up on the PoC prototype plan, I created a very rough test by manually applying the following patches, and it seems to work – based on the description above

          1. HDFS-7729 (this one needs major refactor after HDFS-7793)
          2. HDFS-7853
          3. HDFS-7782

          A few bugs have been found and I'll post them under individual JIRAs.

          Show
          zhz Zhe Zhang added a comment - To follow up on the PoC prototype plan, I created a very rough test by manually applying the following patches, and it seems to work – based on the description above HDFS-7729 (this one needs major refactor after HDFS-7793 ) HDFS-7853 HDFS-7782 A few bugs have been found and I'll post them under individual JIRAs.
          Hide
          drankye Kai Zheng added a comment -

          Thank you for taking this and getting it work !!

          Show
          drankye Kai Zheng added a comment - Thank you for taking this and getting it work !!
          Hide
          zhz Zhe Zhang added a comment -

          This is the patch from trunk that was used in the PoC test. It demonstrates the changes we have made to support basic I/O in striping layout.

          Show
          zhz Zhe Zhang added a comment - This is the patch from trunk that was used in the PoC test. It demonstrates the changes we have made to support basic I/O in striping layout.
          Hide
          drankye Kai Zheng added a comment -

          As discussed previously, I updated the document for the codec framework part, Configurable and pluggable erasure codec in HDFS-7337. Welcome more review and comments. Thanks !

          Show
          drankye Kai Zheng added a comment - As discussed previously, I updated the document for the codec framework part, Configurable and pluggable erasure codec in HDFS-7337 . Welcome more review and comments. Thanks !
          Hide
          walter.k.su Walter Su added a comment -

          I have a problem how to make EC files use specially designed placement policy. I need some help in decision making. Thanks. link HDFS-7068

          Show
          walter.k.su Walter Su added a comment - I have a problem how to make EC files use specially designed placement policy. I need some help in decision making. Thanks. link HDFS-7068
          Hide
          drankye Kai Zheng added a comment -

          Thanks Walter Su for raising the issue. I just commented with my thoughts there. Let's discuss there and see how it goes.

          Show
          drankye Kai Zheng added a comment - Thanks Walter Su for raising the issue. I just commented with my thoughts there. Let's discuss there and see how it goes.
          Hide
          zhz Zhe Zhang added a comment -

          We have been discussing how to fit EC with other storage policies since the first meetup and haven't reached a clear conclusion. This design is now blocking several ongoing JIRAs: HDFS-7068, HDFS-7349, HDFS-7839, HDFS-7866. I'd like to propose the following potential solution based on the ideas we have exchanged:

          To reiterate the challenge: Multiple dimensions of storage policies could be applied to the same file. Across these dimensions we could have a large number of combinations – easily over 50, could be over 100. Fitting them in a single dimension policy space is inefficient for the system to manage and inconvenient for admins to set / get.

          • Storage-type preference: HOT / WARM / COLD
          • Erasure coding schema: ReedSolomon-6-3 / XOR-2-1 (targeting 5~10)
          • Block layout: Striping / contiguous
          • Other potential policies, e.g. compression

          We can setup a family of storage policy XAttrs, where each dimension can be independently set / get:

          • system.hdfs.storagePolicy.type
          • system.hdfs.storagePolicy.erasurecoding
          • system.hdfs.storagePolicy.layout

          Each dimension has a default value. So if an admin only wants to change the EC schema, the following command can be used. The getStoragePolicy should return policies on all dimensions unless an optional argument like -erasureCoding is used.

          setStoragePolicy -erasureCoding RS63 /home/zhezhang/foo
          getStoragePolicy /home/zhezhang/foo
          

          Like the current storage policy semantics, the initial policy of a file or dir is inherited from its parent. Nested policy setting is allowed (/home is not ECed but /home/zhezhang is). A single file can have a storage policy without being in a zone.

          Any feedbacks are very welcome. Jing Zhao, Tsz Wo Nicholas Sze, I think we should have another meetup to sync on this (and several other issues)?

          Show
          zhz Zhe Zhang added a comment - We have been discussing how to fit EC with other storage policies since the first meetup and haven't reached a clear conclusion. This design is now blocking several ongoing JIRAs: HDFS-7068 , HDFS-7349 , HDFS-7839 , HDFS-7866 . I'd like to propose the following potential solution based on the ideas we have exchanged: To reiterate the challenge: Multiple dimensions of storage policies could be applied to the same file. Across these dimensions we could have a large number of combinations – easily over 50, could be over 100. Fitting them in a single dimension policy space is inefficient for the system to manage and inconvenient for admins to set / get. Storage-type preference: HOT / WARM / COLD Erasure coding schema: ReedSolomon-6-3 / XOR-2-1 (targeting 5~10) Block layout: Striping / contiguous Other potential policies, e.g. compression We can setup a family of storage policy XAttrs, where each dimension can be independently set / get: system.hdfs.storagePolicy.type system.hdfs.storagePolicy.erasurecoding system.hdfs.storagePolicy.layout Each dimension has a default value. So if an admin only wants to change the EC schema, the following command can be used. The getStoragePolicy should return policies on all dimensions unless an optional argument like -erasureCoding is used. setStoragePolicy -erasureCoding RS63 /home/zhezhang/foo getStoragePolicy /home/zhezhang/foo Like the current storage policy semantics, the initial policy of a file or dir is inherited from its parent. Nested policy setting is allowed (/home is not ECed but /home/zhezhang is). A single file can have a storage policy without being in a zone. Any feedbacks are very welcome. Jing Zhao , Tsz Wo Nicholas Sze , I think we should have another meetup to sync on this (and several other issues)?
          Hide
          andrew.wang Andrew Wang added a comment -

          I think we could pack all of this into a single xattr (i.e. system.storagePolicy as a protobuf. This will be more efficient, and also standardize the serde since xattrs values are just bytes.

          We could also leave the storage type in the file header the way, since that's zero overhead, and just store the additional parameters into the xattr.

          Show
          andrew.wang Andrew Wang added a comment - I think we could pack all of this into a single xattr (i.e. system.storagePolicy as a protobuf. This will be more efficient, and also standardize the serde since xattrs values are just bytes. We could also leave the storage type in the file header the way, since that's zero overhead, and just store the additional parameters into the xattr.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Zhe Zhang for the great post.
          This documents existing relevant discussions well and gives a good proposal summary that unifies storage policy so EC and stripping can fit in in a much clean and elegant way. I would explicitly point out that in this way we would not use EC ZONE as previously discussed here and in other issues. We don't need to explicitly create and manage EC ZONEs. What is needed now is all about a storage policy for a file or folder. Once we all agree in this approach, we need to update the overall design here and rebase relevant issues as well. It would be great if we could gather as more as possible feedback and ideas this time.

          Show
          drankye Kai Zheng added a comment - Thanks Zhe Zhang for the great post. This documents existing relevant discussions well and gives a good proposal summary that unifies storage policy so EC and stripping can fit in in a much clean and elegant way. I would explicitly point out that in this way we would not use EC ZONE as previously discussed here and in other issues. We don't need to explicitly create and manage EC ZONEs. What is needed now is all about a storage policy for a file or folder. Once we all agree in this approach, we need to update the overall design here and rebase relevant issues as well. It would be great if we could gather as more as possible feedback and ideas this time.
          Hide
          zhz Zhe Zhang added a comment -

          Andrew Wang Thanks for the comment. I forgot to include those from our latest discussion. Yes, leaving the HSM policies as-is will work with this proposal and will just add some logic to combine data from XAttr and file header.

          Kai Zheng Good point. The semantics of EC configurations more closely resemble storage policies than zones. Like mentioned above, an EC policy can exist for a single file, and can be configured in a nesting manner.

          Show
          zhz Zhe Zhang added a comment - Andrew Wang Thanks for the comment. I forgot to include those from our latest discussion. Yes, leaving the HSM policies as-is will work with this proposal and will just add some logic to combine data from XAttr and file header. Kai Zheng Good point. The semantics of EC configurations more closely resemble storage policies than zones . Like mentioned above, an EC policy can exist for a single file, and can be configured in a nesting manner.
          Hide
          jingzhao Jing Zhao added a comment -

          Thanks for the summary, Zhe!

          One question for the EC policy or EC ZONE is that whether we allow users to change the policy/schema of a file/dir. Currently for storage policy like COLD, WARM, and HOT, users can change a file/directory's policy and this change will be applied to new created/appended data, and can be enforced on existing files later by external tools like Mover. This lazy enforcement semantic also applies to renamed files.

          However, for EC, since whether a file is EC'ed and its EC schema directly determines its read/write/append pattern, things become different and more complicated. If we allow changing the EC schema associated with a directory, we need to make sure for all the files inside its old EC schema can be found, which means we may need to associate the schema directly on the files or even blocks (which can be inefficient). And then how to handle new appended data and file rename also becomes challenge. If we disallow the schema changing or renaming across directories with different EC policies, in the end we may have a design like EC ZONE.

          Show
          jingzhao Jing Zhao added a comment - Thanks for the summary, Zhe! One question for the EC policy or EC ZONE is that whether we allow users to change the policy/schema of a file/dir. Currently for storage policy like COLD, WARM, and HOT, users can change a file/directory's policy and this change will be applied to new created/appended data, and can be enforced on existing files later by external tools like Mover. This lazy enforcement semantic also applies to renamed files. However, for EC, since whether a file is EC'ed and its EC schema directly determines its read/write/append pattern, things become different and more complicated. If we allow changing the EC schema associated with a directory, we need to make sure for all the files inside its old EC schema can be found, which means we may need to associate the schema directly on the files or even blocks (which can be inefficient). And then how to handle new appended data and file rename also becomes challenge. If we disallow the schema changing or renaming across directories with different EC policies, in the end we may have a design like EC ZONE.
          Hide
          zhz Zhe Zhang added a comment -

          Jing Zhao This is a great question to discuss and was missing in the above summary; thanks for bring it up!

          In the initial design (page 6 of the latest design doc ), EC policy changes will be lazily enforced by Mover or a similar tool. The rationales are:

          1. Conversion between replication and EC is a very important use case; so we do need to support changing EC policy on files and dirs
          2. The conversion should be done lazily for the same reason of lazily enforcing HSM policies: the purpose (saving space) is not urgent and the operation is expensive

          If we allow changing the EC schema associated with a directory, we need to make sure for all the files inside its old EC schema can be found, which means we may need to associate the schema directly on the files or even blocks (which can be inefficient).

          Great point. Adding a little formality might help the discussion here. Essentially every file or dir has an desired storage policy and an actual one. Luckily, in the context of HSM we don't need to keep explicitly track of the actual placement policy. EC policies are indeed more complicated in this perspective. I think we can solve it by doing the following:

          1. In the storage policy XAttr, always store the actual policy instead of the desired one
          2. Mover (or a similar tool) should keep track of a queue of desired changes
          3. When converting an individual file, keep the old form until the block conversion is done. Then "flip" the XAttr
          4. Because of the above, when converting a directory we need to store the new policy XAttr in some of its files
          5. Appends should either be disallowed during a conversion, or with more advanced mechanism like appending to both old and new forms
          6. A renamed file should materialize and carry over the policy XAttr from the old dir. Then it will become a nested scenario, where the new dir has policy B and the moved file has policy A
          Show
          zhz Zhe Zhang added a comment - Jing Zhao This is a great question to discuss and was missing in the above summary; thanks for bring it up! In the initial design (page 6 of the latest design doc ), EC policy changes will be lazily enforced by Mover or a similar tool. The rationales are: Conversion between replication and EC is a very important use case; so we do need to support changing EC policy on files and dirs The conversion should be done lazily for the same reason of lazily enforcing HSM policies: the purpose (saving space) is not urgent and the operation is expensive If we allow changing the EC schema associated with a directory, we need to make sure for all the files inside its old EC schema can be found, which means we may need to associate the schema directly on the files or even blocks (which can be inefficient). Great point. Adding a little formality might help the discussion here. Essentially every file or dir has an desired storage policy and an actual one. Luckily, in the context of HSM we don't need to keep explicitly track of the actual placement policy. EC policies are indeed more complicated in this perspective. I think we can solve it by doing the following: In the storage policy XAttr, always store the actual policy instead of the desired one Mover (or a similar tool) should keep track of a queue of desired changes When converting an individual file, keep the old form until the block conversion is done. Then "flip" the XAttr Because of the above, when converting a directory we need to store the new policy XAttr in some of its files Appends should either be disallowed during a conversion, or with more advanced mechanism like appending to both old and new forms A renamed file should materialize and carry over the policy XAttr from the old dir. Then it will become a nested scenario, where the new dir has policy B and the moved file has policy A
          Hide
          jingzhao Jing Zhao added a comment -

          Thanks for the comment, Zhe! Some comments inline:

          In the storage policy XAttr, always store the actual policy instead of the desired one

          As you mentioned in #4, this may finally lead to the requirement that every file/block has to store its own "actual" storage policy. My main concern is that doing this in file level will lead to much harder management. Administrators have to check individual files to understand its storage scheme.

          Mover (or a similar tool) should keep track of a queue of desired changes

          Considering Mover is just an external tool, if we use storage policy for EC files, clients (including Mover) will still directly talk to NN to set storage policy. And finally these desired changes still have to be handled/maintained by NN (or even a separate HDFS internal service).

          Conversion between replication and EC is a very important use case; so we do need to support changing EC policy on files and dirs

          Agree. But different from HSM, where the migration only moves blocks across datanodes and keeps file->block metadata unchanged, the conversion between replication and EC will finally generate brand new blocks for the file. Therefore, a much simpler and maybe cleaner way for conversion can be just copying the files using the new scheme, deleting the old data after the copy , and renaming the new data using old names if necessary. This will not cause inefficiency since anyway we need to write new data. Also this can be simply adopted by a Mover like tool, and can avoid a lot of complexity when handling changes on files during the conversion.

          In general, currently I prefer an EC-Zone design:

          1. Conversion is supported through copy
          2. EC policy is annotated on root directory of the zone as its XAttr
          3. No rename or EC schema change is allowed
            This is very similar with specifying the EC schema in the volume level (if we later support volume).
          Show
          jingzhao Jing Zhao added a comment - Thanks for the comment, Zhe! Some comments inline: In the storage policy XAttr, always store the actual policy instead of the desired one As you mentioned in #4, this may finally lead to the requirement that every file/block has to store its own "actual" storage policy. My main concern is that doing this in file level will lead to much harder management. Administrators have to check individual files to understand its storage scheme. Mover (or a similar tool) should keep track of a queue of desired changes Considering Mover is just an external tool, if we use storage policy for EC files, clients (including Mover) will still directly talk to NN to set storage policy. And finally these desired changes still have to be handled/maintained by NN (or even a separate HDFS internal service). Conversion between replication and EC is a very important use case; so we do need to support changing EC policy on files and dirs Agree. But different from HSM, where the migration only moves blocks across datanodes and keeps file->block metadata unchanged, the conversion between replication and EC will finally generate brand new blocks for the file. Therefore, a much simpler and maybe cleaner way for conversion can be just copying the files using the new scheme, deleting the old data after the copy , and renaming the new data using old names if necessary. This will not cause inefficiency since anyway we need to write new data. Also this can be simply adopted by a Mover like tool, and can avoid a lot of complexity when handling changes on files during the conversion. In general, currently I prefer an EC-Zone design: Conversion is supported through copy EC policy is annotated on root directory of the zone as its XAttr No rename or EC schema change is allowed This is very similar with specifying the EC schema in the volume level (if we later support volume).
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Jing for the insights! I certainly agree there are non-trivial tradeoffs between these 2 options.

          Therefore, a much simpler and maybe cleaner way for conversion can be just copying the files using the new scheme, deleting the old data after the copy , and renaming the new data using old names if necessary.

          In that case, I guess the newly created file (with the old name) still needs to carry the new policy/schema? I don't see an easy way to avoid storing per-file policy info if the user chose to convert individual files between EC and replication forms.

          Show
          zhz Zhe Zhang added a comment - Thanks Jing for the insights! I certainly agree there are non-trivial tradeoffs between these 2 options. Therefore, a much simpler and maybe cleaner way for conversion can be just copying the files using the new scheme, deleting the old data after the copy , and renaming the new data using old names if necessary. In that case, I guess the newly created file (with the old name) still needs to carry the new policy/schema? I don't see an easy way to avoid storing per-file policy info if the user chose to convert individual files between EC and replication forms.
          Hide
          jingzhao Jing Zhao added a comment -

          I think in most cases this conversion should happen in the directory level. For converting all the files contained in a directory, since all the files in the target directory (which can be a temporary directory before the later rename) share the same storage scheme, the storage schema only needs to be in the directory level if necessary. For converting an individual file from replication to EC, or between two EC schemas, in most use cases the file is just copied into another EC zone, and the schema should be already on the root of the zone.

          Show
          jingzhao Jing Zhao added a comment - I think in most cases this conversion should happen in the directory level. For converting all the files contained in a directory, since all the files in the target directory (which can be a temporary directory before the later rename) share the same storage scheme, the storage schema only needs to be in the directory level if necessary. For converting an individual file from replication to EC, or between two EC schemas, in most use cases the file is just copied into another EC zone, and the schema should be already on the root of the zone.
          Hide
          azuryy Fengdong Yu added a comment -

          Zhe Zhang, Can you explain how to run your Python code? you don't have parameter specification.

          Show
          azuryy Fengdong Yu added a comment - Zhe Zhang , Can you explain how to run your Python code? you don't have parameter specification.
          Hide
          azuryy Fengdong Yu added a comment -

          Wow, why shows lots of repeated comments here?

          Show
          azuryy Fengdong Yu added a comment - Wow, why shows lots of repeated comments here?
          Hide
          zhz Zhe Zhang added a comment -

          Fengdong Yu Good catch. The fsimage analyzer should be run following the steps below:

          1. Download both Python programs to the same directory
          2. ./ECAnalyzer <fsimage file name> <flag>, where the flag (-Dold or -Dnew) indicates whether the fsimage is in old (delimited) or new (xml) format.
          Show
          zhz Zhe Zhang added a comment - Fengdong Yu Good catch. The fsimage analyzer should be run following the steps below: Download both Python programs to the same directory ./ECAnalyzer <fsimage file name> <flag>, where the flag (-Dold or -Dnew) indicates whether the fsimage is in old (delimited) or new (xml) format.
          Hide
          zhz Zhe Zhang added a comment -

          I had another offline discussion with Andrew Wang around the storagePolicy vs. zone topic. We agreed that it's a difficult decision because it requires prediction of production usage patterns. The desired EC setup might not always align with directories. E.g., it is possible for a directory to contain both big files (suitable for striping) and small ones (will cause heavy NN overhead under striping). In this case, we can keep the directory policy to be non-EC, so only big files need to carry the EC policy in their XAttr – it is a small NN overhead since only a small fraction of files are big. As a follow on optimization we can even setup a size-based policy for automatic conversion. I'll look at a few applications like HBase / Hive to get a better understanding.

          I think we can follow an incremental development plan:

          1. We can start with a simple zone-like policy as Jing proposed above. In this step we don't even need to fully implement the enforcement of zone constraints (empty directory, no nesting etc.).
          2. After collecting potential usage patterns (in terms of directory structure), we'll decide whether the use case of per-file and nested EC configuration is important enough. Based on that, we'll either fully implement zone constraints or implement fine grained EC policies.
          3. We'll finally decide whether and how to integrate with other storage policies

          Thoughts?

          Show
          zhz Zhe Zhang added a comment - I had another offline discussion with Andrew Wang around the storagePolicy vs. zone topic. We agreed that it's a difficult decision because it requires prediction of production usage patterns. The desired EC setup might not always align with directories. E.g., it is possible for a directory to contain both big files (suitable for striping) and small ones (will cause heavy NN overhead under striping). In this case, we can keep the directory policy to be non-EC, so only big files need to carry the EC policy in their XAttr – it is a small NN overhead since only a small fraction of files are big. As a follow on optimization we can even setup a size-based policy for automatic conversion. I'll look at a few applications like HBase / Hive to get a better understanding. I think we can follow an incremental development plan: We can start with a simple zone-like policy as Jing proposed above. In this step we don't even need to fully implement the enforcement of zone constraints (empty directory, no nesting etc.). After collecting potential usage patterns (in terms of directory structure), we'll decide whether the use case of per-file and nested EC configuration is important enough. Based on that, we'll either fully implement zone constraints or implement fine grained EC policies. We'll finally decide whether and how to integrate with other storage policies Thoughts?
          Hide
          jingzhao Jing Zhao added a comment -

          Thanks for the update, Zhe! The plan sounds good to me. Let's finish the zone work first.

          Show
          jingzhao Jing Zhao added a comment - Thanks for the update, Zhe! The plan sounds good to me. Let's finish the zone work first.
          Hide
          drankye Kai Zheng added a comment -

          Good discussion and plan. But I'm still a little bit confused. Zhe Zhang was mentioning EC policies, and thinking about integrating them with other storage policies (HSM ones); Jing Zhao said Let's finish the zone work first. What term or concept would we use as a final choice ? I'm worrying about this because it's kinds of messy, we need to choose one and use it consistently, update the overall design doc, sync with related issues. It also affects the implementation, for a example, setStoragePolicy or createZone for admin to set an EC policy for a directory...Let's have a conclusion. Thanks.

          In my view, if we use something like extended storage policy (maybe better than EC policy), it would be easier to be unified and integrated into existing HSM storage policies, and also save some DFS commands to create EC zones. If we use EC Zone, it might not be so nature if we create a zone just for a file in case file level policy is needed in future. If we're likely to support file level EC policy, EC zone for directory sounds more nature. Since in medium future we only support directory level EC, either one is good, we just need one and the choice.

          Show
          drankye Kai Zheng added a comment - Good discussion and plan. But I'm still a little bit confused. Zhe Zhang was mentioning EC policies, and thinking about integrating them with other storage policies (HSM ones); Jing Zhao said Let's finish the zone work first. What term or concept would we use as a final choice ? I'm worrying about this because it's kinds of messy, we need to choose one and use it consistently, update the overall design doc, sync with related issues. It also affects the implementation, for a example, setStoragePolicy or createZone for admin to set an EC policy for a directory...Let's have a conclusion. Thanks. In my view, if we use something like extended storage policy (maybe better than EC policy ), it would be easier to be unified and integrated into existing HSM storage policies, and also save some DFS commands to create EC zones. If we use EC Zone , it might not be so nature if we create a zone just for a file in case file level policy is needed in future. If we're likely to support file level EC policy, EC zone for directory sounds more nature. Since in medium future we only support directory level EC, either one is good, we just need one and the choice.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Kai Zheng for the thoughts.

          I just talked to some HIVE folks about the HDFS directory structure in their typical workloads. In a nutshell, it looks like the following:

                 warehouse
                /         \
               db1       db2
              /  \     /     \ 
           ...  ... table_1   table_2
                             /   |   \
                        part_1 part_2 part_3 ...
          

          Each DB table is represented as a directory (usually with huge fan-out), under which each partition is stored as a file. Each partition maps to a fixed range in the key space. I was told that it's quietly common to see skewed partitions. A likely scenario is thousands of small partitions along with a few outliers that are much larger than average. In the EC context, it indicates a potential need for per-file policies (e.g. EC for large files, replication for small files).

          I still plan to look at a few more cases. At this stage, I think extended storage policy is a good term to use in our APIs (maybe we can abbreviate it as XStoragePolicy).

          Show
          zhz Zhe Zhang added a comment - Thanks Kai Zheng for the thoughts. I just talked to some HIVE folks about the HDFS directory structure in their typical workloads. In a nutshell, it looks like the following: warehouse / \ db1 db2 / \ / \ ... ... table_1 table_2 / | \ part_1 part_2 part_3 ... Each DB table is represented as a directory (usually with huge fan-out), under which each partition is stored as a file. Each partition maps to a fixed range in the key space. I was told that it's quietly common to see skewed partitions. A likely scenario is thousands of small partitions along with a few outliers that are much larger than average. In the EC context, it indicates a potential need for per-file policies (e.g. EC for large files, replication for small files). I still plan to look at a few more cases. At this stage, I think extended storage policy is a good term to use in our APIs (maybe we can abbreviate it as XStoragePolicy ).
          Hide
          drankye Kai Zheng added a comment -

          At this stage, I think extended storage policy is a good term to use in our APIs (maybe we can abbreviate it as XStoragePolicy).

          Sounds good to me. XStoragePolicy is nice.

          Show
          drankye Kai Zheng added a comment - At this stage, I think extended storage policy is a good term to use in our APIs (maybe we can abbreviate it as XStoragePolicy). Sounds good to me. XStoragePolicy is nice.
          Hide
          drankye Kai Zheng added a comment -

          Per discussion with Tsz Wo Nicholas Sze, I updated ECWorker design doc in HDFS-7344, by incorporating latest discussions and thoughts scattered in related JIRAs, from Li Bo, Zhe Zhang, Jing Zhao, Tsz Wo Nicholas Sze and etc. It's break down and sub-tasks are also opened accordingly for all the parties to consider to take. Hope we can also move forward and make good progress in DataNode as well as we're going in NameNode, client and codec framework in this way.

          Show
          drankye Kai Zheng added a comment - Per discussion with Tsz Wo Nicholas Sze , I updated ECWorker design doc in HDFS-7344 , by incorporating latest discussions and thoughts scattered in related JIRAs, from Li Bo , Zhe Zhang , Jing Zhao , Tsz Wo Nicholas Sze and etc. It's break down and sub-tasks are also opened accordingly for all the parties to consider to take. Hope we can also move forward and make good progress in DataNode as well as we're going in NameNode, client and codec framework in this way.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Hi,
          I think most of the commits to HDFS-7285 were not added to CHANGES-HDFS-EC-7285.txt.
          This will help to update CHANGES.txt at the time of merging to trunk, and hence recording the contributions.
          Very happy to see many new people Contributing to this work.

          For all commits till now I have updated CHANGES-HDFS-EC-7285.txt through HDFS-8027.
          Please take care for further commits.
          Thanks.

          Show
          vinayrpet Vinayakumar B added a comment - Hi, I think most of the commits to HDFS-7285 were not added to CHANGES-HDFS-EC-7285.txt. This will help to update CHANGES.txt at the time of merging to trunk, and hence recording the contributions. Very happy to see many new people Contributing to this work. For all commits till now I have updated CHANGES-HDFS-EC-7285.txt through HDFS-8027 . Please take care for further commits. Thanks.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Agree. We should add an entry to CHANGES-HDFS-EC-7285.txt for each commit.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Agree. We should add an entry to CHANGES-HDFS-EC-7285.txt for each commit.
          Hide
          zhz Zhe Zhang added a comment -

          Yesterday we had another offline meetup. I think the discussion was very productive. Below please find the summary:
          Attendees: Nicholas, Jing, Zhe

          Project phasing
          We went over the list of subtasks under this JIRA and separated them into 3 categories:

          1. Basic EC functionalities under the striping layout. Those subtasks were kept under this umbrella JIRA. The goal is for the HDFS-7285 branch to be ready for merging into trunk upon their completion.
          2. Follow-on tasks for EC+striping (including code and performance optimization, as well as support for advanced HDFS features). Those subtasks were moved under HDFS-8031. Following the common practice, those follow-on tasks are targeted for trunk, after HDFS-7285 is merged.
          3. EC with non-striping / contiguous block layout. Those subtasks were moved to HDFS-8030, which represents the 2nd phase of the erasure coding project.

          Extending from the initial PoC prototype , the following basic EC functionalities will be finished under this JIRA (Tsz Wo Nicholas Sze please let me know if I missed anything from your list):

          • A striped block group is distributed evenly on racks
          • NN handles striped block groups in existing block management logics:
            • Missing and corrupted blocks
            • To-invalidate blocks
            • Lease recovery
            • DN decommissioning
          • NN periodically distributes tasks to DN to reconstruct missing striped blocks
          • DN executes the reconstruction task by pulling data from peer DNs
          • Client can read a striped block group even if some blocks are missing, through decoding
          • Client should handle DN failures during writing
          • Basic command for directory-level EC configuration (similar to a zone)
          • Correctly handle striped block groups in file system statistics and metrics
          • Documentation
          • More comprehensive testing
          • Optional: instead of hard-coding, incorporate the ECSchema class with 1~2 schemas

          Key remaining tasks
          We think the following remaining tasks are key in terms of complexity and amount of work:

          1. Client writing: the basic striped writing logic is close to complete (patch available under HDFS-7889), but it's challenging to handle failures during writing in an elegant way.
          2. Client reading: the logic isn't too complex but amount of work is non-trivial
          3. DN reconstruction: logic is clean but work has not been started yet

          Client design
          We also dived into more details of the design of client reading/writing paths, and are synced on the overall approach. A few points were raised and will be addressed:

          1. Cell size in striping currently has default value of 1M. We should study its impact more carefully. Intuitively, a smaller value (like 128K) might be more suitable.
          2. Pread in striping format should always try to fetch data in parallel, when the requested range spans multiple striping cells.
          3. Stateful read in striping format should maintain multiple block readers to minimize overhead of creating new readers.
          Show
          zhz Zhe Zhang added a comment - Yesterday we had another offline meetup. I think the discussion was very productive. Below please find the summary: Attendees : Nicholas, Jing, Zhe Project phasing We went over the list of subtasks under this JIRA and separated them into 3 categories: Basic EC functionalities under the striping layout. Those subtasks were kept under this umbrella JIRA. The goal is for the HDFS-7285 branch to be ready for merging into trunk upon their completion. Follow-on tasks for EC+striping (including code and performance optimization, as well as support for advanced HDFS features). Those subtasks were moved under HDFS-8031 . Following the common practice, those follow-on tasks are targeted for trunk, after HDFS-7285 is merged. EC with non-striping / contiguous block layout. Those subtasks were moved to HDFS-8030 , which represents the 2nd phase of the erasure coding project. Extending from the initial PoC prototype , the following basic EC functionalities will be finished under this JIRA ( Tsz Wo Nicholas Sze please let me know if I missed anything from your list): A striped block group is distributed evenly on racks NN handles striped block groups in existing block management logics: Missing and corrupted blocks To-invalidate blocks Lease recovery DN decommissioning NN periodically distributes tasks to DN to reconstruct missing striped blocks DN executes the reconstruction task by pulling data from peer DNs Client can read a striped block group even if some blocks are missing, through decoding Client should handle DN failures during writing Basic command for directory-level EC configuration (similar to a zone) Correctly handle striped block groups in file system statistics and metrics Documentation More comprehensive testing Optional : instead of hard-coding, incorporate the ECSchema class with 1~2 schemas Key remaining tasks We think the following remaining tasks are key in terms of complexity and amount of work: Client writing: the basic striped writing logic is close to complete (patch available under HDFS-7889 ), but it's challenging to handle failures during writing in an elegant way. Client reading: the logic isn't too complex but amount of work is non-trivial DN reconstruction: logic is clean but work has not been started yet Client design We also dived into more details of the design of client reading/writing paths, and are synced on the overall approach. A few points were raised and will be addressed: Cell size in striping currently has default value of 1M. We should study its impact more carefully. Intuitively, a smaller value (like 128K) might be more suitable. Pread in striping format should always try to fetch data in parallel, when the requested range spans multiple striping cells. Stateful read in striping format should maintain multiple block readers to minimize overhead of creating new readers.
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Here is the list we discussed.

          Phase 1 – Basic EC features

          • Support (6,3)-Reed-Solomon
          • Read
            • from closed EC files
            • from files with some missing blocks
          • Write
            • Write to 9 datanodes in parallel
            • Failure handling: continue writing with the remaining datanodes as long as #existing datanodes >= 6.
          • EC blocks reconstruction
            • Scheduled by NN like replication
            • Datanode executes block group reconstruction
          • Block group lease recovery
            • Datanode executes lease recovery
            • Truncate at stripe group boundary
          • NN changes
            • EC block group placement
            • EC zone
            • Safemode calculation
            • Quota
            • Block report processing
            • Snapshot
            • Fsck
            • Editlog/image
            • Block group support
            • EC file deletion
            • Decommission
            • Corrupted EC blocks
            • ID collision
          • Balancer/Mover
            • Do not move EC blocks
          • Documentation
          • Testing
          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Here is the list we discussed. Phase 1 – Basic EC features Support (6,3)-Reed-Solomon Read from closed EC files from files with some missing blocks Write Write to 9 datanodes in parallel Failure handling: continue writing with the remaining datanodes as long as #existing datanodes >= 6. EC blocks reconstruction Scheduled by NN like replication Datanode executes block group reconstruction Block group lease recovery Datanode executes lease recovery Truncate at stripe group boundary NN changes EC block group placement EC zone Safemode calculation Quota Block report processing Snapshot Fsck Editlog/image Block group support EC file deletion Decommission Corrupted EC blocks ID collision Balancer/Mover Do not move EC blocks Documentation Testing
          Hide
          drankye Kai Zheng added a comment -

          Thank you all for the comprehensive discussion and sorting these out ! It's very helpful.

          Show
          drankye Kai Zheng added a comment - Thank you all for the comprehensive discussion and sorting these out ! It's very helpful.
          Hide
          zhz Zhe Zhang added a comment -

          I just finished weekly rebase. This time it's quite heavy, with several non-trivial changes. With this rebase, we finally have a stable Jenkins build with on test failures

          If you have any ongoing work, please save the patch against your local repo and reapply on the new HDFS-7285 branch. Thanks!

          Show
          zhz Zhe Zhang added a comment - I just finished weekly rebase. This time it's quite heavy, with several non-trivial changes. With this rebase, we finally have a stable Jenkins build with on test failures If you have any ongoing work, please save the patch against your local repo and reapply on the new HDFS-7285 branch. Thanks!
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          > ... This time it's quite heavy, with several non-trivial changes. ...

          Some of the conflicts probably is due to HDFS-8048. I will make sure that it won't affect much on this JIRA.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - > ... This time it's quite heavy, with several non-trivial changes. ... Some of the conflicts probably is due to HDFS-8048 . I will make sure that it won't affect much on this JIRA.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Nicholas for the consideration! I think we also need to coordinate with HDFS-6200 (in particular, HDFS-8053 and HDFS-8054). Haohui Mai Could you advise when you plan to start/finish these 2 JIRAs? If necessary we can try to push our changes in DFSInputStream and DFSOutputStream to trunk.

          Show
          zhz Zhe Zhang added a comment - Thanks Nicholas for the consideration! I think we also need to coordinate with HDFS-6200 (in particular, HDFS-8053 and HDFS-8054 ). Haohui Mai Could you advise when you plan to start/finish these 2 JIRAs? If necessary we can try to push our changes in DFSInputStream and DFSOutputStream to trunk.
          Hide
          wheat9 Haohui Mai added a comment -

          If necessary we can try to push our changes in DFSInputStream and DFSOutputStream to trunk.

          It should be unnecessary as git is smart enough to detect renames.

          Show
          wheat9 Haohui Mai added a comment - If necessary we can try to push our changes in DFSInputStream and DFSOutputStream to trunk. It should be unnecessary as git is smart enough to detect renames.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Haohui for the advice. You are right. In the most recent rebase git did merge changes to renamed files smoothly.

          Show
          zhz Zhe Zhang added a comment - Thanks Haohui for the advice. You are right. In the most recent rebase git did merge changes to renamed files smoothly.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks for your great job about making erasure code native in HDFS.
          I am working on proactive data protection in HDFS by incorporating hard drive failure detection method based on collected SMART attributes into HDFS kernel and scheduling disk warning process in advance and want to have erasure code native supported by HDFS kernel instead of HDFS-RAID.
          I have some questions below, but I don't know how to consult them , so I just list my questions here and hope it won't bother you so much.
          1, I am wonderring whether and where i can download the project source code you are working on.
          2, When this project will be accomplished, will it take a long time ?
          3, Whether guys like me can join your group?

          Copying comments from lpstudy over here and please find my answers below:

          1. Erasure coding has been developed under the HDFS-7285 branch and the code can be accessed on github: https://github.com/apache/hadoop/tree/HDFS-7285
          2. We haven't explicitly discussed the target Hadoop release of the erasure coding feature. The plan will be discussed here.
          3. Sure! Contributions are always very welcome. Please feel free to file JIRAs on issues you see or take existing ones after checking with original assignee.
          Show
          zhz Zhe Zhang added a comment - Thanks for your great job about making erasure code native in HDFS. I am working on proactive data protection in HDFS by incorporating hard drive failure detection method based on collected SMART attributes into HDFS kernel and scheduling disk warning process in advance and want to have erasure code native supported by HDFS kernel instead of HDFS-RAID. I have some questions below, but I don't know how to consult them , so I just list my questions here and hope it won't bother you so much. 1, I am wonderring whether and where i can download the project source code you are working on. 2, When this project will be accomplished, will it take a long time ? 3, Whether guys like me can join your group? Copying comments from lpstudy over here and please find my answers below: Erasure coding has been developed under the HDFS-7285 branch and the code can be accessed on github: https://github.com/apache/hadoop/tree/HDFS-7285 We haven't explicitly discussed the target Hadoop release of the erasure coding feature. The plan will be discussed here. Sure! Contributions are always very welcome. Please feel free to file JIRAs on issues you see or take existing ones after checking with original assignee.
          Hide
          Vincent.Wei Vincent.Wei added a comment -

          Hi all
          I am a new comer , I want to know if I can add the HDFS-7285-initial-PoC.patch on the hadoop v2.2.0 ?
          Thanks .

          Show
          Vincent.Wei Vincent.Wei added a comment - Hi all I am a new comer , I want to know if I can add the HDFS-7285 -initial-PoC.patch on the hadoop v2.2.0 ? Thanks .
          Hide
          vinayrpet Vinayakumar B added a comment -

          Hi Vincent.Wei, thanks for looking here.

          I am a new comer , I want to know if I can add the HDFS-7285-initial-PoC.patch on the hadoop v2.2.0 ?

          That patch was still in development phase at that time and it was based on trunk code. I think lot has changed from v2.2.0.
          Current progress is really good towards the completion of the Phase I of the EC feature.
          Hope feature will be available soon in trunk/branch-2.

          Show
          vinayrpet Vinayakumar B added a comment - Hi Vincent.Wei , thanks for looking here. I am a new comer , I want to know if I can add the HDFS-7285 -initial-PoC.patch on the hadoop v2.2.0 ? That patch was still in development phase at that time and it was based on trunk code. I think lot has changed from v2.2.0. Current progress is really good towards the completion of the Phase I of the EC feature. Hope feature will be available soon in trunk/branch-2.
          Hide
          zhz Zhe Zhang added a comment -

          Uploading a test plan for phase I of the feature (thanks Kai Zheng for filling in the details on codec testing). Any questions and comments are very welcome.

          When we move on to follow-on optimizations (HDFS-8031) and phase II of the erasure coding feature (HDFS-8030) I will post additional test plans.

          Show
          zhz Zhe Zhang added a comment - Uploading a test plan for phase I of the feature (thanks Kai Zheng for filling in the details on codec testing). Any questions and comments are very welcome. When we move on to follow-on optimizations ( HDFS-8031 ) and phase II of the erasure coding feature ( HDFS-8030 ) I will post additional test plans.
          Hide
          zhz Zhe Zhang added a comment -

          Since most planned functionalities for this phase is complete, we should perhaps start examining the entire consolidated patch in preparation for merging into trunk.

          As you might have noticed, there have been a few trunk-based JIRAs trying to merge generic code refactors to trunk first, so as to minimize the consolidated patch: HDFS-8487, HDFS-8605, HDFS-8608, HDFS-8623, etc.. I'm working on a PoC branch which rebases HDFS-7285 based on those efforts. I just finished a first pass, which (I think) includes all the changes except for fsimage/editlog supports.

          In particular, the updated BlockInfo structure from HDFS-8487 will cause some non-trivial changes to the current HDFS-7285 code (hopefully making it cleaner). I'm attaching the proposed BlockInfoStriped and BlockInfoUCStriped patch. Comments and suggestions are very welcome.

          Show
          zhz Zhe Zhang added a comment - Since most planned functionalities for this phase is complete, we should perhaps start examining the entire consolidated patch in preparation for merging into trunk. As you might have noticed, there have been a few trunk-based JIRAs trying to merge generic code refactors to trunk first, so as to minimize the consolidated patch: HDFS-8487 , HDFS-8605 , HDFS-8608 , HDFS-8623 , etc.. I'm working on a PoC branch which rebases HDFS-7285 based on those efforts. I just finished a first pass, which (I think) includes all the changes except for fsimage/editlog supports. In particular, the updated BlockInfo structure from HDFS-8487 will cause some non-trivial changes to the current HDFS-7285 code (hopefully making it cleaner). I'm attaching the proposed BlockInfoStriped and BlockInfoUCStriped patch. Comments and suggestions are very welcome.
          Hide
          zhz Zhe Zhang added a comment -

          Proposed structure for BlockInfoStriped and BlockInfoUCStriped. It's a little outdated. Please refer to my github branch for the latest proposed code.

          Show
          zhz Zhe Zhang added a comment - Proposed structure for BlockInfoStriped and BlockInfoUCStriped . It's a little outdated. Please refer to my github branch for the latest proposed code.
          Hide
          vinayrpet Vinayakumar B added a comment - - edited

          Proposed structure for BlockInfoStriped and BlockInfoUCStriped. It's a little outdated. Please refer to my github branch for the latest proposed code.

          Below are some comments from the updated code from github POC branch, for the BlockInfo heirarchy.

            @Override
            BlockInfoUnderConstruction convertCompleteBlockToUC(
                HdfsServerConstants.BlockUCState s, DatanodeStorageInfo[] targets) {
              BlockInfoUnderConstructionContiguous ucBlock =
                  new BlockInfoUnderConstructionContiguous(this,
                      getBlockCollection().getPreferredBlockReplication(), s, targets);
              ucBlock.setBlockCollection(getBlockCollection());
              return ucBlock;
            }

          BlockInfoStriped#convertCompleteBlockToUC(..) should return BlockInfoUnderConstructionStriped instance.

          StripedBlockStorageOp.java needs to have Apache licence header.

          Show
          vinayrpet Vinayakumar B added a comment - - edited Proposed structure for BlockInfoStriped and BlockInfoUCStriped. It's a little outdated. Please refer to my github branch for the latest proposed code. Below are some comments from the updated code from github POC branch, for the BlockInfo heirarchy. @Override BlockInfoUnderConstruction convertCompleteBlockToUC( HdfsServerConstants.BlockUCState s, DatanodeStorageInfo[] targets) { BlockInfoUnderConstructionContiguous ucBlock = new BlockInfoUnderConstructionContiguous( this , getBlockCollection().getPreferredBlockReplication(), s, targets); ucBlock.setBlockCollection(getBlockCollection()); return ucBlock; } BlockInfoStriped#convertCompleteBlockToUC(..) should return BlockInfoUnderConstructionStriped instance. StripedBlockStorageOp.java needs to have Apache licence header.
          Hide
          Vincent.Wei Vincent.Wei added a comment -

          I am on will out of office for biz trip form 6.23-6.26, I may reply e-mail slowly, please call me 13764370648 when there are urgent mater.

          Show
          Vincent.Wei Vincent.Wei added a comment - I am on will out of office for biz trip form 6.23-6.26, I may reply e-mail slowly, please call me 13764370648 when there are urgent mater.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Vinay for reviewing the BlockInfo code! Those are good catches. I will update in the 2nd pass that I'm working on.

          Show
          zhz Zhe Zhang added a comment - Thanks Vinay for reviewing the BlockInfo code! Those are good catches. I will update in the 2nd pass that I'm working on.
          Hide
          zhz Zhe Zhang added a comment -

          I just finished a pass rebasing all non-test changes. Attached is the consolidated PoC patch. The aforementioned github repo also has the same changes. We have roughly:

          1. 5,600 LoC in hadoop-common, which is entirely new code
          2. 2,000 LoC in blockmanagement. This is mainly to support block-level management of striping. Aside from a few new classes, most changes are in BlockManager
          3. 1,600 LoC in namenode. This is mainly to add striped blocks support in INodeFile, and fsimage/editlog. Biggest changes are on INodeFile, FSNameSystem, and FSDirWriteFileOp
          4. 1,100 LoC in datanode, 2,500 in client, and 1,000 in util. This is mainly new code.

          Next I plan to make another pass and divide the consolidated patch into functional pieces (while doing that, add associated tests to each piece).

          Meanwhile, any comments / questions / suggestions on the patch are very welcome.

          Show
          zhz Zhe Zhang added a comment - I just finished a pass rebasing all non-test changes. Attached is the consolidated PoC patch. The aforementioned github repo also has the same changes. We have roughly: 5,600 LoC in hadoop-common , which is entirely new code 2,000 LoC in blockmanagement . This is mainly to support block-level management of striping. Aside from a few new classes, most changes are in BlockManager 1,600 LoC in namenode . This is mainly to add striped blocks support in INodeFile , and fsimage/editlog. Biggest changes are on INodeFile , FSNameSystem , and FSDirWriteFileOp 1,100 LoC in datanode , 2,500 in client, and 1,000 in util. This is mainly new code. Next I plan to make another pass and divide the consolidated patch into functional pieces (while doing that, add associated tests to each piece). Meanwhile, any comments / questions / suggestions on the patch are very welcome.
          Hide
          zhz Zhe Zhang added a comment -

          Compared with the current HDFS-7285 branch, besides pre-merged refactor changes, the biggest difference is around INodeFile. With the new BlockInfo hierarchy introduced in HDFS-8499, we maintained the existing BlockInfo - BlockInfoUC structure in trunk; as a cost, we are losing the BlockInfoStriped - BlockInfoStripedUC inheritance in the current HDFS-7285 branch.

          Consequently, INodeFile#blocks and FileWithStripedBlocksFeature#blocks can no longer use BlockInfoContiguous and BlockInfoStriped types because they contain both complete and UC blocks. Yi Liu's approach under HDFS-8058 is an option to solve the issue. Or for stronger type safety, we can build two additional interfaces, one for striped complete or UC blocks and another one for contiguous ones.

          Show
          zhz Zhe Zhang added a comment - Compared with the current HDFS-7285 branch, besides pre-merged refactor changes, the biggest difference is around INodeFile . With the new BlockInfo hierarchy introduced in HDFS-8499 , we maintained the existing BlockInfo - BlockInfoUC structure in trunk; as a cost, we are losing the BlockInfoStriped - BlockInfoStripedUC inheritance in the current HDFS-7285 branch. Consequently, INodeFile#blocks and FileWithStripedBlocksFeature#blocks can no longer use BlockInfoContiguous and BlockInfoStriped types because they contain both complete and UC blocks. Yi Liu 's approach under HDFS-8058 is an option to solve the issue. Or for stronger type safety, we can build two additional interfaces, one for striped complete or UC blocks and another one for contiguous ones.
          Hide
          jingzhao Jing Zhao added a comment -

          With the new BlockInfo hierarchy introduced in HDFS-8499....

          Then why not bringing the BlockInfoXXX - BlockInfoXXXUC inheritance back and just make the inheritance structure like HDFS-7285 branch?

          Show
          jingzhao Jing Zhao added a comment - With the new BlockInfo hierarchy introduced in HDFS-8499 .... Then why not bringing the BlockInfoXXX - BlockInfoXXXUC inheritance back and just make the inheritance structure like HDFS-7285 branch?
          Hide
          walter.k.su Walter Su added a comment -

          INodeFile#blocks and FileWithStripedBlocksFeature#blocks can no longer use BlockInfoContiguous and BlockInfoStriped types because they contain both complete and UC blocks.

          We can use BlockInfo as an abstraction for complete and UC blocks. We need to change FileWithStripedBlocksFeature#blocks to BlockInfo as well.

          It's doable, because currently in trunk, INodeFile#blocks is BlockInfo and it works fine. Usually we don't cast BlockInfo to BlockInfoContiguous(or UC). (I didn't find such casting in trunk code, if we have we should worry type safety)
          I saw you cast BlockInfo to BlockInfoStriped multiple times in BlockManager in github branch. They can be eliminated.

          HDFS-8058 is irrelevant because in trunk INodeFile#blocks is already BlockInfo. HDFS-8058 is to reduce memory usage.

          Show
          walter.k.su Walter Su added a comment - INodeFile#blocks and FileWithStripedBlocksFeature#blocks can no longer use BlockInfoContiguous and BlockInfoStriped types because they contain both complete and UC blocks. We can use BlockInfo as an abstraction for complete and UC blocks. We need to change FileWithStripedBlocksFeature#blocks to BlockInfo as well. It's doable, because currently in trunk, INodeFile#blocks is BlockInfo and it works fine. Usually we don't cast BlockInfo to BlockInfoContiguous(or UC). (I didn't find such casting in trunk code, if we have we should worry type safety) I saw you cast BlockInfo to BlockInfoStriped multiple times in BlockManager in github branch. They can be eliminated. HDFS-8058 is irrelevant because in trunk INodeFile#blocks is already BlockInfo. HDFS-8058 is to reduce memory usage.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Jing and Walter for the helpful discussion!

          Then why not bringing the BlockInfoXXX - BlockInfoXXXUC inheritance back and just make the inheritance structure like HDFS-7285 branch?

          This is because several places in trunk are relying on the BlockInfo - BlockInfoUC inheritance. As discussed under HDFS-8499, this multi-inheritance problem is fundamentally hard. HDFS-8499 patch keeps the BlockInfo - BlockInfoUC inheritance to minimize change to trunk. This structure also makes it easier to share common code because the code difference along the contiguous-striped dimension is smaller than the UC dimension.

          I'm open to revisiting the BlockInfo structure based on discussion here. With either structure discussed above, I think we should solve the BlockInfo multi-inheritance problem more completely as a follow-on.

          We can use BlockInfo as an abstraction for complete and UC blocks. We need to change FileWithStripedBlocksFeature#blocks to BlockInfo as well.

          The PoC patch already does that. As Jing and myself commented under HDFS-8058, the downside is weaker type safety. For example, on the API level, setBlocks allows some other method to assign an array of BlockInfo to the INode; it's not easy to verify whether there are mixed types. My current thought is that we can create an abstraction BlocksInAFile, with a type and an array of BlockInfo. This will serve as a central place to control type safety. I'll post a patch under HDFS-8058 to demonstrate the idea.

          I saw you cast BlockInfo to BlockInfoStriped multiple times in BlockManager in github branch. They can be eliminated.

          This is a good point. We can use isStriped and getStripedBlockStorageOp instead.

          Show
          zhz Zhe Zhang added a comment - Thanks Jing and Walter for the helpful discussion! Then why not bringing the BlockInfoXXX - BlockInfoXXXUC inheritance back and just make the inheritance structure like HDFS-7285 branch? This is because several places in trunk are relying on the BlockInfo - BlockInfoUC inheritance. As discussed under HDFS-8499 , this multi-inheritance problem is fundamentally hard. HDFS-8499 patch keeps the BlockInfo - BlockInfoUC inheritance to minimize change to trunk. This structure also makes it easier to share common code because the code difference along the contiguous-striped dimension is smaller than the UC dimension. I'm open to revisiting the BlockInfo structure based on discussion here. With either structure discussed above, I think we should solve the BlockInfo multi-inheritance problem more completely as a follow-on. We can use BlockInfo as an abstraction for complete and UC blocks. We need to change FileWithStripedBlocksFeature#blocks to BlockInfo as well. The PoC patch already does that. As Jing and myself commented under HDFS-8058 , the downside is weaker type safety. For example, on the API level, setBlocks allows some other method to assign an array of BlockInfo to the INode; it's not easy to verify whether there are mixed types. My current thought is that we can create an abstraction BlocksInAFile , with a type and an array of BlockInfo . This will serve as a central place to control type safety. I'll post a patch under HDFS-8058 to demonstrate the idea. I saw you cast BlockInfo to BlockInfoStriped multiple times in BlockManager in github branch. They can be eliminated. This is a good point. We can use isStriped and getStripedBlockStorageOp instead.
          Hide
          walter.k.su Walter Su added a comment -

          As Jing and myself commented under HDFS-8058, the downside is weaker type safety.

          Hmm. I was worrying mixed types of UC/non-UC. You are worrying mixed types of striped/contiguous. Isn't type check in setBlock enough? I don't know.

          Show
          walter.k.su Walter Su added a comment - As Jing and myself commented under HDFS-8058 , the downside is weaker type safety. Hmm. I was worrying mixed types of UC/non-UC. You are worrying mixed types of striped/contiguous. Isn't type check in setBlock enough? I don't know.
          Hide
          walter.k.su Walter Su added a comment -

          Comments on github branch(PoC-20150624.patch):
          1. You missed merge INodeFile#getLastBlock and numBlocks to github branch.
          2. I run test, and saw NPE. It'll fix after HDFS-8653 merges into trunk.
          3. Please remove this line because NPE when lastBlock==null.

          2437       Preconditions.checkState(!lastBlock.isStriped());  //FSNamesystem#appendFileInternal()
          
          Show
          walter.k.su Walter Su added a comment - Comments on github branch(PoC-20150624.patch): 1. You missed merge INodeFile#getLastBlock and numBlocks to github branch. 2. I run test, and saw NPE. It'll fix after HDFS-8653 merges into trunk. 3. Please remove this line because NPE when lastBlock==null. 2437 Preconditions.checkState(!lastBlock.isStriped()); //FSNamesystem#appendFileInternal()
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Walter! Will address in the next rev.

          Show
          zhz Zhe Zhang added a comment - Thanks Walter! Will address in the next rev.
          Hide
          zhz Zhe Zhang added a comment -

          I just finished another pass of the consolidated patch. The github repo has the latest code. I divided the consolidated patch into 13 sub-patches. Please let me know if the list looks reasonable to you.

              13. Balancer and mover support for striped block groups
              12. Support striped block groups in fsimage and edit logs
              11. Change fsck to support EC files
              10. Add striped block support in INodeFile.
              9. Datanode support
              8. Distribute recovery work for striped blocks to DataNode.
              7. Client side support
              6. Create LocatedStripedBlock abstraction to represent striped block groups.
              5. BlockPlacementPolicies for erasure coding
              4. Allocate and manage striped blocks in NameNode blockmanagement module.
              3. Extend BlockInfo to handle striped block groups.
              2. Support Erasure Coding Zones.
              1. HADOOP-COMMON side support for codec calculations.
          

          I still haven't added tests to the sub-patches. Also the INodeFile implementation is still being discussed. They'll be addressed in the next pass.

          In the next pass, to ensure we capture all latest branch changes, I will do the following:

          1. Take a latest consolidated patch from HDFS-7285
          2. Examine each file (if necessary, each diff in a file) in the patch and fit them into one of the 13 sub-patches (or discard as pre-merged changes).

          The consolidated patch has roughly 25k lines of code. So any volunteered help is much appreciated

          Show
          zhz Zhe Zhang added a comment - I just finished another pass of the consolidated patch. The github repo has the latest code. I divided the consolidated patch into 13 sub-patches. Please let me know if the list looks reasonable to you. 13. Balancer and mover support for striped block groups 12. Support striped block groups in fsimage and edit logs 11. Change fsck to support EC files 10. Add striped block support in INodeFile. 9. Datanode support 8. Distribute recovery work for striped blocks to DataNode. 7. Client side support 6. Create LocatedStripedBlock abstraction to represent striped block groups. 5. BlockPlacementPolicies for erasure coding 4. Allocate and manage striped blocks in NameNode blockmanagement module. 3. Extend BlockInfo to handle striped block groups. 2. Support Erasure Coding Zones. 1. HADOOP-COMMON side support for codec calculations. I still haven't added tests to the sub-patches. Also the INodeFile implementation is still being discussed. They'll be addressed in the next pass. In the next pass, to ensure we capture all latest branch changes, I will do the following: Take a latest consolidated patch from HDFS-7285 Examine each file (if necessary, each diff in a file) in the patch and fit them into one of the 13 sub-patches (or discard as pre-merged changes). The consolidated patch has roughly 25k lines of code. So any volunteered help is much appreciated
          Hide
          drankye Kai Zheng added a comment -

          Thanks Zhe Zhang for the great work! I will have some time to look at some parts I'm familiar with.

          Show
          drankye Kai Zheng added a comment - Thanks Zhe Zhang for the great work! I will have some time to look at some parts I'm familiar with.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Kai! Feel free to claim the component you'd like to look at.

          Show
          zhz Zhe Zhang added a comment - Thanks Kai! Feel free to claim the component you'd like to look at.
          Hide
          drankye Kai Zheng added a comment -

          There're some changes in the patch 1 (for codec) that should be placed elsewhere.
          In FSOutputSummer, better for 9 (datanode support)

          +  protected DataChecksum getDataChecksum() {
          +    return sum;
          +  }
          

          In FsShell, better for 7 (client side support)

          +  protected String getUsagePrefix() {
          +    return usagePrefix;
          +  }
          

          I will continue to look at some other parts. And by the way, I'm going to move left codec issues planned for HDFS-7285 to the follow-on issue since I don't want to interrupt this pre-merge effort.

          Show
          drankye Kai Zheng added a comment - There're some changes in the patch 1 (for codec) that should be placed elsewhere. In FSOutputSummer , better for 9 (datanode support) + protected DataChecksum getDataChecksum() { + return sum; + } In FsShell , better for 7 (client side support) + protected String getUsagePrefix() { + return usagePrefix; + } I will continue to look at some other parts. And by the way, I'm going to move left codec issues planned for HDFS-7285 to the follow-on issue since I don't want to interrupt this pre-merge effort.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Attaching the consolidated-merge-patch.

          1. Merged the latest trunk code to HDFS-7285 branch.
          2. Taken diff against trunk.

          Includes all changes done till date in the branch.

          Thanks to Zhe Zhang. Most complicated BlockInfo heirarchy conflicts were resolved using the POC patch posted earlier.

          Patch is pretty big. But also Intended to run jenkins.

          Show
          vinayrpet Vinayakumar B added a comment - Attaching the consolidated-merge-patch. 1. Merged the latest trunk code to HDFS-7285 branch. 2. Taken diff against trunk. Includes all changes done till date in the branch. Thanks to Zhe Zhang . Most complicated BlockInfo heirarchy conflicts were resolved using the POC patch posted earlier. Patch is pretty big. But also Intended to run jenkins.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 1s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12743065/HDFS-7285-merge-consolidated-01.patch
          Optional Tests javadoc javac unit findbugs checkstyle shellcheck
          git revision HDFS-7285 / 0b7af27
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11561/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 1s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12743065/HDFS-7285-merge-consolidated-01.patch Optional Tests javadoc javac unit findbugs checkstyle shellcheck git revision HDFS-7285 / 0b7af27 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11561/console This message was automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Attaching the same patch, with 'trunk' in the name.

          Show
          vinayrpet Vinayakumar B added a comment - Attaching the same patch, with 'trunk' in the name.
          Hide
          zhz Zhe Zhang added a comment -

          If Jenkins still tries to apply against HDFS-7285, we can rename the patch to HDFS-EC-xxx

          Show
          zhz Zhe Zhang added a comment - If Jenkins still tries to apply against HDFS-7285 , we can rename the patch to HDFS-EC-xxx
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Vinay for the great effort! When I finish my current pass to divide the patch, we can compare the 2 consolidated patches and resolve all conflicts.

          Show
          zhz Zhe Zhang added a comment - Thanks Vinay for the great effort! When I finish my current pass to divide the patch, we can compare the 2 consolidated patches and resolve all conflicts.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Sure. Actually, I created this for this purpose only . So that nothing should miss.

          Show
          vinayrpet Vinayakumar B added a comment - Sure. Actually, I created this for this purpose only . So that nothing should miss.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 1s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12743121/HDFS-7285-merge-consolidated-trunk-01.patch
          Optional Tests javadoc javac unit findbugs checkstyle shellcheck
          git revision HDFS-7285 / 0b7af27
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11565/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 1s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12743121/HDFS-7285-merge-consolidated-trunk-01.patch Optional Tests javadoc javac unit findbugs checkstyle shellcheck git revision HDFS-7285 / 0b7af27 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11565/console This message was automatically generated.
          Hide
          zhz Zhe Zhang added a comment -

          Updating the same patch Vinay generated with a different name. Hope Jenkins tries it with trunk this time.

          Show
          zhz Zhe Zhang added a comment - Updating the same patch Vinay generated with a different name. Hope Jenkins tries it with trunk this time.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 1s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12743179/HDFS-EC-merge-consolidated-01.patch
          Optional Tests javadoc javac unit findbugs checkstyle shellcheck
          git revision HDFS-7285 / 0b7af27
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11569/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 1s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12743179/HDFS-EC-merge-consolidated-01.patch Optional Tests javadoc javac unit findbugs checkstyle shellcheck git revision HDFS-7285 / 0b7af27 Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11569/console This message was automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Rebased patch.

          Show
          vinayrpet Vinayakumar B added a comment - Rebased patch.
          Hide
          zhz Zhe Zhang added a comment -

          I finished a complete pass of the patch (both main and test codes) and pushed to my github repo . Also uploading the consolidated patch (renaming it again and hoping Jenkins applies it against trunk instead of our branch).

          I made some minor adjustments to the list of sub patches:

          1. HADOOP-COMMON side support for codec calculations
          2. Support Erasure Coding Zones.
          3. Extend BlockInfo to handle striped block groups.
          4. Allocate and manage striped blocks in NameNode blockmanagement module.
          5. BlockPlacementPolicies for erasure coding
          6. Create LocatedStripedBlock abstraction to represent striped block groups.
          7. Distribute recovery work for striped blocks to DataNode.
          8. Add striped block support in INodeFile.
          9. Client side support
          10. Datanode support
          11. Change fsck to support EC files
          12. Support striped block groups in fsimage and edit logs
          13. Balancer and mover support for striped block groups
          14. Additional unit tests for erasure coding
          

          I also cherry-picked the 2 new patches from HDFS-7285 branch.

          I created separate Jenkins builds for sub patches 1~4. Most tests passed. The remaining handful of failures seem to be resulted from the patch splitting (they pass in the final consolidated patch).

          So sub patches 1~4 are ready for reviews. Any comments and suggestions are much appreciated. In particular, sub patch 3 has the new implementation of BlockInfoStriped which has not been reviewed in the branch.