Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7337

Configurable and pluggable erasure codec and policy

    Details

    • Target Version/s:
    • Hadoop Flags:
      Incompatible change
    • Release Note:
      Hide
      This allows users to:
      * develop and plugin their own erasure codec and coders. The plugin will be loaded automatically from hadoop jars, the corresponding codec and coder will be registered for runtime use.
      * define their own erasure coding policies thru an xml file and CLI command. The added policies will be persisted into fsimage.
      Show
      This allows users to: * develop and plugin their own erasure codec and coders. The plugin will be loaded automatically from hadoop jars, the corresponding codec and coder will be registered for runtime use. * define their own erasure coding policies thru an xml file and CLI command. The added policies will be persisted into fsimage.

      Description

      According to HDFS-7285 and the design, this considers to support multiple Erasure Codecs via pluggable approach. It allows to define and configure multiple codec schemas with different coding algorithms and parameters. The resultant codec schemas can be utilized and specified via command tool for different file folders. While design and implement such pluggable framework, it’s also to implement a concrete codec by default (Reed Solomon) to prove the framework is useful and workable. Separate JIRA could be opened for the RS codec implementation.

      Note HDFS-7353 will focus on the very low level codec API and implementation to make concrete vendor libraries transparent to the upper layer. This JIRA focuses on high level stuffs that interact with configuration, schema and etc.

      1. PluggableErasureCodec v4.pdf
        154 kB
        SammiChen
      2. PluggableErasureCodec-v3.pdf
        363 kB
        Kai Zheng
      3. PluggableErasureCodec-v2.pdf
        432 kB
        Kai Zheng
      4. PluggableErasureCodec.pdf
        409 kB
        Kai Zheng
      5. HDFS-7337-prototype-v3.zip
        49 kB
        Kai Zheng
      6. HDFS-7337-prototype-v2.zip
        34 kB
        Kai Zheng
      7. HDFS-7337-prototype-v1.patch
        14 kB
        Kai Zheng

        Issue Links

          Activity

          Hide
          drankye Kai Zheng added a comment -

          To better position this issue and avoid the possible confusion with storage policy, rephrased the title a bit.

          Show
          drankye Kai Zheng added a comment - To better position this issue and avoid the possible confusion with storage policy, rephrased the title a bit.
          Hide
          zhz Zhe Zhang added a comment -

          Kai Zheng Sorry about the confusion. I actually created the JIRA for marking individual files or directories to be erasure coded (or not). I'll create another JIRA for that purpose, since we do need this pluggable codec schema support anyway

          Show
          zhz Zhe Zhang added a comment - Kai Zheng Sorry about the confusion. I actually created the JIRA for marking individual files or directories to be erasure coded (or not). I'll create another JIRA for that purpose, since we do need this pluggable codec schema support anyway
          Hide
          drankye Kai Zheng added a comment -

          Yes, agree. Please go ahead. Thanks.

          Show
          drankye Kai Zheng added a comment - Yes, agree. Please go ahead. Thanks.
          Hide
          drankye Kai Zheng added a comment -

          Also opened HDFS-7353 to focus on the very low level codec API and implementation to make concrete vendor libraries transparent to the upper layer. This JIRA focuses on high level stuffs that interact with configuration, schema and etc.

          Show
          drankye Kai Zheng added a comment - Also opened HDFS-7353 to focus on the very low level codec API and implementation to make concrete vendor libraries transparent to the upper layer. This JIRA focuses on high level stuffs that interact with configuration, schema and etc.
          Hide
          drankye Kai Zheng added a comment -

          Zhe, let me consider these issues together and think about how to define and implement such configurable and pluggable codec plus schema. Will give my thoughts here for the discussion. Assigned to me.

          Show
          drankye Kai Zheng added a comment - Zhe, let me consider these issues together and think about how to define and implement such configurable and pluggable codec plus schema. Will give my thoughts here for the discussion. Assigned to me.
          Hide
          drankye Kai Zheng added a comment -

          Initial prototype patch for quick ideas.

          Show
          drankye Kai Zheng added a comment - Initial prototype patch for quick ideas.
          Hide
          drankye Kai Zheng added a comment -

          Zhe Zhang, would you take a look at the quick prototype patch and see if it works? Thanks. It contains codec def that's relevant to ECManager and coder def that's relevant to ECWorker. I have checked coder def with Bo and it roughly works.

          Would also appreciate any others feedback. Thanks.

          Show
          drankye Kai Zheng added a comment - Zhe Zhang , would you take a look at the quick prototype patch and see if it works? Thanks. It contains codec def that's relevant to ECManager and coder def that's relevant to ECWorker. I have checked coder def with Bo and it roughly works. Would also appreciate any others feedback. Thanks.
          Hide
          drankye Kai Zheng added a comment -

          The points made in the prototype patch are:

          • Multiple erasure codecs can be configured and referenced by their names;
          • Multiple erasure codec instances or schemas can be defined with kinds of options in schema file, and can be specified via their distinguished names;
          • ErasureCodec takes care of two aspects, ECSchema for NameNode/ECManager, and ErasureCoder for DataNode/ECWorker;
          • ECSchema is loaded from configuration and can also be persisted in compact form to be passed to DataNode if desired;
          • ErasureCodec is also responsible fro calculating BlockGroup given required original data blocks and to be computed parity blocks;
          • ErasureCoder can be initialized with options from schema and performs basically encoding/decoding of ECChunks;
          • ErasureCoder can be implemented using Jerasure library or Intel ISA library. The concrete coder should only be created in DataNode side, thus corresponding libraries are only required in DataNodes. NameNode doesn't need to create coders;
          • RS codec and LRC codec with corresponding coders are to be supported, as they're typical cases for such API definition;
          • RS and LRC coder implementations will be provided by default using Intel ISA library.
          Show
          drankye Kai Zheng added a comment - The points made in the prototype patch are: Multiple erasure codecs can be configured and referenced by their names; Multiple erasure codec instances or schemas can be defined with kinds of options in schema file, and can be specified via their distinguished names; ErasureCodec takes care of two aspects, ECSchema for NameNode/ECManager, and ErasureCoder for DataNode/ECWorker; ECSchema is loaded from configuration and can also be persisted in compact form to be passed to DataNode if desired; ErasureCodec is also responsible fro calculating BlockGroup given required original data blocks and to be computed parity blocks; ErasureCoder can be initialized with options from schema and performs basically encoding/decoding of ECChunks; ErasureCoder can be implemented using Jerasure library or Intel ISA library. The concrete coder should only be created in DataNode side, thus corresponding libraries are only required in DataNodes. NameNode doesn't need to create coders; RS codec and LRC codec with corresponding coders are to be supported, as they're typical cases for such API definition; RS and LRC coder implementations will be provided by default using Intel ISA library.
          Hide
          drankye Kai Zheng added a comment -

          For time saving, I composed this doc to illustrate the erasure codec framework for review. Your feedback is welcome.

          Show
          drankye Kai Zheng added a comment - For time saving, I composed this doc to illustrate the erasure codec framework for review. Your feedback is welcome.
          Hide
          drankye Kai Zheng added a comment -

          Updated the prototype codes listed as follows.
          hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/ec:

          ./ECSchema.java
          ./BlockGroup.java
          ./ECBlock.java
          ./ECChunk.java
          ./ECConfiguration.java
          ./SchemaLoader.java
          ./SubBlockGroup.java

          ./grouper/BlockGrouper.java
          ./grouper/LRCBlockGrouper.java
          ./grouper/RSBlockGrouper.java

          ./codec/ErasureCodec.java
          ./codec/IsaLRCErasureCodec.java
          ./codec/IsaRSErasureCodec.java
          ./codec/JavaRSErasureCodec.java
          ./codec/JerasureRSErasureCodec.java
          ./codec/LRCErasureCodec.java
          ./codec/RSErasureCodec.java

          ./coder/AbstractErasureCoder.java
          ./coder/ErasureCoder.java
          ./coder/IsaLRCErasureCoder.java
          ./coder/IsaRSErasureCoder.java
          ./coder/JavaRSErasureCoder.java
          ./coder/JerasureRSErasureCoder.java
          ./coder/LRCErasureCoder.java
          ./coder/RSErasureCoder.java
          ./coder/util/GaloisField.java

          ./rawcoder/AbstractRawErasureCoder.java
          ./rawcoder/impl/IsaReedSolomonDecoder.java
          ./rawcoder/impl/IsaReedSolomonEncoder.java
          ./rawcoder/IsaRSRawErasureCoder.java
          ./rawcoder/JavaRSRawErasureCoder.java
          ./rawcoder/JavaXORRawErasureCoder.java
          ./rawcoder/RawErasureCoder.java

          Show
          drankye Kai Zheng added a comment - Updated the prototype codes listed as follows. hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/ec: ./ECSchema.java ./BlockGroup.java ./ECBlock.java ./ECChunk.java ./ECConfiguration.java ./SchemaLoader.java ./SubBlockGroup.java ./grouper/BlockGrouper.java ./grouper/LRCBlockGrouper.java ./grouper/RSBlockGrouper.java ./codec/ErasureCodec.java ./codec/IsaLRCErasureCodec.java ./codec/IsaRSErasureCodec.java ./codec/JavaRSErasureCodec.java ./codec/JerasureRSErasureCodec.java ./codec/LRCErasureCodec.java ./codec/RSErasureCodec.java ./coder/AbstractErasureCoder.java ./coder/ErasureCoder.java ./coder/IsaLRCErasureCoder.java ./coder/IsaRSErasureCoder.java ./coder/JavaRSErasureCoder.java ./coder/JerasureRSErasureCoder.java ./coder/LRCErasureCoder.java ./coder/RSErasureCoder.java ./coder/util/GaloisField.java ./rawcoder/AbstractRawErasureCoder.java ./rawcoder/impl/IsaReedSolomonDecoder.java ./rawcoder/impl/IsaReedSolomonEncoder.java ./rawcoder/IsaRSRawErasureCoder.java ./rawcoder/JavaRSRawErasureCoder.java ./rawcoder/JavaXORRawErasureCoder.java ./rawcoder/RawErasureCoder.java
          Hide
          drankye Kai Zheng added a comment -

          Updated the prototype codes:
          1. Refined the APIs, ErasureCoder handles BlockGroup, ECWorker not bothering to understand it;
          2. Included unit test codes for example to see how ErasureCodec/ErasureCoder can be used.

          Show
          drankye Kai Zheng added a comment - Updated the prototype codes: 1. Refined the APIs, ErasureCoder handles BlockGroup, ECWorker not bothering to understand it; 2. Included unit test codes for example to see how ErasureCodec/ErasureCoder can be used.
          Hide
          zhz Zhe Zhang added a comment -

          Great work Kai Zheng ! I went over the design and have the following comments:

          1. I like the idea of creating an ec package under org.apache.hadoop.hdfs. It is a good place to host all codec classes.
          2. I think the ec package should focus on codec calculation based on a packet unit. Below is how I think the functions should be logically divided:
            • The ErasureCodec interface simply provide encode and decode functions that take a byte[][] and produce another byte[][]. It should be unaware of blocks. For example, I imagine our encode function should look similar to Jerasure's (https://github.com/tsuraan/Jerasure/blob/master/Manual.pdf):
               void jerasure matrix encode(k, m, w, matrix, data_ptrs, coding_ptrs, size) 
            • BlockGroups should be formed by ECManager. In doing so it calls the encode and decode functions from ErasureCodec
          3. Logically, BlockGroup is applicable even without EC, because striping can be done without EC. So an alternative is to put it in the protocol package.
          4. I don't think we should reference the schema through a name (since it wastes space and is fragile). We should look at other configurable policies (e.g., block placement algorithm) and see how they are loaded. IIRC a factory class is used.
          5. It's great that we are considering LRC in advance. However, with LEGAL-211 pending, I suggest we keep BlockGroup simpler for now. For example, it can contain only dataBlocks and parityBlocks. When we implement LRC we can subclass or extend it.
          6. I guess ECBlock is for testing purpose? An erasure coded block should have all properties of a regular block. I think we can just add a couple of flags to the Block class.
          7. It's not quite clear to me why we need ErasureCoderCallback. Is it for async codec calculation? If codec calculations are done on small packets, I think sync operations are fine.

          Thanks!

          Show
          zhz Zhe Zhang added a comment - Great work Kai Zheng ! I went over the design and have the following comments: I like the idea of creating an ec package under org.apache.hadoop.hdfs . It is a good place to host all codec classes. I think the ec package should focus on codec calculation based on a packet unit. Below is how I think the functions should be logically divided: The ErasureCodec interface simply provide encode and decode functions that take a byte[][] and produce another byte[][] . It should be unaware of blocks. For example, I imagine our encode function should look similar to Jerasure's ( https://github.com/tsuraan/Jerasure/blob/master/Manual.pdf): void jerasure matrix encode(k, m, w, matrix, data_ptrs, coding_ptrs, size) BlockGroups should be formed by ECManager . In doing so it calls the encode and decode functions from ErasureCodec Logically, BlockGroup is applicable even without EC, because striping can be done without EC. So an alternative is to put it in the protocol package. I don't think we should reference the schema through a name (since it wastes space and is fragile). We should look at other configurable policies (e.g., block placement algorithm) and see how they are loaded. IIRC a factory class is used. It's great that we are considering LRC in advance. However, with LEGAL-211 pending, I suggest we keep BlockGroup simpler for now. For example, it can contain only dataBlocks and parityBlocks . When we implement LRC we can subclass or extend it. I guess ECBlock is for testing purpose? An erasure coded block should have all properties of a regular block. I think we can just add a couple of flags to the Block class. It's not quite clear to me why we need ErasureCoderCallback . Is it for async codec calculation? If codec calculations are done on small packets, I think sync operations are fine. Thanks!
          Hide
          drankye Kai Zheng added a comment -

          We're discussing offline thru the design and codes. When it's finished, let's see how we're aligned and I will update here then.

          Show
          drankye Kai Zheng added a comment - We're discussing offline thru the design and codes. When it's finished, let's see how we're aligned and I will update here then.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hey Kai, thanks for getting us started here. I gave this a quick look, had a few comments:

          • Could you generate normal plaintext diffs rather than a zip? We might also want to reorganize things into existing packages. The rawcoder stuff could go somewhere in hadoop-common for instance. We could move the block grouper classes into blockmanagement. etc.
          • I see mixed tabs and spaces, we do spaces only in Hadoop.
          • Since the LRC stuff is still up in the air, could we defer everything related to that to a later JIRA?
          • In RSBlockGrouper, using ExtendedBlockId is overkill, since the bpid is the same for everything

          Configuration

          • The XML file approach seems potentially error-prone. IIUC after a set of parameters are assigned to a schema name, the parameters should never be changed. We thus also need to keep the xml file in sync between the NN, DN, and client. The client part is especially troublesome. Are we planning to put into the editlog/image down the road, like how we do storage policies?
          • Also, I think we want to separate out the the type of erasure coding from the implementation. The schema definition from the PDF encodes both together, e.g. JerasureRS. While it's not possible to change the RS part, the user might want to swap out Jerasure for ISAL which should be allowed. This is sort of like how we did things for encryption; we define a CipherSuite (i.e. AES-CTR) and then the user can choose among the multiple pluggable implementations for that cipher.

          BlockGroup:

          • Zhe told me this is a placeholder class, but a few comments nonetheless.
          • Can we just set the two fields in the constructor? They should also be final.
          • Since the schema encodes the layout, does SubBlockGroup need to encode both data and parity? Do we even need SubBlockGroup? Seems like a single array and a schema (a concrete object, which also encodes the RS or LRC parameters) tells you the layout, which is sufficient. This will save some memory.
          Show
          andrew.wang Andrew Wang added a comment - Hey Kai, thanks for getting us started here. I gave this a quick look, had a few comments: Could you generate normal plaintext diffs rather than a zip? We might also want to reorganize things into existing packages. The rawcoder stuff could go somewhere in hadoop-common for instance. We could move the block grouper classes into blockmanagement. etc. I see mixed tabs and spaces, we do spaces only in Hadoop. Since the LRC stuff is still up in the air, could we defer everything related to that to a later JIRA? In RSBlockGrouper, using ExtendedBlockId is overkill, since the bpid is the same for everything Configuration The XML file approach seems potentially error-prone. IIUC after a set of parameters are assigned to a schema name, the parameters should never be changed. We thus also need to keep the xml file in sync between the NN, DN, and client. The client part is especially troublesome. Are we planning to put into the editlog/image down the road, like how we do storage policies? Also, I think we want to separate out the the type of erasure coding from the implementation. The schema definition from the PDF encodes both together, e.g. JerasureRS. While it's not possible to change the RS part, the user might want to swap out Jerasure for ISAL which should be allowed. This is sort of like how we did things for encryption; we define a CipherSuite (i.e. AES-CTR) and then the user can choose among the multiple pluggable implementations for that cipher. BlockGroup: Zhe told me this is a placeholder class, but a few comments nonetheless. Can we just set the two fields in the constructor? They should also be final. Since the schema encodes the layout, does SubBlockGroup need to encode both data and parity? Do we even need SubBlockGroup? Seems like a single array and a schema (a concrete object, which also encodes the RS or LRC parameters) tells you the layout, which is sufficient. This will save some memory.
          Hide
          drankye Kai Zheng added a comment -

          Hi Andrew, thanks for your great and detailed feedback. Sorry for my late response. I will address them soon.

          Show
          drankye Kai Zheng added a comment - Hi Andrew, thanks for your great and detailed feedback. Sorry for my late response. I will address them soon.
          Hide
          drankye Kai Zheng added a comment -

          Hi Zhe, let me address your comments. We have discussed quite a bit offline so I'm here to summarize and clarify further. If anything I missed please comment, thanks.

          I like the idea of creating an ec package under org.apache.hadoop.hdfs. It is a good place to host all codec classes.

          Glad you like it. I guess we could put all the central EC constructs and facilities here that's not relevant to client, namenode, and datenode. Currently codec related stuffs are the best examples.

          I think the ec package should focus on codec calculation based on a packet unit. Below is how I think the functions should be logically divided:

          In this work RawErasureCoder focuses on calculation based on a packet unit or chunk. It won't be much and just simply implements how to encode/decode with a group of chunks. But why we would not stop here and ask for higher level construct like ErasureCodec, as I explained to you, because it would be better to be able to have such as a central place to maintain all the codec specific logics. The effect would be, customer only need to plugin in one place (ErasureCodec) instead of in many places to avoid possible inconsistency; to add support for a new erasure code algorithm to implement an ErasureCodec is all, we don't need to modify in many places in ECManager, ECWorker and ECClient. So the question comes to what aspects would be covered and how they're covered when support a new code algorithm: 1) how to calculate with a group of bytes, units or chunks, which is covered by ErasureCoder and RawErasureCoder; 2) how to layout/order the group of chunks, which is covered by BlockGrouper. The aspects of ErasureCoder and BlockGrouper are abstracted and can be extended according to a code algorithm or codec. So when add support a new code, it's expected to: 1) add a new ErasureCoder; 2) add a new BlockGrouper; 3) add a new ErasureCodec using the former two; 4) Update hdfs-site.xml or whatever place to register the new ErasureCodec with a name. Then a customer would simply configure/create a new ec schema by referencing the new codec name; and using the schema a ec file system zone can be created, and so on. So as all the code specific logics are extracted into such ErasureCodec construct, how it to be called or interact with ECManager, ECWorker and ECClient? Anyway it would all start assuming a schema is known by whatever means. Using the schema the ErasureCodec can be instanced, using the codec instance the BlockGrouper can be created and utilized by ECManager to create a BlockGroup providing necessary information, and the ErasureCoder can be created then utilized by ECWorker or ECClient to perform encoding/decoding provided a group of chunks.

          Sure it won't be that easy and actually Zhe pointed out a hurdle that a codec would have to be hard-coded in order to be able to efficiently maintained/associated by an inode, thus adding to support a new code maybe also involves changing codes in some places outside of the codec framework. I will investigate such chances. Anyhow, still, it would be ideal to avoid to change or add codes in many places besides the new codec itself.

          To demonstrate how the codec framework works, as Zhe suggested, we would come up more than one codecs so that we can compare and see more clearly. Currently only RS codec is implemented with test case and sample, we're working on another one using XOR code though it may be never used in production.

          Show
          drankye Kai Zheng added a comment - Hi Zhe, let me address your comments. We have discussed quite a bit offline so I'm here to summarize and clarify further. If anything I missed please comment, thanks. I like the idea of creating an ec package under org.apache.hadoop.hdfs. It is a good place to host all codec classes. Glad you like it. I guess we could put all the central EC constructs and facilities here that's not relevant to client, namenode, and datenode. Currently codec related stuffs are the best examples. I think the ec package should focus on codec calculation based on a packet unit. Below is how I think the functions should be logically divided: In this work RawErasureCoder focuses on calculation based on a packet unit or chunk. It won't be much and just simply implements how to encode/decode with a group of chunks. But why we would not stop here and ask for higher level construct like ErasureCodec, as I explained to you, because it would be better to be able to have such as a central place to maintain all the codec specific logics. The effect would be, customer only need to plugin in one place (ErasureCodec) instead of in many places to avoid possible inconsistency; to add support for a new erasure code algorithm to implement an ErasureCodec is all, we don't need to modify in many places in ECManager, ECWorker and ECClient. So the question comes to what aspects would be covered and how they're covered when support a new code algorithm: 1) how to calculate with a group of bytes, units or chunks, which is covered by ErasureCoder and RawErasureCoder; 2) how to layout/order the group of chunks, which is covered by BlockGrouper. The aspects of ErasureCoder and BlockGrouper are abstracted and can be extended according to a code algorithm or codec. So when add support a new code, it's expected to: 1) add a new ErasureCoder; 2) add a new BlockGrouper; 3) add a new ErasureCodec using the former two; 4) Update hdfs-site.xml or whatever place to register the new ErasureCodec with a name. Then a customer would simply configure/create a new ec schema by referencing the new codec name; and using the schema a ec file system zone can be created, and so on. So as all the code specific logics are extracted into such ErasureCodec construct, how it to be called or interact with ECManager, ECWorker and ECClient? Anyway it would all start assuming a schema is known by whatever means. Using the schema the ErasureCodec can be instanced, using the codec instance the BlockGrouper can be created and utilized by ECManager to create a BlockGroup providing necessary information, and the ErasureCoder can be created then utilized by ECWorker or ECClient to perform encoding/decoding provided a group of chunks. Sure it won't be that easy and actually Zhe pointed out a hurdle that a codec would have to be hard-coded in order to be able to efficiently maintained/associated by an inode, thus adding to support a new code maybe also involves changing codes in some places outside of the codec framework. I will investigate such chances. Anyhow, still, it would be ideal to avoid to change or add codes in many places besides the new codec itself. To demonstrate how the codec framework works, as Zhe suggested, we would come up more than one codecs so that we can compare and see more clearly. Currently only RS codec is implemented with test case and sample, we're working on another one using XOR code though it may be never used in production.
          Hide
          drankye Kai Zheng added a comment -

          Continued.

          Logically, BlockGroup is applicable even without EC, because striping can be done without EC. So an alternative is to put it in the protocol package.

          Good point! I agree we need to decouple. EC does need something like ECBlockGroup that can be derived from BlockGroup. Maybe we need better name for such.

          I don't think we should reference the schema through a name (since it wastes space and is fragile).

          I agree and will investigate further.

          It's great that we are considering LRC in advance. However, with LEGAL-211 pending, I suggest we keep BlockGroup simpler for now. For example, it can contain only dataBlocks and parityBlocks. When we implement LRC we can subclass or extend it.

          Good point. Let me try how it can be simplified. Basically you're right only data blocks and parity blocks are needed I guess in whatever code algorithm. ECManager only needs to provide an array of data blocks as sources and an array of parity blocks as placeholders in addition to a block group id to create a BlockGroup. As to how these blocks are organized/ordered is specific to the codec and can be hidden from outside. So actually the SubBlockGroup stuff is only for the codec framework. Sure I will make it internal avoiding polluting the public API.

          Show
          drankye Kai Zheng added a comment - Continued. Logically, BlockGroup is applicable even without EC, because striping can be done without EC. So an alternative is to put it in the protocol package. Good point! I agree we need to decouple. EC does need something like ECBlockGroup that can be derived from BlockGroup. Maybe we need better name for such. I don't think we should reference the schema through a name (since it wastes space and is fragile). I agree and will investigate further. It's great that we are considering LRC in advance. However, with LEGAL-211 pending, I suggest we keep BlockGroup simpler for now. For example, it can contain only dataBlocks and parityBlocks. When we implement LRC we can subclass or extend it. Good point. Let me try how it can be simplified. Basically you're right only data blocks and parity blocks are needed I guess in whatever code algorithm. ECManager only needs to provide an array of data blocks as sources and an array of parity blocks as placeholders in addition to a block group id to create a BlockGroup. As to how these blocks are organized/ordered is specific to the codec and can be hidden from outside. So actually the SubBlockGroup stuff is only for the codec framework. Sure I will make it internal avoiding polluting the public API.
          Hide
          drankye Kai Zheng added a comment -

          Still continued to address Zhe's above comments.

          I guess ECBlock is for testing purpose? An erasure coded block should have all properties of a regular block. I think we can just add a couple of flags to the Block class.

          Somehow you're right the ECBlock class isn't finalized as the whole bundle of codes were attached for our looking at and discussing. I'm working on this and will decouple ECBlock from HDFS block. It's possible because the codec framework has already nice arrangement to delegate how to pull/extract chunks (ECChunk) from ECBlock. It's the caller's (ECWorker or ECClient) responsibility to handle how to extract/collect bytes chunks from an actual HDFS block. When decoupled, the ECBlock or similar would be very lightweight and won't need so many fields at all. I will have new codes for us discussion further.

          It's not quite clear to me why we need ErasureCoderCallback. Is it for async codec calculation? If codec calculations are done on small packets, I think sync operations are fine.

          The ErasureCoderCallback maybe better named to avoid such confusion. It's not relevant to sync or async. It's basically for the codec caller (ECWorker or ECClient) to handle how to get chunks from blocks. Codec will call it to pull chunks from blocks. It can be regarded as data sources provider. In ECWorker in transforming case, many chunks can be pulled from the transformed blocks, thus the enclosed bytes level encode() or decode() in raw coder can be called in many places in a while loop. In ECClient in stripping ec case, it's similar until the application finishes to write/read data from a BlockGroup.

          Show
          drankye Kai Zheng added a comment - Still continued to address Zhe's above comments. I guess ECBlock is for testing purpose? An erasure coded block should have all properties of a regular block. I think we can just add a couple of flags to the Block class. Somehow you're right the ECBlock class isn't finalized as the whole bundle of codes were attached for our looking at and discussing. I'm working on this and will decouple ECBlock from HDFS block. It's possible because the codec framework has already nice arrangement to delegate how to pull/extract chunks (ECChunk) from ECBlock. It's the caller's (ECWorker or ECClient) responsibility to handle how to extract/collect bytes chunks from an actual HDFS block. When decoupled, the ECBlock or similar would be very lightweight and won't need so many fields at all. I will have new codes for us discussion further. It's not quite clear to me why we need ErasureCoderCallback. Is it for async codec calculation? If codec calculations are done on small packets, I think sync operations are fine. The ErasureCoderCallback maybe better named to avoid such confusion. It's not relevant to sync or async. It's basically for the codec caller (ECWorker or ECClient) to handle how to get chunks from blocks. Codec will call it to pull chunks from blocks. It can be regarded as data sources provider. In ECWorker in transforming case, many chunks can be pulled from the transformed blocks, thus the enclosed bytes level encode() or decode() in raw coder can be called in many places in a while loop. In ECClient in stripping ec case, it's similar until the application finishes to write/read data from a BlockGroup.
          Hide
          drankye Kai Zheng added a comment -

          Hi Andrew Wang, thanks for your comments and sorry for my late response.

          Could you generate normal plaintext diffs rather than a zip? We might also want to reorganize things into existing packages. The rawcoder stuff could go somewhere in hadoop-common for instance. We could move the block grouper classes into blockmanagement. etc.

          Yes I will provide diff or patch format when attaching the new revision.
          I have discussed with Uma, Zhe and Weihua also about how to organize the bundle of new codes. Looks like we all agree to move rawcoder classes to hadoop-common. About the block grouper in this codec codes, it's not about block placement, but only for codec specific logics. As discussed above and Zhe also agreed, we would need to support plugin of modules regarding how to form a block group for an ec code algorithm. The block grouper here is for such and taken care of by the high level construct ErasureCodec. Please kindly review my above comments for Zhe so let me know if anything I'm not going in the right way.

          I see mixed tabs and spaces, we do spaces only in Hadoop.

          Sorry this messy. I will absolutely clean up and follow the styles when breakdown and submit patches for the sub tasks.

          Since the LRC stuff is still up in the air, could we defer everything related to that to a later JIRA?

          I agree. I added the LRC* stuffs just to make sure I'm keeping the kind of codes like LRC in mind so that the codec framework is general enough and we won't involve into having to redesign when considering to support such code algorithms. I won't submit any LRC related formal patches before we're confirmed about legal stuff.

          In RSBlockGrouper, using ExtendedBlockId is overkill, since the bpid is the same for everything

          I'm happy to know that about bpid. Thanks.

          Will address the left comments later.

          Show
          drankye Kai Zheng added a comment - Hi Andrew Wang , thanks for your comments and sorry for my late response. Could you generate normal plaintext diffs rather than a zip? We might also want to reorganize things into existing packages. The rawcoder stuff could go somewhere in hadoop-common for instance. We could move the block grouper classes into blockmanagement. etc. Yes I will provide diff or patch format when attaching the new revision. I have discussed with Uma, Zhe and Weihua also about how to organize the bundle of new codes. Looks like we all agree to move rawcoder classes to hadoop-common. About the block grouper in this codec codes, it's not about block placement, but only for codec specific logics. As discussed above and Zhe also agreed, we would need to support plugin of modules regarding how to form a block group for an ec code algorithm. The block grouper here is for such and taken care of by the high level construct ErasureCodec. Please kindly review my above comments for Zhe so let me know if anything I'm not going in the right way. I see mixed tabs and spaces, we do spaces only in Hadoop. Sorry this messy. I will absolutely clean up and follow the styles when breakdown and submit patches for the sub tasks. Since the LRC stuff is still up in the air, could we defer everything related to that to a later JIRA? I agree. I added the LRC* stuffs just to make sure I'm keeping the kind of codes like LRC in mind so that the codec framework is general enough and we won't involve into having to redesign when considering to support such code algorithms. I won't submit any LRC related formal patches before we're confirmed about legal stuff. In RSBlockGrouper, using ExtendedBlockId is overkill, since the bpid is the same for everything I'm happy to know that about bpid. Thanks. Will address the left comments later.
          Hide
          drankye Kai Zheng added a comment -

          Continued

          The XML file approach seems potentially error-prone. ... Are we planning to put into the editlog/image down the road, like how we do storage policies?

          Yes I agree and will follow the approach you suggested.

          I think we want to separate out the the type of erasure coding from the implementation....

          Good suggestion. It makes sense to swap among various implementations using different erasure coding libraries (Java, ISA and Jerasure) given a certain codec. It's easy to allow this since in current codes ErasureCoder and RawErasureCoder are separated already and what's needed is just allowing changing of RawErasureCoder for an ErasureCodec via configuration.

          BlockGroup

          As discussed with Zhe and updated in my above comments, I need to hide and not expose internal details like SubBlockGroup only interested by codec. I need to provide two factory methods or constructors for the two cases of creating a BlockGroup: 1) in non-stripping mode, given an array of existing data blocks with the blockgroup id; 2) in stripping case, in addition to the blockgroup id, no data blocks are needed because they're all new and to be allocated.

          Since the schema encodes the layout,...

          In current design in my understanding, schema records configuration globally for all files in an ec zone. A BlockGroup object can be regarded as an instance of the schema for an inode or file, which records how the blocks in the group including data blocks and parity blocks are organized and ordered. In effect, one copy of codec specific configuration (k=6,m=3,chunk_size=16mb) in schema + amounts of BlockGroup instances are required to persist in fs image/editlog. Looks like Zhe has nice consideration about how to persist blockgroups efficiently. Not necessarily the relevant fields appeared in BlockGroup will be persisted, but can be restored or derived from minimum persistent information. Sorry for my confusion if my previous discussions gave you that impression. I do need to update the design and codes to clarify all of this. I will catch up soon next week.

          Show
          drankye Kai Zheng added a comment - Continued The XML file approach seems potentially error-prone. ... Are we planning to put into the editlog/image down the road, like how we do storage policies? Yes I agree and will follow the approach you suggested. I think we want to separate out the the type of erasure coding from the implementation.... Good suggestion. It makes sense to swap among various implementations using different erasure coding libraries (Java, ISA and Jerasure) given a certain codec. It's easy to allow this since in current codes ErasureCoder and RawErasureCoder are separated already and what's needed is just allowing changing of RawErasureCoder for an ErasureCodec via configuration. BlockGroup As discussed with Zhe and updated in my above comments, I need to hide and not expose internal details like SubBlockGroup only interested by codec. I need to provide two factory methods or constructors for the two cases of creating a BlockGroup: 1) in non-stripping mode, given an array of existing data blocks with the blockgroup id; 2) in stripping case, in addition to the blockgroup id, no data blocks are needed because they're all new and to be allocated. Since the schema encodes the layout,... In current design in my understanding, schema records configuration globally for all files in an ec zone. A BlockGroup object can be regarded as an instance of the schema for an inode or file, which records how the blocks in the group including data blocks and parity blocks are organized and ordered. In effect, one copy of codec specific configuration (k=6,m=3,chunk_size=16mb) in schema + amounts of BlockGroup instances are required to persist in fs image/editlog. Looks like Zhe has nice consideration about how to persist blockgroups efficiently. Not necessarily the relevant fields appeared in BlockGroup will be persisted, but can be restored or derived from minimum persistent information. Sorry for my confusion if my previous discussions gave you that impression. I do need to update the design and codes to clarify all of this. I will catch up soon next week.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Kai for the deeper discussion.

          So the question comes to what aspects would be covered and how they're covered when support a new code algorithm: 1) how to calculate with a group of bytes, units or chunks, which is covered by ErasureCoder and RawErasureCoder; 2) how to layout/order the group of chunks, which is covered by BlockGrouper.

          I think this is a good summary of what's included in the current patch. Actually I think the patch will be more trackable if we separate these 2 features. The arithmetic part is primarily used by client/DN (HDFS-7545 and HDFS-7344). The grouper component will be used by NN (HDFS-7339). I suggest we keep this JIRA for configurable/pluggable arithmetic codec calculation, and create another JIRA for configurable/pluggable block layout. This way they can be reviewed and committed more quickly.

          As Kai also echoed above, we should first create a working prototype with simplest schema, then add at least one other schema, and finally figure out how to abstract out the common logic between different schemas.

          In my understanding, the simplest working prototype would be a striping client (HDFS-7545) asking the NN (HDFS-7339) to allocate and persist block groups, using the arithmetic codec provided in this JIRA (HDFS-7337) to calculate Reed-Solomon parity data, and successfully writing an EC file.

          In this flow, all NN needs from the schema is the numbers or data and parity blocks. I think these 2 numbers can be embedded as XAttr. We should also assume a pair of default values which are used in absence of configured XAttrs.

          Show
          zhz Zhe Zhang added a comment - Thanks Kai for the deeper discussion. So the question comes to what aspects would be covered and how they're covered when support a new code algorithm: 1) how to calculate with a group of bytes, units or chunks, which is covered by ErasureCoder and RawErasureCoder; 2) how to layout/order the group of chunks, which is covered by BlockGrouper. I think this is a good summary of what's included in the current patch. Actually I think the patch will be more trackable if we separate these 2 features. The arithmetic part is primarily used by client/DN ( HDFS-7545 and HDFS-7344 ). The grouper component will be used by NN ( HDFS-7339 ). I suggest we keep this JIRA for configurable/pluggable arithmetic codec calculation, and create another JIRA for configurable/pluggable block layout. This way they can be reviewed and committed more quickly. As Kai also echoed above , we should first create a working prototype with simplest schema, then add at least one other schema, and finally figure out how to abstract out the common logic between different schemas. In my understanding, the simplest working prototype would be a striping client ( HDFS-7545 ) asking the NN ( HDFS-7339 ) to allocate and persist block groups, using the arithmetic codec provided in this JIRA ( HDFS-7337 ) to calculate Reed-Solomon parity data, and successfully writing an EC file. In this flow, all NN needs from the schema is the numbers or data and parity blocks. I think these 2 numbers can be embedded as XAttr. We should also assume a pair of default values which are used in absence of configured XAttrs.
          Hide
          drankye Kai Zheng added a comment -

          Good idea, Zhe. We have already created the suggested JIRAs. I will update the codes according to feedback and break it down into smaller patches to submit.

          Show
          drankye Kai Zheng added a comment - Good idea, Zhe. We have already created the suggested JIRAs. I will update the codes according to feedback and break it down into smaller patches to submit.
          Hide
          vinayrpet Vinayakumar B added a comment -

          While trying to understand the implementation ran TestJavaRSErasureCodec#testCodec But it failed. Enconded and decoded data doesn't match.

          Show
          vinayrpet Vinayakumar B added a comment - While trying to understand the implementation ran TestJavaRSErasureCodec#testCodec But it failed. Enconded and decoded data doesn't match.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Vinay for your review. I will check the test when submit a formal patch for the part.

          Show
          drankye Kai Zheng added a comment - Thanks Vinay for your review. I will check the test when submit a formal patch for the part.
          Hide
          drankye Kai Zheng added a comment -

          As discussed, most part of this source codes will be moved to hadoop-common side, but I'm not sure if it's OK to still use these JIRA entries that start with HDFS, instead of HADOOP.

          Would anyone help confirm this ? It would be great if we don't have to change, it's reasonable because it does work for HDFS, although for other considerations we'd better move over there.

          Show
          drankye Kai Zheng added a comment - As discussed, most part of this source codes will be moved to hadoop-common side, but I'm not sure if it's OK to still use these JIRA entries that start with HDFS, instead of HADOOP. Would anyone help confirm this ? It would be great if we don't have to change, it's reasonable because it does work for HDFS, although for other considerations we'd better move over there.
          Hide
          andrew.wang Andrew Wang added a comment -

          I don't think it's necessary to move to HADOOP. If anything, I find it conceptually easier if everything related to erasure encoding stayed a subtask of HDFS-7285.

          Show
          andrew.wang Andrew Wang added a comment - I don't think it's necessary to move to HADOOP. If anything, I find it conceptually easier if everything related to erasure encoding stayed a subtask of HDFS-7285 .
          Hide
          szetszwo Tsz Wo Nicholas Sze added a comment -

          Do you expect that the erasure code package will be used outside hdfs? If not, we could put everything under hdfs for the moment.

          Show
          szetszwo Tsz Wo Nicholas Sze added a comment - Do you expect that the erasure code package will be used outside hdfs? If not, we could put everything under hdfs for the moment.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Andrew Wang for the clarification. I agree.

          Show
          drankye Kai Zheng added a comment - Thanks Andrew Wang for the clarification. I agree.
          Hide
          drankye Kai Zheng added a comment -

          Hi Tsz Wo Nicholas Sze,

          It would be great if the erasure codec work can be worked out and used in other context, anyway it's better not to tightly couple with HDFS. Having the codes stay in hadoop-common side, it would avoid many basic bootstrap work when support and incorporate native libraries as compression, encryption and etc. do.

          Show
          drankye Kai Zheng added a comment - Hi Tsz Wo Nicholas Sze , It would be great if the erasure codec work can be worked out and used in other context, anyway it's better not to tightly couple with HDFS. Having the codes stay in hadoop-common side, it would avoid many basic bootstrap work when support and incorporate native libraries as compression, encryption and etc. do.
          Hide
          drankye Kai Zheng added a comment -

          Updated the document according to the major discussions online or offline with Andrew Wang, Zhe Zhang, Li Bo, and Uma Maheswara Rao G. Thanks for your great ideas I incorporated here !

          Show
          drankye Kai Zheng added a comment - Updated the document according to the major discussions online or offline with Andrew Wang , Zhe Zhang , Li Bo , and Uma Maheswara Rao G . Thanks for your great ideas I incorporated here !
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Kai for the update! The design looks good to me overall.

          I also took the chance to look at ErasureCodec and ECSchema again. IIUC, ErasureCodec is like a factory or an utility class, which creates ErasureCoder and BlockGrouper based on ECSchema.

          If that's the case, I think we can leverage the pattern of BlockStoragePolicySuite. Something like:

          public static ECSchemaSuite createDefaultSuite() {
              final ECSchema[] schemas =
                  new ECSchema[2];
              final byte RS63 = HdfsConstants.RS63_EC_SCHEMA_ID;
              policies[RS63] = new ECSchema(RS63,
                  HdfsConstants.RS63_EC_SCHEMA_NAME,
                  HdfsConstants.RS_EC_ALGORITHM_ID,
                  6, 3, chunkSize);
              final byte XOR21 = HdfsConstants.XOR21_EC_SCHEMA_ID;
              policies[XOR21] = new ECSchema(XOR21,
                  HdfsConstants.XOR21_EC_SCHEMA_NAME,
                  HdfsConstants.XOR_EC_ALGORITHM_ID,
                  2, 1, chunkSize);
            }
          

          Then NN can just pass around the schema ID when communicating with DN and client, which is much smaller than an ErasureCodec object.

          Show
          zhz Zhe Zhang added a comment - Thanks Kai for the update! The design looks good to me overall. I also took the chance to look at ErasureCodec and ECSchema again. IIUC, ErasureCodec is like a factory or an utility class, which creates ErasureCoder and BlockGrouper based on ECSchema . If that's the case, I think we can leverage the pattern of BlockStoragePolicySuite . Something like: public static ECSchemaSuite createDefaultSuite() { final ECSchema[] schemas = new ECSchema[2]; final byte RS63 = HdfsConstants.RS63_EC_SCHEMA_ID; policies[RS63] = new ECSchema(RS63, HdfsConstants.RS63_EC_SCHEMA_NAME, HdfsConstants.RS_EC_ALGORITHM_ID, 6, 3, chunkSize); final byte XOR21 = HdfsConstants.XOR21_EC_SCHEMA_ID; policies[XOR21] = new ECSchema(XOR21, HdfsConstants.XOR21_EC_SCHEMA_NAME, HdfsConstants.XOR_EC_ALGORITHM_ID, 2, 1, chunkSize); } Then NN can just pass around the schema ID when communicating with DN and client, which is much smaller than an ErasureCodec object.
          Hide
          zhz Zhe Zhang added a comment -

          s/policies/schemas/g

          Show
          zhz Zhe Zhang added a comment - s/policies/schemas/g
          Hide
          drankye Kai Zheng added a comment -

          Thanks Zhe Zhang for the review and thoughts.

          ErasureCodec is like a factory or an utility class, which creates ErasureCoder and BlockGrouper based on ECSchema

          ErasureCodec would be the high level construct in the framework that covers all the potential erasure code specific aspects, including but might not be limited to ErasureCoder and BlockGrouper, which allows to be implemented and deployed as a whole for a new code. All the underlying code specific logic can be hooked via codec and can only be accessible thru codec. I understand there will be something more to think about, it's generally one of the major goal for the framework.

          I think we can leverage the pattern of BlockStoragePolicySuite

          It's a good pattern. ErasureCodec follows another good pattern, CompressionCodec.

          Something like:...your illustration codes...

          I understand we need to hard-code a default schema for the system. What we have discussed and been doing is we allow to predefine EC schemas in an external file (XML currently used as we regularly do in the project). For easy reference, unique schema name (string) and codec name (string) are used. Do you have any concern for this way ?

          Then NN can just pass around the schema ID when communicating with DN and client, which is much smaller than an ErasureCodec object.

          Yes similarly it's to pass around the schema NAME between any pair among NN, DN, client. It's not meaning to pass ErasureCodec object. Is there confusing sentence I need to clarify in the doc ? All the {{ErasureCodec}}s are loaded thru core-site configuration or service locators, and kept in map with codec name as the key. Providing the codec name, a codec will be fetched from the map. Codec object isn't needed to be passed around, codec name is. I guess you're meaning schema object. In the f2f meetup discussion with Jing Zhao, we mentioned it may need to pass around schema object. If we don't want to hard-code all the schemas, then we need to pass schema object I guess.

          Show
          drankye Kai Zheng added a comment - Thanks Zhe Zhang for the review and thoughts. ErasureCodec is like a factory or an utility class, which creates ErasureCoder and BlockGrouper based on ECSchema ErasureCodec would be the high level construct in the framework that covers all the potential erasure code specific aspects, including but might not be limited to ErasureCoder and BlockGrouper , which allows to be implemented and deployed as a whole for a new code. All the underlying code specific logic can be hooked via codec and can only be accessible thru codec. I understand there will be something more to think about, it's generally one of the major goal for the framework. I think we can leverage the pattern of BlockStoragePolicySuite It's a good pattern. ErasureCodec follows another good pattern, CompressionCodec . Something like:...your illustration codes... I understand we need to hard-code a default schema for the system. What we have discussed and been doing is we allow to predefine EC schemas in an external file (XML currently used as we regularly do in the project). For easy reference, unique schema name (string) and codec name (string) are used. Do you have any concern for this way ? Then NN can just pass around the schema ID when communicating with DN and client, which is much smaller than an ErasureCodec object. Yes similarly it's to pass around the schema NAME between any pair among NN, DN, client. It's not meaning to pass ErasureCodec object. Is there confusing sentence I need to clarify in the doc ? All the {{ErasureCodec}}s are loaded thru core-site configuration or service locators, and kept in map with codec name as the key. Providing the codec name, a codec will be fetched from the map. Codec object isn't needed to be passed around, codec name is. I guess you're meaning schema object. In the f2f meetup discussion with Jing Zhao , we mentioned it may need to pass around schema object. If we don't want to hard-code all the schemas, then we need to pass schema object I guess.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Kai for the explanation! Now I have a much clearer understanding of the codec design.

          ErasureCodec would be the high level construct in the framework ...

          I agree with the high level goal. The reason I think ErasureCodec seems like a utility class is that (at least in the current HADOOP-11643 / HADOOP-11645 code) it is pretty much stateless. It creates ErasureCoder and BlockGrouper based on the given schema type. But as you said we might extend its functionalities in the future. So we can revisit this point later.

          It's a good pattern. ErasureCodec follows another good pattern, CompressionCodec.

          My statement was a little confusing. I wasn't suggesting leveraging BlockStoragePolicySuite to build the ErasureCodec class. I was suggesting we build a similar schema suite class to store all schemas.

          All the {{ErasureCodec}}s are loaded thru core-site configuration or service locators, and kept in map with codec name as the key.

          Agreed. Actually the ECSchemaSuite idea I proposed above is doing the same thing: besides a few hard-coded schemas, it can also parse the XML and load more schemas in the suite. If we don't use something like a schema suite, where should we maintain this map? I see HADOOP-11664 loads schemas from XML. Is there another JIRA handling the management of loaded schemas? If not maybe we can consider ECSchemaSuite? It has a simple task of mapping an ID (either a byte or a String as you proposed) to the ECSchema object.

          If we don't want to hard-code all the schemas, then we need to pass schema object I guess.

          Agreed. Actually even if hard-code all schemas it's still dangerous to pass only the schema ID. The DN might be on a different version of Hadoop than NN. However, in storing per-dir or per-file schemas, we should only store the IDs.

          Show
          zhz Zhe Zhang added a comment - Thanks Kai for the explanation! Now I have a much clearer understanding of the codec design. ErasureCodec would be the high level construct in the framework ... I agree with the high level goal. The reason I think ErasureCodec seems like a utility class is that (at least in the current HADOOP-11643 / HADOOP-11645 code) it is pretty much stateless. It creates ErasureCoder and BlockGrouper based on the given schema type. But as you said we might extend its functionalities in the future. So we can revisit this point later. It's a good pattern. ErasureCodec follows another good pattern, CompressionCodec. My statement was a little confusing. I wasn't suggesting leveraging BlockStoragePolicySuite to build the ErasureCodec class. I was suggesting we build a similar schema suite class to store all schemas. All the {{ErasureCodec}}s are loaded thru core-site configuration or service locators, and kept in map with codec name as the key. Agreed. Actually the ECSchemaSuite idea I proposed above is doing the same thing: besides a few hard-coded schemas, it can also parse the XML and load more schemas in the suite. If we don't use something like a schema suite, where should we maintain this map? I see HADOOP-11664 loads schemas from XML. Is there another JIRA handling the management of loaded schemas? If not maybe we can consider ECSchemaSuite ? It has a simple task of mapping an ID (either a byte or a String as you proposed) to the ECSchema object. If we don't want to hard-code all the schemas, then we need to pass schema object I guess. Agreed. Actually even if hard-code all schemas it's still dangerous to pass only the schema ID. The DN might be on a different version of Hadoop than NN. However, in storing per-dir or per-file schemas, we should only store the IDs.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Zhe Zhang for the detailed clarifying. It's great we're much aligned !

          Is there another JIRA handling the management of loaded schemas? If not maybe we can consider ECSchemaSuite?

          I got your point about ECSchemaSuite. HDFS-7866 was the JIRA that does the job you're saying. I do have some rough codes for it where ECSchemaManager is the core part. ECSchemaManager basically wraps a map and contains all the ACTIVE schemas synced between NN metadata and predefined. At one side it serves for dfsadmin to request reloading predefined schemas (by HADOOP-11664) in authorization controlled way, at the other side it also serves for client requests for schemas list and detailed definition. I will rethink about the codes and see ECSchemaSuite works the better, or borrow the benefits. By the way HDFS-7859 is used to persist schemas in NN metadata. Anything more for us to fill the gap ?

          Actually even if hard-code all schemas it's still dangerous to pass only the schema ID

          I agree. Thanks for the confirmation. With schema object passed around and available in DN and client, we can perform schema-driven encoding and decoding, which will be much safer and flexible.

          Currently I'm working from bottom up and hopefully it wouldn't be too long to achieve to NN and get all the work hooked together.

          Show
          drankye Kai Zheng added a comment - Thanks Zhe Zhang for the detailed clarifying. It's great we're much aligned ! Is there another JIRA handling the management of loaded schemas? If not maybe we can consider ECSchemaSuite? I got your point about ECSchemaSuite . HDFS-7866 was the JIRA that does the job you're saying. I do have some rough codes for it where ECSchemaManager is the core part. ECSchemaManager basically wraps a map and contains all the ACTIVE schemas synced between NN metadata and predefined. At one side it serves for dfsadmin to request reloading predefined schemas (by HADOOP-11664 ) in authorization controlled way, at the other side it also serves for client requests for schemas list and detailed definition. I will rethink about the codes and see ECSchemaSuite works the better, or borrow the benefits. By the way HDFS-7859 is used to persist schemas in NN metadata. Anything more for us to fill the gap ? Actually even if hard-code all schemas it's still dangerous to pass only the schema ID I agree. Thanks for the confirmation. With schema object passed around and available in DN and client, we can perform schema-driven encoding and decoding, which will be much safer and flexible. Currently I'm working from bottom up and hopefully it wouldn't be too long to achieve to NN and get all the work hooked together.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks for the pointers to HDFS-7859 and HDFS-7866. Yes I believe they are along the same direction as ECSchemaSuite in the above discussion.

          Show
          zhz Zhe Zhang added a comment - Thanks for the pointers to HDFS-7859 and HDFS-7866 . Yes I believe they are along the same direction as ECSchemaSuite in the above discussion.
          Hide
          zhz Zhe Zhang added a comment -

          I had an offline discussion with Kai about simplifying the code structure.

          Basically, with the abstraction provided by ECBlockGroup, we can have a unified codec function signature. This will allow us to get rid of separate ErasureEncoder and ErasureDecoder interfaces.

          Show
          zhz Zhe Zhang added a comment - I had an offline discussion with Kai about simplifying the code structure. Basically, with the abstraction provided by ECBlockGroup , we can have a unified codec function signature. This will allow us to get rid of separate ErasureEncoder and ErasureDecoder interfaces.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Zhe for the very good thought and the new JIRA HADOOP-11740 to work on it.

          Show
          drankye Kai Zheng added a comment - Thanks Zhe for the very good thought and the new JIRA HADOOP-11740 to work on it.
          Hide
          drankye Kai Zheng added a comment -

          As inspired with discussion in HDFS-7344 with Tsz Wo Nicholas Sze, a codec understands and should give hints to NN how erased block(s) to be scheduled in priority for recovering. For example, in RS(6,3), 1 erased block is not so urgent than 2 or 3 erased blocks. Will update the patch in HADOOP-11645 to reflect this thinking.

          Show
          drankye Kai Zheng added a comment - As inspired with discussion in HDFS-7344 with Tsz Wo Nicholas Sze , a codec understands and should give hints to NN how erased block(s) to be scheduled in priority for recovering. For example, in RS(6,3), 1 erased block is not so urgent than 2 or 3 erased blocks. Will update the patch in HADOOP-11645 to reflect this thinking.
          Hide
          drankye Kai Zheng added a comment -

          Updated the design doc.

          Show
          drankye Kai Zheng added a comment - Updated the design doc.
          Hide
          drankye Kai Zheng added a comment -

          I updated the schema & codec doc. This revision entirely rewrote the section about EC Schema, and reflects latest related discussions in various related issues with many, mainly with Vinayakumar B, Tsz Wo Nicholas Sze, Zhe Zhang, Uma Maheswara Rao G and Rakesh R. Thanks for your further thoughts, comments and questions!

          Show
          drankye Kai Zheng added a comment - I updated the schema & codec doc. This revision entirely rewrote the section about EC Schema , and reflects latest related discussions in various related issues with many, mainly with Vinayakumar B , Tsz Wo Nicholas Sze , Zhe Zhang , Uma Maheswara Rao G and Rakesh R . Thanks for your further thoughts, comments and questions!
          Hide
          vinayrpet Vinayakumar B added a comment -

          Thanks for the update Kai Zheng.
          Updated section clearly mentioned about how EC schemas can be made configurable.

          One improvement I am thinking of is,
          Instead of maintaining an xml at serverside, which needs to be updated and then issue command to load from it, lets make admin's life simpler.

          • we can have explicit commands to add/delete schemas from the shell itself.
          • In case of adding new schemas, (this will not be used for the deletion)
            1. If the schema is simple with few parameters, then it can be given as argument to command in predefined structure(propably with comma separated).
            2. Otherwise command can load the schemas from admin-side locally maintained xml and send them via RPC to NameNode. In this argument will be the file name to be loaded. By this way admin can create xml and execute command from any machine using admin authentication.
          • For deletion of schemas, separate command via separate RPC, with just schema name(s) as param, can do the job. Of-course it should be deleted only if not used.

          So this way, XML is used just for the friendly way of specifying schemas. NameNode doesn't need to depend on this.

          Any thoughts?

          Show
          vinayrpet Vinayakumar B added a comment - Thanks for the update Kai Zheng . Updated section clearly mentioned about how EC schemas can be made configurable. One improvement I am thinking of is, Instead of maintaining an xml at serverside, which needs to be updated and then issue command to load from it, lets make admin's life simpler. we can have explicit commands to add/delete schemas from the shell itself. In case of adding new schemas, (this will not be used for the deletion) If the schema is simple with few parameters, then it can be given as argument to command in predefined structure(propably with comma separated). Otherwise command can load the schemas from admin-side locally maintained xml and send them via RPC to NameNode. In this argument will be the file name to be loaded. By this way admin can create xml and execute command from any machine using admin authentication. For deletion of schemas, separate command via separate RPC, with just schema name(s) as param, can do the job. Of-course it should be deleted only if not used. So this way, XML is used just for the friendly way of specifying schemas. NameNode doesn't need to depend on this. Any thoughts?
          Hide
          drankye Kai Zheng added a comment -

          Thanks Vinayakumar B for the comments, suggestions and more options.

          Before to decide which way to go, I thought it would make sense to figure out first the following questions:

          • What possible erasure codes or codecs we would have, for now and the future? XOR, RS, HitchHiker, LRC, and even more, typical codes from broad industry experiences.
          • What kinds of schema parameters it would have for each possible erasure codec?

          Let's slow down and let me find some time for the further investigation. With such questions well answered, I thought it would not be hard to tell which way sounds better, creating schema in command line or thru a schema definition file.

          Show
          drankye Kai Zheng added a comment - Thanks Vinayakumar B for the comments, suggestions and more options. Before to decide which way to go, I thought it would make sense to figure out first the following questions: What possible erasure codes or codecs we would have, for now and the future? XOR, RS, HitchHiker, LRC, and even more, typical codes from broad industry experiences. What kinds of schema parameters it would have for each possible erasure codec? Let's slow down and let me find some time for the further investigation. With such questions well answered, I thought it would not be hard to tell which way sounds better, creating schema in command line or thru a schema definition file.
          Hide
          drankye Kai Zheng added a comment -

          For LRC code, for a typical example, 8+3+2: source (X1, X2, X3, X4, Y1, Y2, Y3, Y4), and 2 local parity (L1, L2), 3 global parity (R1, R2, R3). We need to configure the number of 8, as how many source blocks; also need to configure the number of 3, as how many global parity blocks; we could avoid configuring the number of 2, as we could assume the two local parity groups, each having one local parity block. So a schema like LRC-k8-m3 like a RS one RS-k6-m3 should work.

          Show
          drankye Kai Zheng added a comment - For LRC code, for a typical example, 8+3+2: source (X1, X2, X3, X4, Y1, Y2, Y3, Y4), and 2 local parity (L1, L2), 3 global parity (R1, R2, R3). We need to configure the number of 8, as how many source blocks; also need to configure the number of 3, as how many global parity blocks; we could avoid configuring the number of 2, as we could assume the two local parity groups, each having one local parity block. So a schema like LRC-k8-m3 like a RS one RS-k6-m3 should work.
          Hide
          drankye Kai Zheng added a comment -

          For Hitchhiker code, Rashmi Vinayak and jack liuquan, for the supported modes, how they should be configured and what are the key/optional parameters that matter in a deployment? Could you help comment on this? Thanks a lot.

          Show
          drankye Kai Zheng added a comment - For Hitchhiker code, Rashmi Vinayak and jack liuquan , for the supported modes, how they should be configured and what are the key/optional parameters that matter in a deployment? Could you help comment on this? Thanks a lot.
          Hide
          jack_liuquan jack liuquan added a comment -

          Hi Kai,
          As I know, Hitchhiker code can be configured the same as RS code. Using system defined schemas RS(6,3) and RS(10, 4) is OK.
          Hitchhikercodec can also be configured as you showing in PluggableErasureCodec-v3.pdf.

          Show
          jack_liuquan jack liuquan added a comment - Hi Kai, As I know, Hitchhiker code can be configured the same as RS code. Using system defined schemas RS(6,3) and RS(10, 4) is OK. Hitchhikercodec can also be configured as you showing in PluggableErasureCodec-v3.pdf.
          Hide
          rashmikv Rashmi Vinayak added a comment -

          Hi Kai,

          For each of the modes in Hitchhiker, configuration would be identical to RS (as jack liuquan also mentioned). So we could have schemas like HH-k10-m4 just as in RS.

          Show
          rashmikv Rashmi Vinayak added a comment - Hi Kai, For each of the modes in Hitchhiker, configuration would be identical to RS (as jack liuquan also mentioned). So we could have schemas like HH-k10-m4 just as in RS.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hi Kai Zheng, this JIRA resurfaced due to discussion on HDFS-8059 and also the need to persist this information in the fsimage/editlog, I was hoping we could clarify the configuration language for an EC codec and schema.

          I read through the v3 design doc, and please let me know if you think the following would work:

          Schema (stored on EC zone or INode as a PB so we can evolve it):

          • Codec enum (e.g. RS, LRC, etc), which would also have a friendly human-readable name. The enum is good for efficiency and so the user can only pick from supported codecs.
          • List of k,v pairs for configuration. This could be used for k, m, and any other arbitrary parameters needed by the codec. Very general.

          Client-side:

          • Client would validate the codec of a file against the codecs supported in its own software version. This way, if we add a new codec type, we can restrict old clients from reading it.
          • In client's hdfs-site.xml, we can configure a codec implementation for every codec. This would look something like e.g. dfs.client.ec.codec.reed-solomon.impl = org.apache.hadoop....isal, saying to use ISA-L for reed-solomon.

          This is just to get us going for phase 1. We'd be restricting users to choosing from a list of known-good codecs, while they could still provide their own codec implementations as long as they implement the interfaces.

          When we get to the point of fully-pluggable codecs, we can add a special "wildcard" enum value to support this, and then potentially add new fields to the PB if required. This will require another HDFS upgrade before we can support full pluggability, but it sounds like we still need to figure out interfaces for things like block placement and recovery logic anyway.

          Show
          andrew.wang Andrew Wang added a comment - Hi Kai Zheng , this JIRA resurfaced due to discussion on HDFS-8059 and also the need to persist this information in the fsimage/editlog, I was hoping we could clarify the configuration language for an EC codec and schema. I read through the v3 design doc, and please let me know if you think the following would work: Schema (stored on EC zone or INode as a PB so we can evolve it): Codec enum (e.g. RS, LRC, etc), which would also have a friendly human-readable name. The enum is good for efficiency and so the user can only pick from supported codecs. List of k,v pairs for configuration. This could be used for k, m, and any other arbitrary parameters needed by the codec. Very general. Client-side: Client would validate the codec of a file against the codecs supported in its own software version. This way, if we add a new codec type, we can restrict old clients from reading it. In client's hdfs-site.xml, we can configure a codec implementation for every codec. This would look something like e.g. dfs.client.ec.codec.reed-solomon.impl = org.apache.hadoop....isal , saying to use ISA-L for reed-solomon. This is just to get us going for phase 1. We'd be restricting users to choosing from a list of known-good codecs, while they could still provide their own codec implementations as long as they implement the interfaces. When we get to the point of fully-pluggable codecs, we can add a special "wildcard" enum value to support this, and then potentially add new fields to the PB if required. This will require another HDFS upgrade before we can support full pluggability, but it sounds like we still need to figure out interfaces for things like block placement and recovery logic anyway.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Andrew Wang for the thoughts to move on this a bit. They sound good to me and I'm fine. Some points for further discussion:

          ...also the need to persist this information in the fsimage/editlog, ...

          Did you mean schema? If so, it looks like a point we all agree with. Per discussion in HDFS-7859 and related, we planed to do it in follow-on, along with support of multiple schemas. For now we only support one system defined schema, RS(6, 3).

          Codec enum (e.g. RS, LRC, etc), ...When we get to the point of fully-pluggable codecs, we can add a special "wildcard" enum value to support this

          Good to have the enum for built-in codecs for now and the wildcard for customized additional ones in future.

          In client's hdfs-site.xml, we can configure a codec implementation for every codec. This would look something like...

          In existing codes we're using the following format for the similar purpose. Please confirm if it looks good.

            /** Raw coder factory for the RS codec. */
            public static final String IO_ERASURECODE_CODEC_RS_RAWCODER_KEY =
                "io.erasurecode.codec.rs.rawcoder";
          
            /** Raw coder factory for the XOR codec. */
            public static final String IO_ERASURECODE_CODEC_XOR_RAWCODER_KEY =
                "io.erasurecode.codec.xor.rawcoder";
          

          The related codes reside in CodecUtil reading above configurations. Would you check it if necessary.
          When we're clear what's needed to be done for the phase, I would have an issue to get them done separately. Thanks.

          Show
          drankye Kai Zheng added a comment - Thanks Andrew Wang for the thoughts to move on this a bit. They sound good to me and I'm fine. Some points for further discussion: ...also the need to persist this information in the fsimage/editlog, ... Did you mean schema? If so, it looks like a point we all agree with. Per discussion in HDFS-7859 and related, we planed to do it in follow-on, along with support of multiple schemas. For now we only support one system defined schema, RS(6, 3). Codec enum (e.g. RS, LRC, etc), ...When we get to the point of fully-pluggable codecs, we can add a special "wildcard" enum value to support this Good to have the enum for built-in codecs for now and the wildcard for customized additional ones in future. In client's hdfs-site.xml, we can configure a codec implementation for every codec. This would look something like... In existing codes we're using the following format for the similar purpose. Please confirm if it looks good. /** Raw coder factory for the RS codec. */ public static final String IO_ERASURECODE_CODEC_RS_RAWCODER_KEY = "io.erasurecode.codec.rs.rawcoder"; /** Raw coder factory for the XOR codec. */ public static final String IO_ERASURECODE_CODEC_XOR_RAWCODER_KEY = "io.erasurecode.codec.xor.rawcoder"; The related codes reside in CodecUtil reading above configurations. Would you check it if necessary. When we're clear what's needed to be done for the phase, I would have an issue to get them done separately. Thanks.
          Hide
          wqijun wqijun added a comment -

          Hi Kai,

          We would like to add new RS coder in the EC framework, like RS(8,4) coder. More importantly, we want to leverage C library to accelerate this coder, not only for INTEL Chip, but also for IBM POWER chip. We are not sure which branch should our work be based on, 7337 or 7285? Should we add new JIRA branch? In addition, I have downloaded the latest Hadoop Trunk version and found there are native ISA_L acceleration files but no coder class for ISA_L. Are you plan to add these coder classes into Hadoop Trunk version? Thanks a lot!

          Show
          wqijun wqijun added a comment - Hi Kai, We would like to add new RS coder in the EC framework, like RS(8,4) coder. More importantly, we want to leverage C library to accelerate this coder, not only for INTEL Chip, but also for IBM POWER chip. We are not sure which branch should our work be based on, 7337 or 7285? Should we add new JIRA branch? In addition, I have downloaded the latest Hadoop Trunk version and found there are native ISA_L acceleration files but no coder class for ISA_L. Are you plan to add these coder classes into Hadoop Trunk version? Thanks a lot!
          Hide
          drankye Kai Zheng added a comment -

          Hi Qijun,

          Thanks for your questions. I thought it's good enough to open a jira for your work on the new RS coder, based on trunk. Regarding the extra schema 8+4, it's easy to add it when we implement the multiple schema support. The ISA-L coder is under the way and isn't finished yet.

          A question is, why the new proposed RS coder would be coupled with the specific 8+4 schema? Sure we can discuss this separately on your new jira.

          Show
          drankye Kai Zheng added a comment - Hi Qijun, Thanks for your questions. I thought it's good enough to open a jira for your work on the new RS coder, based on trunk. Regarding the extra schema 8+4, it's easy to add it when we implement the multiple schema support. The ISA-L coder is under the way and isn't finished yet. A question is, why the new proposed RS coder would be coupled with the specific 8+4 schema? Sure we can discuss this separately on your new jira.
          Hide
          drankye Kai Zheng added a comment -

          Andrew Wang, Zhe Zhang, Rakesh R or anybody

          Trying not to be complicated, based on the existing codes we already have, the goal here seems to be easier to target now.

          In ErasureCodingPolicyManager we have these built-in EC policies:

            private static final int DEFAULT_CELLSIZE = 64 * 1024;
            private static final ErasureCodingPolicy SYS_POLICY1 =
                new ErasureCodingPolicy(ErasureCodeConstants.RS_6_3_SCHEMA,
                    DEFAULT_CELLSIZE, HdfsConstants.RS_6_3_POLICY_ID);
            private static final ErasureCodingPolicy SYS_POLICY2 =
                new ErasureCodingPolicy(ErasureCodeConstants.RS_3_2_SCHEMA,
                    DEFAULT_CELLSIZE, HdfsConstants.RS_3_2_POLICY_ID);
            private static final ErasureCodingPolicy SYS_POLICY3 =
                new ErasureCodingPolicy(ErasureCodeConstants.RS_6_3_LEGACY_SCHEMA,
                    DEFAULT_CELLSIZE, HdfsConstants.RS_6_3_LEGACY_POLICY_ID);
            private static final ErasureCodingPolicy SYS_POLICY4 =
                new ErasureCodingPolicy(ErasureCodeConstants.XOR_2_1_SCHEMA,
                    DEFAULT_CELLSIZE, HdfsConstants.XOR_2_1_POLICY_ID);
            private static final ErasureCodingPolicy SYS_POLICY5 =
                new ErasureCodingPolicy(ErasureCodeConstants.RS_10_4_SCHEMA,
                    DEFAULT_CELLSIZE, HdfsConstants.RS_10_4_POLICY_ID);
          

          In ErasureCodeConstants we have these schemas used by the above policies:

            public static final String RS_CODEC_NAME = "rs";
            public static final String RS_LEGACY_CODEC_NAME = "rs-legacy";
            public static final String XOR_CODEC_NAME = "xor";
            public static final String HHXOR_CODEC_NAME = "hhxor";
          
            public static final ECSchema RS_6_3_SCHEMA = new ECSchema(
                RS_CODEC_NAME, 6, 3);
          
            public static final ECSchema RS_3_2_SCHEMA = new ECSchema(
                RS_CODEC_NAME, 3, 2);
          
            public static final ECSchema RS_6_3_LEGACY_SCHEMA = new ECSchema(
                RS_LEGACY_CODEC_NAME, 6, 3);
          
            public static final ECSchema XOR_2_1_SCHEMA = new ECSchema(
                XOR_CODEC_NAME, 2, 1);
          
            public static final ECSchema RS_10_4_SCHEMA = new ECSchema(
                RS_CODEC_NAME, 10, 4);
          

          In HDFS-11314 it allows to enforce set of enabled EC policies on the NameNode like follow:

           <property>
            <name>dfs.namenode.ec.policies.enabled</name>
            <value>RS-6-3-64k, RS-10-4-128k</value>
            <description>Comma-delimited list of enabled erasure coding policies.
              The NameNode will enforce this when setting an erasure coding policy
              on a directory.
            </description>
          </property>
          

          For a codec the used raw coder impl can be configured as follows, using the rs codec as an example:

          <property>
            <name>io.erasurecode.codec.rs.rawcoder</name>
            <value>org.apache.hadoop.io.erasurecode.rawcoder.RSRawErasureCoderFactory</value>
            <description>
              Raw coder implementation for the rs codec. The default value is a
              pure Java implementation. There is also a native implementation. Its value
              is org.apache.hadoop.io.erasurecode.rawcoder.NativeRSRawErasureCoderFactory.
            </description>
          </property>
          

          So given above, what would be lacked and needed now could be, a mechanism (say writing an XML file) to let admin users define their EC schema and policies in NameNode side. The reasons to do this:

          • Users want to try different codec;
          • Users want to use different codec parameters, for RS codec, say 10 + 4 other than 6 + 3;
          • Users want to try different cell size other than 64k.

          Yes it's nice to have. I heard there are somebody wanting to try different things other than the built-in ones available in the codes. If it sounds not so high weight, we can work on and make it in the release cycle.

          Comments?

          Show
          drankye Kai Zheng added a comment - Andrew Wang , Zhe Zhang , Rakesh R or anybody Trying not to be complicated, based on the existing codes we already have, the goal here seems to be easier to target now. In ErasureCodingPolicyManager we have these built-in EC policies: private static final int DEFAULT_CELLSIZE = 64 * 1024; private static final ErasureCodingPolicy SYS_POLICY1 = new ErasureCodingPolicy(ErasureCodeConstants.RS_6_3_SCHEMA, DEFAULT_CELLSIZE, HdfsConstants.RS_6_3_POLICY_ID); private static final ErasureCodingPolicy SYS_POLICY2 = new ErasureCodingPolicy(ErasureCodeConstants.RS_3_2_SCHEMA, DEFAULT_CELLSIZE, HdfsConstants.RS_3_2_POLICY_ID); private static final ErasureCodingPolicy SYS_POLICY3 = new ErasureCodingPolicy(ErasureCodeConstants.RS_6_3_LEGACY_SCHEMA, DEFAULT_CELLSIZE, HdfsConstants.RS_6_3_LEGACY_POLICY_ID); private static final ErasureCodingPolicy SYS_POLICY4 = new ErasureCodingPolicy(ErasureCodeConstants.XOR_2_1_SCHEMA, DEFAULT_CELLSIZE, HdfsConstants.XOR_2_1_POLICY_ID); private static final ErasureCodingPolicy SYS_POLICY5 = new ErasureCodingPolicy(ErasureCodeConstants.RS_10_4_SCHEMA, DEFAULT_CELLSIZE, HdfsConstants.RS_10_4_POLICY_ID); In ErasureCodeConstants we have these schemas used by the above policies: public static final String RS_CODEC_NAME = "rs" ; public static final String RS_LEGACY_CODEC_NAME = "rs-legacy" ; public static final String XOR_CODEC_NAME = "xor" ; public static final String HHXOR_CODEC_NAME = "hhxor" ; public static final ECSchema RS_6_3_SCHEMA = new ECSchema( RS_CODEC_NAME, 6, 3); public static final ECSchema RS_3_2_SCHEMA = new ECSchema( RS_CODEC_NAME, 3, 2); public static final ECSchema RS_6_3_LEGACY_SCHEMA = new ECSchema( RS_LEGACY_CODEC_NAME, 6, 3); public static final ECSchema XOR_2_1_SCHEMA = new ECSchema( XOR_CODEC_NAME, 2, 1); public static final ECSchema RS_10_4_SCHEMA = new ECSchema( RS_CODEC_NAME, 10, 4); In HDFS-11314 it allows to enforce set of enabled EC policies on the NameNode like follow: <property> <name>dfs.namenode.ec.policies.enabled</name> <value>RS-6-3-64k, RS-10-4-128k</value> <description>Comma-delimited list of enabled erasure coding policies. The NameNode will enforce this when setting an erasure coding policy on a directory. </description> </property> For a codec the used raw coder impl can be configured as follows, using the rs codec as an example: <property> <name>io.erasurecode.codec.rs.rawcoder</name> <value>org.apache.hadoop.io.erasurecode.rawcoder.RSRawErasureCoderFactory</value> <description> Raw coder implementation for the rs codec. The default value is a pure Java implementation. There is also a native implementation. Its value is org.apache.hadoop.io.erasurecode.rawcoder.NativeRSRawErasureCoderFactory. </description> </property> So given above, what would be lacked and needed now could be, a mechanism (say writing an XML file) to let admin users define their EC schema and policies in NameNode side. The reasons to do this: Users want to try different codec; Users want to use different codec parameters, for RS codec, say 10 + 4 other than 6 + 3; Users want to try different cell size other than 64k. Yes it's nice to have. I heard there are somebody wanting to try different things other than the built-in ones available in the codes. If it sounds not so high weight, we can work on and make it in the release cycle. Comments?
          Hide
          andrew.wang Andrew Wang added a comment -

          Thanks for reviving this Kai Zheng!

          Will we allow users to define new EC codecs? If so, we could exercise it by refactoring out rs-legacy to be a pluggable policy as a test, or HDFS-9345 to implement a dummy codec. It looks like the ECSchemaProto and ErasureCodingPolicyProto support this kind of pluggability already, so we can transmit this info to the client.

          I think it'd be good to keep this info in the normal NN metadata if possible. Without the policy information, you essentially have data loss, and from an operational POV, admins might not be used to backing up an extra XML file.

          Show
          andrew.wang Andrew Wang added a comment - Thanks for reviving this Kai Zheng ! Will we allow users to define new EC codecs? If so, we could exercise it by refactoring out rs-legacy to be a pluggable policy as a test, or HDFS-9345 to implement a dummy codec. It looks like the ECSchemaProto and ErasureCodingPolicyProto support this kind of pluggability already, so we can transmit this info to the client. I think it'd be good to keep this info in the normal NN metadata if possible. Without the policy information, you essentially have data loss, and from an operational POV, admins might not be used to backing up an extra XML file.
          Hide
          andrew.wang Andrew Wang added a comment -

          From looking at the protos, one other question I had is about the overhead of these protos when using the hardcoded policies. There are a bunch of strings and ints, which can be kind of heavy since they're added to each HdfsFileStatus. Should we make the built-in ones identified by purely an ID, with these fully specified protos used for the pluggable policies?

          Show
          andrew.wang Andrew Wang added a comment - From looking at the protos, one other question I had is about the overhead of these protos when using the hardcoded policies. There are a bunch of strings and ints, which can be kind of heavy since they're added to each HdfsFileStatus. Should we make the built-in ones identified by purely an ID, with these fully specified protos used for the pluggable policies?
          Hide
          drankye Kai Zheng added a comment -

          Thanks Andrew Wang for the response!

          Will we allow users to define new EC codecs?

          Ah yes, it's a good point to refactor out the rs-legacy codec, not listing the related policies as built-in but would rather add it back by pluggable. I thought this is what you meant, right? Totally make sense if so.

          I think it'd be good to keep this info in the normal NN metadata if possible.

          Yes, this is needed if we allow customized codecs and policies. I'll revisit the attached design doc and provide a new revision, ensuring something like this is well considered. Basically, by writing an XML file, admin users can define their own codecs, schemas and policies. An admin command is needed to trigger the load of it, populating the user polices list managed by ErasureCodingPolicyManager. The list will be persisted into the fsimage. Another command is provided to allow removal of a policy, if the policy isn't referenced by dfs.namenode.ec.policies.enabled property or used at all (hard to know whether a policy is used or not by a file/folder?).

          Should we make the built-in ones identified by purely an ID, with these fully specified protos used for the pluggable policies?

          Sounds like this could be considered separately because, either built-in policies or plugged-in polices, the full meta info is maintained either by the codes or in the fsimage persisted, so identifying them by purely an ID should works fine. If agree, we could refactor the codes you mentioned above separately.

          Thanks!

          Show
          drankye Kai Zheng added a comment - Thanks Andrew Wang for the response! Will we allow users to define new EC codecs? Ah yes, it's a good point to refactor out the rs-legacy codec, not listing the related policies as built-in but would rather add it back by pluggable. I thought this is what you meant, right? Totally make sense if so. I think it'd be good to keep this info in the normal NN metadata if possible. Yes, this is needed if we allow customized codecs and policies. I'll revisit the attached design doc and provide a new revision, ensuring something like this is well considered. Basically, by writing an XML file, admin users can define their own codecs, schemas and policies. An admin command is needed to trigger the load of it, populating the user polices list managed by ErasureCodingPolicyManager . The list will be persisted into the fsimage. Another command is provided to allow removal of a policy, if the policy isn't referenced by dfs.namenode.ec.policies.enabled property or used at all (hard to know whether a policy is used or not by a file/folder?). Should we make the built-in ones identified by purely an ID, with these fully specified protos used for the pluggable policies? Sounds like this could be considered separately because, either built-in policies or plugged-in polices, the full meta info is maintained either by the codes or in the fsimage persisted, so identifying them by purely an ID should works fine. If agree, we could refactor the codes you mentioned above separately. Thanks!
          Hide
          andrew.wang Andrew Wang added a comment -

          Hi Kai,

          Ah yes, it's a good point to refactor out the rs-legacy codec, not listing the related policies as built-in but would rather add it back by pluggable. I thought this is what you meant, right? Totally make sense if so.

          Yep, exactly

          by writing an XML file, admin users can define their own codecs, schemas and policies.

          I gave the v3 doc a quick look, it sounds like the XML file is basically an input for a "refresh" command, and is unnecessary after it's loaded since the information is persisted to the NN metadata.

          It might be simpler for admins if we still do this over an RPC interface. Rather than specifying all the ECSchema info as arguments, the CLI tool can take the XML file as input. The CLI tool can also perform basic validation, and prompt the user when doing possibly destructive operations like removing a schema.

          I like this a bit better since the admin doesn't need to be SSH'd into the NameNode, know where to put the XML file, or know which NN is active. It might also simplify error reporting for malformed requests, since it'll be returned on the CLI rather than in a log file.

          Show
          andrew.wang Andrew Wang added a comment - Hi Kai, Ah yes, it's a good point to refactor out the rs-legacy codec, not listing the related policies as built-in but would rather add it back by pluggable. I thought this is what you meant, right? Totally make sense if so. Yep, exactly by writing an XML file, admin users can define their own codecs, schemas and policies. I gave the v3 doc a quick look, it sounds like the XML file is basically an input for a "refresh" command, and is unnecessary after it's loaded since the information is persisted to the NN metadata. It might be simpler for admins if we still do this over an RPC interface. Rather than specifying all the ECSchema info as arguments, the CLI tool can take the XML file as input. The CLI tool can also perform basic validation, and prompt the user when doing possibly destructive operations like removing a schema. I like this a bit better since the admin doesn't need to be SSH'd into the NameNode, know where to put the XML file, or know which NN is active. It might also simplify error reporting for malformed requests, since it'll be returned on the CLI rather than in a log file.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hit submit too soon, I also filed HDFS-11565 to cover the HdfsFileStatus optimizations. Will leave the existing protos for use by pluggable schemas.

          Show
          andrew.wang Andrew Wang added a comment - Hit submit too soon, I also filed HDFS-11565 to cover the HdfsFileStatus optimizations. Will leave the existing protos for use by pluggable schemas.
          Hide
          drankye Kai Zheng added a comment -

          Thanks Andrew!

          It might be simpler for admins if we still do this over an RPC interface. Rather than specifying all the ECSchema info as arguments, the CLI tool can take the XML file as input. The CLI tool can also perform basic validation, and prompt the user when doing possibly destructive operations like removing a schema.

          It's a great new suggestion and it sounds much better as you said. We use an XML file to define codecs, schemas and policies, and then have CLI parse, validate, send over RPC and load them into NameNode side. One thing left, do we need an XML sample file put in the configuration folder for admins to reference?

          Do you think we should allow removing of schema/policy by this XML means? IMO, the XML file is only for new entries. Extra CLI command could be provided to do removal. When do removal, codec/schema/policy name would be used to distinguish and reference the entry to remove? No update is supported, since admins can remove and then add.

          Glad we're much close now. Hope we can revive the work soon.

          Show
          drankye Kai Zheng added a comment - Thanks Andrew! It might be simpler for admins if we still do this over an RPC interface. Rather than specifying all the ECSchema info as arguments, the CLI tool can take the XML file as input. The CLI tool can also perform basic validation, and prompt the user when doing possibly destructive operations like removing a schema. It's a great new suggestion and it sounds much better as you said. We use an XML file to define codecs, schemas and policies, and then have CLI parse, validate, send over RPC and load them into NameNode side. One thing left, do we need an XML sample file put in the configuration folder for admins to reference? Do you think we should allow removing of schema/policy by this XML means? IMO, the XML file is only for new entries. Extra CLI command could be provided to do removal. When do removal, codec/schema/policy name would be used to distinguish and reference the entry to remove? No update is supported, since admins can remove and then add. Glad we're much close now. Hope we can revive the work soon.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hi Kai, glad the suggestion was helpful,

          Do you think we should allow removing of schema/policy by this XML means? IMO, the XML file is only for new entries. Extra CLI command could be provided to do removal. When do removal, codec/schema/policy name would be used to distinguish and reference the entry to remove? No update is supported, since admins can remove and then add.

          Agree, sounds good to me. Thanks again for driving this!

          Show
          andrew.wang Andrew Wang added a comment - Hi Kai, glad the suggestion was helpful, Do you think we should allow removing of schema/policy by this XML means? IMO, the XML file is only for new entries. Extra CLI command could be provided to do removal. When do removal, codec/schema/policy name would be used to distinguish and reference the entry to remove? No update is supported, since admins can remove and then add. Agree, sounds good to me. Thanks again for driving this!
          Hide
          andrew.wang Andrew Wang added a comment -

          Adding the incompatible flag and raising priority so we know we need this before beta, it's possibly one of the bigger EC JIRAs left too.

          Show
          andrew.wang Andrew Wang added a comment - Adding the incompatible flag and raising priority so we know we need this before beta, it's possibly one of the bigger EC JIRAs left too.
          Hide
          drankye Kai Zheng added a comment -

          Agree, Andrew. Let we speed up on this, and I'll break it down so to parallel.

          Show
          drankye Kai Zheng added a comment - Agree, Andrew. Let we speed up on this, and I'll break it down so to parallel.
          Hide
          drankye Kai Zheng added a comment -

          Unassigned this because more than one guys will help with the work.

          Show
          drankye Kai Zheng added a comment - Unassigned this because more than one guys will help with the work.
          Hide
          Sammi SammiChen added a comment -

          For remove policy, one suggestion is the lazy remove. That is, when user uses the CLI command to remove a policy A, policy A will only be marked as removed in NN and will not be available to apply to new dir/file any more. But until next time NN restart, if there is no dir/file using this policy, then policy A can be deleted permanently from the system. With this approach, we leverage the NN restart process, go through the entire namespace to find out whether there is file of this specific policy.

          Show
          Sammi SammiChen added a comment - For remove policy, one suggestion is the lazy remove. That is, when user uses the CLI command to remove a policy A, policy A will only be marked as removed in NN and will not be available to apply to new dir/file any more. But until next time NN restart, if there is no dir/file using this policy, then policy A can be deleted permanently from the system. With this approach, we leverage the NN restart process, go through the entire namespace to find out whether there is file of this specific policy.
          Hide
          Sammi SammiChen added a comment -

          Update design document to reflect the up-to-date design

          Show
          Sammi SammiChen added a comment - Update design document to reflect the up-to-date design
          Hide
          drankye Kai Zheng added a comment -

          Thanks SammiChen for moving this on.

          The PluggableErasureCodec-v4 design reflects all the latest discussions (mainly with Andrew Wang, SammiChen and Wei-Chiu Chuang) and related work. Important ones are:
          HADOOP-13200 Implement customizable and configurable erasure coders;
          HDFS-11314 Enforce set of enabled EC policies on the NameNode;
          HDFS-11605 Allow user to customize new erasure code policies.

          To close the gap targeting for the upcoming 3.0 ALPHA3 release we're also tightly working on these:
          HDFS-11794 Add ec sub command -listCodec to show currently supported ec codecs
          HDFS-11606 Add CLI cmd to remove an erasure code policy
          HDFS-7859 Erasure Coding: Persist erasure coding policies in NameNode

          As we're close and will make the changes into NameNode fsimage/editlog in HDFS-7859, would it be great if more folks take a look at the latest design doc and see if any concerns. Particularly, could I ping Tsz Wo Nicholas Sze, Jing Zhao, Zhe Zhang and Vinayakumar B. Thanks for your time!

          Show
          drankye Kai Zheng added a comment - Thanks SammiChen for moving this on. The PluggableErasureCodec-v4 design reflects all the latest discussions (mainly with Andrew Wang , SammiChen and Wei-Chiu Chuang ) and related work. Important ones are: HADOOP-13200 Implement customizable and configurable erasure coders; HDFS-11314 Enforce set of enabled EC policies on the NameNode; HDFS-11605 Allow user to customize new erasure code policies. To close the gap targeting for the upcoming 3.0 ALPHA3 release we're also tightly working on these: HDFS-11794 Add ec sub command -listCodec to show currently supported ec codecs HDFS-11606 Add CLI cmd to remove an erasure code policy HDFS-7859 Erasure Coding: Persist erasure coding policies in NameNode As we're close and will make the changes into NameNode fsimage/editlog in HDFS-7859 , would it be great if more folks take a look at the latest design doc and see if any concerns. Particularly, could I ping Tsz Wo Nicholas Sze , Jing Zhao , Zhe Zhang and Vinayakumar B . Thanks for your time!
          Hide
          eddyxu Lei (Eddy) Xu added a comment -

          Thanks a lot for the new design doc, it looks great, SammiChen

          For remove policy, one suggestion is the lazy remove. That is, when user uses the CLI command to remove a policy A, policy A will only be marked as removed in NN and will not be available to apply to new dir/file any more. But until next time NN restart, if there is no dir/file using this policy, then policy A can be deleted permanently from the system.

          Do you know what is the overhead of this check when NN restarts? Will it introduce noticeable slowdown to NN start process?

          • Enable policy: while it supports to add / remove policies using CLI, but why it dose not support to enable / use the policy via CLI? This limitation makes the user experience not consistent .
          • For the configuration keys of codecs, why do we need ".rawcoders" as suffix for each one?

          The order is defined by a list of coder names separated by commas.

          Do you mean the implementations of the same algorithm? Btw, is there a system-wide default codec / configuration to use?

          Show
          eddyxu Lei (Eddy) Xu added a comment - Thanks a lot for the new design doc, it looks great, SammiChen For remove policy, one suggestion is the lazy remove. That is, when user uses the CLI command to remove a policy A, policy A will only be marked as removed in NN and will not be available to apply to new dir/file any more. But until next time NN restart, if there is no dir/file using this policy, then policy A can be deleted permanently from the system. Do you know what is the overhead of this check when NN restarts? Will it introduce noticeable slowdown to NN start process? Enable policy: while it supports to add / remove policies using CLI, but why it dose not support to enable / use the policy via CLI? This limitation makes the user experience not consistent . For the configuration keys of codecs, why do we need ".rawcoders" as suffix for each one? The order is defined by a list of coder names separated by commas. Do you mean the implementations of the same algorithm? Btw, is there a system-wide default codec / configuration to use?
          Hide
          Sammi SammiChen added a comment -

          Hi Lei (Eddy) Xu, thanks for review the design doc!

          Do you know what is the overhead of this check when NN restarts? Will it introduce noticeable slowdown to NN start process?

          My current idea is when NN restarts, it will load all file inodes and directory inodes, check whether the file is a striped file, directory has EC file applied. So we can leverage this process, if the file is a striped file or a EC directory, add one extra step, put their EC policy ID into a global map. Once we have all the used EC policy IDs, we can decide if a user removed EC policy can be ultimately deleted or not. I think one extra simple step will not introduce noticeable slowdown to the NN start process.
          We have also thought about other alternative solution, such as for those user removed EC policies, they will not be seen by user through any NN API anymore, but will not actually deleted from the system, such there will be some stale information in the system.

          Enable policy: while it supports to add / remove policies using CLI, but why it dose not support to enable / use the policy via CLI? This limitation makes the user experience not consistent .

          It has a history. At first, there are only built-in EC policies, such as RS(6,3) and RS(10,4). Then an improvement is made to avoid user use the EC policy which is not feasible for his/her cluster, for example, if the cluster has only 6 datanodes, then RS(10,4) is not feasible at all. The improvement is made through 'dfs.namenode.ec.policies.enabled' property which requires Admin privilege to set the enabled policies and restart the cluster for cautious. Then comes the user define EC policy, so we think this 'dfs.namenode.ec.policies.enabled' can be leveraged to enabled or disabled user defined policies, also for cautious.
          On the meanwhile, I have discussed this question with Kai. Add an extra API is also an alternative. But I want to hear more from your guys.

          For the configuration keys of codecs, why do we need ".rawcoders" as suffix for each one? Do you mean the implementations of the same algorithm?

          Yes, they are different implementations of the same algorithm. For example, RS algorithm, there are pure Java coder, ISA-L coder, HDFS RAID coder. And because the configuration keys are used to define the order of different raw coders, so use the .rawcoders as the suffix.

          is there a system-wide default codec / configuration to use?

          Yes. There are several system wide codecs to use, including RS codec, RS legacy codec and XOR codec.

          Show
          Sammi SammiChen added a comment - Hi Lei (Eddy) Xu , thanks for review the design doc! Do you know what is the overhead of this check when NN restarts? Will it introduce noticeable slowdown to NN start process? My current idea is when NN restarts, it will load all file inodes and directory inodes, check whether the file is a striped file, directory has EC file applied. So we can leverage this process, if the file is a striped file or a EC directory, add one extra step, put their EC policy ID into a global map. Once we have all the used EC policy IDs, we can decide if a user removed EC policy can be ultimately deleted or not. I think one extra simple step will not introduce noticeable slowdown to the NN start process. We have also thought about other alternative solution, such as for those user removed EC policies, they will not be seen by user through any NN API anymore, but will not actually deleted from the system, such there will be some stale information in the system. Enable policy: while it supports to add / remove policies using CLI, but why it dose not support to enable / use the policy via CLI? This limitation makes the user experience not consistent . It has a history. At first, there are only built-in EC policies, such as RS(6,3) and RS(10,4). Then an improvement is made to avoid user use the EC policy which is not feasible for his/her cluster, for example, if the cluster has only 6 datanodes, then RS(10,4) is not feasible at all. The improvement is made through 'dfs.namenode.ec.policies.enabled' property which requires Admin privilege to set the enabled policies and restart the cluster for cautious. Then comes the user define EC policy, so we think this 'dfs.namenode.ec.policies.enabled' can be leveraged to enabled or disabled user defined policies, also for cautious. On the meanwhile, I have discussed this question with Kai. Add an extra API is also an alternative. But I want to hear more from your guys. For the configuration keys of codecs, why do we need ".rawcoders" as suffix for each one? Do you mean the implementations of the same algorithm? Yes, they are different implementations of the same algorithm. For example, RS algorithm, there are pure Java coder, ISA-L coder, HDFS RAID coder. And because the configuration keys are used to define the order of different raw coders, so use the .rawcoders as the suffix. is there a system-wide default codec / configuration to use? Yes. There are several system wide codecs to use, including RS codec, RS legacy codec and XOR codec.
          Hide
          drankye Kai Zheng added a comment -

          Lei (Eddy) Xu, should the above comments address your questions? Any further thoughts? Thanks for taking your time looking at this aspect. Very helpful. Andrew Wang, any concern if we proceed to use a CLI cmd to enable/disable a EC policy? I mean as above discussed, as an alternative mean for the configuration property we introduced before.

          Show
          drankye Kai Zheng added a comment - Lei (Eddy) Xu , should the above comments address your questions? Any further thoughts? Thanks for taking your time looking at this aspect. Very helpful. Andrew Wang , any concern if we proceed to use a CLI cmd to enable/disable a EC policy? I mean as above discussed, as an alternative mean for the configuration property we introduced before.
          Hide
          eddyxu Lei (Eddy) Xu added a comment -

          Hi, Kai Zheng and SammiChen

          Thanks a lot for the reply. The explanation helps a lot.

          There are several system wide codecs to use, including RS codec, RS legacy codec and XOR codec.

          Is there a way to choose a system-wide default codec? So that after the cluster being initialized, users and admins can just specify a zone / directory to be "erasure coded", instead of choosing from several different codes, and each one has its own trade-offs, which require user / admin to understand?

          while it supports to add / remove policies using CLI, it dose not support to enable / use the policy via CLI?

          My concern is that, even if the admin is able to add policy via the API dynamically, it still requires the admin to reboot NN, or ssh into NN / change conf files and reload NN confs, to enable the policy? It makes the workflow complicated. I think using API / CLI and ssh NN / changing conf files should be two different sets of operations. If possible, it is more consistent to do the EC policy management in either one, or both. The current design is doing half of the management in each approach.

          Thanks.

          Show
          eddyxu Lei (Eddy) Xu added a comment - Hi, Kai Zheng and SammiChen Thanks a lot for the reply. The explanation helps a lot. There are several system wide codecs to use, including RS codec, RS legacy codec and XOR codec. Is there a way to choose a system-wide default codec? So that after the cluster being initialized, users and admins can just specify a zone / directory to be "erasure coded", instead of choosing from several different codes, and each one has its own trade-offs, which require user / admin to understand? while it supports to add / remove policies using CLI, it dose not support to enable / use the policy via CLI? My concern is that, even if the admin is able to add policy via the API dynamically , it still requires the admin to reboot NN, or ssh into NN / change conf files and reload NN confs, to enable the policy? It makes the workflow complicated. I think using API / CLI and ssh NN / changing conf files should be two different sets of operations. If possible, it is more consistent to do the EC policy management in either one, or both. The current design is doing half of the management in each approach. Thanks.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hi folks, thanks for the discussion,

          Is there a way to choose a system-wide default codec? So that after the cluster being initialized, users and admins can just specify a zone / directory to be "erasure coded", instead of choosing from several different codes, and each one has its own trade-offs, which require user / admin to understand?

          We had a system default policy originally, but then moved away from it. I'm open to bringing it back if we believe that there's typically only one policy in a cluster. I think this is likely true.

          My concern is that, even if the admin is able to add policy via the API dynamically, it still requires the admin to reboot NN, or ssh into NN / change conf files and reload NN confs, to enable the policy? It makes the workflow complicated.

          Yea, this is true. I can envision how this would work with just CLI commands: add/remove/enable/disable. I don't know how we'd do this with just config, since we want the safety of persisting things in the fsimage.

          So, shall we do it all via API?

          Show
          andrew.wang Andrew Wang added a comment - Hi folks, thanks for the discussion, Is there a way to choose a system-wide default codec? So that after the cluster being initialized, users and admins can just specify a zone / directory to be "erasure coded", instead of choosing from several different codes, and each one has its own trade-offs, which require user / admin to understand? We had a system default policy originally, but then moved away from it. I'm open to bringing it back if we believe that there's typically only one policy in a cluster. I think this is likely true. My concern is that, even if the admin is able to add policy via the API dynamically, it still requires the admin to reboot NN, or ssh into NN / change conf files and reload NN confs, to enable the policy? It makes the workflow complicated. Yea, this is true. I can envision how this would work with just CLI commands: add/remove/enable/disable. I don't know how we'd do this with just config, since we want the safety of persisting things in the fsimage. So, shall we do it all via API?
          Hide
          Sammi SammiChen added a comment - - edited

          Thanks Lei (Eddy) Xu and Andrew Wang for the discussion and feedback! Agree that a CLI command to enable/disable erasure coding policy will be very helpful to end user. HDFS-11870 is created to track this. I will move on with the implementation.

          Show
          Sammi SammiChen added a comment - - edited Thanks Lei (Eddy) Xu and Andrew Wang for the discussion and feedback! Agree that a CLI command to enable/disable erasure coding policy will be very helpful to end user. HDFS-11870 is created to track this. I will move on with the implementation.
          Hide
          danielpol Daniel Pol added a comment -

          Maybe you can help me with some EC info. This is with alpha4. Not sure if this is the correct JIRA to comment under.
          I’ve built a custom EC policy and managed to use it properly (RS-5-2-256k just because I have only 7 datanodes in my setup and want lower overhead while still supporting 2 failures). Generates data as expected and works fine.

          Problem I have is that when I restart HDFS it all gets lost and I have 2 possible cases:
          1. If I have it enabled in dfs.namenode.ec.policies.enabled namenode won’t start up as its not part of the “standard” EC policies. Error logged is:
          2017-07-13 18:47:37,339 ERROR namenode.NameNode (NameNode.java:main(1709)) - Failed to start namenode.
          java.lang.IllegalArgumentException: EC policy 'RS-5-2-256k' specified at dfs.namenode.ec.policies.enabled is not a valid policy. Please choose from list of available policies: [RS-6-3-64k, RS-3-2-64k, RS-LEGACY-6-3-64k, XOR-2-1-64k, RS-10-4-64k, ]
          at org.apache.hadoop.hdfs.server.namenode.ErasureCodingPolicyManager.init(ErasureCodingPolicyManager.java:123)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:868)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:693)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:664)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:726)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:949)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:928)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1637)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1704)

          2. If I don’t have it enabled, namenode starts up but doesn’t leave safe mode because it doesn’t know how to read the blocks (without knowing the custom EC policy)

          Right now, I end up deleting all EC data before restarts (or manually leave safe mode and deleting EC data). Delays my testing quite a bit since I have to recreate data all the time.

          I’m not 100% this is a bug or just something I don’t know how to properly configure. Maybe there’s some undocumented setting I’m not aware of that can point to my custom EC policy XML file at startup.

          Show
          danielpol Daniel Pol added a comment - Maybe you can help me with some EC info. This is with alpha4. Not sure if this is the correct JIRA to comment under. I’ve built a custom EC policy and managed to use it properly (RS-5-2-256k just because I have only 7 datanodes in my setup and want lower overhead while still supporting 2 failures). Generates data as expected and works fine. Problem I have is that when I restart HDFS it all gets lost and I have 2 possible cases: 1. If I have it enabled in dfs.namenode.ec.policies.enabled namenode won’t start up as its not part of the “standard” EC policies. Error logged is: 2017-07-13 18:47:37,339 ERROR namenode.NameNode (NameNode.java:main(1709)) - Failed to start namenode. java.lang.IllegalArgumentException: EC policy 'RS-5-2-256k' specified at dfs.namenode.ec.policies.enabled is not a valid policy. Please choose from list of available policies: [RS-6-3-64k, RS-3-2-64k, RS-LEGACY-6-3-64k, XOR-2-1-64k, RS-10-4-64k, ] at org.apache.hadoop.hdfs.server.namenode.ErasureCodingPolicyManager.init(ErasureCodingPolicyManager.java:123) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:868) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:693) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:664) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:726) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:949) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:928) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1637) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1704) 2. If I don’t have it enabled, namenode starts up but doesn’t leave safe mode because it doesn’t know how to read the blocks (without knowing the custom EC policy) Right now, I end up deleting all EC data before restarts (or manually leave safe mode and deleting EC data). Delays my testing quite a bit since I have to recreate data all the time. I’m not 100% this is a bug or just something I don’t know how to properly configure. Maybe there’s some undocumented setting I’m not aware of that can point to my custom EC policy XML file at startup.
          Hide
          drankye Kai Zheng added a comment -

          Hi Daniel Pol,

          Yes you're asking in the right place. Please note this isn't finished yet and the last step is to persist the configured ec policies into fs image, so in your case when HDFS restarts, they'll be recovered. The work is tracked in HDFS-7859 and please stay tuned. It's likely to be solved in beta 1 release.

          Show
          drankye Kai Zheng added a comment - Hi Daniel Pol , Yes you're asking in the right place. Please note this isn't finished yet and the last step is to persist the configured ec policies into fs image, so in your case when HDFS restarts, they'll be recovered. The work is tracked in HDFS-7859 and please stay tuned. It's likely to be solved in beta 1 release.
          Hide
          danielpol Daniel Pol added a comment -

          Thanks for letting me know, I'll wait for beta1 for this feature. I'll continue testing with my custom EC policy in the mean time.

          Show
          danielpol Daniel Pol added a comment - Thanks for letting me know, I'll wait for beta1 for this feature. I'll continue testing with my custom EC policy in the mean time.
          Hide
          drankye Kai Zheng added a comment -

          Thanks for the testing and contribution! Note except user defined policies are not persisted yet and we'll also need to solve to support default system policy, other functionalities discussed here were already done and should be fine, but if you find anything that doesn't work, please let we know or fire bugs. Thanks!

          Show
          drankye Kai Zheng added a comment - Thanks for the testing and contribution! Note except user defined policies are not persisted yet and we'll also need to solve to support default system policy, other functionalities discussed here were already done and should be fine, but if you find anything that doesn't work, please let we know or fire bugs. Thanks!
          Hide
          andrew.wang Andrew Wang added a comment -

          Hey folks, are we on track for completing this by beta1 in ~three weeks? We want metadata and API changes to be complete by then.

          Show
          andrew.wang Andrew Wang added a comment - Hey folks, are we on track for completing this by beta1 in ~three weeks? We want metadata and API changes to be complete by then.
          Hide
          Sammi SammiChen added a comment -

          Hi Andrew Wang, the remaining task is to persist erasure coding policy to fsImage. We are on track to complete it before beta1. And review help is strongly needed.

          Show
          Sammi SammiChen added a comment - Hi Andrew Wang , the remaining task is to persist erasure coding policy to fsImage. We are on track to complete it before beta1. And review help is strongly needed.
          Hide
          andrew.wang Andrew Wang added a comment -

          I think we've resolved the scope targeted for beta1, shall we close this umbrella JIRA and move out the remaining subtasks?

          Show
          andrew.wang Andrew Wang added a comment - I think we've resolved the scope targeted for beta1, shall we close this umbrella JIRA and move out the remaining subtasks?
          Hide
          drankye Kai Zheng added a comment -

          Good idea, Andrew. Let's do that.

          Show
          drankye Kai Zheng added a comment - Good idea, Andrew. Let's do that.
          Hide
          drankye Kai Zheng added a comment -

          Resolved this issue with RN added. The remain issues left here were moved out as they're not targets for 3.0.

          Show
          drankye Kai Zheng added a comment - Resolved this issue with RN added. The remain issues left here were moved out as they're not targets for 3.0.
          Hide
          andrew.wang Andrew Wang added a comment -

          Thanks Kai!

          Show
          andrew.wang Andrew Wang added a comment - Thanks Kai!
          Hide
          andrew.wang Andrew Wang added a comment -

          Assigning this to Sammi for release notes process.

          Show
          andrew.wang Andrew Wang added a comment - Assigning this to Sammi for release notes process.
          Hide
          Sammi SammiChen added a comment -

          Hi Andrew Wang, the release note is ready. Is there anything I need to do besides that?

          Show
          Sammi SammiChen added a comment - Hi Andrew Wang , the release note is ready. Is there anything I need to do besides that?
          Hide
          andrew.wang Andrew Wang added a comment -

          Nothing additional to do, just wanted to make sure all fixed JIRAs have an assignee.

          Show
          andrew.wang Andrew Wang added a comment - Nothing additional to do, just wanted to make sure all fixed JIRAs have an assignee.
          Hide
          xiaochen Xiao Chen added a comment -

          Hi Kai Zheng and SammiChen,
          Thanks a lot for working on this cool feature!
          Is there a jira for persisting EC policy to NN metadata? I searched the linked jiras but didn't find one.

          Show
          xiaochen Xiao Chen added a comment - Hi Kai Zheng and SammiChen , Thanks a lot for working on this cool feature! Is there a jira for persisting EC policy to NN metadata? I searched the linked jiras but didn't find one.
          Hide
          rakeshr Rakesh R added a comment -

          Is there a jira for persisting EC policy to NN metadata?

          HDFS-7859, hope this jira will help you to get more info.

          Show
          rakeshr Rakesh R added a comment - Is there a jira for persisting EC policy to NN metadata? HDFS-7859 , hope this jira will help you to get more info.
          Hide
          Sammi SammiChen added a comment -

          Hi Xiao Chen, HDFS-7859 is for persist EC policy in protobuffer fsImage, and HDFS-12395 is for support EC policy API in edit log.

          Thanks Rakesh R!

          Show
          Sammi SammiChen added a comment - Hi Xiao Chen , HDFS-7859 is for persist EC policy in protobuffer fsImage, and HDFS-12395 is for support EC policy API in edit log. Thanks Rakesh R !
          Hide
          xiaochen Xiao Chen added a comment -

          Got it, will look into those. Thanks for the prompt response!

          Show
          xiaochen Xiao Chen added a comment - Got it, will look into those. Thanks for the prompt response!

            People

            • Assignee:
              Sammi SammiChen
              Reporter:
              zhz Zhe Zhang
            • Votes:
              2 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development