Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2036

Enable Erasure Code in Tool similar to Hadoop Archive

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: contrib/raid, harchive
    • Labels:
      None

      Description

      Features:
      1) HAR-like Tool
      2) RAID5/RAID6 & pluggable interface to implement additional coding
      3) Enable to group blocks across files
      4) Portable across cluster since all necessary metadata is embedded

      While it was developed separately from HAR or RAID due to time constraints, it would make sense to integrate with either of them.

      1. hdfs-raid.tar.gz
        34 kB
        Wittawat Tantisiriroj
      2. MAPREDUCE-2036.patch
        198 kB
        Wittawat Tantisiriroj
      3. RaidTool.pdf
        917 kB
        Wittawat Tantisiriroj

        Issue Links

          Activity

          Wittawat Tantisiriroj created issue -
          Hide
          Wittawat Tantisiriroj added a comment -

          Design Document

          Show
          Wittawat Tantisiriroj added a comment - Design Document
          Wittawat Tantisiriroj made changes -
          Field Original Value New Value
          Attachment RaidTool.docx [ 12453272 ]
          Hide
          Wittawat Tantisiriroj added a comment -

          PDF Version of design document is uploaded.

          Show
          Wittawat Tantisiriroj added a comment - PDF Version of design document is uploaded.
          Wittawat Tantisiriroj made changes -
          Attachment RaidTool.pdf [ 12453274 ]
          Wittawat Tantisiriroj made changes -
          Attachment RaidTool.docx [ 12453272 ]
          Hide
          Wittawat Tantisiriroj added a comment -

          Prototype is uploaded.

          Show
          Wittawat Tantisiriroj added a comment - Prototype is uploaded.
          Wittawat Tantisiriroj made changes -
          Attachment hdfs-raid.tar.gz [ 12453275 ]
          Attachment MAPREDUCE-2036.patch [ 12453276 ]
          Tsz Wo Nicholas Sze made changes -
          Assignee Wittawat Tantisiriroj [ wtantisi ]
          Component/s harchive [ 12312903 ]
          Hide
          Ramkumar Vadali added a comment -

          Hi Wittawat, good work on generating this patch! The concept of RAIDing files in a directory is a good complement to the existing RAID, which requires larger files.
          Some thoughts:
          1. It will be really good to integrate this with the current RAID. Apart from the obvious code reuse in DistributedRaidFileSystem, it has some automation around generating parity files. I also have a lot of upcoming patches that automate repair of lost blocks.
          2. I did not see code that reduces the replication for RAIDed files. Is that supposed to be done independent of this tool?
          3. A usage-related question: I assume the source directory under consideration is older data such that users can tolerate some increase in read latency. If so, the source directory could be HAR'ed and the result files then RAIDed using the current RAID. Thoughts?

          Looking forward to a good discussion!

          Show
          Ramkumar Vadali added a comment - Hi Wittawat, good work on generating this patch! The concept of RAIDing files in a directory is a good complement to the existing RAID, which requires larger files. Some thoughts: 1. It will be really good to integrate this with the current RAID. Apart from the obvious code reuse in DistributedRaidFileSystem, it has some automation around generating parity files. I also have a lot of upcoming patches that automate repair of lost blocks. 2. I did not see code that reduces the replication for RAIDed files. Is that supposed to be done independent of this tool? 3. A usage-related question: I assume the source directory under consideration is older data such that users can tolerate some increase in read latency. If so, the source directory could be HAR'ed and the result files then RAIDed using the current RAID. Thoughts? Looking forward to a good discussion!
          Jeff Hammerbacher made changes -
          Link This issue is related to HDFS-503 [ HDFS-503 ]
          Hide
          Wittawat Tantisiriroj added a comment -

          1) Yes, it is a great idea. Please let me know how I can help integrate it with the current RAID
          2) Before reducing the replication, I would like to make sure that no block in the same group located on the same datanode. I have been working on a tool (similar Balancer) to migrate blocks so that no block in the same group located on the same datanode.
          3) I agree. However, it would make sense to store parity files inside a HAR directory?

          Plus, I am also working on porting RS codes from Jerasure (http://www.cs.utk.edu/~plank/plank/papers/CS-08-627.html), so it can support more than 2 parities.

          Show
          Wittawat Tantisiriroj added a comment - 1) Yes, it is a great idea. Please let me know how I can help integrate it with the current RAID 2) Before reducing the replication, I would like to make sure that no block in the same group located on the same datanode. I have been working on a tool (similar Balancer) to migrate blocks so that no block in the same group located on the same datanode. 3) I agree. However, it would make sense to store parity files inside a HAR directory? Plus, I am also working on porting RS codes from Jerasure ( http://www.cs.utk.edu/~plank/plank/papers/CS-08-627.html ), so it can support more than 2 parities.
          Hide
          Scott Chen added a comment -

          Hey Wittawat,

          2) Before reducing the replication, I would like to make sure that no block in the same group located on the same datanode. I have been working on a tool (similar Balancer) to migrate blocks so that no block in the same group located on the same datanode.

          I like this idea of migrating blocks. Is it possible that you can implement this on the current RAID project. That will be really helpful.

          Plus, I am also working on porting RS codes from Jerasure (http://www.cs.utk.edu/~plank/plank/papers/CS-08-627.html), so it can support more than 2 parities.

          We have also implemented a java version of RS code in MAPREDUCE-1970. It has been deployed on our test cluster which holds 300TB of data.
          In this patch, we have an interface for general erasure codes.
          Maybe you can make your patch implements the same interface. So we can configure different codecs to use.
          I think the encode/decode is more IO-bounded because the parity length we are using is really small comparing to the regular use cases of RS codes.

          Show
          Scott Chen added a comment - Hey Wittawat, 2) Before reducing the replication, I would like to make sure that no block in the same group located on the same datanode. I have been working on a tool (similar Balancer) to migrate blocks so that no block in the same group located on the same datanode. I like this idea of migrating blocks. Is it possible that you can implement this on the current RAID project. That will be really helpful. Plus, I am also working on porting RS codes from Jerasure ( http://www.cs.utk.edu/~plank/plank/papers/CS-08-627.html ), so it can support more than 2 parities. We have also implemented a java version of RS code in MAPREDUCE-1970 . It has been deployed on our test cluster which holds 300TB of data. In this patch, we have an interface for general erasure codes. Maybe you can make your patch implements the same interface. So we can configure different codecs to use. I think the encode/decode is more IO-bounded because the parity length we are using is really small comparing to the regular use cases of RS codes.
          李志然 made changes -
          Assignee Wittawat Tantisiriroj [ wtantisi ] 李志然 [ lizhiran ]

            People

            • Assignee:
              李志然
              Reporter:
              Wittawat Tantisiriroj
            • Votes:
              1 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:

                Development