Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • None
    • None
    • None

    Description

      We can support reasonably well use cases on non-HDFS filesystems, like S3, where an external writer has loaded (and continues to load) HFiles via the bulk load mechanism, and then we serve out a read only workload at the HBase API.

      Mixed workloads or write-heavy workloads won't fare as well. In fact, data loss seems certain. It will depend in the specific filesystem, but all of the S3 backed Hadoop filesystems suffer from a couple of obvious problems, notably a lack of atomic rename.

      This umbrella will serve to collect some related ideas for consideration.

      Attachments

        Issue Links

        There are no Sub-Tasks for this issue.

        Activity

          Let me just state this to get it out of the way. As you can imagine, reading between the lines, the motivation to look at this where I work is the good probability our storage stack is either going to utilize Amazon's S3 service "where applicable" or a compatible API analogue. Please don't take this to infer anything about business relationships, or not. Really, I would personally have no idea one way or the other.

          apurtell Andrew Kyle Purtell added a comment - Let me just state this to get it out of the way. As you can imagine, reading between the lines, the motivation to look at this where I work is the good probability our storage stack is either going to utilize Amazon's S3 service "where applicable" or a compatible API analogue. Please don't take this to infer anything about business relationships, or not. Really, I would personally have no idea one way or the other.
          zyork Zach York added a comment -

          Andrew Kyle Purtell I'll look into the specifics in a little bit, but I feel like relying less on the FS (atomic renames for example) might be the right way to go here. A while back there was some work done (or proposed) to have HBase handle the file metadata somewhere to avoid the necessity of renames (HBase would update the path/location in this table so in effect, the rename would be atomic). I didn't spend a ton of time looking for the old issues, but I think this one was related: HBASE-14090.

          Michael Stack and Umesh Agashe did some initial planning on it and I planned to help out, but got sidelined by other stuff. They might be able to chime in here as well.

          zyork Zach York added a comment - Andrew Kyle Purtell I'll look into the specifics in a little bit, but I feel like relying less on the FS (atomic renames for example) might be the right way to go here. A while back there was some work done (or proposed) to have HBase handle the file metadata somewhere to avoid the necessity of renames (HBase would update the path/location in this table so in effect, the rename would be atomic). I didn't spend a ton of time looking for the old issues, but I think this one was related: HBASE-14090 . Michael Stack and Umesh Agashe did some initial planning on it and I planned to help out, but got sidelined by other stuff. They might be able to chime in here as well.
          apurtell Andrew Kyle Purtell added a comment - - edited

          Zach York Oh yes, certainly, see HBASE-20431. Not relying on atomic rename would be the first order of business.

          apurtell Andrew Kyle Purtell added a comment - - edited Zach York Oh yes, certainly, see HBASE-20431 . Not relying on atomic rename would be the first order of business.

          A while back there was some work done (or proposed) to have HBase handle the file metadata somewhere to avoid the necessity of renames (HBase would update the path/location in this table so in effect, the rename would be atomic).

          This is interesting. Of course I missed it mired in 0.98 fleet maintenance. :-/ Could do this instead of or in addition to HBASE-20431, which proposes something like SplitTransaction but for commits of store files after compaction or flush.

          apurtell Andrew Kyle Purtell added a comment - A while back there was some work done (or proposed) to have HBase handle the file metadata somewhere to avoid the necessity of renames (HBase would update the path/location in this table so in effect, the rename would be atomic). This is interesting. Of course I missed it mired in 0.98 fleet maintenance. :-/ Could do this instead of or in addition to HBASE-20431 , which proposes something like SplitTransaction but for commits of store files after compaction or flush.
          uagashe Umesh Agashe added a comment -

          Andrew Kyle Purtell, Zach York: HBASE-14090 had multiple objectives and one of them was supporting file systems other than HDFS. Currently that work is on hold. I also found that along with atomic rename, at couple of places we check file permissions (rwx) which is not supported by, say, S3.

          uagashe Umesh Agashe added a comment - Andrew Kyle Purtell , Zach York : HBASE-14090 had multiple objectives and one of them was supporting file systems other than HDFS. Currently that work is on hold. I also found that along with atomic rename, at couple of places we check file permissions (rwx) which is not supported by, say, S3.

          HBASE-14090 has far too wide a scope, not going to attempt that. Let's proceed here with the goal of getting HBase running with mixed workloads on S3 (or not).

          apurtell Andrew Kyle Purtell added a comment - HBASE-14090 has far too wide a scope, not going to attempt that. Let's proceed here with the goal of getting HBase running with mixed workloads on S3 (or not).
          stack Michael Stack added a comment -

          Andrew Kyle Purtell

          Should we pow-wow on s3'ing? A meeting/hangout)? I see a bunch of efforts in this direction (e.g. WAL elsewhere). Perhaps itd be possible for there to be a bit of coordination.

          I like your talking out loud on your experience. That helps

          I'd be interested too in how we could avoid fsredo.

          stack Michael Stack added a comment - Andrew Kyle Purtell Should we pow-wow on s3'ing? A meeting/hangout)? I see a bunch of efforts in this direction (e.g. WAL elsewhere). Perhaps itd be possible for there to be a bit of coordination. I like your talking out loud on your experience. That helps I'd be interested too in how we could avoid fsredo.
          stevel@apache.org Steve Loughran added a comment - - edited

          One thing which would be good for you all to write down is: what are your expectations of an FS to work.

          in particular

          • create/read/update/delete consistency
          • listing consistency
          • which ops are required to be atomic and O(1)
          • is it ok for create(path, overwrite=false) to be non-atomic?
          • when you expect things to be written to store
          • how long do you expect the final close() to take.

          Identify these things and you can start to see what stores can work. And show you where you need to involve other things for the semantics you need.

          stevel@apache.org Steve Loughran added a comment - - edited One thing which would be good for you all to write down is: what are your expectations of an FS to work. in particular create/read/update/delete consistency listing consistency which ops are required to be atomic and O(1) is it ok for create(path, overwrite=false) to be non-atomic? when you expect things to be written to store how long do you expect the final close() to take. Identify these things and you can start to see what stores can work. And show you where you need to involve other things for the semantics you need.
          stack Michael Stack added a comment -

          (note to self, invite Steve Loughran to any meeting if it happens....and review s3guard ....)

          stack Michael Stack added a comment - (note to self, invite Steve Loughran to any meeting if it happens....and review s3guard ....)
          zyork Zach York added a comment -

          I was planning to starting work on some of this in a little bit, but I think we need to decide whether we want to:

          1) fix this in the FileSystem (not change HBase's assumption of a strongly consistent FileSystem)

          or

          2) fix this in HBase where we know what we are doing with the data and any guarantees needed.

           

          Personally I think #2 will be easier, but would be willing to discuss. It might end up being a mix of things.

           

          Also, let's start with what currently is not working with HBase backed by S3 - what are the pain points we are trying to solve. That will help us direct the effort better. I can definitely help where I can with that list.

          zyork Zach York added a comment - I was planning to starting work on some of this in a little bit, but I think we need to decide whether we want to: 1) fix this in the FileSystem (not change HBase's assumption of a strongly consistent FileSystem) or 2) fix this in HBase where we know what we are doing with the data and any guarantees needed.   Personally I think #2 will be easier, but would be willing to discuss. It might end up being a mix of things.   Also, let's start with what currently is not working with HBase backed by S3 - what are the pain points we are trying to solve. That will help us direct the effort better. I can definitely help where I can with that list.
          zyork Zach York added a comment -

          Anyone interested in this want to review two related S3 related issues? HBASE-21070 and HBASE-21098 (patch coming soon).

           

          Thanks!

          zyork Zach York added a comment - Anyone interested in this want to review two related S3 related issues? HBASE-21070 and HBASE-21098 (patch coming soon).   Thanks!
          stack Michael Stack added a comment -

          +1 for #2

          stack Michael Stack added a comment - +1 for #2
          apurtell Andrew Kyle Purtell added a comment - - edited

          I don't want this to be considered "FSredo". We need improvements in branch-1 for today's production. I deliberately did not file this under the related FS redo JIRAs. I am going to remove the tag. We can have #2, at least to some extent, without reinventing how we manage files in an incompatible way. For example HBASE-20431. It is not going to be sufficient to commit something to master (aka 3.0) and call it a day

          apurtell Andrew Kyle Purtell added a comment - - edited I don't want this to be considered "FSredo". We need improvements in branch-1 for today's production. I deliberately did not file this under the related FS redo JIRAs. I am going to remove the tag. We can have #2, at least to some extent, without reinventing how we manage files in an incompatible way. For example HBASE-20431 . It is not going to be sufficient to commit something to master (aka 3.0) and call it a day

          Reiterating: The goal of this JIRA is stability on S3 for write workloads all the way down through to a new 1.x minor release.

          apurtell Andrew Kyle Purtell added a comment - Reiterating: The goal of this JIRA is stability on S3 for write workloads all the way down through to a new 1.x minor release.
          stack Michael Stack added a comment -

          If for 1.x, ok to remove label.

          stack Michael Stack added a comment - If for 1.x, ok to remove label.
          zyork Zach York added a comment -

          Andrew Kyle Purtell Makes sense. We'll tackle FS redo in a separate JIRA.

          Have you tested using a consistency layer such as S3Guard? That should be able to handle most of the consistency guarantees so you don't have RS aborts when you encounter consistencies after writing storefiles. Alternatively, you could add a retry in the failure logic instead of aborting the RS immediately when a file doesn't exist.

          FWIW, I have seen a large number of heavy write workloads working well with S3 given you use a consistency layer (EmrFS Consistent View in my case) so that might be sufficient for your case until you can upgrade to a version that (hopefully) contains the FSredo work.

          zyork Zach York added a comment - Andrew Kyle Purtell Makes sense. We'll tackle FS redo in a separate JIRA. Have you tested using a consistency layer such as S3Guard? That should be able to handle most of the consistency guarantees so you don't have RS aborts when you encounter consistencies after writing storefiles. Alternatively, you could add a retry in the failure logic instead of aborting the RS immediately when a file doesn't exist. FWIW, I have seen a large number of heavy write workloads working well with S3 given you use a consistency layer (EmrFS Consistent View in my case) so that might be sufficient for your case until you can upgrade to a version that (hopefully) contains the FSredo work.
          apurtell Andrew Kyle Purtell added a comment - - edited

          Zach York Yes there are two improvements I've become aware of since that I'd like to apply and then retest

          1. HBASE-20723
          2. S3Guard

          Edit: It's been a while but I also seem to remember some failures where PUT-COPY failed server side but we didn't catch it, so committed store files went missing in a way that S3Guard wouldn't address.

          apurtell Andrew Kyle Purtell added a comment - - edited Zach York Yes there are two improvements I've become aware of since that I'd like to apply and then retest HBASE-20723 S3Guard Edit: It's been a while but I also seem to remember some failures where PUT-COPY failed server side but we didn't catch it, so committed store files went missing in a way that S3Guard wouldn't address.
          zyork Zach York added a comment -

          If you're interested in HBASE-20723, you could also apply/review HBASE-20734 (the real fix for the issue), but I know you were tracking that as well.

           

          I'm not sure what you mean by PUT-COPY, where is this happening? It's been a while since I looked at the actual operations being called (and they might be slightly different on our side). It seems with HBase, PUT-COPY's wouldn't need to be used, but again I'm not sure what filesystem operation is being called here that implements a put-copy under the hood.

          zyork Zach York added a comment - If you're interested in HBASE-20723 , you could also apply/review HBASE-20734 (the real fix for the issue), but I know you were tracking that as well.   I'm not sure what you mean by PUT-COPY, where is this happening? It's been a while since I looked at the actual operations being called (and they might be slightly different on our side). It seems with HBase, PUT-COPY's wouldn't need to be used, but again I'm not sure what filesystem operation is being called here that implements a put-copy under the hood.

          Zach York S3A, Hadoop 2.9. Bottom line I need to retest.

          apurtell Andrew Kyle Purtell added a comment - Zach York S3A, Hadoop 2.9. Bottom line I need to retest.
          stevel@apache.org Steve Loughran added a comment -

          BTW, HADOOP-15691 is my latest iteration of having each FS declare its capabilities. As I've noted at the end, as well as through a new interface, we could expose this as new config options you can look for in fsInstance.getCon().get("option"), provided the FS instances clone their supplied configs and then patch them. This would let you check to see what an FS offered.

          w.r.t s3guard, need to know what semantics you get. With S3Guard you get consistent listings, but rename is still sub-atomic

          thanks for promising to invite me to any discussions, as long as it not via Amazon Chime or Skype for Business I'm up for it.

          stevel@apache.org Steve Loughran added a comment - BTW, HADOOP-15691 is my latest iteration of having each FS declare its capabilities. As I've noted at the end, as well as through a new interface, we could expose this as new config options you can look for in fsInstance.getCon().get("option"), provided the FS instances clone their supplied configs and then patch them. This would let you check to see what an FS offered. w.r.t s3guard, need to know what semantics you get. With S3Guard you get consistent listings, but rename is still sub-atomic thanks for promising to invite me to any discussions, as long as it not via Amazon Chime or Skype for Business I'm up for it.
          apurtell Andrew Kyle Purtell added a comment - - edited

          With S3Guard you get consistent listings, but rename is still sub-atomic

          Consistent listings - needed. Not using it was my mistake.
          Atomic rename - not needed, provided we handle it

          apurtell Andrew Kyle Purtell added a comment - - edited With S3Guard you get consistent listings, but rename is still sub-atomic Consistent listings - needed. Not using it was my mistake. Atomic rename - not needed, provided we handle it

          People

            Unassigned Unassigned
            apurtell Andrew Kyle Purtell
            Votes:
            0 Vote for this issue
            Watchers:
            Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                In order to see discussions, first confirm access to your Slack account(s) in the following workspace(s): ASF