[HBASE-20429] Support for mixed or write-heavy workloads on non-HDFS filesystems - ASF JIRA

Details

Type: Umbrella
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- s3

Description

We can support reasonably well use cases on non-HDFS filesystems, like S3, where an external writer has loaded (and continues to load) HFiles via the bulk load mechanism, and then we serve out a read only workload at the HBase API.

Mixed workloads or write-heavy workloads won't fare as well. In fact, data loss seems certain. It will depend in the specific filesystem, but all of the S3 backed Hadoop filesystems suffer from a couple of obvious problems, notably a lack of atomic rename.

This umbrella will serve to collect some related ideas for consideration.

Attachments

Issue Links

relates to

HBASE-21070 SnapshotFileCache won't update for snapshots stored in S3

Resolved

Sub-Tasks

Store commit transaction for filesystems that do not support an atomic rename

Closed

Unassigned

Activity

Ascending order - Click to sort in descending order

Andrew Kyle Purtell added a comment - 16/Apr/18 20:38

Let me just state this to get it out of the way. As you can imagine, reading between the lines, the motivation to look at this where I work is the good probability our storage stack is either going to utilize Amazon's S3 service "where applicable" or a compatible API analogue. Please don't take this to infer anything about business relationships, or not. Really, I would personally have no idea one way or the other.

Andrew Kyle Purtell added a comment - 16/Apr/18 20:38 Let me just state this to get it out of the way. As you can imagine, reading between the lines, the motivation to look at this where I work is the good probability our storage stack is either going to utilize Amazon's S3 service "where applicable" or a compatible API analogue. Please don't take this to infer anything about business relationships, or not. Really, I would personally have no idea one way or the other.

Zach York added a comment - 16/Apr/18 20:50

apurtell I'll look into the specifics in a little bit, but I feel like relying less on the FS (atomic renames for example) might be the right way to go here. A while back there was some work done (or proposed) to have HBase handle the file metadata somewhere to avoid the necessity of renames (HBase would update the path/location in this table so in effect, the rename would be atomic). I didn't spend a ton of time looking for the old issues, but I think this one was related: HBASE-14090.

stack and uagashe did some initial planning on it and I planned to help out, but got sidelined by other stuff. They might be able to chime in here as well.

Zach York added a comment - 16/Apr/18 20:50 apurtell I'll look into the specifics in a little bit, but I feel like relying less on the FS (atomic renames for example) might be the right way to go here. A while back there was some work done (or proposed) to have HBase handle the file metadata somewhere to avoid the necessity of renames (HBase would update the path/location in this table so in effect, the rename would be atomic). I didn't spend a ton of time looking for the old issues, but I think this one was related: HBASE-14090 . stack and uagashe did some initial planning on it and I planned to help out, but got sidelined by other stuff. They might be able to chime in here as well.

Andrew Kyle Purtell added a comment - 16/Apr/18 20:54 - edited

zyork Oh yes, certainly, see ~~HBASE-20431~~. Not relying on atomic rename would be the first order of business.

Andrew Kyle Purtell added a comment - 16/Apr/18 20:54 - edited zyork Oh yes, certainly, see HBASE-20431 . Not relying on atomic rename would be the first order of business.

Andrew Kyle Purtell added a comment - 16/Apr/18 20:57

A while back there was some work done (or proposed) to have HBase handle the file metadata somewhere to avoid the necessity of renames (HBase would update the path/location in this table so in effect, the rename would be atomic).

This is interesting. Of course I missed it mired in 0.98 fleet maintenance. :-/ Could do this instead of or in addition to ~~HBASE-20431~~, which proposes something like SplitTransaction but for commits of store files after compaction or flush.

Andrew Kyle Purtell added a comment - 16/Apr/18 20:57 A while back there was some work done (or proposed) to have HBase handle the file metadata somewhere to avoid the necessity of renames (HBase would update the path/location in this table so in effect, the rename would be atomic). This is interesting. Of course I missed it mired in 0.98 fleet maintenance. :-/ Could do this instead of or in addition to HBASE-20431 , which proposes something like SplitTransaction but for commits of store files after compaction or flush.

Umesh Agashe added a comment - 18/Apr/18 19:22

apurtell, zyork: HBASE-14090 had multiple objectives and one of them was supporting file systems other than HDFS. Currently that work is on hold. I also found that along with atomic rename, at couple of places we check file permissions (rwx) which is not supported by, say, S3.

Umesh Agashe added a comment - 18/Apr/18 19:22 apurtell , zyork : HBASE-14090 had multiple objectives and one of them was supporting file systems other than HDFS. Currently that work is on hold. I also found that along with atomic rename, at couple of places we check file permissions (rwx) which is not supported by, say, S3.

Andrew Kyle Purtell added a comment - 18/Apr/18 20:54

HBASE-14090 has far too wide a scope, not going to attempt that. Let's proceed here with the goal of getting HBase running with mixed workloads on S3 (or not).

Andrew Kyle Purtell added a comment - 18/Apr/18 20:54 HBASE-14090 has far too wide a scope, not going to attempt that. Let's proceed here with the goal of getting HBase running with mixed workloads on S3 (or not).

Michael Stack added a comment - 18/Aug/18 00:51

apurtell

Should we pow-wow on s3'ing? A meeting/hangout)? I see a bunch of efforts in this direction (e.g. WAL elsewhere). Perhaps itd be possible for there to be a bit of coordination.

I like your talking out loud on your experience. That helps

I'd be interested too in how we could avoid fsredo.

Michael Stack added a comment - 18/Aug/18 00:51 apurtell Should we pow-wow on s3'ing? A meeting/hangout)? I see a bunch of efforts in this direction (e.g. WAL elsewhere). Perhaps itd be possible for there to be a bit of coordination. I like your talking out loud on your experience. That helps I'd be interested too in how we could avoid fsredo.

Steve Loughran added a comment - 18/Aug/18 01:02 - edited

One thing which would be good for you all to write down is: what are your expectations of an FS to work.

in particular

create/read/update/delete consistency
listing consistency
which ops are required to be atomic and O(1)
is it ok for create(path, overwrite=false) to be non-atomic?
when you expect things to be written to store
how long do you expect the final close() to take.

Identify these things and you can start to see what stores can work. And show you where you need to involve other things for the semantics you need.

Steve Loughran added a comment - 18/Aug/18 01:02 - edited One thing which would be good for you all to write down is: what are your expectations of an FS to work. in particular create/read/update/delete consistency listing consistency which ops are required to be atomic and O(1) is it ok for create(path, overwrite=false) to be non-atomic? when you expect things to be written to store how long do you expect the final close() to take. Identify these things and you can start to see what stores can work. And show you where you need to involve other things for the semantics you need.

Michael Stack added a comment - 18/Aug/18 01:06

(note to self, invite stevel@apache.org to any meeting if it happens....and review s3guard ....)

Michael Stack added a comment - 18/Aug/18 01:06 (note to self, invite stevel@apache.org to any meeting if it happens....and review s3guard ....)

Zach York added a comment - 18/Aug/18 04:16

I was planning to starting work on some of this in a little bit, but I think we need to decide whether we want to:

1) fix this in the FileSystem (not change HBase's assumption of a strongly consistent FileSystem)

2) fix this in HBase where we know what we are doing with the data and any guarantees needed.

Personally I think #2 will be easier, but would be willing to discuss. It might end up being a mix of things.

Also, let's start with what currently is not working with HBase backed by S3 - what are the pain points we are trying to solve. That will help us direct the effort better. I can definitely help where I can with that list.

Zach York added a comment - 18/Aug/18 04:16 I was planning to starting work on some of this in a little bit, but I think we need to decide whether we want to: 1) fix this in the FileSystem (not change HBase's assumption of a strongly consistent FileSystem) or 2) fix this in HBase where we know what we are doing with the data and any guarantees needed. Personally I think #2 will be easier, but would be willing to discuss. It might end up being a mix of things. Also, let's start with what currently is not working with HBase backed by S3 - what are the pain points we are trying to solve. That will help us direct the effort better. I can definitely help where I can with that list.

Zach York added a comment - 23/Aug/18 08:38

Anyone interested in this want to review two related S3 related issues? ~~HBASE-21070~~ and ~~HBASE-21098~~ (patch coming soon).

Thanks!

Zach York added a comment - 23/Aug/18 08:38 Anyone interested in this want to review two related S3 related issues? HBASE-21070 and HBASE-21098 (patch coming soon). Thanks!

Michael Stack added a comment - 23/Aug/18 23:36

+1 for #2

Michael Stack added a comment - 23/Aug/18 23:36 +1 for #2

Andrew Kyle Purtell added a comment - 24/Aug/18 00:37 - edited

I don't want this to be considered "FSredo". We need improvements in branch-1 for today's production. I deliberately did not file this under the related FS redo JIRAs. I am going to remove the tag. We can have #2, at least to some extent, without reinventing how we manage files in an incompatible way. For example ~~HBASE-20431~~. It is not going to be sufficient to commit something to master (aka 3.0) and call it a day

Andrew Kyle Purtell added a comment - 24/Aug/18 00:37 - edited I don't want this to be considered "FSredo". We need improvements in branch-1 for today's production. I deliberately did not file this under the related FS redo JIRAs. I am going to remove the tag. We can have #2, at least to some extent, without reinventing how we manage files in an incompatible way. For example HBASE-20431 . It is not going to be sufficient to commit something to master (aka 3.0) and call it a day

Andrew Kyle Purtell added a comment - 24/Aug/18 00:42

Reiterating: The goal of this JIRA is stability on S3 for write workloads all the way down through to a new 1.x minor release.

Andrew Kyle Purtell added a comment - 24/Aug/18 00:42 Reiterating: The goal of this JIRA is stability on S3 for write workloads all the way down through to a new 1.x minor release.

Michael Stack added a comment - 24/Aug/18 00:43

If for 1.x, ok to remove label.

Michael Stack added a comment - 24/Aug/18 00:43 If for 1.x, ok to remove label.

Zach York added a comment - 24/Aug/18 17:39

apurtell Makes sense. We'll tackle FS redo in a separate JIRA.

Have you tested using a consistency layer such as S3Guard? That should be able to handle most of the consistency guarantees so you don't have RS aborts when you encounter consistencies after writing storefiles. Alternatively, you could add a retry in the failure logic instead of aborting the RS immediately when a file doesn't exist.

FWIW, I have seen a large number of heavy write workloads working well with S3 given you use a consistency layer (EmrFS Consistent View in my case) so that might be sufficient for your case until you can upgrade to a version that (hopefully) contains the FSredo work.

Zach York added a comment - 24/Aug/18 17:39 apurtell Makes sense. We'll tackle FS redo in a separate JIRA. Have you tested using a consistency layer such as S3Guard? That should be able to handle most of the consistency guarantees so you don't have RS aborts when you encounter consistencies after writing storefiles. Alternatively, you could add a retry in the failure logic instead of aborting the RS immediately when a file doesn't exist. FWIW, I have seen a large number of heavy write workloads working well with S3 given you use a consistency layer (EmrFS Consistent View in my case) so that might be sufficient for your case until you can upgrade to a version that (hopefully) contains the FSredo work.

Andrew Kyle Purtell added a comment - 24/Aug/18 17:44 - edited

zyork Yes there are two improvements I've become aware of since that I'd like to apply and then retest

~~HBASE-20723~~
S3Guard

Edit: It's been a while but I also seem to remember some failures where PUT-COPY failed server side but we didn't catch it, so committed store files went missing in a way that S3Guard wouldn't address.

Andrew Kyle Purtell added a comment - 24/Aug/18 17:44 - edited zyork Yes there are two improvements I've become aware of since that I'd like to apply and then retest HBASE-20723 S3Guard Edit: It's been a while but I also seem to remember some failures where PUT-COPY failed server side but we didn't catch it, so committed store files went missing in a way that S3Guard wouldn't address.

Zach York added a comment - 24/Aug/18 18:07

If you're interested in ~~HBASE-20723~~, you could also apply/review ~~HBASE-20734~~ (the real fix for the issue), but I know you were tracking that as well.

I'm not sure what you mean by PUT-COPY, where is this happening? It's been a while since I looked at the actual operations being called (and they might be slightly different on our side). It seems with HBase, PUT-COPY's wouldn't need to be used, but again I'm not sure what filesystem operation is being called here that implements a put-copy under the hood.

Zach York added a comment - 24/Aug/18 18:07 If you're interested in HBASE-20723 , you could also apply/review HBASE-20734 (the real fix for the issue), but I know you were tracking that as well. I'm not sure what you mean by PUT-COPY, where is this happening? It's been a while since I looked at the actual operations being called (and they might be slightly different on our side). It seems with HBase, PUT-COPY's wouldn't need to be used, but again I'm not sure what filesystem operation is being called here that implements a put-copy under the hood.

Andrew Kyle Purtell added a comment - 24/Aug/18 18:15

zyork S3A, Hadoop 2.9. Bottom line I need to retest.

Andrew Kyle Purtell added a comment - 24/Aug/18 18:15 zyork S3A, Hadoop 2.9. Bottom line I need to retest.

Steve Loughran added a comment - 29/Aug/18 14:42

BTW, ~~HADOOP-15691~~ is my latest iteration of having each FS declare its capabilities. As I've noted at the end, as well as through a new interface, we could expose this as new config options you can look for in fsInstance.getCon().get("option"), provided the FS instances clone their supplied configs and then patch them. This would let you check to see what an FS offered.

w.r.t s3guard, need to know what semantics you get. With S3Guard you get consistent listings, but rename is still sub-atomic

thanks for promising to invite me to any discussions, as long as it not via Amazon Chime or Skype for Business I'm up for it.

Steve Loughran added a comment - 29/Aug/18 14:42 BTW, HADOOP-15691 is my latest iteration of having each FS declare its capabilities. As I've noted at the end, as well as through a new interface, we could expose this as new config options you can look for in fsInstance.getCon().get("option"), provided the FS instances clone their supplied configs and then patch them. This would let you check to see what an FS offered. w.r.t s3guard, need to know what semantics you get. With S3Guard you get consistent listings, but rename is still sub-atomic thanks for promising to invite me to any discussions, as long as it not via Amazon Chime or Skype for Business I'm up for it.

Andrew Kyle Purtell added a comment - 29/Aug/18 16:40 - edited

With S3Guard you get consistent listings, but rename is still sub-atomic

Consistent listings - needed. Not using it was my mistake.
Atomic rename - not needed, provided we handle it

Andrew Kyle Purtell added a comment - 29/Aug/18 16:40 - edited With S3Guard you get consistent listings, but rename is still sub-atomic Consistent listings - needed. Not using it was my mistake. Atomic rename - not needed, provided we handle it

People

Assignee:: Unassigned

Reporter:: Andrew Kyle Purtell

Votes:: 0 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 16/Apr/18 20:25

Updated:: 22/Apr/23 21:52

Resolved:: 22/Apr/23 21:52