Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-19289

CommonFSUtils$StreamLacksCapabilityException: hflush when running test against hadoop3 beta1

    Details

    • Type: Test
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 2.0.0-beta-1
    • Component/s: None
    • Labels:
      None

      Description

      As of commit d8fb10c8329b19223c91d3cda6ef149382ad4ea0 , I encountered the following exception when running unit test against hadoop3 beta1:

      testRefreshStoreFiles(org.apache.hadoop.hbase.regionserver.TestHStore)  Time elapsed: 0.061 sec  <<< ERROR!
      java.io.IOException: cannot get log writer
      	at org.apache.hadoop.hbase.regionserver.TestHStore.initHRegion(TestHStore.java:215)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:220)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:195)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:190)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:185)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:179)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:173)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.testRefreshStoreFiles(TestHStore.java:962)
      Caused by: org.apache.hadoop.hbase.util.CommonFSUtils$StreamLacksCapabilityException: hflush
      	at org.apache.hadoop.hbase.regionserver.TestHStore.initHRegion(TestHStore.java:215)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:220)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:195)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:190)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:185)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:179)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:173)
      	at org.apache.hadoop.hbase.regionserver.TestHStore.testRefreshStoreFiles(TestHStore.java:962)
      
      1. HBASE-19289.v5.patch
        7 kB
        Mike Drob
      2. HBASE-19289.v4.patch
        8 kB
        Mike Drob
      3. HBASE-19289.v3.patch
        19 kB
        Mike Drob
      4. HBASE-19289.v2.patch
        10 kB
        Mike Drob
      5. HBASE-19289.patch
        2 kB
        Mike Drob
      6. 19289.v2.txt
        2 kB
        Ted Yu
      7. 19289.v1.txt
        2 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          busbey Sean Busbey added a comment -

          Sounds like LocalFileSystem doesn't support flush/sync. That's odd. Let's do a quick check if Hadoop can provide that and then update tests to deal with it as necessary.

          Show
          busbey Sean Busbey added a comment - Sounds like LocalFileSystem doesn't support flush/sync. That's odd. Let's do a quick check if Hadoop can provide that and then update tests to deal with it as necessary.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Sean Busbey:
          To unblock testing against hadoop3, how about relaxing the following for unit tests ?

              if (!(CommonFSUtils.hasCapability(output, "hflush"))) {
                throw new StreamLacksCapabilityException("hflush");
          
          Show
          yuzhihong@gmail.com Ted Yu added a comment - Sean Busbey : To unblock testing against hadoop3, how about relaxing the following for unit tests ? if (!(CommonFSUtils.hasCapability(output, "hflush" ))) { throw new StreamLacksCapabilityException( "hflush" );
          Hide
          busbey Sean Busbey added a comment -

          How would you propose relaxing it?

          Two solutions I see:

          • updating LocalFileSystem to support hflush/hsync
          • telling CommonFSUtils to ignore the check on LocalFileSystem

          The former is probably going to be easier due to the way we wrap FileSystems.

          Show
          busbey Sean Busbey added a comment - How would you propose relaxing it? Two solutions I see: updating LocalFileSystem to support hflush/hsync telling CommonFSUtils to ignore the check on LocalFileSystem The former is probably going to be easier due to the way we wrap FileSystems.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment - - edited

          updating LocalFileSystem to support hflush/hsync

          This would be done in hadoop, right ?
          Consider logging a HADOOP issue.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - - edited updating LocalFileSystem to support hflush/hsync This would be done in hadoop, right ? Consider logging a HADOOP issue.
          Hide
          busbey Sean Busbey added a comment -

          The more I think about this, the more I think changing LocalFileSystem needs to be the answer. unless we want to expressly tell folks not to run HBase on top of LocalFileSystem, in which case why are we running tests against it in the first place?

          Show
          busbey Sean Busbey added a comment - The more I think about this, the more I think changing LocalFileSystem needs to be the answer. unless we want to expressly tell folks not to run HBase on top of LocalFileSystem, in which case why are we running tests against it in the first place?
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Logged HADOOP-15051

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Logged HADOOP-15051
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Patch v1 uses approach #2.
          This would allow me to run test suite against hadoop3.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Patch v1 uses approach #2. This would allow me to run test suite against hadoop3.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Closed HADOOP-15051 as wontfix. LocalFS output streams don't declare their support for hflush/sync for the following reason, as covered in HADOOP-13327 (oustanding, reviews welcome)

          Output streams which do not implement the flush/persitence semantics of hflush/hsync MUST NOT declare that their streams have that capability.

          LocalFileSystem is a subclass of ChecksumFileSystem; ChecksumFileSystem output streams don't implement hflush/hsync, therefore it's the correct behaviour in the Hadoop code.

          If HBase requires the methods for the correct persistence of its data, then it cannot safely use localFS as destination of its output. It's check is therefore also the correct behavior

          In which case, "expressly tell folks not to run HBase on top of LocalFileSystem," is the correct action on your part. People must not be using the local FS as a direct destination of HDFS output.

          Show
          stevel@apache.org Steve Loughran added a comment - Closed HADOOP-15051 as wontfix. LocalFS output streams don't declare their support for hflush/sync for the following reason, as covered in HADOOP-13327 (oustanding, reviews welcome) Output streams which do not implement the flush/persitence semantics of hflush/hsync MUST NOT declare that their streams have that capability. LocalFileSystem is a subclass of ChecksumFileSystem; ChecksumFileSystem output streams don't implement hflush/hsync, therefore it's the correct behaviour in the Hadoop code. If HBase requires the methods for the correct persistence of its data, then it cannot safely use localFS as destination of its output. It's check is therefore also the correct behavior In which case, "expressly tell folks not to run HBase on top of LocalFileSystem," is the correct action on your part. People must not be using the local FS as a direct destination of HDFS output.
          Hide
          busbey Sean Busbey added a comment -

          Thanks for weighing in Steve Loughran. This helps a ton; I had overlooked that limitation of ChecksumFileSystem.

          Would updating RawLocalFileSystem to create output streams that do implement hflush/hsync be viable? I think long term HBase needs a FileSystem implementation that can be used with the local disks on a single node. I'd rather it get maintained in a place where folks other than us can use it and help with maintenance.

          Show
          busbey Sean Busbey added a comment - Thanks for weighing in Steve Loughran . This helps a ton; I had overlooked that limitation of ChecksumFileSystem. Would updating RawLocalFileSystem to create output streams that do implement hflush/hsync be viable? I think long term HBase needs a FileSystem implementation that can be used with the local disks on a single node. I'd rather it get maintained in a place where folks other than us can use it and help with maintenance.
          Hide
          busbey Sean Busbey added a comment -

          Ted Yu v1 is expressly unsafe based on the discussion so far, please don't commit it.

          How onerous is converting the failed tests to use MiniDFSCluster instead of LocalFileSystem?

          Does this mean we're also broken in standalone mode?

          Show
          busbey Sean Busbey added a comment - Ted Yu v1 is expressly unsafe based on the discussion so far, please don't commit it. How onerous is converting the failed tests to use MiniDFSCluster instead of LocalFileSystem? Does this mean we're also broken in standalone mode?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          If people really want hbase -> file://

          • they'd need a distributed file:// or some shared NFS server
          • it'd presumably need its own RAID > 0 to do checksumming; so checksum fs is moot

          I'd look at seeing whether checksumfs could actually bypass its checksum, say if we set the property "bytes per checksum == 0" as the secret no, turn me off" switch. But people would probably then use it for performance and then be upset when all their data got corrupted without anything noticing. It's too critical a layer under HDFS really.

          I was thinking about what if we added a raw:// URL which bonded directly to raw local fs, but RawLocalFileSystem has an expectation that file:// is its schema and returns it in getURI(), so forcing you back to CheckummedFS

          I believe the way to do this is

          • subclass RawLocalFileSystem
          • give it a new schema, like say "raw"
          • have it remember its URI in initialize() and return it in getURI()
          • register it (statically, dynamically)
          Show
          stevel@apache.org Steve Loughran added a comment - If people really want hbase -> file:// they'd need a distributed file:// or some shared NFS server it'd presumably need its own RAID > 0 to do checksumming; so checksum fs is moot I'd look at seeing whether checksumfs could actually bypass its checksum, say if we set the property "bytes per checksum == 0" as the secret no, turn me off" switch. But people would probably then use it for performance and then be upset when all their data got corrupted without anything noticing. It's too critical a layer under HDFS really. I was thinking about what if we added a raw:// URL which bonded directly to raw local fs, but RawLocalFileSystem has an expectation that file:// is its schema and returns it in getURI(), so forcing you back to CheckummedFS I believe the way to do this is subclass RawLocalFileSystem give it a new schema, like say "raw" have it remember its URI in initialize() and return it in getURI() register it (statically, dynamically)
          Hide
          stevel@apache.org Steve Loughran added a comment -

          ps: Are people using HBase against file:// today? If so, they've not been getting the persistence/durability HBase needs. Tell them to stop it.

          Show
          stevel@apache.org Steve Loughran added a comment - ps: Are people using HBase against file:// today? If so, they've not been getting the persistence/durability HBase needs. Tell them to stop it.
          Hide
          mdrob Mike Drob added a comment -

          I can't get this to fail locally... Did somebody fix it in a separate issue?

          My command is

          mvn clean test -Dtest=TestHStore\#testRefreshStoreFiles -Dhadoop-profile=3.0 -Dhadoop-three.version=3.0.0-beta1 -pl hbase-server -am
          
          Show
          mdrob Mike Drob added a comment - I can't get this to fail locally... Did somebody fix it in a separate issue? My command is mvn clean test -Dtest=TestHStore\#testRefreshStoreFiles -Dhadoop-profile=3.0 -Dhadoop-three.version=3.0.0-beta1 -pl hbase-server -am
          Hide
          busbey Sean Busbey added a comment -

          ps: Are people using HBase against file:// today? If so, they've not been getting the persistence/durability HBase needs. Tell them to stop it.

          I believe we do this in standalone mode.

          Show
          busbey Sean Busbey added a comment - ps: Are people using HBase against file:// today? If so, they've not been getting the persistence/durability HBase needs. Tell them to stop it. I believe we do this in standalone mode.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Mike:
          You may need the patch from HBASE-19422.
          Here is the command I used:

          mvn test -Dtest=TestHStore\#testRefreshStoreFiles -Dhadoop-profile=3.0 -Dhadoop-three.version=3.0.0-beta1 -Dhadoop-two.version=3.0.0-beta1 -pl hbase-serr -am
          
          [ERROR] testRefreshStoreFiles(org.apache.hadoop.hbase.regionserver.TestHStore)  Time elapsed: 0.718 s  <<< ERROR!
          java.io.IOException: cannot get log writer
          	at org.apache.hadoop.hbase.regionserver.TestHStore.initHRegion(TestHStore.java:215)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:220)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:195)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:190)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:185)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:179)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:173)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.testRefreshStoreFiles(TestHStore.java:962)
          Caused by: org.apache.hadoop.hbase.util.CommonFSUtils$StreamLacksCapabilityException: hflush
          	at org.apache.hadoop.hbase.regionserver.TestHStore.initHRegion(TestHStore.java:215)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:220)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:195)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:190)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:185)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:179)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:173)
          	at org.apache.hadoop.hbase.regionserver.TestHStore.testRefreshStoreFiles(TestHStore.java:962)
          
          Show
          yuzhihong@gmail.com Ted Yu added a comment - Mike: You may need the patch from HBASE-19422 . Here is the command I used: mvn test -Dtest=TestHStore\#testRefreshStoreFiles -Dhadoop-profile=3.0 -Dhadoop-three.version=3.0.0-beta1 -Dhadoop-two.version=3.0.0-beta1 -pl hbase-serr -am [ERROR] testRefreshStoreFiles(org.apache.hadoop.hbase.regionserver.TestHStore) Time elapsed: 0.718 s <<< ERROR! java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.regionserver.TestHStore.initHRegion(TestHStore.java:215) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:220) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:195) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:190) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:185) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:179) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:173) at org.apache.hadoop.hbase.regionserver.TestHStore.testRefreshStoreFiles(TestHStore.java:962) Caused by: org.apache.hadoop.hbase.util.CommonFSUtils$StreamLacksCapabilityException: hflush at org.apache.hadoop.hbase.regionserver.TestHStore.initHRegion(TestHStore.java:215) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:220) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:195) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:190) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:185) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:179) at org.apache.hadoop.hbase.regionserver.TestHStore.init(TestHStore.java:173) at org.apache.hadoop.hbase.regionserver.TestHStore.testRefreshStoreFiles(TestHStore.java:962)
          Hide
          mdrob Mike Drob added a comment -

          Cool, I get this to fail locally now. I tried setting file system to raw:/// then I realized that you were suggesting that as a future fix, not something that's available now, Steve Loughran.

          At least for our tests here, since we're testing higher level functionality in HBase we shouldn't need to care about the underlying FS implementation - there is an expectation that the local fs (ext3 or whatever) supports flush and sync, and beyond that there's no reason to start a full DFS cluster - this will add a lot of time to the test suite if it's not strictly necessary.

          Anyway, I'm attaching a patch that will start a DFS cluster for tests where we need to use this. So far I've only done it for one test, but I know there are lots of failures out there... Attaching the WIP for initial review of the approach. Josh Elser, Ted Yu - WDYT?

          Show
          mdrob Mike Drob added a comment - Cool, I get this to fail locally now. I tried setting file system to raw:/// then I realized that you were suggesting that as a future fix, not something that's available now, Steve Loughran . At least for our tests here, since we're testing higher level functionality in HBase we shouldn't need to care about the underlying FS implementation - there is an expectation that the local fs (ext3 or whatever) supports flush and sync, and beyond that there's no reason to start a full DFS cluster - this will add a lot of time to the test suite if it's not strictly necessary. Anyway, I'm attaching a patch that will start a DFS cluster for tests where we need to use this. So far I've only done it for one test, but I know there are lots of failures out there... Attaching the WIP for initial review of the approach. Josh Elser , Ted Yu - WDYT?
          Hide
          mdrob Mike Drob added a comment -

          HBASE-19369 is related to it, but I don't think it will fix the cases where we run on LocalFS

          Show
          mdrob Mike Drob added a comment - HBASE-19369 is related to it, but I don't think it will fix the cases where we run on LocalFS
          Hide
          mdrob Mike Drob added a comment -

          My proposed solution works for hbase-server, but we also have failing tests in the same vein in hbase-procedure. I would prefer not to introduce a dependency on hdfs that low, but maybe there's no other way around it?

          Show
          mdrob Mike Drob added a comment - My proposed solution works for hbase-server, but we also have failing tests in the same vein in hbase-procedure. I would prefer not to introduce a dependency on hdfs that low, but maybe there's no other way around it?
          Hide
          elserj Josh Elser added a comment -

          unless we want to expressly tell folks not to run HBase on top of LocalFileSystem, in which case why are we running tests against it in the first place?

          ps: Are people using HBase against file:// today? If so, they've not been getting the persistence/durability HBase needs. Tell them to stop it.

          Is this solvable by a flag that says "yes I acknowledge that I may lose data"? I think we're well aware that we may experience "data loss" with XSumFileSystem and this is OK because we just don't care (because it's a short lived test). We don't want to wait the extra 5+secs for a full MiniDFSCluster.

          Anyway, I'm attaching a patch that will start a DFS cluster for tests where we need to use this. So far I've only done it for one test, but I know there are lots of failures out there

          Your approach is a nice idea as well, but I'm worried about the tail in tracking all of these down.

          Show
          elserj Josh Elser added a comment - unless we want to expressly tell folks not to run HBase on top of LocalFileSystem, in which case why are we running tests against it in the first place? ps: Are people using HBase against file:// today? If so, they've not been getting the persistence/durability HBase needs. Tell them to stop it. Is this solvable by a flag that says "yes I acknowledge that I may lose data"? I think we're well aware that we may experience "data loss" with XSumFileSystem and this is OK because we just don't care (because it's a short lived test). We don't want to wait the extra 5+secs for a full MiniDFSCluster. Anyway, I'm attaching a patch that will start a DFS cluster for tests where we need to use this. So far I've only done it for one test, but I know there are lots of failures out there Your approach is a nice idea as well, but I'm worried about the tail in tracking all of these down.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I tried the patch on both Linux and Mac.
          I got the same error:

          [ERROR] org.apache.hadoop.hbase.regionserver.TestHStore  Time elapsed: 2.663 s  <<< ERROR!
          java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
          
          mvn clean install -DskipTests -Dhadoop.profile=3.0 -Dhadoop-three.version=3.0.0-beta1 -Dhadoop-two.version=3.0.0-beta1
          mvn test -Dtest=TestHStore\#testRefreshStoreFiles -Dhadoop.profile=3.0 -Dhadoop-three.version=3.0.0-beta1 -Dhadoop-two.version=3.0.0-beta1 -pl hbase-server -am
          
          Show
          yuzhihong@gmail.com Ted Yu added a comment - I tried the patch on both Linux and Mac. I got the same error: [ERROR] org.apache.hadoop.hbase.regionserver.TestHStore Time elapsed: 2.663 s <<< ERROR! java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V mvn clean install -DskipTests -Dhadoop.profile=3.0 -Dhadoop-three.version=3.0.0-beta1 -Dhadoop-two.version=3.0.0-beta1 mvn test -Dtest=TestHStore\#testRefreshStoreFiles -Dhadoop.profile=3.0 -Dhadoop-three.version=3.0.0-beta1 -Dhadoop-two.version=3.0.0-beta1 -pl hbase-server -am
          Hide
          busbey Sean Busbey added a comment -

          unless we want to expressly tell folks not to run HBase on top of LocalFileSystem, in which case why are we running tests against it in the first place?

          ps: Are people using HBase against file:// today? If so, they've not been getting the persistence/durability HBase needs. Tell them to stop it.

          Is this solvable by a flag that says "yes I acknowledge that I may lose data"? I think we're well aware that we may experience "data loss" with XSumFileSystem and this is OK because we just don't care (because it's a short lived test). We don't want to wait the extra 5+secs for a full MiniDFSCluster.

          That doesn't work for our reliance on LocalFileSystem for standalone mode (ref the quickstart guide). We don't actually call sync / flush or anything like that for the local OS, so the WAL is essentially useless.

          Show
          busbey Sean Busbey added a comment - unless we want to expressly tell folks not to run HBase on top of LocalFileSystem, in which case why are we running tests against it in the first place? ps: Are people using HBase against file:// today? If so, they've not been getting the persistence/durability HBase needs. Tell them to stop it. Is this solvable by a flag that says "yes I acknowledge that I may lose data"? I think we're well aware that we may experience "data loss" with XSumFileSystem and this is OK because we just don't care (because it's a short lived test). We don't want to wait the extra 5+secs for a full MiniDFSCluster. That doesn't work for our reliance on LocalFileSystem for standalone mode ( ref the quickstart guide ). We don't actually call sync / flush or anything like that for the local OS, so the WAL is essentially useless.
          Hide
          elserj Josh Elser added a comment -

          That doesn't work for our reliance on LocalFileSystem for standalone mode (ref the quickstart guide). We don't actually call sync / flush or anything like that for the local OS, so the WAL is essentially useless.

          Understood. I still think it's something that's easily caveat'ed: "The use of HDFS (or a comparable FileSystem implementation) is required to ensure that HBase does not experience data loss. This quickstart example may lose data in unexpected-failure conditions – this should only be used for testing/investigation purposes, never in a production environment."

          Show
          elserj Josh Elser added a comment - That doesn't work for our reliance on LocalFileSystem for standalone mode (ref the quickstart guide). We don't actually call sync / flush or anything like that for the local OS, so the WAL is essentially useless. Understood. I still think it's something that's easily caveat'ed: "The use of HDFS (or a comparable FileSystem implementation) is required to ensure that HBase does not experience data loss. This quickstart example may lose data in unexpected-failure conditions – this should only be used for testing/investigation purposes, never in a production environment."
          Hide
          stevel@apache.org Steve Loughran added a comment -

          looking at the patch

          • it can take time to start and stop the serve, good to see you making this a class rule.
          • failures in stop should be caught & logged, in case raising them could iding the underlying exception triggering a test failure. (I don't know enough about custom rules here to know for sure, just based on test teardown method experience)
          Show
          stevel@apache.org Steve Loughran added a comment - looking at the patch it can take time to start and stop the serve, good to see you making this a class rule. failures in stop should be caught & logged, in case raising them could iding the underlying exception triggering a test failure. (I don't know enough about custom rules here to know for sure, just based on test teardown method experience)
          Hide
          mdrob Mike Drob added a comment -

          Josh Elser - let me see if I understand your proposed solution.

          1. Introduce new flag to not fail on the StreamCapability flush/sync checks (but still print appropriate warnings)
          2. Apply this flag to unit tests that don't use an underlying DFS cluster
          3. Apply this flag to standalone clusters
          4. Update documentation to include blatant caveats about data loss in standalone cluster

          That doesn't seem right to me. We have a bunch of tests that check for data correctness, that we're apparently running with a file system that does not offer the guarantees that we expect. This is like the difference between SloppyMath and StrictMath - one is fast; the other is accurate. Do we gain any confidence from running tests quickly, seeing them pass, and then discovering that we don't have the coverage we thought we did at a later date?

          Show
          mdrob Mike Drob added a comment - Josh Elser - let me see if I understand your proposed solution. Introduce new flag to not fail on the StreamCapability flush/sync checks (but still print appropriate warnings) Apply this flag to unit tests that don't use an underlying DFS cluster Apply this flag to standalone clusters Update documentation to include blatant caveats about data loss in standalone cluster That doesn't seem right to me. We have a bunch of tests that check for data correctness, that we're apparently running with a file system that does not offer the guarantees that we expect. This is like the difference between SloppyMath and StrictMath - one is fast; the other is accurate. Do we gain any confidence from running tests quickly, seeing them pass, and then discovering that we don't have the coverage we thought we did at a later date?
          Hide
          elserj Josh Elser added a comment -

          That doesn't seem right to me. We have a bunch of tests that check for data correctness, that we're apparently running with a file system that does not offer the guarantees that we expect.

          Sorry, the corollary got lost. For tests in which we're explicitly testing for correctness in the face of failure (e.g. CM doing CM things), we should definitely use something with can have the guarantees we require.

          My assertion was that not all tests are testing for correctness in the face of failure – in these cases, MiniDFS would be overkill. This is still based on my previous understanding that XSumFileSystem does not have inherent issues under normal circumstances (e.g. those in which we're not abruptly killing/interrupting things). Maybe this has changed since Hadoop2 and I'm coming from bad knowledge?

          Show
          elserj Josh Elser added a comment - That doesn't seem right to me. We have a bunch of tests that check for data correctness, that we're apparently running with a file system that does not offer the guarantees that we expect. Sorry, the corollary got lost. For tests in which we're explicitly testing for correctness in the face of failure (e.g. CM doing CM things), we should definitely use something with can have the guarantees we require. My assertion was that not all tests are testing for correctness in the face of failure – in these cases, MiniDFS would be overkill. This is still based on my previous understanding that XSumFileSystem does not have inherent issues under normal circumstances (e.g. those in which we're not abruptly killing/interrupting things). Maybe this has changed since Hadoop2 and I'm coming from bad knowledge?
          Hide
          mdrob Mike Drob added a comment -

          patch v2 attached.

          Here's the start of the approach suggested by Josh Elser. I'm terrified of introducing this because I have visions that somebody is going to accidentally set this property and then end up losing their data, but there's only so much that can be done to prevent users from shooting themselves in the foot.

          I also am not quite sure how we're going to automatically activate this flag for standalone mode - AFAICT anybody starting HBase on local file system is going to run straight into a brick wall.

          Show
          mdrob Mike Drob added a comment - patch v2 attached. Here's the start of the approach suggested by Josh Elser . I'm terrified of introducing this because I have visions that somebody is going to accidentally set this property and then end up losing their data, but there's only so much that can be done to prevent users from shooting themselves in the foot. I also am not quite sure how we're going to automatically activate this flag for standalone mode - AFAICT anybody starting HBase on local file system is going to run straight into a brick wall.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          What about giving the property some name to make clear its experimental/risky? ""hbase.experimental.stream.capability.enforce.disabled"

          Then if people set it, well, "told you so"

          Show
          stevel@apache.org Steve Loughran added a comment - What about giving the property some name to make clear its experimental/risky? ""hbase.experimental.stream.capability.enforce.disabled" Then if people set it, well, "told you so"
          Hide
          mdrob Mike Drob added a comment -

          hbase.unsafe.stream.capability.enforce.disabled?

          Show
          mdrob Mike Drob added a comment - hbase.unsafe.stream.capability.enforce.disabled ?
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -
               if (!(CommonFSUtils.hasCapability(newStream, durability))) {
          -      throw new IllegalStateException("The procedure WAL relies on the ability to " + durability +
          +      if (fs.getConf().getBoolean("hbase.stream.capability.enforce", true))
          

          Since the value of the new config is unlikely to change during the lifetime of server, you can save the value in an instance variable.
          The order of if (!(CommonFSUtils.hasCapability( and the config check should be reverted.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - if (!(CommonFSUtils.hasCapability(newStream, durability))) { - throw new IllegalStateException( "The procedure WAL relies on the ability to " + durability + + if (fs.getConf().getBoolean( "hbase.stream.capability.enforce" , true )) Since the value of the new config is unlikely to change during the lifetime of server, you can save the value in an instance variable. The order of if (!(CommonFSUtils.hasCapability( and the config check should be reverted.
          Hide
          mdrob Mike Drob added a comment -

          v3:

          • pulls the config key out into a constant
          • check for enforcement enabled before checking for stream capabilities
          Show
          mdrob Mike Drob added a comment - v3: pulls the config key out into a constant check for enforcement enabled before checking for stream capabilities
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          lgtm

          Show
          yuzhihong@gmail.com Ted Yu added a comment - lgtm
          Hide
          mdrob Mike Drob added a comment -

          Setting a property on the conf doesn't work for the tests in hbase-server because the FS is already created and then gets pulled out of the cache in FSHLogProvider.createWAL(), need to do the DFSCluster approach here...

          v4 coming soon.

          Show
          mdrob Mike Drob added a comment - Setting a property on the conf doesn't work for the tests in hbase-server because the FS is already created and then gets pulled out of the cache in FSHLogProvider.createWAL() , need to do the DFSCluster approach here... v4 coming soon.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          How about putting the config in:
          ./hbase-server/src/test/resources/hbase-site.xml

          This way, you don't need to modify individual tests.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - How about putting the config in: ./hbase-server/src/test/resources/hbase-site.xml This way, you don't need to modify individual tests.
          Hide
          mdrob Mike Drob added a comment -

          v4: put the property in test/resources/hbase-site.xml

          Show
          mdrob Mike Drob added a comment - v4: put the property in test/resources/hbase-site.xml
          Hide
          mdrob Mike Drob added a comment -

          Thanks for the suggestion, Ted Yu - this is a much simpler solution.

          Show
          mdrob Mike Drob added a comment - Thanks for the suggestion, Ted Yu - this is a much simpler solution.
          Hide
          elserj Josh Elser added a comment -

          put the property in test/resources/hbase-site.xml

          That's a nice trick!

          <description>We don't need to enforce stream capabilities in tests using local FS</description>
          

          How about instead: "Controls whether or not HBase will require stream capabilities (hflush) in tests using the LocalFileSystem class".

          Nit: TestHStore.java has a whitespace-only change.

          AFAICT anybody starting HBase on local file system is going to run straight into a brick wall.

          Need to return to this too, eh? The book does have instructions for users to create an hbase-site.xml – is updating this sufficient? A release note is obviously good to warn users. Is the error message a user would find if they didn't set this property sufficient to lead them to somewhere that will clearly show them how to fix the problem (e.g. via a web-search)?

          Show
          elserj Josh Elser added a comment - put the property in test/resources/hbase-site.xml That's a nice trick! <description>We don't need to enforce stream capabilities in tests using local FS</description> How about instead: "Controls whether or not HBase will require stream capabilities (hflush) in tests using the LocalFileSystem class". Nit: TestHStore.java has a whitespace-only change. AFAICT anybody starting HBase on local file system is going to run straight into a brick wall. Need to return to this too, eh? The book does have instructions for users to create an hbase-site.xml – is updating this sufficient? A release note is obviously good to warn users. Is the error message a user would find if they didn't set this property sufficient to lead them to somewhere that will clearly show them how to fix the problem (e.g. via a web-search)?
          Hide
          mdrob Mike Drob added a comment -

          How about instead: "Controls whether or not HBase will require stream capabilities (hflush) in tests using the LocalFileSystem class".

          I'll wordsmith this a bit.

          Nit: TestHStore.java has a whitespace-only change.

          Fixed.

          Do you mind if we handle the bit about standalone cluster in a follow on issue? Currently we're blocking tests, and while that should definitely be a release blocker I'm not sure it needs to be addressed with equal urgency.

          Show
          mdrob Mike Drob added a comment - How about instead: "Controls whether or not HBase will require stream capabilities (hflush) in tests using the LocalFileSystem class". I'll wordsmith this a bit. Nit: TestHStore.java has a whitespace-only change. Fixed. Do you mind if we handle the bit about standalone cluster in a follow on issue? Currently we're blocking tests, and while that should definitely be a release blocker I'm not sure it needs to be addressed with equal urgency.
          Hide
          elserj Josh Elser added a comment -

          Do you mind if we handle the bit about standalone cluster in a follow on issue? Currently we're blocking tests, and while that should definitely be a release blocker I'm not sure it needs to be addressed with equal urgency.

          Not at all. Just wanted to make sure we didn't miss this aspect

          Show
          elserj Josh Elser added a comment - Do you mind if we handle the bit about standalone cluster in a follow on issue? Currently we're blocking tests, and while that should definitely be a release blocker I'm not sure it needs to be addressed with equal urgency. Not at all. Just wanted to make sure we didn't miss this aspect
          Hide
          elserj Josh Elser added a comment -

          LGTM. I assume you're running these tests successfully using H3 on your local machine (as I don't think PreCommit is, right?) – if that's the case, push it!

          Show
          elserj Josh Elser added a comment - LGTM. I assume you're running these tests successfully using H3 on your local machine (as I don't think PreCommit is, right?) – if that's the case, push it!

            People

            • Assignee:
              mdrob Mike Drob
              Reporter:
              yuzhihong@gmail.com Ted Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development