Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.3.0, 1.4.0, 1.2.3, 1.1.7, 2.0.0
-
None
-
Reviewed
Description
Compactor#performCompaction
do {
hasMore = scanner.next(cells, scannerContext);
// output to writer:
for (Cell c : cells) {
if (cleanSeqId && c.getSequenceId() <= smallestReadPoint)
writer.append(c);
}
cells.clear();
} while (hasMore);
scanner.next will choose at most "hbase.hstore.compaction.kv.max" kvs, the last cell still reference by StoreScanner.prevCell, so if cleanSeqId is called when the scanner.next call StoreScanner.checkScanOrder may throw exception and cause regionserver down.
Attachments
Attachments
- HBASE-16931_master_v2.patch
- 8 kB
- Lijin Bin
- HBASE-16931_master_v3.patch
- 9 kB
- Lijin Bin
- HBASE-16931_master_v4.patch
- 9 kB
- Lijin Bin
- HBASE-16931_master_v5.patch
- 8 kB
- Lijin Bin
- HBASE-16931.branch-1.patch
- 8 kB
- Lijin Bin
- HBASE-16931.branch-1.v2.patch
- 8 kB
- Lijin Bin
- HBASE-16931-master.patch
- 7 kB
- Lijin Bin
Activity
export HBASE_OPTS="-ea -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC"
enable assertions on production.
When i check the hfile's data there is “K: 067:532053748281/a:priceTrend/1476280811000/Put/vlen=4/seqid=11993975 V: 39.0” in a hfile, and “K: 067:532053748281/a:priceTrend/1476280811000/Put/vlen=0/seqid=12076597 V:“ in another hfile, so actually there is no cell "Key 067:532053748281/a:priceTrend/1476280811000/Put/vlen=0/seqid=0" in hfile.
And this compaction exception cause 100+ regionservers down in one of our production cluster.
Oh.. nice debugging.. It would have been so hard for u debug this.
On that patch, so what u r doing now is change the seqId of the cell to zero and pass that time writer and out of the for loop, change back the seqId of the last cell to the original value. This also might fail in one case. You can see within the loop, there is a shipped() call now. We have use shipped to do in btw cleanup in readers. But the writers also implemented as ShipperListener and this will do a clone of the last cells it refer. This is needed because the when the blocks for compaction comes from a L2 shared memory cache, the shipped call might return back the block to BC and so that memory can get released. If the lastCells refer to this block area, our cell data area can get corrupted. So what we do is before return back of the blocks we will do a clone of the cells. (Note that only the needed part we will clone. FYI I can forsee a bug here also. We dont clone the seqId and all cloned cells will give seqId of 0!.. That is another issue to solve).. So said that when u come out of for loop, it might so happen that the prevCells that other part of code refer to is not the same as object within the cells list.
cc ram_krish
So the way to solve this is that we should be able to pass some info the writer on write call to put this cell's seqId as zero. And no need to call setSeqId on individual cells at all. wdyt?
I just changed the subject of the jira and upped the priority to critical as this will be a very bad bug when occurs. Chance of this to occur may be rare. Still !!!
Checking this patch.
Fix looks ok but one question -since we call shipped() after a batch of cells are written - to avoid OOME because during compaction we hold all the blocks till the compaction is completed. So to avoid that we call shipped(). But because we do shipped() there is a chance that the blocks are cleared and in write flow we hold on to 'lastCell' etc so those could get corrupted when the block got released.
So we added beforeShipped() called. Now even before this bug was there in read path even in write path we will end up in the same problem right.
The lastCell in write path just before the cleanSeqId started happening will have a seqId but now the next Cell will become 0. So it is going to be problem in Writer#checkKey() method I believe.
One more question - after append() immediately cant we again set back the lastSeqId?
Ok. So just saw that in Writer
if (lastCell != null) { int keyComp = comparator.compareKeyIgnoresMvcc(lastCell, cell); if (keyComp > 0) { throw new IOException("Added a key not lexically larger than" + " previous. Current cell = " + cell + ", lastCell = " + lastCell); } else if (keyComp == 0) { isDuplicateKey = true; } }
So we ignore the mvcc. So this issue will not happen in writer.
And if we set back sequence id before ((ShipperListener) writer).beforeShipped(), do this can fix this bug?
U mean call that before calling beforeShipped also.. Ya seems that will work out. Am fine with that.. Pls add fat comments there what we are doing. And also in other places. This is very bad bug and this is the way we solve it.. ELse later people will wonder what non sense we are doing
Yes,i miss to check the writer, thanks very much for the confirm.
Sure.. Pls post a new patch for affected version. Once QA is ok, we can commit.
There is a small issue no? Within if we set lastcleancell and its seqId.. reset back based on lastcleanCell != null.. This might cause set the seqId on a wrong cell on which the set to 0 was not happened. We need to have an else and make lastCleanCell =null there no? We need a better name for this variable.
IMO changing the seqid of a cell which is referred to is dangerous, clone the cell for writing out might be safer although more garbage will be generated.
And please move the test case into TestCompaction like I mentioned in our local review fella aoxiang
I agree that avoid set and reset cell seqId is the best way. And clone is some thing not acceptable IMO. We should be able to tell write whether it need to use the seqId of the cell or zero. But that will be bigger change. So am fine with current way. Ya I was abt to ask. For one test do we need to spin a cluster. If possible avoid. But I like that u did a UT for this so that in future we wont have regression here. Seem V3 also dont have the change to reset the local variable to null.
upload HBASE-16931_master_v4.patch this patch reset the local variable to null.
I agree that avoid set and reset cell seqId is the best way. And clone is some thing not acceptable IMO. We should be able to tell write whether it need to use the seqId of the cell or zero. But that will be bigger change. So am fine with current way.
Ok, we'll also use current version as a hot-fix online, so I've no much rights to argue here I guess (Smile).
Ya I was abt to ask. For one test do we need to spin a cluster. If possible avoid. But I like that u did a UT for this so that in future we wont have regression here.
Ya adding a UT case to cover a new problem is required on our side (Smile). And I still insist to move the case into TestCompaction guys.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 17s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
+1 | mvninstall | 3m 16s | master passed |
+1 | compile | 0m 39s | master passed |
+1 | checkstyle | 0m 50s | master passed |
+1 | mvneclipse | 0m 15s | master passed |
-1 | findbugs | 1m 52s | hbase-server in master has 1 extant Findbugs warnings. |
+1 | javadoc | 0m 31s | master passed |
+1 | mvninstall | 0m 49s | the patch passed |
+1 | compile | 0m 37s | the patch passed |
+1 | javac | 0m 37s | the patch passed |
+1 | checkstyle | 0m 47s | the patch passed |
+1 | mvneclipse | 0m 15s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | hadoopcheck | 29m 17s | Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha1. |
+1 | hbaseprotoc | 0m 12s | the patch passed |
+1 | findbugs | 1m 48s | the patch passed |
+1 | javadoc | 0m 26s | the patch passed |
-1 | unit | 83m 52s | hbase-server in the patch failed. |
+1 | asflicense | 0m 15s | The patch does not generate ASF License warnings. |
126m 16s |
Reason | Tests |
---|---|
Timed out junit tests | org.apache.hadoop.hbase.client.TestFromClientSide |
org.apache.hadoop.hbase.client.TestReplicaWithCluster | |
org.apache.hadoop.hbase.client.TestHCM | |
org.apache.hadoop.hbase.client.TestMobCloneSnapshotFromClient |
Subsystem | Report/Notes |
---|---|
Docker | Client=1.11.2 Server=1.11.2 Image:yetus/hbase:7bda515 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12834889/HBASE-16931_master_v3.patch |
JIRA Issue | |
Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile |
uname | Linux 5e03579f4917 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh |
git revision | master / 187ff19 |
Default Java | 1.8.0_101 |
findbugs | v3.0.0 |
findbugs | https://builds.apache.org/job/PreCommit-HBASE-Build/4155/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html |
unit | https://builds.apache.org/job/PreCommit-HBASE-Build/4155/artifact/patchprocess/patch-unit-hbase-server.txt |
unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/4155/artifact/patchprocess/patch-unit-hbase-server.txt |
Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/4155/testReport/ |
modules | C: hbase-server U: hbase-server |
Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/4155/console |
Powered by | Apache Yetus 0.3.0 http://yetus.apache.org |
This message was automatically generated.
The failed cases are known as always timeout and irrelative to the patch here. Committing the patch with 3 binding +1s, thanks all for review anoop.hbase, ram_krish, tedyu!
Oops, at a second glance, the previous HadoopQA report was about patch v3, will wait for the QA report for v5 and branch-1 before committing the patch.
aoxiang Please update the branch-1 patch to merge UT case into TestCompaction, thanks.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 13s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
+1 | mvninstall | 3m 21s | master passed |
+1 | compile | 0m 36s | master passed |
+1 | checkstyle | 0m 46s | master passed |
+1 | mvneclipse | 0m 14s | master passed |
-1 | findbugs | 1m 44s | hbase-server in master has 1 extant Findbugs warnings. |
+1 | javadoc | 0m 26s | master passed |
+1 | mvninstall | 0m 46s | the patch passed |
+1 | compile | 0m 36s | the patch passed |
+1 | javac | 0m 36s | the patch passed |
+1 | checkstyle | 0m 45s | the patch passed |
+1 | mvneclipse | 0m 14s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | hadoopcheck | 27m 25s | Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha1. |
+1 | hbaseprotoc | 0m 12s | the patch passed |
+1 | findbugs | 1m 49s | the patch passed |
+1 | javadoc | 0m 25s | the patch passed |
-1 | unit | 79m 19s | hbase-server in the patch failed. |
+1 | asflicense | 0m 15s | The patch does not generate ASF License warnings. |
119m 26s |
Reason | Tests |
---|---|
Timed out junit tests | org.apache.hadoop.hbase.master.procedure.TestTableDescriptorModificationFromClient |
org.apache.hadoop.hbase.master.procedure.TestDeleteColumnFamilyProcedure | |
org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure |
Subsystem | Report/Notes |
---|---|
Docker | Client=1.11.2 Server=1.11.2 Image:yetus/hbase:7bda515 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12834902/HBASE-16931_master_v5.patch |
JIRA Issue | |
Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile |
uname | Linux 06f275318878 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh |
git revision | master / 97cb1d7 |
Default Java | 1.8.0_101 |
findbugs | v3.0.0 |
findbugs | https://builds.apache.org/job/PreCommit-HBASE-Build/4158/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html |
unit | https://builds.apache.org/job/PreCommit-HBASE-Build/4158/artifact/patchprocess/patch-unit-hbase-server.txt |
unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/4158/artifact/patchprocess/patch-unit-hbase-server.txt |
Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/4158/testReport/ |
modules | C: hbase-server U: hbase-server |
Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/4158/console |
Powered by | Apache Yetus 0.3.0 http://yetus.apache.org |
This message was automatically generated.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 17s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
+1 | mvninstall | 3m 34s | master passed |
+1 | compile | 0m 49s | master passed |
+1 | checkstyle | 0m 54s | master passed |
+1 | mvneclipse | 0m 17s | master passed |
-1 | findbugs | 2m 16s | hbase-server in master has 1 extant Findbugs warnings. |
+1 | javadoc | 0m 36s | master passed |
+1 | mvninstall | 0m 55s | the patch passed |
+1 | compile | 0m 45s | the patch passed |
+1 | javac | 0m 45s | the patch passed |
+1 | checkstyle | 0m 50s | the patch passed |
+1 | mvneclipse | 0m 16s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | hadoopcheck | 33m 45s | Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha1. |
+1 | hbaseprotoc | 0m 14s | the patch passed |
+1 | findbugs | 2m 11s | the patch passed |
+1 | javadoc | 0m 31s | the patch passed |
-1 | unit | 84m 37s | hbase-server in the patch failed. |
+1 | asflicense | 0m 15s | The patch does not generate ASF License warnings. |
133m 25s |
Reason | Tests |
---|---|
Timed out junit tests | org.apache.hadoop.hbase.client.TestFromClientSide |
org.apache.hadoop.hbase.client.TestReplicaWithCluster | |
org.apache.hadoop.hbase.client.TestHCM | |
org.apache.hadoop.hbase.client.TestMobCloneSnapshotFromClient | |
org.apache.hadoop.hbase.client.TestMobSnapshotCloneIndependence |
Subsystem | Report/Notes |
---|---|
Docker | Client=1.11.2 Server=1.11.2 Image:yetus/hbase:7bda515 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12834900/HBASE-16931_master_v4.patch |
JIRA Issue | |
Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile |
uname | Linux f7fc783d6af3 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh |
git revision | master / 187ff19 |
Default Java | 1.8.0_101 |
findbugs | v3.0.0 |
findbugs | https://builds.apache.org/job/PreCommit-HBASE-Build/4157/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html |
unit | https://builds.apache.org/job/PreCommit-HBASE-Build/4157/artifact/patchprocess/patch-unit-hbase-server.txt |
unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/4157/artifact/patchprocess/patch-unit-hbase-server.txt |
Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/4157/testReport/ |
modules | C: hbase-server U: hbase-server |
Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/4157/console |
Powered by | Apache Yetus 0.3.0 http://yetus.apache.org |
This message was automatically generated.
Checking the UT result and all 3 cases failed because of OOME and irrelative to patch here.
Running org.apache.hadoop.hbase.master.procedure.TestTableDescriptorModificationFromClient Running org.apache.hadoop.hbase.master.procedure.TestDeleteColumnFamilyProcedure Exception in thread "Thread-2421" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-2415" Running org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure Exception in thread "Thread-2415" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569) at java.lang.StringBuffer.append(StringBuffer.java:369) at java.io.BufferedReader.readLine(BufferedReader.java:358) at java.io.BufferedReader.readLine(BufferedReader.java:389) at org.apache.maven.surefire.shade.org.apache.maven.shared.utils.cli.StreamPumper.run(StreamPumper.java:66) Exception in thread "Thread-2419" java.lang.OutOfMemoryError: Java heap space Exception in thread "Thread-2423" java.lang.OutOfMemoryError: Java heap space Exception in thread "Thread-2425" java.lang.OutOfMemoryError: Java heap space
And confirmed all could pass locally.
-1 overall |
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
0 | reexec | 0m 18s | Docker mode activated. |
+1 | @author | 0m 0s | The patch does not contain any @author tags. |
+1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
+1 | mvninstall | 2m 3s | branch-1 passed |
+1 | compile | 0m 32s | branch-1 passed with JDK v1.8.0_101 |
+1 | compile | 0m 35s | branch-1 passed with JDK v1.7.0_80 |
+1 | checkstyle | 0m 55s | branch-1 passed |
+1 | mvneclipse | 0m 17s | branch-1 passed |
-1 | findbugs | 1m 55s | hbase-server in branch-1 has 2 extant Findbugs warnings. |
+1 | javadoc | 0m 25s | branch-1 passed with JDK v1.8.0_101 |
+1 | javadoc | 0m 33s | branch-1 passed with JDK v1.7.0_80 |
+1 | mvninstall | 0m 47s | the patch passed |
+1 | compile | 0m 31s | the patch passed with JDK v1.8.0_101 |
+1 | javac | 0m 31s | the patch passed |
+1 | compile | 0m 34s | the patch passed with JDK v1.7.0_80 |
+1 | javac | 0m 34s | the patch passed |
+1 | checkstyle | 0m 56s | the patch passed |
+1 | mvneclipse | 0m 16s | the patch passed |
+1 | whitespace | 0m 0s | The patch has no whitespace issues. |
+1 | hadoopcheck | 16m 51s | The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. |
+1 | hbaseprotoc | 0m 16s | the patch passed |
+1 | findbugs | 2m 8s | the patch passed |
+1 | javadoc | 0m 23s | the patch passed with JDK v1.8.0_101 |
+1 | javadoc | 0m 33s | the patch passed with JDK v1.7.0_80 |
-1 | unit | 84m 29s | hbase-server in the patch failed. |
+1 | asflicense | 0m 17s | The patch does not generate ASF License warnings. |
116m 4s |
Reason | Tests |
---|---|
Failed junit tests | hadoop.hbase.mapreduce.TestMultiTableSnapshotInputFormat |
Timed out junit tests | org.apache.hadoop.hbase.TestHBaseOnOtherDfsCluster |
org.apache.hadoop.hbase.filter.TestFuzzyRowFilterEndToEnd | |
org.apache.hadoop.hbase.tool.TestCanaryTool |
Subsystem | Report/Notes |
---|---|
Docker | Client=1.11.2 Server=1.11.2 Image:yetus/hbase:b2c5d84 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12834926/HBASE-16931.branch-1.v2.patch |
JIRA Issue | |
Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile |
uname | Linux 59828eb69aaf 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | maven |
Personality | /testptch/patchprocess/precommit/personality/hbase.sh |
git revision | branch-1 / d76cc4c |
Default Java | 1.7.0_80 |
Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-oracle:1.7.0_80 |
findbugs | v3.0.0 |
findbugs | https://builds.apache.org/job/PreCommit-HBASE-Build/4161/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html |
unit | https://builds.apache.org/job/PreCommit-HBASE-Build/4161/artifact/patchprocess/patch-unit-hbase-server.txt |
unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/4161/artifact/patchprocess/patch-unit-hbase-server.txt |
Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/4161/testReport/ |
modules | C: hbase-server U: hbase-server |
Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/4161/console |
Powered by | Apache Yetus 0.3.0 http://yetus.apache.org |
This message was automatically generated.
All timed out cases failed because of OOME (why so frequent OOME?...)
Running org.apache.hadoop.hbase.filter.TestFuzzyRowFilterEndToEnd Exception in thread "Thread-2475" java.lang.OutOfMemoryError: Java heap space Running org.apache.hadoop.hbase.TestHBaseOnOtherDfsCluster Running org.apache.hadoop.hbase.tool.TestCanaryTool Exception in thread "process reaper" java.lang.OutOfMemoryError: Java heap space Exception in thread "Thread-2505" java.lang.OutOfMemoryError: Java heap space Exception in thread "Thread-2507" java.lang.OutOfMemoryError: Java heap space
And the failed case seems encountered some environment issue (Unable to create region directory):
Running org.apache.hadoop.hbase.mapreduce.TestMultiTableSnapshotInputFormat Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 23.592 sec <<< FAILURE! - in org.apache.hadoop.hbase.mapreduce.TestMultiTableSnapshotInputFormat testScanYZYToEmpty(org.apache.hadoop.hbase.mapreduce.TestMultiTableSnapshotInputFormat) Time elapsed: 0.044 sec <<< ERROR! java.io.IOException: java.util.concurrent.ExecutionException: java.io.IOException: Unable to create region directory: /tmp/scantest1_snapshot__8235bb48-4e7b-4e00-ad80-b2ce716c8522/data/default/scantest1/519e450e89d832d702a416a9bca04b5d at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:180) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.cloneHdfsRegions(RestoreSnapshotHelper.java:527) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.restoreHdfsRegions(RestoreSnapshotHelper.java:234) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.restoreHdfsRegions(RestoreSnapshotHelper.java:170) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.copySnapshotForScanner(RestoreSnapshotHelper.java:736) at org.apache.hadoop.hbase.mapreduce.MultiTableSnapshotInputFormatImpl.restoreSnapshot(MultiTableSnapshotInputFormatImpl.java:249) at org.apache.hadoop.hbase.mapreduce.MultiTableSnapshotInputFormatImpl.restoreSnapshots(MultiTableSnapshotInputFormatImpl.java:243) at org.apache.hadoop.hbase.mapreduce.MultiTableSnapshotInputFormatImpl.setInput(MultiTableSnapshotInputFormatImpl.java:80) at org.apache.hadoop.hbase.mapreduce.MultiTableSnapshotInputFormat.setInput(MultiTableSnapshotInputFormat.java:106) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initMultiTableSnapshotMapperJob(TableMapReduceUtil.java:319) at org.apache.hadoop.hbase.mapreduce.TestMultiTableSnapshotInputFormat.initJob(TestMultiTableSnapshotInputFormat.java:72)
Ran above 4 cases locally and confirmed all could pass.
Pushed commit to master branch and branch-1.
AFAICS all branch-1.1+ have this problem, and since this is a critical issue, please let me know your thoughts on whether to have this one in branch-1.1/1.2/1.3 gentlemen ndimiduk busbey stack mantonov, thanks.
Should we do another version of the patch after the hot fix has gone in? It looks like we get the sequenceid twice from a Cell. We could cache it since calc costs. What does this: "...so if cleanSeqId is called when the scanner.next call "? Where does it come from? Can we block it for ongoing scans? Like carp84 I like the idea of cloning rather than resetting the sequenceid on a Cell (but appreciate too the Ram and Anoop conservation). beforeShipped is making copies of Cells? Pardon if the questions are dumb. Just looking at the patch w/o spending time garnering more context.
If the problem also exists in the earlier branch-1 release lines then we should probably fix it there. Does this patch cause some change in behavior or other downside that would give us pause on backporting it to all branches?
It looks like we get the sequenceid twice from a Cell
Not really needed.. getSeqId return a state of the cell. So no real calc there.
Does this patch cause some change in behavior or other downside that would give us pause on backporting it to all branches?
No I dont think so.. +1 to fix this in all applicable places now.
FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #1846 (See https://builds.apache.org/job/HBase-Trunk_matrix/1846/)
HBASE-16931 Setting cell's seqId to zero in compaction flow might cause (liyu: rev fa3cbd1d80ab414901a650e41679534578ebcf63)
- (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
FAILURE: Integrated in Jenkins build HBase-1.4 #493 (See https://builds.apache.org/job/HBase-1.4/493/)
HBASE-16931 Setting cell's seqId to zero in compaction flow might cause (liyu: rev 16823ff55e7da4822d82fb2a2108b4a253fd42f9)
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
- (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java
Thanks busbey stack and anoop.hbase for the confirmation, will commit into branch-1.1/1.2 soon if no objections.
For branch-1.3, mantonov I know you're trying to stabilize the 1.3.0 release sir, so a special ping before committing this, please let me know your thoughts. Thanks.
Pushed into branch-1.1 and branch-1.2. Now waiting for the decision on branch-1.3 to close the JIRA.
The choice is between this going into 1.3.0 or waiting for 1.3.1, right?
FAILURE: Integrated in Jenkins build HBase-1.1-JDK8 #1888 (See https://builds.apache.org/job/HBase-1.1-JDK8/1888/)
HBASE-16931 Setting cell's seqId to zero in compaction flow might cause (liyu: rev f4cbf77f28a2668743b1c73b5aeb113c3456fb13)
- (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
FAILURE: Integrated in Jenkins build HBase-1.1-JDK7 #1804 (See https://builds.apache.org/job/HBase-1.1-JDK7/1804/)
HBASE-16931 Setting cell's seqId to zero in compaction flow might cause (liyu: rev f4cbf77f28a2668743b1c73b5aeb113c3456fb13)
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
- (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java
SUCCESS: Integrated in Jenkins build HBase-1.2-JDK8 #50 (See https://builds.apache.org/job/HBase-1.2-JDK8/50/)
HBASE-16931 Setting cell's seqId to zero in compaction flow might cause (liyu: rev 7f10379b1e6756002993c2638c866a6537a088c4)
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
- (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java
SUCCESS: Integrated in Jenkins build HBase-1.2-JDK7 #55 (See https://builds.apache.org/job/HBase-1.2-JDK7/55/)
HBASE-16931 Setting cell's seqId to zero in compaction flow might cause (liyu: rev 7f10379b1e6756002993c2638c866a6537a088c4)
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java
- (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java
The choice is between this going into 1.3.0 or waiting for 1.3.1, right?
Yes sir. busbey
Skimmed log of failed post-commit check for trunk/branch-1/branch-1.1 and confirmed all caused by known fake case or timed out, no regression caused by change here, FWIW.
Ping mantonov, pending on your decision for branch-1.3 to close this JIRA boss.
Plan to commit to branch-1.3 if no objections in another 12 hours and close this JIRA, thanks.
anoopsamjohn for that other possible bug in the clone..is there a jira for that to track? Would it also affect the same set of branches?
Sorry for late reply here carp84 and great investigation
This bug seems bad enough to get it in 1.3.0.. As I'm not very familiar with this code, let me just have another look Monday morning (tomorrow, PST) and we should be able to cherry pick to 1.3.. cc ghelmling who has been debugging compactions recently.
(Any more details on how to trigger/repro it? Some specific workload?)
No problem, totally understand what it costs and how busy that will be to maintain some production online cluster meanwhile, will wait for your feedback.
And yes aoxiang has added a UT case in his patch to reproduce the issue, please refer to that for more details.
We dont clone the seqId and all cloned cells will give seqId of 0!.
Seems there is no issue. Only this method affects the mentioned area
public static KeyValue toNewKeyCell(final Cell cell) { byte[] bytes = new byte[keyLength(cell)]; appendKeyTo(cell, bytes, 0); KeyValue kv = new KeyValue.KeyOnlyKeyValue(bytes, 0, bytes.length); // Set the seq id. The new key cell could be used in comparisons so it // is important that it uses the seqid also. If not the comparsion would fail kv.setSequenceId(cell.getSequenceId()); return kv; }
And we are copying the seqId also. So all good
Yeah I don't think we've seen that in the workloads we ran on 1.3, but yeah, let's pull it in 1.3. +1. Thanks for the patience here carp84!
Closing this one since fix pushed to all 1.1+ branches. Again thanks for the good work aoxiang and thanks all for review.
FAILURE: Integrated in Jenkins build HBase-1.3-JDK8 #64 (See https://builds.apache.org/job/HBase-1.3-JDK8/64/)
HBASE-16931 Setting cell's seqId to zero in compaction flow might cause (liyu: rev 84481361b618f9f8ca12a6671eac2bb6ba894ea1)
- (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java
- (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java