Details
-
Umbrella
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.95.2
-
None
-
None
Description
A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
The ideal target is:
- failure impact client applications only by an added delay to execute a query, whatever the failure.
- this delay is always inferior to 1 second.
We're not going to achieve that immediately...
Priority will be given to the most frequent issues.
Short term:
- software crash
- standard administrative tasks as stop/start of a cluster.
Attachments
Issue Links
- is related to
-
HBASE-2108 [HA] hbase cluster should be able to ride over hdfs 'safe mode' flip and namenode restart/move
- Closed
-
HDFS-2296 If read error while lease is being recovered, client reverts to stale view on block info
- Open
-
HBASE-3809 .META. may not come back online if > number of executors servers crash and one of those > number of executors was carrying meta
- Closed
-
HBASE-4177 Handling read failures during recovery - when HMaster calls Namenode recovery, recovery may be a failure leading to read failure while splitting logs
- Closed
-
HBASE-6401 HBase may lose edits after a crash if used with HDFS 1.0.3 or older
- Closed
-
HBASE-2958 When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it
- Closed
-
HBASE-6140 Make distributed log splitting faster by changing call site of tmp log renaming
- Closed
-
HBASE-7390 Add extra test cases for assignement on the region server and fix the related issues
- Closed
-
HBASE-6328 FSHDFSUtils#recoverFileLease tries to rethrow InterruptedException but actually shallows it
- Closed
-
HBASE-7407 TestMasterFailover under tests some cases and over tests some others
- Closed
-
HBASE-8216 Be able to differentiate Power failures from Rack switch reboot
- Closed
-
HBASE-6175 TestFSUtils flaky on hdfs getFileStatus method
- Closed
-
HBASE-6356 printStackTrace in FSUtils
- Closed
-
HBASE-2315 BookKeeper for write-ahead logging
- Closed
-
HDFS-1075 Separately configure connect timeouts from read timeouts in data path
- Open
-
HDFS-1094 Intelligent block placement policy to decrease probability of block loss
- Open
-
HDFS-3706 Add the possibility to mark a node as 'low priority' for writes in the DFSClient
- Open
-
ZOOKEEPER-922 enable faster timeout of sessions in case of unexpected socket disconnect
- Open
-
HDFS-4642 Allow lease recovery for multiple paths to be issued in one request
- Resolved
-
ZOOKEEPER-1147 Add support for local sessions
- Resolved
-
HDFS-3702 Add an option for NOT writing the blocks locally if there is a datanode on the same box as the client
- Resolved
-
HBASE-6134 Improvement for split-worker to speed up distributed log splitting
- Closed
-
HBASE-1316 ZooKeeper: use native threads to avoid GC stalls (JNI integration)
- Closed
-
HBASE-2183 Ride over restart
- Closed
-
HBASE-6295 Possible performance improvement in client batch operations: presplit and send in background
- Closed
-
HBASE-6508 [0.89-fb] Filter out edits at log split time
- Closed
-
HDFS-3703 Decrease the datanode failure detection time
- Closed
-
HBASE-6490 'dfs.client.block.write.retries' value could be increased in HBase
- Closed
-
ZOOKEEPER-702 GSoC 2010: Failure Detector Model
- Open
-
HBASE-1111 [performance] Crash recovery takes way too long
- Closed
- relates to
-
HBASE-6060 Regions's in OPENING state from failed regionservers takes a long time to recover
- Closed
-
HADOOP-8144 pseudoSortByDistance in NetworkTopology doesn't work properly if no local node and first node is local rack node
- Closed
-
ZOOKEEPER-1495 ZK client hangs when using a function not available on the server.
- Closed
-
HBASE-5970 Improve the AssignmentManager#updateTimer and speed up handling opened event
- Closed
-
HBASE-7386 Investigate providing some supervisor support for znode deletion
- Closed
- requires
-
HDFS-3912 Detecting and avoiding stale datanodes for writing
- Closed
-
HBASE-6737 NullPointerException at regionserver.wal.SequenceFileLogWriter.append
- Closed
-
HBASE-6738 Too aggressive task resubmission from the distributed log manager
- Closed
-
HBASE-6364 Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table
- Closed
-
HBASE-6713 Stopping META/ROOT RS may take 50mins when some region is splitting
- Closed
-
HBASE-6736 Distributed Split: a split tasks can be mark as DONE but keep unassigned
- Closed
-
HBASE-6970 hbase-deamon.sh creates/updates pid file even when that start failed.
- Closed
-
HBASE-7271 Have a single executor for all zkWorkers in the assignment manager
- Closed
-
HBASE-7756 Strange code in ServerCallable#shouldRetry
- Closed
-
HBASE-7989 Client with a cache info on a dead server will wait for 20s before trying another one.
- Closed
-
HBASE-8204 Don't use hdfs append during lease recovery
- Closed
-
HBASE-5992 Generalization of region move implementation + manage draining servers in bulk assign
- Closed
-
HBASE-6156 Improve multiop performances in HTable#flushCommits
- Closed
-
HBASE-6315 ipc.HBaseClient should support address change as does hdfs
- Closed
-
HBASE-6878 DistributerLogSplit can fail to resubmit a task done if there is an exception during the log archiving
- Closed
-
HBASE-7815 Too subtle behavior for HConnection#getRegionLocation reload parameter and performance risk
- Closed
-
HBASE-5902 Some scripts are not executable
- Closed
-
HBASE-4755 HBase based block placement in DFS
- Closed
-
HBASE-7006 [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
- Closed
-
HBASE-7590 Add a costless notifications mechanism from master to regionservers & clients
- Closed
-
HDFS-2576 Namenode should have a favored nodes hint to enable clients to have control over block placement.
- Closed
-
HDFS-3705 Add the possibility to mark a node as 'low priority' for read in the DFSClient
- Resolved
-
HBASE-6870 HTable#coprocessorExec always scan the whole table
- Closed
-
HBASE-7213 Have HLog files for .META. and -ROOT- edits only
- Closed
-
HBASE-5844 Delete the region servers znode after a regions server crash
- Closed
-
HBASE-5924 In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
- Closed
-
HBASE-5930 Limits the amount of time an edit can live in the memstore.
- Closed
-
HBASE-6309 [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager
- Closed
-
HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes
- Closed
-
HBASE-6751 Too many retries, leading a a delay to read the HLog after a datanode failure
- Closed
-
HBASE-6752 On region server failure, serve writes and timeranged reads during the log split
- Closed
-
HBASE-6772 Make the Distributed Split HDFS Location aware
- Closed
-
HBASE-6773 Make the dfs replication factor configurable per table
- Closed
-
HBASE-6774 Immediate assignment of regions that don't have entries in HLog
- Closed
-
HBASE-6783 Make read short circuit the default
- Closed
-
HBASE-7246 Assignment#nodeChildrenChanged calls listChildrenAndWatchForNewChildren, overloading master & zookeper needlessly
- Closed
-
HBASE-7247 Assignment performances decreased by 50% because of regionserver.OpenRegionHandler#tickleOpening
- Closed
-
HBASE-7327 Assignment Timeouts: Remove the code from the master
- Closed
-
HBASE-7334 We should expire the zk session for crashed servers rather than deleting ephemeral znodes
- Closed
-
HDFS-4721 Speed up lease/block recovery when DN fails and a block goes into recovery
- Closed
-
HBASE-5859 Optimize the rolling restart script
- Closed
-
HBASE-5877 When a query fails because the region has moved, let the regionserver return the new address to the client
- Closed
-
HBASE-5926 Delete the master znode after a master crash
- Closed
-
HBASE-5939 Add an autorestart option in the start scripts
- Closed
-
HBASE-5998 Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request
- Closed
-
HBASE-6058 Use ZK 3.4 API 'multi' in bulk assignment
- Closed
-
HBASE-6109 Improve RIT performances during assignment on large clusters
- Closed
-
HBASE-6290 Add a function a mark a server as dead and start the recovery the process
- Closed
-
HDFS-4754 Add an API in the namenode to mark a datanode as stale
- Patch Available
- supercedes
-
HBASE-1111 [performance] Crash recovery takes way too long
- Closed