[HBASE-8778] Region assigments scan table directory making them slow for huge tables - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.98.0, 0.95.2
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Table descriptors are now moved inside hdfs from residing directly in the table directory (alongside region directories) to being in a well known subdirectory called ".tabledesc". For example, instead of /hbase/exampleTable/.tableinfo.0000000003 the file would be /hbase/exampleTable/.tabledesc/.tableinfo.0000000003 after this release. The same will be true for snapshots. The first active master to be started up will move these files for existing tables and snapshots.

Show
Table descriptors are now moved inside hdfs from residing directly in the table directory (alongside region directories) to being in a well known subdirectory called ".tabledesc". For example, instead of /hbase/exampleTable/.tableinfo.0000000003 the file would be /hbase/exampleTable/.tabledesc/.tableinfo.0000000003 after this release. The same will be true for snapshots. The first active master to be started up will move these files for existing tables and snapshots.
Tags:
0.96notable

Description

On a table with 130k regions it takes about 3 seconds for a region server to open a region once it has been assigned.

Watching the threads for a region server running 0.94.5 that is opening many such regions shows the thread opening the reigon in code like this:

"PRI IPC Server handler 4 on 60020" daemon prio=10 tid=0x00002aaac07e9000 nid=0x6566 runnable [0x000000004c46d000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.String.indexOf(String.java:1521)
        at java.net.URI$Parser.scan(URI.java:2912)
        at java.net.URI$Parser.parse(URI.java:3004)
        at java.net.URI.<init>(URI.java:736)
        at org.apache.hadoop.fs.Path.initialize(Path.java:145)
        at org.apache.hadoop.fs.Path.<init>(Path.java:126)
        at org.apache.hadoop.fs.Path.<init>(Path.java:50)
        at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:215)
        at org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSystem.java:252)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:311)
        at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:159)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:867)
        at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1168)
        at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:269)
        at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:255)
        at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoModtime(FSTableDescriptors.java:368)
        at org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:155)
        at org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:126)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2834)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2807)
        at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)

To open the region, the region server first loads the latest HTableDescriptor. Since ~~HBASE-4553~~ HTableDescriptor's are stored in the file system at "/hbase/<tableDir>/.tableinfo.<sequenceNum>". The file with the largest sequenceNum is the current descriptor. This is done so that the current descirptor is updated atomically. However, since the filename is not known in advance FSTableDescriptors it has to do a FileSystem.listStatus operation which has to list all files in the directory to find it. The directory also contains all the region directories, so in our case it has to load 130k FileStatus objects. Even using a globStatus matching function still transfers all the objects to the client before performing the pattern matching. Furthermore HDFS uses a default of transferring 1000 directory entries in each RPC call, so it requires 130 roundtrips to the namenode to fetch all the directory entries.

Consequently, to reassign all the regions of a table (or a constant fraction thereof) requires time proportional to the square of the number of regions.

In our case, if a region server fails with 200 such regions, it takes 10+ minutes for them all to be reassigned, after the zk expiration and log splitting.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

8778-dirmodtime.txt
22/Jul/13 18:49
1 kB
Lars Hofhansl
HBASE-8778.patch
24/Jul/13 21:05
80 kB
Dave Latham
HBASE-8778-0.94.5.patch
20/Jun/13 23:36
77 kB
Dave Latham
HBASE-8778-0.94.5-v2.patch
24/Jun/13 17:25
77 kB
Dave Latham
HBASE-8778-v2.patch
25/Jul/13 23:33
81 kB
Dave Latham
HBASE-8778-v3.patch
26/Jul/13 02:05
81 kB
Dave Latham
HBASE-8778-v4.patch
30/Jul/13 19:03
81 kB
Dave Latham
HBASE-8778-v5.patch
02/Aug/13 17:29
82 kB
Dave Latham

Issue Links

is related to

HBASE-8348 Polish the migration to 0.96

Closed

HBASE-9132 Use table dir modtime to avoid scanning table dir to check cached table descriptor in 0.94

Closed

Region assigments scan table directory making them slow for huge tables

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates