Hbase Region Split

Please read Apache HBase Region Splitting and Merging firstly.
This is a quick explanation on the Hbase Region Split policy.

Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family. The hierarchy of objects is as follows:

Pre-split

Here are 2 predefined Split Algorithm -- HexStringSplit and UniformSplit.

1. HexStringSplit

The format of a HexStringSplit region boundary is the ASCII representation of an MD5 checksum, or any other uniformly distributed hexadecimal value. Row are hex-encoded long values in the range "00000000" => "FFFFFFFF" and are left-padded with zeros to keep the same order lexicographically as if they were binary.

Sample:
Below command will create a table with 10 regions using HexStringSplit Algorithm:

hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
DEBUG util.RegionSplitter: Creating table test_table with 1 column families.  Presplitting to 10 regions

10 regions created:

[root]# hadoop fs -ls /apps/hbase/data/test_table
Found 12 items
-rw-r--r--   3 hbase hadoop        673 2014-05-21 09:54 /apps/hbase/data/test_table/.tableinfo.0000000001
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/.tmp
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/339d0eb61160df679c6ea628ee80b0d6
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/86da408d174d83aae3fb0bcdb68145c8
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/b0129aac1ec9f20a6a4ffe27b125cd27
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/b94f184ee55374ed5d5db71b88a7bc05
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/c003cd8b2ff3b4a9c6c653ce1a3c0fce
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/ca8cc09027606d6c51f189d61fe6eb4f
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/d41006f677c222b62695035364c528d6
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/e8e2f820883ccd5771d1470f3a36b88f
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/ef2cb46e051fdcccb070b1e637bb5fd5
drwxr-xr-x   - hbase hadoop          0 2014-05-21 09:54 /apps/hbase/data/test_table/f7d3a744d584f44a890e398618c85c4f

You can find out the range for each region by:

hadoop fs -cat /apps/hbase/data/test_table/339d0eb61160df679c6ea628ee80b0d6/.regioninfo
STARTKEY => '99999996', ENDKEY => 'b333332f'

hadoop fs -cat /apps/hbase/data/test_table/86da408d174d83aae3fb0bcdb68145c8/.regioninfo
STARTKEY => '', ENDKEY => '19999999'

2. UniformSplit

A SplitAlgorithm that divides the space of possible keys evenly. Useful when the keys are approximately uniform random bytes (e.g. hashes). Rows are raw byte values in the range 00 => FF and are right-padded with zeros to keep the same memcmp() order. This is the natural algorithm to use for a byte[] environment and saves space, but is not necessarily the easiest for readability.

Sample:

hbase org.apache.hadoop.hbase.util.RegionSplitter test_table3 UniformSplit -c 3 -f f1
DEBUG util.RegionSplitter: Creating table test_table3 with 1 column families.  Presplitting to 3 regions

3 regions created:

[root@hdm ~]# hadoop fs -ls /apps/hbase/data/test_table3
Found 5 items
-rw-r--r--   3 hbase hadoop        675 2014-05-21 14:09 /apps/hbase/data/test_table3/.tableinfo.0000000001
drwxr-xr-x   - hbase hadoop          0 2014-05-21 14:09 /apps/hbase/data/test_table3/.tmp
drwxr-xr-x   - hbase hadoop          0 2014-05-21 14:09 /apps/hbase/data/test_table3/02bcd58dc337bc28fac74ee0e36a11a2
drwxr-xr-x   - hbase hadoop          0 2014-05-21 14:09 /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb
drwxr-xr-x   - hbase hadoop          0 2014-05-21 14:09 /apps/hbase/data/test_table3/e9f130fc2ebbafb20e5ebc45ea3bc7bd

Range of region:

hadoop fs -cat /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/.regioninfo
STARTKEY => '', ENDKEY => 'UUUUUUUU'

hadoop fs -cat /apps/hbase/data/test_table3/02bcd58dc337bc28fac74ee0e36a11a2/.regioninfo
STARTKEY => 'UUUUUUUU', ENDKEY => '\xAA\xAA\xAA\xAA\xAA\xAA\xAA\xAA'

hadoop fs -cat /apps/hbase/data/test_table3/e9f130fc2ebbafb20e5ebc45ea3bc7bd/.regioninfo
STARTKEY => '\xAA\xAA\xAA\xAA\xAA\xAA\xAA\xAA', ENDKEY => ''

Key = "1","2" are in one region, Key="zzz" is in another region:

put 'test_table3','1','f1:col1','data_1_col1'
# hadoop fs -ls /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1
Found 1 items
-rw-r--r--   3 hbase hadoop        697 2014-05-21 14:12 /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1/585ad8fa038c434880848a260160eed2

put 'test_table3','2','f1:col1','data_2_col1'
# hadoop fs -ls /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1
Found 2 items
-rw-r--r--   3 hbase hadoop        697 2014-05-21 14:13 /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1/1d26504ba52444309dc03c0a4ef92283
-rw-r--r--   3 hbase hadoop        697 2014-05-21 14:12 /apps/hbase/data/test_table3/e24e8865d98d52605f166d2e9afa07eb/f1/585ad8fa038c434880848a260160eed2

put 'test_table3','zzz','f1:col1','data_zzz_col1'
# hadoop fs -ls /apps/hbase/data/test_table3/02bcd58dc337bc28fac74ee0e36a11a2/f1
Found 1 items
-rw-r--r--   3 hbase hadoop        705 2014-05-21 15:05 /apps/hbase/data/test_table3/02bcd58dc337bc28fac74ee0e36a11a2/f1/a10e61a74a394e5d93a20eb61372d674

3. Desired split points

If you have split points at hand, you can also use the HBase shell, to create the table with the desired split points.

Sample:

create 'test_table2', 'f1', {SPLITS => ['a', 'b', 'c']}

3 regions created:

# hadoop fs -ls /apps/hbase/data/test_table2/
Found 6 items
-rw-r--r--   3 hbase hadoop        675 2014-05-21 13:06 /apps/hbase/data/test_table2/.tableinfo.0000000001
drwxr-xr-x   - hbase hadoop          0 2014-05-21 13:06 /apps/hbase/data/test_table2/.tmp
drwxr-xr-x   - hbase hadoop          0 2014-05-21 13:06 /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7
drwxr-xr-x   - hbase hadoop          0 2014-05-21 13:06 /apps/hbase/data/test_table2/b8ef19896ac8e43ab5c050c01f129329
drwxr-xr-x   - hbase hadoop          0 2014-05-21 13:06 /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd
drwxr-xr-x   - hbase hadoop          0 2014-05-21 13:06 /apps/hbase/data/test_table2/c9baa85d4d5302d8fa53e807741d323d

Range of region:

hadoop fs -cat /apps/hbase/data/test_table2/c9baa85d4d5302d8fa53e807741d323d/.regioninfo
STARTKEY => '', ENDKEY => 'a'

hadoop fs -cat /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/.regioninfo
STARTKEY => 'a', ENDKEY => 'b'

hadoop fs -cat  /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7/.regioninfo
STARTKEY => 'b', ENDKEY => 'c'

hadoop fs -cat  /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7/.regioninfo
STARTKEY => 'c', ENDKEY => ''

Keys = "a","b","c","abcd" fall into each region:

put 'test_table2','a','f1:col1','data_a_col1'

# hadoop fs -ls /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1
Found 1 items
-rw-r--r--   3 hbase hadoop        697 2014-05-21 13:11 /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1/4c54189cb64f452a98e722a6bfef23b7

put 'test_table2','b','f1:col1','data_b_col1'

# hadoop fs -ls /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7/f1
Found 1 items
-rw-r--r--   3 hbase hadoop        697 2014-05-21 13:37 /apps/hbase/data/test_table2/17eb744fc9788cab51f92d4e9ed740d7/f1/4159bd0e73dd4dcaad49efbead735851

put 'test_table2','123','f1:col1','data_123_col1'

# hadoop fs -ls /apps/hbase/data/test_table2/c9baa85d4d5302d8fa53e807741d323d/f1
Found 1 items
-rw-r--r--   3 hbase hadoop        705 2014-05-21 13:38 /apps/hbase/data/test_table2/c9baa85d4d5302d8fa53e807741d323d/f1/37c2532ae31a4904ad593887ce9dd70c

put 'test_table2','abcd','f1:col1','data_abcd_col1'

# hadoop fs -ls /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1
Found 2 items
-rw-r--r--   3 hbase hadoop        697 2014-05-21 13:11 /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1/4c54189cb64f452a98e722a6bfef23b7
-rw-r--r--   3 hbase hadoop        709 2014-05-21 13:39 /apps/hbase/data/test_table2/be78b0afc4ba7a4118234630104bfbbd/f1/a27ae4c6a09644f7b7c9c23344a878fc

Auto Split

Once a region gets to a certain limit, it is automatically split into two regions.
Here are 3 predefined Auto Split Algorithm -- ConstantSizeRegionSplitPolicy, IncreasingToUpperBoundRegionSplitPolicy, and KeyPrefixRegionSplitPolicy.

hbase.regionserver.region.split.policy
A split policy determines when a region should be split. The various other split policies that are available currently are:
ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy, DelimitedKeyPrefixRegionSplitPolicy, KeyPrefixRegionSplitPolicy etc.
Default: org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy

1. ConstantSizeRegionSplitPolicy

A RegionSplitPolicy implementation which splits a region as soon as any of its store files exceeds a maximum configurable size("hbase.hregion.max.filesize", default =10G).
This is the default split policy. From 0.94.0 on the default split policy has changed to IncreasingToUpperBoundRegionSplitPolicy

hbase.hregion.max.filesize
Maximum HStoreFile size. If any one of a column families' HStoreFiles has grown to exceed this value, the hosting HRegion is split in two.
Default: 10737418240

2. IncreasingToUpperBoundRegionSplitPolicy

For 0.94:Split size is the number of regions that are on this server that all are of the same table, squared, times the region flush size OR the maximum region split size, whichever is smaller.

Min (R^2 * "hbase.hregion.memstore.flush.size", "hbase.hregion.max.filesize"), 
where R is the number of regions of the same table hosted on the same region server.

By default, "hbase.hregion.memstore.flush.size" = 128MB, "hbase.hregion.max.filesize"=10GB.

hbase.hregion.memstore.flush.size
Memstore will be flushed to disk if size of the memstore exceeds this number of bytes. Value is checked by a thread that runs every hbase.server.thread.wakefrequency.
Default: 134217728

So the split point is : 128MB, 512MB, 1152MB, 2GB, 3.2GB, 4.6GB, 6.2GB, 10GB, 10GB, ...

For 0.98:Split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size OR the maximum region split size, whichever is smaller.

Min (R^3 * 2 * "hbase.hregion.memstore.flush.size", "hbase.hregion.max.filesize"), 
where R is the number of regions of the same table hosted on the same region server.

So the split point is : 256MB, 2GB, 6.75GB, 10GB, 10GB, ...

In all, different versions may have different algorithm.

3. KeyPrefixRegionSplitPolicy

A custom RegionSplitPolicy implementing a SplitPolicy that groups rows by a prefix of the row-key This ensures that a region is not split "inside" a prefix of a row key. I.e. rows can be co-located in a region by their prefix.
"prefix_split_key_policy.prefix_length" attribute of the table defines the prefix length.

Force Split

hbase(main):004:0> help 'split'
Split entire table or pass a region to split individual region.  With the
second parameter, you can specify an explicit split key for the region.
Examples:
    split 'tableName'
    split 'regionName' # format: 'tableName,startKey,id'
    split 'tableName', 'splitKey'
    split 'regionName', 'splitKey'

Sample:

create 'testforce','f1'
put 'testforce','row1','f1:col1','data1'
put 'testforce','row2','f1:col1','data2'
put 'testforce','row3','f1:col1','data3'
put 'testforce','row4','f1:col1','data4'
flush 'testforce'

# hadoop fs -ls /apps/hbase/data/testforce/1a146d535be7662bb1102e44961ddb7e/f1
Found 1 items
-rw-r--r--   3 hbase hadoop        808 2014-05-21 15:56 /apps/hbase/data/testforce/1a146d535be7662bb1102e44961ddb7e/f1/dbe88fd159324a9499405a8536c66c4b

[root@hdm ~]# hadoop fs -ls /apps/hbase/data/testforce
Found 3 items
-rw-r--r--   3 hbase hadoop        671 2014-05-21 15:55 /apps/hbase/data/testforce/.tableinfo.0000000001
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:55 /apps/hbase/data/testforce/.tmp
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:56 /apps/hbase/data/testforce/1a146d535be7662bb1102e44961ddb7e

hbase(main):034:0> split '1a146d535be7662bb1102e44961ddb7e','row2'
0 row(s) in 0.0420 seconds

# hadoop fs -ls /apps/hbase/data/testforce
Found 5 items
-rw-r--r--   3 hbase hadoop        671 2014-05-21 15:55 /apps/hbase/data/testforce/.tableinfo.0000000001
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:55 /apps/hbase/data/testforce/.tmp
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:58 /apps/hbase/data/testforce/1a146d535be7662bb1102e44961ddb7e
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:58 /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:58 /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03

# hadoop fs -ls /apps/hbase/data/testforce
Found 4 items
-rw-r--r--   3 hbase hadoop        671 2014-05-21 15:55 /apps/hbase/data/testforce/.tableinfo.0000000001
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:55 /apps/hbase/data/testforce/.tmp
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:59 /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60
drwxr-xr-x   - hbase hadoop          0 2014-05-21 15:59 /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03

# hadoop fs -ls /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03/f1
Found 1 items
-rw-r--r--   3 hbase hadoop        715 2014-05-21 15:58 /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03/f1/3ba626091f7948eb9b19a328fe108716
# hadoop fs -ls /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60/f1
Found 1 items
-rw-r--r--   3 hbase hadoop        645 2014-05-21 15:58 /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60/f1/eb093dbeb36a44c29d049135f0fcbfe8

# hadoop fs -cat /apps/hbase/data/testforce/439b0a8e5306b24370fd5d61ff1eeb03/f1/3ba626091f7948eb9b19a328fe108716
row2f1col1F data2 row3f1col1F data3 row4f1col1F data4

# hadoop fs -cat /apps/hbase/data/testforce/2645e315abec969864d5d5610b004c60/f1/eb093dbeb36a44c29d049135f0fcbfe8
row1f1col1F data1

Thursday, May 22, 2014