Tuesday, February 10, 2015

Understanding MapRDB table auto-split algorithm

This article explains the behavior of MapRDB table auto-split algorithm.
All tests below are done in MapR 4.0.1.

Theory

1. auto split is enabled by default.
Check using "maprcli table info", for example:
# maprcli table info -path /maprtable -json|grep -i autosplit
   "autosplit":true,
2. "regionsizemb" table attribute controls the average size of the regions into which MapR-DB tries to split the table as the table grows.
Check using "maprcli table info", for example:
# maprcli table info -path /maprtable -json|grep -i regionsizemb   
"regionsizemb":4096,
According to http://doc.mapr.com/display/MapR/table+edit :
If autosplit is set to true, MapR-DB splits a region when the size of the region exceeds 50% of the average value. For example, if the average value is 4096 MB, MapR-DB splits a region that is larger than 6144 MB.
Note that before a table is smaller than 4 regions, MapR-DB ignores the regionsizemb parameter and aggressively distributes the table data.

Lab

To verify above theory, firstly let's create a table "/maprtable" and bulk load 76733449 rows using Spark following steps here.
1. Disable auto split manually and merge them into one region with size about 4.6GB.
# maprcli table edit -autosplit false -path /maprtable
# maprcli table region list -path /maprtable
numberofrows  fid              secondarynodes  primarynode  numberofrowswithdelete  startkey   logicalsize  lastheartbeat  endkey    physicalsize
76733449      2115.523.263486  yarn-92         yarn-94      0                       -INFINITY  4899782656   0              INFINITY  4964327424
2. Set region size to 512MB
maprcli table edit -regionsizemb 512 -path /maprtable
3. Enable auto split and then average region size is about 512MB.
maprcli table edit -autosplit true -path /maprtable

#  maprcli table region list -path /maprtable
numberofrows  fid               secondarynodes  primarynode  numberofrowswithdelete  startkey          logicalsize  lastheartbeat  endkey            physicalsize
9288340       2189.761.132748   yarn-94         yarn-92      0                       -INFINITY         551845888    0              \x00Q\x03S        565313536
8218724       2191.465.132312   yarn-94         yarn-92      0                       \x00Q\x03S        538714112    0              \x00\xAC\x89\xDE  553336832
7547911       2192.1708.134654  yarn-94         yarn-92      0                       \x00\xAC\x89\xDE  486211584    0              \x01\x1E\xF2C     490905600
7628796       2193.34.131220    yarn-94         yarn-92      0                       \x01\x1E\xF2C     489406464    0              \x01\x91\xC7\xCA  494075904
8536673       2194.34.131186    yarn-94         yarn-92      0                       \x01\x91\xC7\xCA  547258368    0              \x02\x12-j        552493056
8650526       2195.723.132698   yarn-94         yarn-92      0                       \x02\x12-j        557539328    0              \x02\x95v\xA6     562798592
8927659       2196.569.132256   yarn-94         yarn-92      0                       \x02\x95v\xA6     573784064    0              \x03\x1CB\xAF     579248128
8973834       2116.322.263128   yarn-92         yarn-94      0                       \x03\x1CB\xAF     578256896    0              \x03\xA4jS        583835648
8960986       2190.720.133176   yarn-94         yarn-92      0                       \x03\xA4jS        576765952    0              INFINITY          582320128
4. Disable auto split and merge them into one region again.
# maprcli table edit -autosplit false -path /maprtable
# maprcli table region list -path /maprtable
numberofrows  fid              secondarynodes  primarynode  numberofrowswithdelete  startkey   logicalsize  lastheartbeat  endkey    physicalsize
76733449      2198.945.133124  yarn-92         yarn-94      0                       -INFINITY  4899782656   0              INFINITY  4964327424
5. Set region size to 8GB.
maprcli table edit -regionsizemb 8192 -path /maprtable
6. Enable auto split and it still generates 4 regions aggressively.
maprcli table edit -autosplit true -path /maprtable

# maprcli table region list -path /maprtable
numberofrows  fid               secondarynodes  primarynode  numberofrowswithdelete  startkey       logicalsize  lastheartbeat  endkey         physicalsize
38305525      2190.721.133178   yarn-94         yarn-92      0                       -INFINITY      2426093568   0              \x01\xE6&R     2466988032
17322521      2192.1709.134656  yarn-94         yarn-92      0                       \x01\xE6&R     1114742784   0              \x02\xECP\xBD  1125318656
16298948      2198.945.133124   yarn-92         yarn-94      0                       \x02\xECP\xBD  1050279936   0              \x03\xE3\x9DM  1060331520
4806455       2191.466.132314   yarn-94         yarn-92      0                       \x03\xE3\x9DM  308666368    0              INFINITY       311689216

Conclusion

1. MapR-DB splits the regions once the region size reaches 150% of "regionsizemb" table attributes.
2. MapR-DB will aggressively splits to at least 4 regions.

No comments:

Post a Comment