Friday, December 11, 2020

Hbase replication cheat sheet

 Goal:

This article records the common commands and issues for hbase replication.


Solution:

1. Add the target as peer

hbase shell> add_peer "us_east","hostname.of.zookeeper:5181:/path-to-hbase"

2. Enable and Disable table replication

hbase shell> enable_table_replication "t1"
hbase shell> disable_table_replication "t1"

3. Copy table from source to target

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=hostname.of.zookeeper:5181:/path-to-hbase t1

4. Remove target as peer

hbase shell> remove_peer "us_east"

5. List all peers

hbase shell> list_peers

6. Verify the rows between source and target table

hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication peer1 table1

Compare the GOODROWS and BADROWS.

7.  Monitor Replication Status

# Prints the status of each source and its sinks, sorted by hostname.
hbase shell> status 'replication'

# Prints the status for each replication source, sorted by hostname.
hbase shell> status 'replication', 'source'

# Prints the status for each replication sink, sorted by hostname.
hbase shell> status 'replication', 'sink'

8.  HBase Replication Metrics

Metric

Description

source.sizeOfLogQueue

Number of WALs to process (excludes the one which is being processed) at the replication source.

source.shippedOps

Number of of mutations shipped.

source.logEditsRead

Number of mutations read from WALs at the replication source.

source.ageOfLastShippedOp

Age of last batch shipped by the replication source.

9. Practice for replicating one existing table from cluster A to cluster B

on cluster A:
hbase shell> add_peer "B","hostname.of.zookeeper:5181:/path-to-hbase"
hbase shell> enable_table_replication "t1"
hbase shell> disable_peer 'B'

Then use either CopyTable, Export/Import or ExportSnapshot to copy table "t1" from A to B.

hbase shell> enable_peer 'B'

10. Hbase replication related parameters

<property>
<name>hbase.replication</name>
<value>true</value>
<description>Allow HBase tables to be replicated.</description>
</property>

<property>
<name>replication.source.nb.capacity</name>
<value>25000</value>
<description>The data records synchronized to the sink side each time cannot be greater than the threshold, and the default is 25000</description>
</property>

<property>
<name>replication.source.ratio</name>
<value>0.1</value>
<description>The RegionServer of this ratio is selected from the cluster to be backed up as potential ReplicationSink, and the default value is 0.1</description>
</property>

<property>
<name>replication.source.size.capacity</name>
<value>67108864</value>
<description>The size of the data synchronized to the sink side each time cannot exceed this threshold, and the default is 64M</description>
</property>

<property>
<name>replication.sleep.before.failover</name>
<value>2000</value>
<description>Before transferring the ReplicationQueue in the dead RegionServer to another RegionServer, take a nap for 2 seconds</description>
</property>

<property>
<name>replication.executor.workers</name>
<value>1</value>
<description>The number of threads engaged in replication, the default is 1</description>
</property>

Known Issues

1. HBASE-18111

The cluster connection was aborted when the ZookeeperWatcher receive a AuthFailed event. Then the HBaseInterClusterReplicationEndpoint's replicate() method will stuck in a while loop.

One symptom is the jstack on RS shows:

java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)

This is fixed on 1.3.3, 1.4.0, 2.0.0.

2.  HBASE-24359

Replication will be stuck after we delete CFs from both the source and the sink, if the source still has outstanding edits that now it could not get rid of. Now all replication is backed up behind these unreplicatable edits.

The fix is to introduce a new config hbase.replication.drop.on.deleted.columnfamily, default is false. When config to true, the replication will drop the edits for columnfamily that has been deleted from the replication source and target. 

This is fixed on 2.3.0 and 3.0.0.

References

https://blog.cloudera.com/what-are-hbase-znodes/

https://blog.cloudera.com/apache-hbase-replication-overview/

 https://blog.cloudera.com/online-apache-hbase-backups-with-copytable/

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/fault-tolerance/content/manually_enable_hbase_replication.html

https://blog.cloudera.com/introduction-to-apache-hbase-snapshots/

 

No comments:

Post a Comment

Popular Posts