Tuesday, July 10, 2018

Hive: Different lock behaviors between DummyTxnManager and DbTxnManager

Goal

Understand the different lock behaviors between DummyTxnManager and DbTxnManager in Hive. The example query is INSERT into a partition table.

Env

Hive 2.1
MapR 5.2

Test Preparation

1. Create a Hive partition table "testpart" as described in http://www.openkb.info/2014/11/hive-locks-tablepartition-level.html

2. Create 2 child partitions with data for "testpart":

INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords;

INSERT OVERWRITE TABLE testpart
PARTITION(one = 'c', two = 'd')
SELECT  * FROM passwords;

3. Hive Metastore has 2 types of lock configurations

a. DummyTxnManager
It means to set below in hive-site.xml and restart Hive Metastore:
<property>
  <name>hive.support.concurrency</name>
  <value>true</value>
</property>

<property>
  <name>hive.txn.manager</name>
  <value>org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager</value>
</property>
b. DbTxnManager
It means to set below in hive-site.xml and restart Hive Metastore:
<property>
  <name>hive.support.concurrency</name>
  <value>true</value>
</property>

<property>
  <name>hive.txn.manager</name>
  <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>

4. We will test below 2 types of INSERT

a. Dynamic Partition Insert
set hive.exec.dynamic.partition.mode=nostrict;
INSERT OVERWRITE TABLE testpart
PARTITION(one,two)
SELECT *,'a','b' FROM passwords;
b. Static Partition Insert
INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords;

Solution

1. DummyTxnManager + Dynamic Partition Insert

Here we open 2 Hive CLI sessions, and run one Dynamic Partition Insert in each session.
The locks are:
hive>    show locks extended;
OK
default@passwords SHARED
LOCK_QUERYID:mapr_20180710102038_18601ec0-9f6d-4305-8446-d6c895fc5013
LOCK_TIME:1531243238410
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one,two)
SELECT *,'a','b' FROM passwords
default@passwords SHARED
LOCK_QUERYID:mapr_20180710102138_a9494859-ff06-4359-895c-00eb8c348ffd
LOCK_TIME:1531243299188
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one,two)
SELECT *,'a','b' FROM passwords
default@testpart SHARED
LOCK_QUERYID:mapr_20180710102038_18601ec0-9f6d-4305-8446-d6c895fc5013
LOCK_TIME:1531243238411
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one,two)
SELECT *,'a','b' FROM passwords
default@testpart SHARED
LOCK_QUERYID:mapr_20180710102138_a9494859-ff06-4359-895c-00eb8c348ffd
LOCK_TIME:1531243299189
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one,two)
SELECT *,'a','b' FROM passwords
Time taken: 0.035 seconds, Fetched: 28 row(s)

Both of the Dynamic Partition Inserts(into the same leaf partition) can run concurrently because each of the INSERT only needs:
  • SHARED lock on parent table -- "testpart"

2. DummyTxnManager + Static Partition Insert

Here we open 2 Hive CLI sessions, and run one Static Partition Insert in each session.
The locks are:
hive>    show locks extended;
OK
default@passwords SHARED
LOCK_QUERYID:mapr_20180710104550_8f7fd2d8-7bbc-4e8c-983f-94549ff4cda8
LOCK_TIME:1531244750212
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@passwords SHARED
LOCK_QUERYID:mapr_20180710104604_92fb819d-8cd1-4a1c-bbc5-60db7c59cc51
LOCK_TIME:1531244764648
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@testpart SHARED
LOCK_QUERYID:mapr_20180710104604_92fb819d-8cd1-4a1c-bbc5-60db7c59cc51
LOCK_TIME:1531244764648
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@testpart SHARED
LOCK_QUERYID:mapr_20180710104550_8f7fd2d8-7bbc-4e8c-983f-94549ff4cda8
LOCK_TIME:1531244750212
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@testpart@one=a SHARED
LOCK_QUERYID:mapr_20180710104604_92fb819d-8cd1-4a1c-bbc5-60db7c59cc51
LOCK_TIME:1531244764648
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@testpart@one=a SHARED
LOCK_QUERYID:mapr_20180710104550_8f7fd2d8-7bbc-4e8c-983f-94549ff4cda8
LOCK_TIME:1531244750212
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@testpart@one=a/two=b EXCLUSIVE
LOCK_QUERYID:mapr_20180710104550_8f7fd2d8-7bbc-4e8c-983f-94549ff4cda8
LOCK_TIME:1531244750212
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
Time taken: 0.027 seconds, Fetched: 49 row(s)
Only one of the Static Partition Inserts(into the same leaf partition) can run at one time because each of the INSERT needs:
  • SHARED lock on parent table -- "testpart"
  • SHARED lock on parent partition(s) -- "testpart@one=a"
  • EXCLUSIVE lock on the leaf partition -- "testpart@one=a/two=b".
So the 2nd Static Partition Insert into the same leaf partition "testpart@one=a/two=b" will be waiting for the EXCLUSIVE lock on the leaf partition.
From above output, the 2nd Static Partition Insert's query ID is mapr_20180710104604_92fb819d-8cd1-4a1c-bbc5-60db7c59cc51.
After the 1st Static Partition Insert completed, then 2nd Static Partition Insert started.
At that time, the locks became:
hive>    show locks extended;
OK
default@passwords SHARED
LOCK_QUERYID:mapr_20180710104604_92fb819d-8cd1-4a1c-bbc5-60db7c59cc51
LOCK_TIME:1531244764648
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@testpart SHARED
LOCK_QUERYID:mapr_20180710104604_92fb819d-8cd1-4a1c-bbc5-60db7c59cc51
LOCK_TIME:1531244764648
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@testpart@one=a SHARED
LOCK_QUERYID:mapr_20180710104604_92fb819d-8cd1-4a1c-bbc5-60db7c59cc51
LOCK_TIME:1531244764648
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
default@testpart@one=a/two=b EXCLUSIVE
LOCK_QUERYID:mapr_20180710104604_92fb819d-8cd1-4a1c-bbc5-60db7c59cc51
LOCK_TIME:1531244764648
LOCK_MODE:IMPLICIT
LOCK_QUERYSTRING:INSERT OVERWRITE TABLE testpart
PARTITION(one = 'a', two = 'b')
SELECT  * FROM passwords
Time taken: 0.017 seconds, Fetched: 28 row(s)

Due to this lock behaviors, it means, Static Partition Inserts into different leaf partitions can run concurrently because the EXCLUSIVE lock is at leaf partition level.
Of course, SELECT queries on other leaf partitions(without EXCLUSIVE lock) can also run concurrently with the Static Partition Insert.
I have done tests to prove this theory, but I will not share the complete output here.

3. DbTxnManager + Dynamic Partition Insert

Running one Dynamic Partition Insert in Hive CLI session, and the locks are:
hive> show locks extended;
OK
Lock ID Database Table Partition State Blocked By Type Transaction ID Last Hearbeat Acquired At User Hostname Agent Info
12.1 default passwords NULL ACQUIRED              SHARED_READ NULL 1531256068000 1531256068000 mapr s4.poc.com mapr_20180710135425_b4e412b0-ddb9-4c3a-92bc-fda863df71d7
12.2 default testpart NULL ACQUIRED              EXCLUSIVE NULL 1531256068000 1531256068000 mapr s4.poc.com mapr_20180710135425_b4e412b0-ddb9-4c3a-92bc-fda863df71d7
Time taken: 0.012 seconds, Fetched: 3 row(s)
Only one of the Dynamic Partition Insert can run at one time because each of the INSERT needs:
  • EXCLUSIVE lock on parent table -- "testpart".
Due to this lock behaviors, it means, nothing else(SELECT/INSERT/DDL on the same table or even on different leaf partitions) can run together with Dynamic Partition Insert.

4. DbTxnManager + Static Partition Insert

Running one Static Partition Insert in Hive CLI session, and the locks are:
hive> show locks extended;
OK
Lock ID Database Table Partition State Blocked By Type Transaction ID Last Hearbeat Acquired At User Hostname Agent Info
13.1 default passwords NULL ACQUIRED              SHARED_READ NULL 1531256273000 1531256273000 mapr s4.poc.com mapr_20180710135753_3086d4dc-b157-429e-9604-c8a094468a3e
13.2 default testpart one=a/two=b ACQUIRED              EXCLUSIVE NULL 1531256273000 1531256273000 mapr s4.poc.com mapr_20180710135753_3086d4dc-b157-429e-9604-c8a094468a3e
Time taken: 0.01 seconds, Fetched: 3 row(s)
Only one of the Static Partition Inserts(into the same leaf partition) can run at one time because each of the INSERT needs:
  • EXCLUSIVE lock on the leaf partition -- "testpart@one=a/two=b".
Due to this lock behaviors, it means, Static Partition Inserts into different leaf partitions can run concurrently because the EXCLUSIVE lock is at leaf partition level.
See below locks:
hive> show locks extended;
OK
Lock ID Database Table Partition State Blocked By Type Transaction ID Last Hearbeat Acquired At User Hostname Agent Info
18.1 default passwords NULL ACQUIRED              SHARED_READ NULL 1531258035000 1531258035000 mapr s4.poc.com mapr_20180710142715_d40501ee-9267-47de-bf6a-9d18ef4a1df7
19.1 default passwords NULL ACQUIRED              SHARED_READ NULL 1531258040000 1531258040000 mapr s4.poc.com mapr_20180710142720_a1be323c-905d-4f4c-b7dc-33634df68a91
18.2 default testpart one=a/two=b ACQUIRED              EXCLUSIVE NULL 1531258035000 1531258035000 mapr s4.poc.com mapr_20180710142715_d40501ee-9267-47de-bf6a-9d18ef4a1df7
19.2 default testpart one=c/two=d ACQUIRED              EXCLUSIVE NULL 1531258040000 1531258040000 mapr s4.poc.com mapr_20180710142720_a1be323c-905d-4f4c-b7dc-33634df68a91
Time taken: 0.012 seconds, Fetched: 5 row(s)

However due to my test, SELECT on parent table or SELECT on other leaf partition(s) will be blocked.
Here we open 2 Hive CLI sessions. One session is running Static Partition Insert on partition(one=a/two=b). The other session is running SELECT on another partition(one=c/two=d).
The locks are:
hive> show locks extended;
OK
Lock ID Database Table Partition State Blocked By Type Transaction ID Last Hearbeat Acquired At User Hostname Agent Info
16.1 default passwords NULL ACQUIRED              SHARED_READ NULL 1531257690000 1531257690000 mapr s4.poc.com mapr_20180710142129_e99fa6fc-2cfa-4e66-a596-d83e10285802
16.2 default testpart one=a/two=b ACQUIRED              EXCLUSIVE NULL 1531257690000 1531257690000 mapr s4.poc.com mapr_20180710142129_e99fa6fc-2cfa-4e66-a596-d83e10285802
17.1 default testpart NULL WAITING 16.2 SHARED_READ NULL 1531257724000 NULL mapr s4.poc.com mapr_20180710142136_21312ed9-f1a1-442a-8545-1942885fbc77
17.2 default testpart one=c/two=d WAITING              SHARED_READ NULL 1531257724000 NULL mapr s4.poc.com mapr_20180710142136_21312ed9-f1a1-442a-8545-1942885fbc77
Time taken: 0.006 seconds, Fetched: 5 row(s)
As we can see from above, the SELECT on another partition(one=c/two=d) will wait for SHARED lock on the parent table which is blocked by the EXCLUSIVE lock on partition(one=a/two=b).

Key Takeaways

1. DummyTxnManager + Dynamic Partition Insert
  •     SHARED lock on parent table -- "testpart"
Below types of SQL can run together with this Dynamic Partition Insert:
  • Dynamic Partition Inserts(no matter into the same or different leaf partition)
  • SELECT(on parent table or on any partition)

2. DummyTxnManager + Static Partition Insert
  •     SHARED lock on parent table -- "testpart"
  •     SHARED lock on parent partition(s) -- "testpart@one=a"
  •     EXCLUSIVE lock on the leaf partition -- "testpart@one=a/two=b".
Below types of SQL can run together with this Dynamic Partition Insert:
  • Static Partition Insert into different leaf partitions
  • SELECT on other leaf partitions(without EXCLUSIVE lock)
3. DbTxnManager + Dynamic Partition Insert
  •     EXCLUSIVE lock on parent table -- "testpart".
Nothing else(SELECT/INSERT/DDL on the same table or even on different leaf partitions) can run together with this Dynamic Partition Insert.
All of them will be blocked.

4. DbTxnManager + Static Partition Insert
  •     EXCLUSIVE lock on the leaf partition -- "testpart@one=a/two=b".   
Below types of SQL can run together with this Static Partition Insert:
  • Static Partition Insert into different leaf partitions

4 comments:

  1. Is it necessary to use hive transactions and ORC tables to take advantage of the exclusive lock by DbTxnManager + Dynamic Partition Insert ,
    or could I just use 'DbTxnManager + Dynamic Partition Insert' without hive transactions on parquet tables ?

    ReplyDelete
    Replies
    1. It is not a requirement to use ORC or parquet format.
      Above tests are done using text format tables.

      "Hive transaction" is a different feature which only works on ORC format.

      Delete
  2. need a way to figure out which hive query is holding up the locks, but when we enable (DbTxnManager) hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager then the show locks extended on beeline does not show the actual query holding up the lock like it did under DummyTransationManager,

    so to figure out the query holding up the lock(s) which tables in the mysql metastore should I probe and how would one figure out the actual (HS2 based) query holding up the locks pls

    ReplyDelete
  3. It permits you to really look at the ordering status and upgrade the perceivability of your websites.
    onohosting.com/

    ReplyDelete

Popular Posts