Thursday, February 19, 2015

What factors control how many MapR File System Read Ahead Threads when Impala reads Parquet tables?

Impala on MapR triggers Read Ahead(RA) threads by default.
It is shown as "mapr::fs::RAThread::runner" when pstack <impalad_pid>.
When Impala is reading Parquet table, the number of RA threads are controlled by below factors:
  1. Number of columns of the Parquet table
  2. Table Size
  3. Number of Impalad processes
Below tests are to prove above assumptions.

Env:

MapR 4.0.1 + Impala 1.4.1
3 Impalad processes

Preparation:

1. Create several Parquet tables with similar size, but with different columns.

Table Name Table Size Number of Columns
parquet_table_passwords_5 1.0 G 5
parquet_table_passwords_10 1.0 G 10
parquet_table_passwords_20 1.0 G 20
parquet_table_passwords_20_half 526.5 M 20

Sample DDL:
 CREATE TABLE default.parquet_table_passwords_5( 
   col0 STRING,                            
   col1 STRING,                
   col2 STRING,  
   col3 STRING, 
   col4 STRING               
 )                   
 STORED AS PARQUET ;   

2. Start monitoring script to count the number of RA Threads spawned by each impalad process every 1 minute.

Please see [this article] to understand how to switch the file client firstly.
If MapR C++ File Client is used(Default), use below script:
# cat monitorRA.sh
#!/bin/bash
while [  true ]; do
    date;clush -a "pstack \`pgrep impalad\`|grep mapr::fs::RAThread::runner|wc -l"
    sleep 60
done
Note: pstack actually calls gdb which will block the process, so be careful when using it in production.
If Hadoop JAVA File Client is used, use below script under the user who started impalad:
# cat monitor_ra_java.sh
#!/bin/bash
while [  true ]; do
    date;clush -a "jstack \`pgrep impalad\`  |grep \"MapR RA\"|wc -l"
    sleep 10
done

Lab Tests

All of below tests are running single concurrent "select * from <table_name> limit 100000" repeatedly, and count the RA Thread number from each impalad process.
Here are the test results to prove each factor.

Factor 1. Number of columns of the Parquet table

Table Name Number of
RA Threads
Number of
Columns
parquet_table_passwords_5 Node1: 40
Node2: 40
Node3: 40
5
parquet_table_passwords_10 Node1: 80
Node2: 80
Node3: 80
10
parquet_table_passwords_20 Node1: 120
Node2: 160
Node3: 200
20

Factor 2. Table Size

Table Name Number of
RA Threads
Table Size
parquet_table_passwords_20 Node1: 120
Node2: 160
Node3: 200
1.0G
parquet_table_passwords_20_half Node1: 80
Node2: 80
Node3: 80
526.5 M

Factor 3. Number of Impalad processes

During this test, we stopped one and two impalad processes separately.
Table Name Number of
RA Threads
Number of
Impalad
parquet_table_passwords_20 Node1: 120
Node2: 160
Node3: 200
3
parquet_table_passwords_20 Node1: 200
Node2: 280
2
parquet_table_passwords_20 Node1: 480 1

Conclusion

1. The more columns of Parquet tables, more RA threads.
2. The larger the table is, more RA threads.
3. Total number of RA threads dose not change.



No comments:

Post a Comment

Popular Posts