Wednesday, April 24, 2019

Some reducer output files are not moved into final output directory after a MapReduce Job with customized FileOutputCommitter is migrated from MRv1 to MRv2.

Symptom:

After a MapReduce Job with customized FileOutputCommitter got migrated from MRv1 to MRv2, some reducer output files are not moved into final output directory.

Take a sample WordCount job with a customized FileOutputCommitter for example.
In this specific case, only 1st reducer's output files are moved into final output directory.
All other reducers' output files are still sitting in the temporary location:
/hao/wordfinal/part_00000/part-r-00000
/hao/wordfinal/part_00001/_temporary/1/task_1554837836642_0091_r_000001/part-r-00001
/hao/wordfinal/part_00002/_temporary/1/task_1554837836642_0091_r_000001/part-r-00002
The expected final output files should be:
/hao/wordfinal/part_00000/part-r-00000
/hao/wordfinal/part_00001/part-r-00001
/hao/wordfinal/part_00002/part-r-00002
Note: The same job works fine in MRv1 though before migrating to MRv2.

Tuesday, April 23, 2019

What is the difference between mapreduce.fileoutputcommitter.algorithm.version=1 and 2

Goal:

This article explains the difference between mapreduce.fileoutputcommitter.algorithm.version=1 and 2 using a sample wordcount job.

How to customize FileOutputCommitter for MapReduce job by overwriting Output Format Class

Goal:

This article explains how to customize FileOutputCommitter for MapReduce job by overwriting Output Format Class.
This can be used to change the output directory, customize the file name etc.

Monday, March 11, 2019

How to use snapshot feature of MapR-FS to isolate Hive reads and writes

Goal:

As we know, Hive locks can be used to isolate reads and writes.
Please refer to my previous blogs for details:
http://www.openkb.info/2014/11/hive-locks-tablepartition-level.html
http://www.openkb.info/2018/07/hive-different-lock-behaviors-between.html

That lock mechanism is maintained by Hive and it could have performance overhead or unexpected lock behavior based on your application logic.
This article shows another way to use snapshot feature of MapR-FS to isolate Hive reads and writes.

Wednesday, February 6, 2019

"msck repair" of a partition table fails with error "Expecting only one partition but more than one partitions are found.".

Symptom:

In spark-shell, "msck repair" of a partition table named "database_name.table_name" fails with error "Expecting only one partition but more than one partitions are found.".

Thursday, January 24, 2019

Friday, January 11, 2019

How to configure Drill to use cgroups to hard limit CPU resource in Redhat/CentOS 7

Goal:

This article explains how to configure Drill to use cgroups to hard limit CPU resource in Redhat/CentOS 7.

Friday, November 16, 2018

How to write an example MapR Drill JDBC code which connects to a MapR-Drill Cluster with MapRSASL authentication

Goal:

How to write an example MapR Drill JDBC code which connects to a MapR-Drill Cluster with MapRSASL authentication.
The reason why we are using Simba Drill JDBC Driver instead of open source JDBC Driver is:
The open-source JDBC driver is not tested on the MapR Converged Data Platform. The driver supports Kerberos and Plain authentication mechanisms, but does not support the MapR-SASL authentication mechanism. 

Monday, November 12, 2018

How to install MapR Data Fabric for Kubernetes(KDF) on Kubernetes Cluster

Goal:

This article explains the steps in details on how to install MapR Data Fabric for Kubernetes(KDF) on Kubernetes Cluster.

Friday, November 9, 2018

How to start Dashboard in a test Kubernetes Cluster

Goal:

How to start Dashboard in a test Kubernetes Cluster.
Dashboard is a web-based Kubernetes user interface. You can use Dashboard to deploy containerized applications to a Kubernetes cluster, troubleshoot your containerized application, and manage the cluster itself along with its attendant resources.

Popular Posts