Hive fails to read the parquet table created by Impala

Wednesday, May 13, 2015

Hive fails to read the parquet table created by Impala

Env:

Hive 0.13
Impala 1.4.1

Symptom:

Hive fails to read the parquet table created by Impala with below error:

FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found)

Root Cause:

Parquet tables created by Impala are using different SerDe , InputFormat and OutputFormat than the parquet tables created by Hive.
Impala parquet table:

| SerDe Library:               | parquet.hive.serde.ParquetHiveSerDe        | NULL                 |
| InputFormat:                 | parquet.hive.DeprecatedParquetInputFormat  | NULL                 |
| OutputFormat:                | parquet.hive.DeprecatedParquetOutputFormat | NULL                 |

Hive parquet table:

| SerDe Library:               | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe    | NULL                 |
| InputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat  | NULL                 |
| OutputFormat:                | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL

However in Hive's library, there is no such class "parquet.hive.serde.ParquetHiveSerDe".
The reason why Impala can read this table is Impala has its own implementation of Parquet written separately in C++.

Solution:

1. Download parquet-hive-bundle jar in Maven Central, and put it in Hive metastore lib directory.
For example, put parquet-hive-bundle-1.6.0.jar into /opt/mapr/hive/hive-0.13/lib on the node where Hive metastore is running.
2. Restart Hive metastore on that node so that command like "desc <tablename>" will work.
3. Before running Hive queries on the parquet tables created by Impala, we need to add that parquet-hive-bundle-1.6.0.jar as auxiliary JAR following this article.
For example, if you are using Hive CLI, just run:

hive> add jar /opt/mapr/hive/hive-0.13/lib/parquet-hive-bundle-1.6.0.jar;
Added /opt/mapr/hive/hive-0.13/lib/parquet-hive-bundle-1.6.0.jar to class path
Added resource: /opt/mapr/hive/hive-0.13/lib/parquet-hive-bundle-1.6.0.jar

After that, run some queries on that parquet table to verify.

Note:
Adding parquet-hive-bundle jar to Hive metastore lib directory is to make Hive metastore can return the metadata of that table properly.

Adding parquet-hive-bundle jar as auxiliary JAR before running Hive queries is to make sure MapReduce job spawned by Hive can work fine.
Otherwise below error may show up(Assume it is MRv2 job):

Error: java.io.IOException: cannot find class parquet.hive.DeprecatedParquetInputFormat
 at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:584)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:172)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:414)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Wednesday, May 13, 2015