How to build and use parquet-tools to read parquet files

Monday, February 23, 2015

How to build and use parquet-tools to read parquet files

Goal:

How to build and use parquet-tools to read parquet files.

Solution:

1. Download and Install maven.

Follow below link:
http://maven.apache.org/download.cgi

2. Download the parquet source code

git clone https://github.com/Parquet/parquet-mr.git

3. Build the parquet-tools.

cd parquet-mr/parquet-tools/
mvn clean package -Plocal

The resulting jar is target/parquet-tools.jar.

Note, you may meet error such as below:

Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository

It is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>

4. Show help manual

cd target
java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help

5. Dump the schema

Take sample nation.parquet file for example.

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet
message root {
  required int64 N_NATIONKEY;
  required binary N_NAME (UTF8);
  required int64 N_REGIONKEY;
  required binary N_COMMENT (UTF8);
}

6. Read the data

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet
N_NATIONKEY = 0
N_NAME = ALGERIA
N_REGIONKEY = 0
N_COMMENT =  haggle. carefully f

(... ...)

7. Read first n records

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet
N_NATIONKEY = 0
N_NAME = ALGERIA
N_REGIONKEY = 0
N_COMMENT =  haggle. carefully f

N_NATIONKEY = 1
N_NAME = ARGENTINA
N_REGIONKEY = 1
N_COMMENT = al foxes promise sly

N_NATIONKEY = 2
N_NAME = BRAZIL
N_REGIONKEY = 1
N_COMMENT = y alongside of the p

8. Show meta info

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet
file:        file:/tmp/nation.parquet
creator:     parquet-mr

file schema: root
--------------------------------------------------------------------------------
N_NATIONKEY: REQUIRED INT64 R:0 D:0
N_NAME:      REQUIRED BINARY O:UTF8 R:0 D:0
N_REGIONKEY: REQUIRED INT64 R:0 D:0
N_COMMENT:   REQUIRED BINARY O:UTF8 R:0 D:0

row group 1: RC:25 TS:1352 OFFSET:4
--------------------------------------------------------------------------------
N_NATIONKEY:  INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKED
N_NAME:       BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKED
N_REGIONKEY:  INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKED
N_COMMENT:    BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED

9. Dump all data

Note: Values are in column format.

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta  /tmp/nation.parquet
INT64 N_NATIONKEY
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:0
value 2:  R:0 D:0 V:1
value 3:  R:0 D:0 V:2
(...)

BINARY N_NAME
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:ALGERIA
value 2:  R:0 D:0 V:ARGENTINA
value 3:  R:0 D:0 V:BRAZIL
(...)

INT64 N_REGIONKEY
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:0
value 2:  R:0 D:0 V:1
value 3:  R:0 D:0 V:1
(...)

BINARY N_COMMENT
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V: haggle. carefully f
value 2:  R:0 D:0 V:al foxes promise sly
value 3:  R:0 D:0 V:y alongside of the p
(...)

11 comments:

EOL (Eric O LEBIGOT)August 13, 2017 at 8:01 AM
Thanks for the compilation fix! Too bad that the project on GitHub does not include issues where this could be mentioned, because it is quite a useful fix.
ReplyDelete
Replies
AnonymousDecember 7, 2017 at 4:56 PM
Can not clone this project.
error: fatal: repository 'https://github.com/Parquet/parquet-mr.git/' not found
Any idea how to clone it?
ReplyDelete
Replies
UnknownNovember 3, 2019 at 8:32 PM
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project parquet-tools: Compilation failure: Compilation failure:
[ERROR] /C:/bigdata/parquet-mr/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
[ERROR] symbol: method getCrc()
[ERROR] location: variable pageV1 of type org.apache.parquet.column.page.DataPageV1

Getting the above error while trying to run mvn clean package -Plocal
ReplyDelete
Replies
EricaOctober 14, 2024 at 6:16 PM
This is great.
ReplyDelete
Replies
outsource cfoFebruary 2, 2026 at 12:55 AM
This post offers valuable insights without feeling repetitive. It kept my attention from start to finish and delivered useful takeaways.

outsource cfo
ReplyDelete
Replies
Lily HarisApril 10, 2026 at 11:03 PM
Great guide on building and using parquet-tools! It makes reading parquet files much easier, especially for quick data inspection and debugging tasks. topcare vitamins
ReplyDelete
Replies
AnonymousMay 19, 2026 at 4:52 AM
CF5988C5
kozan esçort
diyarbakır esçort numaraları
bozüyük esçort
elbistan esçort
odunpazarı esçort
van esçort
esçort bayan edirne
esçort hakkari
esçort bayan ağrı
ReplyDelete
Replies
fried chicken champaign ilJuly 8, 2026 at 3:40 AM
Thanks for sharing this detailed guide! The step-by-step explanation makes building and using Parquet tools much easier to understand, especially for anyone working with Hadoop or big data workflows. A very practical and useful resource.

fried chicken champaign il
ReplyDelete
Replies