Monday, February 23, 2015

How to build and use parquet-tools to read parquet files

Goal:

How to build and use parquet-tools to read parquet files.

Solution:

1. Download and Install maven.

Follow below link:
http://maven.apache.org/download.cgi

2. Download the parquet source code

git clone https://github.com/Parquet/parquet-mr.git

3. Build the parquet-tools.

cd parquet-mr/parquet-tools/
mvn clean package -Plocal 
The resulting jar is target/parquet-tools.jar.

Note, you may meet error such as below:
Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository
It is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>

4. Show help manual

cd target
java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help

 5. Dump the schema

Take sample nation.parquet file for example.
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet
message root {
  required int64 N_NATIONKEY;
  required binary N_NAME (UTF8);
  required int64 N_REGIONKEY;
  required binary N_COMMENT (UTF8);
}

6. Read the data


# java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet
N_NATIONKEY = 0
N_NAME = ALGERIA
N_REGIONKEY = 0
N_COMMENT =  haggle. carefully f

(... ...)

7. Read first n records

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet
N_NATIONKEY = 0
N_NAME = ALGERIA
N_REGIONKEY = 0
N_COMMENT =  haggle. carefully f

N_NATIONKEY = 1
N_NAME = ARGENTINA
N_REGIONKEY = 1
N_COMMENT = al foxes promise sly

N_NATIONKEY = 2
N_NAME = BRAZIL
N_REGIONKEY = 1
N_COMMENT = y alongside of the p 

8. Show meta info


# java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet
file:        file:/tmp/nation.parquet
creator:     parquet-mr

file schema: root
--------------------------------------------------------------------------------
N_NATIONKEY: REQUIRED INT64 R:0 D:0
N_NAME:      REQUIRED BINARY O:UTF8 R:0 D:0
N_REGIONKEY: REQUIRED INT64 R:0 D:0
N_COMMENT:   REQUIRED BINARY O:UTF8 R:0 D:0

row group 1: RC:25 TS:1352 OFFSET:4
--------------------------------------------------------------------------------
N_NATIONKEY:  INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKED
N_NAME:       BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKED
N_REGIONKEY:  INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKED
N_COMMENT:    BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED

9. Dump all data

Note: Values are in column format.
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta  /tmp/nation.parquet
INT64 N_NATIONKEY
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:0
value 2:  R:0 D:0 V:1
value 3:  R:0 D:0 V:2
(...)

BINARY N_NAME
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:ALGERIA
value 2:  R:0 D:0 V:ARGENTINA
value 3:  R:0 D:0 V:BRAZIL
(...)

INT64 N_REGIONKEY
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V:0
value 2:  R:0 D:0 V:1
value 3:  R:0 D:0 V:1
(...)

BINARY N_COMMENT
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 25 ***
value 1:  R:0 D:0 V: haggle. carefully f
value 2:  R:0 D:0 V:al foxes promise sly
value 3:  R:0 D:0 V:y alongside of the p
(...)


8 comments:

  1. Thanks for the compilation fix! Too bad that the project on GitHub does not include issues where this could be mentioned, because it is quite a useful fix.

    ReplyDelete
  2. Can not clone this project.
    error: fatal: repository 'https://github.com/Parquet/parquet-mr.git/' not found
    Any idea how to clone it?

    ReplyDelete
    Replies
    1. It is moved to :
      https://github.com/apache/parquet-mr/

      Delete
    2. Thank you. OpenKB.

      Delete
  3. [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project parquet-tools: Compilation failure: Compilation failure:
    [ERROR] /C:/bigdata/parquet-mr/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
    [ERROR] symbol: method getCrc()
    [ERROR] location: variable pageV1 of type org.apache.parquet.column.page.DataPageV1

    Getting the above error while trying to run mvn clean package -Plocal

    ReplyDelete
    Replies
    1. I am also getting the same error:
      parquet-mr-master/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol

      Any pointers please?

      Delete
  4. This post offers valuable insights without feeling repetitive. It kept my attention from start to finish and delivered useful takeaways.

    outsource cfo

    ReplyDelete