Goal:
How to build and use parquet-tools to read parquet files.Solution:
1. Download and Install maven.
Follow below link:http://maven.apache.org/download.cgi
2. Download the parquet source code
1 | git clone https: //github .com /Parquet/parquet-mr .git |
3. Build the parquet-tools.
1 2 | cd parquet-mr /parquet-tools/ mvn clean package -Plocal |
Note, you may meet error such as below:
Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repositoryIt is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>
4. Show help manual
1 2 | cd target java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help |
5. Dump the schema
Take sample nation.parquet file for example.1 2 3 4 5 6 7 | # java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet message root { required int64 N_NATIONKEY; required binary N_NAME (UTF8); required int64 N_REGIONKEY; required binary N_COMMENT (UTF8); } |
6. Read the data
1 2 3 4 5 6 7 | # java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet N_NATIONKEY = 0 N_NAME = ALGERIA N_REGIONKEY = 0 N_COMMENT = haggle. carefully f (... ...) |
7. Read first n records
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet N_NATIONKEY = 0 N_NAME = ALGERIA N_REGIONKEY = 0 N_COMMENT = haggle. carefully f N_NATIONKEY = 1 N_NAME = ARGENTINA N_REGIONKEY = 1 N_COMMENT = al foxes promise sly N_NATIONKEY = 2 N_NAME = BRAZIL N_REGIONKEY = 1 N_COMMENT = y alongside of the p |
8. Show meta info
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet file : file : /tmp/nation .parquet creator: parquet-mr file schema: root -------------------------------------------------------------------------------- N_NATIONKEY: REQUIRED INT64 R:0 D:0 N_NAME: REQUIRED BINARY O:UTF8 R:0 D:0 N_REGIONKEY: REQUIRED INT64 R:0 D:0 N_COMMENT: REQUIRED BINARY O:UTF8 R:0 D:0 row group 1: RC:25 TS:1352 OFFSET:4 -------------------------------------------------------------------------------- N_NATIONKEY: INT64 SNAPPY DO:0 FPO:4 SZ:130 /219/1 .68 VC:25 ENC:PLAIN,BIT_PACKED N_NAME: BINARY SNAPPY DO:0 FPO:134 SZ:267 /296/1 .11 VC:25 ENC:PLAIN,BIT_PACKED N_REGIONKEY: INT64 SNAPPY DO:0 FPO:401 SZ:79 /218/2 .76 VC:25 ENC:PLAIN,BIT_PACKED N_COMMENT: BINARY SNAPPY DO:0 FPO:480 SZ:468 /619/1 .32 VC:25 ENC:PLAIN,BIT_PACKED |
9. Dump all data
Note: Values are in column format.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | # java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta /tmp/nation.parquet INT64 N_NATIONKEY -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:0 value 2: R:0 D:0 V:1 value 3: R:0 D:0 V:2 (...) BINARY N_NAME -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:ALGERIA value 2: R:0 D:0 V:ARGENTINA value 3: R:0 D:0 V:BRAZIL (...) INT64 N_REGIONKEY -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:0 value 2: R:0 D:0 V:1 value 3: R:0 D:0 V:1 (...) BINARY N_COMMENT -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V: haggle. carefully f value 2: R:0 D:0 V:al foxes promise sly value 3: R:0 D:0 V:y alongside of the p (...) |
Thanks for the compilation fix! Too bad that the project on GitHub does not include issues where this could be mentioned, because it is quite a useful fix.
ReplyDeleteCan not clone this project.
ReplyDeleteerror: fatal: repository 'https://github.com/Parquet/parquet-mr.git/' not found
Any idea how to clone it?
It is moved to :
Deletehttps://github.com/apache/parquet-mr/
Thank you. OpenKB.
Delete[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project parquet-tools: Compilation failure: Compilation failure:
ReplyDelete[ERROR] /C:/bigdata/parquet-mr/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
[ERROR] symbol: method getCrc()
[ERROR] location: variable pageV1 of type org.apache.parquet.column.page.DataPageV1
Getting the above error while trying to run mvn clean package -Plocal
I am also getting the same error:
Deleteparquet-mr-master/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
Any pointers please?
This is great.
ReplyDelete