Goal:
How to build and use parquet-tools to read parquet files.Solution:
1. Download and Install maven.
Follow below link:http://maven.apache.org/download.cgi
2. Download the parquet source code
git clone https://github.com/Parquet/parquet-mr.git
3. Build the parquet-tools.
cd parquet-mr/parquet-tools/ mvn clean package -PlocalThe resulting jar is target/parquet-tools.jar.
Note, you may meet error such as below:
Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repositoryIt is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .
The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:
<version>1.6.1-SNAPSHOT</version>
4. Show help manual
cd target java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help
5. Dump the schema
Take sample nation.parquet file for example.# java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet
message root {
required int64 N_NATIONKEY;
required binary N_NAME (UTF8);
required int64 N_REGIONKEY;
required binary N_COMMENT (UTF8);
}
6. Read the data
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet N_NATIONKEY = 0 N_NAME = ALGERIA N_REGIONKEY = 0 N_COMMENT = haggle. carefully f (... ...)
7. Read first n records
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet N_NATIONKEY = 0 N_NAME = ALGERIA N_REGIONKEY = 0 N_COMMENT = haggle. carefully f N_NATIONKEY = 1 N_NAME = ARGENTINA N_REGIONKEY = 1 N_COMMENT = al foxes promise sly N_NATIONKEY = 2 N_NAME = BRAZIL N_REGIONKEY = 1 N_COMMENT = y alongside of the p
8. Show meta info
# java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet file: file:/tmp/nation.parquet creator: parquet-mr file schema: root -------------------------------------------------------------------------------- N_NATIONKEY: REQUIRED INT64 R:0 D:0 N_NAME: REQUIRED BINARY O:UTF8 R:0 D:0 N_REGIONKEY: REQUIRED INT64 R:0 D:0 N_COMMENT: REQUIRED BINARY O:UTF8 R:0 D:0 row group 1: RC:25 TS:1352 OFFSET:4 -------------------------------------------------------------------------------- N_NATIONKEY: INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKED N_NAME: BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKED N_REGIONKEY: INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKED N_COMMENT: BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED
9. Dump all data
Note: Values are in column format.# java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta /tmp/nation.parquet INT64 N_NATIONKEY -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:0 value 2: R:0 D:0 V:1 value 3: R:0 D:0 V:2 (...) BINARY N_NAME -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:ALGERIA value 2: R:0 D:0 V:ARGENTINA value 3: R:0 D:0 V:BRAZIL (...) INT64 N_REGIONKEY -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V:0 value 2: R:0 D:0 V:1 value 3: R:0 D:0 V:1 (...) BINARY N_COMMENT -------------------------------------------------------------------------------- *** row group 1 of 1, values 1 to 25 *** value 1: R:0 D:0 V: haggle. carefully f value 2: R:0 D:0 V:al foxes promise sly value 3: R:0 D:0 V:y alongside of the p (...)
Thanks for the compilation fix! Too bad that the project on GitHub does not include issues where this could be mentioned, because it is quite a useful fix.
ReplyDeleteCan not clone this project.
ReplyDeleteerror: fatal: repository 'https://github.com/Parquet/parquet-mr.git/' not found
Any idea how to clone it?
It is moved to :
Deletehttps://github.com/apache/parquet-mr/
Thank you. OpenKB.
Delete[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project parquet-tools: Compilation failure: Compilation failure:
ReplyDelete[ERROR] /C:/bigdata/parquet-mr/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
[ERROR] symbol: method getCrc()
[ERROR] location: variable pageV1 of type org.apache.parquet.column.page.DataPageV1
Getting the above error while trying to run mvn clean package -Plocal
I am also getting the same error:
Deleteparquet-mr-master/parquet-tools/src/main/java/org/apache/parquet/tools/command/DumpCommand.java:[286,27] cannot find symbol
Any pointers please?
This is great.
ReplyDeleteThis post offers valuable insights without feeling repetitive. It kept my attention from start to finish and delivered useful takeaways.
ReplyDeleteoutsource cfo