Goal:
This article explains how to use PyArrow to view the metadata information inside a Parquet file.
Env:
CentOS 7
Solution:
1. Create a Python 3 virtual environment
This step is because the default python version is 2.x on CentOS/Redhat 7 and it is too old to install pyArrow latest version.
Using Python 3 and its pip3 is the way to go.
However if we just use "alternatives" to config the python to use python3, it may break some other tools such as "yum" which depends on python2.
Using virtual environment is the easiest way to keep both python2 and python3 on CentOS 7.
1 2 | python3 -m venv .venv . .venv /bin/activate |
2. Install PyArrow and its dependencies
1 2 3 | pip install --upgrade pip setuptools pip install Cython pip install pyarrow |
3. Read the metadata inside a Parquet file
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | >>> import pyarrow.parquet as pq >>> parquet_file = pq.ParquetFile( '/.../part-00000-67861019-20bb-4396-96f8-146141351ff2-c000.snappy.parquet' ) >>> parquet_file.metadata <pyarrow._parquet.FileMetaData object at 0x7f8014250bf8> created_by: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) num_columns: 10 num_rows: 546097 num_row_groups: 1 format_version: 1.0 serialized_size: 1886 >>> parquet_file.metadata.row_group(0) <pyarrow._parquet.RowGroupMetaData object at 0x7f808aaf4f98> num_columns: 10 num_rows: 546097 total_byte_size: 17515040 >>> parquet_file.metadata.row_group(0).column(3) <pyarrow._parquet.ColumnChunkMetaData object at 0x7f801356cf48> file_offset: 6588315 file_path: physical_type: INT64 num_values: 546097 path_in_schema: Index is_stats_set: True statistics: <pyarrow._parquet.Statistics object at 0x7f8013fd2ea8> has_min_max: True min: 0 max: 396316 null_count: 0 distinct_count: 0 num_values: 546097 physical_type: INT64 logical_type: None converted_type (legacy): NONE compression: SNAPPY encodings: ( 'BIT_PACKED' , 'RLE' , 'PLAIN' ) has_dictionary_page: False dictionary_page_offset: None data_page_offset: 6588315 total_compressed_size: 2277936 total_uncompressed_size: 4369155 >>> parquet_file.metadata.row_group(0).column(3).statistics <pyarrow._parquet.Statistics object at 0x7f801356cef8> has_min_max: True min: 0 max: 396316 null_count: 0 distinct_count: 0 num_values: 546097 physical_type: INT64 logical_type: None converted_type (legacy): NONE |
From above information, we can tell that:
- The parquet file version is 1.10.1.
- It has only 1 row group inside.
- It has 10 columns and 546097 rows.
- The 4th column(.column(3)) named “Index” is a INT64 type with min=0 and max=396316.
Nice Post
ReplyDelete