Wednesday, February 3, 2021

How to use pyarrow to view the metadata information inside a Parquet file

Goal:

This article explains how to use PyArrow to view the metadata information inside a Parquet file.

Env:

CentOS 7

Solution:

1. Create a Python 3 virtual environment

This step is because the default python version is 2.x on CentOS/Redhat 7 and it is too old to install pyArrow latest version. 

Using Python 3 and its pip3 is the way to go.

However if we just use "alternatives" to config the python to use python3, it may break some other tools such as "yum" which depends on python2.

Using virtual environment is the easiest way to keep both python2 and python3 on CentOS 7.

1
2
python3 -m venv .venv
. .venv/bin/activate

 2. Install PyArrow and its dependencies

1
2
3
pip install --upgrade pip setuptools
pip install Cython
pip install pyarrow

3.  Read the metadata inside a Parquet file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
>>> import pyarrow.parquet as pq
>>> parquet_file = pq.ParquetFile('/.../part-00000-67861019-20bb-4396-96f8-146141351ff2-c000.snappy.parquet')
 
>>> parquet_file.metadata
<pyarrow._parquet.FileMetaData object at 0x7f8014250bf8>
  created_by: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
  num_columns: 10
  num_rows: 546097
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1886
 
>>> parquet_file.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7f808aaf4f98>
  num_columns: 10
  num_rows: 546097
  total_byte_size: 17515040
 
>>> parquet_file.metadata.row_group(0).column(3)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f801356cf48>
  file_offset: 6588315
  file_path:
  physical_type: INT64
  num_values: 546097
  path_in_schema: Index
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f8013fd2ea8>
      has_min_max: True
      min: 0
      max: 396316
      null_count: 0
      distinct_count: 0
      num_values: 546097
      physical_type: INT64
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('BIT_PACKED', 'RLE', 'PLAIN')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 6588315
  total_compressed_size: 2277936
  total_uncompressed_size: 4369155
 
>>> parquet_file.metadata.row_group(0).column(3).statistics
<pyarrow._parquet.Statistics object at 0x7f801356cef8>
  has_min_max: True
  min: 0
  max: 396316
  null_count: 0
  distinct_count: 0
  num_values: 546097
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE

From above information, we can tell that:

  • The parquet file version is 1.10.1.
  • It has only 1 row group inside.
  • It has 10 columns and 546097 rows.
  • The 4th column(.column(3)) named “Index” is a INT64 type with min=0 and max=396316.


1 comment:

Popular Posts