SoFunction
Updated on 2024-10-28

Defining Schema and Generating Parquet File Details in Python

Java cap (a poem)Python Convert Avro toParquet Format.chema are defined in Avro. The thing to try here is to define theParquet (used form a nominal expression)Schema, which then populates the data and generatesParquet Documentation.

I. Simple field definitions

1, define the Schema and generate Parquet file

import pandas as pd
import pyarrow as pa
import  as pq

# Define Schema
schema = ([
    ('id', pa.int32()),
    ('email', ())
])

# Prepare data
ids = ([1, 2], type = pa.int32())
emails = (['first@', 'second@'], ())

# Generate Parquet data
batch = .from_arrays(
    [ids, emails],
    schema = schema
)
table = .from_batches([batch])

# Write Parquet file
pq.write_table(table, '')
import pandas as pd

import pyarrow as pa

import pyarrow . parquet as pq

# Define Schema

schema = pa . schema ( [

     ( 'id' , pa . int32 ( ) ) ,

     ( 'email' , pa . string ( ) )

] )

# Prepare data

ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) )

emails = pa . array ( [ 'first@' , 'second@' ] , pa . string ( ) )

# Generate Parquet data

batch = pa . RecordBatch . from_arrays (

     [ ids , emails ] ,

     schema = schema

)

table = pa . Table . from_batches ( [ batch ] )

# Write Parquet file

pq . write_table ( table , '' )

2、Verify the Parquet data file

We can use the toolparquet-tools come and see The data in the file and theSchema

 $ parquet-tools schema   message schema {      optional int32 id;      optional binary email (STRING);  }  $ parquet-tools cat --json   {"id":1,"email":"first@"}  {"id":2,"email":"second@"} 


No problem, consistent with what we expected. It is also possible to use thepyarrow code to get which of theSchema and data

schema = pq.read_schema('')
print(schema)

df = pd.read_parquet('')
print(df.to_json())
schema = pq . read_schema ( '' )

print ( schema )

df = pd . read_parquet ( '' )

print ( df . to_json ( ) )

The output is:

schema = pq.read_schema('')
print(schema)

df = pd.read_parquet('')
print(df.to_json())
schema = pq . read_schema ( '' )

print ( schema )

df = pd . read_parquet ( '' )

print ( df . to_json ( ) )

II. Definitions with nested fields

lowerSchema The definition adds a nested object to theaddress demerit pointemail_address cap (a poem) post_addressSchema Definition and GenerationParquet The code for the file is as follows

import pandas as pd
import pyarrow as pa
import  as pq

# Internal fields
address_fields = [
    ('email_address', ()),
    ('post_address', ()),
]

# Define Parquet Schema, address nested address_fields
schema = (j)

# Prepare data
ids = ([1, 2], type = pa.int32())
addresses = (
    [('first@', 'city1'), ('second@', 'city2')],
    (address_fields)
)

# Generate Parquet data
batch = .from_arrays(
    [ids, addresses],
    schema = schema
)
table = .from_batches([batch])

# Write Parquet data to file
pq.write_table(table, '')
import pandas as pd

import pyarrow as pa

import pyarrow . parquet as pq

# Internal fields

address_fields = [

     ( 'email_address' , pa . string ( ) ) ,

     ( 'post_address' , pa . string ( ) ) ,

]

# Define Parquet Schema, address nested address_fields

schema = pa . schema ( j )

# Prepare data

ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) )

addresses = pa . array (

     [ ( 'first@' , 'city1' ) , ( 'second@' , 'city2' ) ] ,

     pa . struct ( address_fields )

)

# Generate Parquet data

batch = pa . RecordBatch . from_arrays (

     [ ids , addresses ] ,

     schema = schema

)

table = pa . Table . from_batches ( [ batch ] )

# Write Parquet data to file

pq . write_table ( table , '' )

1、Verify the Parquet data file

using the same parquet-tools Come check it out. file

 $ parquet-tools schema   message schema {      optional int32 id;      optional group address {          optional binary email_address (STRING);          optional binary post_address (STRING);      }  }  $ parquet-tools cat --json   {"id":1,"address":{"email_address":"first@","post_address":"city1"}}  {"id":2,"address":{"email_address":"second@","post_address":"city2"}} 


expense or outlay parquet-tools apparentSchama hasn'tstruct of the word, but reflecting its address Nested relationships with subordinate attributes.

expense or outlaypyarrow Code to read documentationSchema and what the data looks like

schema = pq.read_schema("")
print(schema)

df = pd.read_parquet('')
print(df.to_json())
schema = pq . read_schema ( "" )

print ( schema )

df = pd . read_parquet ( '' )

print ( df . to_json ( ) )

Output:

id: int32
  -- field metadata --
  PARQUET:field_id: '1'
address: struct<email_address: string, post_address: string>
  child 0, email_address: string
    -- field metadata --
    PARQUET:field_id: '3'
  child 1, post_address: string
    -- field metadata --
    PARQUET:field_id: '4'
  -- field metadata --
  PARQUET:field_id: '2'
{"id":{"0":1,"1":2},"address":{"0":{"email_address":"first@","post_address":"city1"},"1":{"email_address":"second@","post_address":"city2"}}}
id : int32

   -- field metadata --

   PARQUET : field_id : '1'

address : struct & lt ; email_address : string , post_address : string & gt ;

   child 0 , email_address : string

     -- field metadata --

     PARQUET : field_id : '3'

   child 1 , post_address : string

     -- field metadata --

     PARQUET : field_id : '4'

   -- field metadata --

   PARQUET : field_id : '2'

{ "id" : { "0" : 1 , "1" : 2 } , "address" : { "0" : { "email_address" : "first@" , "post_address" : "city1" } , "1" : { "email_address" : "second@" , "post_address" : "city2" } } }

The data is of course the same, with a slight difference in the display of theSchema Middle.address labeled as struct<email_address: string, post_address: string> , clearly demonstrating that it is astruct types, rather than just showing nested hierarchies.

To this point this article on the use ofPython defineSchema and generatesParquet File details of the article is introduced to this, more related to use thePython defineSchema and generatesParquet For documentation, please search my previous posts or continue to browse the related articles below I hope you will support me in the future!