Java
cap (a poem)Python
Convert Avro toParquet
Format.chema
are defined in Avro. The thing to try here is to define theParquet
(used form a nominal expression)Schema
, which then populates the data and generatesParquet
Documentation.
I. Simple field definitions
1, define the Schema and generate Parquet file
import pandas as pd import pyarrow as pa import as pq # Define Schema schema = ([ ('id', pa.int32()), ('email', ()) ]) # Prepare data ids = ([1, 2], type = pa.int32()) emails = (['first@', 'second@'], ()) # Generate Parquet data batch = .from_arrays( [ids, emails], schema = schema ) table = .from_batches([batch]) # Write Parquet file pq.write_table(table, '') import pandas as pd import pyarrow as pa import pyarrow . parquet as pq # Define Schema schema = pa . schema ( [ ( 'id' , pa . int32 ( ) ) , ( 'email' , pa . string ( ) ) ] ) # Prepare data ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) ) emails = pa . array ( [ 'first@' , 'second@' ] , pa . string ( ) ) # Generate Parquet data batch = pa . RecordBatch . from_arrays ( [ ids , emails ] , schema = schema ) table = pa . Table . from_batches ( [ batch ] ) # Write Parquet file pq . write_table ( table , '' )
2、Verify the Parquet data file
We can use the toolparquet-tools
come and see
The data in the file and theSchema
$ parquet-tools schema message schema { optional int32 id; optional binary email (STRING); } $ parquet-tools cat --json {"id":1,"email":"first@"} {"id":2,"email":"second@"}
No problem, consistent with what we expected. It is also possible to use thepyarrow
code to get which of theSchema
and data
schema = pq.read_schema('') print(schema) df = pd.read_parquet('') print(df.to_json()) schema = pq . read_schema ( '' ) print ( schema ) df = pd . read_parquet ( '' ) print ( df . to_json ( ) )
The output is:
schema = pq.read_schema('') print(schema) df = pd.read_parquet('') print(df.to_json()) schema = pq . read_schema ( '' ) print ( schema ) df = pd . read_parquet ( '' ) print ( df . to_json ( ) )
II. Definitions with nested fields
lowerSchema
The definition adds a nested object to theaddress
demerit pointemail_address
cap (a poem) post_address
,Schema
Definition and GenerationParquet
The code for the file is as follows
import pandas as pd import pyarrow as pa import as pq # Internal fields address_fields = [ ('email_address', ()), ('post_address', ()), ] # Define Parquet Schema, address nested address_fields schema = (j) # Prepare data ids = ([1, 2], type = pa.int32()) addresses = ( [('first@', 'city1'), ('second@', 'city2')], (address_fields) ) # Generate Parquet data batch = .from_arrays( [ids, addresses], schema = schema ) table = .from_batches([batch]) # Write Parquet data to file pq.write_table(table, '') import pandas as pd import pyarrow as pa import pyarrow . parquet as pq # Internal fields address_fields = [ ( 'email_address' , pa . string ( ) ) , ( 'post_address' , pa . string ( ) ) , ] # Define Parquet Schema, address nested address_fields schema = pa . schema ( j ) # Prepare data ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) ) addresses = pa . array ( [ ( 'first@' , 'city1' ) , ( 'second@' , 'city2' ) ] , pa . struct ( address_fields ) ) # Generate Parquet data batch = pa . RecordBatch . from_arrays ( [ ids , addresses ] , schema = schema ) table = pa . Table . from_batches ( [ batch ] ) # Write Parquet data to file pq . write_table ( table , '' )
1、Verify the Parquet data file
using the same parquet-tools
Come check it out.
file
$ parquet-tools schema message schema { optional int32 id; optional group address { optional binary email_address (STRING); optional binary post_address (STRING); } } $ parquet-tools cat --json {"id":1,"address":{"email_address":"first@","post_address":"city1"}} {"id":2,"address":{"email_address":"second@","post_address":"city2"}}
expense or outlay parquet-tools
apparentSchama
hasn'tstruct
of the word, but reflecting its address
Nested relationships with subordinate attributes.
expense or outlaypyarrow
Code to read
documentationSchema
and what the data looks like
schema = pq.read_schema("") print(schema) df = pd.read_parquet('') print(df.to_json()) schema = pq . read_schema ( "" ) print ( schema ) df = pd . read_parquet ( '' ) print ( df . to_json ( ) )
Output:
id: int32 -- field metadata -- PARQUET:field_id: '1' address: struct<email_address: string, post_address: string> child 0, email_address: string -- field metadata -- PARQUET:field_id: '3' child 1, post_address: string -- field metadata -- PARQUET:field_id: '4' -- field metadata -- PARQUET:field_id: '2' {"id":{"0":1,"1":2},"address":{"0":{"email_address":"first@","post_address":"city1"},"1":{"email_address":"second@","post_address":"city2"}}} id : int32 -- field metadata -- PARQUET : field_id : '1' address : struct & lt ; email_address : string , post_address : string & gt ; child 0 , email_address : string -- field metadata -- PARQUET : field_id : '3' child 1 , post_address : string -- field metadata -- PARQUET : field_id : '4' -- field metadata -- PARQUET : field_id : '2' { "id" : { "0" : 1 , "1" : 2 } , "address" : { "0" : { "email_address" : "first@" , "post_address" : "city1" } , "1" : { "email_address" : "second@" , "post_address" : "city2" } } }
The data is of course the same, with a slight difference in the display of theSchema
Middle.address
labeled as struct<email_address: string, post_address: string>
, clearly demonstrating that it is astruct
types, rather than just showing nested hierarchies.
To this point this article on the use ofPython
defineSchema
and generatesParquet
File details of the article is introduced to this, more related to use thePython
defineSchema
and generatesParquet
For documentation, please search my previous posts or continue to browse the related articles below I hope you will support me in the future!