Detailed introduction to using Protocol Buffers in Python

Practical environment

protoc-25.

Download address:

/protocolbuffers/protobuf/releases

/protocolbuffers/protobuf/releases/download/v25.4/protoc-25.

protobuf 5.27.2

pip install protobuf==5.27.2

Python 3.9.13

Problem domain

The example that this article will use is a very simple "address book" application that can read and write people's contact information from files. Everyone in the address book has a name, an ID, an email address and a contact number.

How to serialize and retrieve such structured data? There are several ways to solve this problem:

Use Python pickle. This is the default method because it is built into the language, but it doesn't handle schema evolution well, and it doesn't work well if you need to share data with applications written in C++ or Java.

You can invent a special way to encode data items into a single string, for example encode 4 integers as "12:3:-23:67". This is a simple and flexible approach, although it does require writing one-time encoding and parsing code, and the runtime cost of parsing is small. This is best for encoding very simple data.

Serialize data to XML. This approach is very attractive because XML is (to some extent) human readable and has binding libraries for many languages. This may be a good choice if you want to share data with other applications/projects. However, XML is well known to be space-intensive and encoding/decoding it can cause huge performance losses to the application. In addition, accessing simple fields in XML DOM tree access is much more complex.

Protocol buffers can be used instead of these options. Protocol buffers are a flexible, efficient and automated solution to this problem. Using protocol buffers, you can write the data structure you want to store..protodescribe. The protocol buffer compiler will create a class from the file that implements automatic encoding and parsing of protocol buffer data in a valid binary format. The generated class is provided for the fields that make up the protocol buffergettersandsettersMethod and handle the details of reading and writing protocol buffers as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time so that the code can still read data encoded in the old format.

Define the protocol format (write a proto file)

To create an address book application, you need to.protoThe file begins..protoThe definition in the file is simple: add a message to each data structure to be serialized (message), and then specify a name and type for each field in the message.

Example:

syntax = "proto2"; // proto2 specifies the version of proto bufferpackage tutorial;
message Person {
  optional string name = 1;
  optional int32 id = 2;
  optional string email = 3;
  enum PhoneType {
    PHONE_TYPE_UNSPECIFIED = 0;
    PHONE_TYPE_MOBILE = 1;
    PHONE_TYPE_HOME = 2;
    PHONE_TYPE_WORK = 3;
  }
  message PhoneNumber {
    optional string number = 1;
    optional PhoneType type = 2 [default = PHONE_TYPE_HOME];
  }
  repeated PhoneNumber phones = 4; // The phones field is a duplicate field that can contain multiple phone numbers.}
message AddressBook {
  repeated Person people = 1;
}

illustrate:

The above.protoFilespackageThe declaration begins, which helps prevent naming conflicts between different projects. In Python, packages are usually determined by directory structure, so.protoFile definedpackageNo impact on the generated code. However, one should still be declaredpackageto avoid name conflicts in protocol buffer namespaces as well as in non-Python languages.

Next, the message definition. A message is just a collection of fields of type. Many standard simple data types can be used as field types, includingbool、int32、float、doubleandstring. You can also add more structure to messages by using other message types as field types - in the example above,PersonMessage containsPhoneNumberNews, andAddressBookMessage containsPersoninformation. You can even define the message types nested in other messages - as above,PhoneNumberThe type is defined inPersonmiddle. If you want one of the fields to have one of the predefined list of values, you can also define an enum type - here you want to specify a phone number that can be one of the following phone types:

PHONE_TYPE_MOBILE
PHONE_TYPE_HOME
PHONE_TYPE_WORK

The "=1" and "=2" tags on each element identify the unique "tags" used by this field in binary encoding, which ensures that during the serialization and deserialization process, each field can be correctly identified and processed. ‌These numeric tags are converted to namespace and type signatures at compile time, thus ensuring the uniqueness of the fields. Using tag numbers from 1-15 is one byte less encoding than using higher numbers, so as an optimization, it is possible to decide to use these tags for common or repeated elements, and using tag numbers from 16 and higher for less commonly used optional elements. Each element in the repeating field requires a re-encoded mark, so repeating fields are particularly suitable for this optimization.

Each field must be annotated with one of the following modifiers:

optional: This field can be set or not. If the optional field value is not set, the default value is used. For simple types, you can specify your own default value, just like in the example phone numbertypeWhat is done. Otherwise, the system default value will be used: the default value of the numeric type is zero, the default value of the string type is empty, and the default value of the Boolean type isfalse. For embedded messages, the default value is always the "default instance" or "prototype" of the message, which does not have any fields set. When the accessor is called to get the value of an optional (or required) field that has not been explicitly set, the default value of that field is always returned.
repeated: This field can be repeated as many times (including zero times), indicating that the field can contain multiple values. Treat duplicate fields as dynamically sized arrays, and the order of duplicate values will be preserved in the protocol buffer.
required: The value of this field must be provided, otherwise the message will be considered "uninitialized". Serializing an uninitialized message will throw an exception. Parsing uninitialized messages will fail. Other than that, the required fields behave exactly the same as the optional fields.

important

requiredis permanent, marking the field asrequiredBe very careful when you are. If you want to stop writing or sending required fields at some point, changing that field to optional fields will be a problem - old readers will think that messages without this field are incomplete and may accidentally reject or delete them. You should consider writing application-specific custom validation routines for protocol buffers. Use strongly disapprove on GooglerequiredField; most messages defined in proto2 syntax use onlyoptionalandrepeated. (Proto3 does not support it at allrequiredField. )

Compile protocol buffer

Now there is.proto, the next thing you need to do is generate read and writeAddressBook(as well asPersonandPhoneNumber) the required class for message. To do this, it is necessary to.protoRun the protocol buffer compiler onprotoc：

1. DownloadprotocAfter decompression,protocWherebinAdd directory path to system environment variables

>protoc --version
libprotoc 25.4

2. Now run the compiler, specifying the source directory (where the application source code is located - if no value is provided, use the current directory), the target directory (the storage directory of the code you want to generate; usually with$SRC_DIRSame) and.protopath. as follows:

protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/

Because I want a Python class, I use--python_outOptions - Similar options are provided for other supported languages.

protocCan also be used--pyi_outGenerate python stubs (.pyi).

This will generate the corresponding target directory you specifiedxxxx_pb2.py

Practice: cmd opens the console and entersaddressbook.proto3The directory you are in, and then execute the following command

protoc --python_out=. addressbook.proto2

After the command is executed successfully, it will be generated in the current directory and.proto2The directory of the file with the same name (in the exampleaddressbook), the corresponding py file is automatically generated in the directory (in the exampleproto2_pb2.py, copy it toaddressbook.proto2The directory is located and named asaddressbook_pb2.py）

Protocol Buffer API

Unlike generating Java and C++ protocol buffer code, the Python protocol buffer compiler will not directly generate data access code for you. On the contrary (if you checkaddressbook_pb2.py, you'll see), it generates special descriptors for all your messages, enums, and fields, as well as some mysterious empty classes, one class for each message type.

# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler.  DO NOT EDIT!
# source: addressbook.proto2
# Protobuf Python Version: 4.25.4
"""Generated protocol buffer code."""
from  import descriptor as _descriptor
from  import descriptor_pool as _descriptor_pool
from  import symbol_database as _symbol_database
from  import builder as _builder
# @@protoc_insertion_point(imports)
_sym_db = _symbol_database.Default()
DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x12\x61\x64\x64ressbook.proto2\x12\x08tutorial\"\xa3\x02\n\x06Person\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\n\n\x02id\x18\x02 \x01(\x05\x12\r\n\x05\x65mail\x18\x03 \x01(\t\x12,\n\x06phones\x18\x04 \x03(\x0b\x32\\x1aX\n\x0bPhoneNumber\x12\x0e\n\x06number\x18\x01 \x01(\t\x12\x39\n\x04type\x18\x02 \x01(\x0e\x32\:\x0fPHONE_TYPE_HOME\"h\n\tPhoneType\x12\x1a\n\x16PHONE_TYPE_UNSPECIFIED\x10\x00\x12\x15\n\x11PHONE_TYPE_MOBILE\x10\x01\x12\x13\n\x0fPHONE_TYPE_HOME\x10\x02\x12\x13\n\x0fPHONE_TYPE_WORK\x10\x03\"/\n\x0b\x41\x64\x64ressBook\x12 \n\x06people\x18\x01 \x03(\x0b\x32\')
_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'addressbook.proto2_pb2', _globals)
if _descriptor._USE_C_DESCRIPTORS == False:
  DESCRIPTOR._options = None
  _globals['_PERSON']._serialized_start=33
  _globals['_PERSON']._serialized_end=324
  _globals['_PERSON_PHONENUMBER']._serialized_start=130
  _globals['_PERSON_PHONENUMBER']._serialized_end=218
  _globals['_PERSON_PHONETYPE']._serialized_start=220
  _globals['_PERSON_PHONETYPE']._serialized_end=324
  _globals['_ADDRESSBOOK']._serialized_start=326
  _globals['_ADDRESSBOOK']._serialized_end=373
# @@protoc_insertion_point(module_scope)

The important line in each class is__metaclass__ = . They can be considered as templates for creating classes. When loading,GeneratedProtocolMessageTypeThe metaclass uses the specified descriptor to create all Python methods required to use each message type and add them to the relevant class. Then you can use fully populated classes in your code.

The ultimate effect of all this is that you can usePersonclass, as if it defines each field of the Message base class as a regular field. For example:

import addressbook_pb2
person = addressbook_pb2.Person()
 = 1234
 = "John Doe"
 = "jdoe@"
phone = ()
 = "555-4321"
 = addressbook_pb2.Person.PHONE_TYPE_HOME

Note that these assignments are not just about adding arbitrary new fields to a general Python object. If you try to assign undefined fields in the .proto file, it will raiseAttributeError. If you assign a field to a value of the wrong type, it will raiseTypeError. Additionally, reading the value of the field before setting the field returns the default value.

enumerate

The metaclass extends an enum into a set of symbolic constants with integer values. Therefore, for example, constantaddressbook_pb2..PHONE_TYPE_WORKThe value of is 2.

Standard message method

Each message class also contains many other methods that allow you to check or manipulate the entire message, including:

IsInitialized(): Check whether all required fields have been set.
__str__(): Returns the readable representation of the message, especially suitable for debugging. (Usually called like thisstr(message)orprint(message)）
CopyFrom(other_msg): Overwrite the message with the value of the given message.
Clear(): Clear all elements to return to empty state.

These methods implement the Message interface. For more information, see Message'sComplete API documentation。

Parsing and serialization

Each protocol buffer class has a method to write and read messages of the selected type using the protocol buffer binary format. These methods include:

SerializeToString(): Serialize the message and return it as a string. Note that bytes is binary, not text; onlystrTypes are used as convenient containers.
ParseFromString(data): parses the message from the given string.

These are just some of the options used for parsing and serialization. Again, see the Message API reference for the complete list.

important

Protocol Buffers and Object-Oriented Design Protocol Buffers classes are basically data holders (such as structures in C) and do not provide other functions; they are not good primary citizens in the object model. If you want to add richer behavior to the generated class, the best way is to wrap the generated protocol buffer class in an application-specific class. If you can't control it.protoPackaging protocol buffers is also a good idea to design files (for example, if files from another project are being reused). In this case, you can use wrapper classes to build interfaces that are more suitable for your application's unique environment: hide some data and methods, expose convenient functions, etc. They should never be added to the class inheritance generated by inheritance. This breaks the internal mechanism and is not a good object-oriented practice anyway.

Write a message

Suppose the first thing you want a address book application to do is write personal details into the address book file. To do this, instances of protocol buffer classes need to be created and populated, and then they are written to the output stream.

This sample code is read from the fileAddressBook, add a new one to it according to user inputPerson, and then put the new oneAddressBookWrite back to the file again. The part of the code generated by the direct call or reference to the protocol compiler has been highlighted.

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import addressbook_pb2
import os
def PromptForAddress(person):
    '''Fill Person message based on user input'''
     = int(input('Enter person ID number: '))
     = input('Enter name: ')
    email = input('Enter email address (blank for none): ')
    if email != '':
         = email
    while True:
        number = input('Enter a phone number (or leave blank to finish): ')
        if number == '':
            break
        phone_number = ()
        phone_number.number = number
        phone_type = input('Is this a mobile, home, or work phone? ')
        if phone_type == 'mobile':
            phone_number.type = addressbook_pb2..PHONE_TYPE_MOBILE
        elif phone_type == 'home':
            phone_number.type = addressbook_pb2..PHONE_TYPE_HOME
        elif phone_type == 'work':
            phone_number.type = addressbook_pb2..PHONE_TYPE_WORK
        else:
            print('Unknown phone type; leaving as default value.')
address_book = addressbook_pb2.AddressBook()
# Read the existing address bookif ('my_addressbook.db'):
    with open('my_addressbook.db', 'rb') as f:
        address_book.ParseFromString(())
# Add a mailing addressPromptForAddress(address_book.())
# Write the mailing address to diskwith open('my_addressbook.db', 'wb') as f:
    (address_book.SerializeToString())

After running the program, enter the content according to the prompts, as shown below

Enter person ID number: 1
Enter name: shouke
Enter email address (blank for none): shouke@
Enter a phone number (or leave blank to finish): 15813735565
Is this a mobile, home, or work phone? mobile
Enter a phone number (or leave blank to finish):

Read the message

This example reads the file created by the above example and prints all the information in it

# -*- coding:utf-8 -*-
import addressbook_pb2
def ListPeople(address_book):
  '''Travel through all people in the address book and print related information'''
  for person in address_book.people:
    print('Person ID: ', )
    print('Name: ', )
    if ('email'):
      print('E-mail address: ', )
    for phone_number in :
      if phone_number.type == addressbook_pb2..PHONE_TYPE_MOBILE:
        print('Mobile phone #: ', end='')
      elif phone_number.type == addressbook_pb2..PHONE_TYPE_HOME:
        print('Home phone #: ', end='')
      elif phone_number.type == addressbook_pb2..PHONE_TYPE_WORK:
        print('Work phone #: ', end='')
      print(phone_number.number)
address_book = addressbook_pb2.AddressBook()
# Read the existing address bookwith open('my_addressbook.db', 'rb') as f:
  address_book.ParseFromString(())
ListPeople(address_book)

Run output:

Person ID:  1
Name:  shouke
E-mail address:  shouke@
Mobile phone #: 15813735565

Another example

In the example, a name is definedDeviceThe message, it has 4 fields:name、price，typeandlabels。

syntax = "proto3";
message Device {
  string name = 1;
  int32 price = 2;
  string type = 3;
  map<string, string> labels = 15;
}

according toFile generation python file

protoc --python_out=.

Automatically generate in the current directorydeviceContents anddevice/proto3_pb2.pydocument

Use the generated py file (copy the above py file and rename it todevice_pb2.py, and store the following files in the same directory as the same level)

my_test.py

# -*- coding:utf-8 -*-
import device_pb2
# Create a Person object and set the field valuedevice = device_pb2.Device()
 = 'Lenovo Xiaoxing'
 =  3999
 = 'Notebook'
['color'] = 'red'
['outlook'] = 'fashionable'
# Serialize Person object to binary stringserialized_device = ()
print(f"Serialized data：{serialized_device}")
# Deserialize binary strings to a new Person objectnew_device = device_pb2.Device()
new_device.ParseFromString(serialized_device)
# Output the field value of the new Device objectprint(type(new_device.labels)) # &lt;class 'google._upb._message.ScalarMapContainer'&gt;
for label, value in new_device.():
    print(label, value) # The output content is like: color redprint(new_device.labels) # {'color': 'red', 'outlook': 'fashionable'}
print(f'反Serialized data：Device name={new_device.name}, price={new_device.price}, type={new_device.type}, Label={new_device.labels}')
# Output：反Serialized data：Device name=Lenovo Little Star, price=3999, type=Notebook, Label={'color': 'red', 'outlook': 'fashionable'}

Reference link

/getting-started/pythontutorial/

/programming-guides/proto3/

This is the introduction to this article about the basic introduction to using Protocol Buffers in Python. For more related content on using Protocol Buffers in Python, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!