Python use base64 module for binary data encoding details

preamble

Yesterday the team's sister asked about the POP3 protocol, so today I studied the POP3 protocol format and the poplib in Python, and part of the data sent back from the POP server needs to be decoded using Base64, so I took a look at the base64 module in Python.

This post will first talk about the base64 module, which provides functions related to encoding and decoding of Base16, Base32, Base64, Base85 and Ascii85. The contents of the poplib module will be posted later. Well, I dug another hole, and I can't fill it up in this life...

The following excerpt from / details why the returned data is Base64 encoded first:

Due to historical reasons, some mail systems on the Internet only support 7-bit character transfer, while the internal code of Chinese characters is 8-bit. When sending Chinese characters in e-mail, if you pass through these mail systems that only support 7-bit characters, all the 1's in the eighth bit of the internal code of the Chinese characters will be changed to 0.
Take the word "Chinese" as an example, HEX is A4A4A4E5, when the highest bit is cleared, it will become 24242465, which is "$$$e". telnet also has such a problem.

In addition to Chinese emails, this problem also occurs when using emails to send pictures, programs, compressed files, etc. Therefore, various email encoding methods are generally used in emails to solve this problem. Therefore, various mail encoding methods are generally used in emails to solve this problem. 8Bit can be encoded according to certain rules, and it can pass through the mail system which only supports 7Bit characters perfectly.

Common mail codes are UU and MIME, and MIME (Multipurpose Internet Mail Extentions) is generally translated as "Multimedia Extension Mode", as the name suggests, it is labeled as the ability to send multimedia files, which can be sent together with various types of files attached to a mail.

MIME defines two encoding methods: Base64 and QP (Quote-Printable), which are used at different times, the rule of QP is to encode the 7bits of the data without repeating, only 8bits of the data is converted to 7bits, QP encoding is suitable for non-US-ASCII text content, for example, our Chinese files, and Base64 encoding rule is to recode the whole file into 7bits, it is used when sending binary files. Due to the different encoding method, it will affect the size of the encoded file. Some lazy software uses Base64 encoding.

Base64

The base64 module provides six functions for Base64 encoding and decoding, and they can be categorized into three groups.

base64.b64encode(s, altchars=None)
base64.b64decode(s, altchars=None, validate=False)

The parameter s represents the data to be encoded/decoded. The type of the parameter s of b64encode must be a byte packet (bytes). b64decode's parameter s can be either a byte packet (bytes) or a string (str).

Since Base64 encoded data may contain '+' or '/' symbols, if the encoded data is used in the url or file system path, it may lead to bugs. so the base64 module provides a method to replace '+' and '/' in the encoded data.

The parameter altchars must be a packet of bytes of length 2. These two symbols will be used to replace '+' and '/' in the encoded data. This parameter defaults to None.

The parameter validate defaults to False; if it is True, the base64 module checks for non-base64 alphabetic characters in s before decoding, and throws an error if there are any: Non-base64 digit found.

If the length of the data is not correct an error is thrown: Incorrect padding.

>>> import base64
>>> x = base64.b64encode(b'test')
>>> x
b'dGVzdA=='
>>> base64.b64decode(x)
b'test'

base64.standard_b64encode(s)
base64.standard_b64decode(s)

This set of functions will pass the argument s directly to the previous set of functions.

base64.urlsafe_b64encode(s)
base64.urlsafe_b64decode(s)

This set of functions is also based on the first set of functions, but after encoding it replaces '+' and '/' with '-' and '_' in the output data. Before decoding it replaces '-' and '_' with '+' and '/' in the data.

Also, Base64 encoding produces a symbol '=', which is used to pad the data length to a multiple of 4.

Base32

base64.b32encode(s)
base64.b32decode(s, casefold=False, map01=None)

The parameter s is consistent with Base64.

Base32 encodes characters in the range of [2-7A-Z] and does not support lowercase letters. However, when the parameter casefold is True, Base32 can accept lowercase input when decoding. But for security reasons, this parameter defaults to False.

Base32 decoding also allows replacing the number 0 with the uppercase letter O and the number 1 with the uppercase letter I or L. The parameter map01 specifies the character to replace the number 1 with (the source code does not specify that it must be either I or L), and when the parameter is not None, the number 0 is always replaced with the letter O. Again, for security reasons, the parameter defaults to None. is None.

Base16

base64.b16encode(s)
base64.b16decode(s, casefold=False)

The range of characters after Base16 encoding is [0-9A-F].

The arguments s and casefold function in the same way as Base32.

Base85

base64.b85encode(b, pad=False)
base64.b85decode(b)

Parameter b is the data used for encoding/decoding, and the type requirement is the same as the parameter s of Base64.

When the parameter pad is True, the data is padded with b'\0' to a multiple of length 4 before encoding. However, this padding data is not removed during decoding.

This set of functions was added after Python 3.4.

Ascii85

base64.a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False)

The parameter b is the data to be encoded and must be of type bytes.

The parameter foldspaces is True, and b'y' is used to represent 4 consecutive spaces.

The parameter wrapcol is an integer that controls how many characters of encoded output are added to the line break b'\n' when wrapcol is non-zero.

When the parameter pad is True, the data is padded with b'\0' to a multiple of length 4 before encoding. This padding data is not removed when decoding.

The parameter adobe specifies whether the data is in Adobe format or not.Adobe Ascii85 encoded data is surrounded by <\~ and \~> if this parameter is True, the returned data will be appended with this pair of symbols.

base64.a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v')

The parameter b is the data to be encoded and can be of type bytes or str.

The parameter foldspaces is True, and b'y' is used to represent 4 consecutive spaces.

The parameter adobe specifies whether the data is in Adobe's format or not.Adobe Ascii85 encodes data surrounded by <\~ and \~>, if this parameter is True, base64 will remove this pair of symbols before decoding.

The parameter ignorechars specifies the characters to be ignored when decoding. By default, all whitespace characters in ASCII are included.

This set of functions was added after Python 3.4.

The official documentation for the base64 module mentions that Base85 and Ascii85 use 5 characters encoded in 4 bytes, while Base64 uses 6 characters encoded in 4 bytes (actually 4 characters encoded in 3 bytes), and that the first two are more efficient than Base64 when space is scarce.

old API

base64 still retains a portion of the old API for some special purposes.

(input, output)
(input, output)

This group of functions uses a binary file as a data source and writes the encoded/decoded data to the binary file.

(s)
(s)

Both encodebytes and b64encode internally call binascii module's b2a_base64, except that encodebytes calls b2a_base64 with the newline parameter using the default value of True, which means that encodebytes will add a newline character b'\n' every 76 bytes when outputting data. add a newline character b'\n' every 76 bytes.

decodebytes is basically the same as b64decode with default parameters. Only the parameter type checking is different, decodebytes only supports data of type bytes.

(s)
(s)

This set of functions was deprecated after Python 3.1, and will now call the previous set of functions directly.

summarize

The base64 module provides an interface to encode binary data, including the standard Base64, Base32, Base16 and the fact that the standard Ascii85 and Base85. through the study of this module, by the way, I learned about the details of the binary data encoding, and I feel very deep. Sometimes we think we know about computers and the Internet, but what we see is just a drop in the ocean, not worth mentioning. There are still a lot of unknowns in this field for me, which are waiting to be explored, and I will not stop exploring.

Above is this article on Python use base64 module for binary data encoding details of all, I hope you can help. Interested friends can continue to refer to other related topics on this site, if there are inadequacies, welcome to leave a message to point out. Thank you for the support of friends on this site!