SoFunction
Updated on 2024-07-15

Difference between bytes and str types in python

After a morning of searching for information. Probably sorted out the difference between the bytes type and the str type.

The types bytes and str have the same thing in their presentation form; if you print a variable of type bytes, it will print a sequence that starts with b and is enclosed in single quotes. For example.

>>> c = b'\x80abc'
>>> type(c)
bytes

We see that c = b'\x80abc' represents a bytes type. Isn't it similar to a string? What does b'\x80abc mean? \x80 that is, two digits in hexadecimal, representing the decimal 0-255, but also represents a byte, 8 bits. abc, that is, the letters of the alphabet abc, why not here \x... Because in utf-8, the letters abc and abc are not in the form of \x... Because in utf-8, an ASCII code is stored in a completely unchanged form, an a is stored in one byte.

Then the storage of b'\x80abc' is fully understood, four bytes in total, and each byte value is clear at a glance. Here is another experiment.

>>> A = b'\xe5\x9d\x8fHello'.decode("utf-8","strict")
>>> A
'Bad Hello'
>>> type(A)
str

The first thing to know is that utf-8 is variable length encoding. Chinese characters take up 3 bytes, and the utf-8 code for the word 'bad' is \xe5\x9d\x8f . So given a sequence of bytes b'\xe5\x9d\x8fHello', decoding it in utf-8 obviously yields the bad Hello, and we see that after decoding, A has been changed to str, exactly as expected.

If python is unable to decode a binary into utf-8, it will report an error. For example, decoding b'\x80abc' will report an error:

'utf-8' codec can't decode byte 0x80 in position 0:invalid start byte

This is the whole content of this article.