In the past two days, I wrote a crawler that monitors web pages, the function is to track the changes of a web page, but after running it for one night there is a problem 。。。。 I hope you can give me some advice!
I'm using python3 and the error is thrown when decoding the html response, the code as is:
response = (dsturl) content = ().decode('utf-8')
The error is thrown as
File "./unxingCrawler_p3.py", line 50, in getNewPhones content = ().decode() UnicodeDecodeError: 'utf8' codec can't decode byte 0xb2 in position 24137: invalid start byte
It was running fine before, after a night it came up 。。。。 What I don't understand the most is why there are characters that can't be parsed by utf-8 in pages that it declares as utf-8 encoded?
It was only later, after a reminder from an enthusiastic user, that I realized that I needed to use thedecode('utf-8', 'ignore')
In order to completely understand the coding problems of python, I would like to share the following, and hope that it will help you to familiarize yourself with the coding problems of python.
1. Start with the bytes:
A byte consists of eight bits, each bit represents 0 or 1, a byte can represent from 00000000 to 11111111 a total of 2^8 = 256 numbers. An ASCII code uses one byte (excluding the highest bit of the byte as the parity bit), ASCII code actually uses seven bits in a byte to represent characters, which can represent a total of 2^7=128 characters. For example, 01000001 (i.e., decimal 65) in ASCII encoding represents the character 'A', and 01100001 (i.e., decimal 97) after adding 32 to 01000001 represents the character 'a'. Now open Python and call the chr and ord functions, we can see that Python has converted the ASCII code for us. As shown in the figure
The first 00000000 is a null character, so the ASCII code actually consists of only 127 characters, including letters, punctuation marks, and special symbols. Because ASCII was born in the United States, it is sufficient for English, which consists of letters that make up words and are then expressed as words. But Chinese, Japanese, Korean, and other language speakers are not convinced. Chinese is a word by word, ASCII encoding used all its strength 256 characters are not enough.
Unicode encoding is usually composed of two bytes, representing 256*256 characters, which is known as UCS-2. Four bytes are used for some remote characters, which is known as UCS-4, meaning that the Unicode standard is still evolving. But UCS-4 appears less, let's remember: the original ASCII encoding using one byte encoding, but due to language differences in many characters, people use two bytes, the emergence of a unified, encompassing the multi-language Unicode encoding.
In Unicode, the original 127 characters in ASCII only need to make up a full-zero byte in front of it, for example, the character 'a': 01100001, which was talked about before, becomes 00000000 01100001 in Unicode.Soon, the Americans were not happy, and they ate the potpourri of the world's national forests. The English language that could be transmitted with only one byte now becomes two bytes, which is a great waste of storage space and transmission speed.
People then used their ingenuity and came up with UTF-8 encoding. Because of the wasted space problem that was targeted, this kind ofUTF-8 encoding is variable-length UTF-8 encoder is the first of its kind in the world, ranging from one byte for the English alphabet, to the usual three bytes for Chinese, to six bytes for some rare characters. Having solved the space problem, UTF-8 encoding has a fantastic additional feature, that is, it is compatible with the big brother ASCII encoding. Some vintage software can now continue to work in UTF-8.
Note that except for the same letters of the alphabet, Chinese characters are usually different in Unicode and UTF-8 encoding. For example, the Chinese character for 'center' is 01001110 00101101 in Unicode, while it is 11100100 10111000 10101101 in UTF-8 encoding.
Naturally, our motherland has its own set of standards. They are GB2312 and GBK, but of course you don't see them very often nowadays. Usually UTF-8 is used directly.
2. Default encoding in Python 3
The default in Python3 is UTF-8, which we pass the following code:
import sys ()
The default encoding for Python 3 can be viewed.
3. encode and decode in Python 3
Python3 character encoding often use the decode and encode functions. Especially in the capture of web pages, these two functions are very good to use skillfully. encode's role, so that we see the intuitive characters into the computer's byte form. decode is just the opposite, the byte form of the characters into what we understand, intuitive, "human-like" form.
\x means hexadecimal, \xe4\xb8\xad is binary 11100100 10111000 10101101, which means that the Chinese character 'center', encoded in byte form, is 11100100 10111000 10101101. Similarly, if we take 11100100 10111000 10101101, which is 11100100 10111000 10101101, and decode it back, it is the Chinese character 'center'. 10111000 10101101, that is, \xe4\xb8\xad to decode back, is the Chinese character 'center'. The full text should be b'\xe4\xb8\xad'. In Python 3, strings in byte form must be prefixed with b, which is written as b'xxxx' above.
The default encoding of Python3 as mentioned earlier is UTF-8, so we can see that Python handles these characters as UTF-8. So as you can see from the image above, even if we purposely encode the characters as UTF-8 encoding through encode('utf-8'), the result that comes out is still the same: b'\xe4\xb8\xad'.
Understand this, at the same time we know that UTF-8 is compatible with ASCII, we can guess the university often recite 'A' corresponds to 65 in ASCII, here is not also correct decode out of it. Decimal 65 converted to hexadecimal is 41, we try:
b'\x41'.decode()
The result is as follows. It's the character 'A'.
4. Encoding Conversion in Python3
It is said that characters are uniformly encoded in Unicode in the computer's memory. They only become utf-8 when they are written to a file, stored on a hard disk, or sent from a server to a client (e.g., the code on the front end of a web page.) But I'm actually more interested in how to represent these characters as Unicode bytes, to reveal their in-memory purpose. Here's a magic mirror:
/decode('unicode-escape')
b'\\u4e2d' or b'\u4e2d', a slash doesn't seem to matter. At the same time, we can find that in the shell window, directly inputting '\u4e2d' and inputting b '\u4e2d'.decode('unicode-escape') are the same, both of them will print out the Chinese character 'center', on the contrary, '\u4e2d'.decode('unicode-escape') will report an error. will result in an error. Explanation Not only does Python 3 support Unicode, but a Unicode character in the format '\uxxxx' can be recognized and equated to str.
If we know a Unicode byte code, how to become UTF-8 byte code. Understand the above, now we have ideas, first decode, and then encode. code is as follows:
('unicode-escape').encode()
Final Expansion
Remember that ord just now. Times have changed, and big brother ASCII has been merged, but ord still has its place. Try ord('in'), the output is 20013. 20013 is what it is, let's try hex(ord('in')) again, the output is '0x4e2d', that is, 20013 is the decimal value of x4e2d that we met countless times above. Here say hex, is used to convert to hexadecimal function, learned the microcontroller people on the hex certainly will not be unfamiliar.
The final extension to the problem of others I've seen on the web. We write characters like '\u4e2d' and Python3 knows what we want to say. But let Python read a certain file with '\u4e2d', does the computer not recognize it? Then someone below gave the answer. Below:
import codecs file = ( "", "r", "unicode-escape" ) u = () print(u)