1 Representation of floating point numbers
Usually, we can use the following format to represent floating point numbers
S | P | M |
Where S is the sign bit, P is the order code, and M is the tail number
For IBM-PC, single-precision floating-point numbers are 32-bit (i.e. 4-byte) and double-precision floating-point numbers are 64-bit (i.e. 8-byte). The number of bits occupied by S, P, and M of both can be seen from the following table
S | P | M | Representation formula | Offset |
1 | 8 | 23 | (-1)S*2(P-127)* | 127 |
1 | 11 | 52 | (-1)S*2(P-1023)* | 1023 |
Taking single-precision floating point numbers as an example, the binary representation format can be obtained as follows
S (No. 31) | P(30 to 23) | M (22-bit to 0-bit) |
Where S is the sign bit, only 0 and 1 represent positive and negative respectively; P is the order code, which is usually represented by a code shift (the shift and complement are only opposite sign bits, and the rest are the same. For positive numbers, the original code, the inverse code and the complement code are the same; for negative numbers, the complement code is to invert all the original codes of its absolute value, and then add 1.)
For the sake of simplicity, this article only discusses single-precision floating-point numbers, and double-precision floating-point numbers are also stored and represented in the same way.
2 The representation convention of floating point numbers
Single-precision floating-point numbers and double-precision floating-point numbers are defined using the IEEE754 standard, with some special conventions.
(1) When P = 0 and M = 0, it means 0.
(2) When P = 255 and M = 0, it means infinity, and use the sign bit to determine whether it is positive or negative infinity.
(3) When P = 255, M != 0, it means NaN (Not a Number, not a number).
When we use .Net Framework, we usually use the following three constants
(); // 3.402823E+38 (); //-3.402823E+38 (); // 1.401298E-45 //If we convert them to double precision types, their values are as follows(()); // 3.40282346638529E+38 (()); //-3.40282346638529E+38 (()); // 1.40129846432482E-45
So how do you find these values?
According to the above convention, we can know that the maximum value of the order code P is 111111110 (this value is 254, because 255 is used for a special convention, then for numbers that can be accurately represented, 254 is the largest order code). The maximum value of the mantissa is 1111111111111111111111111111111111.
Then the maximum value is: 0 1111111110 11111111111111111111111111111111111111111.
That is 2(254-127) * (1.111111111111111111111111)2 = 2127 * (1+1-2-23) = 3.40282346638529E+38
From the above double precision representation, it can be seen that the two are consistent. The smallest number is naturally -3.40282346638529E+38.
For numbers closest to 0, according to the convention of IEEE754, in order to expand the representation ability of data near the 0 value, the order code P = -126 and the mantissa M = (0.0000000000000000000000000001)2 is taken. At this time, the binary representation of the number is: 0 00000000000000000000000000000000000000000000000000000001
That is 2-126 * 2-23 = 2-149 = 1.40129846432482E-45. This number is consistent with Epsilon above.
If we want to accurately represent the number closest to 0, it should be 0 000000001 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
That is: 2-126 * (1+0) = 1.17549435082229E-38.
3. The accuracy problem of floating point numbers
Floating-point numbers reflect infinite sets of real numbers with a finite 32 bit length, so in most cases it is an approximation. At the same time, the operation of floating point numbers is also accompanied by error diffusion. Two floating-point numbers that seem equal at a particular precision may not be equal because their minimum significant digits are different.
Since floating-point numbers may not approximate exactly to decimal numbers, mathematical or comparison operations using floating-point numbers may not produce the same result if they are used.
If floating point numbers are involved, the values may not be round-trip. Round trip of values means that a certain operation converts the original floating point number to another format, and the reverse operation converts the converted format back to the floating point number, and the final floating point number is equal to the original floating point number. The round trip may fail because one or more of the least significant bits may be lost or changed in the conversion.
4 Denote floating point numbers as binary
4.1 Convert floating point numbers without decimals to binary representations
First, we use a floating point number without decimals to illustrate how to convert a floating point number into a binary representation. Assume that the data to be converted is 45678.0f.
When dealing with this floating point number without decimals, the integer part is directly converted into a binary representation:
1011001001101110.0, at this time, a default 1 must be added (this is because according to the requirements of floating point standardization, the mantissa must be transformed into the format).
Then it can be expressed as: 11011001001101110.0.
Then move the decimal point to the left and move it to only 1 bit from the highest position, that is, 1.101100100110110, and a total of 16 bits have been moved. We know that the left shift represents multiplication and the right shift represents division. So the original number is equal to this: 1.1011001001101110 * ( 216 ). Now both the mantissa and the exponent are out. Because the highest 1 is added according to the standard, it is just to meet the specification requirements, this 1 needs to be removed at this time. The binary of the mantissa becomes: 1011001001101110.
Finally, add 0 after the mantissa until you have enough 23 digits, that is: 1011001001101110000000.
Let’s look at the index. According to the previous definition, P-127=16, then P = 143, which is expressed as binary: 10001111.
The number 45678.0f is positive, so the sign bit is 0. So we spell it together according to the format mentioned above, which is: 0 100011111 1011001001101101100000000.
This is the binary representation of the 45678.0f number. If we want to get a hexadecimal representation, it is very simple. We only need to convert this binary string into a hexadecimal number in a group. However, it should be noted that the CPUs of the x86 architecture are all from Little Endian (that is, the low byte is in front and the high byte is in the back), so in actual memory, the number is stored in reverse order of the above binary string. It is also easy to know whether the CPU is little endian.
;
4.2 Floating point numbers with decimals are expressed as binary
For floating point numbers with decimals, there will be precision problems, so let’s give an example below. Assume that the decimal to be converted is 123.456f.
For such decimals, the integer part and the decimal part need to be processed. The processing of the integer part will not be repeated, and it will be directly converted into binary: 100100011. The processing of the decimal part is more troublesome. We know that if the binary representation is only 0 and 1, then the decimal can only be expressed in the following way:
a1*2-1+a2*2-2+a3*2-3+......+an*2-n
Where a1 is equal to the number can be 0 or 1, theoretically, using this representation method can represent a finite decimal. However, the mantissa can only have 23 digits, which will inevitably lead to the problem of accuracy.
In many cases, we can only represent decimals approximately. Let’s look at the decimal pure decimal number 0.456. How can it be represented as binary? Generally speaking, we can express it by multiplying by 2.
First, multiply this number by 2, less than 1, so the first bit is 0, and then multiply by 2, greater than 1, so the second bit is 1, subtract the number by 1, and multiply by 2, so loop on until the number equals 0.
In many cases, the binary numbers we get are greater than 23 bits, and those more than 23 bits need to be abandoned. The rounding principle is 0 rounding 1. In this way, we can get the binary representation: 1111011.01110100101111001.
Now start to move the decimal point to the left, and a total of 6 bits are moved. At this time, the mantissa is: 1.111011011010101010111001, the order code is 6 plus 127 to get 131, the binary is expressed as: 10000101, then the total binary is expressed as:
0 10000101 11101101110100101111001
Represented in hexadecimal is: 42 F6 E9 79
Since the CPU is Little Endian, it is expressed in memory as: 79 E9 F6 42.
4.3 Denote pure decimals as binary
For pure decimal conversion to binary, it must be standardized first. For example, 0.0456, we need to normalize it and change it to the form of * (2n). We require that n corresponding to the pure decimal X can be obtained by using the following formula:
n = int( 1 + log 2X )
0.0456 We can represent 1.4592 times the power of -5 with base 2, i.e. 1.4592 * ( 2-5 ). After converting it to this form, the binary representation is obtained by processing the decimals above.
1. 01110101100011100010001
Remove the first 1 and get the final number
01110101100011100010001
The order code is: -5 + 127 = 122, and the binary is expressed as
0 01111010 01110101100011100010001
Finally converted to hexadecimal
11 C7 3A 3D
5 Mathematical operations of floating point numbers
5.1 Addition and subtraction of floating point numbers
Suppose two floating point numbers X=Mx*2Ex, Y=My*2Ey
To achieve X±Y, it must be completed in the following 5 steps:
(1) Sequence operation: Small orders look at the same level as the larger order
(2) Perform mantissa addition and subtraction operations
(3) Normalization processing: The result of the mantissa operation must become a normalized floating point number. For the complement of the mantissa with double sign bits (that is, using 00 to represent a positive number, 11 to represent a negative number, 01 to represent an overflow, and 10 to represent an underflow) it must be
001×××…×× or 110×××…×× form
If the above forms do not meet the above, the left or right rules must be handled.
(4) Rounding operation: When performing order or right-hand operation, the "0" rounds "1" method is often used to round the mantissa value that is moved out of the right to ensure accuracy.
(5) The correctness of the judgment result: that is, check whether the order code is overflowing
If the order code is underflowed (the code shift means 00…0), the result should be set to machine 0;
If the order code overflows (exceeding the maximum value indicated by the order code), the overflow flag is placed.
Now use a specific example to illustrate the above 5 steps
Example: Assume X=0.0110011*211, Y=0.1101101*2-10 (the numbers here are all binary), calculate X+Y;
First, we need to turn these two numbers into 2-digit representations. For floating-point numbers, the order code is usually represented by shift code, while the mantissa is usually represented by complement code.
It should be noted that the code shift of -10 is 00110
[X]Float: 0 1 010 1100110
[Y]Float: 0 0 110 1101101
Symbol bits Level codes Posture number
(1) Find the order difference: │ΔE│=|1010-0110|=0100
(2) Pair order: The order code of Y is small, and the mantissa of Y is shifted right by 4 bits
[Y]Float to 0 1 010 0000110 1101Temporarily saved
(3) Adding mantissa and using the complement of the two-signed bits
00 1100110
+00 0000110
00 1101100
(4) Standardization: Meet the specification requirements
(5) Rounding process, using 0 rounding method
Therefore, the floating point number format of the final calculation result is: 0 1 010 1101101
That is, X+Y=+0. 1101101*210
5.2 Multiplication and division of floating point numbers
(1) Order code operation: summing the order code (multiple) or order code difference (division)
That is, [Ex+Ey] shift = [Ex] shift + [Ey] complement
[Ex-Ey]shift= [Ex]shift+ [-Ey]complement
(2) Mantissa processing of floating point numbers: The results of the mantissa multiplication and division operation in floating point numbers must be rounded to process
Example: X=0.0110011*211, Y=0.1101101*2-10 Find X*Y
Solution: [X]Float: 0 1 010 1100110
[Y]Float: 0 0 110 1101101
(1) Addition of order codes
[Ex+Ey]shift=[Ex]shift+[Ey]complement=1 010+1 110=1 000
1 000 is 0 represented by the code shift
(2) The result of multiplying the end numbers of the original code is:
0 10101101101110
(3) Standardization processing: The specification requirements have been met, no left criterion is required, the mantissa remains unchanged, and the order code remains unchanged.
(4) Rounding processing: According to the rounding rule, add 1 to correct it
So X※Y= 0.1010111*20
/******************************************************************************************
*【Author】:flyingbread
*【Date】: March 2, 2007
*【Notice】:
*1. This article is an original technical article, and is published on the personal site of the Blog Park (/). Please indicate the author and source when reprinting and citing.
*2. This article must be reproduced and cited in full. No organization or individual cannot modify any content without authorization, and it cannot be used for commercial purposes without authorization.
*3. This statement is part of the article, and reprints and citations must be included in the original text.
*4. This article refers to several information on the Internet, and does not list them one by one, but thanks together.
******************************************************************************************/