C++ practical binary data processing and encapsulation

Preface

A recent project to do network terminal testing in the institute includes encapsulation calls of some embedded and underlying data frames. I rarely contacted the processing and encapsulation of binary raw data before, so I organized it here.

The following examples are mainly explained in C++.

What is binary data

All data on a computer is stored in binary (0 or 1). Multi-bit binary data can then represent various basic types of data such as shaping, floating point, characters, strings, etc. or some more complex data formats.

When programming for general needs in daily life, we usually don’t need to pay attention to the underlying binary data. However, if we want to process binary files (audio, video, pictures, etc.), design more efficient data structures (network data frames, bytecodes, protobufs), or process some underlying layers, we need to process these binary data.

In a computer, each binary bit is calledBit(bit, also known as: bit), is the smallest storage unit in a computer.

Make up one for every 8 bitsbyte(byte), generally is the smallest unit actually stored and processed by a computer (can be multiples of it). That is to say, the computer allocates space or performs calculations in bytes, and cannot allocate smaller storage space than bytes (for example, the smallest data type is char, with a length of 1 byte, and does not support applying for 6-bit storage space) or directly process data smaller than the byte unit (for example, two 4-bit data addition and subtraction).

Several bytes form oneComputer word(Abbreviated as: word, word), represents a fixed-length binary data that a computer processes transactions at one time, and the number of digits of the word is the word length. Computers process or operate in words. Two common concepts areNumber of CPU bitsandOperating system bit count。

The number of bits of a CPU refers to the maximum number of bits (one word length) that the CPU can handle when executing an instruction, which corresponds to the number of bits of the registers in the CPU. Among them, the address register MAR limits the address range of the computer, and the data register MDR limits the data length of a processed one time. More bit counts bring greater addressing space and stronger computing power.

Note: The addressing range is not equal to the memory size, and the addressing objects include memory sticks, graphics card memory, sound card, network card and other devices. The reason why the addressing range is often regarded as the upper limit of the memory is because memory is the main addressing object of the CPU.

Here I will explain the common instruction architecture: x86 is an instruction set architecture (complex instruction set CISC architecture) launched by Intel. At first, it was only 32-bit, called x86_32; later, AMD launched a 64-bit instruction set amd64 compatible with x86_32, which was accepted by the industry. Intel renamed it to x86_64, referred to as x64, while x86_32 and x86_64 can be collectively called x86. In contrast to x86, the ARM instruction set architecture is based on the thin instruction set RISC architecture, which is mostly used in mobile devices.

The operating system is implemented based on the CPU instruction set, so the number of bits of the operating system also directly corresponds to the number of bits of the CPU. Due to the backward compatibility of the CPU instruction set, a 32-bit operating system can also run on a 64-bit CPU, but the other way around is not. The operating system provides backward compatibility with the software. The 64-bit operating system supports 64 and 32-bit programs, but the 32-bit operating system only supports 32-bit programs.

Process binary data

In most languages, the smallest data type is char, one byte, and binary data is mostly represented by unsigned char and written as uint8. The language base often uses it as int for operation.

The binary constant starts with "0b", such as: 0b001. Binary data is also commonly represented in octal (starting with "0") and hexadecimal (starting with "0x"), such as: 0257 (175, octal), 0x1f (31, hexadecimal). 1 digit in octal represents 3-bit binary data, 1 digit in hexadecimal represents 4-bit binary data, and a byte can be represented by 2 hexadecimal numbers.

To process data less than one byte, you need to use the bit operators (&, |, ^, ~, >>, <<).

bit operator	describe	Operation rules	use
&	and	When both bits are 1, the result is 1	Clear the binary bit or get the specified bit data
\|	or	When both bits are 0, the result is 0.	The binary bit is set to 1; adds to the data with the corresponding bit 0
^	Extraordinary	The same bit is 0, and the same is 1	Reverse the specified bit
~	Reverse	0 changes 1, 1 changes 0	All binary bits are inverted
<<	Move left	All binary bits are shifted left by several bits, discarded at the high position, and compensated at the low position by 0.	Find x∗2nx∗2n; move the data to the high bit
>>	Move right	All binary bits are shifted right by several bits. For unsigned numbers, the high bits are supplemented with 0 and signed numbers. The processing methods of each compiler are different. Some complement the sign bits (arithm right shift), and some complement the 0 (logical right shift)	Find x/2nx/2n; move data to the low bit

For example, determine whether the third bit of a certain byte is 1:

// Clear the other bits first, and then determine whether it is equal to 0b100bool isOne = (byte &amp; 0b100) == 0b100;

For another example, the control flag and fragment offset in the computer network IP protocol are stored together in the 7th and 8th bytes of the IP header. Flag occupies the first three bits, and the last 13 bits are fragment offsets. You can obtain flag and offset through the following operations:

// Get flag to intercept the first 3 bits of data byte7: first clear the last 5 bits, retain the first 3 bits of data, then move the right 5 bits to move the first 3 bits of data to the startuint8_t flag = (byte7 &amp; 0b11100000) &gt;&gt; 5;
// Here we use big-end storage to obtain offset to intercept the low 5 bits of byte7 as the high bits, and byte8 as the low bits, sum: first clear the first 3 bits of byte7, retain the last 5 bits of data, move it to the high 8 bits, and then find the sum of the two by bit-wise using the low 8 bits of all 0 and byte8 bit-wise or byte8 to find the or((byte7 &amp; 0b00011111) &lt;&lt; 8) | byte8;

Additional explanation: When multiple bytes are needed to represent a data type, it is necessary to define whether the high-bit bytes of the data is stored in the high-bit address space or the low-bit address space. This is the definition of the size and end. The big end refers to the high-digit bytes with low-digit addresses, which is a human handwriting habit; the small end refers to the low-digit bytes with high-digit addresses. When processing data represented by multiple bytes, you must first figure out whether the data is big-endian or small-endian.

Therefore, we can write a general method for converting unsigned shaping and byte streams based on the above knowledge:

// true is the big endian, the low address contains high bytesbool ENDIAN = true;
 
/**
  * Convert data to unsigned plastic digits (unsigned char, short, int, long, long long, etc.)
  * @tparam T T destination type, default is uint32_t
  * @param data Load data byte array
  * @param valueSize Data length, unit: byte, -1 means automatic calculation based on T type
  * @param default_value The default value is 0
  * @return Unsigned plastic shaping data converted according to data
  */
template&lt;typename T = uint32_t&gt;
T payloadToUnsignedInt(std::vector&lt;uint8_t&gt; data, int valueSize = -1, T default_value = uint32_t(0)) {
    if (valueSize == -1) valueSize = sizeof(T);
    if (valueSize &gt; ()) return default_value;
    T value = 0;
    for (int i = 0; i &lt; valueSize; i++) {
        if (ENDIAN) {
            value |= (data[i] &amp; 0xff) &lt;&lt; ((valueSize - 1 - i) &lt;&lt; 3);
        } else {
            value |= (data[i] &amp; 0xff) &lt;&lt; (i &lt;&lt; 3);
        }
    }
    return value;
}
 
/**
  * Unsigned shaping is converted to load byte array
  * @param value Unsigned plastic surgery data
  * @param valueSize Data length, unit: byte, -1 means automatic calculation based on T type
  * @return load byte array
  */
template&lt;typename T&gt;
std::vector&lt;uint8_t&gt; uintToPayload(T value, int valueSize = -1) {
    if (valueSize == -1) valueSize = sizeof(T);
    std::vector&lt;uint8_t&gt; data(valueSize, 0);
    for (int i = 0; i &lt; valueSize; i++) {
        if (ENDIAN) {
            data[i] = (value &gt;&gt; ((valueSize - 1 - i) &lt;&lt; 3)) &amp; 0xff;
        } else {
            data[i] = (value &gt;&gt; (i &lt;&lt; 3)) &amp; 0xff;
        }
    }
    return data;
}

Encapsulate binary data

After mastering the processing methods of binary data, the next step is to encapsulate binary data and encapsulate it into an object that people can understand.

Binary data is usually represented by uint8_t array. Different bits have different meanings. It is necessary to parse according to the actual meaning to obtain meaningful target information. So the focus is to describe the meaning of each bit, and parse the binary data based on the description, providing mutual conversion between the binary data and meaningful objects.

Idea 1: Based on configuration files

Here we take the customized binary instruction encapsulation as an example to illustrate (Project gallery), but this configuration project is suitable for any binary data encapsulation scenario. In the face of this requirement, the first thing that comes to mind is to describe the meaning of each bit of the binary stream through the configuration file. After loading the configuration file, determine the actual corresponding configuration of the current binary stream segment based on some filtering conditions and parse it into a dictionary.

Since the project includes some embedded content, all files need to be compiled and burned into the board, and does not support storing configuration files in ordinary file formats, it adopts variable configuration, globally declares the configuration type information and configuration object (cmd_manager), and defines the configuration object anywhere in the project. In other scenarios, you can also choose Json, xml and other configuration formats.

The configuration object definitions designed in this article are as follows:

/**
  * Load configuration item
  */
const CmdManager cmd_manager = { 2, {  // The number of instructions, the following is the configuration of each instruction        {"TCRQ", 3, {  // Configuration item name, number of fields corresponding to configuration item            {"TE_SEQ_NO", -1, &amp;FT_SHORT, 0},  // Specific field configuration (field name, field offset, field type, configuration item for this field filtering condition            {"CMD", -1, &amp;FT_CHARS_4, "TCRQ"},  // The configuration item requires that this field is equal to "TCRQ". If the data does not meet the data, the configuration item will not be matched.            {"REPEAT_COUNT", -1, &amp;FT_SHORT, 0}}}
}};

The project will automatically load the configuration object, and then match the corresponding configuration through the PayloadObjectMapFactory factory for the original binary data and generate the data object. This object type (configuration item name) can be obtained from the data object and read and write the field values therein. Or specify the configuration item to create an empty data object, and then obtain its original binary data payload after data setting.

evaluate

This idea can be adjusted and resolved freely and dynamically through configuration files, making it easy to reuse, expand or adjust. The difficulty lies in the design of the configuration format, and the dictionary type data cannot be as clear and easy to use as directly declaring the type structure.

Idea 2: Based on the underlying data storage method

Here, the computer network data frame encapsulation is used as an example. The underlying c++ uses type-aligned continuous storage for member fields of objects/structures. Using this feature can be declared naturally based on the actual meaning and used fields, and can be directly processed as binary data streams. The implementation example is as follows:

/**
  * Data abstraction class, providing the ability to convert binary streams to objects
  * Internal class, only reuse code, not used for polymorphism
  * @tparam size Data byte length
  */
template&lt;int size&gt;
class DataType {
public:
    DataType() { resetData(); }
    // Initialize all data    void resetData() const { memset((void *) (this), 0, size); }
    // Load data from binary stream    bool loadData(const std::vector&lt;uint8_t&gt;&amp; data, int startIndex=0) {
        auto * p = (uint8_t *) this;  // Treat yourself as a binary array        for (int i = 0; i &lt; size; i++) {
            *p = data[i + startIndex];
            p++;
        }
        return true;
    }
    // Generate new binary data stream based on itself    [[nodiscard]] std::vector&lt;uint8_t&gt; createData() const {
        std::vector&lt;uint8_t&gt; result;
        auto p = (uint8_t const *) this;
        for (int i = 0; i &lt; size; i++) {
            result.push_back(*p);
            p++;
        }
        return result;
    }
    [[nodiscard]] int getSize() const { return size; }
};
 
// Define specific binary data types in sequential declarations, supporting nested declarationsclass MACHeader : public DataType&lt;14&gt; {
public:
    // Encapsulate the reading and writing of netType through the above method of converting unsigned shaping and byte streaming with each other.    [[nodiscard]] uint16_t getNetType() const {
        return payloadToUnsignedInt(std::vector&lt;uint8_t&gt;((), ()), 2, uint16_t(0));
    }
    void setNetType(uint16_t _netType) {
        auto data = uintToPayload(_netType, 2);
        std::copy((), (), ());
    }
 
    // Provide the ability to interchange with json, in order to provide the ability to map to python objects    bool loadJson(const Json::Value&amp; json);
    [[nodiscard]] Json::Value createJson() const;
 
    std::array&lt;uint8_t, 6&gt; desMac;  // Data that occupies multiple bytes is described by the std::array array, which can avoid type loss and ensure that the data type is still consistent.    std::array&lt;uint8_t, 6&gt; srcMac;
    std::array&lt;uint8_t, 2&gt; netType;
};

This project also needs to provide the ability to map data frame objects in C++ to python objects. In order to simplify the expansion method interface of CPython, the c++ layer provides the ability to load or generate json from json, implements a json cache at the python layer, and implements data management through cache submission and update. In order to pay tribute to git, the actual submission and update methods of the project are named push and pull, (╯▔＾▔)╯.

evaluate

This idea defines the actual meaning of each location of the data stream through a similar sequential declaration (a bit like configuration). It is clear and direct when used, and cleverly provides conversion operations between the object and the binary data stream through its underlying principles. However, since it requires actual declaration type, it is not as dynamic, flexible and easy to reuse as Idea 1.

This is the article about binary data processing and encapsulation in C++ practice. For more related C++ binary data content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!