python BitMap algorithm for processing 2 billion random integers for de-duplication

preamble

For a large number of random integers, how to do de-emphasis? BitMap is a very good choice, this article will take you to recognize the wonders of BitMap

What is BitMap

BitMapThe basic principle is to use a bit to mark the Value of an element, and the Key is that element. Since a bit is used to store a piece of data, it saves a lot of space.

General data storage

We know that when we arbitrarily enter a number into a computer, that number is definitely not stored in the computer's memory in its own numerical form, and the storage form is binary.

For example, if we type 8, it will be stored as 1000 in the computer, and the smallest unit of storage for each character in a computer is a byte, which has 8 bits, so it is at least 00001000, which is why a byte can store up to 1111111111 as a binary value (which stands for 255). This means that one byte is definitely not enough, so you generally need more bytes to store larger numbers.

The number of bytes used to store numbers may be different in each programming language. In Python version 3, the int type is dynamic in length, so theoretically very large numbers can be stored.

BitMap storage method

The role of BitMap is to achieve data de-emphasis or storage, then it is certainly to be stored things become as few as possible, in BitMap ideas use bit to store each value. Look at the following picture together:

The above figure only draws four bits, you can see that the box is respectively four bits, four bits on some places for 1, some places for 0, after which these bits are combined to form a decimal value that we are familiar with. This decimal value is stored in the BitMap storage value. For example, the above 5 can store 1,3 two numbers, 15 can store 1,2,3,4 these numbers. Now understand it, in fact, in which bit there is a 1 is on behalf of a number stored here, this is the principle of BitMap data storage!

Why BitMap can de-duplicate big data

In BitMap idea bit is used to store each value. If you want to store the same numbers, then these numbers will be stored in the same location in BitMap, which will lead to data duplication and cannot achieve the purpose of de-duplication. Therefore, BitMap cannot store the same number, using this feature, BitMap can de-duplicate big data.

Here is the code to implement the most basic BitMap algorithm using Python:

import array
class BitMap:
    def __init__(self, max_num):
        self.max_num = max_num
         = ('B', [0] * (max_num // 8 + 1))
    def set(self, num):
        index = num // 8
        bit = num % 8
        [index] |= 1 << bit
    def get(self, num):
        index = num // 8
        bit = num % 8
        return [index] & (1 << bit) != 0
    def remove_duplicates(self, nums):
        for num in nums:
            if (num):
                continue
            (num)
            yield num

In this class, we define three methods: __init__, set, and get. the __init__ method initializes an array to store the data of the BitMap. the set method is used to set the state of a number, and the get method is used to retrieve the state of a number. the remove_duplicates method is used to de-duplicate the data. Here we are using the array library in Python to manipulate the bit array directly.

This is the most basic implementation of the BitMap algorithm, if you want to know more about the BitMap algorithm, you can refer to other more advanced implementations.

For more information about python BitMap algorithm de-emphasis please follow my other related articles!