SoFunction
Updated on 2024-12-20

Summary of six mistakes pandas beginners tend to make

We're here to discuss6One Newbie Mistake,These errors have nothing to do with the tool you are usingAPIor syntactically irrelevant,Rather, it is directly related to your level of knowledge and experience。In practice there may not be any error message if these issues arise,But it can cause us a lot of trouble in application。

Reading large files using the functions that come with pandas

The first error has to do with actually using Pandas for certain tasks. Specifically the datasets we're actually working with tables on are very large. Using pandas' read_csv to read large files will be your biggest mistake.

Why? Because it's so slow! Take a look at this test where we load the TPS October dataset, which has 1M rows and about 300 features and takes up 2.2GB of disk space.

import pandas as pd
%%time

tps_october = pd.read_csv("data/")
Wall time: 21.8 s

read_csv took about 22 seconds. You might say that 22 seconds is not much. But in a project, many experiments need to be executed at different stages. We create many separate scripts for cleaning, feature engineering, selecting models, and other tasks. Waiting multiple times for data to load for 20 seconds becomes a long time. In addition, the dataset may be larger it will take even longer. So what's a faster solution?

The solution is to drop Pandas at this stage and use other alternatives designed for fast IO. My favorite is datatable, but you can also choose Dask, Vaex, cuDF, etc. Here is the time it takes to load the same dataset with datatable:.

import datatable as dt  # pip install datatble

%%time

tps_dt_october = ("data/").to_pandas()

------------------------------------------------------------

Wall time: 2 s

Only two seconds. Ten times the difference.

No vectorization

One of the most important rules of functional programming is to never use loops. It seems that adhering to this "no loops" rule when using Pandas is the best way to speed up calculations.

Functional programming uses recursion instead of loops. Although recursion can also have various problems (this we won't consider here), using vectorization is the best option for scientific computing!

矢量化是 Pandas 和 NumPy 的核心,它对整个数组而不是单个标量执行数学运算。 Pandas 已经拥有一套广泛的矢量化函数,我们无需重新发明*,只要关注我们的重点如何计算就好了。

在 Pandas 中进行Python 的大部分算术运算符(+、-、*、/、**)都以矢量化方式工作。此外,在 Pandas 或 NumPy 中看到的任何其他数学函数都已经矢量化了。

为了验证到速度的提高,我们将使用下面的 big_function,它以三列作为输入并执行一些无意义的算术作为测试:

def big_function(col1, col2, col3):
    return (col1 ** 10 / col2 ** 9 + (col3 ** 3))

首先,我们将这个函数与 Pandas 最快的迭代器——apply 一起使用:

%time tps_october['f1000'] = tps_october.apply(
      lambda row: big_function(row['f0'], row['f1'], row['f2']), axis=1
    )

-------------------------------------------------

Wall time: 20.1 s

operational time-consuming20unit of angle or arc equivalent one sixtieth of a degree。 Let's use the core in a vectorized way NumPy array to do the same thing:

%time tps_october['f1001'] = big_function(tps_october['f0'].values, 
                                          tps_october['f1'].values, 
                                          tps_october['f2'].values)

------------------------------------------------------------------

Wall time: 82 ms

It took only 82 milliseconds, which is about 250 times faster.

事实上我们不能完全抛弃循环。 Because not all data manipulation operations are math operations。 But whenever you find that you need to use some looping function(for example apply、applymap maybe itertuples)hour,花点hour间看看想要做的事情是否可以矢量化是一个非常好的习惯。

Data types, dtypes!

We can specify the data type based on memory usage.

The worst and most memory-intensive datatype in pandas is object, which also happens to limit some of the functionality of Pandas. For the rest we have floating point numbers and integers. This table below shows all the types in pandas:

In the Pandas nomenclature, the number after the datatype name indicates how many bits of memory each number in this datatype will occupy. So the idea is to convert each column in the dataset to the smallest possible subtype. We just have to go by the rules, here's the rule sheet:

Typically, float numbers are converted to float16/32 and columns with positive and negative integers are converted to int8/16/32 according to the table above. uint8 can also be used for boolean values and positive integers only to further reduce memory consumption.

This function must look familiar to you as he is widely used in Kaggle and it converts floating point numbers and integers to their smallest subtypes according to the table above:

def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in :
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > (np.int8).min and c_max < (np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > (np.int16).min and c_max < (np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > (np.int32).min and c_max < (np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > (np.int64).min and c_max < (np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > (np.float16).min
                    and c_max < (np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > (np.float32).min
                    and c_max < (np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

Let's use it on the TPS October data and see how much we can reduce it:

>>> reduce_memory_usage(tps_october)
Mem. usage decreased to 509.26 Mb (76.9% reduction)

We compressed the dataset from 2.2GB to 510MB. this reduction in memory consumption is lost when we save the df to a csv file because the csv is still saved as a string, but it's fine if we save it using pickle.

Why is it important to reduce the memory footprint? Memory footprint and consumption plays an important role when working with such datasets using large machine learning models. Once you encounter some OutOfMemory errors, you'll start catching up and learning such tricks to keep the computer happy (who made Kaggle only give 16G of RAM, it's all forced).

No style

One of the most wonderful features of Pandas is its ability to set different styles when displaying a DF, rendering the original DF as a table with some CSS HTML in Jupyter.

Pandas allows you to style your DataFrame via the style attribute.

tps_october.sample(20, axis=1).describe().(
    subset=["mean"], color="#205ff2"
).background_gradient(subset=["std"], cmap="Reds").background_gradient(
    subset=["50%"], cmap="coolwarm"
)

We randomly select 20 columns, create a 5-digit summary for them, and transpose the results, coloring the mean, standard deviation, and median columns according to their size. Adding a style like this makes it easier to spot patterns in the raw numbers, and the setup eliminates the need to use other visualization libraries.

Actually, there is nothing wrong with not styling the df. But it's certainly a nice feature, isn't it.

Saving files using CSV format

Just as reading CSV files is very slow, so is saving data back to them. Here is how long it takes to save TPS October data to CSV:

%%time

tps_october.to_csv("data/")

------------------------------------------

Wall time: 2min 43s

It took almost 3 minutes. To save time you can save as a parquet, feather or even pickle.

%%time

tps_october.to_feather("data/")

Wall time: 1.05 s

--------------------------------------------------------------------------------

%%time

tps_october.to_parquet("data/")

Wall time: 7.84 s

Not looking at the documentation!

Actually, the worst mistake this one made for me was not reading the Pandas documentation. But nobody reads the documentation, right? Sometimes we'd rather search the Internet for hours than read the documentation.

But when it comes to Pandas, this is a very big mistake. Because it has an excellent user guide like sklearn, covering everything from the basics to how to contribute code, and even how to set up prettier themes (or maybe it might just be that there's so much of it that no one reads it).

All the mistakes I mentioned today can be found in the documentation. There's even a section in the documentation called "Large Datasets" that specifically tells you to use other packages (like Dask) to read large files and to stay away from Pandas. In fact, if I had the time to read the user guide cover to cover, I'd probably make about 50 newbie mistakes, so check out the docs instead.

summarize

Today, we learned the six most common mistakes newbies make when working with Pandas.

Most of the errors we mention here are related to large datasets and may only occur when working with GB-sized datasets. If you are still working with a novice dataset like Titanic, you might not even feel like having these problems. But when you start working with real-world datasets, these concepts will make others feel that you are not a novice but someone who has actually had real-world experience.

to this article on pandas beginners are prone to commit six errors summarized article is introduced to this, more related pandas beginners are prone to commit errors content please search my previous articles or continue to browse the following related articles I hope you will support me in the future more!