SoFunction
Updated on 2025-03-02

Example of method to select a subset of data using Pandas

Data Analysis - How Pandas chooses a subset of data

In the data frame, select a column, a row, or a sub-region. What should I do?

Select an attribute column dimension

For example, in the Titanic data sheet, if you are only interested in passengers, you can do this:

In [4]: ages = titanic["Age"]

In [5]: ()
Out[5]: 
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [6]: type(titanic["Age"])
Out[6]: 

In [7]: titanic["Age"].shape
Out[7]: (891,)

Select multiple attribute column dimensions

For example, in the Titanic data table, if you want to select multiple attributes for combined research, you are not only interested in passengers, but also need to know the gender. You can do this:

In [8]: age_sex = titanic[["Age", "Sex"]]

In [9]: age_sex.head()
Out[9]: 
    Age     Sex
0  22.0    male
1  38.0  female
2  26.0  female
3  35.0  female
4  35.0    male

In [10]: type(titanic[["Age", "Sex"]])
Out[10]: 

In [11]: titanic[["Age", "Sex"]].shape
Out[11]: (891, 2)

Filter the attribute value collection

For example, in the Titanic data sheet, you are interested in a collection of passengers older than 35 years old.

In [12]: above_35 = titanic[titanic["Age"] > 35]

In [13]: above_35.head()
Out[13]: 
    PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
1             2         1       1  ...  71.2833   C85         C
6             7         0       1  ...  51.8625   E46         S
11           12         1       1  ...  26.5500  C103         S
13           14         0       3  ...  31.2750   NaN         S
15           16         1       2  ...  16.0000   NaN         S

[5 rows x 12 columns]

In [15]: above_35.shape
Out[15]: (217, 12)

In fact, the conditions in brackets are actually a list of truth values:

In [14]: titanic["Age"] > 35
Out[14]: 
0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

In addition, you are also interested in the passenger's cockpit level. If you filter the level 2 and 3, you can operate as follows:

In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])]

In [17]: class_23.head()
Out[17]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
2            3         1       3  ...   7.9250   NaN         S
4            5         0       3  ...   8.0500   NaN         S
5            6         0       3  ...   8.4583   NaN         Q
7            8         0       3  ...  21.0750   NaN         S

[5 rows x 12 columns]

# is equivalent to:In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]

In [19]: class_23.head()
Out[19]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
2            3         1       3  ...   7.9250   NaN         S
4            5         0       3  ...   8.0500   NaN         S
5            6         0       3  ...   8.4583   NaN         Q
7            8         0       3  ...  21.0750   NaN         S

[5 rows x 12 columns]

In addition, it is often used in data cleaning to filter out NA values ​​or non-NA values ​​and process them separately. You can operate like this:

In [20]: age_no_na = titanic[titanic["Age"].notna()]

In [21]: age_no_na.head()
Out[21]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

In [22]: age_no_na.shape
Out[22]: (714, 12)

Filter a specific row and column dimension collection

For example, in the Titanic data sheet, you are interested in names of passengers older than 35 years old.

In [23]: adult_names = [titanic["Age"] > 35, "Name"]

In [24]: adult_names.head()
Out[24]: 
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

If you are interested in rows 10-25, columns 3 to 5, you can do this:

In [25]: [9:25, 2:5]
Out[25]: 
    Pclass                                 Name     Sex
9        2  Nasser, Mrs. Nicholas (Adele Achem)  female
10       3      Sandstrom, Miss. Marguerite Rut  female
11       1             Bonnell, Miss. Elizabeth  female
12       3       Saundercock, Mr. William Henry    male
13       3          Andersson, Mr. Anders Johan    male
..     ...                                  ...     ...
20       2                 Fynney, Mr. Joseph J    male
21       2                Beesley, Mr. Lawrence    male
22       3          McGowan, Miss. Anna "Annie"  female
23       1         Sloper, Mr. William Thompson    male
24       3        Palsson, Miss. Torborg Danira  female

[16 rows x 3 columns]

The above code is just a simple example, and the expressions and variable ranges in the example code can also be modified according to actual problems.

This is the end of this article about the example method of using Pandas to select data subsets. For more related contents of Pandas to select data subsets, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!