Data Analysis - How Pandas chooses a subset of data
In the data frame, select a column, a row, or a sub-region. What should I do?
Select an attribute column dimension
For example, in the Titanic data sheet, if you are only interested in passengers, you can do this:
In [4]: ages = titanic["Age"] In [5]: () Out[5]: 0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 Name: Age, dtype: float64 In [6]: type(titanic["Age"]) Out[6]: In [7]: titanic["Age"].shape Out[7]: (891,)
Select multiple attribute column dimensions
For example, in the Titanic data table, if you want to select multiple attributes for combined research, you are not only interested in passengers, but also need to know the gender. You can do this:
In [8]: age_sex = titanic[["Age", "Sex"]] In [9]: age_sex.head() Out[9]: Age Sex 0 22.0 male 1 38.0 female 2 26.0 female 3 35.0 female 4 35.0 male In [10]: type(titanic[["Age", "Sex"]]) Out[10]: In [11]: titanic[["Age", "Sex"]].shape Out[11]: (891, 2)
Filter the attribute value collection
For example, in the Titanic data sheet, you are interested in a collection of passengers older than 35 years old.
In [12]: above_35 = titanic[titanic["Age"] > 35] In [13]: above_35.head() Out[13]: PassengerId Survived Pclass ... Fare Cabin Embarked 1 2 1 1 ... 71.2833 C85 C 6 7 0 1 ... 51.8625 E46 S 11 12 1 1 ... 26.5500 C103 S 13 14 0 3 ... 31.2750 NaN S 15 16 1 2 ... 16.0000 NaN S [5 rows x 12 columns] In [15]: above_35.shape Out[15]: (217, 12)
In fact, the conditions in brackets are actually a list of truth values:
In [14]: titanic["Age"] > 35 Out[14]: 0 False 1 True 2 False 3 False 4 False ... 886 False 887 False 888 False 889 False 890 False Name: Age, Length: 891, dtype: bool
In addition, you are also interested in the passenger's cockpit level. If you filter the level 2 and 3, you can operate as follows:
In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])] In [17]: class_23.head() Out[17]: PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 2 3 1 3 ... 7.9250 NaN S 4 5 0 3 ... 8.0500 NaN S 5 6 0 3 ... 8.4583 NaN Q 7 8 0 3 ... 21.0750 NaN S [5 rows x 12 columns] # is equivalent to:In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)] In [19]: class_23.head() Out[19]: PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 2 3 1 3 ... 7.9250 NaN S 4 5 0 3 ... 8.0500 NaN S 5 6 0 3 ... 8.4583 NaN Q 7 8 0 3 ... 21.0750 NaN S [5 rows x 12 columns]
In addition, it is often used in data cleaning to filter out NA values or non-NA values and process them separately. You can operate like this:
In [20]: age_no_na = titanic[titanic["Age"].notna()] In [21]: age_no_na.head() Out[21]: PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 NaN S 3 4 1 1 ... 53.1000 C123 S 4 5 0 3 ... 8.0500 NaN S [5 rows x 12 columns] In [22]: age_no_na.shape Out[22]: (714, 12)
Filter a specific row and column dimension collection
For example, in the Titanic data sheet, you are interested in names of passengers older than 35 years old.
In [23]: adult_names = [titanic["Age"] > 35, "Name"] In [24]: adult_names.head() Out[24]: 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 6 McCarthy, Mr. Timothy J 11 Bonnell, Miss. Elizabeth 13 Andersson, Mr. Anders Johan 15 Hewlett, Mrs. (Mary D Kingcome) Name: Name, dtype: object
If you are interested in rows 10-25, columns 3 to 5, you can do this:
In [25]: [9:25, 2:5] Out[25]: Pclass Name Sex 9 2 Nasser, Mrs. Nicholas (Adele Achem) female 10 3 Sandstrom, Miss. Marguerite Rut female 11 1 Bonnell, Miss. Elizabeth female 12 3 Saundercock, Mr. William Henry male 13 3 Andersson, Mr. Anders Johan male .. ... ... ... 20 2 Fynney, Mr. Joseph J male 21 2 Beesley, Mr. Lawrence male 22 3 McGowan, Miss. Anna "Annie" female 23 1 Sloper, Mr. William Thompson male 24 3 Palsson, Miss. Torborg Danira female [16 rows x 3 columns]
The above code is just a simple example, and the expressions and variable ranges in the example code can also be modified according to actual problems.
This is the end of this article about the example method of using Pandas to select data subsets. For more related contents of Pandas to select data subsets, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!