SoFunction
Updated on 2024-10-30

Summary of must-have Pandas tricks for python data analysis

I. Creation of two major data structures in Pandas

serial number methodologies clarification
1 (object,index=[ ]) Create a Series. objects can be lists \ndarray, dictionaries, and a row or column in a DataFrame
2 (data,columns = [ ],index = [ ]) Create a DataFrame. columns and indexes are the specified column and row indexes and are in order.

Example: creating a data table with pandas:

df = ({"id":[1001,1002,1003,1004,1005,1006], 
 "date":pd.date_range('20130102', periods=6),
  "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
 "age":[23,44,54,32,34,32],
 "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
  "price":[1200,,2133,5433,,4432]},
  columns =['id','date','city','category','age','price'])

II. DataFrame Common Methods

serial number methodologies clarification
1 () Query the first five rows of the data
2 () Query the last 5 rows of the data
3 () Discretize variables into equal-sized buckets based on rank or based on sample quartiles
4 () Quantile-based discretization function
5 pandas.date_range() Returns a time index
6 () Apply the function along the corresponding axis
7 Series.value_counts() Return the count value of different data
8 df.reset_index() reset index, parameter drop = True when the original index will be discarded, set a new index from 0, often used with groupby ()

Example: Re-indexing

df_inner.reset_index()

III. Data indexing

serial number methodologies clarification
1 .values Convert DataFrame to ndarray two-dimensional array
2 .append(idx) Connect another Index object to produce a new Index object
3 .insert(loc,e) Add an element to the loc position
4 .delete(loc) Delete the element at loc
5 .union(idx) compute parallel sets (math.)
6 .intersection(idx) Compute the intersection
7 .diff(idx) Calculate the difference set to produce a new Index object
8 .reindex(index, columns ,fill_value, method, limit, copy ) Changing, rearranging Series and DataFrame indexes creates a new object and introduces missing values if an index value does not currently exist.
9 .drop() Deletes Series and DataFrame specified row or column indexes.
10 .loc [row labels, column labels] Queries the specified data by label, the first value is the row label and the second value is the column label.
11 [Row position, column position] Queries the specified data through a numeric index generated by default.

Example: Extracting a single row of values by index

df_inner.loc[3]

IV. DataFrame method of selecting and recombining data

serial number methodologies clarification
1 df[val] Selection of a single column or a group of columns from a DataFrame; convenient in special cases: Boolean arrays (filtering rows), slices (slicing rows), or Boolean DataFrames (setting values based on conditions)
2 [val] Selecting a single row or a group of rows of a DataFrame by labeling them
3 [:,val] Select a single column or a subset of columns by labeling them
4 df.1oc[val1,val2] Select both rows and columns by tabbing
5 [where] Picking a single row or subset of rows from a DataFrame by integer position
6 [:,where] Picking a single column or subset of columns from a DataFrame by integer position
7 [where_i,where_j] Simultaneous selection of rows and columns by integer position
8 [1abel_i,1abel_j] Selection of a single scalar by row and column labeling
9 [i,j] Pick a single scalar by row and column positions (integers)
10 reindex Selection of rows or columns by labels
11 get_value Selection of single values by row and column labels
12 set_value Selection of single values by row and column labels

Example: Using iloc to extract data by location area

df_inner.iloc[:3,:2]

The numbers before and after the # colon are no longer the label name of the index, but where the data is located, starting at 0, the first three rows, and the first two columns.

V. Sorting

serial number function (math.) clarification
1 .sort_index(axis=0, ascending=True) Sort by the value of the specified axis index
2 Series.sort_values(axis=0, ascending=True) It can only be sorted according to the value of the 0-axis.
3 DataFrame.sort_values(by, axis=0, ascending=True) The parameter by is an index or a list of indexes on the axis.

Example: Sort by index column

df_inner.sort_index()

VI. Relevant analysis and statistical analysis

serial number methodologies clarification
1 .idxmin() Calculate the index where the minimum value of the data is located (custom index)
2 .idxmax() Calculate the index where the maximum value of the data is located (custom index)
3 .argmin() Calculate the index position where the minimum value of the data is located (automatic indexing)
4 .argmax() Calculate the index position where the maximum value of the data is located (automatic indexing)
5 .describe() Multiple statistical summaries for each column to quickly describe a summary of the data using statistical indicators
6 .sum() Calculate the sum of the columns
7 .count() Number of non-NaN values
8 .mean( ) Calculate the arithmetic mean of the data
9 .median() calculate the arithmetic median
10 .var() Calculate the variance of the data
11 .std() Calculate the standard deviation of the data
12 .corr() Calculate the correlation coefficient matrix
13 .cov() Calculate the covariance matrix
14 .corrwith() Using the corrwith method of a DataFrame, you can calculate the correlation coefficient between its columns or rows and another Series or DataFrame.
15 .min() Calculate the minimum value of the data
16 .max() Calculate the maximum value of the data
17 .diff() Calculating first-order differences works well for time series
18 .mode() Calculate the number of plurals, returning the one(s) with the highest frequency
19 .mean() Calculate the mean value
20 .quantile() Calculation of quartiles (0 to 1)
21 .isin() Used to determine membership in vectorized collections, can be used to filter subsets of data in Series or DataFrame columns
22 .unique() Returns an array of unique values in a Series.
23 .value_counts() Calculates the frequency of occurrence of each value in a Series.

Example: Determine if the value in the city column is Beijing.

df_inner['city'].isin(['beijing']) 

VII. Approach to clustering

serial number methodologies clarification
1 () grouping function
2 () According to the characteristics of the data analysis object, according to certain numerical indicators, the data analysis object is divided into different parts of the interval to be studied, in order to reveal its intrinsic connection and regularity.

Example: .groupby usage

group_by_name=('name')
print(type(group_by_name))

The output result is:

class ''

VIII. Methods for reading and writing text-format data

serial number methodologies clarification
1 read_csv Loads delimited data from files, URLs, and file-type objects. The default delimiter is a comma
2 read_table Loads delimited data from files, URLs, and file-type objects. The default delimiter is tab (t)
3 read_ fwf Read fixed-width column-format data (i.e., no delimiters)
4 read_clipboard Reading data from the clipboard can be seen as the clipboard version of read_table. Useful when converting web pages to tables.
5 read_excel Reading table data from ExcelXLS or XLSXfile
6 read_hdf Reading HDF5 files written in pandas
7 read_html Read all tables in an HTML document.
8 read_json Reading data from a JSON string
9 read_msgpack Encoded pandas data in binary format
10 read_pickle Read any object stored in the Python pickle format
11 read_sas Read SAS datasets stored in a SAS system's customized storage format
12 read_sql Reading SQL query results as DataFrame in pandas
13 read_stata Reading a dataset in Stata file format
14 read_feather Read Feather binary file format

Example: Importing a CSV or xlsx file

df = (pd.read_csv('',header=1))
df = (pd.read_excel(''))

IX. Dealing with missing data

serial number methodologies clarification
1 .fillna(value,method,limit,inplace) Filling in missing values
2 .dropna() Deletion of missing data
3 .info() View information about the data, including the name of each field, the number of non-nulls, the data type of the field
4 .isnull() Returns an object (Series or DataFrame) of the same length with Boolean values, indicating which values are missing

Example: Viewing basic data table information (dimensions, column names, data formats, etc.)

()

X. Data conversion

serial number methodologies clarification
1 .replace(old, new) Replaces old data with new data, old and new can be lists if you wish to replace multiple values at once. A new object is returned by default, pass inplace=True to modify an existing object in-place.
2 .duplicated() Determines whether each row is a duplicate row, returning a Boolean Series.
3 .drop_duplicates() Removes duplicate rows and returns the deleted DataFrame object.

Example: Duplicate values that appear after deletion:

df['city'].drop_duplicates()

This article summarizes are some of the common methods of Pandas, as for some basic concepts you need to learn Pandas to understand, such as what is Series, what is DataFrame? If you have a clear understanding of these basic things after Pandas, with these methods in the article, then you use Pandas to do data processing and analysis will inevitably be at ease.