Implementation of Multiple Indexing Techniques in Pandas

Hello everyone, in data analysis, processing complex multidimensional data is a common requirement. Python's Pandas library provides powerful multi-index functionality, enabling flexible management and analysis of multi-level data structures. This article will introduce multiple indexes in Pandas, explore how to create, operate and reset multiple indexes, and demonstrate its powerful functions in practical applications through specific example code.

1. Overview of multiple indexes

Multi-index is a hierarchical way of indexing that allows multiple levels of indexing to be used in DataFrame or Series. Through multiple indexes, we can express the hierarchical relationships of data more clearly, making processing complex data sets more intuitive and efficient.

Multiple indexes can be created through multiple columns of data, thus organizing DataFrame into hierarchical forms.

import pandas as pd

# Create a DataFrame with multiple levelsdata = {'City': ['Beijing', 'Beijing', 'Shanghai', 'Shanghai', 'Guangzhou', 'Guangzhou'],
        'years': [2020, 2021, 2020, 2021, 2020, 2021],
        'population': [2154, 2160, 2424, 2430, 1530, 1540],
        'GDP': [36102, 37200, 38155, 39400, 25000, 26000]}

df = (data)

# Set multiple indexesdf.set_index(['City', 'years'], inplace=True)

print(df)

Run the above code and output the result:

Population GDP
City Year
Beijing 2020 2154 36102
2021 2160 37200
Shanghai 2020 2424 38155
2021 2430 39400
Guangzhou 2020 1530 25000
2021 1540 26000

In this example, byset_index()The function willCityandyearsThe two columns are set to multiple indexes, resulting in a DataFrame with a hierarchy.

2. Basic operations of multiple indexes

Once we have created multiple indexes, Pandas provides multiple ways to manipulate and query this data, including selection, slicing, swap hierarchy, reset indexes, etc.

2.1 Select and slice multiple indexes

Multiple indexes allow us to easily select or slice data. For example, you can select data for a certain city, or select data for a specific year.

import pandas as pd

# Create a DataFrame with multiple levelsdata = {'City': ['Beijing', 'Beijing', 'Shanghai', 'Shanghai', 'Guangzhou', 'Guangzhou'],
        'years': [2020, 2021, 2020, 2021, 2020, 2021],
        'population': [2154, 2160, 2424, 2430, 1530, 1540],
        'GDP': [36102, 37200, 38155, 39400, 25000, 26000]}

df = (data)
df.set_index(['City', 'years'], inplace=True)

# Select data for specific citiesbeijing_data = ['Beijing']
print("Beijing's data:\n", beijing_data)

# Select data for a specific yeardata_2021 = (2021, level='years')
print("2021 data:\n", data_2021)

Run the above code and output the result:

Beijing data:
Population GDP
Year
2020 2154 36102
2021 2160 37200

Data for 2021:
Population GDP
City
Beijing 2160 37200
Shanghai 2430 39400
Guangzhou 1540 26000

In this example, useloc[]Selected Beijing data and usedxs()Method The data for 2021 were selected by year.

2.2 Exchange hierarchy and reset index

Multiple indexes can swap index positions at different levels and can reset multiple indexes to normal indexes.

import pandas as pd

# Create a DataFrame with multiple levelsdata = {'City': ['Beijing', 'Beijing', 'Shanghai', 'Shanghai', 'Guangzhou', 'Guangzhou'],
        'years': [2020, 2021, 2020, 2021, 2020, 2021],
        'population': [2154, 2160, 2424, 2430, 1530, 1540],
        'GDP': [36102, 37200, 38155, 39400, 25000, 26000]}

df = (data)
df.set_index(['City', 'years'], inplace=True)

# The level of the exchange indexswapped_df = ()
print("DataFrame after swap hierarchy:\n", swapped_df)

# Reset indexreset_df = df.reset_index()
print("DataFrame after resetting index:\n", reset_df)

Run the above code and output the result:

DataFrame after swapping the hierarchy:
Population GDP
Year City
2020 Beijing 2154 36102
2021 Beijing 2160 37200
2020 Shanghai 2424 38155
2021 Shanghai 2430 39400
2020 Guangzhou 1530 25000
2021 Guangzhou 1540 26000

Reset the DataFrame after indexing:
City Year Population GDP
0 Beijing 2020 2154 36102
1 Beijing 2021 2160 37200
2 Shanghai 2020 2424 38155
3 Shanghai 2021 2430 39400
4 Guangzhou 2020 1530 25000
5 Guangzhou 2021 1540 26000

In this example, useswaplevel()ExchangedCityandyearsThe index level, usereset_index()Restore multiple indexes to normal indexes.

3. Advanced operations of multiple indexes

In addition to basic selection and operations, Pandas' multi-index also supports more advanced operations, such as grouping aggregation, multi-index slicing, index sorting, etc. These functions can handle complex data sets more flexibly.

3.1 Grouping and aggregation of multiple indexes

Grouping operations can be performed on the basis of multiple indexes and aggregate functions can be applied, such as calculating sum, average, etc.

import pandas as pd

# Create a DataFrame with multiple levelsdata = {'City': ['Beijing', 'Beijing', 'Shanghai', 'Shanghai', 'Guangzhou', 'Guangzhou'],
        'years': [2020, 2021, 2020, 2021, 2020, 2021],
        'population': [2154, 2160, 2424, 2430, 1530, 1540],
        'GDP': [36102, 37200, 38155, 39400, 25000, 26000]}

df = (data)
df.set_index(['City', 'years'], inplace=True)

# Group by city to calculate the sum of GDPgrouped_gdp = ('City')['GDP'].sum()
print("Group of GDP grouped by city:\n", grouped_gdp)

Run the above code and output the result:

Total GDP grouped by city:
City
Beijing 73302
Shanghai 77555
Guangzhou 51000
Name: GDP, dtype: int64

In this example, multiple indexes are grouped by city and the GDP sum of each city in different years is calculated.

3.2 Multi-index slicing operation

Pandas can be usedsliceSlicing multiple indexes is very useful when dealing with multidimensional data.

import pandas as pd
import numpy as np

# Create a DataFrame with multiple levelsarrays = [
    ['Beijing', 'Beijing', 'Beijing', 'Shanghai', 'Shanghai', 'Guangzhou', 'Guangzhou'],
    [2020, 2021, 2022, 2020, 2021, 2020, 2021]
]
index = .from_arrays(arrays, names=('City', 'years'))
data = (7, 2)
df = (data, index=index, columns=['Indicator 1', 'Indicator 2'])

# Slice multiple indexessliced_df = [[:, 2021], :]
print("Sliced DataFrame:\n", sliced_df)

Run the above code and output the result:

DataFrame after slice:
Indicator 1 Indicator 2
City Year
Beijing 2021 0.558769 0.722681
Shanghai 2021 0.392982 0.888569
Guangzhou 2021 -0.668413 -0.907221

In this example, useMultiple indexes were sliced and data from all cities in 2021 were selected. This slicing operation can very conveniently extract the portion of interest from the multi-level data.

3.3 Sorting of multiple indexes

Multi-indexes can also perform sorting operations, which is useful when you need to view data in a specific order.

import pandas as pd
import numpy as np

# Create a DataFrame with multiple levelsarrays = [
    ['Beijing', 'Beijing', 'Shanghai', 'Shanghai', 'Guangzhou', 'Guangzhou'],
    [2021, 2020, 2021, 2020, 2021, 2020]
]
index = .from_arrays(arrays, names=('City', 'years'))
data = (6, 2)
df = (data, index=index, columns=['Indicator 1', 'Indicator 2'])

# Sort multiple indexessorted_df = df.sort_index(level=['City', 'years'], ascending=[True, False])
print("Sorted DataFrame:\n", sorted_df)

Run the above code and output the result:

Sort DataFrame:
Indicator 1 Indicator 2
City Year
Beijing 2021 1.013978 0.731106
2020 -0.856558 0.696849
Shanghai 2021 -0.585347 0.494768
2020 0.129116 -0.477598
Guangzhou 2021 -0.542223 1.212357
2020 0.221365 -0.055147

In this example, multiple indexes are sorted, sorted in ascending order of city names and descending order of year. This sorting operation allows viewing of data in an order that is more in line with the analysis needs.

4. Practical application scenarios of multiple indexes

Multiple indexing is very useful in many practical applications, especially when dealing with time series data, panel data, and cubes.

In time series analysis, it is often necessary to use dates and other categorical variables (such as products, regions) for analysis. Multiple indexes can manage and analyze this data.

import pandas as pd
import numpy as np

# Create time series datadates = pd.date_range('2023-01-01', periods=6)
products = ['Product A', 'Product B']
index = .from_product([dates, products], names=['date', 'product'])
data = (12, 2)
df = (data, index=index, columns=['Sales', 'profit'])

print("DataFrame of time series data:\n", df)

# Total sales by product groupingtotal_sales = ('product')['Sales'].sum()
print("\nTotal sales by product group:\n", total_sales)

Run the above code and output the result:

DataFrame of time series data:
Sales Profit
Date
2023-01-01 Product A -0.856051 0.166173
Product B 0.934522 0.570209
2023-01-02 Product A -0.205493 1.195617
Product B -1.286157 0.122996
2023-01-03 Product A -1.618019 0.593061
Product B 0.246715 -0.654644
2023-01-04 Product A 0.158859 -1.404354
Product B -0.255284 1.383135
2023-01-05 Product A 0.408226 0.799745
Product B 0.411282 0.339705
2023-01-06 Product A -1.023615 -0.616391
Product B -1.564080 1.062635

Total sales by product group:
product
Product A -3.136093
Product B -1.513002
Name: Sales, dtype: float64

In this example, the date and product are combined using multiple indexes and the total sales of each product are calculated by grouping.

In data analysis, Python Pandas' multi-index capability provides powerful tools to handle complex multi-dimensional data. Through multiple indexing, users can manage data frames in layers, making data selection, slicing, grouping and aggregation operations more intuitive and efficient. Multiple indexing not only makes it more flexible when processing time series, panel data and other multi-level data, but also improves the accuracy and efficiency of data analysis. Mastering Pandas' multiple indexing operations can make you more relaxed and comfortable when facing complex data structures.

This is the end of this article about the implementation of multi-index techniques in Pandas. For more related Pandas multi-index content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!