Detailed tutorial on using Pandas to process .xlsx files in Python

Preface

PandasIt is one of the core libraries of Python data analysis. It provides rich data processing functions, especially when processing table data (such as.xlsxVery powerful when file).PandasCombining Python's flexibility and simplicity, users can easily read, write, clean, operate and analyze data. This article will introduce how to use itPandasdeal with.xlsxCommon operations of files include reading, writing, filtering, merging and statistics.

1. Environment configuration

1. Install Pandas

First, make sure it is installedPandasandopenpyxl(For reading.xlsxdocument). You can install it through the following command:

pip install pandas openpyxl

openpyxlYes Pandas read by default.xlsxThe file's dependency library ensures that it has been installed correctly.

2. Import Pandas

Before starting to process files, you need to import them in the code.Pandas：

import pandas as pd

2. Read Excel files

Pandas providespd.read_excel()Functions can be read easily.xlsxdocument.

1. Read a single worksheet

The most common operation is reading.xlsxSingle worksheet in the file. Here are the basic usages for reading Excel files:

# Read the first worksheet in an Excel filedf = pd.read_excel('')

# Show the first five elements dataprint(())

Can be passedsheet_nameParameters specify the worksheet to be read:

# Read a worksheet named "Sheet2"df = pd.read_excel('', sheet_name='Sheet2')

2. Read multiple worksheets

If there are multiple worksheets in an Excel file and you want to read multiple tables at the same time, you can passsheet_nameFor list:

# Read multiple worksheets and return a dictionarysheets = pd.read_excel('', sheet_name=['Sheet1', 'Sheet2'])

# Get data from a worksheetsheet1_df = sheets['Sheet1']

3. Read all worksheets

To read all worksheets, you cansheet_name=None：

# Read all worksheetssheets = pd.read_excel('', sheet_name=None)

# Get dictionary for all worksheetsfor sheet_name, data in ():
    print(f"Sheet name: {sheet_name}")
    print(())

4. Read some columns or rows

Can be usedusecolsParameters only read specific columns, or usenrowsRead some lines:

# Read data from columns 1 to 3df = pd.read_excel('', usecols="A:C")

# Only read the first 10 lines of datadf = pd.read_excel('', nrows=10)

5. Skip the line

Can be usedskiprowsParameters skip the first few lines in the file:

# Skip the first 5 lines in the filedf = pd.read_excel('', skiprows=5)

3. Write to Excel files

PandasAllow toDataFrameWrite data to Excel files, useto_excel()method.

1. Write DataFrame to Excel

Write DataFrame to.xlsxdocument:

df.to_excel('', index=False)

in,index=FalseIndicates that the row index is not written. If you need to preserve index information, you can omit it or set it toTrue。

2. Write multiple worksheets

If you want to write data to multiple worksheets, you can usePerform:

with ('multi_sheet_output.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Sheet1', index=False)
    df2.to_excel(writer, sheet_name='Sheet2', index=False)

3. Customize the header

Can be passedheaderParameters to customize the header name or disable the header:

# Customize the headerdf.to_excel('', header=['Col1', 'Col2', 'Col3'], index=False)

# Do not write to the table headerdf.to_excel('', header=False, index=False)

4. Data operation

After reading Excel files, you can use Pandas' powerful data operation capabilities to process data.

1. Filter data

Assume that the excel data table read is as follows:

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'Score': [85, 62, 90, 88]
}
df = (data)

Data can be filtered based on specific conditions:

# Filter out data older than 25 years oldfiltered_df = df[df['Age'] &gt; 25]
print(filtered_df)

2. Sort data

You can sort the data according to the value of a column:

# Sort ascending order by agesorted_df = df.sort_values(by='Age', ascending=True)
print(sorted_df)

3. Grouping and Aggregation

The data can be grouped according to a certain column and the aggregate result can be calculated:

# Group by age and calculate the average scoregrouped = ('Age')['Score'].mean()
print(grouped)

4. Missing value processing

Pandas provides a variety of ways to deal with missing values. For example, find and delete missing values:

# View missing valuesprint(().sum())

# Delete rows containing missing values(inplace=True)

# Replace missing values with a value(0, inplace=True)

5. Advanced operations of Excel files

1. Merge multiple Excel files

Assume there are multiple Excel files and they have the same column structure, you can useconcat()Methods to merge these files:

import pandas as pd

# Read multiple Excel filesdf1 = pd.read_excel('')
df2 = pd.read_excel('')

# Merge datadf_combined = ([df1, df2], ignore_index=True)
print(df_combined)

2. Use custom data types

Can be passeddtypeParameters specify the data type of the read column:

# Read the 'Age' column as a stringdf = pd.read_excel('', dtype={'Age': str})

3. Process merged cells

In an Excel file, merging cells can cause incomplete data reading. Pandas will assign the first value of the merged cell to all cells in that column by default. If you want to preserve the data structure, you can manually process these merged cells:

df = pd.read_excel('data_with_merged_cells.xlsx', merge_cells=False)

4. Conditional formatting

You can add conditional formatting when writing to an Excel file. For example, a cell that highlights certain conditions:

import pandas as pd
from  import Styler

# Create style functionsdef highlight_max(s):
    is_max = s == ()
    return ['background-color: yellow' if v else '' for v in is_max]

# Create DataFramedf = ({
    'A': [1, 2, 3],
    'B': [4, 3, 6],
    'C': [7, 8, 5]
})

# Apply styles and save them to Excelstyled = (highlight_max)
styled.to_excel('styled_output.xlsx', engine='openpyxl', index=False)

6. Summary

This article introduces how to use itPandasdeal with.xlsxFiles, including read, write, data operations and some advanced operations. Pandas provides powerful capabilities for processing Excel files, especially in data cleaning, analysis, and preservation, which can help to easily deal with complex Excel data operations.

Common operations include:

useread_excel()Read the contents of an Excel file and read specific worksheets or part of data according to your needs.
useto_excel()Write DataFrame data to an Excel file, and you can output multiple worksheets or custom formats.
With Pandas' powerful data operation capabilities, data filtering, sorting, grouping, aggregation and processing missing values can be performed.

By mastering these operations, you will be able to process and analyze data in Excel files more efficiently.

This is the article about Python processing .xlsx files using Pandas. For more information about Pandas processing .xlsx files, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!