How to operate Excel in Pandas

Pandas Operation Excel

Pandas provides convenient ways to process Excel files, which are mainly due to the functions pandas.read_excel() and DataFrame.to_excel().

grammar

pd.read_excel(io, sheet_name=0, header=0,
              names=None, index_col=None,
              usecols=None, squeeze=False,
              dtype=None, engine=None,
              converters=None, true_values=None,
              false_values=None, skiprows=None,
              nrows=None, na_values=None,
              keep_default_na=True, verbose=False,
              parse_dates=False, date_parser=None,
              thousands=None, comment=None, skipfooter=0,
              convert_float=True, mangle_dupe_cols=True, **kwds)

parameter

File io

Read Excel files

# str, bytes, ExcelFile, , path object, or file-like object
# Local relative path:pd.read_excel('data/') # Pay attention to the directory levelpd.read_excel('') # If the file and the code file are in the same directory# Local absolute path:pd.read_excel('/user/wfg/data/')
# Use URL urlpd.read_excel('/file/data/dataset/')

Table sheet_name

You can specify which sheet to read in Excel files, and the first one is by default.

# str, int, list, or None, default 0
pd.read_excel('', sheet_name=1) # The second sheetpd.read_excel('', sheet_name='Summary Table') # Press the name of the sheet
# Take the first, second, named Sheet1, and return a dictionary composed of dfdfs = pd.read_excel('', sheet_name=[0, 1, "Sheet1"])
dfs = pd.read_excel('', sheet_name=None) # All sheetsdfs['Sheet5'] # Press sheet name

Header

The header of the data defaults to the first row.

pd.read_excel('', header=None)  # No headerpd.read_excel('', header=2)  # The third behavior headerpd.read_excel('', header=[0, 1])  # Two-layer header，Multi-layer indexing

Column name/table header names

The default header name in the data can be selected by default and can be re-specified.

# array-like, default None
pd.read_excel('', names=['Name', 'age', 'score'])
pd.read_excel('', names=c_list) # Pass in list variables# There is no header, it needs to be set to Nonepd.read_excel('', header=None, names=None)

Index column index_col

As columns as indexes, they are not set by default, and natural indexes are used (starting from 0).

# int, list of int, default None
pd.read_excel('', index_col=0) # Specify the first columnpd.read_excel('', index_col=[0,1]) # First two columns，Multi-layer indexing

Use columns

Specify the columns used, and the rest are not read, and the default is to use all.

# int, str, list-like, or callable default None
pd.read_excel('', usecols='A,B')  # Take columns A and Bpd.read_excel('', usecols='A:H')  # Take A to H columnpd.read_excel('', usecols='A,C,E:H')  # Take columns A and C, then add E to column Hpd.read_excel('', usecols=[0,1])  # Take the first two columnspd.read_excel('', usecols=['Name','gender'])  # Take the column with the specified column name# The table header contains Qpd.read_excel('', usecols=lambda x: 'Q' in x)

Return sequence squeezebool

If there is only one column, a Series is returned, and the default is still DataFrame.

# default False
pd.read_excel('', usecols='A', squeezebool=True)

Data type dtype

Data type, automatically inferred if not transmitted. If processed by converters, it does not take effect.

# Type name or dict of column -&gt; type, default None
pd.read_excel(data, dtype=np.float64) # All data are of this data typepd.read_excel(data, dtype={'c1':np.float64, 'c2': str}) # Specify the type of fieldpd.read_excel(data, dtype=[datetime, datetime, str, float]) # Specify in sequence

Processing Engine

The acceptable parameter values are "xlrd", "openpyxl" or "odf", and if the file is not a buffer or path, it needs to be specified, used to handle the engine used by Excel, a three-party library.

# str, default None
pd.read_excel('', engine='xlrd')

In practice, the default xlrd engine will not read lines with special characters such as asterisk* and percent signs, and can be replaced with openpyxl to solve the problem.

Column data processing converters

Converts the column data, a dictionary composed of column names and functions. The key can be a column name or a column number.

# dict, default None
def foo(p):
   return p + 's'

# x Apply function, y Use lambdapd.read_excel('', converters={'x': foo,
                                    'y': lambda x: x * 3})
# Use column indexingpd.read_excel('',
            converters={0: foo, 1: lambda x: x * 3})

True_values false_values

Convert the specified text to True or False, and you can specify multiple values with a list.

# list, default None
pd.read_excel('',
            true_values=['Yes'], false_values=['No'])

skip the specified line skiprows

# list-like, int or callable, optional
# Skip the first three linespd.read_excel(data, skiprows=2)
# Skip the first three linespd.read_excel(data, skiprows=range(2))
# Skip the specified linepd.read_excel(data, skiprows=[24,234,141])
# Skip the specified linepd.read_excel(data, skiprows=([2, 6, 11]))
# Skip interlacedpd.read_excel(data, skiprows=lambda x: x % 2 != 0)
# Skip the last few lines skipfooter=2

Read the number of rows nrows

The number of lines that need to be read is often used for larger data starting from the beginning of the file. It is first taken to write the code.

# int, default None
pd.read_excel(data, nrows=1000)

null value replacement na_values

A set of values used to replace NA/NaN. If you pass a parameter, you need to set the null value of a specific column.

# scalar, str, list-like, or dict, default None
#5 and 5.0 will be considered NaNpd.read_excel(data, na_values=[5])
# ? Will be considered NaNpd.read_excel(data, na_values='?')
# The null value is NaNpd.read_excel(data, keep_default_na=False, na_values=[""])
# Character NA Character 0 will be considered NaNpd.read_excel(data, keep_default_na=False, na_values=["NA", "0"])
# Nope will be considered NaNpd.read_excel(data, na_values=["Nope"])
# a, b, and c will be considered that NaN is equal to na_values=['a','b','c']pd.read_excel(data, na_values='abc')
# The specified value of the specified column will be considered NaN
pd.read_excel(data, na_values={'c':3, 1:[2,5]})

Keep the default null value keep_default_na

Whether the default NaN value is included when analyzing the data, whether it is automatically identified. If the na_values parameter is specified and keep_default_na=False, the default NaN will be overwritten, otherwise added.

The relationship with na_values is:

keep_default_na	na_values	logic
True	Specify	Configuration additional processing of na_values
True	not specified	Automatic recognition
False	Specify	Configuration using na_values
False	not specified	No processing

Note: If na_filter is False (the default is True), then keep_default_na and na_values parameters are invalid.

# boolean, default True
# No null values are not automatically recognizedpd.read_excel(data, keep_default_na=False)

Lost value check na_filter

Whether to check for missing values (empty string or empty value). For large files, there is no empty value in the dataset. Setting na_filter=False can improve the reading speed.

# boolean, default True
pd.read_excel(data, na_filter=False) # Don't check

Analyze information verbose

Whether to print the output information of various parsers, such as "number of missing values in non-numeric columns", etc.

# boolean, default False
# You can see the parsed informationpd.read_excel(data, verbose=True)
# Tokenization took: 0.02 ms
# Type conversion took: 0.36 ms
# Parser memory cleanup took: 0.01 ms

Date time parse_dates

This parameter analyzes the time and date.

# boolean or list of ints or names or list of lists or dict, default False.
pd.read_excel(data, parse_dates=True) # Automatically parse date and time formatpd.read_excel(data, parse_dates=['years']) # Specify date and time fields for parsing
# Merge columns 1 and 4 to resolve columns named timepd.read_excel(data, parse_dates={'time':[1,4]})

datetime parser date_parser

Functions used to parse dates are used to convert by default. Pandas tries to parse in three different ways, and the next one if you encounter problems.

Use one or more arrays (specified by parse_dates) as parameters;
Concatenate multiple column strings as one column as parameter;
The date_parser function is called once per line to parse one or more strings (specified by parse_dates) as parameters.

# function, default None
# Specify the time parsing library, the default isdate_parser=.date_converters.parse_date_time
date_parser=lambda x: pd.to_datetime(x, utc=True, format='%d%b%Y')
date_parser = lambda d: (d, '%d%b%Y')
# usepd.read_excel(data, parse_dates=['years'], date_parser=date_parser)

thousandthalite segmentation characters

Thousand separator.

# str, default None
pd.read_excel(data, thousands=',') # Comma-separated

Comment ID comment

Indicates that the sections of rows should not be analyzed. If the line is found at the beginning of a line, the line is completely ignored. This parameter must be a single character. Like blank lines (as long as skip_blank_lines = True), the argument is treated as header ignores fully commented lines, while skiprows ranks ignores. For example, if comment = ‘#’, parsing ’#empty \na,b,c\n1,2,3’ with header=0 will treat ’a,b,c’ as headers.

# str, default None
s = '# notes\na,b,c\n# more notes\n1,2,3' # For example onlypd.read_excel(data, sep=',', comment='#', skiprows=1)

Skipfooter at the tail

Ignore from the end of the file. (C engine does not support)

# int, default 0
pd.read_excel(filename, skipfooter=1) # The last line does not load

Convert to floating point convert_float

Read Excel converts numbers to floating point by default, setting them to False will retain integers.

# bool, default True
pd.read_excel('', convert_float=False)

mangle_dupe_cols
Handle duplicate column names mangle_dupe_cols

When there are duplicate column names, the parsed column names will become ‘X’, ‘X.1’…’’ instead of ‘X’…’X’.

If this parameter is False, then when there is a duplication in the column name, the front column will be overwritten by the back column.

# bool, default True
data = 'a,b,a\n0,1,2\n3,4,5' # For example onlypd.read_excel(data, mangle_dupe_cols=True)
# The table header is a b a.1# False Report ValueError mistake

Storage options storage_options

**Other parameterskwds

Other parameters processed by TextFileReader.

return:Generally speaking, the read data will be returned to a DataFrame, and of course, the specified type will be returned according to the requirements of the parameters.

Example:Pandas provides convenient ways to process Excel files, which are mainly due to the functions pandas.read_excel() and DataFrame.to_excel(). Here are some key steps and examples for manipulating Excel files using Pandas:

Read Excel files

To read data in an Excel file, you can use the pandas.read_excel() function. This function can read data from a specified worksheet and convert it into a Pandas DataFrame object.

import pandas as pd
 
# Read a specific worksheet in an Excel filedf = pd.read_excel('', sheet_name='Sheet1')
 
# If you need to read all sheets, you can set sheet_name to None, which will return a dictionary containing all sheet data.sheets = pd.read_excel('', sheet_name=None)

Processing read data

Once the data is read into the DataFrame, you can use the various functions and methods provided by Pandas to process this data. For example, you can filter, sort, group, aggregate data, and other operations.

# Suppose we have a DataFrame called 'df'# Filter out rows whose column values meet specific conditionsfiltered_df = df[df['column_name'] &gt; some_value]
 
# Sort the datasorted_df = df.sort_values(by='column_name')

Write data back to Excel file

After processing the data, you may want to save the results back into an Excel file. At this time, you can use the DataFrame.to_excel() method.

# Write DataFrame to a new Excel filedf.to_excel('', sheet_name='Sheet1', index=False)
 
# If you want to write multiple DataFrames to different worksheets in the same Excel file, you can use ExcelWriterwith ('multiple_sheets.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Sheet1', index=False)
    df2.to_excel(writer, sheet_name='Sheet2', index=False)

Notes:

File path: Make sure that the file path you provide is correct and that the program has sufficient permissions to read and write files.
Worksheet Name: When reading or writing to a worksheet, make sure that the specified worksheet name exists or that you have correctly handled the situation where the worksheet does not exist.
Data type: Pay attention to the compatibility of data types when reading and writing data. For example, if the dates in an Excel file are stored in text format, you may need to type conversion after reading.
Performance: For large datasets, reading and writing Excel files may be slower and may be memory-limited. In this case, you might consider batching the data or using a format that is more suitable for large data sets (such as CSV).
Dependencies: Pandas uses the openpyxl or xlrd library to read and write Excel files (xlrd no longer supports the .xlsx format since version 2.0.0, so openpyxl is recommended). Make sure you have these libraries installed.

Summarize

The above is personal experience. I hope you can give you a reference and I hope you can support me more.