Pandas Operation Excel
Pandas provides convenient ways to process Excel files, which are mainly due to the functions pandas.read_excel() and DataFrame.to_excel().
grammar
pd.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)
parameter
- File io
Read Excel files
# str, bytes, ExcelFile, , path object, or file-like object # Local relative path:pd.read_excel('data/') # Pay attention to the directory levelpd.read_excel('') # If the file and the code file are in the same directory# Local absolute path:pd.read_excel('/user/wfg/data/') # Use URL urlpd.read_excel('/file/data/dataset/')
- Table sheet_name
You can specify which sheet to read in Excel files, and the first one is by default.
# str, int, list, or None, default 0 pd.read_excel('', sheet_name=1) # The second sheetpd.read_excel('', sheet_name='Summary Table') # Press the name of the sheet # Take the first, second, named Sheet1, and return a dictionary composed of dfdfs = pd.read_excel('', sheet_name=[0, 1, "Sheet1"]) dfs = pd.read_excel('', sheet_name=None) # All sheetsdfs['Sheet5'] # Press sheet name
- Header
The header of the data defaults to the first row.
pd.read_excel('', header=None) # No headerpd.read_excel('', header=2) # The third behavior headerpd.read_excel('', header=[0, 1]) # Two-layer header,Multi-layer indexing
- Column name/table header names
The default header name in the data can be selected by default and can be re-specified.
# array-like, default None pd.read_excel('', names=['Name', 'age', 'score']) pd.read_excel('', names=c_list) # Pass in list variables# There is no header, it needs to be set to Nonepd.read_excel('', header=None, names=None)
- Index column index_col
As columns as indexes, they are not set by default, and natural indexes are used (starting from 0).
# int, list of int, default None pd.read_excel('', index_col=0) # Specify the first columnpd.read_excel('', index_col=[0,1]) # First two columns,Multi-layer indexing
- Use columns
Specify the columns used, and the rest are not read, and the default is to use all.
# int, str, list-like, or callable default None pd.read_excel('', usecols='A,B') # Take columns A and Bpd.read_excel('', usecols='A:H') # Take A to H columnpd.read_excel('', usecols='A,C,E:H') # Take columns A and C, then add E to column Hpd.read_excel('', usecols=[0,1]) # Take the first two columnspd.read_excel('', usecols=['Name','gender']) # Take the column with the specified column name# The table header contains Qpd.read_excel('', usecols=lambda x: 'Q' in x)
- Return sequence squeezebool
If there is only one column, a Series is returned, and the default is still DataFrame.
# default False pd.read_excel('', usecols='A', squeezebool=True)
- Data type dtype
Data type, automatically inferred if not transmitted. If processed by converters, it does not take effect.
# Type name or dict of column -> type, default None pd.read_excel(data, dtype=np.float64) # All data are of this data typepd.read_excel(data, dtype={'c1':np.float64, 'c2': str}) # Specify the type of fieldpd.read_excel(data, dtype=[datetime, datetime, str, float]) # Specify in sequence
- Processing Engine
The acceptable parameter values are "xlrd", "openpyxl" or "odf", and if the file is not a buffer or path, it needs to be specified, used to handle the engine used by Excel, a three-party library.
# str, default None pd.read_excel('', engine='xlrd')
In practice, the default xlrd engine will not read lines with special characters such as asterisk* and percent signs, and can be replaced with openpyxl to solve the problem.
- Column data processing converters
Converts the column data, a dictionary composed of column names and functions. The key can be a column name or a column number.
# dict, default None def foo(p): return p + 's' # x Apply function, y Use lambdapd.read_excel('', converters={'x': foo, 'y': lambda x: x * 3}) # Use column indexingpd.read_excel('', converters={0: foo, 1: lambda x: x * 3})
- True_values false_values
Convert the specified text to True or False, and you can specify multiple values with a list.
# list, default None pd.read_excel('', true_values=['Yes'], false_values=['No'])
- skip the specified line skiprows
# list-like, int or callable, optional # Skip the first three linespd.read_excel(data, skiprows=2) # Skip the first three linespd.read_excel(data, skiprows=range(2)) # Skip the specified linepd.read_excel(data, skiprows=[24,234,141]) # Skip the specified linepd.read_excel(data, skiprows=([2, 6, 11])) # Skip interlacedpd.read_excel(data, skiprows=lambda x: x % 2 != 0) # Skip the last few lines skipfooter=2
- Read the number of rows nrows
The number of lines that need to be read is often used for larger data starting from the beginning of the file. It is first taken to write the code.
# int, default None pd.read_excel(data, nrows=1000)
- null value replacement na_values
A set of values used to replace NA/NaN. If you pass a parameter, you need to set the null value of a specific column.
# scalar, str, list-like, or dict, default None #5 and 5.0 will be considered NaNpd.read_excel(data, na_values=[5]) # ? Will be considered NaNpd.read_excel(data, na_values='?') # The null value is NaNpd.read_excel(data, keep_default_na=False, na_values=[""]) # Character NA Character 0 will be considered NaNpd.read_excel(data, keep_default_na=False, na_values=["NA", "0"]) # Nope will be considered NaNpd.read_excel(data, na_values=["Nope"]) # a, b, and c will be considered that NaN is equal to na_values=['a','b','c']pd.read_excel(data, na_values='abc') # The specified value of the specified column will be considered NaN pd.read_excel(data, na_values={'c':3, 1:[2,5]})
- Keep the default null value keep_default_na
Whether the default NaN value is included when analyzing the data, whether it is automatically identified. If the na_values parameter is specified and keep_default_na=False, the default NaN will be overwritten, otherwise added.
The relationship with na_values is:
keep_default_na | na_values | logic |
---|---|---|
True | Specify | Configuration additional processing of na_values |
True | not specified | Automatic recognition |
False | Specify | Configuration using na_values |
False | not specified | No processing |
Note: If na_filter is False (the default is True), then keep_default_na and na_values parameters are invalid.
# boolean, default True # No null values are not automatically recognizedpd.read_excel(data, keep_default_na=False)
- Lost value check na_filter
Whether to check for missing values (empty string or empty value). For large files, there is no empty value in the dataset. Setting na_filter=False can improve the reading speed.
# boolean, default True pd.read_excel(data, na_filter=False) # Don't check
- Analyze information verbose
Whether to print the output information of various parsers, such as "number of missing values in non-numeric columns", etc.
# boolean, default False # You can see the parsed informationpd.read_excel(data, verbose=True) # Tokenization took: 0.02 ms # Type conversion took: 0.36 ms # Parser memory cleanup took: 0.01 ms
- Date time parse_dates
This parameter analyzes the time and date.
# boolean or list of ints or names or list of lists or dict, default False. pd.read_excel(data, parse_dates=True) # Automatically parse date and time formatpd.read_excel(data, parse_dates=['years']) # Specify date and time fields for parsing # Merge columns 1 and 4 to resolve columns named timepd.read_excel(data, parse_dates={'time':[1,4]})
- datetime parser date_parser
Functions used to parse dates are used to convert by default. Pandas tries to parse in three different ways, and the next one if you encounter problems.
- Use one or more arrays (specified by parse_dates) as parameters;
- Concatenate multiple column strings as one column as parameter;
- The date_parser function is called once per line to parse one or more strings (specified by parse_dates) as parameters.
# function, default None # Specify the time parsing library, the default isdate_parser=.date_converters.parse_date_time date_parser=lambda x: pd.to_datetime(x, utc=True, format='%d%b%Y') date_parser = lambda d: (d, '%d%b%Y') # usepd.read_excel(data, parse_dates=['years'], date_parser=date_parser)
- thousandthalite segmentation characters
Thousand separator.
# str, default None pd.read_excel(data, thousands=',') # Comma-separated
- Comment ID comment
Indicates that the sections of rows should not be analyzed. If the line is found at the beginning of a line, the line is completely ignored. This parameter must be a single character. Like blank lines (as long as skip_blank_lines = True), the argument is treated as header ignores fully commented lines, while skiprows ranks ignores. For example, if comment = ‘#’, parsing ’#empty \na,b,c\n1,2,3’ with header=0 will treat ’a,b,c’ as headers.
# str, default None s = '# notes\na,b,c\n# more notes\n1,2,3' # For example onlypd.read_excel(data, sep=',', comment='#', skiprows=1)
- Skipfooter at the tail
Ignore from the end of the file. (C engine does not support)
# int, default 0 pd.read_excel(filename, skipfooter=1) # The last line does not load
- Convert to floating point convert_float
Read Excel converts numbers to floating point by default, setting them to False will retain integers.
# bool, default True pd.read_excel('', convert_float=False)
- mangle_dupe_cols
- Handle duplicate column names mangle_dupe_cols
When there are duplicate column names, the parsed column names will become ‘X’, ‘X.1’…’’ instead of ‘X’…’X’.
If this parameter is False, then when there is a duplication in the column name, the front column will be overwritten by the back column.
# bool, default True data = 'a,b,a\n0,1,2\n3,4,5' # For example onlypd.read_excel(data, mangle_dupe_cols=True) # The table header is a b a.1# False Report ValueError mistake
- Storage options storage_options
**Other parameterskwds
Other parameters processed by TextFileReader.
return:Generally speaking, the read data will be returned to a DataFrame, and of course, the specified type will be returned according to the requirements of the parameters.
Example:Pandas provides convenient ways to process Excel files, which are mainly due to the functions pandas.read_excel() and DataFrame.to_excel(). Here are some key steps and examples for manipulating Excel files using Pandas:
- Read Excel files
To read data in an Excel file, you can use the pandas.read_excel() function. This function can read data from a specified worksheet and convert it into a Pandas DataFrame object.
import pandas as pd # Read a specific worksheet in an Excel filedf = pd.read_excel('', sheet_name='Sheet1') # If you need to read all sheets, you can set sheet_name to None, which will return a dictionary containing all sheet data.sheets = pd.read_excel('', sheet_name=None)
- Processing read data
Once the data is read into the DataFrame, you can use the various functions and methods provided by Pandas to process this data. For example, you can filter, sort, group, aggregate data, and other operations.
# Suppose we have a DataFrame called 'df'# Filter out rows whose column values meet specific conditionsfiltered_df = df[df['column_name'] > some_value] # Sort the datasorted_df = df.sort_values(by='column_name')
- Write data back to Excel file
After processing the data, you may want to save the results back into an Excel file. At this time, you can use the DataFrame.to_excel() method.
# Write DataFrame to a new Excel filedf.to_excel('', sheet_name='Sheet1', index=False) # If you want to write multiple DataFrames to different worksheets in the same Excel file, you can use ExcelWriterwith ('multiple_sheets.xlsx') as writer: df1.to_excel(writer, sheet_name='Sheet1', index=False) df2.to_excel(writer, sheet_name='Sheet2', index=False)
Notes:
- File path: Make sure that the file path you provide is correct and that the program has sufficient permissions to read and write files.
- Worksheet Name: When reading or writing to a worksheet, make sure that the specified worksheet name exists or that you have correctly handled the situation where the worksheet does not exist.
- Data type: Pay attention to the compatibility of data types when reading and writing data. For example, if the dates in an Excel file are stored in text format, you may need to type conversion after reading.
- Performance: For large datasets, reading and writing Excel files may be slower and may be memory-limited. In this case, you might consider batching the data or using a format that is more suitable for large data sets (such as CSV).
- Dependencies: Pandas uses the openpyxl or xlrd library to read and write Excel files (xlrd no longer supports the .xlsx format since version 2.0.0, so openpyxl is recommended). Make sure you have these libraries installed.
Summarize
The above is personal experience. I hope you can give you a reference and I hope you can support me more.