Several common ways to read Excel in Python

Preface

Excel files are a very common data storage format in the daily work of data analysis and processing. Python, as a powerful programming language, provides a variety of libraries to help users easily read Excel files. This article will introduce several common ways to read Excel files using Python.

1. Use pandas to read Excel files

pandas is one of the most popular libraries in Python data analysis, and it provides the read_excel function to read Excel files directly.

import pandas as pd 
# Read Excel filedf = pd.read_excel('') 
# Show the first few lines of data print(())

This method is simple and fast, and is very suitable for data analysis. pandas supports reading .xls and .xlsx format files, and can specify specific worksheets and cell ranges through parameters.

2. Read Excel file using openpyxl

openpyxl is another powerful library dedicated to reading and writing .xlsx files. Unlike pandas, openpyxl allows users to manipulate Excel files in a more granular way, such as reading or editing cell formats.

from openpyxl import load_workbook
# Load Excel filewb = load_workbook('')
# Select a worksheetsheet = 
# Read data from a specific cellcell_value = sheet['A1'].value
print(cell_value)

openpyxl is an ideal choice for handling complex Excel files, such as reading or setting the format of cells, formulas, etc.

3. Read Excel files using xlrd and xlwt

For older versions of .xls files, the xlrd and xlwt libraries provide read and write functions. Although they do not support .xlsx format files, they are very useful for situations where legacy data needs to be handled.

import xlrd
# Open the fileworkbook = xlrd.open_workbook('')
# Select a worksheetsheet = workbook.sheet_by_index(0)
# Read contents of a specific cellcell_value = sheet.cell_value(0, 0)
print(cell_value)

Although the latest version of xlrd has restricted support for .xlsx files, it is still an effective tool for handling .xls files.

4. Use pyexcel as a unified interface

pyexcel provides a simple API to read, process and write Excel files in many different formats. If you are looking for a unified interface to handle different types of Excel files, pyexcel may be a good choice.

import pyexcel as p
# Read Excel filedata = p.get_records(file_name='')
for record in data: 
print(record)

pyexcel supports .xls, .xlsx, .xlsm and other formats, making it a convenient tool for handling different types of Excel files.

5. Read Excel using Tablib

TablibIt is one of the most popular libraries in Python for importing and exporting data in various formats. It was originally developed by the creators of the popular requests library, so it is characterized by an equal focus on developer experience and ergonomics.

Install:

$ pip install tablib

Code:

**import** tablib

def iter\_excel\_tablib(file: IO\[bytes\]) -> Iterator\[dict\[str, object\]\]:  
yield from ().load(file).dict

With just one line of code, the library can do all the heavy lifting.

Before continuing with the benchmark, let's take a look at the results of the first line:

with open(‘’, ‘rb’) as f:
… rows = iter_excel_tablib(f)
… row = next(rows)
… print(row)
…
OrderedDict([(‘number’, 1),
(‘decimal’, 1.1),
(‘date’, (2000, 1, 1, 0, 0)),
(‘boolean’, True),
(‘text’, ‘CONTROL ROW’)])

OrderedDictis a subclass of Python, dict has some extra methods to rearrange dictionary order. It is built-inCollections moduledefined in , when you request a dictionary, it is what tablib returns. Since OrderedDict is a subclass of dict and it is defined in the built-in module, we don't mind and think it's enough for our purposes.

6. Read Excel using Openpyxl

OpenpyxlIt is a library that reads and writes Excel files in Python. Unlike Tablib, Openpyxl is only dedicated to Excel and does not support any other file types.

In fact, both tablib and pandas use Openpyxl at the bottom when reading xlsx files. Perhaps this specialization will lead to better performance.

Install:

$ pip install openpyxl

Code:

**import** openpyxl

def iter\_excel\_openpyxl(file: IO\[bytes\]) -> Iterator\[dict\[str, object\]\]:  
workbook = \_workbook(file)  
rows =   
headers = \[str() **for** cell in next(rows)\]  
**for** row in rows:  
yield dict(zip(headers, ( **for** cell in row)))

This time we are going to write more code, let's break it down:

Load the workbook from the opened file: The function load_workbook supports both file paths and readable data streams. In this example, we operate on the opened file.
Get active worksheets: An Excel file can contain multiple worksheets, and we can choose which worksheet to read. In this case, we only have one worksheet.
Build a header list: The first line of the Excel file contains the header. To use these headers as keys to a dictionary, we need to read the first line and generate a list of headers.
Returns the result: The cell type used by openpyxl contains values and some metadata. This is useful for other uses, but we only need the value. To access the value of a cell, we use .

Output:

with open(‘’, ‘rb’) as f:
… rows = iter_excel_openpyxl(f)
… row = next(rows)
… print(row)
{‘boolean’: True,
‘date’: (2000, 1, 1, 0, 0),
‘decimal’: 1.1,
‘number’: 1,
‘text’: ‘CONTROL ROW’}

7. Read Excel using LibreOffice

We have now exhausted the traditional and obvious way to import Excel into Python. We used the top-level specified library and got good results. Now is the time to think outside the box.

LibreOfficeIt is a free open source alternative to other office suites. LibreOffice can handle xls and xlsx files, and also happens to include headless mode with some useful command-line options:

One of the LibreOffice command line options is to convert files between different formats. For example, we can use libreoffice to convert the xlsx file to a csv file:

$ libreoffice --headless --convert-to csv --outdir .   
convert  ->  using filter: Text - txt - csv (StarCalc)

$ head   
number,decimal,date,**boolean**,text  
1,1.1,01/01/2000,TRUE,CONTROL ROW  
2,1.2,01/02/2000,FALSE,RANDOM TEXT:0.716658989024692  
3,1.3,01/03/2000,TRUE,RANDOM TEXT:0.966075283958641

Let's splice it in Python. We first convert the xlsx file to CSV, and then import the CSV into Python:

**import** subprocess, tempfile, csv

def iter\_excel\_libreoffice(file: IO\[bytes\]) -> Iterator\[dict\[str, object\]\]:  
with (prefix=‘excelbenchmark') as tempdir:  
(\[  
‘libreoffice', ‘–headless', ‘–convert-to', ‘csv',  
‘–outdir', tempdir, ,  
\])  
with open(f'{tempdir}/{(“.”)\[0\]}.csv', ‘r') as f:  
rows = (f)  
headers = list(map(str, next(rows)))  
**for** row in rows:  
yield dict(zip(headers, row))

Let's analyze it:

Create a temporary directory for storing CSV files: Create a temporary directory using the built-in tempfile module, which will be cleaned automatically when completed. Ideally, we want to convert a specific file into a class file object in memory, but the libreoffice command line does not provide a way to convert to a specific file, it can only be converted to a directory.
Convert files to CSV using the libreoffice command line: Execute operating system commands using built-in child process modules.
Read the generated CSV: Open the newly created CSV file, parse it using the built-in csv module and generate dicts.

8. Read Excel using DuckDB

If we have already started using external tools, why not give newcomers a chance to compete?
DuckDBIt is a "in-process SQL OLAP database management system". This description doesn't immediately explain why DuckDB is useful in this case, but it does. DuckDB is very good at converting between moving data and formats.

Install:

$ pip install duckdb

Code:

**import** duckdb

def iter\_excel\_duckdb(file: IO\[bytes\]) -> Iterator\[dict\[str, object\]\]:  
\_extension(‘spatial')  
\_extension(‘spatial')  
rows = (f"“”  
SELECT \* FROM st\_read(  
‘{}',  
open\_options=\[‘HEADERS=FORCE', ‘FIELD\_TYPES=AUTO'\]) “”")  
**while** row := ():  
yield dict(zip(, row))

Let's analyze it:

Install and load space extensions: To import data from Excel using duckdb, you need to install space extensions. This is a little weird because the space extension is used for geo-operation, but that's exactly what it wants.
Query file: When executing queries directly using duckdb global variables, an in-memory database will be used by default, similar to sqlite using the :memory: option. To actually import an Excel file, we use the st_read function and take the file path as the first parameter. In the function options, we set the first line as the title and activate the option to automatically detect the type (which is also the default option).
The build result traverses each row and builds the dict using the title and value list of each row.

9. Read Excel using Calamine

In recent years, every performance problem in Python seems to end up being solved in another language. As a Python developer, I think this is a real blessing. This means I can continue to use the language I am used to and enjoy the performance benefits of all other languages!

Calamine is a pure Rust library for reading Excel and OpenDocument spreadsheet files. To installpython-calamineFor Python binding to calamine, please execute the following command:

$ pip install python-calamine

Code:

**import** python\_calamine

def iter\_excel\_calamine(file: IO\[bytes\]) -> Iterator\[dict\[str, object\]\]:  
workbook = python\_calamine.\_filelike(file) # type: ignore\[arg-type\]  
rows = iter(\_sheet\_by\_index(0).to\_python())  
headers = list(map(str, next(rows)))  
**for** row in rows:  
yield dict(zip(headers, row))

Performance Summary

Summarize the method of reading Excel files in Python and the time it takes to read Excel:

Pandas 32.98
Tablib 28.52
Openpyxl 35.62
Openpyxl (readonly) 24.79
LibreOffice 15.27
DuckDB (sql) 11.36
DuckDB (execute) 5.73
Calamine (python-calamine) 3.58

The above introduces several methods to read Excel files using Python, each method has its applicable scenarios. For data analysts and developers, choosing the right library can greatly improve productivity.
So which one should you use? It depends...In choosing a library that processes Excel files in Python, there are a few other considerations besides speed:

Write capability: We benchmarked the method of reading Excel, but sometimes we also need to generate Excel files. Some libraries we benchmarked do not support writing. For example, Calamine cannot be written to Excel files, but can only be read.
Other formats: The system may need to load and generate files in other formats other than Excel. Some libraries, such as pandas and Tablib, support multiple additional formats, while calamine and openpyxl only support Excel.

This is the end of this article about the fastest common methods of reading Excel in Python. For more related content on the fastest way to read Excel in Python, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!