When using Pandas to process TB-level data, it is not feasible to directly load the entire dataset into memory due to memory limitations.
Here are some strategies and methods that can be adopted:
Read data in chunks
Pandas provideschunksize
Parameter, it allows you to read large files in chunks and process part of the data at a time, which can avoid the problem of insufficient memory.
Take the CSV file as an example:
import pandas as pd # Define the number of rows read per timechunk_size = 100000 for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size): # Process each data block here processed_chunk = chunk[chunk['column_name'] > 10] # You can save or further aggregate the processed data blocks # For example, append the processed data block to a file processed_chunk.to_csv('processed_file.csv', mode='a', header=not bool(chunk_number)) chunk_number += 1
In the above code,chunksize
Parameters read the CSV file in chunks, read the data of the specified number of lines each time, and after processing each data block, the processed data block is added to the new file.
Combining Dask with Pandas
Dask is a flexible parallel computing library that can handle data sets larger than memory.
Dask's DataFrame API is similar to Pandas, which allows you to use familiar Pandas operations to handle large datasets.
import as dd # Read large CSV filesdf = dd.read_csv('large_file.csv') # Do some data processing operationsresult = df[df['column_name'] > 10].groupby('group_column').sum() # Calculate resultsfinal_result = ()
The above code uses Dask'sread_csv
The function reads a large CSV file, builds a Dask DataFrame, then performs data processing operations, and finally calls itcompute
Method calculates the final result.
Data compression and type optimization
When reading data, optimize the data type and use appropriate data types to reduce memory usage.
For example, useastype
Method to transfer the data type of an integer sequence fromint64
Convert toint32
orint16
。
import pandas as pd # Read datadf = pd.read_csv('large_file.csv') # Optimize data typesdf['integer_column'] = df['integer_column'].astype('int32') df['float_column'] = df['float_column'].astype('float32')
Used in the codeastype
The method optimizes the data types of integer and floating point sequences to reduce memory usage.
Database query and filtering
If the data is stored in the database, you can use the database's query function to select only the columns and rows you need to avoid loading the entire dataset into memory.
import pandas as pd import sqlite3 # Connect to the databaseconn = ('large_database.db') # Execute the query statement and select only the required dataquery = "SELECT column1, column2 FROM large_table WHERE condition = 'value'" df = pd.read_sql(query, conn) # Close the database connection()
The above code passessqlite3
Connect to the database, execute the query statement, select only the columns and rows you want, and then load the query results into the Pandas DataFrame.
Distributed computing
For hyperscale data, consider using distributed computing frameworks such as Apache Spark.
Spark can process PB-level data and provides an API similar to Pandas (PySpark Pandas API) for easy data processing.
import as ps # Create SparkSessionfrom import SparkSession spark = ("LargeDataProcessing").getOrCreate() # Read large CSV filesdf = ps.read_csv('large_file.csv') # Perform data processing operationsresult = df[df['column_name'] > 10].groupby('group_column').sum() # Output resultprint(result) # Stop SparkSession()
This code uses the PySpark Pandas API to read large CSV files, perform data processing operations, and finally output the result.
Summarize
Through the above method, when processing TB-level data, the problem of insufficient memory can be effectively avoided, while making full use of computing resources to improve the efficiency of data processing.
These are just personal experience. I hope you can give you a reference and I hope you can support me more.