SoFunction
Updated on 2025-03-06

18 ways to implement data cleaning in Python

Data cleaning may be the first big challenge you encounter, but don't worry, the magic of Python lies in the ability to solve complex problems with concise code. Today, we will learn eighteen tricks of how to use a line of code to complete data cleaning. Get ready, let us simplify the complex and become experts in data cleaning!

1. Remove spaces on both sides of the string

data = "   Hello World!   "  
cleaned_data = ()  # Magical line, left and right spaces worship
  • Interpretation:strip()Methods to remove whitespace characters at the beginning and end of a string, which is simple and efficient.

2. Convert data types

num_str = "123"  
num_int = int(num_str)  # String to integer,That's so direct  
  • Note: Make sure the data format is correct during conversion, otherwise an error will be reported.

3. Case conversion

text = "Python is Awesome"  
lower_text = ()  # All are lowercase for easy unified processingupper_text = ()  # Or all capitalized,As you like  

4. Remove duplicate elements from the list

my_list = [1, 2, 2, 3, 4, 4]  
unique_list = list(set(my_list))  # Collection Features,No pressure to remove weight  
  • Tips: Although this trick is good, it has changed the original list order.

5. Quickly count the number of elements occurrences

from collections import Counter  
data = ['apple', 'banana', 'apple', 'orange']  
counts = dict(Counter(data))  # Want to know who is the most popular?  
  • Interpretation:CounterIt is a statistical artifact that easily obtains frequency.

6. Split string into list

sentence = "Hello world"  
words = (" ")  # The splitter defaults to spaces,A word list  

7. List merge

list1 = [1, 2, 3]  
list2 = [4, 5, 6]  
merged_list = list1 + list2  # Merge list,It's that simple  

8. Data Filling

my_list = [1, 2]  
filled_list = my_list * 3  # Repeat three times,Quickly fill the list  

9. Extraction date and time

from datetime import datetime  
date_str = "2023-04-01"  
date_obj = (date_str, "%Y-%m-%d")  # Date String Change Object  
  • Key points:%Y-%m-%dIt is date format, adjusted as needed.

10. String replacement

old_string = "Python is fun."  
new_string = old_string.replace("fun", "awesome")  # Makeover,A word is made  

11. Quick sort

numbers = [5, 2, 9, 1, 5]  
sorted_numbers = sorted(numbers)  # Natural sorting,Ascending default  
  • Advanced:reverse=TrueCan be arranged in descending order.

12. Extract numbers

mixed_str = "The year is 2023"  
nums = ''.join(filter(, mixed_str))  # Leave only the numbers,The rest go away  
  • Decryption:filterFunctional coordinationisdigit, only numeric characters are preserved.

13. Null value processing (assuming it is a list)

data_list = [None, 1, 2, None, 3]  
filtered_list = [x for x in data_list if x is not None]  # Reject empty value,Clean and neat  
  • Syntactic sugar: list comprehension, concise and elegant.

14. Dictionary key-value pair swap

my_dict = {"key1": "value1", "key2": "value2"}  
swapped_dict = {v: k for k, v in my_dict.items()}  # Flip the world,Key value change,Value change key  

15. Average calculation

numbers = [10, 20, 30, 40]  
average = sum(numbers) / len(numbers)  # Average,Get in place in one step  

16. String grouping

s = "abcdef"  
grouped = [s[i:i+2] for i in range(0, len(s), 2)]  # One group for every two,Dividing is a way  
  • Application: Suitable for any scenario where grouping is required.

17. Data Standardization

import numpy as np  
data = ([1, 2, 3])  
normalized_data = (data - ()) / ()  # The beauty of mathematics,Standard distribution  
  • Background: A must-have for data analysis to make the data comply with standard normal distribution.

18. Data filtering (based on conditions)

data = [1, 2, 3, 4, 5]  
even_numbers = [x for x in data if x % 2 == 0]  # Leave only even numbers,Exclude dissidents  
  • Skills: List derivation combined with conditional judgment, efficient screening.

Advanced practice and skills

Now that you have mastered the eighteen basic methods, let’s go deeper and explore how to combine these techniques to solve more complex data cleaning problems, and share some tips in practice.

1. Complex string processing: regular expressions

Regular expressions are an indispensable tool in data cleaning, and although strictly speaking, it can handle pattern matching and replacement efficiently.

import re  
text = "Email: example@ Phone: 123-456-7890"  
emails = (r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)  
phones = (r'\b\d{3}-\d{3}-\d{4}\b', text)  

This code extracts email and phone numbers in the text respectively, demonstrating the power of regular expressions.

2. The magic of Pandas library

Pandas is the best choice for data analysis and cleaning. Although Pandas' commands are usually more than one line, their efficiency and simplicity are worth learning.

import pandas as pd  
df = pd.read_csv('')  
# Delete rows with missing valuesdf_clean = ()  
# Replace specific valuesdf['column_name'] = df['column_name'].replace('old_value', 'new_value')  
  • Note: Pandas, although powerful, may take more time for beginners to get familiar with it.

3. Error handling and logging

Errors are almost inevitable when dealing with large amounts of data. Learning to use the try-except structure to catch exceptions and use logging to record logs can greatly improve debugging efficiency.

import logging  
(level=)  
try:  
    result = some_function_that_might_fail()  
    (f"Successfully executed!result:{result}")  
except Exception as e:  
    (f"Execution failed:{e}")  

In this way, even if problems arise, they can be quickly located.

4. Batch operations and function encapsulation

Encapsulating common data cleaning steps into functions can greatly improve the reusability and readability of the code.

def clean_phone(phone):  
    """Remove non-numeric characters from phone number"""  
    return ''.join(c for c in phone if ())  
  
phone_numbers = ['123-456-7890', '(555) 555-5555']  
cleaned_numbers = [clean_phone(phone) for phone in phone_numbers]  

By definitionclean_phoneFunctions, we can easily clean up a batch of phone numbers.

Practical suggestions:

  • Step by step: Don’t try to complete all cleaning tasks at once, process them in steps and gradually optimize them.

  • Test data: Before testing your cleaning logic on actual data, use small samples or simulated data to verify the correctness of the code.

  • Documentation and Comments: Even a simple data cleaning script, good comments can be a huge help to your future self or other developers.

This is the end of this article about 18 methods for Python to implement data cleaning. For more related Python data cleaning content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!