SoFunction
Updated on 2025-04-14

Sharing tips on using Python to process datasets

1. Import the required library

First, we need to import some Python libraries that will help us process the data. The code is as follows:

import jsonlines
import itertools
import pandas as pd
from pprint import pprint

import datasets
from datasets import load_dataset

explain:

jsonlines: Used to process files in JSON Lines format.

itertools: provides some efficient looping tools.

pandas: Used to process table data, such as Excel or CSV files.

pprint: Used to beautify the printed data and make the data look neat.

datasets: A library specifically used to load and process datasets.

2. Load the pre-trained dataset

Next, we want to load a pretrained dataset. Here we use the allenai/c4 dataset, which is an English text dataset.

pretrained_dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)

explain:

load_dataset: Used to load the dataset.

"allenai/c4": The name of the dataset.

"en": means we only load the English part.

split="train": means we only load the training set.

streaming=True: means to load data in a streaming manner, suitable for processing large data sets.

3. View the first 5 samples of the dataset

We can view the first 5 samples of the dataset with the following code:

n = 5
print("Pretrained dataset:")
top_n = (pretrained_dataset, n)
for i in top_n:
  print(i)

explain:

n = 5: means we want to view 5 samples.

: Used to extract the first 5 samples from the dataset.

for i in top_n:: traverse these 5 samples and print them out.

4. Load the company fine-tuning dataset

Suppose we have a file called lamini_docs.jsonl that stores some questions and answers. We can load this file with the following code:

filename = "lamini_docs.jsonl"
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df

explain:

pd.read_json: Used to read a JSON Lines format file and convert it into a table form (DataFrame).

instruction_dataset_df: Print table contents.

5. Format the data

We can splice the questions and answers into a string for easy subsequent processing:

examples = instruction_dataset_df.to_dict()
text = examples["question"][0] + examples["answer"][0]
text

explain:

to_dict(): Convert table data into dictionary format.

examples["question"][0]: Get the content of the first question.

examples["answer"][0]: Get the content of the first answer.

text: Splice the question and answer into a string.

6. Format data using templates

We can use templates to format questions and answers to make them look neat:

prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""

question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
text_with_prompt_template

explain:

prompt_template_qa: defines a template, including two parts: "Question" and "Answer".

format: Insert the question and answer into the template.

7. Generate fine-tuning datasets

We can format all Q&A pairs and save them to a list:

num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

explain:

num_examples: Get the number of questions.

finetoning_dataset_text_only: Stores formatted text.

finetoning_dataset_question_answer: Stores formatted questions and answers.

8. Save processed data

We can save the processed data to a new file:

with (f'lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

explain:

: Open a file and prepare to write data.

writer.write_all: Write all data to a file.

9. Load processed data

Finally, we can load the dataset we just saved:

finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

explain:

load_dataset: Loads the dataset with the specified name.

print(finenetuning_dataset): Print the loaded dataset.

Summarize

Through this article, we learned how to load, process and save datasets in Python. We started with simple data loading, gradually learned how to format and save data, and finally learned how to load processed data.

This is the end of this article about the skills to use Python to process data sets. For more related content of Python processing data sets, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!