1. Import the required library
First, we need to import some Python libraries that will help us process the data. The code is as follows:
import jsonlines import itertools import pandas as pd from pprint import pprint import datasets from datasets import load_dataset
explain:
jsonlines: Used to process files in JSON Lines format.
itertools: provides some efficient looping tools.
pandas: Used to process table data, such as Excel or CSV files.
pprint: Used to beautify the printed data and make the data look neat.
datasets: A library specifically used to load and process datasets.
2. Load the pre-trained dataset
Next, we want to load a pretrained dataset. Here we use the allenai/c4 dataset, which is an English text dataset.
pretrained_dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
explain:
load_dataset: Used to load the dataset.
"allenai/c4": The name of the dataset.
"en": means we only load the English part.
split="train": means we only load the training set.
streaming=True: means to load data in a streaming manner, suitable for processing large data sets.
3. View the first 5 samples of the dataset
We can view the first 5 samples of the dataset with the following code:
n = 5 print("Pretrained dataset:") top_n = (pretrained_dataset, n) for i in top_n: print(i)
explain:
n = 5: means we want to view 5 samples.
: Used to extract the first 5 samples from the dataset.
for i in top_n:: traverse these 5 samples and print them out.
4. Load the company fine-tuning dataset
Suppose we have a file called lamini_docs.jsonl that stores some questions and answers. We can load this file with the following code:
filename = "lamini_docs.jsonl" instruction_dataset_df = pd.read_json(filename, lines=True) instruction_dataset_df
explain:
pd.read_json: Used to read a JSON Lines format file and convert it into a table form (DataFrame).
instruction_dataset_df: Print table contents.
5. Format the data
We can splice the questions and answers into a string for easy subsequent processing:
examples = instruction_dataset_df.to_dict() text = examples["question"][0] + examples["answer"][0] text
explain:
to_dict(): Convert table data into dictionary format.
examples["question"][0]: Get the content of the first question.
examples["answer"][0]: Get the content of the first answer.
text: Splice the question and answer into a string.
6. Format data using templates
We can use templates to format questions and answers to make them look neat:
prompt_template_qa = """### Question: {question} ### Answer: {answer}""" question = examples["question"][0] answer = examples["answer"][0] text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer) text_with_prompt_template
explain:
prompt_template_qa: defines a template, including two parts: "Question" and "Answer".
format: Insert the question and answer into the template.
7. Generate fine-tuning datasets
We can format all Q&A pairs and save them to a list:
num_examples = len(examples["question"]) finetuning_dataset_text_only = [] finetuning_dataset_question_answer = [] for i in range(num_examples): question = examples["question"][i] answer = examples["answer"][i] text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer) finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa}) text_with_prompt_template_q = prompt_template_q.format(question=question) finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})
explain:
num_examples: Get the number of questions.
finetoning_dataset_text_only: Stores formatted text.
finetoning_dataset_question_answer: Stores formatted questions and answers.
8. Save processed data
We can save the processed data to a new file:
with (f'lamini_docs_processed.jsonl', 'w') as writer: writer.write_all(finetuning_dataset_question_answer)
explain:
: Open a file and prepare to write data.
writer.write_all: Write all data to a file.
9. Load processed data
Finally, we can load the dataset we just saved:
finetuning_dataset_name = "lamini/lamini_docs" finetuning_dataset = load_dataset(finetuning_dataset_name) print(finetuning_dataset)
explain:
load_dataset: Loads the dataset with the specified name.
print(finenetuning_dataset): Print the loaded dataset.
Summarize
Through this article, we learned how to load, process and save datasets in Python. We started with simple data loading, gradually learned how to format and save data, and finally learned how to load processed data.
This is the end of this article about the skills to use Python to process data sets. For more related content of Python processing data sets, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!