How to use TensorDataset and DataLoader in Pytorch

The combination of TensorDataset and DataLoader in Pytorch

First of all, we understand TensorDataset and DataLoader in a literal sense. TensorDataset is a data set that is only used to store tensors (tensors), and DataLoader is a data loader. Generally, when using DataLoader, it means that data needs to be traversed and manipulated.

The function of TensorDataset(tensor1, tensor2) is to form the corresponding value of data tensor1 and tag tensor2, that is, in tensor1, it is data, and tensor2 is the corresponding tag of tensor1.

Let's have a small example

from  import TensorDataset,DataLoader
import torch
 
a = ([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9],
                  [1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9],
                  [1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9],
                  [1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])
 
b = ([44, 55, 66, 44, 55, 66, 44, 55, 66, 44, 55, 66])
train_ids = TensorDataset(a,b)
# Slice outputprint(train_ids[0:4]) # Lines 0, 1, 2, 3# Looping datafor x_train,y_label in train_ids:
    print(x_train,y_label)

Here is the corresponding output:

(tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[1, 2, 3]]), tensor([44, 55, 66, 44]))
===============================================
tensor([1, 2, 3]) tensor(44)
tensor([4, 5, 6]) tensor(55)
tensor([7, 8, 9]) tensor(66)
tensor([1, 2, 3]) tensor(44)
tensor([4, 5, 6]) tensor(55)
tensor([7, 8, 9]) tensor(66)
tensor([1, 2, 3]) tensor(44)
tensor([4, 5, 6]) tensor(55)
tensor([7, 8, 9]) tensor(66)
tensor([1, 2, 3]) tensor(44)
tensor([4, 5, 6]) tensor(55)
tensor([7, 8, 9]) tensor(66)

From the output results, we can understand the correspondence between tensor-type data and tensor-type tags. This is the basic application of TensorDataset.

Next, we encapsulate the constructed TensorDataset into DataLoader to operate the data:

# Parameter description, dataset=train_ids represents the dataset that needs to be encapsulated, batch_size represents how many times it takes at a time# shuffle means to get data out of order, set to False means to get data in sequence, True means to get data out of ordertrain_loader = DataLoader(dataset=train_ids,batch_size=4,shuffle=False)
# Note that there are two return values of enumerate, one is the sequence number and the other is the data (including training data and labels)for i,data in enumerate(train_loader,1):
    train_data, label = data
    print(' batch:{0} train_data:{1}  label: {2}'.format(i+1, train_data, label))

Here is the corresponding output:

batch:1 x_data:tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[1, 2, 3]]) label: tensor([44, 55, 66, 44])
batch:2 x_data:tensor([[4, 5, 6],
[7, 8, 9],
[1, 2, 3],
[4, 5, 6]]) label: tensor([55, 66, 44, 55])
batch:3 x_data:tensor([[7, 8, 9],
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]) label: tensor([66, 44, 55, 66])

At this point, the combined use of TensorDataset and DataLoader has been introduced.

Let's take a look at the source codes of these two methods:

class TensorDataset(Dataset[Tuple[Tensor, ...]]):
    r"""Dataset wrapping tensors.
    Each sample will be retrieved by indexing tensors along the first dimension.
    Arguments:
        *tensors (Tensor): tensors that have the same size of the first dimension.
    """
    tensors: Tuple[Tensor, ...]
 
    def __init__(self, *tensors: Tensor) -&gt; None:
        assert all(tensors[0].size(0) == (0) for tensor in tensors)
         = tensors
 
    def __getitem__(self, index):
        return tuple(tensor[index] for tensor in )
 
    def __len__(self):
        return [0].size(0)
 
# Due to too much content, only the parameters related to this article are listed. You can view the source code by yourself.class DataLoader(Generic[T_co]):
    r"""
    Data loader. Combines a dataset and a sampler, and provides an iterable over
    the given dataset.
    The :class:`~` supports both map-style and
    iterable-style datasets with single- or multi-process loading, customizing
    loading order and optional automatic batching (collation) and memory pinning.
    See :py:mod:`` documentation page for more details.
    Arguments:
        dataset (Dataset): dataset from which to load the data.
        batch_size (int, optional): how many samples per batch to load
            (default: ``1``).
        shuffle (bool, optional): set to ``True`` to have the data reshuffled
            at every epoch (default: ``False``).
    """
    dataset: Dataset[T_co]
    batch_size: Optional[int]
 
    def __init__(self, dataset: Dataset[T_co], batch_size: Optional[int] = 1,
                 shuffle: bool = False):
 
         = dataset
        self.batch_size = batch_size

Source code analysis of Pytorch's DataLoader and Dataset and TensorDataset

1. Why use DataLoader and Dataset

When loading and processing a large amount of data, there may be insufficient memory. At this time, you need to use the dataset class Dataset or TensorDataset and the dataset loading class DataLoader.

After using these classes, you can divide the original data into small pieces and read it into memory partly when you need to use it, instead of reading all the data into memory from the beginning.

Use

In pytorch, it is an abstract class representing a dataset, but it is generally not used directly, but is used by customizing a dataset.

Custom datasets should inherit Dataset and should have a __len__ method that returns the dataset size and a __getitem__ method used to obtain index data.

The source code of the Dataset class is as follows:

class Dataset(object):
    r"""An abstract class representing a :class:`Dataset`.

    All datasets that represent a map from keys to data samples should subclass
    it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
    data sample for a given key. Subclasses could also optionally overwrite
    :meth:`__len__`, which is expected to return the size of the dataset by many
    :class:`~` implementations and the default options
    of :class:`~`.

    .. note::
      :class:`~` by default constructs a index
      sampler that yields integral indices.  To make it work with a map-style
      dataset with non-integral indices/keys, a custom sampler must be provided.
    """

    def __getitem__(self, index):
        raise NotImplementedError

    def __add__(self, other):
        return ConcatDataset([self, other])

    # No `def __len__(self)` default?
    # See NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]
    # in pytorch/torch/utils/data/

You can see that there is no __len__ method in the Dataset class. Although there is a __getitem__ method, it does not implement any useful functions.

So you need to write a subclass of the Dataset class to achieve its due functions.

Examples of custom class implementation:

import torch
from  import Dataset, DataLoader, TensorDataset
from  import Variable
import numpy as np
import pandas as pd

value_df = pd.read_csv(''）
value_array = (value_df)
print("value_array.shape =", value_array.shape)  # (73700, 300)
value_size = value_array.shape[0]  # 73700
train_size = int(0.7*value_size)

train_array = val_array[:train_size]  
train_label_array = val_array[60:train_size+60]

class DealDataset(Dataset):
    """
 Downloading data and initializing data can be done here
     """

    def __init__(self, *arrays):
        assert all(arrays[0].shape[0] == [0] for array in arrays)
         = arrays

    def __getitem__(self, index):
        return tuple(array[index] for array in )

    def __len__(self):
        return [0].shape[0]


# Instantiate this class, and then we get data of type Dataset, write it down and pass this class to DataLoader, and that's fine.train_dataset = DealDataset(train_array, train_label_array)

train_loader2 = DataLoader(dataset=train_dataset,
                           batch_size=32,
                           shuffle=True)

for epoch in range(2):
    for i, data in enumerate(train_loader2):
        # Read the data from the train_loader, the number of samples read at one time is 32        inputs, labels = data

        # Convert these data to Variable type        inputs, labels = Variable(inputs), Variable(labels)

        # Next is the link of running the model, we use print instead        print("epoch：", epoch, "The First", i, "Inputs", (), "labels", ())

result:

epoch: 0th inputs of 0 ([32, 300]) labels ([32, 300])
epoch: 1st inputs of 0 ([32, 300]) labels ([32, 300])
epoch: 0's 2nd inputs ([32, 300]) labels ([32, 300])
epoch: 0's 3rd inputs ([32, 300]) labels ([32, 300])
epoch: 0's 4th inputs ([32, 300]) labels ([32, 300])
epoch: 0's 5th inputs ([32, 300]) labels ([32, 300])
...

Use

TensorDataset is a dataset class that can be used directly, and its source code is as follows:

class TensorDataset(Dataset):
    r"""Dataset wrapping tensors.

    Each sample will be retrieved by indexing tensors along the first dimension.

    Arguments:
        *tensors (Tensor): tensors that have the same size of the first dimension.
    """

    def __init__(self, *tensors):
        assert all(tensors[0].size(0) == (0) for tensor in tensors)
         = tensors

    def __getitem__(self, index):
        return tuple(tensor[index] for tensor in )

    def __len__(self):
        return [0].size(0)

You can see that the TensorDataset class is a subclass of the Dataset class, and has the __len__ method that returns the dataset size and the __getitem__ method used to obtain index data, so it can be used directly.

Its structure is the same as the structure of the custom subclass above. The only difference is that TensorDataset has stipulated that the incoming data must be of type, and the custom subclass can be set freely.

Examples of use:

import torch
from  import Dataset, DataLoader, TensorDataset
from  import Variable
import numpy as np
import pandas as pd

value_df = pd.read_csv(''）
value_array = (value_df)
print("value_array.shape =", value_array.shape)  # (73700, 300)
value_size = value_array.shape[0]  # 73700
train_size = int(0.7*value_size)

train_array = val_array[:train_size]  
train_tensor = (train_array, dtype=torch.float32).to(device)
train_label_array = val_array[60:train_size+60]
train_labels_tensor = (train_label_array,dtype=torch.float32).to(device)

train_dataset = TensorDataset(train_tensor, train_labels_tensor)
train_loader = DataLoader(dataset=train_dataset,
                          batch_size=100,
                          shuffle=False,
                          num_workers=0)

for epoch in range(2):
    for i, data in enumerate(train_loader):
        inputs, labels = data
        inputs, labels = Variable(inputs), Variable(labels)
        print(epoch, i, "inputs", (), "labels", ())

result:

0 0 inputs ([100, 300]) labels ([100, 300])
0 1 inputs ([100, 300]) labels ([100, 300])
0 2 inputs ([100, 300]) labels ([100, 300])
0 3 inputs ([100, 300]) labels ([100, 300])
0 4 inputs ([100, 300]) labels ([100, 300])
0 5 inputs ([100, 300]) labels ([100, 300])
0 6 inputs ([100, 300]) labels ([100, 300])
0 7 inputs ([100, 300]) labels ([100, 300])
0 8 inputs ([100, 300]) labels ([100, 300])
0 9 inputs ([100, 300]) labels ([100, 300])
0 10 inputs ([100, 300]) labels ([100, 300])
...

Summarize

The above is personal experience. I hope you can give you a reference and I hope you can support me more.