SoFunction
Updated on 2024-12-19

Tensorflow sub-batch read data tutorial

In previous blogs, using tf to read data is to fetch one record at a time, in fact, most of the time you need to fetch a batch of small batches of data, in tf the obvious change in this operation is that the tensor's rank changes, I am currently using the face dataset is a grayscale image, so the size of the 92*112, so the very beginning of the fetch After reshape, the image dataset is a tensor with a rank of 2 and a size of 92*112 (if you consider channels, you can also reshape it to a rank of 3, i.e. 92*112*1). If you add batch, for example, batch size is 5, then get the tensor's rank becomes 3, the size is 5*92*112.

The following rules to write a general process of reading data, according to the official website of the example, generally split the reading of data into two major parts, one is a function is specifically responsible for reading data and decoding data, a function is responsible for the production of batch.

import tensorflow as tf

def read_data(fileNameQue):

  reader = ()
  key, value = (fileNameQue)
  features = tf.parse_single_example(value, features={'label': ([], tf.int64),
                            'img': ([], ),})
  img = tf.decode_raw(features["img"], tf.uint8)
  img = (img, [92,112]) # Restore the original size of the image
  label = (features["label"], tf.int32)

  return img, label

def batch_input(filename, batchSize):

  fileNameQue = .string_input_producer([filename], shuffle=True)
  img, label = read_data(fileNameQue) # fetch image and label
  min_after_dequeue = 1000
  capacity = min_after_dequeue+3*batchSize
  # Prefetch images and labels and randomly disrupt them to form a batch, at which point the tensor rank changes, with an additional batch size dimension
  exampleBatch,labelBatch = .shuffle_batch([img, label],batch_size=batchSize, capacity=capacity,
                           min_after_dequeue=min_after_dequeue)
  return exampleBatch,labelBatch

if __name__ == "__main__":

  init = tf.initialize_all_variables()
  exampleBatch, labelBatch = batch_input("./data/", batchSize=10)

  with () as sess:

    (init)
    coord = ()
    threads = .start_queue_runners(coord=coord)

    for i in range(100):
      example, label = ([exampleBatch, labelBatch])
      print()

    coord.request_stop()
    (threads)

Read data and decode data with the previous basically the same, for different format datasets using different readers and decoders can be, the back is to produce batch, the core is .shuffle_batch function, it is equivalent to a cistern function, the first parameter on behalf of the cistern of the inlet, that is, one by one to read the record, batch_size naturally is the batch size, capacity is the capacity of the reservoir, that can accommodate how many samples, min_after_dequeue is pointed out that after the queuing operation can also be used for random sampling out of the batch size of the pool of samples, obviously, capacity is greater than the min_after_dequeue, the official website recommends: min_after_dequeue + (num_third_dequeue), min_after_dequeue + (num_third_dequeue), min_after_dequeue + (num_third_dequeue) dequeue + (num_threads + a small safety margin) * batch_size, there is also a parameter is num_threads, said the number of threads used.

The larger the value of min_after_dequeue, the better the random sampling is, but the more memory it consumes.

This Tensorflow sub-batch read data tutorial above is all that I have shared with you, I hope to give you a reference, and I hope you will support me more.