A brief discussion on how Golang reads file content (7 types)

This article aims to quickly introduce many options for reading files in the Go standard library.

In Go (for that matter, most underlying languages and some dynamic languages (such as Node)) return a stream of bytes. The advantage of not converting everything into a string automatically is that one of them is to avoid expensive string allocations, which adds GC pressure.

To make this article simpler, I will use string(arrayOfBytes) to convert the bytes array to a string. However, it should not be used as general advice when publishing production code.

1. Read the entire file into memory

First, the standard library provides a variety of features and utilities to read file data. We will start with the basic situation provided in the os package. This means two prerequisites:

The file must be contained in memory
We need to know the file size in advance in order to instantiate a buffer that is sufficient to hold it.

With the handle to the object, we can query the size and instantiate a byte list.

package main


import (
 "os"
 "fmt"
)
func main() {
 file, err := ("")
 if err != nil {
 (err)
 return
 }
 defer ()

 fileinfo, err := ()
 if err != nil {
 (err)
 return
 }

 filesize := ()
 buffer := make([]byte, filesize)

 bytesread, err := (buffer)
 if err != nil {
 (err)
 return
 }
 ("bytes read: ", bytesread)
 ("bytestream to string: ", string(buffer))
}

2. Read the file in block form

Although most of the time it is possible to read files at once, sometimes we still want to use a more memory-saving method. For example, read files in chunks of some size, process them, and repeat until the end. In the following example, the buffer size used is 100 bytes.

package main


import (
 "io"
 "os"
 "fmt"
)

const BufferSize = 100

func main() {
 
 file, err := ("")
 if err != nil {
 (err)
 return
 }
 defer ()

 buffer := make([]byte, BufferSize)

 for {
 bytesread, err := (buffer)
 if err != nil {
  if err !=  {
  (err)
  }
  break
 }
 ("bytes read: ", bytesread)
 ("bytestream to string: ", string(buffer[:bytesread]))
 }
}

Compared to fully reading files, the main difference is:

Read until the EOF mark is obtained, so we added a specific check for err ==
We define the size of the buffer so we can control the desired "block" size. If the operating system correctly caches the file being read, it can improve performance when used correctly.
If the file size is not an integer multiple of the buffer size, the last iteration will only add the remaining bytes to the buffer, so buffer [:bytesread] is called. Under normal circumstances, bytesread will be the same size as the buffer.

For each iteration of the loop, the internal file pointer is updated. The next time you read, data from the file pointer offset to the buffer size is returned. This pointer is not a language construct, but one of the operating systems. On Linux, this pointer is the property of the file descriptor to be created. All read/Read calls (in Ruby/Go respectively) are internally converted to system calls and sent to the kernel, and the kernel manages this pointer.

3. Concurrently read file blocks

What should we do if we want to speed up the processing of the above blocks? One way is to use multiple go routines! Another thing we need to do compared to serial read blocks is that we need to know the offset of each routine. Note that ReadAt behaves slightly differently from Read when the size of the target buffer is greater than the remaining number of bytes.

Also note that I don't limit the number of goroutines, it is defined only by the buffer size. In fact, this number may have an upper limit.

package main

import (
 "fmt"
 "os"
 "sync"
)

const BufferSize = 100

type chunk struct {
 bufsize int
 offset int64
}

func main() {
 
 file, err := ("")
 if err != nil {
 (err)
 return
 }
 defer ()

 fileinfo, err := ()
 if err != nil {
 (err)
 return
 }

 filesize := int(())
 // Number of go routines we need to spawn.
 concurrency := filesize / BufferSize
 // buffer sizes that each of the go routine below should use. ReadAt
 // returns an error if the buffer size is larger than the bytes returned
 // from the file.
 chunksizes := make([]chunk, concurrency)

 // All buffer sizes are the same in the normal case. Offsets depend on the
 // index. Second go routine should start at 100, for example, given our
 // buffer size of 100.
 for i := 0; i < concurrency; i++ {
 chunksizes[i].bufsize = BufferSize
 chunksizes[i].offset = int64(BufferSize * i)
 }

 // check for any left over bytes. Add the residual number of bytes as the
 // the last chunk size.
 if remainder := filesize % BufferSize; remainder != 0 {
 c := chunk{bufsize: remainder, offset: int64(concurrency * BufferSize)}
 concurrency++
 chunksizes = append(chunksizes, c)
 }

 var wg 
 (concurrency)

 for i := 0; i < concurrency; i++ {
 go func(chunksizes []chunk, i int) {
  defer ()

  chunk := chunksizes[i]
  buffer := make([]byte, )
  bytesread, err := (buffer, )

  if err != nil {
  (err)
  return
  }

  ("bytes read, string(bytestream): ", bytesread)
  ("bytestream to string: ", string(buffer))
 }(chunksizes, i)
 }

 ()
}

There are much more of this approach than any previous one:

I'm trying to create a specific number of Go routines depending on file size and buffer size (100 in this case).
We need a way to make sure we are "waiting" for all execution routines. In this example, I'm using wait group.
At the end of each routine, a signal is sent from within, rather than a break for loop. Because we call() in a delay, it is called only when each routine returns.

Note: Always check the number of returned bytes and reassign the output buffer.

Reading files with Read() can go a long way, but sometimes you need more convenience. IO functions are often used in Ruby, such as each_line, each_char, each_codepoint, etc. By using the Scanner type and the association functions in the bufio package, we can achieve similar purposes.

Type implements a function with a "split" function and advances the pointer based on that function. For example, for each iteration, the built-in split function will make the pointer go forward until the next newline character.

In each step, the type also discloses a method for obtaining a byte array/string between the start position and the end position.

package main

import (
 "fmt"
 "os"
 "bufio"
)

const BufferSize = 100

type chunk struct {
 bufsize int
 offset int64
}

func main() {
 file, err := ("")
 if err != nil {
 (err)
 return
 }
 defer ()
 scanner := (file)
 ()

 // Returns a boolean based on whether there's a next instance of `\n`
 // character in the IO stream. This step also advances the internal pointer
 // to the next position (after '\n') if it did find that token.
 for {
 read := ()
 if !read {
  break
  
 }
 ("read byte array: ", ())
 ("read string: ", ())
 }
 
}

Therefore, to read the entire file line by line in this way, you can use something like this:

package main

import (
 "bufio"
 "fmt"
 "os"
)

func main() {
 file, err := ("")
 if err != nil {
 (err)
 return
 }
 defer ()

 scanner := (file)
 ()

 // This is our buffer now
 var lines []string

 for () {
 lines = append(lines, ())
 }

 ("read lines:")
 for _, line := range lines {
 (line)
 }
}

4. Scan word by word

The bufio package contains basic predefined splitting functions:

ScanLines (default)
ScanWords
ScanRunes (very useful for traversing UTF-8 code points (rather than bytes))
ScanBytes

So to read the file and create a word list in the file, you can use something like this:

package main

import (
 "bufio"
 "fmt"
 "os"
)

func main() {
 file, err := ("")
 if err != nil {
 (err)
 return
 }
 defer ()

 scanner := (file)
 ()

 var words []string

 for () {
 words = append(words, ())
 }

 ("word list:")
 for _, word := range words {
 (word)
 }
}

The ScanBytes split function will provide the same output as the earlier Read() example. The main difference between the two is that in a scanner, dynamic allocation issues are every time you need to append to a byte/string array. This can be avoided by techniques such as pre-initializing the buffer to a specific length and only increases the size when the previous limit is reached. Use the same example as above:

package main

import (
 "bufio"
 "fmt"
 "os"
)

func main() {
 file, err := ("")
 if err != nil {
 (err)
 return
 }
 defer ()

 scanner := (file)
 ()

 // initial size of our wordlist
 bufferSize := 50
 words := make([]string, bufferSize)
 pos := 0

 for () {
 if err := (); err != nil {
  // This error is a non-EOF error. End the iteration if we encounter
  // an error
  (err)
  break
 }

 words[pos] = ()
 pos++

 if pos >= len(words) {
  // expand the buffer by 100 again
  newbuf := make([]string, bufferSize)
  words = append(words, newbuf...)
 }
 }

 ("word list:")
 // we are iterating only until the value of "pos" because our buffer size
 // might be more than the number of words because we increase the length by
 // a constant value. Or the scanner loop might've terminated due to an
 // error prematurely. In this case the "pos" contains the index of the last
 // successful update.
 for _, word := range words[:pos] {
 (word)
 }
}

So we end up doing much less slice "growth" operations, but it may end up leaving some empty slots at the end based on the buffer size and the number of words in the file, which is a tradeoff.

5. Split long string into words

Use a type that satisfies the interface as a parameter, which means it will be used with any type that defines the Read method.
One of the methods of returning reader type string utility in the standard library is a function. When reading words from strings, we can combine the two:

package main

import (
 "bufio"
 "fmt"
 "strings"
)

func main() {
 longstring := "This is a very long string. Not."
 var words []string
 scanner := ((longstring))
 ()

 for () {
 words = append(words, ())
 }

 ("word list:")
 for _, word := range words {
 (word)
 }
}

6. Scan the comma-separated string

Manually parsing CSV files/strings through basic() or Scanner types is complex. Because according to the split function, "word" is defined as a string of runes defined by the unicode space. Reading individual runes and tracking the size and position of the buffer (such as the work done in lexical analysis) is too much work and operation.

But this can be avoided. We can define a new split function that reads characters until the reader encounters a comma and then returns the block when either Text() or Bytes() is called. The function signature of the function is as follows:

type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)

For simplicity, I showed an example of reading strings instead of files. A simple reader using the CSV string signed above can be:

package main

import (
 "bufio"
 "bytes"
 "fmt"
 "strings"
)

func main() {
 csvstring := "name, age, occupation"

 // An anonymous function declaration to avoid repeating main()
 ScanCSV := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
 commaidx := (data, ',')
 if commaidx > 0 {
  // we need to return the next position
  buffer := data[:commaidx]
  return commaidx + 1, (buffer), nil
 }

 // if we are at the end of the string, just return the entire buffer
 if atEOF {
  // but only do that when there is some data. If not, this might mean
  // that we've reached the end of our input CSV string
  if len(data) > 0 {
  return len(data), (data), nil
  }
 }

 // when 0, nil, nil is returned, this is a signal to the interface to read
 // more data in from the input reader. In this case, this input is our
 // string reader and this pretty much will never occur.
 return 0, nil, nil
 }

 scanner := ((csvstring))
 (ScanCSV)

 for () {
 (())
 }
}

We've seen a variety of ways to read files. But what if you just want to read files into a buffer?
ioutil is a package in the standard library that contains some features that make it a single line.

Read the entire file

package main

import (
 "io/ioutil"
 "log"
 "fmt"
)

func main() {
 bytes, err := ("")
 if err != nil {
 (err)
 }

 ("Bytes read: ", len(bytes))
 ("String read: ", string(bytes))
}

This is closer to what we see in the high-level scripting language.

Read the entire directory of the file

Needless to say, don't run this script if you have large files

package main

import (
 "io/ioutil"
 "log"
 "fmt"
)

func main() {
 filelist, err := (".")
 if err != nil {
 (err)
 }
 for _, fileinfo := range filelist {
 if ().IsRegular() {
  bytes, err := (())
  if err != nil {
  (err)
  }
  ("Bytes read: ", len(bytes))
  ("String read: ", string(bytes))
 }
 }
}

References

Overview of Go language reading files

This is the end of this article about how Golang reads file content (7 types). For more related Golang to read file content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!