SoFunction
Updated on 2025-03-03

Golang obtains file md5 verification method and efficiency comparison

Recently, there is a requirement: obtain multiple files md5 checksum to determine whether there are duplicate files. Because there are many files, some files are still relatively large, and the files that need to be processed have not yet been in place, so I considered the issue of efficiency.

Currently I know there are two methods to get md5 checksum in Golang

The implementation source code is directly given here.

package main
import (
 "crypto/md5"
 "flag"
 "fmt"
 "io"
 "io/ioutil"
 "os"
)
var which = ("which", true, "")
var path = ("path", "", "")
var cnt = ("cnt", 100, "")
func aaa() {
 f, err := (*path)
 if err != nil {
  ("Open", err)
  return
 }
 defer ()
 body, err := (f)
 if err != nil {
  ("ReadAll", err)
  return
 }
 (body)
 //("%x\n", (body))
}
func bbb() {
 f, err := (*path)
 if err != nil {
  ("Open", err)
  return
 }
 defer ()
 md5hash := ()
 if _, err := (md5hash, f); err != nil {
  ("Copy", err)
  return
 }
 (nil)
 //("%x\n", (nil))
}
func main() {
 ()
 for i := 0; i < *cnt; i++ {
  if *which {
   aaa()
  } else {
   bbb()
  }
 }
}

There are also shell commands for reference to obtain md5 checksum

md5 -- calculate a message-digest fingerprint (checksum) for a file
md5 [-pqrtx] [-s string] [file ...]

The test file is a log file for the company's project

banjakukutekiiMac:shell panshiqu$ ls -an | grep by
-rw-r--r--   1 501  20   7285957 11 17 16:14 
banjakukutekiiMac:shell panshiqu$ cp  
banjakukutekiiMac:shell panshiqu$ cat  >> 
banjakukutekiiMac:shell panshiqu$ ls -an | grep by
-rw-r--r--   1 501  20   7285957 11 17 16:14 
-rw-r--r--   1 501  20  14571914 11 17 17:03 

The following efficiency displays

banjakukutekiiMac:shell panshiqu$ time ./gomd5 -cnt=1 -which=true -path=""
real 0m0.027s
user 0m0.017s
sys 0m0.012s
banjakukutekiiMac:shell panshiqu$ time ./gomd5 -cnt=1 -which=true -path=""
real 0m0.048s
user 0m0.033s
sys 0m0.018s
banjakukutekiiMac:shell panshiqu$ time ./gomd5 -cnt=1 -which=false -path=""
real 0m0.018s
user 0m0.012s
sys 0m0.004s
banjakukutekiiMac:shell panshiqu$ time ./gomd5 -cnt=1 -which=false -path=""
real 0m0.031s
user 0m0.024s
sys 0m0.005s
banjakukutekiiMac:shell panshiqu$ time md5 
MD5 () = 9d79e19a00cef1ae1bb6518ca4adf9de
real 0m0.023s
user 0m0.019s
sys 0m0.006s
banjakukutekiiMac:shell panshiqu$ time md5 
MD5 () = 0a029a460a20e8dcb00d032d6fab74c6
real 0m0.042s
user 0m0.037s
sys 0m0.009s

Summarize:

No matter what method, it will take longer as the file grows. The above examples are about 2 times.

The method is most efficient, it is recommended that you use it like this

Supplementary: Research on the efficiency of Go language: md5 calculation method

I studied Go's md5 calculation method. At present, the most efficient and fastest writing method is to call the() function to return 16-byte checksum, and then map the high 4 and low 4 bits of each byte into hexadecimal characters and store them in two bytes, obtain 32 bytes, and then convert them into a string.

FastMD5 is more efficient than other algorithms by at least 46%.

 
const hextable = "0123456789abcdef" 
//Author: pengpengzhoufunc FastMD5(str string) string {
	src := ([]byte(str))
	var dst = make([]byte, 32)
	j := 0
	for _, v := range src {
		dst[j] = hextable[v&gt;&gt;4]
		dst[j+1] = hextable[v&amp;0x0f]
		j += 2
	}
	return string(dst)
}

Go Test Benchmark test results:

goos: linux
goarch: amd64
pkg: example
BenchmarkFastMD5-4       5564898               205 ns/op
BenchmarkV1-4            3461698               379 ns/op
BenchmarkV2-4            2277235               516 ns/op
BenchmarkV3-4            2158122               527 ns/op
PASS
ok      example 6.440s

The detailed code is as follows:

package main 
import (
	"crypto/md5"
	"encoding/hex"
	"fmt"
	"io"
)
 
const hextable = "0123456789abcdef"
 
func FastMD5(str string) string {
	src := ([]byte(str))
	var dst = make([]byte, 32)
	j := 0
	for _, v := range src {
		dst[j] = hextable[v&gt;&gt;4]
		dst[j+1] = hextable[v&amp;0x0f]
		j += 2
	}
	return string(dst)
}
 
func md5V1(str string) string {
	h := ()
	([]byte(str))
	return ((nil))
}
 
func md5V2(str string) string {
	data := []byte(str)
	has := (data)
	md5str := ("%x", has)
	return md5str
}
 
func md5V3(str string) string {
	w := ()
	(w, str)
	md5str := ("%x", (nil))
	return md5str
}
 
func main() {
	str := "Chinese"
	(FastMD5(str))
	(md5V1(str))
	(md5V2(str))
	(md5V3(str))
}
package main 
import (
	"testing"
)
 
var str = "Golang Chinese Tutorial"
 
func BenchmarkFastMD5(b *) {
	for i := 0; i &lt; ; i++ {
		FastMD5(str)
	}
}
 
func BenchmarkV1(b *) {
	for i := 0; i &lt; ; i++ {
		md5V1(str)
	}
}
 
func BenchmarkV2(b *) {
	for i := 0; i &lt; ; i++ {
		md5V2(str)
	}
}
 
func BenchmarkV3(b *) {
	for i := 0; i &lt; ; i++ {
		md5V3(str)
	}
}

The above is personal experience. I hope you can give you a reference and I hope you can support me more. If there are any mistakes or no complete considerations, I would like to give you advice.