Preface
Recently, there is a requirement for the project, which requires comparing whether the contents of two files of any size are the same. The requirements are as follows:
- The project is .NET CORE, so use C# to write a comparison method
- The file size is arbitrary, so you cannot read all the file contents into memory for comparison (more professional, you need to use a non-cache comparison method)
- No dependency on third-party libraries
- The faster the better
In order to choose the optimal solution, I built a simple command line project and prepared two files of 912MB size, and the contents of these two files are exactly the same. At the end of this article, you can see the code of the Main method of the project.
Let’s start trying various comparison methods and select the best solution:
To compare whether the two files are exactly the same, the first thing that comes to mind is to use a hash algorithm (such as MD5, SHA) to calculate the hash values of the two files and then compare them.
Less nonsense, roll up your sleeves and write a comparison method for MD5:
/// <summary> /// MD5 /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByMD5(string file1, string file2) { // Use .NET built-in MD5 library using (var md5 = ()) { byte[] one, two; using (var fs1 = (file1, )) { // Read file contents with FileStream and calculate HASH value one = (fs1); } using (var fs2 = (file2, )) { // Read file contents with FileStream and calculate HASH value two = (fs2); } // Convert MD5 results (byte array) into string for comparison return (one) == (two); } }
Comparison results:
Method: CompareByMD5, Identical: True. Elapsed: 00:00:05.7933178
It took 5.79 seconds and it felt pretty good. However, is this the best solution?
In fact, if we think about it carefully, the answer should be no.
Because any hash algorithm essentially performs a certain calculation of bytes, and the calculation process takes time.
Many download websites provide the hash value of the downloaded file, because the downloaded source file itself will not change. You only need to calculate the hash value of the source file once and provide it to the user for verification.
In our needs, both files are not fixed, so it is not appropriate to calculate the hash value of the two files each time.
Therefore, the hash comparison scheme is PASS.
My past experience in finding the optimal solution of the algorithm is: Go to * to find :)
After my hard work, I found a very question-oriented answer:How to compare 2 files fast using .NET?
Dezan has at most one answer, and I transformed the code and put it into the project:
/// <summary> /// /a/1359947 /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByToInt64(string file1, string file2) { const int BYTES_TO_READ = sizeof(Int64); // Read 8 bytes each time int iterations = (int)((double)new FileInfo(file1).Length / BYTES_TO_READ); // Calculate the number of reads using (FileStream fs1 = (file1, )) using (FileStream fs2 = (file2, )) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; for (int i = 0; i < iterations; i++) { // Loop to the byte array (one, 0, BYTES_TO_READ); (two, 0, BYTES_TO_READ); // Convert to Int64 for numerical comparison if (BitConverter.ToInt64(one, 0) != BitConverter.ToInt64(two, 0)) return false; } } return true; }
The basic principle of this method is to read two files in a loop, read 8 bytes each time, convert them to Int64, and then compare values. So how efficient is it?
Method: CompareByToInt64, Identical: True. Elapsed: 00:00:08.0918099
What? 8 seconds! It's even slower than MD5? Isn't this the answer that has the most praises from SO? How could this happen?
In fact, it is not difficult to think of the reason after analyzing it. Because only 8 bytes are read at a time, the program frequently performs IO operations, resulting in poor performance. It seems that the answers on SO cannot be superstitious!
Then the direction of optimization becomes how to reduce the loss caused by IO operations.
Since 8 bytes are too few at a time, we define a larger byte array, such as 1024 bytes. Each time, 1024 bytes are read into the array, and then the byte array is compared.
But this brings up a new problem, which is how to quickly compare whether two byte arrays are the same?
The first thing I thought of is what I used in the MD5 method -- converting byte arrays into strings for comparison:
/// <summary> /// Read into byte array to compare (convert to String comparison)/// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByString(string file1, string file2) { const int BYTES_TO_READ = 1024 * 10; using (FileStream fs1 = (file1, )) using (FileStream fs2 = (file2, )) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; while (true) { int len1 = (one, 0, BYTES_TO_READ); int len2 = (two, 0, BYTES_TO_READ); if ((one) != (two)) return false; if (len1 == 0 || len2 == 0) break; // Both files are read to the end and exit the while loop } } return true; }
result:
Method: CompareByString, Identical: True. Elapsed: 00:00:07.8088732
It took nearly 8 seconds, which is not much better than the previous method.
Let’s analyze the reason. In each loop, the conversion of strings is a very time-consuming operation. So is there any way to compare byte arrays without type conversion?
I thought of a method in LINQ that compares sequences SequenceEqual, we try to compare using this method:
/// <summary> /// Read into byte array to compare (using LINQ's SequenceEqual comparison)/// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareBySequenceEqual(string file1, string file2) { const int BYTES_TO_READ = 1024 * 10; using (FileStream fs1 = (file1, )) using (FileStream fs2 = (file2, )) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; while (true) { int len1 = (one, 0, BYTES_TO_READ); int len2 = (two, 0, BYTES_TO_READ); if (!(two)) return false; if (len1 == 0 || len2 == 0) break; // Both files are read to the end and exit the while loop } } return true; }
result:
Method: CompareBySequenceEqual, Identical: True. Elapsed: 00:00:08.2174360
It is actually slower than the first two (actually this is also the slowest of all solutions), and LINQ's SequenceEqual is not born for efficiency.
So how about we use the while loop to compare byte arrays without those fancy functions, and how about we honestly use while loops to compare byte arrays?
/// <summary> /// Read into byte array to compare (while loop to compare byte array)/// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByByteArry(string file1, string file2) { const int BYTES_TO_READ = 1024 * 10; using (FileStream fs1 = (file1, )) using (FileStream fs2 = (file2, )) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; while (true) { int len1 = (one, 0, BYTES_TO_READ); int len2 = (two, 0, BYTES_TO_READ); int index = 0; while (index < len1 && index < len2) { if (one[index] != two[index]) return false; index++; } if (len1 == 0 || len2 == 0) break; } } return true; }
turn out....
Method: CompareByByteArry, Identical: True. Elapsed: 00:00:01.5356821
1.53 seconds! Big breakthrough! It seems that sometimes the clumsy method will work better!
At this point, comparing two files with more than 900 MB takes about 1.5 seconds. Are readers satisfied with this method?
No! I am not satisfied! I believe that through hard work, I will definitely find a faster way!
Similarly, .NET CORE is constantly optimizing to write high-performance code.
So, how do we continue to optimize our code?
I suddenly thought of a new value type added in C# 7.2: Span<T>, which is used to represent a continuous memory area and provides a series of methods to manipulate the area.
For our needs, since we will not change the value of the array, we can use another read-only type ReadOnlySpan<T> to pursue higher efficiency.
Modify the code and use ReadOnlySpan<T>:
/// <summary> /// Read into byte array to compare (ReadOnlySpan)/// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <returns></returns> private static bool CompareByReadOnlySpan(string file1, string file2) { const int BYTES_TO_READ = 1024 * 10; using (FileStream fs1 = (file1, )) using (FileStream fs2 = (file2, )) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; while (true) { int len1 = (one, 0, BYTES_TO_READ); int len2 = (two, 0, BYTES_TO_READ); // Byte array can be converted directly to ReadOnlySpan if (!((ReadOnlySpan<byte>)one).SequenceEqual((ReadOnlySpan<byte>)two)) return false; if (len1 == 0 || len2 == 0) break; // Both files are read to the end and exit the while loop } } return true; }
The core is the SequenceEqual method used for comparison. This method is an extension method of ReadOnlySpan. Note that it is just that the method name is the same as in LINQ and the implementation is completely different.
So how does this method perform?
Method: CompareByReadOnlySpan, Identical: True. Elapsed: 00:00:00.9287703
In less than a second!
Compared with the previous one, the speed has increased by about 40%!
I personally feel very satisfied with this result. If you have a faster method, please give me some advice and I am very welcome!
Regarding the Span<T> structure type, readers who are interested can browse the article, which has a very detailed introduction.
postscript
The code in the article is only due to experimental nature, and can continue to optimize details in actual applications, such as:
- If the two files have different sizes, return false directly
- If the paths of the two files are the same, return true directly
- ...
Source code of Main method for experimental engineering:
static void Main(string[] args) { string file1 = @"C:\Users\WAKU\Desktop\"; string file2 = @"C:\Users\WAKU\Desktop\"; var methods = new Func<string, string, bool>[] { CompareByMD5, CompareByToInt64, CompareByByteArry, CompareByReadOnlySpan }; foreach (var method in methods) { var sw = (); bool identical = method(file1, file2); ("Method: {0}, Identical: {1}. Elapsed: {2}", , identical, ); } }
Summarize
The above is the entire content of this article. I hope that the content of this article has certain reference value for your study or work. Thank you for your support.