Detailed explanation of Linux using split to cut log files

1. Introduction to split command

split is a very useful command line tool in Unix and Unix-like systems such as Linux, which is used to split large files into smaller fragments. This is especially useful for handling large log files, data transfers, or storage constraints.

2. Help for using split command

2.1 split command help help information

In the command line terminal, we use --help to query the basic help information of the split command.

root@jeven01:~# split --help
Usage: split [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.

With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of records per output file
  -d                      use numeric suffixes starting at 0, not alphabetic
      --numeric-suffixes[=FROM]  same as -d, but allow setting the start value
  -x                      use hex suffixes starting at 0, not alphabetic
      --hex-suffixes[=FROM]  same as -x, but allow setting the start value
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines/records per output file
  -n, --number=CHUNKS     generate CHUNKS output files; see explanation below
  -t, --separator=SEP     use SEP instead of newline as the record separator;
                            '\0' (zero) specifies the NUL character
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Binary prefixes can be used, too: KiB=K, MiB=M, and so on.

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

GNU coreutils online help: </software/coreutils/>
Full documentation </software/coreutils/split>
or available locally via: info '(coreutils) split invocation'

2.2 Interpretation of split command options

Below issplitThe command help information is translated into Chinese and sorted in the form of a Markdown table:

Options	describe
-a, --suffix-length=N	Generate a suffix with length N (default is 2)
--additional-suffix=SUFFIX	Append additional SUFFIX after the file name
-b, --bytes=SIZE	Each output file size is SIZE bytes
-C, --line-bytes=SIZE	Records containing up to SIZE bytes per output file
-d	Use a numeric suffix starting from 0, instead of a letter suffix
--numeric-suffixes[=FROM]	Same as -d, but allows setting of the starting value
-x	Use a hexadecimal suffix starting from 0, instead of a letter suffix
--hex-suffixes[=FROM]	Same as -x, but allows setting of the starting value
-e, --elide-empty-files	When using '-n', no empty output file is generated
--filter=COMMAND	Write content to shell command COMMAND; file name is $FILE
-l, --lines=NUMBER	Each output file contains NUMBER lines/records
-n, --number=CHUNKS	Generate CHUNKS output files; see below for details
-t, --separator=SEP	Use SEP as record separator, not line breaks; '\0' specifies NUL characters
-u, --unbuffered	Copy input to output immediately when using '-n r/…'
--verbose	Print diagnostic information before opening each output file
--help	Show help information and exit
--version	Output version information and exit

SIZE Parameters

The SIZE parameter is an integer and optional unit (for example: 10K means 10*1024).
The units can be K, M, G, T, P, E, Z, Y (power of 1024) or KB, MB, … (power of 1000).
Binary prefixes can also be used: KiB=K, MiB=M, etc.

CHUNKS Parameters

N: Split into N files according to the input size
K/N: Output the Kth to the standard output, a total of N copies
l/N: Split into N files without splitting lines/records
l/K/N: output the Kth to the standard output without splitting the line/record, a total of N copies
r/N: Similar to 'l', but using loop allocation
r/K/N: Same as above, but only outputs the Kth to the standard output

3. Basic use of split command

3.1 Generate test files

Generate a 2M size test file

root@jeven01:/test# dd if=/dev/zero bs=1M count=2 of=
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00158099 s, 1.3 GB/s
root@jeven01:/test# ll -h 
-rw-r--r-- 1 root root 2.0M Oct  3 20:35

3.2 Split small files with size 200KB

Use the -b option to split the file you just created into small files with a size of 200KB:

root@jeven01:/test# split -b 200k 
root@jeven01:/test# ls
  xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj  xak

3.3 Cut into a file with a numeric suffix

Use the -a and -d options to cut large files into small files with numeric suffixes.

 root@jeven01:/test# split -b 200k  -d -a 3
root@jeven01:/test# ll
total 4104
drwxr-xr-x  2 root root    4096 Oct  3 20:42 ./
drwxr-xr-x 22 root root    4096 Sep 24 22:37 ../
-rw-r--r--  1 root root 2097152 Oct  3 20:35 
-rw-r--r--  1 root root  204800 Oct  3 20:42 x000
-rw-r--r--  1 root root  204800 Oct  3 20:42 x001
-rw-r--r--  1 root root  204800 Oct  3 20:42 x002
-rw-r--r--  1 root root  204800 Oct  3 20:42 x003
-rw-r--r--  1 root root  204800 Oct  3 20:42 x004
-rw-r--r--  1 root root  204800 Oct  3 20:42 x005
-rw-r--r--  1 root root  204800 Oct  3 20:42 x006
-rw-r--r--  1 root root  204800 Oct  3 20:42 x007
-rw-r--r--  1 root root  204800 Oct  3 20:42 x008
-rw-r--r--  1 root root  204800 Oct  3 20:42 x009
-rw-r--r--  1 root root   49152 Oct  3 20:42 x010

3.4 Split files by number of lines

Split files by number of lines: split the file into a new file every 1000 lines, the new file name is logs_part_aa, logs_part_ab, etc.

split -l 1000  logs_part_

3.5 Prefix for the file name

The cut file name suffix is named in sequence with 000, etc., and the prefix is split_file.

root@jeven01:/test# split -b 200k  -d -a 3 split_file
root@jeven01:/test# ll -h
total 4.1M
drwxr-xr-x  2 root root 4.0K Oct  3 20:57 ./
drwxr-xr-x 22 root root 4.0K Sep 24 22:37 ../
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file000
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file001
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file002
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file003
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file004
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file005
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file006
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file007
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file008
-rw-r--r--  1 root root 200K Oct  3 20:57 split_file009
-rw-r--r--  1 root root  48K Oct  3 20:57 split_file010
-rw-r--r--  1 root root 2.0M Oct  3 20:35

4. Things to note

1. Ensure the integrity of log files: When dividing log files by rows or bytes, please be careful to maintain the integrity of log records. Avoid splitting a full log record into two different files, which can lead to misunderstandings during log analysis. The -C option can be used to limit the maximum number of bytes per output file, while trying not to split lines.

2. Reasonably select the segmentation size: reasonably set the size of each segmentation file according to your storage needs and log processing strategy. Too large files may cause inconvenience in processing, while too small files may increase management complexity. For example, if the amount of logs generated per day is about 50MB, then consider dividing the file into small pieces of about 10MB.

3. Use appropriate suffix naming rules: For easy management and identification, set clear and meaningful prefixes and suffixes to the divided files. Specify the suffix length with the -a option and add a numeric suffix to the files using the -d or --numeric-suffixes option, which helps to process the files in order.

4. Consider timestamp information: If the log file contains timestamps, make sure that this important information is retained during the splitting process. This helps to quickly locate and search according to time in the future. The record separator can be customized with the -t option to accommodate timestamps in different formats.

5. Test and verify the results: Before formal application, perform segmentation tests on a small amount of sample data to check whether the output file meets expectations. Make sure all configurations are correct before performing operations on the full log. This step can help you discover possible problems in advance and adjust your plan in time.

6. Backup the original log file: Be sure to back up the original log file before performing any cutting operations. Although the split command does not modify the source file, backups can prevent data loss caused by accidental deletion or other human errors.

This is the end of this article about the example of using split to cut log files in Linux. For more related contents of split log files in Linux, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!