The Art of Writing Linux Utilities

Linux and other UNIX-like systems always come with a large number of tools that perform a wide range of functions from obvious to incredible. The success of UNIX-like programming environments is largely due to the high quality and choice of tools, as well as the ease of connection between these tools.

As a developer, you may find that existing utilities don't always solve the problem. While many problems can be easily solved by combining existing utilities, solving other problems requires at least some practical programming efforts. These subsequent tasks are often candidate tasks for creating new utility, and creating new utility in combination with existing utility can solve the problem by doing minimal work. This article examines the qualities of excellent utilities and the processes that are experienced in designing such utilities.

What qualities does an excellent utility have?

The UNIX Programming Environment by Kernighan & Pike contains a great discussion of this issue. Excellent utility is a utility that does your job as well as possible. It must be well-matched with other utilities; it must be easily used with other utilities. A program that cannot be used with other utilities is not a utility, but an application.

Utilities should allow you to build one-time applications inexpensively and easily based on the materials at hand. Many people think that utilities are like tools in a toolbox. The goal of designing a utility is not to have a single tool do everything, but to have a set of tools where each tool does one thing as well as possible.

Some utilities are quite useful in themselves, while others must be used with a range of utilities. Examples of the former include sort and grep. On the other hand, xargs is rarely used alone except for being used with other utilities (most commonly find).

What language is used to write utilities?

Most UNIX system utilities are written in C. The examples in this article use Perl and sh. The right tools should be used to do the right things. If you use a utility frequently enough, the cost of writing it in a compiled language may be rewarded with performance improvements. On the other hand, for a fairly common situation where the workload of a program is light, using a scripting language may provide faster development speeds.

If you can't be sure, you should use the language you know best. At least when you are prototyped a utility, or figuring out how it works, programmer efficiency will take precedence over performance tuning. Most UNIX system utilities are written in C only because these utilities are used frequently enough that taking into account efficiency is more important than taking into account development costs. Perl and sh (or ksh) are probably good languages for rapid prototyping. For utilities that work with other programs, it may be easier to write them using a shell than to write them in a more traditional programming language. On the other hand, when you want to interact with the original bytes, C is perhaps the best choice.

Design utility

A good rule of thumb is to think about the design of the utility first when you have to solve a problem the second time. Don't regret the one-time work you wrote for the first time; you can think of it as a prototype. The second time, compare the features you need with the features you need for the first time. Around the third time, you should start thinking about taking the time to write a universal utility. Even purely repetitive tasks can benefit the development of utility programs; for example, many common file renaming programs have been developed because people are disappointed in trying to rename files in a common way.

Below are some utility design goals; each goal will be described in the separate subsections below.

Do one thing well; don’t do many things badly. Perhaps the best example of doing something well is sort. There is no other utility except sort. The basic idea is simple: if you solve only one problem at a time, you can take the time to solve it.

Imagine how frustrating it would be if most programs had sorting but some only supported sorting by lexical means, others only supported sorting by numbers, and others even supported keyword selection instead of sorting the entire line. At least, this is also annoying.

When you find a problem that needs to be solved, try to break it down into parts, and don't repeat those parts that already exist in other utilities. The more you pay attention to tools that are allowed to work with existing tools, the more likely your utility will remain useful.

Maybe you need to write multiple programs. The best way to complete a specialized task is usually to write a utility or two and connect them with some clues instead of writing a single program to solve the whole thing. It is ideal to use 20-line shell scripts to combine new utilities with existing tools. If you try to solve the entire problem at once, the first change that follows may require you to rethink it completely.

I occasionally need to generate output of two or three columns from the database. It is usually more efficient to write a program that generates output in a single column and then combines a program that classifies the output. The shell script that combines these two utilities is temporary itself, and the separate utilities last longer than this script.

Some utilities serve very dedicated needs. For a directory with a lot of content, if the output of ls rolls out of the screen very quickly, this may be because one of the files has a very long file name, forcing ls to only use a single column for the output. Using more to paginate the output takes some time. Why not sort the rows by length like below and then pipe the results through tail?

Listing 1. The smallest utility that can be found in the world sl

#/usr/bin/perl -w
print sort { length $a <=> length $b } <>;

The script in Listing 1 does exactly one thing. It doesn't accept any options because it doesn't require options; it only cares about the length of the row. Thanks to Perl's convenient <> expression, this little utility is suitable for both standard input and for files specified on the command line.

Become a filter

Almost all utilities are best imagined as filters, although there are some very useful utilities that don't fit this model. (For example, a program can be very useful when executing a count, although it doesn't work well as a filter. Programs that only accept command line parameters as input and potentially produce complex outputs can be very useful.) However, most utilities should work as filters. By convention, filters work on lines of text. Most filters should support multiple input files.

Remember that utilities need to be run on the command line and in scripts. Sometimes, ideal behaviors can be slightly different. For example, most versions of ls automatically sort the input into multiple columns when written to the terminal. The default behavior of grep is to print the file name from which the match is found when multiple files are specified. Such a difference should be related to the way the user wants the utility to work, not with other things. For example, an older version of GNU bc displays a compulsive copyright tag on startup. Please, do not do that. Let your utility do only what it should do.

Utilities like to live in the pipeline. Pipeline allows utilities to focus on their own work, rather than on the sidelines. To live in a pipeline, the utility needs to read data from standard input and then write out data to standard output. If you want to process records, you'd better be able to make each row a "record". Existing programs such as sort and join are already considered that way. They will be thankful for doing so.

I occasionally use such a utility that repeatedly calls other programs against one file tree. This takes full advantage of the standard UNIX utility filter model, but this model is only suitable for utilities that read input and then write out output; it cannot be used for utilities that operate in-place or accept input and output file names.

Most programs that can be run with standard input can also be run against a single file or a group of files. Note that this can be proven to violate the rules against repeated work; it is obvious that this can be solved by feeding the output of cat to the next program in the series. However, this seems reasonable in practice.

Some programs may legally read records in one format, but produce completely different outputs. An example of this is a utility that divides input materials into columns. Such a utility may treat lines in the input as records, but produces multiple records on each line in the output.

Not every utility fits exactly this model. For example, xargs accepts not records but file names as input, and all actual processing is done by other programs.

Generalization

Try to think of tasks as similar to the ones you are actually performing; if you can find a common description of these tasks, it is best to try writing a utility that fits that description. For example, if you find yourself sorting text by lexical one day and sorting text by numbers the other day, it might make sense to consider writing a general sorting utility.

Generalizing the functionality can sometimes lead to you finding that a program that looks like a single utility is actually two utilities that work together. This is very good. Writing two well-designed utilities may be easier than writing an ugly or complex utilities.

Doing one thing well does not mean doing one thing just. It means handling consistent but useful problem space. Many people use grep. However, its large amount of utility lies in its ability to perform related tasks. The various options of grep do the work of many small utilities, which, if all done by separate small utilities, will eventually result in a lot of shared, duplicate code.

This rule, as well as the rule of doing something well, are the inevitable result of a fundamental principle: you should avoid code duplication as much as possible at any time. If you write half a dozen programs, each of which sorts rows, you may have to fix six similar bugs six times instead of using a better maintained sort program.

This is part of writing a utility, i.e. adding most of the work to the process of completing the utility. You may not have the time to completely generalize a utility at the beginning, but you will get the rewards when you keep using it.

Sometimes it is useful to add relevant features to a program, even if this feature is not used to accomplish the exact same task. For example, a program that perfectly prints the raw binary data when running on a terminal device may be more useful because it brings the terminal into original mode. This makes it much easier to test problems involving keyboard mapping, new keyboards, etc. Not sure why you get the font size (~) when you press the delete key? This is an easy way to figure out what is actually sent. This is not exactly the same task, but it is similar enough to possibly become an additional feature.

The errno utility in Listing 2 is a good example of generalization, because it supports both numerical and symbolic names.

robust
The stability of the utility is important. Utilities that are prone to crashing or unable to process real data are not useful. Utilities should be able to handle lines of any length, giant files, and so on. It may be tolerant for a utility to not be able to handle datasets that exceed its memory capacity, but some utilities do not; for example, sort can generally sort datasets that have much larger memory capacity by using temporary files.

You should try to make sure you figure out what data your utility might be operating on. Don't simply ignore the possibility of data that cannot be processed. This should be checked for and diagnosed with your utility. The clearer the error message, the more helpful you will be to the user. Try to provide users with enough information so that they know what is going on and how to resolve it. When processing data files, try to identify bad data as accurately as possible. When trying to parse numbers, don't simply give up; the user should be told what data you are getting, and if possible, also on which row the data is located in the input stream.

As a good example, consider the difference between the two implementations of dc. If you run dc /home , one implementation will show "Cannot use directory as input!" while the other implementation will just return silently, with no error messages and no unusual exit code. When you type a cd command incorrectly, which implementation do you prefer in the current path? Similarly, if you provide a stream of data in a directory (perhaps executing dc < /home), the former gives a detailed error message. On the other hand, it may be ideal to choose to give up early in getting invalid data.

Security vulnerabilities are often rooted in programs that are not robust enough in the face of unexpected data. It is important to remember that excellent utilities can manage to run as root user in shell scripts. Buffer overflow in programs such as find can pose risks to a large number of systems.

The better the program processes unexpected data, the more likely it is to adapt to a changing environment. Generally, trying to make a program stronger will lead you to better understand what the program does, thus making it more generalized.

novel

One of the worst kinds of utilities to write is the ones you already have. I've written a wonderful utility called count. It allows me to perform almost any counting task. It's an excellent utility, but there's already a standard BSD utility called jot that does the same thing. Similarly, one of my flexible programs for converting data into columns repeats the functionality of an existing utility, rs, which is also found on BSD systems, except that rs is more flexible and better designed. Please see the Resources below for more information about jot and rs.

If you are about to start writing a utility, take a moment to browse the various systems to determine if such utility already exists. Don't be afraid to borrow Linux utilities on BSD, or BSD utilities on Linux; one of the fun of utility code is that almost all utilities are very portable.

Don't forget to examine the possibility of combining existing applications to form a utility. In theory, it is possible that utilities formed by combining existing programs are not run fast enough, but writing a new utility is rarely faster than waiting for a slightly slower pipeline.

An example utility

In a sense, this program is an executable file because it will never be of use as a filter. However, it works very well as a command line utility.

This program does only one thing. It outputs the errno line in /usr/include/sys/ in near perfect output format. For example:

$ errno 22
EINVAL [22]: Invalid argument

Listing 2. errno finder

    #!/bin/sh
    usage() {
        echo >&2 'usage: errno [numbers or error names]\n'
        exit 1
    }
    for i
    do
        case '$i' in
        [0-9]*)
            awk '/^#define/ && $3 == ''$i'' {
                for (i = 5; i < NF; ++i) {
                    foo = foo ' ' $i;
                }
                printf('%-22s%s\n', $2 ' [' $3 ']:', foo);
                foo = ''
            }' < /usr/include/sys/

        E*)
            awk '/^#define/ && $2 == '''$i''' {
                for (i = 5; i < NF; ++i) {
                    foo = foo ' ' $i;
                }
                printf('%-22s%s\n', $2 ' [' $3 ']:', foo);
                foo = ''
            }' < /usr/include/sys/

        *)
echo >&2 'errno: can't figure out whether '$i' is a name or a number.'
            usage

        esac
    done

Is this program universal? Yes, very ideal. It supports both numerical and symbolic names. On the other hand, it does not know about other files that may have the same format, such as /usr/include/sys/. It can be easily extended to do this, but for such a convenient utility, it would be easier to simply create a copy called "signal" to read while using "SIG*" as the pattern to match the name.

Although this is only a little bit easier than using grep for system header files, it is less prone to errors. It does not produce useless results due to inadequate consideration of parameters. On the other hand, if the given name or number is not found from the header file, it does not produce diagnostic information. It also doesn't bother to correct certain input errors. Moreover, since the command line utility has never been intended to be used in an automated environment, its above features are beyond refutation.

Another example might be a program that unsorted inputs (see Resources for a link to this utility). This is quite simple; that is, read in the input files, store them somehow, and then generate a random order to output those lines. This is a utility with almost unlimited application prospects. Writing this utility is also much easier than writing a sorter; for example, you don't need to specify which keys you don't sort, or whether you want to sort randomly in alphabetical, lexical, or numerical order. The tricky part is reading into rows that can be very long. In fact, the version provided above is cheating; it assumes that there are no empty bytes in the line you read. Correcting this problem is much more difficult, and I am too lazy to care about it when writing it.

Conclusion

If you find yourself repeating a task, consider writing a program to complete it. If it turns out to be a little more generalized, then generalize it so you have written a utility.

Don't design it the first time you need a utility. It will take you until you have some experience before starting designing. Feel free to write one or two prototypes; excellent utilities prove the value of the time spent and the research work more than bad ones. Don't feel sorry if the great utility originally envisioned ends up being useless after you write it. If you find yourself frustrated with the shortcomings of the new program, you just need to perform another prototype phase. If it turns out to be useless, no surprise, sometimes something like this happens.

What you are looking for is a program that looks for common applications outside of your initial usage pattern. I wrote unsort because, I was hoping to find an easy way to get a random sequence of color from an old X11" file. Since then, I've used it in an incredible number of tasks that aren't meant to generate test data for debugging and benchmark sorting routines.

Excellent utility can reward the time you spend on all your less than ideal pieces. The next thing to do is make it available to others so they can experiment with it. Also make your failed attempt available to others, maybe someone else has a purpose for a utility that you don't need. More importantly, your failed utility may be a prototype of someone else, thus giving everyone a wonderful utility.