awk introduction and study notes collection page 2/3

5. awk operator
Table 2. Operator

Operator Description
= += -= *= /= %= ^= **= Assignment
?: C conditional expression
||Logical or
&&Logistics and
~ ~! Match regular expressions and mismatch regular expressions
< <= > >= != =  Relational operator
Space Connection
+ - Add, Subtract
* / & Multiple, divide and find the remaining
+ - ! One dollar plus, subtraction and logic non
^ *** Question
++ -  Add or decrease as a prefix or suffix
$ field reference
in Array Member

6. Records and Domains
6.1. Record
awk refers to each line ending with a newline as a record.

Record separator: The default input and output separator are carriage return and are saved in built-in variables ORS and RS.

$0 variable: It refers to the entire record. For example, $ awk '{print $0}' test will output all records in the test file.

Variable NR: A counter, every time a record is processed, the value of NR is increased by 1. For example, $ awk '{print NR,$0}' test will output all records in the test file and display the record number before the record.

6.2. Domain
Each word in the record is called a "domain", separated by spaces or tabs by default. awk can track the number of fields and save the value in the built-in variable NF. For example, $ awk '{print $1,$3}' test will print the first and third columns (fields) separated by spaces in the test file.

6.3. Domain Delimiter
The built-in variable FS saves the value of the input domain delimiter, which is a space or a tab by default. We can modify the value of FS through the -F command line option. For example, $ awk -F: '{print $1,$5}' test will print the contents of the first and fifth columns with the colon as the delimiter.

Multiple domain separators can be used at the same time. At this time, the separators should be written into square brackets, such as $awk -F'[:\t]' '{print $1,$3}' test, which means that spaces, colons and tabs are used as separators.

The delimiter of the output field is a space by default and is saved in OFS. For example, $ awk -F: '{print $1,$5}' test, the comma between $1 and $5 is the value of OFS.

7. gawk special regular expression metacharacter
I won’t talk about the general metacharacter set, please refer to my Sed and Grep study notes. The following are dedicated to gawk and are not suitable for the Unix version of awk.

\Y
Match an empty string at the beginning or end of a word.

\B
Match empty strings within words.

\<
Match the empty string at the beginning of a word, and the anchor starts.

\>
Match the empty string at the end of a word, anchoring the end.

\w
Match a word composed of alphanumeric characters.

\W
Match a word composed of non-alphanumeric.

\‘
Match an empty string at the beginning of the string.

\'
Match an empty string at the end of the string.

8. POSIX character set
Please refer to my Grep study notes

9. Match operator (~)
Used to match regular expressions within records or fields. For example, $ awk '$1 ~/^root/' test will display the rows starting with root in the first column of the test file.

10. Comparison expressions
conditional expression1 ? expression2: expression3, for example: $ awk '{max = {$1 > $3} ? $1: $3: print max}' test. If the first field is larger than the third field, $1 is assigned to max, otherwise $3 is assigned to max.

$ awk '$1 + $2 < 100' test. If the first and second fields are added greater than 100, these lines are printed.

$ awk '$1 > 5 && $2 < 10' test, if the first field is greater than 5 and the second field is less than 10, print these lines.

11. Scope template
The range template matches all rows between the first appearance of the first template to the first appearance of the second template. If a template does not appear, it matches to the beginning or end. For example, $ awk '/root/,/mysql/' test will display all lines between the first occurrence of root and the first occurrence of mysql.

12. An example to verify the validity of passwd file

$ cat /etc/passwd | awk -F: '\
NF != 7{\
printf("line %d,does not have 7 fields:%s\n",NR,$0)}\
$1 !~ /[A-Za-z0-9]/{printf("line %d,non alpha and numeric user id:%d: %s\n,NR,$0)}\
$2 == "*" {printf("line %d, no password: %s\n",NR,$0)}'

cat outputs the result to awk, and awk sets the separator between the fields as a colon.

If the number of fields (NF) is not equal to 7, execute the following program.

printf prints the string "line ?? does not have 7 fields" and displays the record.

If the first field does not contain any letters and numbers, printf prints "no alpha and numeric user id" and displays the number of records and records.

If the second field is an asterisk, print the string "no passwd" and then display the number of records and the record itself.

13. Several examples
$ awk '/^(no|so)/' test---Print all lines starting with pattern no or so.

$ awk '/^[ns]/{print $1}' test---- If the record starts with n or s, print this record.

$ awk '$1 ~/[0-9][0-9]$/(print $1}' test----- If the first field ends with two numbers, print this record.

$ awk '$1 == 100 || $2 < 50' test---- If the first or equals 100 or the second field is less than 50, print the line.

$ awk '$1 != 10' test---- If the first field is not equal to 10, print the line.

$ awk '/test/{print $1 + 10}' test----If the record contains the regular expression test, the first field is added 10 and printed out.

$ awk '{print ($1 > 5 ? "ok "$1: "error"$1)}' test---- If the first field is greater than 5, print the expression value after the question mark, otherwise print the expression value after the colon.

$ awk '/^root/,/^mysql/' test---Print records starting with regular expression root to all records within the range of records starting with regular expression mysql. If a new record starting with the regular expression root is found, continue printing until the next record starting with the regular expression mysql, or to the end of the file.

14. awk programming
14.1. Variables
In awk, variables can be used directly without definition, and the variable type can be a number or a string.

Assignment format: Variable = expression, such as $ awk '$1 ~/test/{count = $2 + $3; print count}' test. The function of the above formula is that awk first scans the first domain. Once the test matches, adds the value of the second domain to the third domain, and assigns the result to the variable count, and finally prints it out.

awk can assign values to variables on the command line and then transfer this variable to the awk script. For example, $ awk -F: -f awkscript month=4 year=2004 test, the month and year of the above formula are both custom variables, and are assigned values as 4 and 2004 respectively. In awk scripts, these variables are used as if they were created in a script. Note that if the parameter is preceded by the test, then the variables in the BEGIN statement cannot be used.

Domain variables can also be assigned and modified, such as $ awk '{$2 = 100 + $1; print }' test, the above formula means that if the second domain does not exist, awk will calculate the value of expression 100 plus $1 and assign it to $2. If the second domain exists, the original value of $2 is overwritten with the value of the expression. For example: $ awk '$1 == "root"{$1 ="test";print}' test. If the value of the first field is "root", assign it to "test". Note that the string must be in double quotes.

Use of built-in variables. The variable list has been listed above, let’s give an example to illustrate. $ awk -F: '{IGNORECASE=1; $1 == "MARY"{print NR,$1,$2,$NF}'test, setting IGNORECASE to 1 means ignoring case, print the first field is the number of records, the first field, the second field and the last field.

14.2. BEGIN module
The BEGIN module is followed by an action block, which is executed before awk processes any input files. So it can be tested without any input. It is usually used to change the values of built-in variables, such as OFS, RS and FS, and print the title. For example: $ awk 'BEGIN{FS=":"; OFS="\t"; ORS="\n\n"}{print $1,$2,$3} test. The above formula shows that before processing the input file, the domain separator (FS) is set to a colon, the output file separator (OFS) is set to a tab, and the output record separator (ORS) is set to two newlines. $ awk 'BEGIN{print "TITLE TEST"} prints only the title.

14.3. END module
END does not match any input file, but executes all actions in the action block, which is executed after the entire input file processing is completed. For example, $ awk 'END{print "The number of records is" NR}' test, the above formula will print all the number of records being processed.

14.4. Redirection and pipeline
awk can use the shell's redirection character to perform redirection output, such as: $ awk '$1 = 100 {print $1 > "output_file" }' test. The above formula means that if the value of the first field is equal to 100, it is output to output_file. You can also use >> to redirect the output, but do not clear the file, and only do append operations.

The getline function is required for output redirection. getline gets input from a standard input, pipeline, or other input files other than the file currently being processed. It is responsible for getting the content of the next line from the input and assigning values to built-in variables such as NF, NR and FNR. If you get a record, the getline function returns 1, and if it reaches the end of the file, it returns 0, and if an error occurs, such as failure to open the file, it returns -1. like:

$ awk 'BEGIN{ "date" | getline d; print d}' test. Execute the date command of linux and output it to getline through pipeline, then assign the output to the custom variable d and print it.

$ awk 'BEGIN{"date" | getline d; split(d,mon); print mon[2]}' test. Execute the shell's date command and output it to getline through the pipeline, then getline reads from the pipeline and assigns the input to d. The split function converts the variable d into an array mon, and then prints the second element of the array mon.

$ awk 'BEGIN{while( "ls" | getline) print}', the output of the command ls is passed to the geoline as input, and the loop makes the getline read a line from the output of ls and print it to the screen. There is no input file here, because the BEGIN block executes before opening the input file, so the input file can be ignored.

$ awk 'BEGIN{printf "What is your name?"; getline name < "/dev/tty" } $1 ~name {print "Found" name on line ", NR "."} END{print "See you," name "."} test. Print “What is your name?” on the screen and wait for the user to answer. When a line is input, the getline function receives the input from the terminal and stores it in the custom variable name. If the first domain matches the value of the variable name, the print function is executed, and the END block prints the values of See you and name.

$ awk 'BEGIN{while (getline < "/etc/passwd" > 0) lc++; print lc}'. awk will read the contents of file /etc/passwd line by line. Before reaching the end of the file, the counter lc continues to increase, and when it reaches the end, the value of lc is printed. Note that if the file does not exist, getline returns -1, if it reaches the end of the file, it returns 0, and if it reads a line, it returns 1, so the command while (getline < "/etc/passwd") will fall into an infinite loop when the file does not exist, because returning -1 means logic is true.

You can open a pipeline in awk, and only one pipeline exists at the same time. The pipeline can be closed via close(). For example: $ awk '{print $1, $2 |"sort" }' test END {close("sort")}. awd uses the output of the print statement as input to the linux command sort through the pipeline, and the END block performs the shutdown operation.

The system function can execute linux commands in awk. For example: $ awk 'BEGIN{system("clear")'.

The fflush function is used to refresh the output buffer. If there are no parameters, it will refresh the buffer of the standard output. If an empty string is used as a parameter, such as fflush(""), it will refresh the output buffer of all files and pipelines.

14.5. Conditional statements
The conditional statements in awk are borrowed from C language and can control the process of the program.

14.5.1. if statement
Format:
        {if (expression){
                   statement; statement; ...
                     }
        }
$ awk '{if ($1 <$2) print $2 "too high"}' test. Print if the first field is smaller than the second field.

$ awk '{if ($1 < $2) {count++; print "ok"}}' test. If the first field is smaller than the second field, count is added by one and print ok.

14.5.2. if/else statement, used for double judgment.
Format:
        {if (expression){
                   statement; statement; ...
                       }
        else{
                   statement; statement; ...
                       }
        }
$ awk '{if ($1 > 100) print $1 "bad" ; else print "ok"}' test. If $1 is greater than 100, print $1 bad, otherwise print ok.

$ awk '{if ($1 > 100){ count++; print $1} else {count--; print $2}' test. If $1 is greater than 100, count is added and $1 is printed, otherwise count is reduced and $1 is printed.

14.5.3. if/else else if statement, used for multiple judgments.
Format:
        {if (expression){
                    statement; statement; ...
                   }
        else if (expression){
                    statement; statement; ...
                   }
        else if (expression){
                    statement; statement; ...
                   }
        else {
                   statement; statement; ...
             }
        }
14.6. Circulation
There are three types of loops awk: while loop; for loop; special for loop.

$ awk '{ i = 1; while ( i <= NF ) { print NF,$i; i++}}' test. The initial value of the variable is 1. If i is less than or equal to NF (number of fields in record), a print statement is executed and i is increased by 1. Until the value of i is greater than NF.

$ awk '{for (i = 1; i<NF; i++) print NF,$i}' test. The function is the same as above.

Breadkcontinue statement. break is used to break out of the loop when the condition is met; continue is used to ignore the following statement when the condition is met and directly return to the top of the loop. like:

{for ( x=3; x<=NF; x++)
            if ($x<0){print "Bottomed out!"; break}}
{for ( x=3; x<=NF; x++)
            if ($x==0){print "Get next item"; continue}}

The next statement reads a line from the input file and executes the awk script from scratch. like:

{if ($1 ~/test/){next}
    else {print}
}

The exit statement is used to end the awk program, but the END block is not skipped. An exit status of 0 means success, and a non-zero value indicates an error.

14.7. Array
The subscript of an array in awk can be a number and a letter, called an associative array.

14.7.1. Subscript and associative array
Use variables as array subscripts. For example: $ awk {name[x++]=$2};END{for(i=0;i<NR;i++) print i,name[i]}' test. The subscript in the array name is a custom variable x, awk initializes the value of x to 0, and increases by 1 after each use. The value of the second field is assigned to each element of the name array. In the END module, a for loop is used to loop the entire array, starting with elements with subscript 0, and printing those values stored in the array. Because the subscript is a key word, it does not necessarily start with 0, it can start with any value.

The special for loop is used to read elements in an associative array. The format is as follows:

{for (item in arrayname){
         print arrayname[item]
         }
}

$ awk '/^tom/{name[NR]=$1}; END{for(i in name){print name[i]}}' test. Prints an array element with values. The order of printing is random.
Use a string as a subscript. For example: count["test"]

Use the domain value as the subscript of the array. A new for loop method, for (index_value in array) statement. For example: $ awk '{count[$1]++} END{for(name in count) print name,count[name]}' test. This statement will print the number of times the string appears in $1. It first uses the first domain as the subscript of the array count. When the first domain changes, the index changes.

The delete function is used to delete array elements. For example: $ awk '{line[x++]=$1} END{for(x in line) delete(line[x])}' test. The value assigned to the array line is the first field. After all records are processed, the special for loop will delete each element.

14.8. Awk's built-in function
14.8.1. String function
The sub function matches the regular expression of the largest and most left-hand substring in the record and replaces these strings with the replacement string. If the target string is not specified, the entire record is used by default. Replacement only occurs when the first match is done. The format is as follows:

            sub (regular expression, substitution string):
            sub (regular expression, substitution string, target string)

Example:

            $ awk '{ sub(/test/, "mytest"); print }' testfile
            $ awk '{ sub(/test/, "mytest"); $1}; print }' testfile

The first example matches throughout the record, and the replacement only occurs when the first match occurs. To match the entire file, you need to use gsub

The second example matches in the first domain of the entire record, and the replacement only occurs when the first match occurs.

Previous page123Next pageRead the full text