SoFunction
Updated on 2025-04-09

awk basic knowledge summary page 2/2



10. Circular structure
Circular structure
awk's while loop structure, which is equivalent to the corresponding C language while loop.
awk also has a "do...while" loop, which evaluates the condition at the end of the code block, rather than at the beginning like a standard while loop.
It is similar to the "repeat...until" loop in other languages.
Example:
do...while example {
     count=1
     do {
    print "I get printed at least once no matter what"
    } while ( count != 1 )
}

Unlike the general while loop, the "do...while" loop is always executed at least once since the condition is evaluated after the code block.
In other words, when the normal while loop is first encountered, if the condition is false, the loop will never be executed.

for loop
awk allows for creation of for loops, which is like a while loop, and is also equivalent to a for loop in C language:
for ( initial assignment; comparison; increment ) {
 code block
}

Example:
for ( x = 1; x <= 4; x++ ) {
    print "iteration",x
}

This code will print:
iteration 1
iteration 2
iteration 3
iteration 4

break and continue
Furthermore, like in C, awk provides break and continue statements. Use these statements to better control the loop structure of awk. Here are the code snippets that are urgently needed for break statements:

while dead loop
while (1) {
  print "forever and ever..."
}
Because 1 always means true, this while loop will run forever. Here is a loop that only executes ten times:

break statement example x=1
while(1) {
   print "iteration",x
   if ( x == 10 ) {
      break
   }
  x++
}

Here, the break statement is used to "escape" the deepest loop. "break" causes the loop to terminate immediately and continues to execute the statements behind the loop code block.
The continue statement supplements break, and its functions are as follows:
x=1
while (1) {
    if ( x == 4 ) {
      x++
      continue
     }
     print "iteration",x
     if ( x > 20 ) {
         break
     }
     x++
}

This code prints "iteration 1" to "iteration 21", except for "iteration 4". If the iteration is equal to 4, then x is added and the continue statement is called, which immediately starts to execute the next loop iteration without executing the rest of the code block. Like break, the continue statement is suitable for various awk iteration loops. When used in the for loop body, continue will automatically increase the loop control variable. Here is an equivalent cycle:
for ( x=1; x<=21; x++ ) {
    if ( x == 4 ) {
        continue
    }
    print "iteration",x
}

When in a while loop, it is not necessary to increase x before calling continue, because the for loop automatically increases x.

Array
If you know that awk can use arrays, you will be happy. However, in awk, array subscripts usually start at 1, not 0:
myarray[1]="jim"
myarray[2]=456

awk When the first assignment statement is encountered, it creates myarray and sets the element myarray[1] to "jim". After executing the second assignment statement, the array has two elements.

Array iteration

After definition, awk has a convenient mechanism to iterate over array elements, as shown below:
for ( x in myarray ) {
  print myarray[x]
}

This code will print every element in the array myarray. When this special "in" form is used for for, awk assigns each existing subscript of myarray to x (loop control variable), and loops the code once after each assignment. While this is a very convenient awk feature, it has one drawback - when awk rotates between array subscripts, it does not follow any specific order. That means we cannot know that the output of the above code is:
jim
456
still:
456
jim
Iterating the contents of the array is like a box of chocolates - you never know what you will get.

11. Array subscript stringification

Array subscript stringification
While awk is to perform the necessary transformations to do the job, it can use some weird-looking code:
a="1"
b="2"
c=a+b+3
After executing this code, c is equal to 6. Since awk is "stringized", adding strings "1" and "2" is no more functionally hard than adding numbers 1 and 2. In both cases, awk can successfully perform operations. The "string" nature of awk is very cute -- you might want to know what happens if you use string subscripts of an array. For example, the following code:
myarr["1"]="Mr. Whipple"
print myarr["1"]
As expected, this code will print "Mr. Whipple". But what if the quotes in the second "1" subscript are removed?
myarr["1"]="Mr. Whipple"
print myarr[1]

It is difficult to guess the result of this code snippet. awk treats myarr["1"] and myarr[1] as two independent elements of an array, or do they refer to the same element? The answer is that they refer to the same element, awk will print "Mr. Whipple", just like the first code snippet. Although it may seem a bit weird, awk has been using array string subscripts behind the scenes!

After learning this weird truth, some of us may want to execute weird code similar to the following:
myarr["name"]="Mr. Whipple"
print myarr["name"]

Not only does this code not produce errors, but it functions exactly the same as the previous example, and will also print "Mr. Whipple"! As you can see, awk does not restrict us from using pure integer subscripts; if we want, we can use string subscripts, and there will be no problems. As long as we use non-integer array subscripts, such as myarr["name"], we are using associative arrays. Technically, if we use string subscripts, awk's background operations are no different (because awk will still regard it as a string even if we use "integral" subscripts). However, they should be called associative arrays -- it sounds cool and will impress your boss. Stringed subscripting is our little secret. ;)

Array Tools
When it comes to arrays, awk gives us a lot of flexibility. String subscripts can be used, and there is no need for consecutive sequences of numeric subscripts (for example, myarr[1] and myarr[1000] can be defined, but all other elements are not defined). While these are useful, in some cases, confusion can occur. Fortunately, awk provides some practical features that help make arrays more manageable.

First, you can delete the array element. If you want to delete element 1 of the array fooarray, enter:
delete fooarray[1]

Also, if you want to see if a particular array element exists, you can use a special "in" boolean operator, as shown below:
if ( 1 in fooarray ) {
print "Ayep!  It's there."
} else {
   print "Nope!  Can't find it."
}

12. Format output

awk provides two functions printf() and sprintf(). Like many other awk components, these functions are equivalent to corresponding C functions.
printf() prints the formatted string to stdout, while sprintf() returns the formatted string that can be assigned to the variable.
If you are not familiar with printf() and sprintf(), articles introducing C language can quickly understand these two basic print functions. On Linux systems, you can enter "man 3 printf" to view the printf() help page.

Here are some sample codes for awk sprintf() and printf() . As you can see, they are almost exactly the same as C.
x=1
b="foo"
printf("%s got a %d on the last testn","Jim",83)
myout=("%s-%d",b,x)
print myout

This code will print:
Jim got a 83 on the last test
foo-1


13. String function

awk has many string functions.
In awk, string functions are indeed needed because strings cannot be regarded as character arrays as in other languages ​​such as C, C++, and Python.
For example, if you execute the following code:
mystring="How are you doing today?"
print mystring[3]
An error will be received as follows:
awk: :59: fatal: attempt to use scalar as array

Although not as convenient as Python's sequence types, awk's string function can still complete the task. Let's take a look.
First, there is a basic length() function that returns the length of the string. Here is how it is used:
print length(mystring)

This code will print the value:
24

The next string function is called index, which will return the location where the substring appears in another string, and if the string is not found, it will return 0. Using mystring, you can call it as follows:
print index(mystring,"you")

awk will print:
9

Two simple functions, toilet() and toupper(). As you guessed, these two functions will return a string and convert all characters to lowercase or uppercase respectively. Note that toilet() and toupper() return a new string and will not modify the original string. This code:
print tolower(mystring)
print toupper(mystring)
print mystring
……
The following output will be produced:
how are you doing today?
HOW ARE YOU DOING TODAY?
How are you doing today?

All is good so far, but how exactly do we choose substrings from strings, or even single characters? That's why you use substr() . The following is the call method of substr():
mysub=substr(mystring,startpos,maxlen)
mystring should be a string variable or literal string to extract substrings from. startpos should be set to the starting character position, maxlen should contain the maximum length of the string to be extracted. Note that I'm talking about the maximum length; if length(mystring) is shorter than startpos+maxlen, the result will be truncated. substr() does not modify the original string, but returns the substring. Here is an example:
print substr(mystring,9,3)

awk will print:
you
If the language you are usually used for programming uses array subscripts to access part of strings (and people who don't use this language), remember that substr() is awk instead.
It is necessary to use it to extract individual characters and substrings; because awk is a string-based language, it is often used.

Some more intriguing functions
First is match(). match() is very similar to index() , the difference between it and index() is that it does not search for substrings, it searches for regular expressions. The match() function returns the start position of the match, and if no match is found, it returns 0. In addition, match() will set two variables, called RSTART and RLENGTH. RSTART contains the return value (the first match position), and RLENGTH specifies the character span it occupies (returns -1 if no match is found). Each match in a string can be easily iterated through RSTART, RLENGTH, substr(), and a small loop. Here is an example match() call:
print match(mystring,/you/), RSTART, RLENGTH
awk will print:
9 9 3

String replacement
Now, we will look at two string replacement functions, sub() and gsub(). These functions are slightly different from those discussed so far, because they do modify the original string. Here is a template showing how to call sub():
sub(regexp,replstring,mystring)

When sub() is called, it will match the first character sequence of regexp in mystring and replace that sequence with replstring. sub() and gsub() use the same arguments; the only difference is that sub() will replace the first regexp match (if any), gsub() will perform a global replacement, swapping out all matches in the string. Here is an example of sub() and gsub() calls:
sub(/o/,"O",mystring)
print mystring
mystring="How are you doing today?"
gsub(/o/,"O",mystring)
print mystring

Mystring must be reset to its initial value, because the first sub() call directly modifies mystring. On execution, this code will make awk output:
HOw are you doing today?
HOw are yOu dOing tOday?

Of course, it can also be more complex regular expressions. I leave the task of testing some complex rule expressions to you.

By introducing the function split(), we will summarize the functions discussed. The task of split() is to "slice" the string and put the parts into an array using integer subscripts. Here is an example split() call:
numelements=split("Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec",mymonths,",")

When split() is called, the first argument contains the literal string or string variable to be cut. In the second argument, you should specify the name of the array that split() will fill in the fragment part. In the third element, specify the separator used to cut the string. When split() returns, it returns the number of string elements that are split. split() assigns each fragment to an array with subscript starting from 1, so the following code:
print mymonths[1],mymonths[numelements]
...will print:
Jan Dec

Special string form
Short Comment -- When calling length(), sub() or gsub(), the last argument can be removed, so awk will apply a function call to $0 (the entire current line). To print the length of each line in the file, use the following awk script:
{
  print length()
}

Previous page12Read the full text