Detailed explanation of how to use shell command to count logs

Preface

As we all know, logs can be easily counted and analyzed by using shell commands. When there are abnormalities in the service, logs need to be checked. Then mastering a skill of statistical logs is essential.

Suppose there is a log file containing the following content. Let's take the logs of this file as an example.

date=2017-09-23 13:32:50 | ip=40.80.31.153 | method=GET | url=/api/foo/bar?params=something | status=200 | time=9.703 | bytes=129 | referrer="-" | user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7" | cookie="-"
date=2017-09-23 00:00:00 | ip=100.109.222.3 | method=HEAD | url=/api/foo/healthcheck | status=200 | time=0.337 | bytes=10 | referrer="-" | user-agent="-" | cookie="-"
date=2017-09-23 13:32:50 | ip=40.80.31.153 | method=GET | url=/api/foo/bar?params=anything | status=200 | time=8.829 | bytes=466 | referrer="-" | user-agent="GuzzleHttp/6.2.0 curl/7.19.7 PHP/7.0.15" | cookie="-"
date=2017-09-23 13:32:50 | ip=40.80.31.153 | method=GET | url=/api/foo/bar?params=everything | status=200 | time=9.962 | bytes=129 | referrer="-" | user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7" | cookie="-"
date=2017-09-23 13:32:50 | ip=40.80.31.153 | method=GET | url=/api/foo/bar?params=nothing | status=200 | time=11.822 | bytes=121 | referrer="-" | user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7" | cookie="-"

The corresponding logs for different services may be different. The format of this article using sample logs is:

date | ip | method | url | status | time | bytes | referrer | user-agent | cookie

Notice:The command behaviors in mac and linux systems may be different. Please use the following commands in linux systems

Exclude special logs

When counting logs, we may not care about HEAD requests, or only care about GET requests. Here we need to filter the logs first, you can use the grep command. The meaning of -v is to exclude matching text lines.

grep GET  # Only count GET requestsgrep -v HEAD  # Don't count HEAD requestsgrep -v 'HEAD\|POST'  # No statistics HEAD and POST ask

Check the interface time consumption

We can match the time of each row and then sort it. Use awk's match method to match the regular:

awk '{ match($0, /time=([0-9]+\.[0-9]+)/, result); print result[1]}'

The awk command is used as follows:

awk '{pattern + action}' {filenames}

We've actually only used itaction：match($0, /time=([0-9]+\.[0-9]+)/, result); print result[1] This paragraph.

The match method receives three parameters: text that needs to be matched, regular expressions, and result array. $0 represents each line processed by the awk command. The result array is optional. Because we want to get the matching result, a result array is passed here to store the matching result.

Note that the rules here I did not use \d to represent numbers, because the awk directive uses "EREs" by default and does not support the representation of \d. For details, please seeComparison of differences between linux shell regular expressions (BREs,EREs,PREs)。

The result array is actually very similar to the result array in javascript, so we print out the second element, that is, the matching content. After executing this line of command, the result is as follows:

Of course, there may actually be thousands of logs in a day. We need to sort the logs and only display the first 3. The sort command is used here.

The sort command is sorted from small to large by default and is sorted as a string. So by default, "11" will be ahead of "8" after using the sort command. Then we need to use -n to specify sorting by number, -r to sort from large to small, and then we look at the first 3 items:

awk '{ match($0, /time=([0-9]+\.[0-9]+)/, result); print result[1]}'  | sort -rn | head -3

result:

11.822
9.962
9.703

View the most time-consuming interface

Of course, we generally do not only check the interface time-consuming situation, but also need to print out the specific logs, and the above commands cannot meet the requirements.

The printing of awk is separated by space by default, meaning 2017-09-23 GET If you use awk '{print $1}', "2017-09-23", similarly, $2 will print out GET.

According to the log characteristics, we can use | as the separator, so that we can print out each value that we are interested in. Because we want to find the most time-consuming interface, we can find time, date and url separately.

The -F parameter of awk uses a custom delimiter. Then we can count the three parts and separate them by | which one is: time is the 6th, date is the 1st, and url is the 4th.

awk -F '|' '{print $6 $1 $4}'

This result is:

 time=9.703 date=2017-09-23 13:32:50 url=/api/foo/bar?params=something
 time=0.337 date=2017-09-23 00:00:00 url=/api/foo/healthcheck
 time=8.829 date=2017-09-23 13:32:50 url=/api/foo/bar?params=anything
 time=9.962 date=2017-09-23 13:32:50 url=/api/foo/bar?params=everything
 time=11.822 date=2017-09-23 13:32:50 url=/api/foo/bar?params=nothing

Because we want to sort by time, sort can be sorted by columns, and columns are separated by spaces. Our first column is time=xxx, which cannot be sorted, so we need to find a way to remove time=, because we put the time to the first column very terrible, so we can actually separate it through time=.

awk -F '|' '{print $6 $1 $4}'  | awk -F 'time=' '{print $2}'

result:

9.703 date=2017-09-23 13:32:50 url=/api/foo/bar?params=something
0.337 date=2017-09-23 00:00:00 url=/api/foo/healthcheck
8.829 date=2017-09-23 13:32:50 url=/api/foo/bar?params=anything
9.962 date=2017-09-23 13:32:50 url=/api/foo/bar?params=everything
11.822 date=2017-09-23 13:32:50 url=/api/foo/bar?params=nothing

Use the -k parameter of sort to specify the column to be sorted, here is the 1st column; combined with the above sort, the most time-consuming log can be printed out:

awk -F '|' '{print $6 $1 $4}'  | awk -F 'time=' '{print $2}' | sort -k1nr | head -3

result:

11.822 date=2017-09-23 13:32:50 url=/api/foo/bar?params=nothing
9.962 date=2017-09-23 13:32:50 url=/api/foo/bar?params=everything
9.703 date=2017-09-23 13:32:50 url=/api/foo/bar?params=something

The interface with the most requests

If you need to count which interfaces have the largest number of requests per day, you only need to introduce a new uniq command.

We can already passgrep -v HEAD | awk -F '|' '{print $4}' To filter out all urls, the uniq command can delete the same rows adjacent to each row, while -c can output the number of occurrences per row.

So we first sort the urls to put the same urls together, and then use uniq -c to count the number of occurrences:

grep -v HEAD  | awk -F '|' '{print $4}' | sort | uniq -c

Because the number of sample logs is too small, we assume that there are multiple logs, and the result should be similar to the following:

1 url=/api/foo/bar?params=anything
19 url=/api/foo/bar?params=everything
4 url=/api/foo/bar?params=nothing
5 url=/api/foo/bar?params=something

Next, sort:

grep -v HEAD  | awk -F '|' '{print $4}' | sort | uniq -c | sort -k1nr | head -10

Summarize

The above is the entire content of this article. I hope the content of this article will be of some help to your study or work. If you have any questions, you can leave a message to communicate. Thank you for your support.