As projects become more and more dependent on Erlang, the problems encountered also increase. Some time ago, the online system encountered high memory consumption problem, so record the analysis process of troubleshooting. The online system uses the Erlang R16B02 version.
Problem description
There are several online systems that have been running for a while and the memory soars. The system model is very simple, with network connections, and you can find a new process in the pool for processing. The top command observed the memory and found that the Erlang process had eaten up all the memory. The netstat command checked the number of network connections, which was only a few K. The problem should be that Erlang memory leaks.
Analysis Methods
The Erlang system has an advantage, which can directly enter the online system and analyze problems at the production site. Our system is managed through Rebar and can be accessed online systems in different ways.
Login from this machine
You can log in to the online machine directly, and then attach to the Erlang system through the following command
$ cd /path/to/project
$ rel/xxx/bin/xxx attach
(node@host)>
By remote shell
Get Erlang system cookies
$ ps -ef |grep beam %%Finish parameter --setcookie
A new shell is opened, using the same cookie, and a different nodename
Enter the system with start remote shell
Erlang R16B02 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.3 (abort with ^G)
([email protected])1> net_adm:ping('[email protected]').
pong
([email protected])2> nodes().
['[email protected]']
([email protected])3>
User switch command
--> h
c [nn] - connect to job
i [nn] - interrupt job
k [nn] - kill job
j - list all jobs
s [shell] - start local shell
r [node [shell]] - start remote shell
q - quit erlang
? | h - this message
--> r '[email protected]'
--> j
1 {shell,start,[init]}
2* {'[email protected]',shell,start,[]}
--> c 2
Analysis process
Erlang has many tools that can analyze system information, such asappmon,webtool. However, the system memory is seriously insufficient, and there is no way to start these tools. Fortunately, there is an Erlang shell.
Erlang shell comes with a lot of useful thingsOrder, you can use the help() method to view
> help().
Memory consumption in Erlang system
The top result shows that it is a memory problem, so the first step is to check Erlang's system memory consumption first
> erlang:memory().
memory()You can see that the memory allocated by Erlang emulator has total memory, memory consumed by atom, memory consumed by process, etc.
Erlang process creation quantity
The online system found that the main memory consumption is on the process. Next, we need to analyze whether the process memory leaks or the number of process creations is caused by too many.
> erlang:system_info(process_limit). %%View at most how many processes the system can create
> erlang:system_info(process_count). %%How many processes have been created in the current system
system_info()Returns some information about the current system, such as the number of system process and ports. I was shocked when I executed the above command. There were only 2 or 3k network connections, but the Erlang process was already over 10w. The system process is created, but the stacking is not released due to code or other reasons.
View information for a single process
Since it is because the process has accumulated for some reason, I can only find the reason from the process.
First, you need to get the pid that stacks the process
> i(). %%Returns system information
> i(0,61,886). %% (0,61,886) is pid
I saw a lot of process hangs there, checked the specific pid information, and found that several messages from message_queue were not processed. Here are powerful oneserlang:process_info()Method, it can obtain quite rich information about the process.
> erlang:process_info(pid(0,61,886), current_stacktrace).
> rp(erlang:process_info(pid(0,61,886), backtrace)).
When viewing the backtrace of the process, you will find the following information
0x00007fbd6f18dbf8 Return addr 0x00007fbff201aa00 (gen_event:rpc/2 + 96)
y(0) #Ref<0.0.2014.142287>
y(1) infinity
y(2) {sync_notify,{log,{lager_msg,[], ..........}}
y(3) <0.61.886>
y(4) <0.89.0>
y(5) []
Process is processing Erlang third-party log librarylagerWhen, he stopped.
Cause of the problem
Check the document of the Lager and find the following information
Prior to lager 2.0, the gen_event at the core of lager operated purely in synchronous mode. Asynchronous mode is faster, but has no protection against message queue overload. In lager 2.0, the gen_event takes a hybrid approach. it polls its own mailbox size and toggles the messaging between synchronous and asynchronous depending on mailbox size.
{async_threshold, 20}, {async_threshold_window, 5}
This will use async messaging until the mailbox exceeds 20 messages, at which point synchronous messaging will be used, and switch back to asynchronous, when size reduces to 20 - 5 = 15.
If you wish to disable this behaviour, simply set it to 'undefined'. It defaults to a low number to prevent the mailbox growing rapidly beyond the limit and causing problems. In general, lager should process messages as fast as they come in, so getting 20 behind should be relatively exceptional anyway.
It turns out that the Lager has a configuration item that configures the number of unprocessed messages. If the number of messages accumulated exceeds the number, it will be processed in synchronization!
The current system has turned on the debug log, and the flood-like log has washed away the system.
Foreigners also encountered similar problems, thisthreadThank you for your analysis.
Summarize
Erlang provides a wealth of tools to access the system online and analyze problems on site, which is very helpful for efficient and fast positioning problems. At the same time, the powerful Erlang OTP allows the system to be more stable. We will continue to explore Erlang and look forward to more practical sharing.