1. Basic monitoring architecture design
Monitoring indicator selection
- Core resources: CPU utilization, memory utilization, disk space and I/O, network traffic, process status, etc.
- Business indicators: HTTP service status code, database connection number, application response time, etc.
- Containerized scenarios: Docker/Kubernetes container resource usage, Pod health status.
Tools and library selection
-
Data collection:
psutil
(System Resources),requests
(HTTP status),docker
(Container Monitoring). -
Alarm notification:
smtplib
(mail),requests
(Webhook)、twilio
(Short message). - Data storage and visualization: Prometheus (time sequence database), Grafana (dashboard), InfluxDB (lightweight storage).
2. Core code implementation and configuration
Scenario 1: Basic resource monitoring and alarm
Configuration instructions:
usepsutil
Collect data and send email alerts through SMTP protocol.
Timing tasks:passcrontab
Execute the script every 5 minutes:
*/5 * * * * /usr/bin/python3 /path/to/
Scenario 2: HTTP service status monitoring
import requests import sys def check_http_status(url, expected_code=200): try: response = (url, timeout=10) if response.status_code != expected_code: send_alert(f"HTTPAbnormal status:{url} return {response.status_code}") except Exception as e: send_alert(f"Unreachable service:{url},mistake:{str(e)}") def send_alert(message): # Integrated Webhook (such as DingTalk, Enterprise WeChat) webhook_url = "/robot/send?access_token=xxx" headers = {'Content-Type': 'application/json'} data = {"msgtype": "text", "text": {"content": message}} (webhook_url, json=data, headers=headers) # Call examplecheck_http_status("/api/health")
Extended configuration:
- Integrated Zabbix: Use script output as a custom monitoring item and configure Trigger to trigger alarms.
-
Prometheus Monitoring:use
prometheus-client
The library exposes the indicators for Prometheus to pull.
Scenario 3: Log Analysis and Anomaly Detection
import re from collections import defaultdict def analyze_logs(log_path, pattern=r'ERROR: (.*)'): error_counts = defaultdict(int) with open(log_path, 'r') as f: for line in f: match = (pattern, line) if match: error_type = (1) error_counts[error_type] += 1 # Trigger threshold alarm for error, count in error_counts.items(): if count > 10: send_alert(f"Error Type {error} Appears in the log {count} Second-rate") # Example: Monitoring Nginx error logsanalyze_logs('/var/log/nginx/')
Optimization solution:
- use
loguru
orELK
Stack (Elasticsearch + Logstash + Kibana) implements log aggregation.
3. Advanced scenarios and integration
1. Container monitoring
usedocker
Library gets container status:
import docker client = docker.from_env() for container in (): stats = (stream=False) print(f"container {} CPUUsage rate:{stats['cpu_percent']}%")
Integration of Kubernetes: viakubernetes
The library monitors Pod resources.
2. Automatic repair
Automatically clean up old logs when insufficient disk space is detected:
if > 90: ("find /var/log -name '*.log' -mtime +7 -exec rm {} \;")
3. Visualize the dashboard
Grafana configuration: Store data to InfluxDB and configure the dashboard to display real-time metrics.
4. Recommended complete tool chain
Tools/Library | use |
---|---|
psutil | System resource collection |
prometheus-client | Expose monitoring indicators |
Fabric | Batch remote command execution |
AlertManager | Alarm routing and deduplication |
5. Summary
Implementing automated operation and maintenance monitoring through Python, it is necessary to select a tool chain based on specific scenarios:
-
Basic monitoring:
psutil
+SMTP alarm meets stand-alone needs. - Distributed Systems: Prometheus+Grafana implements cluster monitoring.
- Log and business monitoring: Regular analysis + ELK stack improves the inspection efficiency.
- Automated repair: Trigger predefined scripts (such as cleaning files, restarting services) after a problem is detected.
Things to note:
- Security: Sensitive information (such as passwords) should be stored using environment variables or encrypted.
- Performance overhead: Monitoring scripts need to optimize resource usage to avoid affecting business.
- Alarm convergence: Avoid alarm storms through tools such as AlertManager.
This is the article about the detailed explanation of server performance monitoring and alarms in Python automation operation and maintenance. For more related contents of Python server performance monitoring and alarms, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!