How to Create Effective Health Check Scripts for Linux Systems - Step-by-Step Guide

Learn to create effective health check scripts for Linux systems, ensuring optimal performance and reliability with essential metrics and alerts.

Last updated: 2025-12-20

How to Create Effective Health Check Scripts for Linux Systems - Step-by-Step Guide

Are you looking to create custom health check scripts for your Linux systems? Need practical, step-by-step instructions with real-world examples? This comprehensive guide covers everything you need to know about creating effective health check scripts for Linux, including essential metrics, bash scripting techniques, alerting, logging, and best practices to ensure optimal system performance and reliability.

Introduction to Linux Health Checks

Linux health checks are automated scripts or processes that systematically evaluate your server's operational status, performance metrics, and overall condition. These checks examine system resources, service status, performance indicators, and security status to ensure everything is functioning correctly and identify potential issues before they escalate into critical problems.

Health check scripts transform server management from reactive troubleshooting to proactive maintenance. By automating regular health checks, you can continuously monitor system status, detect problems early, and respond quickly to issues before they impact users or cause downtime. Effective health checks help maintain optimal performance, prevent resource exhaustion, and ensure reliable service delivery.

The goal of Linux health check scripts is to provide automated, continuous monitoring that detects issues immediately and enables rapid response. By creating custom health check scripts tailored to your specific needs, you can monitor exactly what matters most for your infrastructure, integrate with your existing tools, and maintain complete control over your monitoring approach.

Essential Metrics to Monitor

Understanding which metrics to include in health checks ensures comprehensive system monitoring.

CPU Usage Metrics

Monitor CPU performance to detect bottlenecks:

  • CPU Utilization: Overall processor usage percentage. Should typically stay below 70-80% under normal load. Sustained high CPU usage indicates potential bottlenecks.
  • Load Average: System load over 1, 5, and 15 minutes. Load average should be below the number of CPU cores for optimal performance.
  • CPU Wait Time: Time CPU spends waiting for I/O operations. High wait times suggest disk or network bottlenecks rather than CPU limitations.
  • Top CPU Processes: Identify which processes consume the most CPU resources to pinpoint resource-intensive applications.

Memory Usage Metrics

Monitor memory to prevent out-of-memory conditions:

  • RAM Usage: Total and available memory. Should maintain at least 10-20% free memory for optimal performance. High memory usage can cause swapping and performance degradation.
  • Swap Usage: Virtual memory usage on disk. High swap usage indicates insufficient RAM. Excessive swapping dramatically impacts performance.
  • Memory Pressure: How close the system is to memory limits. Monitor trends to predict when upgrades are needed.
  • Memory Leaks: Processes with continuously increasing memory consumption. Early detection prevents memory exhaustion.

Disk Usage Metrics

Monitor disk to prevent storage exhaustion:

  • Disk Space Usage: Available storage capacity. Maintain at least 15-20% free disk space. Running out of disk space can cause service failures and data loss.
  • Disk I/O Performance: Read/write operations per second and latency. High I/O rates or latency indicate potential bottlenecks.
  • Inode Usage: File system metadata capacity. Running out of inodes prevents file creation even when disk space is available.
  • I/O Wait Time: CPU time spent waiting for disk I/O operations. High I/O wait suggests disk bottlenecks.

Network Performance Metrics

Monitor network to ensure connectivity:

  • Bandwidth Usage: Network traffic volume relative to capacity. High utilization may indicate attacks or saturation.
  • Network Latency: Response times for network requests. Should be under 100ms for local networks.
  • Packet Loss: Percentage of packets lost during transmission. Should be near 0%.
  • Connection Count: Active network connections. Unusually high counts may indicate attacks or misconfigurations.

Service Status Metrics

Monitor critical services:

  • Service Status: Verify all critical services are running and enabled at boot.
  • Process Count: Ensure expected number of processes are running.
  • Port Availability: Verify services are listening on correct ports.
  • Response Times: Services should respond within acceptable timeframes.

Creating Your First Health Check Script

Follow these step-by-step instructions to create your first health check script.

Step 1: Create Script File

Create a new bash script file:

# Create script file
nano /usr/local/bin/health-check.sh

# Or use your preferred editor
vim /usr/local/bin/health-check.sh

Step 2: Add Script Header

Start with a proper script header:

#!/bin/bash
# Linux System Health Check Script
# Description: Monitors CPU, memory, disk, and network metrics
# Author: Your Name
# Date: $(date +%Y-%m-%d)

# Set script options
set -euo pipefail

# Configuration
THRESHOLD_CPU=80
THRESHOLD_MEM=85
THRESHOLD_DISK=90
LOG_FILE="/var/log/health-check.log"

Step 3: Add Logging Function

Create a logging function for consistent log output:

# Logging function
log_message() {
    local level=$1
    shift
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" | tee -a "$LOG_FILE"
}

Step 4: Add CPU Check Function

Create function to check CPU usage:

# Check CPU usage
check_cpu() {
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
    local load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
    local cpu_cores=$(nproc)
    
    log_message "INFO" "CPU Usage: ${cpu_usage}%, Load Average: ${load_avg}, CPU Cores: ${cpu_cores}"
    
    if (( $(echo "$cpu_usage > $THRESHOLD_CPU" | bc -l) )); then
        log_message "WARNING" "CPU usage is ${cpu_usage}% (threshold: ${THRESHOLD_CPU}%)"
        return 1
    fi
    
    return 0
}

Step 5: Add Memory Check Function

Create function to check memory usage:

# Check memory usage
check_memory() {
    local mem_total=$(free | grep Mem | awk '{print $2}')
    local mem_used=$(free | grep Mem | awk '{print $3}')
    local mem_percent=$(awk "BEGIN {printf \"%.2f\", ($mem_used/$mem_total)*100}")
    local swap_used=$(free | grep Swap | awk '{print $3}')
    local swap_total=$(free | grep Swap | awk '{print $2}')
    
    log_message "INFO" "Memory Usage: ${mem_percent}%, Swap Used: ${swap_used}KB"
    
    if (( $(echo "$mem_percent > $THRESHOLD_MEM" | bc -l) )); then
        log_message "WARNING" "Memory usage is ${mem_percent}% (threshold: ${THRESHOLD_MEM}%)"
        return 1
    fi
    
    return 0
}

Step 6: Add Disk Check Function

Create function to check disk usage:

# Check disk usage
check_disk() {
    local disk_usage=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
    
    log_message "INFO" "Disk Usage: ${disk_usage}%"
    
    if [ "$disk_usage" -gt "$THRESHOLD_DISK" ]; then
        log_message "WARNING" "Disk usage is ${disk_usage}% (threshold: ${THRESHOLD_DISK}%)"
        return 1
    fi
    
    return 0
}

Step 7: Add Main Function

Create main function that runs all checks:

# Main health check function
main() {
    log_message "INFO" "Starting health check..."
    
    local exit_code=0
    
    check_cpu || exit_code=1
    check_memory || exit_code=1
    check_disk || exit_code=1
    
    if [ $exit_code -eq 0 ]; then
        log_message "INFO" "Health check completed successfully"
    else
        log_message "ERROR" "Health check detected issues"
    fi
    
    return $exit_code
}

# Run main function
main "$@"

Step 8: Make Script Executable

Make the script executable:

chmod +x /usr/local/bin/health-check.sh

Step 9: Test the Script

Test your script:

# Run script manually
/usr/local/bin/health-check.sh

# Check log output
tail -f /var/log/health-check.log

Step 10: Schedule with Cron

Schedule the script to run automatically:

# Edit crontab
crontab -e

# Add line to run every 5 minutes
*/5 * * * * /usr/local/bin/health-check.sh

Advanced Health Check Techniques

Enhance your health check scripts with advanced features.

Adding Email Alerts

Add email notification when issues are detected:

# Email alert function
send_alert() {
    local subject="Health Check Alert: $1"
    local message="$2"
    local recipient="[email protected]"
    
    echo "$message" | mail -s "$subject" "$recipient"
}

# Modify check functions to send alerts
check_cpu() {
    # ... existing code ...
    if (( $(echo "$cpu_usage > $THRESHOLD_CPU" | bc -l) )); then
        log_message "WARNING" "CPU usage is ${cpu_usage}%"
        send_alert "High CPU Usage" "CPU usage is ${cpu_usage}% (threshold: ${THRESHOLD_CPU}%)"
        return 1
    fi
    return 0
}

Adding Webhook Notifications

Send alerts via webhook for integration with monitoring systems:

# Webhook alert function
send_webhook() {
    local message="$1"
    local webhook_url="https://your-webhook-url.com/alert"
    
    local payload=$(cat <<EOF
{
  "server": "$(hostname)",
  "message": "$message",
  "timestamp": "$(date -Iseconds)"
}
EOF
)
    
    curl -X POST -H "Content-Type: application/json" \
         -d "$payload" \
         "$webhook_url"
}

Adding Service Status Checks

Check critical service status:

# Check service status
check_services() {
    local services=("nginx" "mysql" "ssh")
    local failed_services=()
    
    for service in "${services[@]}"; do
        if ! systemctl is-active --quiet "$service"; then
            failed_services+=("$service")
            log_message "ERROR" "Service $service is not running"
        fi
    done
    
    if [ ${#failed_services[@]} -gt 0 ]; then
        log_message "ERROR" "Failed services: ${failed_services[*]}"
        return 1
    fi
    
    return 0
}

Adding Disk I/O Monitoring

Monitor disk I/O performance:

# Check disk I/O
check_disk_io() {
    if ! command -v iostat &> /dev/null; then
        log_message "WARNING" "iostat not available, skipping disk I/O check"
        return 0
    fi
    
    local disk_util=$(iostat -x 1 1 | grep -v "^$" | tail -n +4 | awk '{print $NF}' | head -1 | sed 's/%//')
    
    if [ -n "$disk_util" ] && [ "$disk_util" -gt 80 ]; then
        log_message "WARNING" "Disk utilization is ${disk_util}%"
        return 1
    fi
    
    return 0
}

Adding Network Connectivity Checks

Check network connectivity:

# Check network connectivity
check_network() {
    local test_hosts=("8.8.8.8" "google.com")
    local failed_hosts=()
    
    for host in "${test_hosts[@]}"; do
        if ! ping -c 1 -W 2 "$host" &> /dev/null; then
            failed_hosts+=("$host")
            log_message "WARNING" "Cannot reach $host"
        fi
    done
    
    if [ ${#failed_hosts[@]} -gt 0 ]; then
        log_message "ERROR" "Network connectivity issues to: ${failed_hosts[*]}"
        return 1
    fi
    
    return 0
}

Comprehensive Health Check Script Example

Complete example combining all features:

#!/bin/bash
# Comprehensive Linux Health Check Script

set -euo pipefail

# Configuration
THRESHOLD_CPU=80
THRESHOLD_MEM=85
THRESHOLD_DISK=90
LOG_FILE="/var/log/health-check.log"
ALERT_EMAIL="[email protected]"
WEBHOOK_URL="https://your-webhook-url.com/alert"

# Logging function
log_message() {
    local level=$1
    shift
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" | tee -a "$LOG_FILE"
}

# Alert functions
send_email() {
    echo "$2" | mail -s "Health Check Alert: $1" "$ALERT_EMAIL"
}

send_webhook() {
    local payload=$(cat <<EOF
{"server": "$(hostname)", "alert": "$1", "message": "$2", "timestamp": "$(date -Iseconds)"}
EOF
)
    curl -s -X POST -H "Content-Type: application/json" -d "$payload" "$WEBHOOK_URL" &> /dev/null || true
}

# Check functions (include all from above)
check_cpu() { ... }
check_memory() { ... }
check_disk() { ... }
check_services() { ... }
check_disk_io() { ... }
check_network() { ... }

# Main function
main() {
    log_message "INFO" "Starting comprehensive health check..."
    
    local exit_code=0
    local alerts=()
    
    check_cpu || { exit_code=1; alerts+=("High CPU usage"); }
    check_memory || { exit_code=1; alerts+=("High memory usage"); }
    check_disk || { exit_code=1; alerts+=("High disk usage"); }
    check_services || { exit_code=1; alerts+=("Service failures"); }
    check_disk_io || { exit_code=1; alerts+=("Disk I/O issues"); }
    check_network || { exit_code=1; alerts+=("Network issues"); }
    
    if [ ${#alerts[@]} -gt 0 ]; then
        local alert_message="Health check detected issues: ${alerts[*]}"
        log_message "ERROR" "$alert_message"
        send_email "System Issues Detected" "$alert_message"
        send_webhook "System Issues" "$alert_message"
    else
        log_message "INFO" "All health checks passed"
    fi
    
    return $exit_code
}

main "$@"

Tools and Software for Linux Monitoring

While custom scripts provide flexibility, monitoring tools complement health check scripts.

Automated Monitoring Solutions

Zuzia.app - Cloud-based automated monitoring:

  • Automatic Host Metrics collection
  • Continuous 24/7 monitoring
  • Historical data storage
  • Intelligent alerting
  • Dashboard visualization
  • Complements custom scripts with comprehensive monitoring

Nagios - Enterprise monitoring solution:

  • Extensive plugin ecosystem
  • Flexible alerting system
  • Web-based interface
  • Can execute custom health check scripts
  • Integrates with existing scripts

Zabbix - Open-source enterprise monitoring:

  • Comprehensive monitoring capabilities
  • Auto-discovery features
  • Custom script execution support
  • Advanced alerting and visualization
  • Can run health check scripts as external checks

Prometheus + Grafana - Time-series monitoring:

  • Powerful query language
  • Highly customizable dashboards
  • Script exporter for custom metrics
  • Can collect metrics from health check scripts
  • Excellent for technical teams

Command-Line Tools

Use these tools within health check scripts:

  • top/htop: Process and CPU monitoring
  • free: Memory usage information
  • df: Disk space usage
  • iostat: Disk I/O statistics
  • vmstat: System-wide statistics
  • systemctl: Service status checking
  • ss/netstat: Network connection monitoring

Best Practices for Health Check Scripts

Follow these best practices to maintain effective health check scripts.

Keep Scripts Simple and Focused

  • Single responsibility: Each function should check one metric
  • Clear logic: Use straightforward conditional logic
  • Readable code: Add comments explaining complex logic
  • Modular design: Break scripts into reusable functions

Simple scripts are easier to maintain, debug, and modify as requirements change.

Use Proper Error Handling

  • Set error options: Use set -euo pipefail for strict error handling
  • Check command availability: Verify tools exist before using them
  • Handle failures gracefully: Don't let one failed check stop others
  • Return appropriate exit codes: Use exit codes to indicate success/failure

Proper error handling ensures scripts run reliably and provide accurate results.

Implement Logging

  • Structured logging: Use consistent log format with timestamps
  • Log levels: Use INFO, WARNING, ERROR levels appropriately
  • Log rotation: Implement log rotation to prevent disk filling
  • Centralized logging: Consider sending logs to centralized system

Good logging enables troubleshooting and provides audit trail of system health.

Set Appropriate Thresholds

  • Baseline first: Monitor for 1-2 weeks to establish baselines
  • Adjust thresholds: Fine-tune based on actual usage patterns
  • Different thresholds: Use different thresholds for different servers
  • Document reasoning: Document why thresholds are set as they are

Appropriate thresholds reduce false positives while ensuring critical issues are detected.

Test Scripts Thoroughly

  • Test in non-production: Test scripts in development environments first
  • Test edge cases: Test with high/low values, missing tools, etc.
  • Test failure scenarios: Verify alerts work when issues are detected
  • Regular review: Periodically review and update scripts

Thorough testing ensures scripts work correctly in all scenarios.

Document Your Scripts

  • Header comments: Include purpose, author, date in script header
  • Function documentation: Document what each function does
  • Configuration documentation: Document threshold values and their reasoning
  • Usage instructions: Document how to run and schedule scripts

Documentation helps maintain scripts and enables others to understand and modify them.

Integrate with Monitoring Tools

  • Complement, don't replace: Use scripts to complement automated monitoring
  • Export metrics: Format output for monitoring tools to consume
  • Use monitoring APIs: Integrate with monitoring tool APIs when possible
  • Leverage best of both: Combine custom scripts with automated monitoring

Integration provides comprehensive monitoring combining custom checks with automated solutions.

Conclusion

Creating effective health check scripts for Linux systems enables proactive monitoring and rapid problem detection. By understanding essential metrics, following step-by-step scripting guides, implementing advanced techniques, and following best practices, you can create custom health check scripts that monitor exactly what matters for your infrastructure.

Key Takeaways

  • Start simple: Begin with basic checks and gradually add complexity
  • Monitor essentials: Focus on CPU, memory, disk, and network metrics
  • Implement logging: Add proper logging for troubleshooting
  • Set up alerts: Configure alerts for immediate notification of issues
  • Test thoroughly: Test scripts before deploying to production
  • Document everything: Maintain documentation for future maintenance
  • Integrate with tools: Combine custom scripts with automated monitoring solutions

Next Steps

  1. Create basic script: Start with simple CPU, memory, and disk checks
  2. Add logging: Implement structured logging
  3. Set up alerts: Add email or webhook notifications
  4. Schedule execution: Use cron to run scripts automatically
  5. Test and refine: Test scripts and adjust thresholds based on results
  6. Integrate monitoring: Complement scripts with automated monitoring like Zuzia.app

Remember, health check scripts are most effective when combined with comprehensive monitoring solutions. Use custom scripts for specific checks while leveraging automated monitoring for continuous coverage.

For more information on Linux monitoring, explore related guides on understanding server health checks, Linux resource monitoring best practices, and server monitoring best practices.

FAQ: Common Questions About Health Check Scripts

What are health check scripts in Linux?

Health check scripts are automated bash scripts that systematically evaluate your Linux server's operational status, performance metrics, and overall condition. They check system resources (CPU, memory, disk, network), service status, performance indicators, and security status to ensure everything is functioning correctly and identify potential issues before they escalate into critical problems.

Health check scripts typically run on a schedule (via cron) and log results, send alerts when issues are detected, and provide visibility into system health. They complement automated monitoring solutions by providing custom checks tailored to your specific needs.

Why are health checks important for Linux systems?

Health checks are important because they:

  • Prevent downtime: Detect problems early before they cause service outages
  • Maintain performance: Monitor performance metrics continuously to maintain optimal performance
  • Enable rapid response: Provide immediate alerts when issues are detected
  • Support proactive maintenance: Transform management from reactive to proactive
  • Optimize resources: Identify resource bottlenecks and optimization opportunities

Without health checks, problems are discovered only after they impact users, leading to emergency fixes, increased costs, and potential data loss. Health checks enable proactive problem detection and rapid response.

How can I automate health checks on my Linux server?

Automate health checks by:

  1. Create health check script: Write bash script with checks for CPU, memory, disk, and services
  2. Add logging: Implement logging to track health check results
  3. Configure alerts: Add email or webhook notifications for issues
  4. Schedule with cron: Use crontab -e to schedule script execution (e.g., every 5 minutes)
  5. Test thoroughly: Test script in non-production environment first
  6. Monitor logs: Review logs regularly to verify script is working correctly

Example cron entry:

# Run health check every 5 minutes
*/5 * * * * /usr/local/bin/health-check.sh

For comprehensive automated monitoring, complement custom scripts with automated solutions like Zuzia.app that provide continuous monitoring with minimal maintenance.

Note: The content above is part of our brainstorming and planning process. Not all described features are yet available in the current version of Zuzia.

If you'd like to achieve what's described in this article, please contact us – we'd be happy to work on it and tailor the solution to your needs.

In the meantime, we invite you to try out Zuzia's current features – server monitoring, SSL checks, task management, and many more.

We use cookies to ensure the proper functioning of our website.