Kubernetes Zero to Hero Part 1: Building Production-Ready Observability with Prometheus and Grafana

This is Part 1 of our Kubernetes Zero to Hero series. By the end of this series, you'll have built a production-ready platform with enterprise-grade observability, automated SSL, and Akamai-powered security.

Observability isn't optional in production Kubernetes—it's survival. Without proper monitoring, you're flying blind in a complex distributed system where anything can fail at any time. Today, we're building the foundation: a rock-solid observability stack with Prometheus and Grafana that would make any SRE proud.

What We're Building

By the end of this article, you'll have:

Prometheus collecting metrics from every corner of your cluster
Grafana providing beautiful, actionable dashboards
AlertManager sending intelligent notifications
Node Exporter monitoring your infrastructure
Application metrics from your web apps
Service discovery automatically finding new services

This isn't a toy setup—this is production-grade observability that scales.

Prerequisites

A Kubernetes cluster (1.20+) with kubectl access
Basic understanding of Kubernetes concepts
30 minutes of focused time

Pro tip: If you need a cluster, Linode Kubernetes Engine is perfect for this tutorial—simple, reliable, and cost-effective.

Step 1: Setting Up the Monitoring Namespace

First, let's create a dedicated namespace for our monitoring stack:

kubectl create namespace monitoring
kubectl label namespace monitoring name=monitoring

This isolation ensures our monitoring components are organized and easier to manage.

Step 2: Deploying Prometheus with Helm

We'll use the official Prometheus Helm chart for a production-ready deployment:

# Add the Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create values file for Prometheus configuration
cat <<EOF > prometheus-values.yaml
prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: "linode-block-storage-retain"
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    retention: 30d
    resources:
      requests:
        memory: 2Gi
        cpu: 1000m
      limits:
        memory: 4Gi
        cpu: 2000m
    
    # Enable service discovery
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    ruleSelectorNilUsesHelmValues: false

grafana:
  enabled: true
  adminPassword: "admin123"  # Change this in production!
  persistence:
    enabled: true
    storageClassName: "linode-block-storage-retain"
    size: 10Gi
  
  resources:
    requests:
      memory: 256Mi
      cpu: 100m
    limits:
      memory: 512Mi
      cpu: 200m

alertmanager:
  enabled: true
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: "linode-block-storage-retain"
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

nodeExporter:
  enabled: true

kubeStateMetrics:
  enabled: true

# Enable default rules for Kubernetes monitoring
defaultRules:
  create: true
  rules:
    alertmanager: true
    etcd: true
    general: true
    k8s: true
    kubeApiserver: true
    kubePrometheusNodeRecording: true
    kubernetesApps: true
    kubernetesResources: true
    kubernetesStorage: true
    kubernetesSystem: true
    node: true
    prometheus: true
EOF

# Install the Prometheus stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml

Step 3: Exposing Grafana and Prometheus

Let's create services to access our monitoring tools:

# prometheus-services.yaml
apiVersion: v1
kind: Service
metadata:
  name: prometheus-external
  namespace: monitoring
spec:
  type: LoadBalancer
  ports:
  - port: 9090
    targetPort: 9090
  selector:
    app.kubernetes.io/name: prometheus
---
apiVersion: v1
kind: Service
metadata:
  name: grafana-external
  namespace: monitoring
spec:
  type: LoadBalancer
  ports:
  - port: 3000
    targetPort: 3000
  selector:
    app.kubernetes.io/name: grafana

kubectl apply -f prometheus-services.yaml

Step 4: Setting Up Application Monitoring

Now let's create a sample web application with proper metrics exposition:

# sample-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-web-app
  namespace: default
  labels:
    app: sample-web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-web-app
  template:
    metadata:
      labels:
        app: sample-web-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: web-app
        image: nginx:1.21
        ports:
        - containerPort: 80
        - containerPort: 8080
          name: metrics
        resources:
          requests:
            memory: 64Mi
            cpu: 50m
          limits:
            memory: 128Mi
            cpu: 100m
        # Add nginx-prometheus-exporter sidecar
      - name: nginx-exporter
        image: nginx/nginx-prometheus-exporter:0.10.0
        args:
        - -nginx.scrape-uri=http://localhost/nginx_status
        ports:
        - containerPort: 9113
          name: metrics
        resources:
          requests:
            memory: 32Mi
            cpu: 25m
          limits:
            memory: 64Mi
            cpu: 50m
---
apiVersion: v1
kind: Service
metadata:
  name: sample-web-app
  namespace: default
  labels:
    app: sample-web-app
spec:
  ports:
  - port: 80
    targetPort: 80
    name: http
  - port: 9113
    targetPort: 9113
    name: metrics
  selector:
    app: sample-web-app
---
# ServiceMonitor for Prometheus to discover this service
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sample-web-app
  namespace: default
  labels:
    app: sample-web-app
spec:
  selector:
    matchLabels:
      app: sample-web-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

kubectl apply -f sample-app.yaml

Step 5: Configuring AlertManager

Let's set up intelligent alerting:

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-kube-prometheus-alertmanager
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'localhost:587'
      smtp_from: 'alerts@yourdomain.com'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'web.hook'
      routes:
      - match:
          alertname: DeadMansSwitch
        receiver: 'null'
      - match:
          severity: critical
        receiver: 'critical-alerts'
      - match:
          severity: warning
        receiver: 'warning-alerts'
    
    receivers:
    - name: 'null'
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://127.0.0.1:5001/'
    - name: 'critical-alerts'
      slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts-critical'
        title: 'Critical Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    - name: 'warning-alerts'
      slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts-warning'
        title: 'Warning: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

kubectl apply -f alertmanager-config.yaml

Step 6: Custom Prometheus Rules

Create custom alerting rules for your applications:

# custom-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: custom-application-rules
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: application.rules
    rules:
    - alert: HighErrorRate
      expr: rate(nginx_http_requests_total{status=~"5.."}[5m]) > 0.1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is above 10% for 5 minutes"
    
    - alert: HighLatency
      expr: histogram_quantile(0.95, rate(nginx_http_request_duration_seconds_bucket[5m])) > 0.5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"
        description: "95th percentile latency is above 500ms"
    
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod is crash looping"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"

kubectl apply -f custom-rules.yaml

Step 7: Setting Up Grafana Dashboards

Let's access Grafana and set up essential dashboards:

# Get Grafana external IP
kubectl get service grafana-external -n monitoring

# Default credentials: admin / admin123 (change this!)

Once in Grafana, import these essential dashboards:

Kubernetes Cluster Monitoring (Dashboard ID: 7249)
Node Exporter Full (Dashboard ID: 1860)
Nginx Ingress Controller (Dashboard ID: 9614)
Kubernetes Pod Monitoring (Dashboard ID: 6417)

Step 8: Verifying Your Setup

Check that everything is working:

# Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus-external 9090:9090

# Visit http://localhost:9090/targets - all should be UP

# Check Grafana
kubectl port-forward -n monitoring svc/grafana-external 3000:3000

# Visit http://localhost:3000

Pro Tips for Production

1. Resource Management

Always set resource requests and limits:

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "200m"

2. Data Retention Strategy

Configure appropriate retention based on your needs:

prometheus:
  prometheusSpec:
    retention: 30d  # Adjust based on compliance requirements
    retentionSize: 45GB

3. High Availability

For production, enable HA mode:

prometheus:
  prometheusSpec:
    replicas: 2
alertmanager:
  alertmanagerSpec:
    replicas: 3

4. Security Hardening

Use network policies to restrict access
Enable RBAC with minimal permissions
Secure Grafana with OAuth integration
Use secrets for sensitive configuration

Performance Optimization

Query Optimization

Write efficient PromQL queries:

# Good: Rate over longer time windows
rate(http_requests_total[5m])

# Better: Use recording rules for complex queries
custom:http_request_rate5m

# Best: Aggregate at ingestion time when possible
sum(rate(http_requests_total[5m])) by (job, method)

Storage Optimization

Use remote storage for long-term retention
Configure appropriate scrape intervals
Use recording rules for expensive queries

What's Next?

In Part 2, we'll add Traefik as our ingress controller with automatic SSL certificates, integrating it seamlessly with our monitoring stack. You'll learn how to:

Deploy Traefik with Let's Encrypt integration
Configure advanced routing and middleware
Monitor ingress performance and SSL certificate health
Implement blue-green deployments with observability

Common Troubleshooting

Prometheus Not Scraping Targets

# Check service monitor labels
kubectl get servicemonitor -A

# Verify prometheus service discovery
kubectl logs -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0

Grafana Not Starting

# Check persistent volume claims
kubectl get pvc -n monitoring

# Check pod logs
kubectl logs -n monitoring deployment/prometheus-grafana

High Memory Usage

# Check current resource usage
kubectl top pods -n monitoring

# Adjust retention or add more memory

Conclusion

You now have a production-ready observability foundation that will scale with your applications. This monitoring stack provides:

Complete visibility into your Kubernetes cluster
Proactive alerting for issues before they impact users
Performance insights to optimize your applications
Capacity planning data for informed scaling decisions

The monitoring infrastructure we've built today will be crucial as we add more complexity in the coming articles. With proper observability in place, you can confidently deploy and operate applications knowing you'll see problems before your users do.

Remember: monitoring is not a "set it and forget it" task. Regularly review your dashboards, tune your alerts, and evolve your metrics as your applications grow.

Next up in Part 2: We'll add Traefik ingress with automatic SSL certificates and integrate it with our monitoring stack for complete visibility into your application traffic.

Alexander Cedergren is a Solutions Engineer specializing in Kubernetes, observability, and edge computing. Follow the series to level up your Kubernetes expertise.