Ansible beyond the tutorial: idempotency, drift detection, and the playbook that saved a 3am incident
The demo playbook installs nginx and starts it. It works once on a clean VM and everyone nods in the meeting. What nobody demonstrates is running the same playbook six months later on a server where an engineer manually edited /etc/nginx/nginx.conf to temporarily fix a production problem and then forgot to document it. Or after the nginx package got updated by an unnoticed apt cron job. Or on a server that was never properly converged because someone cancelled the playbook halfway through.
Production Ansible is not about running playbooks. It is about reliably converging infrastructure to known states, including infrastructure that has drifted from whatever Ansible last configured.
Idempotency is a contract, not a feature
Ansible modules are documented as idempotent and most of them are. But "idempotent" in Ansible means "running this module twice with the same arguments produces the same result". It does not mean "this module is safe to run on a system in an unknown state."
Consider a popular pattern that breaks under drift:
# This looks fine. It is not fine if the service was manually stopped.
- name: Ensure application service is running
ansible.builtin.service:
name: myapp
state: started
enabled: true
If an engineer ran systemctl disable myapp --now on the server to debug a CPU spike and then forgot, this task reports ok (already running) or changed (re-enabled), but it does not tell you that a manual intervention occurred. The playbook converges the state, but you have lost the signal that drift happened.
The pattern I use instead:
- name: Check if service has been manually overridden
ansible.builtin.command: systemctl is-enabled myapp
register: svc_enabled
changed_when: false
failed_when: false
- name: Warn on manual override
ansible.builtin.debug:
msg: "WARNING: myapp service is {{ svc_enabled.stdout }} — expected 'enabled'"
when: svc_enabled.stdout != 'enabled'
- name: Converge service state
ansible.builtin.service:
name: myapp
state: started
enabled: true
The warning does not block the playbook. It produces a visible signal that a human made a change that Ansible is now overwriting. In a CI/CD context you parse that output and create an alert.
The 3am playbook
The scenario: production API servers returning 502. Load balancer health checks failing. The on-call engineer has 90 seconds before customers notice. The cause: a deploy job timed out halfway through updating the nginx upstream config, leaving three of eight servers with the old configuration and five with the new.
You write the remediation playbook when you are not under pressure, so that when you are under pressure, you run one command:
---
- name: Emergency nginx config convergence
hosts: api_servers
serial: 2 # converge two at a time, keep 6/8 serving traffic
max_fail_percentage: 25 # abort if more than 2 servers fail convergence
tasks:
- name: Validate config template renders without errors
ansible.builtin.template:
src: templates/nginx-upstream.conf.j2
dest: /tmp/nginx-upstream-validate.conf
mode: '0600'
changed_when: false
- name: Syntax check the rendered config
ansible.builtin.command: nginx -t -c /tmp/nginx-upstream-validate.conf
changed_when: false
# If nginx -t fails, the play fails here — before touching the live config
- name: Deploy nginx upstream config
ansible.builtin.template:
src: templates/nginx-upstream.conf.j2
dest: /etc/nginx/conf.d/upstream.conf
owner: root
group: root
mode: '0644'
backup: true # keeps upstream.conf.TIMESTAMP on the server
notify: reload nginx
- name: Verify health endpoint responds after reload
ansible.builtin.uri:
url: "http://localhost:{{ app_port }}/health"
status_code: 200
timeout: 10
retries: 3
delay: 2
handlers:
- name: reload nginx
ansible.builtin.service:
name: nginx
state: reloaded
# reloaded, not restarted — zero downtime config update
serial: 2 is the parameter that matters most. With eight servers and serial: 2 you always have at least six servers serving traffic during convergence. Without it, Ansible converges all hosts in parallel and you get a short window where all eight are simultaneously reloading nginx, faith-based deployment at scale.
Vault and the secret you accidentally committed
Every team eventually commits a secret to their Ansible repository. The textbook answer is Ansible Vault. The production answer: Ansible Vault for secrets that belong to the playbook, external secrets management (HashiCorp Vault, AWS Secrets Manager) for secrets shared between systems, and no_log: true on every task that handles either.
- name: Set database credentials in application config
ansible.builtin.template:
src: templates/database.php.j2
dest: /var/www/html/config/database.php
mode: '0640'
vars:
db_password: "{{ lookup('aws_ssm', '/prod/app/db_password', region='eu-west-1') }}"
no_log: true # prevents the rendered template (containing the password) from appearing in logs
no_log: true suppresses not just the task output but also the diff output. If you run --diff to review what changed, you will not see the rendered template. That is a feature, not a limitation.
Testing playbooks before they matter
Two tools I use for every non-trivial role. Molecule for role-level testing: it spins up a container or VM, runs the role, runs a verifier (usually Testinfra) and checks that the desired state was actually achieved, not just that Ansible reported success.
# molecule/default/tests/test_nginx.py
import testinfra
def test_nginx_is_running(host):
nginx = host.service("nginx")
assert nginx.is_running
assert nginx.is_enabled
def test_nginx_config_is_valid(host):
result = host.run("nginx -t")
assert result.rc == 0
--check mode with --diff before every production run shows what Ansible would change without actually changing it. The diff output on template tasks is particularly useful, you see exactly which lines in the config file would be modified. Limiting to one server with --limit api_servers[0] is non-negotiable: --check across the full production inventory can take minutes, on one representative server it takes seconds.
What I watch for in Ansible code review
Tasks with no changed_when on command or shell modules report changed every time they run, even if nothing changed. That makes your --check diff useless. ignore_errors: true on anything infrastructure-related is the equivalent of a bare catch (Exception e) {}, the playbook should stop, not continue with a potentially broken server still in the pool. Missing become: false on tasks that do not need root: a playbook where every task runs as root is a playbook where any bug has the blast radius of the entire server.