神刀安全网

How Ansible Took Down Our Servers

As previously mentioned, I work for an IoT company. I was officially hired as a Data Engineer, but like many start ups, my job description was merely a guideline, and a loose one at that. I drifted between Data Engineer, Site Reliability Engineer, Operations Engineer, Backend Engineer, and Security Engineer. I was a proverbial Swiss Army Knife. This was great as it let me pick up a lot of new skills and pick my level and type of involvement with a lot of different projects. However, one thing I had lamented was the my knowledge was too sparse on any one thing and that I would have liked to have had less context switches so I could focus on gaining deeper knowledge in one area.

Today this sparse knowledge bit me when I misunderstood how Ansible handles variables.

In short, Ansible is an orchestration and configuration management tool like Puppet or Chef. Unlike those two (at least when I used them last), Ansible runs over ssh and is defined by simple yaml configuration files instead of being written in a programming language and executed from a centralized master

The layout of a typical Ansible repo looks a bit like this:

heartsucker@archimedes:~/code/ansible/$ tree . ├── README.md ├── database.yml ├── ec2.py ├── push-ssh-keys.yml ├── roles │   ├── database │   │   ├── files │   │   │   ├── pg_hba.conf │   │   ├── meta │   │   │   └── main.yml │   │   └── tasks │   │       └── main.yml │   ├── users │   │   ├── files │   │   │   ├── user1.ssh.pub │   │   │   ├── user2.ssh.pub │   │   │   ├── user3.ssh.pub │   │   ├── tasks │   │   │   └── main.yml │   │   └── vars │   │       └── main.yml │   └── webserver │       ├── meta │       │   └── main.yml │       └── tasks │           └── main.yml └── webserver.yml  12 directories, 15 files

Simply put, at the top level you have playbooks that are made of tasks and roles. Roles are made of roles and tasks. In our case, we had a playbook called push-ssh-keys.yml that was used to create users on all hosts. The file roles/users/tasks/main.yml looked like this.

--- # sometimes the remove user command fails to delete home dirs # keep this before actual removal so removal won't return failure - name: remove old users' home dirs   file:     path: "/home/{{ item }}"     state: absent     force: yes   with_items: '{{ disallowed_users }}'   when: disallowed_users is defined  # sometimes the remove user command fails to delete mail spools # keep this before actual removal so removal won't return failure - name: remove old users' mail spools   file:     path: "/var/mail/{{ item }}"     state: absent     force: yes   with_items: '{{ disallowed_users }}'   when: disallowed_users is defined  - name: remove old users   user:     name: "{{ item }}"     state: absent     remove: yes     force: yes   with_items: '{{ disallowed_users }}'   when: disallowed_users is defined  - name: create users   user:     name: "{{ item.key }}"     shell: "{{ item.value.shell|default('/bin/bash') }}"     state: present   with_dict: '{{ allowed_users }}'  - name: create ~/.ssh/   file:     name: "/home/{{ item.key }}/.ssh"     state: directory     mode: '0700'     owner: "{{ item.key }}"     group: "{{ item.key }}"   with_dict: '{{ allowed_users }}'  - name: set authorized keys   copy:     src: "{{ item.key }}.ssh.pub"     dest: "/home/{{ item.key }}/.ssh/authorized_keys"     mode: '0600'     owner: "{{ item.key }}"     group: "{{ item.key }}"   with_dict: '{{ allowed_users }}'

The file containing the variables, roles/users/vars/main.yml , looked like this.

--- allowed_users:   user1:   user2:   user3:     shell: /bin/zsh  disallowed_users:   - user4

In plain English, the following steps are sequentially executed.

  • Remove /var/mail/$user for all users in disallowed_users
  • Remove /home/$user for all users in disallowed_users
  • Delete user $user for all users in disallowed_users
  • Create user $user for all users in users
  • Create /home/$user/.ssh/ for all users in users
  • Set /home/$user/.ssh/authorized_keys for all users in users

You’ll even notice in the .yml , I set guards against the variable disallowed_users so that if there were no users to delete, there would be no error. With that knowledge, and knowing that we’d rotated instances and key everywhere and that no old users were on our servers, I updated the file to read like so.

--- allowed_users:   user1:   user2:   user3:     shell: /bin/zsh  disallowed_users:

This role was part of every single one of our provisioning playbooks, so it’d been run thousands of time. I’d even, run it as a one-off job many times. I was comfortable making this small optimization and was ready to deploy the new hire’s ssh -key to our servers.

heartsucker@archimedes:~/code/ansible/$ ansible-playbook -i ec2.py --extra-vars 'target=all' push-ssh-keys.yml

While that was running, I set him up with an AWS account and did some other housekeeping. I switched back to the terminal and saw this.

TASK [users : remove old users] ************************************************ failed: [10.1.1.2] (item=None) => {"failed": true, "item": null, "module_stderr": "Could not chdir to home directory /home/heartsucker: No such file or directory/nTraceback (most recent call last):/n  File /"/tmp/ansible_MHz4bs/ansible_module_user.py/", line 2153, in <module>/n    main()/n  File /"/tmp/ansible_MHz4bs/ansible_module_user.py/", line 2071, in main/n    if user.user_exists():/n  File /"/tmp/ansible_MHz4bs/ansible_module_user.py/", line 535, in user_exists/n    if pwd.getpwnam(self.name):/nTypeError: must be string, not None/n", "module_stdout": "", "msg": "MODULE FAILURE", "parsed": false} failed: [10.1.1.3] (item=None) => {"failed": true, "item": null, "module_stderr": "Could not chdir to home directory /home/heartsucker: No such file or directory/nTraceback (most recent call last):/n  File /"/tmp/ansible_IxQdHP/ansible_module_user.py/", line 2153, in <module>/n    main()/n  File /"/tmp/ansible_IxQdHP/ansible_module_user.py/", line 2071, in main/n    if user.user_exists():/n  File /"/tmp/ansible_IxQdHP/ansible_module_user.py/", line 535, in user_exists/n    if pwd.getpwnam(self.name):/nTypeError: must be string, not None/n", "module_stdout": "", "msg": "MODULE FAILURE", "parsed": false} failed: [10.1.1.4] (item=None) => {"failed": true, "item": null, "module_stderr": "Could not chdir to home directory /home/heartsucker: No such file or directory/nTraceback (most recent call last):/n  File /"/tmp/ansible_sH9tGv/ansible_module_user.py/", line 2153, in <module>/n    main()/n  File /"/tmp/ansible_sH9tGv/ansible_module_user.py/", line 2071, in main/n    if user.user_exists():/n  File /"/tmp/ansible_sH9tGv/ansible_module_user.py/", line 535, in user_exists/n    if pwd.getpwnam(self.name):/nTypeError: must be string, not None/n", "module_stdout": "", "msg": "MODULE FAILURE", "parsed": false} failed: [10.1.1.5] (item=None) => {"failed": true, "item": null, "module_stderr": "Could not chdir to home directory /home/heartsucker: No such file or directory/nTraceback (most recent call last):/n  File /"/tmp/ansible_f4nD0I/ansible_module_user.py/", line 2153, in <module>/n    main()/n  File /"/tmp/ansible_f4nD0I/ansible_module_user.py/", line 2071, in main/n    if user.user_exists():/n  File /"/tmp/ansible_f4nD0I/ansible_module_user.py/", line 535, in user_exists/n    if pwd.getpwnam(self.name):/nTypeError: must be string, not None/n", "module_stdout": "", "msg": "MODULE FAILURE", "parsed": false} failed: [10.1.1.6] (item=None) => {"failed": true, "item": null, "module_stderr": "Could not chdir to home directory /home/heartsucker: No such file or directory/nTraceback (most recent call last):/n  File /"/tmp/ansible_KVGExn/ansible_module_user.py/", line 2153, in <module>/n    main()/n  File /"/tmp/ansible_KVGExn/ansible_module_user.py/", line 2071, in main/n    if user.user_exists():/n  File /"/tmp/ansible_KVGExn/ansible_module_user.py/", line 535, in user_exists/n    if pwd.getpwnam(self.name):/nTypeError: must be string, not None/n", "module_stdout": "", "msg": "MODULE FAILURE", "parsed": false} ...

This clearly wasn’t good. I aborted the play and scrolled up to see what had happened.

TASK [users : remove old users' home dirs] ************************************* changed: [10.1.1.2] => (item=None) changed: [10.1.1.3] => (item=None) changed: [10.1.1.4] => (item=None) changed: [10.1.1.5] => (item=None) changed: [10.1.1.6] => (item=None) ...

The with_items loop had run on each server despite having no items. Ansible uses Jinja as a templating language. It interpolates variables and has loop directives, and as anyone (self included) should know, Python’s None value is replaced with the empty string. Or as a coworker put it:

Jinja takes the PHP approach and says "Fuck it, this web page will render."

What this means is that each call to the file_module that was supposed to remove the home directory for /home/$user received a None value, cast it to the emtpy string, and did a force remove on the resulting directory: /home/ .

Fuck.

I had just removed the /home/ directory on every server. Since we only allow ssh access via public key, everyone was locked out of every server.

My company runs our architecture on AWS, so many of the instances were taken out back and shot, then replaced with fresh copies. Some service likes Postges, InfluxDB, Cassandra, etc. couldn’t be replaced so easily.

Our solution was to manually start one instance per availability zone and provision it to a satisfactory state with respect to users and credentials. Then, for each instance in that AZ that had had its’ /home directory wiped, we did the following:

  • Power off original instance
  • Detach EBS root volume from original instance
  • Attach EBS volume to temp instance
  • ssh in to the temp instance
    • mount the drive
    • copy /home to /mnt/old-volume/home
    • unmount the drive
  • Detach EBS volume from temp instance
  • Attach EBS volume to / on old instance
  • Restart old instance
  • ssh in to the original instance
    • do a recursive chown on each dir under home to fix the broken permissions from incorrect uid s and gid s

The last step worked only because we got lucky where at least on engineer had the correct uid and gid attached to the directory when we copied everything over, and thus they were able to ssh in and modify permissions on the rest of the directories.

On top of all that, we had to fix all of it on a garbage wifi connection. This is what my ping looked like during a sample of almost an hour.

--- 8.8.8.8 ping statistics --- 3267 packets transmitted, 1359 packets received, 58.4% packet loss round-trip min/avg/max/stddev = 22.388/39.702/695.399/23.219 ms

If anyone feels terror at how Ansible handles variables and wants to know what version I’m running:

heartsucker@archimedes:~/code/ansible$ python --version Python 2.7.11 heartsucker@archimedes:~/code/ansible$ ansible --version ansible 2.1.0.0   config file = /Users/heartsucker/code/ansible/ansible.cfg   configured module search path = Default w/o overrides

The post mortem was "this was my fault" but also, "ok those were reasonable assumptions to make, so no one is really upset."

Lesson learned?

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » How Ansible Took Down Our Servers

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址