Ansible 优化


Ansible 优化

一、Ansible 执行流程分析

开始优化工作前,首先需要对 Ansible 的工作原理有一定的了解,这样才能知道性能问题会发生在哪里、以及后续如何优化,为了检验优化效果,还要了解如何测量任务的执行速度,所以文章主要围绕三个方面

  1. 执行流程分析
  2. 执行流程优化
  3. 执行速度测试

首先,开始一个问题,Ansible 执行流程分析

anisble、ansible-playbook 命令执行时通过 -v 参数输出执行时更多的信息

-v, --verbose         verbose mode (-vvv for more, -vvvv to enable
                      connection debugging)

通过分析输出的信息,我们来了解 ansible 远程执行时的工作流程

$ ansible -vvv 'tencent' -m ping

执行返回如下

# 环境数据信息、版本
ansible 2.9.25
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/usr/share/ansible-library']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Nov 16 2020, 22:23:17) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

# 远程执行所使用的配置、目标等
Using /etc/ansible/ansible.cfg as config file
host_list declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
script declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
auto declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
Parsed /etc/ansible/hosts inventory source with ini plugin
META: ran handlers

# 1.第一个连接:获取用户家目录,此处为 /root
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '/root\n', '')

# 2.第二个连接:在家目录下创建临时目录,临时目录由配置文件中 remote_tmp 指令控制
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /root/.ansible/tmp `"&& mkdir "` echo /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618 `" && echo ansible-tmp-1678348016.02-28826-116551982443618="` echo /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618 `" ) && sleep 0'"'"''
<bj-tencent-lhins-1> (0, 'ansible-tmp-1678348016.02-28826-116551982443618=/root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618\n', '')

# 3.第三个连接:探测目标节点的平台和 python 解释器的版本信息
<tencent> Attempting python interpreter discovery
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'echo PLATFORM; uname; echo FOUND; command -v '"'"'"'"'"'"'"'"'/usr/bin/python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.5'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/libexec/platform-python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python3'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python'"'"'"'"'"'"'"'"'; echo ENDFOUND && sleep 0'"'"''
<bj-tencent-lhins-1> (0, 'PLATFORM\nLinux\nFOUND\n/usr/bin/python\n/usr/bin/python2.7\n/usr/libexec/platform-python\n/usr/bin/python\nENDFOUND\n', '')

# 4.第四个连接:将要执行的模块相关的代码和参数放到本地临时文件中,并使用 sftp 将任务文件传输到被控节点的临时文件中
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '{"osrelease_content": "NAME=\\"CentOS Linux\\"\\nVERSION=\\"7 (Core)\\"\\nID=\\"centos\\"\\nID_LIKE=\\"rhel fedora\\"\\nVERSION_ID=\\"7\\"\\nPRETTY_NAME=\\"CentOS Linux 7 (Core)\\"\\nANSI_COLOR=\\"0;31\\"\\nCPE_NAME=\\"cpe:/o:centos:centos:7\\"\\nHOME_URL=\\"https://www.centos.org/\\"\\nBUG_REPORT_URL=\\"https://bugs.centos.org/\\"\\n\\nCENTOS_MANTISBT_PROJECT=\\"CentOS-7\\"\\nCENTOS_MANTISBT_PROJECT_VERSION=\\"7\\"\\nREDHAT_SUPPORT_PRODUCT=\\"centos\\"\\nREDHAT_SUPPORT_PRODUCT_VERSION=\\"7\\"\\n\\n", "platform_dist_result": ["centos", "7.9.2009", "Core"]}\n', '')
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
<bj-tencent-lhins-1> PUT /root/.ansible/tmp/ansible-local-28818KyMMwB/tmpFO9j5N TO /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/AnsiballZ_ping.py
# sftp 上传文件
<bj-tencent-lhins-1> SSH: EXEC sftp -b - -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a '[bj-tencent-lhins-1]'
<bj-tencent-lhins-1> (0, 'sftp> put /root/.ansible/tmp/ansible-local-28818KyMMwB/tmpFO9j5N /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/AnsiballZ_ping.py\n', '')

# 5.第五个连接:对目标节点上的任务文件授以执行权限
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'chmod u+x /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/ /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/AnsiballZ_ping.py && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '', '')

# 6.第六个连接:执行目标节点上的任务
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a -tt bj-tencent-lhins-1 '/bin/sh -c '"'"'/usr/bin/python /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/AnsiballZ_ping.py && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '\r\n{"invocation": {"module_args": {"data": "pong"}}, "ping": "pong"}\r\n', 'Shared connection to bj-tencent-lhins-1 closed.\r\n')

# 7.第七个连接:删除目标节点上的临时目录
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'rm -f -r /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/ > /dev/null 2>&1 && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '', '')
tencent | SUCCESS => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    }, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "data": "pong"
        }
    }, 
    "ping": "pong"
}
META: ran handlers
META: ran handlers

总结一下 Ansible 在执行任务时会建立 7 次 ssh 连接,每次 ssh 连接的工作如下:

  • (1). 第一个连接:获取远程主机时行目标用户的家目录,此处为 /root
  • (2). 第二个连接:在远程家目录下创建临时目录,临时目录可由 ansible.cfg 中remote_tmp指令控制
  • (3). 第三个连接:探测目标节点的平台和 python 解释器的版本信息
  • (4). 第四个连接:将待执行模块的相关代码和参数放到本地临时文件中,并使用 sftp 将任务文件传输到被控节点的临时文件中
  • (5). 第五个连接:对目标节点上的任务文件授以执行权限
  • (6). 第六个连接:执行目标节点上的任务
  • (7). 第七个连接:删除目标节点上的临时目录,并将执行结果返回给 Ansible 端

这仅是以单个节点为例,正常情况下,ansible 都是一组节点作为目标,那么整个执行过程,可能是这样的(默认配置,不考虑回调)

  • (1). 进入第一个 play,挑选 forks=N 设置的 N 个节点
  • (2). 每个节点执行第一个任务,每个节点都会建立 7 次 ssh 连接
  • (3). 每个节点执行第二个任务,每个节点都再次建立 7 次 ssh 连接
  • (4). 按照相同逻辑执行该 play 中其它任务…
  • (5). 所有节点执行完该 play 中的所有任务后,进入下一个 play
  • (6). 按照上面的流程执行完所有 play 中的所有任务

上面的流程仅是默认配置,但某些配置会改变 Ansible 的执行策略

二、Ansible 执行策略

为了让 Ansible 按照我们预期的方式运行,我们需要深入了解下 “执行策略”,这包括几个关键字:forksserialstrategythrottle

2.1 forks

此前,我们提过 forks 配置指定最多有多少个节点同时执行任务(playbook),默认值是 5,由于目前总节点数是 3 个,所以为了实验效果,我们调整为 2,

$ vim /etc/ansible/ansible.cfg
forks          = 2

模版定义

---
- hosts: ecs
  gather_facts: no
  tasks:
  - name: debug demo
    shell: "sleep 20; echo ansible"

执行剧本

$ ansible-playbook playbook-execute-demo1.yml

查看进程

$ ps -ef | grep playbook
root     23405 22856 92 13:51 pts/7    00:00:00 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml
root     23415 23405  0 13:51 pts/7    00:00:00 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml
root     23416 23405  0 13:51 pts/7    00:00:00 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml

这里虽然显示为 3 个进程,但实际上真正参与任务执行的只有那两个子进程(2341523416),这一点可以通过查看 SSH 进程证明

$ ps -ef | grep "ssh -C"
root     23469 23415  0 13:51 pts/7    00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/16b0cd324c -tt sz-aliyun-ecs-1 /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1640411472.7-23415-13386693082019/AnsiballZ_command.py && sleep 0'
root     23488 23416  0 13:51 pts/7    00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/54dc19ddaa -tt bj-huawei-hecs-1 /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1640411473.17-23416-227469691024889/AnsiballZ_command.py && sleep 0'

可以看到,只有 2 个 SSH 进程在同时执行,forks 执行策略总结起来就是一句话,“根据 forks 指定的值 N,创建对应数量的进程远程执行任务,每当一个节点执行完成后,便新建子进程选择其他节点执行任务”,如下所示:

$ ps -ef | grep playbook
root     23405 22856 15 13:51 pts/7    00:00:06 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml
root     23503 23405  0 13:51 pts/7    00:00:00 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml

$ ps -ef | grep "ssh -C"
root     23516 23503  0 13:51 pts/7    00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a -tt bj-tencent-lhins-1 /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1640411494.08-23503-146381019301363/AnsiballZ_command.py && sleep 0'

上面的输出中我们看到,总会多出一个进程,那个进程是 Ansible 主控进程,主控进程会监控节点执行任务的状态,并决定是否要创建子进程远程执行任务,比如:默认 forks 为 5 ,总节点数为 10,当第一批 5 个节点中某个节点先一步执行完成,Ansible 主控进程会立即创建一个新进程让第 6 个节点执行任务

除了在配置文件中指定 forks 值,还可以通过 ansible-playbook -f <N> 参数声明最大并发执行数,如下所示

$ ansible-playbook -f 3 playbook-execute-demo1.yml

2.2 serial

serial 是 play 级别的执行策略指令,用于指定 N 个节点作为一批执行节点,该批节点执行完后,才会调度另一批节点继续执行,倘若不指定 serial 则默认会将所有节点作为一批

但是,有几个点需要特别说明:

  • serial 是指明多少节点为一批
  • forks 是指明最多有多少节点执行任务

总的来说,forks 是用来限制 Ansible 执行任务的进程数量,粒度比较粗,而 serial 用于改变调度节点执行 play 的策略,更灵活更强大,除了支持整数,还支持百分数、以及列表递增

  • 单个数值 N:指明以 N 个节点为一个批次去执行 play 中的所有任务,如 serial: 3
  • 百分数 N%:指明以 N% 的节点为一个批次去执行 play 中的所有任务,如 serial: 50%
  • 列表递增:迭代 列表元素 作为批次粒度去执行 play 中的所有任务,如 [1, 3, 50%] ,意为第一批选一个节点执行,第二批从剩余节点中选 3 个节点作为一批去执行,第三批从剩余中选 50% 的节点执行,后续批次都以最后一个元素作为粒度选择剩余节点

仍旧使用上面那个 playbook,添加一行参数,通过 pstree 命令看到,当前只有一台节点在执行任务

$ pstree -apsl `ps -ef|grep ansible-playbook|head -n1|awk '{print $2}'` 
systemd,1 --switched-root --system --deserialize 22
  └─sshd,1068 -D
      └─sshd,19433
          └─zsh,19436
              └─ansible-playboo,20225 /usr/bin/ansible-playbook playbook-execute-demo1.yml
                  ├─ansible-playboo,20235 /usr/bin/ansible-playbook playbook-execute-demo1.yml
                  │   └─ssh,20268 -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/16b0cd324c -tt sz-aliyun-ecs-1 /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1678411494.46-20235-211742113762814/AnsiballZ_command.py && sleep 0'
                  └─{ansible-playboo},20234

执行返回

$ ap playbook-execute-demo1.yml 

PLAY [ecs] ***************

TASK [debug demo] ***************
changed: [ecs-1.aliyun.sz]

PLAY [ecs] ***************

TASK [debug demo] ***************
changed: [huawei]

PLAY [ecs] ***************

TASK [debug demo] ***************
changed: [tencent]

PLAY RECAP ***************
ecs-1.aliyun.sz    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
huawei    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tencent    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

通过命令执行输出也可以看出执行批次是逐台运行的

2.3 strategy

strategy指令用于指定节点执行任务时的策略,其侧重点在于节点而在于任务,

  • linear:默认策略,某个节点先执行完一个任务后,等待其余所有节点都执行完该任务,才统一进入下一个任务
  • free:某节点执行完一个任务后,不等待其它节点继续执行该 play 中的剩余任务,直到该 play 执行完成,才释放节点槽位,让其它未执行任务的节点开始执行任务

配置方法,free 策略适用于大部分场景提高执行效率,先执行完先腾地方给其他批次节点执行,除非,用户对都同批次节点的 play 执行顺序有严格要求

---
- hosts: ecs
  strategy: free

2.4 throttle

三、Ansible 执行优化

3.1 速度测量:profile_tasks 插件

在 ansible.cfg 配置文件中的 callback_whitelist 行启用 profile_tasks 插件

[defaults]

callback_whitelist = profile_tasks
# 为了方便观察输出,暂时只开启 profile_tasks
# callback_whitelist = timer, profile_tasks, profile_roles
  • timer:显示playbook执行的持续时间
  • profile_tasks:添加每个任务的开始时间,并在 playbook 执行结束时显示每个任务所用的时间,按降序排列
  • profile_roles:在结束时显示每个角色所用的时间,按降序排列

3.2 优化手段一:增加 forks 值

目标节点数:6,forks=5

---
- hosts: ecs
#  strategy: free
  gather_facts: no
  tasks:
  - name: debug demo
    shell: "sleep 10; echo ansible"

执行速度如下

$ ap --forks=5 playbook-execute-demo1.yml

PLAY [ecs] ***************

TASK [debug demo] ***************
Friday 10 March 2023  09:54:35 +0800 (0:00:00.054)    0:00:00.054 *************** 
changed: [ecs-1.aliyun.sz]
changed: [47.115.121.119]
changed: [114.115.159.174]
changed: [huawei]
changed: [tencent]
changed: [192.144.227.61]

PLAY RECAP ***************
114.115.159.174    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
192.144.227.61    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
47.115.121.119    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
ecs-1.aliyun.sz    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
huawei    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tencent    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Friday 10 March 2023  09:54:58 +0800 (0:00:23.088)    0:00:23.143 *************** 
=============================================================================== 
debug demo --------------- 23.09s

修改 forks=10

$ ap --forks=10 playbook-execute-demo1.yml

PLAY [ecs] ***************

TASK [debug demo] ***************
Friday 10 March 2023  09:55:04 +0800 (0:00:00.054)    0:00:00.054 *************** 
changed: [ecs-1.aliyun.sz]
changed: [47.115.121.119]
changed: [114.115.159.174]
changed: [192.144.227.61]
changed: [huawei]
changed: [tencent]

PLAY RECAP ***************
114.115.159.174    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
192.144.227.61    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
47.115.121.119    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
ecs-1.aliyun.sz    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
huawei    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
tencent    : ok=1    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Friday 10 March 2023  09:55:17 +0800 (0:00:13.755)    0:00:13.809 *************** 
=============================================================================== 
debug demo --------------- 13.76s

3.3 优化手段二:调整执行策略

Ansible 默认支持 4 种执行策略

$ ls -a /usr/lib/python2.7/site-packages/ansible/plugins/strategy/*.py|grep -v 'init'
/usr/lib/python2.7/site-packages/ansible/plugins/strategy/debug.py
/usr/lib/python2.7/site-packages/ansible/plugins/strategy/free.py
/usr/lib/python2.7/site-packages/ansible/plugins/strategy/host_pinned.py
/usr/lib/python2.7/site-packages/ansible/plugins/strategy/linear.py

下面两种是最常用且被调整的优化参数

  • linear 策略会等待其他节点执行完 task,等待同批次的其他节点执行完 play,统一释放槽位

  • free 的作用是用来减少等待加速释放,早执行完早释放

它的加速体现在目标节点比较多,无法将所有节点纳入同一批次去执行任务

---
- hosts: ecs
  strategy: free
  gather_facts: no
  tasks:
  - name: debug demo
    shell: "sleep 10; echo ansible"

3.4 优化手段三:异步执行任务

默认情况下,Ansible 按照同步执行的方式执行每个任务,即按照顺序依次执行,一个任务执行完毕才会执行下面的,但是有时某些任务是不需要等待它完成的

例如:

---
- hosts: ecs
  strategy: free
  gather_facts: no
  tasks:
  - name: "debug demo"
    shell: "sleep 10; echo ansible"
    # 以 “异步” 的方式在后台运行该任务,如果后台任务在 20 秒内还未完成,则认为该任务失败
    async: 20
    # 检查异步任务状态的频率,是否已成功、报错
    poll: 5
  - name: "other tasks"
    debug:
        msg: "done."

执行效果

$ ap --forks=10 playbook-execute-demo1.yml

PLAY [tencent] ***************
Saturday 11 March 2023  09:01:09 +0800 (0:00:00.093)    0:00:00.093 *************** 

TASK [debug demo] ***************
changed: [tencent]
Saturday 11 March 2023  09:01:23 +0800 (0:00:14.640)    0:00:14.734 *************** 

TASK [other tasks] ***************
ok: [tencent] => {
    "msg": "done."
}

PLAY RECAP ***************
tencent    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  09:01:23 +0800 (0:00:00.025)    0:00:14.759 *************** 
=============================================================================== 
debug demo --------------- 14.64s
other tasks --------------- 0.03s

这里注释掉 async poll 指令再次尝试

$ ap --forks=10 playbook-execute-demo1.yml

PLAY [tencent] ***************
Saturday 11 March 2023  09:02:48 +0800 (0:00:00.063)    0:00:00.063 *************** 

TASK [debug demo] ***************
changed: [tencent]
Saturday 11 March 2023  09:03:00 +0800 (0:00:12.243)    0:00:12.307 *************** 

TASK [other tasks] ***************
ok: [tencent] => {
    "msg": "done."
}

PLAY RECAP ***************
tencent    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  09:03:00 +0800 (0:00:00.028)    0:00:12.335 *************** 
=============================================================================== 
debug demo --------------- 12.24s
other tasks --------------- 0.03s

添加上 async 与 poll 反而执行更慢了,这是因为如果 poll 的值不为 0,那么就不是真正的异步,因为它会定期去做任务状态检查,假如刚检测完 aysnc 任务执行状态,紧接着任务执行完毕了,但由于未到检查间隔,不得不等待下一次检查,才会发现任务已执行完毕,后续任务才会继续执行

修改 poll 值为 0

---
- hosts: tencent
  strategy: free
  gather_facts: no
  tasks:
  - name: "debug demo"
    shell: "sleep 10; echo ansible"
    # 以 “异步” 的方式在后台运行该任务,如果后台任务在 20 秒内还未完成,则认为该任务失败
    async: 20
    # 检查异步任务状态的频率,是否已成功、报错
    poll: 0
  - name: "other tasks"
    debug:
        msg: "done."

执行效果

$ ap --forks=10 playbook-execute-demo1.yml

PLAY [tencent] ***************
Saturday 11 March 2023  09:10:52 +0800 (0:00:00.059)    0:00:00.059 *************** 

TASK [debug demo] ***************
changed: [tencent]
Saturday 11 March 2023  09:10:55 +0800 (0:00:03.081)    0:00:03.141 *************** 

TASK [other tasks] ***************
ok: [tencent] => {
    "msg": "done."
}

PLAY RECAP ***************
tencent    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  09:10:55 +0800 (0:00:00.033)    0:00:03.174 *************** 
=============================================================================== 
debug demo --------------- 3.08s
other tasks --------------- 0.03s

不过有时我们需要检查任务状态或结果,这时需要用到 async_status 模块,模块接收后台任务的 job id 作为参数,返回后台任务状态,包括以下信息

  • ansible_job_id:异步任务的 job id
  • finished:表示所等待的异步任务是否已执行完成,值为 1 表示完成,0 表示未完成
  • started:表示所等待的异步任务是否已开始执行,值为 1 表示已开始,0 表示未开始

示例:异步任务先执行,中间执行同步任务,最后检查异步任务状态

---
- hosts: tencent
  strategy: free
  gather_facts: no
  tasks:
  - name: "debug demo"
    shell: "sleep 10; echo ansible"
    # 以 “异步” 的方式在后台运行该任务,如果后台任务在 20 秒内还未完成,则认为该任务失败
    async: 20
    # 检查异步任务状态的频率,是否已成功、报错
    poll: 0
    register: async_job
  - name: "other tasks1"
    debug:
        msg: "other tasks1"
  - name: "other tasks2"
    debug:
        msg: "other tasks2"
  - name: "等待异步任务完成"
    async_status:
      jid: "{{ async_job.ansible_job_id }}"
    register: async_job_result
    # until指令阻塞等待 job_result.finished 事件发生(异步任务执行完成)
    until: async_job_result.finished
    # 重试次数
    retries: 30
    # 重试间隔时间
    delay: 5

执行效果

$ ap --forks=10 playbook-execute-demo1.yml

PLAY [tencent] ***************
Saturday 11 March 2023  09:25:24 +0800 (0:00:00.062)    0:00:00.062 *************** 

TASK [debug demo] ***************
changed: [tencent]
Saturday 11 March 2023  09:25:27 +0800 (0:00:02.832)    0:00:02.894 *************** 

TASK [other tasks1] ***************
ok: [tencent] => {
    "msg": "other tasks1"
}
Saturday 11 March 2023  09:25:27 +0800 (0:00:00.016)    0:00:02.911 *************** 

TASK [other tasks2] ***************
ok: [tencent] => {
    "msg": "other tasks2"
}
Saturday 11 March 2023  09:25:27 +0800 (0:00:00.017)    0:00:02.928 *************** 
FAILED - RETRYING: 等待异步任务完成 (30 retries left).
FAILED - RETRYING: 等待异步任务完成 (29 retries left).

TASK [等待异步任务完成] ***************
changed: [tencent]
Saturday 11 March 2023  09:25:40 +0800 (0:00:13.705)    0:00:16.634 *************** 

TASK [获取异步任务输出] ***************
ok: [tencent] => {
    "msg": "ansible"
}

PLAY RECAP ***************
tencent    : ok=5    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  09:25:40 +0800 (0:00:00.053)    0:00:16.687 *************** 
=============================================================================== 
等待异步任务完成 --------------- 13.71s
debug demo --------------- 2.83s
获取异步任务输出 --------------- 0.05s
other tasks2 --------------- 0.02s
other tasks1 --------------- 0.02s

在某些场景也会等待多个异步任务

$ ap --forks=10 playbook-execute-demo1.yml

PLAY [tencent] ***************
Saturday 11 March 2023  09:37:56 +0800 (0:00:00.061)    0:00:00.061 *************** 

TASK [async job1] ***************
changed: [tencent]
Saturday 11 March 2023  09:37:59 +0800 (0:00:02.966)    0:00:03.027 *************** 

TASK [async job2] ***************
changed: [tencent]
Saturday 11 March 2023  09:38:01 +0800 (0:00:01.795)    0:00:04.823 *************** 

TASK [other tasks1] ***************
ok: [tencent] => {
    "msg": "other tasks1"
}
Saturday 11 March 2023  09:38:01 +0800 (0:00:00.018)    0:00:04.841 *************** 

TASK [other tasks2] ***************
ok: [tencent] => {
    "msg": "other tasks2"
}
Saturday 11 March 2023  09:38:01 +0800 (0:00:00.018)    0:00:04.859 *************** 
FAILED - RETRYING: 等待异步任务完成 (30 retries left).

TASK [等待异步任务完成] ***************
changed: [tencent] => (item=408631890851.16423)
changed: [tencent] => (item=987065119603.16471)
Saturday 11 March 2023  09:38:10 +0800 (0:00:08.983)    0:00:13.842 *************** 

TASK [获取异步任务输出] ***************
ok: [tencent] => (item={u'stderr_lines': [], u'changed': True, u'ansible_job_id': u'408631890851.16423', u'stdout': u'Async Job1', u'finished': 1, u'delta': u'0:00:05.009015', u'stdout_lines': [u'Async Job1'], u'ansible_loop_var': u'async_job_id', u'end': u'2023-03-11 09:38:04.721359', u'start': u'2023-03-11 09:37:59.712344', u'cmd': u'sleep 5; echo Async Job1', u'attempts': 2, u'failed': False, u'stderr': u'', u'rc': 0, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u'sleep 5; echo Async Job1', u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, u'async_job_id': u'408631890851.16423'}) => {
    "msg": "任务输出:Async Job1"
}
ok: [tencent] => (item={u'stderr_lines': [], u'changed': True, u'ansible_job_id': u'987065119603.16471', u'stdout': u'Async Job1', u'finished': 1, u'delta': u'0:00:05.009973', u'stdout_lines': [u'Async Job1'], u'ansible_loop_var': u'async_job_id', u'end': u'2023-03-11 09:38:06.515008', u'start': u'2023-03-11 09:38:01.505035', u'cmd': u'sleep 5; echo Async Job1', u'attempts': 1, u'failed': False, u'stderr': u'', u'rc': 0, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u'sleep 5; echo Async Job1', u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, u'async_job_id': u'987065119603.16471'}) => {
    "msg": "任务输出:Async Job1"
}

PLAY RECAP ***************
tencent    : ok=6    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  09:38:10 +0800 (0:00:00.068)    0:00:13.911 *************** 
=============================================================================== 
等待异步任务完成 --------------- 8.98s
async job1 --------------- 2.97s
async job2 --------------- 1.80s
获取异步任务输出 --------------- 0.07s
other tasks2 --------------- 0.02s
other tasks1 --------------- 0.02s

小节一下,适合 ansible 异步任务的场景

  1. 某个 task 需要运行很长的时间,可能会达到 ssh 连接的 timeout
  2. 没有任务依赖此任务是否完成的状态
  3. 需要尽快返回当前 shell 执行其他命令

不适合使用异步任务的场景:

  1. 需要执行完该任务后才能继续执行其他任务
  2. 申请排他锁的任务
  3. 剧本内全是耗时短的任务,异步并不会带来直观的性能提升,反而让执行可读性变的不友好,当 poll 非 0 而反而会降低效率

3.5 优化手段四:配置 SSH 长连接

Ansibe 是重度依赖 ssh 服务的,通过上面 -vvv 参数了解到 一次任务执行最少需要连接七次,所以优化 ssh 也可以提高 ansible 执行效率

通过开启配置 ssh 长连接,实现连接复用,大致原理是在 ssh 连接过期前会一直保持 ssh 连接已建立的状态,下次和目标节点建立 ssh 连接时将直接使用该连接

修改 /etc/ansible/ansible.cfg 文件

ssh_args = -C -o ControlMaster=auto -o ControlPersist=1d

首先,先看下未开启 ssh 长连接的执行耗时

$ time ansible 'tencent' -m ping
tencent | SUCCESS => {
    "ansible_facts": {
    "discovered_interpreter_python": "/usr/bin/python"
    }, 
    "changed": false, 
    "ping": "pong"
}
ansible 'tencent' -m ping  1.31s user 0.25s system 54% cpu 2.883 total

缓存连接套接字,文件路径由 ansible.cfg 文件中的 control_path_dir 指令配置

$ ls -l ~/.ansible/cp/
total 0
srw--------------- 1 root root 0 Mar 11 10:00 68cade842a

执行效果

$ time ansible 'tencent' -m ping
tencent | SUCCESS => {
    "ansible_facts": {
    "discovered_interpreter_python": "/usr/bin/python"
    }, 
    "changed": false, 
    "ping": "pong"
}
ansible 'tencent' -m ping  1.28s user 0.24s system 62% cpu 2.453 total

带来性能提升的同时,也会带来一些问题,例如:

  1. 即使被控端的认证信息发生修改,只要连接未过期,主控端依旧可以正常连接
  2. 即使被控端的连接变量发生修改,只要连接未过期,变量就不会刷新生效
  3. 缓存时间过长会导致系统中存在大量 ESTABLISHED 连接,造成套接字资源挤占

3.6 优化手段五:配置 Pipelining

Ansible 通过使用 ssh 的 pipelining 特性,让所有动作在一个 ssh 会话中完成,下面通过 -vvv 参数可以看到

修改 ansible.cfg 配置开启 pipelining 功能

# Enabling pipelining reduces the number of SSH operations required to
pipelining = True

开启 pipelineing 配置项后,SSH 连接从起初的 7 个变成了 3 个

$ ansible tencent -m ping -vvv

# 环境数据信息、版本
ansible 2.9.25
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/usr/share/ansible-library']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Nov 16 2020, 22:23:17) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

# 远程执行所使用的配置、目标等
Using /etc/ansible/ansible.cfg as config file
host_list declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
script declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
auto declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
Parsed /etc/ansible/hosts inventory source with ini plugin
Skipping callback 'actionable', as we already have a stdout callback.
Skipping callback 'counter_enabled', as we already have a stdout callback.
Skipping callback 'debug', as we already have a stdout callback.
Skipping callback 'dense', as we already have a stdout callback.
Skipping callback 'dense', as we already have a stdout callback.
Skipping callback 'full_skip', as we already have a stdout callback.
Skipping callback 'json', as we already have a stdout callback.
Skipping callback 'minimal', as we already have a stdout callback.
Skipping callback 'null', as we already have a stdout callback.
Skipping callback 'oneline', as we already have a stdout callback.
Skipping callback 'selective', as we already have a stdout callback.
Skipping callback 'skippy', as we already have a stdout callback.
Skipping callback 'stderr', as we already have a stdout callback.
Skipping callback 'unixy', as we already have a stdout callback.
Skipping callback 'yaml', as we already have a stdout callback.
META: ran handlers
<tencent> Attempting python interpreter discovery

# 1.第一个 SSH 连接,获取目标节点上支持的 Python 版本
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=1d -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'echo PLATFORM; uname; echo FOUND; command -v '"'"'"'"'"'"'"'"'/usr/bin/python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.5'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/libexec/platform-python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python3'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python'"'"'"'"'"'"'"'"'; echo ENDFOUND && sleep 0'"'"''
<bj-tencent-lhins-1> (0, 'PLATFORM\nLinux\nFOUND\n/usr/bin/python\n/usr/bin/python2.7\n/usr/libexec/platform-python\n/usr/bin/python\nENDFOUND\n', '')

# 2.第二个 SSH 连接,获取目标节点操作系统的信息
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=1d -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '{"osrelease_content": "NAME=\\"CentOS Linux\\"\\nVERSION=\\"7 (Core)\\"\\nID=\\"centos\\"\\nID_LIKE=\\"rhel fedora\\"\\nVERSION_ID=\\"7\\"\\nPRETTY_NAME=\\"CentOS Linux 7 (Core)\\"\\nANSI_COLOR=\\"0;31\\"\\nCPE_NAME=\\"cpe:/o:centos:centos:7\\"\\nHOME_URL=\\"https://www.centos.org/\\"\\nBUG_REPORT_URL=\\"https://bugs.centos.org/\\"\\n\\nCENTOS_MANTISBT_PROJECT=\\"CentOS-7\\"\\nCENTOS_MANTISBT_PROJECT_VERSION=\\"7\\"\\nREDHAT_SUPPORT_PRODUCT=\\"centos\\"\\nREDHAT_SUPPORT_PRODUCT_VERSION=\\"7\\"\\n\\n", "platform_dist_result": ["centos", "7.9.2009", "Core"]}\n', '')

# 准备执行任务,加载任务使用的模块文件,检查是否开启 Pipelining
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
Pipelining is enabled.

# 3.第三个 SSH 连接,执行任务
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=1d -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '\n{"invocation": {"module_args": {"data": "pong"}}, "ping": "pong"}\n', '')
tencent | SUCCESS => {
    "ansible_facts": {
    "discovered_interpreter_python": "/usr/bin/python"
    }, 
    "changed": false, 
    "invocation": {
    "module_args": {
    "data": "pong"
    }
    }, 
    "ping": "pong"
}
META: ran handlers
META: ran handlers

测试 Pipelining 配置项开启与否的效能差异

---
- name: test for timer
  hosts: timer
  gather_facts: no
  tasks:
    - name: only one debug
      debug: 
        var: inventory_hostname
      
    - name: shell
      shell:
        cp /etc/fstab /tmp/
      loop: "{{ range(0, 100)|list }}"

    - name: scp
      copy:
        src: /etc/hosts
        dest: /tmp/
      loop: "{{ range(0, 100)|list }}"

执行剧本

$ ap --forks=10 playbook-execute-demo2.yml

关闭 Pipelining

PLAY RECAP ***************
tencent    : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  12:15:05 +0800 (0:02:55.225)    0:04:42.381 *************** 
=============================================================================== 
scp --------------- 175.23s
shell --------------- 107.05s
only one debug --------------- 0.04s

开启 Pipelining

Saturday 11 March 2023  12:09:40 +0800 (0:02:07.824)    0:03:13.579 *************** 
=============================================================================== 
scp --------------- 127.82s
shell --------------- 65.65s
only one debug --------------- 0.05s

性能提升大概在 2035% 左右,可以说是相当不错了

3.7 优化手段六:调整 facts 行为

默认下 Ansible 会收集所有节点的所有 facts 信息,这个过程是非常慢的,在用不到的 facts 信息的情况下,可以选择不采集

gather_facts: no

或者是只采集特定部分,参考官方文档

可选值为: all, all_ipv4_addresses, all_ipv6_addresses, apparmor, architecture, caps, chroot,cmdline, date_time, default_ipv4, default_ipv6, devices, distribution, distribution_major_version, distribution_release, distribution_version, dns, effective_group_ids, effective_user_id, env, facter, fips, hardware, interfaces, is_chroot, iscsi, kernel, local, lsb, machine, machine_id, mounts, network, ohai, os_family, pkg_mgr, platform, processor, processor_cores, processor_count, python, python_version, real_user_id, selinux, service_mgr, ssh_host_key_dsa_public, ssh_host_key_ecdsa_public, ssh_host_key_ed25519_public, ssh_host_key_rsa_public, ssh_host_pub_keys, ssh_pub_keys, system, system_capabilities, system_capabilities_enforced, user, user_dir, user_gecos, user_gid, user_id, user_shell, user_uid, virtual, virtualization_role, virtualization_type

性能对比,收集所有 facts

---
- name: test for timer
  hosts: tencent
  gather_facts: yes
  # Ansible 默认值
  gather_subset: ["all"]
#  gather_subset: ["!all", "all_ipv4_addresses"]
  tasks:
    - name: Get IP
      debug:
          msg: "{{ ansible_all_ipv4_addresses }}"

执行效果

$ ap --forks=10 playbook-execute-demo2.yml

PLAY [test for timer] ***************

TASK [Gathering Facts] ***************
Saturday 11 March 2023  12:32:19 +0800 (0:00:00.079)    0:00:00.079 *************** 
ok: [tencent]

TASK [Get IP] ***************
Saturday 11 March 2023  12:32:23 +0800 (0:00:03.489)    0:00:03.568 *************** 
ok: [tencent] => {
    "msg": [
    "10.0.24.14", 
    "172.17.0.1", 
    "10.4.0.1"
    ]
}

PLAY RECAP ***************
tencent    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  12:32:23 +0800 (0:00:00.038)    0:00:03.607 *************** 
=============================================================================== 
Gathering Facts --------------- 3.49s
Get IP --------------- 0.04s

只采集 all_ipv4_addresses 信息

gather_subset: ["!all", "all_ipv4_addresses"]

执行效果

$ ap --forks=10 playbook-execute-demo2.yml

PLAY [test for timer] ***************

TASK [Gathering Facts] ***************
Saturday 11 March 2023  12:33:37 +0800 (0:00:00.060)    0:00:00.060 *************** 
ok: [tencent]

TASK [Get IP] ***************
Saturday 11 March 2023  12:33:39 +0800 (0:00:02.066)    0:00:02.127 *************** 
ok: [tencent] => {
    "msg": [
    "10.0.24.14", 
    "172.17.0.1", 
    "10.4.0.1"
    ]
}

PLAY RECAP ***************
tencent    : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  12:33:39 +0800 (0:00:00.044)    0:00:02.171 *************** 
=============================================================================== 
Gathering Facts --------------- 2.07s
Get IP --------------- 0.04s

3.8 优化手段七:剧本拆分

相比于把所有任务放在同一个剧本 或 role 中是不推荐的,更合理的方式是按需拆分,这样对可维护性、性能都有好处

$  tree -L 3
.
├── inventory
├── meta
│   └── main.yml
├── README.md
├── roles
│   ├── consul
│   │   ├── setup.yml
│   │   ├── tasks
│   │   ├── templates
│   │   └── vars
│   ├── initial
│   │   ├── files
│   │   ├── handlers
│   │   ├── setup.yml
│   │   ├── tasks
│   │   ├── tests
│   │   └── vars
│   ├── prometheus
│   │   ├── files
│   │   ├── handlers
│   │   ├── setup.yml
│   │   ├── tasks
│   │   ├── templates
│   │   ├── tests
│   │   └── vars
│   ├── python
│   │   ├── setup.yml
│   │   ├── tasks
│   │   ├── templates
│   │   └── vars
│   ├── terraform
│   │   ├── setup.yml
│   │   ├── tasks
│   │   └── vars
│   └── v2ray
│       ├── files
│       ├── setup.yml
│       ├── tasks
│       ├── templates
│       └── vars
└── setup.yml

31 directories, 10 files
  • setup.yml:总的 Role 入口
  • roles/<role_name>/setup.yml:各个子 role 执行入口

通过 & 并发后台运行提高效率,例如:

$ ap -i inventory roles/python/setup.yml &
$ ap -i inventory roles/initial/setup.yml &
$ ap -i inventory roles/terraform/setup.yml &

当然,如果 roles 之间若是有依赖关系,那么还是需要 定义等待或检测 的任务,这部分的功能也可以使用 shell 脚本实现,毕竟 shell 脚本逻辑更丰富

3.9 优化手段八:引入第三方策略插件——Mitogen for Ansible

除了使用默认的执行策略,还可以使用第三方的策略插件,在社区有一款备受青睐的策略插件 Mitogen for Ansible

Mitogen 非常适用于大量短期操作的 playbook,它主要做了以下方面的优化

  • 一次连接,默认策略会按照 task 数量多次创建连接
  • 一次往返,减少网络往返频率
  • 复用资源,避免调用 Python 解释器 以及 重新编译 imports
  • 缓存代码,代码临时缓存在内存中,减少网络带宽占用
  • 写入优化,优化默认情况下的临时文件写入逻辑(反复重写 ZIP 文件)

开始配置使用,首先下载插件包

$ mkdir -p /etc/ansible/plugins
$ cd /etc/ansible/plugins
$ wget https://networkgenomics.com/try/mitogen-0.2.9.tar.gz
$ tar xf mitogen-0.2.9.tar.gz
$ rm -f mitogen-0.2.9.tar.gz

修改 ansible.cfg 配置

strategy_plugins   = /etc/ansible/plugins/mitogen-0.2.9/ansible_mitogen/plugins/strategy
strategy           = mitogen_linear

如果剧本使用 became 进行 sudo 操作,那么需要在目标节点的 /etc/sudoers 文件添加以下授权

SSH 用户名 = (ALL) NOPASSWD:/usr/bin/python -c*

mitogen 插件提供了 3 种策略,与默认策略是对应的

$ ls /etc/ansible/plugins/mitogen-0.2.9/ansible_mitogen/plugins/strategy | grep -v 'init'
mitogen_free.py
mitogen_host_pinned.py
mitogen_linear.py
mitogen.py

测试剧本依旧是之前 timer 那个

---
- name: test for timer
  hosts: tencent
  gather_facts: no
  tasks:
    - name: only one debug
      debug:
        var: inventory_hostname

    - name: shell
      shell:
        cp /etc/fstab /tmp/
      loop: "{{ range(0, 100)|list }}"

    - name: scp
      copy:
        src: /etc/hosts
        dest: /tmp/
      loop: "{{ range(0, 100)|list }}"

当时的测试性能数据如下

关闭 Pipelining

PLAY RECAP ***************
tencent    : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Saturday 11 March 2023  12:15:05 +0800 (0:02:55.225)    0:04:42.381 *************** 
=============================================================================== 
scp --------------- 175.23s
shell --------------- 107.05s
only one debug --------------- 0.04s

开启 Pipelining

Saturday 11 March 2023  12:09:40 +0800 (0:02:07.824)    0:03:13.579 *************** 
=============================================================================== 
scp --------------- 127.82s
shell --------------- 65.65s
only one debug --------------- 0.05s

开启 Pipelining + 启用 mitogen 策略插件

Saturday 11 March 2023  13:28:32 +0800 (0:00:10.652)    0:00:24.067 *************** 
=============================================================================== 
shell --------------- 13.06s
scp --------------- 10.65s
only one debug --------------- 0.17s

性能提升相当明显!

不过,在使用 mitogen 插件时,有些配置项会和 Ansible 原生配置冲突,需要额外做一些工作,比如:

  • 原生 Ansible 允许使用 forks 设置最大并发节点数量,但 mitogen 默认线程池最大支持 32 个连接,如果需要调整,需要修改环境变量 MITOGEN_POOL_SIZE 设置最大并发量
  • Python 3 的性能明显低于 Python 2,大致是因为核心库,但具体原因官方尚未排查

总的来说,在对 ansible 性能表现不满意时推荐尝试 Mitogen 插件,不过也要做好踩坑的心理准备~

贴一下 ansible 主控端优化后的最终配置 /etc/ansible/ansible.cfg

[defaults]
roles_path         = /etc/ansible/roles
callback_whitelist = timer, profile_tasks, profile_roles
library            = /usr/share/ansible-library
forks              = 100
host_key_checking  = False
jinja2_extensions  = jinja2.ext.do,jinja2.ext.i18n,jinja2.ext.loopcontrols
filter_plugins     = /usr/share/ansible/plugins/filter

# Mitogen 优化插件
strategy_plugins   = /etc/ansible/plugins/mitogen-0.2.9/ansible_mitogen/plugins/strategy
strategy           = mitogen_linear

# Facts 优化
gathering               = smart
fact_caching 	        = jsonfile
fact_caching_connection = /etc/ansible/facts_cache
fact_caching_timeout    = 86400

[ssh_connection]
pipelining = True
ssh_args   = -C -o ControlMaster=auto -o ControlPersist=1d

[inventory]
[privilege_escalation]
[paramiko_connection]

文章作者: Da
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Da !
  目录