Ansible 优化
一、Ansible 执行流程分析
开始优化工作前,首先需要对 Ansible 的工作原理有一定的了解,这样才能知道性能问题会发生在哪里、以及后续如何优化,为了检验优化效果,还要了解如何测量任务的执行速度,所以文章主要围绕三个方面
- 执行流程分析
- 执行流程优化
- 执行速度测试
首先,开始一个问题,Ansible 执行流程分析
anisble、ansible-playbook 命令执行时通过 -v
参数输出执行时更多的信息
-v, --verbose verbose mode (-vvv for more, -vvvv to enable
connection debugging)
通过分析输出的信息,我们来了解 ansible 远程执行时的工作流程
$ ansible -vvv 'tencent' -m ping
执行返回如下
# 环境数据信息、版本
ansible 2.9.25
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/usr/share/ansible-library']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, Nov 16 2020, 22:23:17) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
# 远程执行所使用的配置、目标等
Using /etc/ansible/ansible.cfg as config file
host_list declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
script declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
auto declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
Parsed /etc/ansible/hosts inventory source with ini plugin
META: ran handlers
# 1.第一个连接:获取用户家目录,此处为 /root
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '/root\n', '')
# 2.第二个连接:在家目录下创建临时目录,临时目录由配置文件中 remote_tmp 指令控制
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /root/.ansible/tmp `"&& mkdir "` echo /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618 `" && echo ansible-tmp-1678348016.02-28826-116551982443618="` echo /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618 `" ) && sleep 0'"'"''
<bj-tencent-lhins-1> (0, 'ansible-tmp-1678348016.02-28826-116551982443618=/root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618\n', '')
# 3.第三个连接:探测目标节点的平台和 python 解释器的版本信息
<tencent> Attempting python interpreter discovery
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'echo PLATFORM; uname; echo FOUND; command -v '"'"'"'"'"'"'"'"'/usr/bin/python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.5'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/libexec/platform-python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python3'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python'"'"'"'"'"'"'"'"'; echo ENDFOUND && sleep 0'"'"''
<bj-tencent-lhins-1> (0, 'PLATFORM\nLinux\nFOUND\n/usr/bin/python\n/usr/bin/python2.7\n/usr/libexec/platform-python\n/usr/bin/python\nENDFOUND\n', '')
# 4.第四个连接:将要执行的模块相关的代码和参数放到本地临时文件中,并使用 sftp 将任务文件传输到被控节点的临时文件中
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '{"osrelease_content": "NAME=\\"CentOS Linux\\"\\nVERSION=\\"7 (Core)\\"\\nID=\\"centos\\"\\nID_LIKE=\\"rhel fedora\\"\\nVERSION_ID=\\"7\\"\\nPRETTY_NAME=\\"CentOS Linux 7 (Core)\\"\\nANSI_COLOR=\\"0;31\\"\\nCPE_NAME=\\"cpe:/o:centos:centos:7\\"\\nHOME_URL=\\"https://www.centos.org/\\"\\nBUG_REPORT_URL=\\"https://bugs.centos.org/\\"\\n\\nCENTOS_MANTISBT_PROJECT=\\"CentOS-7\\"\\nCENTOS_MANTISBT_PROJECT_VERSION=\\"7\\"\\nREDHAT_SUPPORT_PRODUCT=\\"centos\\"\\nREDHAT_SUPPORT_PRODUCT_VERSION=\\"7\\"\\n\\n", "platform_dist_result": ["centos", "7.9.2009", "Core"]}\n', '')
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
<bj-tencent-lhins-1> PUT /root/.ansible/tmp/ansible-local-28818KyMMwB/tmpFO9j5N TO /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/AnsiballZ_ping.py
# sftp 上传文件
<bj-tencent-lhins-1> SSH: EXEC sftp -b - -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a '[bj-tencent-lhins-1]'
<bj-tencent-lhins-1> (0, 'sftp> put /root/.ansible/tmp/ansible-local-28818KyMMwB/tmpFO9j5N /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/AnsiballZ_ping.py\n', '')
# 5.第五个连接:对目标节点上的任务文件授以执行权限
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'chmod u+x /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/ /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/AnsiballZ_ping.py && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '', '')
# 6.第六个连接:执行目标节点上的任务
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a -tt bj-tencent-lhins-1 '/bin/sh -c '"'"'/usr/bin/python /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/AnsiballZ_ping.py && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '\r\n{"invocation": {"module_args": {"data": "pong"}}, "ping": "pong"}\r\n', 'Shared connection to bj-tencent-lhins-1 closed.\r\n')
# 7.第七个连接:删除目标节点上的临时目录
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'rm -f -r /root/.ansible/tmp/ansible-tmp-1678348016.02-28826-116551982443618/ > /dev/null 2>&1 && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '', '')
tencent | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"invocation": {
"module_args": {
"data": "pong"
}
},
"ping": "pong"
}
META: ran handlers
META: ran handlers
总结一下 Ansible 在执行任务时会建立 7 次 ssh 连接,每次 ssh 连接的工作如下:
- (1). 第一个连接:获取远程主机时行目标用户的家目录,此处为 /root
- (2). 第二个连接:在远程家目录下创建临时目录,临时目录可由 ansible.cfg 中
remote_tmp
指令控制 - (3). 第三个连接:探测目标节点的平台和 python 解释器的版本信息
- (4). 第四个连接:将待执行模块的相关代码和参数放到本地临时文件中,并使用 sftp 将任务文件传输到被控节点的临时文件中
- (5). 第五个连接:对目标节点上的任务文件授以执行权限
- (6). 第六个连接:执行目标节点上的任务
- (7). 第七个连接:删除目标节点上的临时目录,并将执行结果返回给 Ansible 端
这仅是以单个节点为例,正常情况下,ansible 都是一组节点作为目标,那么整个执行过程,可能是这样的(默认配置,不考虑回调)
- (1). 进入第一个 play,挑选
forks=N
设置的 N 个节点 - (2). 每个节点执行第一个任务,每个节点都会建立 7 次 ssh 连接
- (3). 每个节点执行第二个任务,每个节点都再次建立 7 次 ssh 连接
- (4). 按照相同逻辑执行该 play 中其它任务…
- (5). 所有节点执行完该 play 中的所有任务后,进入下一个 play
- (6). 按照上面的流程执行完所有 play 中的所有任务
上面的流程仅是默认配置,但某些配置会改变 Ansible 的执行策略
二、Ansible 执行策略
为了让 Ansible 按照我们预期的方式运行,我们需要深入了解下 “执行策略”,这包括几个关键字:forks
、serial
、strategy
、throttle
2.1 forks
此前,我们提过 forks 配置指定最多有多少个节点同时执行任务(playbook),默认值是 5,由于目前总节点数是 3 个,所以为了实验效果,我们调整为 2,
$ vim /etc/ansible/ansible.cfg
forks = 2
模版定义
---
- hosts: ecs
gather_facts: no
tasks:
- name: debug demo
shell: "sleep 20; echo ansible"
执行剧本
$ ansible-playbook playbook-execute-demo1.yml
查看进程
$ ps -ef | grep playbook
root 23405 22856 92 13:51 pts/7 00:00:00 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml
root 23415 23405 0 13:51 pts/7 00:00:00 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml
root 23416 23405 0 13:51 pts/7 00:00:00 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml
这里虽然显示为 3 个进程,但实际上真正参与任务执行的只有那两个子进程(23415
、23416
),这一点可以通过查看 SSH 进程证明
$ ps -ef | grep "ssh -C"
root 23469 23415 0 13:51 pts/7 00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/16b0cd324c -tt sz-aliyun-ecs-1 /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1640411472.7-23415-13386693082019/AnsiballZ_command.py && sleep 0'
root 23488 23416 0 13:51 pts/7 00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/54dc19ddaa -tt bj-huawei-hecs-1 /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1640411473.17-23416-227469691024889/AnsiballZ_command.py && sleep 0'
可以看到,只有 2 个 SSH 进程在同时执行,forks 执行策略总结起来就是一句话,“根据 forks
指定的值 N,创建对应数量的进程远程执行任务,每当一个节点执行完成后,便新建子进程选择其他节点执行任务”,如下所示:
$ ps -ef | grep playbook
root 23405 22856 15 13:51 pts/7 00:00:06 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml
root 23503 23405 0 13:51 pts/7 00:00:00 /usr/bin/python2 /usr/bin/ansible-playbook playbook-execute-demo1.yml
$ ps -ef | grep "ssh -C"
root 23516 23503 0 13:51 pts/7 00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a -tt bj-tencent-lhins-1 /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1640411494.08-23503-146381019301363/AnsiballZ_command.py && sleep 0'
上面的输出中我们看到,总会多出一个进程,那个进程是 Ansible 主控进程,主控进程会监控节点执行任务的状态,并决定是否要创建子进程远程执行任务,比如:默认 forks 为 5 ,总节点数为 10,当第一批 5 个节点中某个节点先一步执行完成,Ansible 主控进程会立即创建一个新进程让第 6 个节点执行任务
除了在配置文件中指定 forks 值,还可以通过 ansible-playbook -f <N>
参数声明最大并发执行数,如下所示
$ ansible-playbook -f 3 playbook-execute-demo1.yml
2.2 serial
serial 是 play 级别的执行策略指令,用于指定 N 个节点作为一批执行节点,该批节点执行完后,才会调度另一批节点继续执行,倘若不指定 serial 则默认会将所有节点作为一批
但是,有几个点需要特别说明:
- serial 是指明多少节点为一批
- forks 是指明最多有多少节点执行任务
总的来说,forks 是用来限制 Ansible 执行任务的进程数量,粒度比较粗,而 serial 用于改变调度节点执行 play 的策略,更灵活更强大,除了支持整数,还支持百分数、以及列表递增
- 单个数值
N
:指明以N
个节点为一个批次去执行 play 中的所有任务,如serial: 3
- 百分数
N%
:指明以N%
的节点为一个批次去执行 play 中的所有任务,如serial: 50%
- 列表递增:迭代 列表元素 作为批次粒度去执行 play 中的所有任务,如
[1, 3, 50%]
,意为第一批选一个节点执行,第二批从剩余节点中选 3 个节点作为一批去执行,第三批从剩余中选50%
的节点执行,后续批次都以最后一个元素作为粒度选择剩余节点
仍旧使用上面那个 playbook,添加一行参数,通过 pstree
命令看到,当前只有一台节点在执行任务
$ pstree -apsl `ps -ef|grep ansible-playbook|head -n1|awk '{print $2}'`
systemd,1 --switched-root --system --deserialize 22
└─sshd,1068 -D
└─sshd,19433
└─zsh,19436
└─ansible-playboo,20225 /usr/bin/ansible-playbook playbook-execute-demo1.yml
├─ansible-playboo,20235 /usr/bin/ansible-playbook playbook-execute-demo1.yml
│ └─ssh,20268 -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/16b0cd324c -tt sz-aliyun-ecs-1 /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1678411494.46-20235-211742113762814/AnsiballZ_command.py && sleep 0'
└─{ansible-playboo},20234
执行返回
$ ap playbook-execute-demo1.yml
PLAY [ecs] ***************
TASK [debug demo] ***************
changed: [ecs-1.aliyun.sz]
PLAY [ecs] ***************
TASK [debug demo] ***************
changed: [huawei]
PLAY [ecs] ***************
TASK [debug demo] ***************
changed: [tencent]
PLAY RECAP ***************
ecs-1.aliyun.sz : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
huawei : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
tencent : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
通过命令执行输出也可以看出执行批次是逐台运行的
2.3 strategy
strategy
指令用于指定节点执行任务时的策略,其侧重点在于节点而在于任务,
linear
:默认策略,某个节点先执行完一个任务后,等待其余所有节点都执行完该任务,才统一进入下一个任务free
:某节点执行完一个任务后,不等待其它节点,继续执行该 play 中的剩余任务,直到该 play 执行完成,才释放节点槽位,让其它未执行任务的节点开始执行任务
配置方法,free
策略适用于大部分场景提高执行效率,先执行完先腾地方给其他批次节点执行,除非,用户对都同批次节点的 play 执行顺序有严格要求
---
- hosts: ecs
strategy: free
2.4 throttle
三、Ansible 执行优化
3.1 速度测量:profile_tasks 插件
在 ansible.cfg 配置文件中的 callback_whitelist
行启用 profile_tasks
插件
[defaults]
callback_whitelist = profile_tasks
# 为了方便观察输出,暂时只开启 profile_tasks
# callback_whitelist = timer, profile_tasks, profile_roles
timer
:显示playbook
执行的持续时间profile_tasks
:添加每个任务的开始时间,并在 playbook 执行结束时显示每个任务所用的时间,按降序排列profile_roles
:在结束时显示每个角色所用的时间,按降序排列
3.2 优化手段一:增加 forks 值
目标节点数:6,forks=5
---
- hosts: ecs
# strategy: free
gather_facts: no
tasks:
- name: debug demo
shell: "sleep 10; echo ansible"
执行速度如下
$ ap --forks=5 playbook-execute-demo1.yml
PLAY [ecs] ***************
TASK [debug demo] ***************
Friday 10 March 2023 09:54:35 +0800 (0:00:00.054) 0:00:00.054 ***************
changed: [ecs-1.aliyun.sz]
changed: [47.115.121.119]
changed: [114.115.159.174]
changed: [huawei]
changed: [tencent]
changed: [192.144.227.61]
PLAY RECAP ***************
114.115.159.174 : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.144.227.61 : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
47.115.121.119 : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
ecs-1.aliyun.sz : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
huawei : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
tencent : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Friday 10 March 2023 09:54:58 +0800 (0:00:23.088) 0:00:23.143 ***************
===============================================================================
debug demo --------------- 23.09s
修改 forks=10
$ ap --forks=10 playbook-execute-demo1.yml
PLAY [ecs] ***************
TASK [debug demo] ***************
Friday 10 March 2023 09:55:04 +0800 (0:00:00.054) 0:00:00.054 ***************
changed: [ecs-1.aliyun.sz]
changed: [47.115.121.119]
changed: [114.115.159.174]
changed: [192.144.227.61]
changed: [huawei]
changed: [tencent]
PLAY RECAP ***************
114.115.159.174 : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
192.144.227.61 : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
47.115.121.119 : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
ecs-1.aliyun.sz : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
huawei : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
tencent : ok=1 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Friday 10 March 2023 09:55:17 +0800 (0:00:13.755) 0:00:13.809 ***************
===============================================================================
debug demo --------------- 13.76s
3.3 优化手段二:调整执行策略
Ansible 默认支持 4 种执行策略
$ ls -a /usr/lib/python2.7/site-packages/ansible/plugins/strategy/*.py|grep -v 'init'
/usr/lib/python2.7/site-packages/ansible/plugins/strategy/debug.py
/usr/lib/python2.7/site-packages/ansible/plugins/strategy/free.py
/usr/lib/python2.7/site-packages/ansible/plugins/strategy/host_pinned.py
/usr/lib/python2.7/site-packages/ansible/plugins/strategy/linear.py
下面两种是最常用且被调整的优化参数
它的加速体现在目标节点比较多,无法将所有节点纳入同一批次去执行任务
---
- hosts: ecs
strategy: free
gather_facts: no
tasks:
- name: debug demo
shell: "sleep 10; echo ansible"
3.4 优化手段三:异步执行任务
默认情况下,Ansible 按照同步执行的方式执行每个任务,即按照顺序依次执行,一个任务执行完毕才会执行下面的,但是有时某些任务是不需要等待它完成的
例如:
---
- hosts: ecs
strategy: free
gather_facts: no
tasks:
- name: "debug demo"
shell: "sleep 10; echo ansible"
# 以 “异步” 的方式在后台运行该任务,如果后台任务在 20 秒内还未完成,则认为该任务失败
async: 20
# 检查异步任务状态的频率,是否已成功、报错
poll: 5
- name: "other tasks"
debug:
msg: "done."
执行效果
$ ap --forks=10 playbook-execute-demo1.yml
PLAY [tencent] ***************
Saturday 11 March 2023 09:01:09 +0800 (0:00:00.093) 0:00:00.093 ***************
TASK [debug demo] ***************
changed: [tencent]
Saturday 11 March 2023 09:01:23 +0800 (0:00:14.640) 0:00:14.734 ***************
TASK [other tasks] ***************
ok: [tencent] => {
"msg": "done."
}
PLAY RECAP ***************
tencent : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 09:01:23 +0800 (0:00:00.025) 0:00:14.759 ***************
===============================================================================
debug demo --------------- 14.64s
other tasks --------------- 0.03s
这里注释掉 async poll 指令再次尝试
$ ap --forks=10 playbook-execute-demo1.yml
PLAY [tencent] ***************
Saturday 11 March 2023 09:02:48 +0800 (0:00:00.063) 0:00:00.063 ***************
TASK [debug demo] ***************
changed: [tencent]
Saturday 11 March 2023 09:03:00 +0800 (0:00:12.243) 0:00:12.307 ***************
TASK [other tasks] ***************
ok: [tencent] => {
"msg": "done."
}
PLAY RECAP ***************
tencent : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 09:03:00 +0800 (0:00:00.028) 0:00:12.335 ***************
===============================================================================
debug demo --------------- 12.24s
other tasks --------------- 0.03s
添加上 async 与 poll 反而执行更慢了,这是因为如果 poll 的值不为 0,那么就不是真正的异步,因为它会定期去做任务状态检查,假如刚检测完 aysnc 任务执行状态,紧接着任务执行完毕了,但由于未到检查间隔,不得不等待下一次检查,才会发现任务已执行完毕,后续任务才会继续执行
修改 poll 值为 0
---
- hosts: tencent
strategy: free
gather_facts: no
tasks:
- name: "debug demo"
shell: "sleep 10; echo ansible"
# 以 “异步” 的方式在后台运行该任务,如果后台任务在 20 秒内还未完成,则认为该任务失败
async: 20
# 检查异步任务状态的频率,是否已成功、报错
poll: 0
- name: "other tasks"
debug:
msg: "done."
执行效果
$ ap --forks=10 playbook-execute-demo1.yml
PLAY [tencent] ***************
Saturday 11 March 2023 09:10:52 +0800 (0:00:00.059) 0:00:00.059 ***************
TASK [debug demo] ***************
changed: [tencent]
Saturday 11 March 2023 09:10:55 +0800 (0:00:03.081) 0:00:03.141 ***************
TASK [other tasks] ***************
ok: [tencent] => {
"msg": "done."
}
PLAY RECAP ***************
tencent : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 09:10:55 +0800 (0:00:00.033) 0:00:03.174 ***************
===============================================================================
debug demo --------------- 3.08s
other tasks --------------- 0.03s
不过有时我们需要检查任务状态或结果,这时需要用到 async_status
模块,模块接收后台任务的 job id
作为参数,返回后台任务状态,包括以下信息
ansible_job_id
:异步任务的 job idfinished
:表示所等待的异步任务是否已执行完成,值为 1 表示完成,0 表示未完成started
:表示所等待的异步任务是否已开始执行,值为 1 表示已开始,0 表示未开始
示例:异步任务先执行,中间执行同步任务,最后检查异步任务状态
---
- hosts: tencent
strategy: free
gather_facts: no
tasks:
- name: "debug demo"
shell: "sleep 10; echo ansible"
# 以 “异步” 的方式在后台运行该任务,如果后台任务在 20 秒内还未完成,则认为该任务失败
async: 20
# 检查异步任务状态的频率,是否已成功、报错
poll: 0
register: async_job
- name: "other tasks1"
debug:
msg: "other tasks1"
- name: "other tasks2"
debug:
msg: "other tasks2"
- name: "等待异步任务完成"
async_status:
jid: "{{ async_job.ansible_job_id }}"
register: async_job_result
# until指令阻塞等待 job_result.finished 事件发生(异步任务执行完成)
until: async_job_result.finished
# 重试次数
retries: 30
# 重试间隔时间
delay: 5
执行效果
$ ap --forks=10 playbook-execute-demo1.yml
PLAY [tencent] ***************
Saturday 11 March 2023 09:25:24 +0800 (0:00:00.062) 0:00:00.062 ***************
TASK [debug demo] ***************
changed: [tencent]
Saturday 11 March 2023 09:25:27 +0800 (0:00:02.832) 0:00:02.894 ***************
TASK [other tasks1] ***************
ok: [tencent] => {
"msg": "other tasks1"
}
Saturday 11 March 2023 09:25:27 +0800 (0:00:00.016) 0:00:02.911 ***************
TASK [other tasks2] ***************
ok: [tencent] => {
"msg": "other tasks2"
}
Saturday 11 March 2023 09:25:27 +0800 (0:00:00.017) 0:00:02.928 ***************
FAILED - RETRYING: 等待异步任务完成 (30 retries left).
FAILED - RETRYING: 等待异步任务完成 (29 retries left).
TASK [等待异步任务完成] ***************
changed: [tencent]
Saturday 11 March 2023 09:25:40 +0800 (0:00:13.705) 0:00:16.634 ***************
TASK [获取异步任务输出] ***************
ok: [tencent] => {
"msg": "ansible"
}
PLAY RECAP ***************
tencent : ok=5 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 09:25:40 +0800 (0:00:00.053) 0:00:16.687 ***************
===============================================================================
等待异步任务完成 --------------- 13.71s
debug demo --------------- 2.83s
获取异步任务输出 --------------- 0.05s
other tasks2 --------------- 0.02s
other tasks1 --------------- 0.02s
在某些场景也会等待多个异步任务
$ ap --forks=10 playbook-execute-demo1.yml
PLAY [tencent] ***************
Saturday 11 March 2023 09:37:56 +0800 (0:00:00.061) 0:00:00.061 ***************
TASK [async job1] ***************
changed: [tencent]
Saturday 11 March 2023 09:37:59 +0800 (0:00:02.966) 0:00:03.027 ***************
TASK [async job2] ***************
changed: [tencent]
Saturday 11 March 2023 09:38:01 +0800 (0:00:01.795) 0:00:04.823 ***************
TASK [other tasks1] ***************
ok: [tencent] => {
"msg": "other tasks1"
}
Saturday 11 March 2023 09:38:01 +0800 (0:00:00.018) 0:00:04.841 ***************
TASK [other tasks2] ***************
ok: [tencent] => {
"msg": "other tasks2"
}
Saturday 11 March 2023 09:38:01 +0800 (0:00:00.018) 0:00:04.859 ***************
FAILED - RETRYING: 等待异步任务完成 (30 retries left).
TASK [等待异步任务完成] ***************
changed: [tencent] => (item=408631890851.16423)
changed: [tencent] => (item=987065119603.16471)
Saturday 11 March 2023 09:38:10 +0800 (0:00:08.983) 0:00:13.842 ***************
TASK [获取异步任务输出] ***************
ok: [tencent] => (item={u'stderr_lines': [], u'changed': True, u'ansible_job_id': u'408631890851.16423', u'stdout': u'Async Job1', u'finished': 1, u'delta': u'0:00:05.009015', u'stdout_lines': [u'Async Job1'], u'ansible_loop_var': u'async_job_id', u'end': u'2023-03-11 09:38:04.721359', u'start': u'2023-03-11 09:37:59.712344', u'cmd': u'sleep 5; echo Async Job1', u'attempts': 2, u'failed': False, u'stderr': u'', u'rc': 0, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u'sleep 5; echo Async Job1', u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, u'async_job_id': u'408631890851.16423'}) => {
"msg": "任务输出:Async Job1"
}
ok: [tencent] => (item={u'stderr_lines': [], u'changed': True, u'ansible_job_id': u'987065119603.16471', u'stdout': u'Async Job1', u'finished': 1, u'delta': u'0:00:05.009973', u'stdout_lines': [u'Async Job1'], u'ansible_loop_var': u'async_job_id', u'end': u'2023-03-11 09:38:06.515008', u'start': u'2023-03-11 09:38:01.505035', u'cmd': u'sleep 5; echo Async Job1', u'attempts': 1, u'failed': False, u'stderr': u'', u'rc': 0, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u'sleep 5; echo Async Job1', u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, u'async_job_id': u'987065119603.16471'}) => {
"msg": "任务输出:Async Job1"
}
PLAY RECAP ***************
tencent : ok=6 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 09:38:10 +0800 (0:00:00.068) 0:00:13.911 ***************
===============================================================================
等待异步任务完成 --------------- 8.98s
async job1 --------------- 2.97s
async job2 --------------- 1.80s
获取异步任务输出 --------------- 0.07s
other tasks2 --------------- 0.02s
other tasks1 --------------- 0.02s
小节一下,适合 ansible 异步任务的场景
- 某个 task 需要运行很长的时间,可能会达到 ssh 连接的 timeout
- 没有任务依赖此任务是否完成的状态
- 需要尽快返回当前 shell 执行其他命令
不适合使用异步任务的场景:
- 需要执行完该任务后才能继续执行其他任务
- 申请排他锁的任务
- 剧本内全是耗时短的任务,异步并不会带来直观的性能提升,反而让执行可读性变的不友好,当 poll 非 0 而反而会降低效率
3.5 优化手段四:配置 SSH 长连接
Ansibe 是重度依赖 ssh 服务的,通过上面 -vvv
参数了解到 一次任务执行最少需要连接七次,所以优化 ssh 也可以提高 ansible 执行效率
通过开启配置 ssh 长连接,实现连接复用,大致原理是在 ssh 连接过期前会一直保持 ssh 连接已建立的状态,下次和目标节点建立 ssh 连接时将直接使用该连接
修改 /etc/ansible/ansible.cfg
文件
ssh_args = -C -o ControlMaster=auto -o ControlPersist=1d
首先,先看下未开启 ssh 长连接的执行耗时
$ time ansible 'tencent' -m ping
tencent | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}
ansible 'tencent' -m ping 1.31s user 0.25s system 54% cpu 2.883 total
缓存连接套接字,文件路径由 ansible.cfg
文件中的 control_path_dir
指令配置
$ ls -l ~/.ansible/cp/
total 0
srw--------------- 1 root root 0 Mar 11 10:00 68cade842a
执行效果
$ time ansible 'tencent' -m ping
tencent | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}
ansible 'tencent' -m ping 1.28s user 0.24s system 62% cpu 2.453 total
带来性能提升的同时,也会带来一些问题,例如:
- 即使被控端的认证信息发生修改,只要连接未过期,主控端依旧可以正常连接
- 即使被控端的连接变量发生修改,只要连接未过期,变量就不会刷新生效
- 缓存时间过长会导致系统中存在大量 ESTABLISHED 连接,造成套接字资源挤占
3.6 优化手段五:配置 Pipelining
Ansible 通过使用 ssh 的 pipelining 特性,让所有动作在一个 ssh 会话中完成,下面通过 -vvv
参数可以看到
修改 ansible.cfg 配置开启 pipelining
功能
# Enabling pipelining reduces the number of SSH operations required to
pipelining = True
开启 pipelineing 配置项后,SSH 连接从起初的 7 个变成了 3 个
$ ansible tencent -m ping -vvv
# 环境数据信息、版本
ansible 2.9.25
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/usr/share/ansible-library']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, Nov 16 2020, 22:23:17) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
# 远程执行所使用的配置、目标等
Using /etc/ansible/ansible.cfg as config file
host_list declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
script declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
auto declined parsing /etc/ansible/hosts as it did not pass its verify_file() method
Parsed /etc/ansible/hosts inventory source with ini plugin
Skipping callback 'actionable', as we already have a stdout callback.
Skipping callback 'counter_enabled', as we already have a stdout callback.
Skipping callback 'debug', as we already have a stdout callback.
Skipping callback 'dense', as we already have a stdout callback.
Skipping callback 'dense', as we already have a stdout callback.
Skipping callback 'full_skip', as we already have a stdout callback.
Skipping callback 'json', as we already have a stdout callback.
Skipping callback 'minimal', as we already have a stdout callback.
Skipping callback 'null', as we already have a stdout callback.
Skipping callback 'oneline', as we already have a stdout callback.
Skipping callback 'selective', as we already have a stdout callback.
Skipping callback 'skippy', as we already have a stdout callback.
Skipping callback 'stderr', as we already have a stdout callback.
Skipping callback 'unixy', as we already have a stdout callback.
Skipping callback 'yaml', as we already have a stdout callback.
META: ran handlers
<tencent> Attempting python interpreter discovery
# 1.第一个 SSH 连接,获取目标节点上支持的 Python 版本
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=1d -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'echo PLATFORM; uname; echo FOUND; command -v '"'"'"'"'"'"'"'"'/usr/bin/python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.5'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/libexec/platform-python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python3'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python'"'"'"'"'"'"'"'"'; echo ENDFOUND && sleep 0'"'"''
<bj-tencent-lhins-1> (0, 'PLATFORM\nLinux\nFOUND\n/usr/bin/python\n/usr/bin/python2.7\n/usr/libexec/platform-python\n/usr/bin/python\nENDFOUND\n', '')
# 2.第二个 SSH 连接,获取目标节点操作系统的信息
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=1d -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '{"osrelease_content": "NAME=\\"CentOS Linux\\"\\nVERSION=\\"7 (Core)\\"\\nID=\\"centos\\"\\nID_LIKE=\\"rhel fedora\\"\\nVERSION_ID=\\"7\\"\\nPRETTY_NAME=\\"CentOS Linux 7 (Core)\\"\\nANSI_COLOR=\\"0;31\\"\\nCPE_NAME=\\"cpe:/o:centos:centos:7\\"\\nHOME_URL=\\"https://www.centos.org/\\"\\nBUG_REPORT_URL=\\"https://bugs.centos.org/\\"\\n\\nCENTOS_MANTISBT_PROJECT=\\"CentOS-7\\"\\nCENTOS_MANTISBT_PROJECT_VERSION=\\"7\\"\\nREDHAT_SUPPORT_PRODUCT=\\"centos\\"\\nREDHAT_SUPPORT_PRODUCT_VERSION=\\"7\\"\\n\\n", "platform_dist_result": ["centos", "7.9.2009", "Core"]}\n', '')
# 准备执行任务,加载任务使用的模块文件,检查是否开启 Pipelining
Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/ping.py
Pipelining is enabled.
# 3.第三个 SSH 连接,执行任务
<bj-tencent-lhins-1> ESTABLISH SSH CONNECTION FOR USER: None
<bj-tencent-lhins-1> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=1d -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/68cade842a bj-tencent-lhins-1 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<bj-tencent-lhins-1> (0, '\n{"invocation": {"module_args": {"data": "pong"}}, "ping": "pong"}\n', '')
tencent | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"invocation": {
"module_args": {
"data": "pong"
}
},
"ping": "pong"
}
META: ran handlers
META: ran handlers
测试 Pipelining 配置项开启与否的效能差异
---
- name: test for timer
hosts: timer
gather_facts: no
tasks:
- name: only one debug
debug:
var: inventory_hostname
- name: shell
shell:
cp /etc/fstab /tmp/
loop: "{{ range(0, 100)|list }}"
- name: scp
copy:
src: /etc/hosts
dest: /tmp/
loop: "{{ range(0, 100)|list }}"
执行剧本
$ ap --forks=10 playbook-execute-demo2.yml
关闭 Pipelining
PLAY RECAP ***************
tencent : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 12:15:05 +0800 (0:02:55.225) 0:04:42.381 ***************
===============================================================================
scp --------------- 175.23s
shell --------------- 107.05s
only one debug --------------- 0.04s
开启 Pipelining
Saturday 11 March 2023 12:09:40 +0800 (0:02:07.824) 0:03:13.579 ***************
===============================================================================
scp --------------- 127.82s
shell --------------- 65.65s
only one debug --------------- 0.05s
性能提升大概在 2035% 左右,可以说是相当不错了
3.7 优化手段六:调整 facts 行为
默认下 Ansible 会收集所有节点的所有 facts 信息,这个过程是非常慢的,在用不到的 facts 信息的情况下,可以选择不采集
gather_facts: no
或者是只采集特定部分,参考官方文档
可选值为:
all
,all_ipv4_addresses
,all_ipv6_addresses
,apparmor
,architecture
,caps
,chroot
,cmdline
,date_time
,default_ipv4
,default_ipv6
,devices
,distribution
,distribution_major_version
,distribution_release
,distribution_version
,dns
,effective_group_ids
,effective_user_id
,env
,facter
,fips
,hardware
,interfaces
,is_chroot
,iscsi
,kernel
,local
,lsb
,machine
,machine_id
,mounts
,network
,ohai
,os_family
,pkg_mgr
,platform
,processor
,processor_cores
,processor_count
,python
,python_version
,real_user_id
,selinux
,service_mgr
,ssh_host_key_dsa_public
,ssh_host_key_ecdsa_public
,ssh_host_key_ed25519_public
,ssh_host_key_rsa_public
,ssh_host_pub_keys
,ssh_pub_keys
,system
,system_capabilities
,system_capabilities_enforced
,user
,user_dir
,user_gecos
,user_gid
,user_id
,user_shell
,user_uid
,virtual
,virtualization_role
,virtualization_type
性能对比,收集所有 facts
---
- name: test for timer
hosts: tencent
gather_facts: yes
# Ansible 默认值
gather_subset: ["all"]
# gather_subset: ["!all", "all_ipv4_addresses"]
tasks:
- name: Get IP
debug:
msg: "{{ ansible_all_ipv4_addresses }}"
执行效果
$ ap --forks=10 playbook-execute-demo2.yml
PLAY [test for timer] ***************
TASK [Gathering Facts] ***************
Saturday 11 March 2023 12:32:19 +0800 (0:00:00.079) 0:00:00.079 ***************
ok: [tencent]
TASK [Get IP] ***************
Saturday 11 March 2023 12:32:23 +0800 (0:00:03.489) 0:00:03.568 ***************
ok: [tencent] => {
"msg": [
"10.0.24.14",
"172.17.0.1",
"10.4.0.1"
]
}
PLAY RECAP ***************
tencent : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 12:32:23 +0800 (0:00:00.038) 0:00:03.607 ***************
===============================================================================
Gathering Facts --------------- 3.49s
Get IP --------------- 0.04s
只采集 all_ipv4_addresses
信息
gather_subset: ["!all", "all_ipv4_addresses"]
执行效果
$ ap --forks=10 playbook-execute-demo2.yml
PLAY [test for timer] ***************
TASK [Gathering Facts] ***************
Saturday 11 March 2023 12:33:37 +0800 (0:00:00.060) 0:00:00.060 ***************
ok: [tencent]
TASK [Get IP] ***************
Saturday 11 March 2023 12:33:39 +0800 (0:00:02.066) 0:00:02.127 ***************
ok: [tencent] => {
"msg": [
"10.0.24.14",
"172.17.0.1",
"10.4.0.1"
]
}
PLAY RECAP ***************
tencent : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 12:33:39 +0800 (0:00:00.044) 0:00:02.171 ***************
===============================================================================
Gathering Facts --------------- 2.07s
Get IP --------------- 0.04s
3.8 优化手段七:剧本拆分
相比于把所有任务放在同一个剧本 或 role 中是不推荐的,更合理的方式是按需拆分,这样对可维护性、性能都有好处
$ tree -L 3
.
├── inventory
├── meta
│ └── main.yml
├── README.md
├── roles
│ ├── consul
│ │ ├── setup.yml
│ │ ├── tasks
│ │ ├── templates
│ │ └── vars
│ ├── initial
│ │ ├── files
│ │ ├── handlers
│ │ ├── setup.yml
│ │ ├── tasks
│ │ ├── tests
│ │ └── vars
│ ├── prometheus
│ │ ├── files
│ │ ├── handlers
│ │ ├── setup.yml
│ │ ├── tasks
│ │ ├── templates
│ │ ├── tests
│ │ └── vars
│ ├── python
│ │ ├── setup.yml
│ │ ├── tasks
│ │ ├── templates
│ │ └── vars
│ ├── terraform
│ │ ├── setup.yml
│ │ ├── tasks
│ │ └── vars
│ └── v2ray
│ ├── files
│ ├── setup.yml
│ ├── tasks
│ ├── templates
│ └── vars
└── setup.yml
31 directories, 10 files
setup.yml
:总的 Role 入口roles/<role_name>/setup.yml
:各个子 role 执行入口
通过 & 并发后台运行提高效率,例如:
$ ap -i inventory roles/python/setup.yml &
$ ap -i inventory roles/initial/setup.yml &
$ ap -i inventory roles/terraform/setup.yml &
当然,如果 roles 之间若是有依赖关系,那么还是需要 定义等待或检测 的任务,这部分的功能也可以使用 shell 脚本实现,毕竟 shell 脚本逻辑更丰富
3.9 优化手段八:引入第三方策略插件——Mitogen for Ansible
除了使用默认的执行策略,还可以使用第三方的策略插件,在社区有一款备受青睐的策略插件 Mitogen for Ansible
Mitogen 非常适用于大量短期操作的 playbook,它主要做了以下方面的优化
- 一次连接,默认策略会按照 task 数量多次创建连接
- 一次往返,减少网络往返频率
- 复用资源,避免调用 Python 解释器 以及 重新编译 imports
- 缓存代码,代码临时缓存在内存中,减少网络带宽占用
- 写入优化,优化默认情况下的临时文件写入逻辑(反复重写 ZIP 文件)
开始配置使用,首先下载插件包
$ mkdir -p /etc/ansible/plugins
$ cd /etc/ansible/plugins
$ wget https://networkgenomics.com/try/mitogen-0.2.9.tar.gz
$ tar xf mitogen-0.2.9.tar.gz
$ rm -f mitogen-0.2.9.tar.gz
修改 ansible.cfg 配置
strategy_plugins = /etc/ansible/plugins/mitogen-0.2.9/ansible_mitogen/plugins/strategy
strategy = mitogen_linear
如果剧本使用 became 进行 sudo 操作,那么需要在目标节点的 /etc/sudoers
文件添加以下授权
SSH 用户名 = (ALL) NOPASSWD:/usr/bin/python -c*
mitogen 插件提供了 3 种策略,与默认策略是对应的
$ ls /etc/ansible/plugins/mitogen-0.2.9/ansible_mitogen/plugins/strategy | grep -v 'init'
mitogen_free.py
mitogen_host_pinned.py
mitogen_linear.py
mitogen.py
测试剧本依旧是之前 timer 那个
---
- name: test for timer
hosts: tencent
gather_facts: no
tasks:
- name: only one debug
debug:
var: inventory_hostname
- name: shell
shell:
cp /etc/fstab /tmp/
loop: "{{ range(0, 100)|list }}"
- name: scp
copy:
src: /etc/hosts
dest: /tmp/
loop: "{{ range(0, 100)|list }}"
当时的测试性能数据如下
关闭 Pipelining
PLAY RECAP ***************
tencent : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Saturday 11 March 2023 12:15:05 +0800 (0:02:55.225) 0:04:42.381 ***************
===============================================================================
scp --------------- 175.23s
shell --------------- 107.05s
only one debug --------------- 0.04s
开启 Pipelining
Saturday 11 March 2023 12:09:40 +0800 (0:02:07.824) 0:03:13.579 ***************
===============================================================================
scp --------------- 127.82s
shell --------------- 65.65s
only one debug --------------- 0.05s
开启 Pipelining + 启用 mitogen 策略插件
Saturday 11 March 2023 13:28:32 +0800 (0:00:10.652) 0:00:24.067 ***************
===============================================================================
shell --------------- 13.06s
scp --------------- 10.65s
only one debug --------------- 0.17s
性能提升相当明显!
不过,在使用 mitogen 插件时,有些配置项会和 Ansible 原生配置冲突,需要额外做一些工作,比如:
- 原生 Ansible 允许使用 forks 设置最大并发节点数量,但 mitogen 默认线程池最大支持 32 个连接,如果需要调整,需要修改环境变量
MITOGEN_POOL_SIZE
设置最大并发量 - Python 3 的性能明显低于 Python 2,大致是因为核心库,但具体原因官方尚未排查
- …
总的来说,在对 ansible 性能表现不满意时推荐尝试 Mitogen 插件,不过也要做好踩坑的心理准备~
贴一下 ansible 主控端优化后的最终配置 /etc/ansible/ansible.cfg
[defaults]
roles_path = /etc/ansible/roles
callback_whitelist = timer, profile_tasks, profile_roles
library = /usr/share/ansible-library
forks = 100
host_key_checking = False
jinja2_extensions = jinja2.ext.do,jinja2.ext.i18n,jinja2.ext.loopcontrols
filter_plugins = /usr/share/ansible/plugins/filter
# Mitogen 优化插件
strategy_plugins = /etc/ansible/plugins/mitogen-0.2.9/ansible_mitogen/plugins/strategy
strategy = mitogen_linear
# Facts 优化
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /etc/ansible/facts_cache
fact_caching_timeout = 86400
[ssh_connection]
pipelining = True
ssh_args = -C -o ControlMaster=auto -o ControlPersist=1d
[inventory]
[privilege_escalation]
[paramiko_connection]