saltstack的master上minion连接较多,下面这个程序可以分析哪些minion任务执行成功,哪些执行失败以及哪些没有返回。
脚本说明:
一、最先打印出本次任务的job id、command name以及其它相关信息,然后是本次任务的执行流程和结果,这和我们单独执行这个命令是一致的。最后程序会打印出所有未成功的任务和未返回的任务,并且重新执行一遍。 这里要说明的是,因为没有查看对应的情景,对于失败任务的排判断做的不好,另外minion未连接我也归为任务未返回,并且会再执行一遍,实际上如果是minion未连接,则不应该执行。
二、 程序我们先派生子进程去执行salt命令,再salt命令执行完毕后,我们的程序会对其中失败的和未返回的minion任务二次执行
三、编写脚本
import salt.utils.event
import re
import signal, time
import sys
import os
def single_handler(target):
os.execl('/usr/bin/salt', 'salt', target, 'state.sls', 'os')
def handler(num1, num2):
#signal.signal(signal.SIGCLD,signal.SIG_IGN)
print 'We are in signal handler'
print 'Job Not Ret: '+str(record[jid])
print ' Job Failed: '+str(failedrecord[jid])
print 'all done...'
for item in failedrecord[jid]:
#print item
try:
pid = os.fork()
if pid == 0:
single_handler(item)
except OSError:
print 'we exec. '+ item +' error!'
for item in record[jid]:
#print item
try:
print 'fork ok ' + item
pid = os.fork()
if pid == 0 :
single_handler(item)
except OSError:
print 'we exec. '+item + ' error!'
sys.stdout.flush()
os._exit(0)
fd = open('/tmp/record', 'w+')
#sys.stdout = fd
#sys.stderr = fd
signal.signal(signal.SIGCLD, handler)
#fd = open('/var/log/record', 'w+')
os.dup2(fd.fileno(), sys.stdout.fileno())
os.dup2(fd.fileno(), sys.stderr.fileno())
#sys.stdout = fd
#sys.stderr = fd
try:
pid = os.fork()
if pid == 0:
time.sleep(2)
try:
os.execl('/usr/bin/salt', 'salt', '*', 'state.sls', 'os')
except OSError:
print 'exec error!'
os._exit(1)
except OSError:
print 'first fork error!'
os._exit(1)
event = salt.utils.event.MasterEvent('/var/run/salt/master')
flag=False
reg=re.compile('salt/job/([0-9]+)/new')
reg1=reg
#a process to exec. command, but will sleep some time
#another process listen the event
#if we use this method, we can filter the event through func. name
record={}
failedrecord={}
jid = 0
#try:
for eachevent in event.iter_events(tag='salt/job',full=True):
eachevent=dict(eachevent)
result = reg.findall(eachevent['tag'])
if not flag and result:
flag = True
jid = result[0]
print " job_id: " + jid
print " Command: " + dict(eachevent['data'])['fun'] + ' ' + str(dict(eachevent['data'])['arg'])
print " RunAs: " + dict(eachevent['data'])['user']
print "exec_time: " + dict(eachevent['data'])['_stamp']
print "host_list: " + str(dict(eachevent['data'])['minions'])
sys.stdout.flush()
record[jid]=eachevent['data']['minions']
failedrecord[jid]=[]
reg1 = re.compile('salt/job/'+jid+'/ret/([0-9.]+)')
else:
result = reg1.findall(eachevent['tag'])
if result:
record[jid].remove(result[0])
if not dict(eachevent['data'])['success']:
failedrecord[jid].append(result[0])
#except:
# print 'we in except'
"""
print 'Job Not Ret: '+str(record[jid])
print ' Job Failed: '+str(failedrecord[jid])
for item in failedrecord[jid]:
os.system('salt '+ str(item) + ' state.sls os')
for item in record[jid]:
os.system('salt '+ str(item) + ' state.sls os')
os._exit(0)
"""
执行结果:
job_id: 20151208025319005896
Command: state.sls ['os']
RunAs: root
exec_time: 2015-12-08T02:53:19.006284
host_list: ['172.18.1.212', '172.18.1.214', '172.18.1.213', '172.18.1.211']
172.18.1.213:
----------
ID: configfilecopy
Function: file.managed
Name: /root/node3
Result: True
Comment: File /root/node3 is in the correct state
Started: 02:53:19.314015
Duration: 13.033 ms
Changes:
----------
ID: commonfile
Function: file.managed
Name: /root/commonfile
Result: True
Comment: File /root/commonfile is in the correct state
Started: 02:53:19.327173
Duration: 1.993 ms
Changes:
Summary
------------
Succeeded: 2
Failed: 0
------------
Total states run: 2
172.18.1.212:
----------
ID: configfilecopy
Function: file.managed
Name: /root/node2
Result: True
Comment: File /root/node2 is in the correct state
Started: 02:53:19.337325
Duration: 8.327 ms
Changes:
----------
ID: commonfile
Function: file.managed
Name: /root/commonfile
Result: True
Comment: File /root/commonfile is in the correct state
Started: 02:53:19.345787
Duration: 1.996 ms
Changes:
Summary
------------
Succeeded: 2
Failed: 0
------------
Total states run: 2
172.18.1.211:
----------
ID: configfilecopy
Function: file.managed
Name: /root/node1
Result: True
Comment: File /root/node1 is in the correct state
Started: 02:53:19.345017
Duration: 12.741 ms
Changes:
----------
ID: commonfile
Function: file.managed
Name: /root/commonfile
Result: True
Comment: File /root/commonfile is in the correct state
Started: 02:53:19.357873
Duration: 1.948 ms
Changes:
Summary
------------
Succeeded: 2
Failed: 0
------------
Total states run: 2
172.18.1.214:
Minion did not return. [Not connected]
We are in signal handler
Job Not Ret: ['172.18.1.214']
Job Failed: []
all done...
fork ok 172.18.1.214
172.18.1.214:
Minion did not return. [Not connected]