There seem to be more and more posts on
the forums about jobs ‘stuck’ in the Running state and I have been
investigating this problem for a client recently so I thought I would
summarise some of the troubleshooting techniques I use. This posting
expands on the article
I wrote a few years ago about agent_exec.
The problem is usually expressed in the form of ‘DA shows my job is
running but I know it’s not’. First of all DA shows a job as ‘Running’
whenever it finds a job whose a_special_app attribute is set to
‘agentexec’. Since agent_exec sets this attribute when it starts and
clears it when the job has finished, under normal circumstances this is a
quite accurate reflection of whether a job is running or not.
However if the agent_exec processes are interrupted before clearing
the attribute (if the box is rebooted or the content server hangs for
instance) then the job object can be left with a_special_app =
‘agentexec’ and DA shows the job as running.
Of course the agent_exec attempts to deal with such a situation.
Every time it wakes up to perform some processing it first runs a
‘garbage_collect_jobs’ routine. You won’t see much evidence of this in
the logs unless you turn on agent_exec tracing (see my job scheduler
article for details on how to do this). You will get the follow lines
when there is nothing to garbage collect:
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] garbage_collect_jobs
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_exec: execquery,s0,F,
SELECT ALL r_object_id, a_last_invocation, ...
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_get: getlastcoll,s0
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_next: next,s0,q0
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_exec: close,s0,q0
Basically agent_exec runs the following query:
SELECT ALL
r_object_id, a_last_invocation,
a_last_completion, a_special_app
FROM dm_job
WHERE ( ( (a_last_invocation IS NOT NULLDATE)
AND (a_last_completion IS NULLDATE))
OR (a_special_app = 'agentexec'))
AND (i_is_reference = 0 OR i_is_reference is NULL)
AND (i_is_replica = 0 OR i_is_replica is NULL)
If jobs are returned from this query and agent_exec can not match the
job with an existing running job it will clean up the job object,
unsetting a_special_app and setting a_last_invocation to the current
time.
Here is some typical trace output
in the agentexec.log file when I set the dm_LogPurge a_special_app attribute to agentexec.
This output show that this is the source of the infamous message
Detected while processing dead job dm_LogPurge: The job object
indicated the job was in progress, but the job was not actually running.
It is likely that the dm_agent_exec utility was stopped while the job
was in progress.
DMCL tracing dm_agent_exec
Examining the agentexec trace is usually enough to figure out where
the problems lies however in extreme cases it is useful to look at the
dmcl trace for the agentexec process to further troubleshoot issues. In
principle you can do this by setting the dmcl.ini trace_file parameter
to an existing directory on the Content Server. However this has the
disadvantage of turning on tracing for all dmcl processes on the content
server i.e. all jobs and methods.
What we really want to do is isolate the agentexec process from all
others and in this section I tell you how. I present the steps along
with explanations for a typical Windows server. The same principle
applies to *nix servers usually with a suitable change of folder paths.
First force the agent exec to stop
. You can do this
by killing the main agent_exec process repeatedly. The Content Server
will detect that the agent exec dies and try and restart it, however
there is a limit to the number of times this will happen (seems to be 5
by default). Eventually you get the following message in the content
server log and the dm_agent_exec stays dead:
Thu Jan 17 13:35:37 2008 984000
[DM_SESSION_W_AGENT_EXEC_FAILURE_EXCEED]warning: "The failure limit of
the agent exec program has exceeded. It will not be restarted again.
Please correct the problem and restart the server."
Copy the agent_exec executable to a separate directory
. Copy the program file %DM_HOME%\bin\dm_agent_exec.exe to a new directory e.g. c:\Documentum\agentexec.
Copy the dmcl.ini
. Copy the main dmcl.ini file in c:\windows to c:\Documentum\agentexec. Now edit the file and add the following lines:
trace_level = 10
trace_file = c:\Documentum\agentexec
We are going to take advantage of the fact that the first place the
dmcl looks for the dmcl.ini is in the current working directory.
Start the agent_exec from the command line
. Use the following syntax:
dm_agent_exec -docbase_name docbase
-docbase_owner dmadmin -trace_level 1
Agent exec logging and trace output will continue to appear in the
%DOCUMENTUM%\dba\log\agentexec\agentexec.log, however a number of dmcl
trace files will also be created in C:\Documentum\agentexec directory.
One of these (probably the largest) will be the dmcl trace for the main
agent_exec process; remember agent_exec works by forking off a new
dm_agent_exec process to manage each running job – each of these
processes will have its own dmcl trace file.
When you have finished tracing the agentexec you will need to kill
the command line process and restart the Content Server (if anyone knows
how to force the content server to restart the agentexec after the
failure limit has been reached I’d love to know).
Conclusion
With a clear understanding of how agent_exec works and with the trace
output available it should be possible to troubleshoot and resolve just
about any job scheduler related problem.
转自:http://robineast.wordpress.com/2008/01/17/troubleshooting-agent_exec-garbage-collection/
相关推荐
Installing,_troubleshooting,_and_repairing_wireless_networks_by,_troubleshooting,_and_repairing_wireless_networks_by
Troubleshooting Docker_Code.zip
NetBackup 52xx and 5330 Appliance Troubleshooting Guide_271
Troubleshooting Docker_Code 源码 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
troubleShooting_with_syslog troubleShooting_with_syslog
apress_troubleshooting_oracle_perforamnce by jonathan lewis
思科认证2010版CCNP教材---Troubleshooting_and_Maintaining_Cisco_IP_Networks_(TSHOOT)_Foundation_Learning_Guide.pdf
NetBackup_Appliance_Troubleshooting_Guide_41
NetBackup_Appliance_Troubleshooting_Guide_50
NetBackup_Appliance_Troubleshooting_Guide_511
AU2_Blok_SSL_Troubleshooting_with_Wireshark_and_Tshark.
ARMOURY_CRATE_Mobile_Connection_Troubleshooting_Guide.pdf
GoldenGate_Troubleshooting_v0.1
NetBackup102_Troubleshooting_Guide
NetBackup81_Troubleshooting_Guide
Troubleshooting with Wireshark: Locate the Source of Performance Problem ) By Laura Chappell Foreword by Gerald Combs Edit by Jim Aragon This book focuses on the tips and techniques used to identify ...
oracle数据库的rac的Troubleshooting_asmlib
Database_Tuning___Principles_Experiments_and_Troubleshooting_Techniques(unsecured) 数据库_数据库性能调优:原理与技术
wlan 网络常见问题(L1层问题、L2层问题、安全问题、漫游、信道优化、L3到L7层问题)故障排查指南及工具介绍。 Excerpt from Certified Wireless Network Administrator Official Study Guide Exam CWNA-107