$ GOTO START_PROC $! $! NOTICE: This .COM file was written as a workaround for customers having $! problems with queue crashes. It is not officially part of the supported TCPIP $! product. Use it at your own risk! $! $! SYS$SPECIFIC:[TCPIP$SMTP]TCPIP$RESTART_SMTPQ.COM monitors the SMTP execution $! queue(s) and restarts them if crashed. It also checks the SMTP symbiont $! processes's PAGFILCNTs against a parametric limit and stops and restarts the $! queue(s) if the PAGFILCNT is less than the limit. This catches a memory leak $! before it causes a queue to crash with an "Insufficient memory" error. $! TCPIP$RESTART_SMTPQ.COM also allows you to give it a list of failure statuses $! which it will stop and restart the queue(s) for if found in the symbiont log. $! $! $! Before TCPIP$RESTART_SMTPQ restarts crashed execution queues it checks the $! generic queue for pending jobs. It takes the oldest pending job in the $! generic queue and sets it to a holding state assuming it's the job that $! caused the queue crash. Only once this job is set to a holding state will it $! restart the queue(s). If TCPIP$RESTART_SMTPQ stops the queue(s) itself because $! the symbiont is near crashing due to insufficient memory then no job is put $! on hold as there has been no symbiont crash and no particular job in the $! queue is to blame for the memory leak. This is also true if we stop the queue $! because one of the selected failure statuses was found in the log. In the $! case where SMTP is shutdown at the time TCPIP$RESTART_SMTPQ runs it will not $! restart the execution queues. To decide whether SMTP is shutdown or running $! TCPIP$RESTART_SMTPQ looks to see if SYS$LIBRARY:TCPIP$SMTP_MAILSHR.EXE is $! installed. If it is, TCPIP$RESTART_SMTPQ assumes that SMTP is running and that $! the an execution queue in the stopped state has crashed and so needs to be $! restarted. $! $! $! Note on multiple execution queue setups: $! $! This .COM file has been updated to work with multiple execution queue setups $! up to nine execution queues. (IE. it supports up to TCPIP SET CONFIG /QUEUE=9.) $! This is handled in a fairly simplistic way. When checking to see if the $! execution queue is stopped if there are multiple queues we wait until *all* $! of the queues are stopped before we restart the queues. If we didn't do this $! the scheme to spot the culprit job in the queue wouldn't work. It would be $! possible that the culprit job having crashed one execution queue has made $! its way into another execution queue and that the job at the head of the list $! in the generic queue is *not* in fact the job to blame. We can only be sure $! that the culprit job is the one at the head of the generic queue when all the $! queues have stopped so we must allow the culprit job to crash all of the $! queues before trying to find the culprit job and put it on hold and restart $! the queues. $! $! If we spot one execution queue below the remaining PAGFILCNT limit we stop $! *all* the execution queues. This is because the restart procedure described $! above won't restart the execution queues until all of them have stopped. $! $! $! Caveats: $! $! $! 1) Note that the function to stop the queue if a certain error status is $! signalled in the log still won't work for multiple execution queue SMTP $! setups. It only checks the top most version of the log file which isn't what $! you want when you're running more than one execution queue. $! $! 2) The scheme of assuming that the oldest job in the queue is the one that $! caused a symbiont crash only works if the SMTP generic queue is set up to do $! FIFO scheduling. This is done with the INIT/QUEUE/SCHED=NOSIZE command. Older $! versions of UCX START MAIL would not set this so check your SMTP generic $! queue and do INIT/QUEUE/SCHED=NOSIZE if it's not already been done. $! $! - Remember that INIT/QUEUE/SCHED=NOSIZE is done on the Generic queue, not $! the execution queue. $! $! - You want to stop mail and ensure that there are no pending mail jobs in $! the queue before doing this because changing queue scheduling can have $! unpredictable results if there are pending jobs in the queue when the $! command is issued. From the DCL Help for the INIT/QUEUE/SCHED command: $! $! If you enter this command while there are pending jobs in any $! queue, its effect on future jobs is unpredictable. $! $! $! $! How to use this COM file: $! $! To get it running you submit it once and it resubmits itself every so often. $! The time betweens submits is a parameter. $! $! TCPIP$RESTART_SMTPQ.COM has four parameters. The first is the name of the $! batch queue to which it is to resubmit itself. You probably want to take the $! default on this parameter which is SYS$BATCH. The second parameter is the $! time interval between submits.It should be in standard VMS delta time format. $! The third parameter is the minimum value for the symbiont process's PAGFILCNT $! that TCPIP$RESTART_SMTPQ will allow before it considers the queue "about to $! crash due to a memory leak". The fourth is a list of failure statuses that $! you want TCPIP$RESTART_SMTPQ.COM to stop and restart the queue for if found $! in the symbiont log. For example if you want TCPIP$RESTART_SMTPQ to run $! in SYS$BATCH, resubmit itself to run every five minutes and stop and restart $! the SMTP execution queue if the PAGFILCNT goes under 1000 pages and stop and $! restart the SMTP execution queue if the error "%LIB-F-INSEF, insufficient $! event flags" was in the log file then you'd do this: $! $! $! $ SUBMIT/NOPRINT SYS$SPECIFIC:[TCPIP$SMTP]TCPIP$RESTART_SMTPQ.COM - $! /PARAM=("","+00:05:00.00",1000,"LIB-F-INSEF")/USER=SYSTEM $! $! You should submit it as the system user because it needs priv's. $! $! The defaults for the three parameters are: $! $! P1 Batch queue "SYS$BATCH" $! P2 Resubmit interval "+00:10:00.00" (ten minutes) $! P3 Minimum PAGFILCNT 1000 (1000 pages left) $! P4 Termination_srchstr blank (no search string) $! $! P4 is really only a text string that we search the log for. If you $! want to check for multiple strings just separate the strings with $! a comma, eg: $! $ SUBMIT.../PARAM=("","",1000,"LIB-F-INSEF,SYSTEM-F-LINKDISCON")... $! Most users of this procedure won't need to set P4. It is also not $! supported for multiple executon queue setups. $! $! $! $START_PROC: $ SMTP_EXEC_Q = "TCPIP$SMTP_''f$getsyi("NODENAME")'_01" $ SMTP_GEN_Q = "TCPIP$SMTP_''f$getsyi("NODENAME")'_00" $ SMTP_SYMBIONT = "SMTP_''f$getsyi("NODENAME")'_01" $ IF P1 .EQS. "" $ THEN $ SUB_Q = "SYS$BATCH" $ ELSE $ SUB_Q = P1 $ ENDIF $ $ IF P2 .EQS. "" $ THEN $ SUB_TIME = "+00:10:00.00" $ ELSE $ SUB_TIME = P2 $ ENDIF $ $ IF P3 .EQS. "" $ THEN $ PAGFILCNT_MIN = 1000 $ ELSE $ PAGFILCNT_MIN = P3 $ ENDIF $ $ IF P4 .EQS. "" $ THEN $ TERMINATE_SRCHSTR = "" $ ELSE $ TERMINATE_SRCHSTR = P4 $ ENDIF $ $ SHO SYM SUB_Q $ SHO SYM SUB_TIME $ SHO SYM PAGFILCNT_MIN $ SHO SYM TERMINATE_SRCHSTR $ $ RESUBMIT_SELF = "TRUE" $ $ SET PROC/PRIV=ALL $! $! Count the number of execution queues. $ TEMP_QCOUNT = 0 $ EXEC_QCOUNT = 0 $! $ CHECK_QUEUES: $ TEMP_QCOUNT = TEMP_QCOUNT + 1 $ SMTP_EXEC_Q = "TCPIP$SMTP_''f$getsyi("NODENAME")'_0''TEMP_QCOUNT'" $ SHO SYM SMTP_EXEC_Q $ CANCEL_DUMMY = F$GETQUIW("CANCEL_OPERATION") $ Q_PRESENT = F$GETQUIW("DISPLAY_QUEUE","QUEUE_NAME","''SMTP_EXEC_Q'","ALL_JOBS") $ IF Q_PRESENT .EQS. "" $ THEN $ EXEC_QCOUNT = TEMP_QCOUNT - 1 $ GOTO CHECK_QUEUES_DONE $ ENDIF $ GOTO CHECK_QUEUES $! $ CHECK_QUEUES_DONE: $ WRITE SYS$OUTPUT "There are ''EXEC_QCOUNT' SMTP execution queues." $! $! Count the number of STOPPED execution queues. $ TEMP_QCOUNT = 0 $ STOPPED_EXEC_QCOUNT = 0 $ Q_DOWN = "FALSE" $! $ CHECK_STOPPED_QUEUES: $ TEMP_QCOUNT = TEMP_QCOUNT + 1 $ IF TEMP_QCOUNT .GT. EXEC_QCOUNT THEN GOTO CHECK_STOPPED_QUEUES_DONE $ SMTP_EXEC_Q = "TCPIP$SMTP_''f$getsyi("NODENAME")'_0''TEMP_QCOUNT'" $ SHO SYM SMTP_EXEC_Q $ CANCEL_DUMMY = F$GETQUIW("CANCEL_OPERATION") $ Q_DOWN = F$GETQUIW("DISPLAY_QUEUE","QUEUE_STOPPED","''SMTP_EXEC_Q'","ALL_JOBS") $ IF Q_DOWN .EQS. "" $ THEN $ GOTO CHECK_QUEUES_DONE $ ENDIF $ IF Q_DOWN .EQS. "TRUE" $ THEN $ STOPPED_EXEC_QCOUNT = STOPPED_EXEC_QCOUNT + 1 $ ENDIF $ GOTO CHECK_STOPPED_QUEUES $! $ CHECK_STOPPED_QUEUES_DONE: $ WRITE SYS$OUTPUT "There are ''STOPPED_EXEC_QCOUNT' stopped SMTP execution queues." $ SMTP_EXEC_Q = "TCPIP$SMTP_''f$getsyi("NODENAME")'_01" $! $CHECK_ALL_Q_DOWN: $ IF STOPPED_EXEC_QCOUNT .EQ. EXEC_QCOUNT $ THEN $ WRITE SYS$OUTPUT "All of the SMTP execution queues are stopped." $ IF F$FILE("SYS$LIBRARY:TCPIP$SMTP_MAILSHR.EXE","KNOWN") $ THEN $ GEN_Q_DUMMY = F$GETQUI("DISPLAY_QUEUE","QUEUE_NAME",- "''SMTP_GEN_Q'","WILDCARD") $ FIRST_PASS = "TRUE" $ NOACCESS = - F$GETQUI("DISPLAY_JOB", "JOB_INACCESSIBLE",, "ALL_JOBS,PENDING_JOBS") $ JNUM = - F$GETQUI("DISPLAY_JOB", "ENTRY_NUMBER",, "FREEZE_CONTEXT,PENDING_JOBS") $ sho sym jnum $ ! $ ! Don't bother searching for culprit job if we stopped the Q(s) ourselves. $ IF F$SEARCH("TCPIP$RESTART_SMTPQ_FLAG.TXT") .NES. "" $ THEN $ OPCOM_MSG = "TCPIP$RESTART_SMTPQ found queue stopped purposefully by me." $ REQUEST "''OPCOM_MSG'" $ DELETE TCPIP$RESTART_SMTPQ_FLAG.TXT;* $ GOTO END_ENTRY_LOOP $ ENDIF $! $ENTRY_LOOP: $!****************************************************************************** $ IF JNUM .NES. "" $ THEN $ JNAME = F$GETQUI("DISPLAY_JOB", - "JOB_NAME",, "FREEZE_CONTEXT, PENDING_JOBS") $ UNAME = F$GETQUI("DISPLAY_JOB", - "USERNAME",, "FREEZE_CONTEXT, PENDING_JOBS") $ ANAME = F$GETQUI("DISPLAY_JOB", - "ACCOUNT_NAME",, "FREEZE_CONTEXT, PENDING_JOBS") $ JSUBTIME = F$GETQUI("DISPLAY_JOB", - "SUBMISSION_TIME",, "FREEZE_CONTEXT, PENDING_JOBS") $ JFILENAME = F$GETQUI("DISPLAY_FILE", - "FILE_SPECIFICATION",, - "FREEZE_CONTEXT, PENDING_JOBS") $ ! $ ! Remove file version from job's filename so when we COPY later $ ! no problems will arise if destination file is there already. $ JFILENAME = - F$PARSE("''JFILENAME'",,,"DEVICE","SYNTAX_ONLY") + - F$PARSE("''JFILENAME'",,,"DIRECTORY","SYNTAX_ONLY") + - F$PARSE("''JFILENAME'",,,"NAME","SYNTAX_ONLY") + - F$PARSE("''JFILENAME'",,,"TYPE","SYNTAX_ONLY") $ NEW_JFILENAME = JFILENAME + "_SAVE" $ ! $ ! Look for a text file too. Will be exactly the same as the control $ ! file but with _TEXT tacked onto the file TYPE. $ ! If there is a text file (only will be if text is big enough) then $ ! save the fact. $ TEMP_FILENAME = JFILENAME + "_TEXT" $ IF F$SEARCH ("''TEMP_FILENAME'") .EQS. "" $ THEN $ JTEXTFILENAME = "" $ NEW_JTEXTFILENAME = "" $ ELSE $ JTEXTFILENAME = TEMP_FILENAME $ NEW_JTEXTFILENAME = JTEXTFILENAME + "_SAVE" $ ENDIF $ $ sho sym jnum $ sho sym jname $ sho sym jfilename $ sho sym jtextfilename $ sho sym new_jfilename $ sho sym new_jtextfilename $ IF FIRST_PASS $ THEN $ FIRST_PASS = "FALSE" $ HOLD_JNUM = JNUM $ HOLD_JNAME = JNAME $ HOLD_JFILENAME = JFILENAME $ HOLD_NEW_JFILENAME = NEW_JFILENAME $ HOLD_JTEXTFILENAME = JTEXTFILENAME $ HOLD_NEW_JTEXTFILENAME = NEW_JTEXTFILENAME $ HOLD_JSUBTIME = JSUBTIME $ HOLD_COMP_JSUBTIME = F$CVTIME(HOLD_JSUBTIME) $ ENDIF $ COMP_JSUBTIME = F$CVTIME(JSUBTIME) $ IF COMP_JSUBTIME .LTS. HOLD_COMP_JSUBTIME $ THEN $ HOLD_JNUM = JNUM $ HOLD_JNAME = JNAME $ HOLD_JFILENAME = JFILENAME $ HOLD_NEW_JFILENAME = NEW_JFILENAME $ HOLD_JTEXTFILENAME = JTEXTFILENAME $ HOLD_NEW_JTEXTFILENAME = NEW_JTEXTFILENAME $ HOLD_JSUBTIME = JSUBTIME $ HOLD_COMP_JSUBTIME = F$CVTIME(HOLD_JSUBTIME) $ ENDIF $ NOACCESS = F$GETQUI("DISPLAY_JOB", - "JOB_INACCESSIBLE",, "ALL_JOBS,PENDING_JOBS") $ JNUM = F$GETQUI("DISPLAY_JOB", - "ENTRY_NUMBER",, "FREEZE_CONTEXT,PENDING_JOBS") $ sho sym jnum $ GOTO ENTRY_LOOP $ ENDIF ! IF JNUM .NES. "" $ END_ENTRY_LOOP: $ ! $ ! At this point we've looped through all the pending jobs and found the $ ! earliest one which we're guessing to be the culprit. If there were $ ! no pending jobs then none of the hold_... symbols will be defined. $ ! But if the hold_... symbols are defined then we have a job to put on $ ! hold. $ IF F$TYPE(HOLD_JNUM) .NES. "" $ THEN $ write sys$output "candidate: ''HOLD_JNUM' ''HOLD_JNAME' ''HOLD_JSUBTIME'" $ ! $ ! Copy the control file for safe keeping $ IF F$SEARCH("''HOLD_JFILENAME'") .NES. "" $ THEN COPY/LOG 'HOLD_JFILENAME 'HOLD_NEW_JFILENAME $ ELSE WRITE SYS$OUTPUT "File ''HOLD_JFILENAME' does not exist" $ ENDIF $ ! $ ! If there's a text file save it too. $ IF F$SEARCH("''HOLD_JTEXTFILENAME'") .NES. "" $ THEN COPY/LOG 'HOLD_JTEXTFILENAME 'HOLD_NEW_JTEXTFILENAME $ ENDIF $ ! $ ! Put the entry on hold $ SET ENTRY/HOLD 'HOLD_JNUM $ ! $ ! Inform operator of what we've done. $ OPCOM_MSG = "TCPIP$RESTART_SMTPQ Holding Entry: ''HOLD_JNUM', Job: ''HOLD_JNAME'" $ REQUEST "''OPCOM_MSG'" $ ENDIF $ $ ON ERROR THEN GOTO RESTART_ERROR $ RESUBMIT_SELF = "TRUE" $ TCPIP START MAIL $ OPCOM_MSG = "TCPIP$RESTART_SMTPQ Restarted the SMTP queues" $ REQUEST "''OPCOM_MSG'" $ GOTO SMTP_CONT1 $RESTART_ERROR: $ ERR_MSG = F$MESSAGE($STATUS) $ OPCOM_MSG = "TCPIP$RESTART_SMTPQ Error restarting SMTP queues: ''ERR_MSG'" $ REQUEST "''OPCOM_MSG'" $ RESUBMIT_SELF = "FALSE" $SMTP_CONT1: $ ENDIF ! IF TCPIP$SMTP_MAILSHR is installed $ ELSE $ ! $ ! Execution queue(s) running. Check to see if need to be stopped because $ ! gone below the desired number of free page file count blocks. $ ! $ ! First, get the symbiont process's PAGFILCNT $ WRITE SYS$OUTPUT "Queue(s) running." $ ON ERROR THEN CONTINUE $! $! Count the number of symbiont process above the PAGFILCNT_MIN limit $ TEMP_QCOUNT = 0 $ SYMB_PROC_QCOUNT = 0 $! $CHECK_SYMB_MEM: $ TEMP_QCOUNT = TEMP_QCOUNT + 1 $ IF TEMP_QCOUNT .GT. EXEC_QCOUNT THEN GOTO CHECK_SYMB_MEM_DONE $ SMTP_SYMBIONT = "SMTP_''f$getsyi("NODENAME")'_0''TEMP_QCOUNT'" $ SHO SYM SMTP_SYMBIONT $ CTX = "" $ TEMP = F$CONTEXT ("PROCESS", CTX, "USERNAME", "SYSTEM","EQL") $ TEMP = F$CONTEXT ("PROCESS", CTX, "PRCNAM", "''SMTP_SYMBIONT'","EQL") $ PID = F$PID(CTX) $ MEM_LEFT = F$GETJPI("''PID'", "PAGFILCNT") $ WRITE SYS$OUTPUT - F$TIME()+" PID of ''SMTP_SYMBIONT' "+pid+" Pages left: "+F$STRING(MEM_LEFT) $ ! $ ! If the page file count is less than the limit then increment count $ IF MEM_LEFT .GE. PAGFILCNT_MIN THEN SYMB_PROC_QCOUNT = SYMB_PROC_QCOUNT + 1 $ GOTO CHECK_SYMB_MEM $! $ CHECK_SYMB_MEM_DONE: $ ! $ ! If the page file count for any process is less than the limit then $ ! stop all the queues. $ IF SYMB_PROC_QCOUNT .LT. EXEC_QCOUNT $ THEN $ WRITE SYS$OUTPUT - "At least one of the symbiont processes has gone below the PAGFILCNT limit. Stopping all SMTP execution queues" $ GOSUB STOP_ALL_QUEUES $ ! $ ! Create flag file to indicate queue(s) stopped purposefully $ IF F$SEARCH("TCPIP$RESTART_SMTPQ_FLAG.TXT") .NES. "" $ THEN DELETE TCPIP$RESTART_SMTPQ_FLAG.TXT;* $ ENDIF $ $ CREATE TCPIP$RESTART_SMTPQ_FLAG.TXT This file is created when SYS$SPECIFIC:[TCPIP$SMTP]TCPIP$RESTART_SMTPQ.COM stops the SMTP execution queues because it has detected that one of the symbiont process's PAGFILCNT is lower than the requested minimum (P3). Please don't delete this file. TCPIP$RESTART_SMTPQ.COM will delete it later. $ ! $ ! Record this in operator's log. $ OPCOM_MSG = "TCPIP$RESTART_SMTPQ Stopping SMTP execution queues." +- "Less than minimum requested PAGFILCNT remaining." $ REQUEST "''OPCOM_MSG'" $ ELSE $ ! NEED TO LOOK ATTHIS Multiple exec qs won't work $! Need to scanall logs for currently active qs? $ ! $ ! If we find any of the strings in P4 in the log then we will also stop $ ! the queue. $ IF TERMINATE_SRCHSTR .NES. "" $ THEN $! NEED loop here to search each EXEC_QCOUNT version of the log file $ SEARCH SYS$SPECIFIC:[TCPIP$SMTP]TCPIP$SMTP_LOGFILE.LOG 'TERMINATE_SRCHSTR $ SEARCH_STATUS = $STATUS $ IF SEARCH_STATUS .EQ. 1 $ THEN $ GOSUB STOP_ALL_QUEUES $ ! $ ! Create flag file to indicate queue(s) stopped purposefully $ IF F$SEARCH("TCPIP$RESTART_SMTPQ_FLAG.TXT") .NES. "" $ THEN DELETE TCPIP$RESTART_SMTPQ_FLAG.TXT;* $ ENDIF $ $ CREATE TCPIP$RESTART_SMTPQ_FLAG.TXT This file is created when SYS$SPECIFIC:[TCPIP$SMTP]TCPIP$RESTART_SMTPQ.COM stops the SMTP execution queues because it has found one of the errors that the user wants us to stop the queue for (P4) in the log. Please don't delete this file. TCPIP$RESTART_SMTPQ.COM will delete it later. $ ! $ ! Record this in operator's log. $ OPCOM_MSG = "TCPIP$RESTART_SMTPQ Stopping SMTP execution queues." +- "Found terminating error in log file." $ REQUEST "''OPCOM_MSG'" $ ENDIF $ ENDIF $ ENDIF $ ENDIF ! If execution queue is stopped $! $ IF RESUBMIT_SELF $ THEN $ PURGE/keep=2 SYS$LOGIN:TCPIP$RESTART_SMTPQ.LOG $ SUBMIT SYS$SPECIFIC:[TCPIP$SMTP]TCPIP$RESTART_SMTPQ.COM/AFTER="''SUB_TIME'"- /NOPRINT/QUEUE='SUB_Q/PARAM=("''SUB_Q'","''SUB_TIME'","''PAGFILCNT_MIN'","''TERMINATE_SRCHSTR'") $ ENDIF $! $ EXIT $! $STOP_ALL_QUEUES: $ WRITE SYS$OUTPUT "Stopping all SMTP execution queues" $ TEMP_QCOUNT = 0 $ STOP_NEXT_QUEUE: $ TEMP_QCOUNT = TEMP_QCOUNT + 1 $ IF TEMP_QCOUNT .GT. EXEC_QCOUNT THEN GOTO STOP_ALL_QUEUES_DONE $ CUR_EXEC_Q = "TCPIP$SMTP_''f$getsyi("NODENAME")'_0''TEMP_QCOUNT'" $ STOP/QUEUE/NEXT 'CUR_EXEC_Q $ GOTO STOP_NEXT_QUEUE $! $STOP_ALL_QUEUES_DONE: $ RETURN