我有一个bash shell脚本,该脚本运行大约70个python应用程序实例。每个python实例都运行TensorFlow 2.0,它每小时会唤醒一次并执行一些工作。bash shell脚本在用户shell中运行良好,但是在cron中运行时,作业的第36个实例之后核心转储。
我具有用于完全限定路径的Shell脚本设置,并已验证两个实例中的环境相同。
这在运行AWS的Ubuntu的36核计算机上运行:#56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019 x86_64 x86_64 x86_64 GNU / Linux
似乎cron可以运行的“任务”数量有所限制。
是否有设置可以更改cron中允许的任务数?
这里是crontab条目:
*/5 * * * * /myscripts/watchdog.sh >> /myscripts/watchdog.log 2>&1
因此,每5分钟运行一次,检查运行的进程。如果它们没有运行,那么它将启动它们。
#!/bin/bash
# https://serverfault.com/questions/710847/how-to-apply-memory-limits-to-all-cron-jobs
# checking the cron ulimit
# systemctl status cron
# more /etc/pam.d/cron
# talking about /etc/security/limits.conf
export PATH=/runner/venv/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
/bin/echo "##################### watchdog.sh running now #####################"
/bin/date
export LANG=C.UTF-8
export USER=ubuntu
export HOME=/home/ubuntu
export MAIL=/var/mail/ubuntu
export SHELL=/bin/bash
export LOGNAME=ubuntu
# https://unix.stackexchange.com/questions/162104/how-to-change-the-kernel-max-pid-number
# pid_max is 4194304 for 64 bit
if grep -q 56000 /proc/sys/kernel/pid_max; then
/bin/echo "/proc/sys/kernel/pid_max = 56000"
else
/bin/echo 56000 | sudo tee /proc/sys/kernel/pid_max
fi
# https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt
if grep -q 48000 /sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max; then
/bin/echo "/sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max = 48000"
else
/bin/echo 48000 | /usr/bin/sudo tee /sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max
fi
export DEPLOY_ENV="system_one"
export VIRTUAL_ENV="/runner/venv"
hash -r
# see https://stackoverflow.com/questions/51256738/multiple-instances-of-python-running-simultaneously-limited-to-35
#export OPENBLAS_NUM_THREADS=1
#export OMP_NUM_THREADS=1
export AEP="/runner/analyzerengine"
export PID_FILE_DIR="/runner/pids"
export OUT_FILE_DIR="/runner/out"
while read producer; do
producer="$(/bin/echo $producer| /bin/sed 's/\r//g')"
export PIDFILE="${PID_FILE_DIR}/${producer}.pid"
/bin/echo "Checking producer=$producer in file $PIDFILE"
if [ -e "${PIDFILE}" ] && [ "$(/bin/ps -o pid= -p "$(/bin/sed 's/ //g' < "${PIDFILE}")")" ] ; then
/bin/echo "${producer} process PID check OK (running) on $(/bin/date) ."
else
/bin/echo "Restarting ${producer} process on $(/bin/date)..."
/bin/echo "executing: ${VIRTUAL_ENV}/bin/python ${AEP}/runnerCode.py --producer=${producer} --deployment=${DEPLOY_ENV} &> ${OUT_FILE_DIR}/${producer}.log &"
${VIRTUAL_ENV}/bin/python ${AEP}/runnerCode.py --producer=${producer} --deployment=${DEPLOY_ENV} > ${OUT_FILE_DIR}/${producer}.log &
/bin/echo $! > "${PIDFILE}"
/bin/chmod 644 ${OUT_FILE_DIR}/${producer}.log
/bin/chmod 644 "${PIDFILE}"
/bin/echo "...done."
fi
done < ${AEP}/producer_list.txt
运行命令:$ systemctl status cron
产生以下输出:
cron.service - Regular background program processing daemon
Loaded: loaded (/lib/systemd/system/cron.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2019-11-24 16:59:41 UTC; 2 days ago
Docs: man:cron(8)
Main PID: 1191 (cron)
Tasks: 5391 (limit: 5529)
CGroup: /system.slice/cron.service
├─ 1191 /usr/sbin/cron -f
├─40750 /runner/venv/bin/python /runner/analyzerengine/runnerCode.py --producter=customer_A --deployment=system_one
├─40791 /runner/venv/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
...
只有36个进程将从此脚本开始。当我以用户身份(username = ubuntu)运行此脚本时,我可以顺利启动所有70个进程。显然,某些地方的限制设置不正确。
由于RunnerCode.py的每个实例产生了数百个线程((我无法控制的内置在TensorFlow中的东西),所以我需要将/ proc / sys / kernel / pid_max设置为56000和/ sys / fs / cgroup / pids /user.slice/user-1000.slice/pids.max到48000。
systemctl中是否有一些需要更改的设置才能使更多进程运行?
提前感谢!
事实证明,我还需要为eth cron作业设置pid限制。可以按照以下步骤进行:
/bin/echo 48000 | /usr/bin/sudo tee /sys/fs/cgroup/pids/system.slice/cron.service/pids.max
这将cron服务的控制组设置为48000限制,因此该配置不会达到线程限制。