我尝试在网上搜索并调试这个问题,不幸的是徒劳无功。
我创建了一个简单的
pyspark
应用程序(dockerized),我试图在 Argo 工作流程中运行它。虽然 pyspark 应用程序只是创建一个数据帧并打印它(就是这样),但当我在 Kubernetes 集群中手动部署它时,它可以正常运行。然而,当我使用 argo 工作流程在 Kubernetes 中的相同命名空间和集群中运行相同的 docker 映像时 - 我得到了这个 KerberosAuthException
。谁能指点我该怎么做?我的应用程序中根本没有使用 kerberos
。
hello-world-hp2z4: py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
hello-world-hp2z4: : org.apache.hadoop.security.KerberosAuthException: failure to login: using ticket cache file: FILE:/tmp/krb5cc_0 javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
hello-world-hp2z4: at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:67)
hello-world-hp2z4: at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:134)
hello-world-hp2z4: at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
hello-world-hp2z4: at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:679)
hello-world-hp2z4: at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:677)
hello-world-hp2z4: at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
hello-world-hp2z4: at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:677)
正如我所说,只有当我通过 argo 运行它时才会发生这种情况。否则,应用程序在 Kubernetes 中完全独立运行。如有任何帮助,我们将不胜感激!
Name: test-spark-pod
Namespace: posas-accsecana-argowf-qa
Priority: 600000000
Priority Class Name: application-default
Service Account: default
Node: kworker-be-intg-iz1-bs017/10.242.8.5
Start Time: Tue, 12 Nov 2024 10:21:42 +0100
Labels: <none>
Annotations: cni.projectcalico.org/containerID: a0957c48cfb01b4d155a2fa1a2ac52b269b1858085d8fae82cc05acba4bcf70b
cni.projectcalico.org/podIP: 100.67.81.137/32
cni.projectcalico.org/podIPs: 100.67.81.137/32
kubernetes.io/limit-ranger:
LimitRanger plugin set: cpu, ephemeral-storage, memory request for container test-spark-container01; cpu, ephemeral-storage, memory limit ...
Status: Running
SeccompProfile: RuntimeDefault
IP: 100.67.81.137
IPs:
IP: 100.67.81.137
Containers:
test-spark-container01:
Container ID: containerd://bddd04b8311e340e8eef70747ddf12f0028553960d54d0f6a9608540e25eb124
Image: docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest
Image ID: docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos@sha256:546428e6d40b9cee30e017da38c922a2e67390ab63161ed3dfa4f19000977b21
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 12 Nov 2024 10:33:56 +0100
Finished: Tue, 12 Nov 2024 10:34:06 +0100
Ready: False
Restart Count: 7
Limits:
cpu: 2
ephemeral-storage: 10Gi
memory: 13Gi
Requests:
cpu: 200m
ephemeral-storage: 300Mi
memory: 1Gi
Environment: <none>
Mounts:
/app/tmp/spark from spark-tmp-volume (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-442gc (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
spark-tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-442gc:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m default-scheduler Successfully assigned posas-accsecana-argowf-qa/test-spark-pod to kworker-be-intg-iz1-bs017
Normal Pulled 17m kubelet Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 69ms (69ms including waiting)
Normal Pulled 16m kubelet Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 77ms (77ms including waiting)
Normal Created 16m (x4 over 17m) kubelet Created container test-spark-container01
Normal Started 16m (x4 over 17m) kubelet Started container test-spark-container01
Normal Pulled 16m kubelet Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 56ms (56ms including waiting)
Normal Pulling 15m (x5 over 17m) kubelet Pulling image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest"
Normal Pulled 15m (x2 over 17m) kubelet Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 70ms (70ms including waiting)
Warning BackOff 2m22s (x63 over 17m) kubelet Back-off restarting failed container test-spark-container01 in pod test-spark-pod_posas-accsecana-argowf-qa(f0fedf94-1a04-449e-a298-449bb356292b)
Name: hello-world-bzzkf
Namespace: posas-accsecana-argowf-qa
Priority: 600000000
Priority Class Name: application-default
Service Account: default
Node: kworker-be-intg-iz1-bs017/10.242.8.5
Start Time: Tue, 12 Nov 2024 09:47:50 +0100
Labels: mam_brand=any
mam_dc=bs
mam_stage=qa
workflows.argoproj.io/completed=true
workflows.argoproj.io/controller-instanceid=posas-accsecana-argowf-qa
workflows.argoproj.io/workflow=hello-world-bzzkf
Annotations: cni.projectcalico.org/containerID: 79c1431e821c7bc1166c10eed57f85a7242d59de0b49d735ad3f19efafc98649
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
kubectl.kubernetes.io/default-container: main
kubernetes.io/limit-ranger:
LimitRanger plugin set: ephemeral-storage request for container wait; ephemeral-storage limit for container wait; cpu, ephemeral-storage, ...
workflows.argoproj.io/node-id: hello-world-bzzkf
workflows.argoproj.io/node-name: hello-world-bzzkf
Status: Failed
SeccompProfile: RuntimeDefault
IP: 100.67.81.91
IPs:
IP: 100.67.81.91
Controlled By: Workflow/hello-world-bzzkf
Init Containers:
init:
Container ID: containerd://37acc8242db4c6bf9143b06b003760a40bd2f4165e15929f78569bf75cde4ece
Image: cr.mam.dev/internal/mf/commons/argoexec:latest
Image ID: cr.mam.dev/internal/mf/commons/argoexec@sha256:20a7f519ee4d825e5ae4d2693e7fb69f6f16f64fcab605b6400b86afb1a78362
Port: <none>
Host Port: <none>
Command:
argoexec
init
--loglevel
info
--log-format
text
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 12 Nov 2024 09:47:52 +0100
Finished: Tue, 12 Nov 2024 09:47:52 +0100
Ready: True
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 10Gi
memory: 512Mi
Requests:
cpu: 500m
ephemeral-storage: 300Mi
memory: 512Mi
Environment:
ARGO_POD_NAME: hello-world-bzzkf (v1:metadata.name)
ARGO_POD_UID: (v1:metadata.uid)
GODEBUG: x509ignoreCN=0
ARGO_WORKFLOW_NAME: hello-world-bzzkf
ARGO_WORKFLOW_UID: eb02ba92-2b43-4c64-9f1e-85747bb27a34
ARGO_INSTANCE_ID: posas-accsecana-argowf-qa
ARGO_CONTAINER_NAME: init
ARGO_TEMPLATE: {"name":"whalesay","inputs":{},"outputs":{},"metadata":{"labels":{"mam_brand":"any","mam_dc":"bs","mam_stage":"qa"}},"container":{"name":"","image":"docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest","command":["python3","pyspark_script.py"],"resources":{},"volumeMounts":[{"name":"tmp-volume","mountPath":"/tmp"},{"name":"spark-tmp-volume","mountPath":"/app/tmp/spark"}]}}
ARGO_NODE_ID: hello-world-bzzkf
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pqw7g (ro)
Containers:
wait:
Container ID: containerd://b742467b6b267cc6c82c2a02b821d980491a7113c6cebaaeeb4243cf9fd9f480
Image: cr.mam.dev/internal/mf/commons/argoexec:latest
Image ID: cr.mam.dev/internal/mf/commons/argoexec@sha256:20a7f519ee4d825e5ae4d2693e7fb69f6f16f64fcab605b6400b86afb1a78362
Port: <none>
Host Port: <none>
Command:
argoexec
wait
--loglevel
info
--log-format
text
State: Terminated
Reason: Error
Message: pods "hello-world-bzzkf" is forbidden: User "system:serviceaccount:posas-accsecana-argowf-qa:default" cannot patch resource "pods" in API group "" in the namespace "posas-accsecana-argowf-qa"
Exit Code: 1
Started: Tue, 12 Nov 2024 09:47:53 +0100
Finished: Tue, 12 Nov 2024 09:47:58 +0100
Ready: False
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 10Gi
memory: 512Mi
Requests:
cpu: 500m
ephemeral-storage: 300Mi
memory: 512Mi
Environment:
ARGO_POD_NAME: hello-world-bzzkf (v1:metadata.name)
ARGO_POD_UID: (v1:metadata.uid)
GODEBUG: x509ignoreCN=0
ARGO_WORKFLOW_NAME: hello-world-bzzkf
ARGO_WORKFLOW_UID: eb02ba92-2b43-4c64-9f1e-85747bb27a34
ARGO_INSTANCE_ID: posas-accsecana-argowf-qa
ARGO_CONTAINER_NAME: wait
ARGO_TEMPLATE: {"name":"whalesay","inputs":{},"outputs":{},"metadata":{"labels":{"mam_brand":"any","mam_dc":"bs","mam_stage":"qa"}},"container":{"name":"","image":"docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest","command":["python3","pyspark_script.py"],"resources":{},"volumeMounts":[{"name":"tmp-volume","mountPath":"/tmp"},{"name":"spark-tmp-volume","mountPath":"/app/tmp/spark"}]}}
ARGO_NODE_ID: hello-world-bzzkf
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/mainctrfs/app/tmp/spark from spark-tmp-volume (rw)
/mainctrfs/tmp from tmp-volume (rw)
/tmp from tmp-dir-argo (rw,path="0")
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pqw7g (ro)
main:
Container ID: containerd://bf8da42bcf0b4a210e5fbb8206d2a11660641c2bf7c06f1adebe00c0d04e122b
Image: docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest
Image ID: docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos@sha256:546428e6d40b9cee30e017da38c922a2e67390ab63161ed3dfa4f19000977b21
Port: <none>
Host Port: <none>
Command:
/var/run/argo/argoexec
emissary
--loglevel
info
--log-format
text
--
python3
pyspark_script.py
State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 12 Nov 2024 09:47:54 +0100
Finished: Tue, 12 Nov 2024 09:47:57 +0100
Ready: False
Restart Count: 0
Limits:
cpu: 2
ephemeral-storage: 10Gi
memory: 13Gi
Requests:
cpu: 200m
ephemeral-storage: 300Mi
memory: 1Gi
Environment:
ARGO_CONTAINER_NAME: main
ARGO_TEMPLATE: {"name":"whalesay","inputs":{},"outputs":{},"metadata":{"labels":{"mam_brand":"any","mam_dc":"bs","mam_stage":"qa"}},"container":{"name":"","image":"docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest","command":["python3","pyspark_script.py"],"resources":{},"volumeMounts":[{"name":"tmp-volume","mountPath":"/tmp"},{"name":"spark-tmp-volume","mountPath":"/app/tmp/spark"}]}}
ARGO_NODE_ID: hello-world-bzzkf
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 0001-01-01T00:00:00Z
ARGO_PROGRESS_FILE: /var/run/argo/progress
ARGO_PROGRESS_PATCH_TICK_DURATION: 1m0s
ARGO_PROGRESS_FILE_TICK_DURATION: 3s
Mounts:
/app/tmp/spark from spark-tmp-volume (rw)
/tmp from tmp-volume (rw)
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pqw7g (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
var-run-argo:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
tmp-dir-argo:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
spark-tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-pqw7g:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m57s default-scheduler Successfully assigned posas-accsecana-argowf-qa/hello-world-bzzkf to kworker-be-intg-iz1-bs017
Normal Pulling 2m57s kubelet Pulling image "cr.mam.dev/internal/mf/commons/argoexec:latest"
Normal Pulled 2m56s kubelet Successfully pulled image "cr.mam.dev/internal/mf/commons/argoexec:latest" in 299ms (299ms including waiting)
Normal Created 2m56s kubelet Created container init
Normal Started 2m56s kubelet Started container init
Normal Pulling 2m55s kubelet Pulling image "cr.mam.dev/internal/mf/commons/argoexec:latest"
Normal Pulled 2m55s kubelet Successfully pulled image "cr.mam.dev/internal/mf/commons/argoexec:latest" in 98ms (98ms including waiting)
Normal Created 2m55s kubelet Created container wait
Normal Started 2m55s kubelet Started container wait
Normal Pulling 2m55s kubelet Pulling image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest"
Normal Pulled 2m54s kubelet Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 93ms (93ms including waiting)
Normal Created 2m54s kubelet Created container main
Normal Started 2m54s kubelet Started container main
# pyspark_script.py
import os
from pyspark.sql import SparkSession
print("Starting PySpark application...")
print(os.environ['JAVA_HOME'])
# Create a Spark session
spark = SparkSession.builder \
.appName('pyspark-kerberos') \
.master('local[2]') \
.config('spark.executor.instances', 2) \
.config('spark.executor.cores', 2) \
.config('spark.executor.memory', '5g') \
.config("spark.jars.ivy", "/app/tmp/spark")\
.getOrCreate()
# spark.sparkContext.setLogLevel("DEBUG")
# Sample DataFrame
data = [("Alice", 29), ("Bob", 31), ("Cathy", 27)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Print the DataFrame
df.show()
# Stop the Spark session
spark.stop()
我可以解决这个问题,因此也在这里发布答案:
这确实是unix主体-spark在docker容器中运行的问题。 但是我已经尝试在 docker 中添加用户名以及堆栈中的其他建议 - 但似乎没有任何效果。
我通过阅读this得到了解决问题的提示,看起来 argo 容器无法向 Spark-docker 提供用户名。因此我将这些添加到模板容器下的 argo yml 中:
securityContext:
runAsUser: 1000
runAsGroup: 3000
瞧!它通过了
KerberosAuthException