Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to check Federated Learning Job is working? #419

Open
Sergiossrr opened this issue Oct 12, 2023 · 8 comments
Open

how to check Federated Learning Job is working? #419

Sergiossrr opened this issue Oct 12, 2023 · 8 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Sergiossrr
Copy link

Sergiossrr commented Oct 12, 2023

What happened:
I followed "Using Federated Learning Job in Surface Defect Detection Scenario".As the last step,"After the job completed, you will find the model generated on the directory /model in $EDGE1_NODE and $EDGE2_NODE."
So how can i check the job is completed or is working,and using kubectl get federatedlearningjob surface-defect-detectiononly shows NAME and AGE

Environment:
openEuler 22.03 LTS
kubernetes v1.21.1
kubeedge v1.14.2
edgemesh v1.14.0
sedna v0.6.0

Sedna Version
$ kubectl get -n sedna deploy gm -o jsonpath='{.spec.template.spec.containers[0].image}'
# paste output here

$ kubectl get -n sedna ds lc -o jsonpath='{.spec.template.spec.containers[0].image}'
# paste output here
Kubernets Version
$ kubectl version
# paste output here
KubeEdge Version
$ cloudcore --version
# paste output here

$ edgecore --version
# paste output here

CloudSide Environment:

Hardware configuration
$ lscpu
# paste output here
OS
$ cat /etc/os-release
# paste output here
Kernel
$ uname -a
# paste output here
Others

EdgeSide Environment:

Hardware configuration
$ lscpu
# paste output here
OS
$ cat /etc/os-release
# paste output here
Kernel
$ uname -a
# paste output here
Others
@Sergiossrr Sergiossrr added the kind/bug Categorizes issue or PR as related to a bug. label Oct 12, 2023
@Sergiossrr
Copy link
Author

@MooreZheng @JoeyHwong-gk please give me some suggestions

@JoeyHwong-gk
Copy link
Contributor

Hi there,

In the provided example, the federated learning job simulates independent training on two separate edge nodes using their respective training data. The federated learning process combines these independent models' weights on the cloud, achieving the requirements of federated learning.

After job completion, you can find the merged model's weights in the /model directory on either edge node ($EDGE1_NODE or $EDGE2_NODE). This merged model results from federated learning, combining contributions from both nodes. You can compare this merged model with models trained individually to observe differences in performance.

Hope this helps! Feel free to ask if you have further questions.

@Sergiossrr
Copy link
Author

Thanks for your reply @JoeyHwong-gk. Besides, I have two questions.

Q1: Which version is recommended for the images of train and aggregation? I use isula instead of docker. After directly isula pull images:v0.3.0, the pods' status is OutOfMemory, and after pulling v0.5.0, the pods' status is CrashLoopBackOff.
Q2: How can I check that the job is running? In other words, how can I check the logs when the job is running?Or only I can do is waiting before the /model directory has model's weights?

@JoeyHwong-gk
Copy link
Contributor

Q1: Which version is recommended for the images of train and aggregation? I use isula instead of docker. After directly isula pull images:v0.3.0, the pods' status is OutOfMemory, and after pulling v0.5.0, the pods' status is CrashLoopBackOff.
Q2: How can I check that the job is running? In other words, how can I check the logs when the job is running?Or only I can do is waiting before the /model directory has model's weights?

For Q1:
I cannot confirm the cause of the issues without more information. The containers are automatically pushed to Docker Hub, and all versions should be available. However, I strongly recommend using the latest version, v0.5.1, as it might contain crucial bug fixes and improvements.

For Q2:
Certainly, you can check the running logs from any node (both the cloud-side server and the edge nodes) without waiting for completion. The logs can provide real-time information about the job's progress and any errors that might occur during execution. You don't need to wait until the model's weights appear in the /model directory to access the logs.

Feel free to review the logs for insights into the job's status and any potential issues. If you encounter specific errors in the logs, please provide those details for further assistance.

@JoeyHwong-gk
Copy link
Contributor

/assign @jaypume

@Sergiossrr
Copy link
Author

Thanks for your reply @JoeyHwong-gk @jaypume.

For Q1:
Train pods still cannot work.
When I pull images:v0.5.1, the error message is fetch and parse manifest failed. After switching to v0.4.0, aggregation pod is running,but train pod gives an error and the status is ImagePullBackOff.

Relevant useful information is as follows:
kubectl logs surface-defect-detection-train-llw2c
container "train-worker" in pod "surface-defect-detection-train-llw2c" is waiting to start: trying and failing to pull image

On EDGE1_NODE, isula ps

CONTAINER ID    IMAGE                                                                   COMMAND                 CREATED         STATUS          PORTS   NAMES                           
72a491b1f348    kubeedge/pause:3.1                                                      "/pause"                27 minutes ago  Up 27 minutes           k8s_POD_surface-defect-detection-train-n6t5n_default_b6c0235c-f382-4740-a22a-1974910049d0_0
d8131b19f7df    f64c26f478a3d054ef86062baf4692e3188297632a5487c70d1ff03398d73a1c        "sedna-lc"              4 hours ago     Up 4 hours              k8s_lc_lc-hwxx9_sedna_f59e93da-c6cd-41c6-8a52-581a43dee5f7_0
61209b5d9b6a    kubeedge/pause:3.1                                                      "/pause"                4 hours ago     Up 4 hours              k8s_POD_lc-hwxx9_sedna_f59e93da-c6cd-41c6-8a52-581a43dee5f7_0
5bee722df55d    f51171e9ee03b0367d8914f615a3b46583411b4dfc6d4f09011c90aea02845f5        "edgemesh-agent"        4 hours ago     Up 4 hours              k8s_edgemesh-agent_edgemesh-agent-w79rf_kubeedge_25770be1-8420-4a19-8add-e80b8bf98f38_0
d7f50545bb10    kubeedge/pause:3.1                                                      "/pause"                4 hours ago     Up 4 hours              k8s_POD_edgemesh-agent-w79rf_kubeedge_25770be1-8420-4a19-8add-e80b8bf98f38_0
eb111d22af29    a6c0cb5dbd21197123942b3469a881f936fd7735f2dc9a22763b6f777f24345e        "/opt/bin/flanneld..."  7 hours ago     Up 7 hours              k8s_kube-flannel-edge_kube-flannel-edge-ds-4xgjx_kube-flannel_f43bd8eb-171c-4882-b05a-a1de33b9bdc0_7
2a96a3e38f14    kubeedge/pause:3.1                                                      "/pause"                7 hours ago     Up 7 hours              k8s_POD_kube-flannel-edge-ds-4xgjx_kube-flannel_f43bd8eb-171c-4882-b05a-a1de33b9bdc0_0
e4598be3a11c    5dade4ce550b85d4a56054bc8d74e72350f46613129145c28dd7fa39ccf2c6be        "/docker-entrypoin..."  10 hours ago    Up 10 hours             k8s_mqtt_mqtt-kubeedge_default_d2c774f6-c412-4a38-8aba-08f953c2009c_0
a2ceb1ec5118    kubeedge/pause:3.1                                                      "/pause"                10 hours ago    Up 10 hours             k8s_POD_mqtt-kubeedge_default_d2c774f6-c412-4a38-8aba-08f953c2009c_0

and kubectl edit pod surface-defect-detection-train-llw2c

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-10-17T15:48:13Z"
  generateName: surface-defect-detection-train-
  labels:
    federatedlearningjob.sedna.io/name: surface-defect-detection
    federatedlearningjob.sedna.io/uid: 0b8097ad-83cd-4be6-afc8-ec1b27a60b3e
    federatedlearningjob.sedna.io/worker-type: train
  name: surface-defect-detection-train-llw2c
  namespace: default
  ownerReferences:
  - apiVersion: sedna.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: FederatedLearningJob
    name: surface-defect-detection
    uid: 0b8097ad-83cd-4be6-afc8-ec1b27a60b3e
  resourceVersion: "71706"
  uid: 9008081f-f045-4c28-8eb4-e483eff2be87
spec:
  containers:
  - env:
    - name: batch_size
      value: "32"
    - name: learning_rate
      value: "0.001"
    - name: epochs
      value: "2"
    - name: DATA_PATH_PREFIX
      value: /home/data
    - name: PARTICIPANTS_COUNT
      value: "2"
    - name: NAMESPACE
      value: default
    - name: MODEL_NAME
      value: surface-defect-detection-model
    - name: DATASET_NAME
      value: edge2-surface-defect-detection-dataset
    - name: LC_SERVER
      value: http://localhost:9100
    - name: AGG_PORT
      value: "7363"
    - name: AGG_IP
      value: surface-defect-detection-aggregation.default
    - name: WORKER_NAME
      value: trainworker-c98rq
    - name: TRAIN_DATASET_URL
      value: /home/data/data/2.txt
    - name: JOB_NAME
      value: surface-defect-detection
    - name: TRANSMITTER
      value: ws
    - name: MODEL_URL
      value: /home/data/model
    image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0
    imagePullPolicy: IfNotPresent
    name: train-worker
    resources:
      limits:
        memory: 2Gi
      requests:
        memory: 2Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/data/
      name: sedna-default-volume-name
    - mountPath: /home/data/data/
      name: dataz
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-ffhfg
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  nodeName: kubenode2
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - hostPath:
      path: /
      type: Directory
    name: sedna-default-volume-name
  - hostPath:
      path: /data/
      type: Directory
    name: dataz
  - name: kube-api-access-ffhfg
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T14:50:20Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T14:50:20Z"
    message: 'containers with unready status: [train-worker]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T14:50:20Z"
    message: 'containers with unready status: [train-worker]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T14:50:20Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0
    imageID: ""
    lastState: {}
    name: train-worker
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0"
        reason: ImagePullBackOff
  hostIP: 192.168.xxx.xxx
  phase: Pending
  podIP: 192.168.xxx.xxx
  podIPs:
  - ip: 192.168.xxx.xxx
  qosClass: Burstable
  startTime: "2023-10-17T14:50:20Z"

@JoeyHwong-gk
Copy link
Contributor

I apologize for the inconvenience you faced with pulling the images. It's possible that network issues caused the problem. As an alternative, I recommend trying to build the containers directly using the build_image.sh script. This way, you can bypass potential network-related problems and create the containers locally.

@dsj-kaiyue
Copy link

Q1: Which version is recommended for the images of train and aggregation? I use isula instead of docker. After directly isula pull images:v0.3.0, the pods' status is OutOfMemory, and after pulling v0.5.0, the pods' status is CrashLoopBackOff.
Q2: How can I check that the job is running? In other words, how can I check the logs when the job is running?Or only I can do is waiting before the /model directory has model's weights?

For Q1: I cannot confirm the cause of the issues without more information. The containers are automatically pushed to Docker Hub, and all versions should be available. However, I strongly recommend using the latest version, v0.5.1, as it might contain crucial bug fixes and improvements.

For Q2: Certainly, you can check the running logs from any node (both the cloud-side server and the edge nodes) without waiting for completion. The logs can provide real-time information about the job's progress and any errors that might occur during execution. You don't need to wait until the model's weights appear in the /model directory to access the logs.

Feel free to review the logs for insights into the job's status and any potential issues. If you encounter specific errors in the logs, please provide those details for further assistance.

Hello, I encountered the same issue while trying example4: Collaboratively Train Yolo-v5 Using MistNet on the COCO128 Dataset. The pods for the edge node show CrashLoopBackOff, and when I use kubectl describe pods yolo-v5-train-897dd, the displayed events are none. I am using Docker image version V:0.4.3. Can you tell me how to resolve this issue,please?
屏幕截图 2023-12-01 210017
屏幕截图 2023-12-01 210153
屏幕截图 2023-12-01 210226
屏幕截图 2023-12-01 210252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants