amazon ec2 - Ray does not distribute over all availble cpus within cluster

Question

Welcome To Ask or Share your Answers For Others

amazon ec2 - Ray does not distribute over all availble cpus within cluster

asked Jan 29, 2021 in Technique[技术] by 深蓝 (71.8m points)

amazon ec2 - Ray does not distribute over all availble cpus within cluster

The problem I have is that ray will not distribute over my workers

I collectively I have 16 cores, as I have 8cpus on each ec2 aws instance of ubuntu.

However when I launch my ray cluster, and submit my python script it only distributes over 8 cores as only 8 pids are showing to be utilized.

its also noteworthy than im unable to access the ray dashboard on the EC2 instance, I only have this information by printing the pids being used.

How do I get my script to be utlized by all 16 cpus and therefore show 16pids being used to execute the script?

This is my script:

import os
import ray
import time
import xgboost
from xgboost.sklearn import XGBClassifier


def printer():
    print("INSIDE WORKER " + str(time.time()) +"  PID  :    "+  str(os.getpid()))

# decorators allow for futures to be created for parallelization
@ray.remote        
def func_1():
    #model = XGBClassifier()
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count
        
@ray.remote        
def func_2():
    #model = XGBClassifier()
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count

@ray.remote
def func_3():
    count = 0
    for i in range(100000000):
        count += 1
    printer()
    return count

def main():
    #model = XGBClassifier()

    start = time.time()
    results = []
    
    ray.init(address='auto')
    #append fuction futures
    for i in range(10):
        results.append(func_1.remote())
        results.append(func_2.remote())
        results.append(func_3.remote())
        
    #run in parrallel and get aggregated list
    a = ray.get(results)
    b = 0
    
    #add all values in list together
    for j in range(len(a)):
        b += a[j]
    print(b)
    
    #time to complete
    end = time.time()
    print(end - start)
    
    
if __name__ == '__main__':
    main()

This is my config:

# A unique identifier for the head node and workers of this cluster.
cluster_name: basic-ray-123454
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers defaults to 0.
max_workers: 2 # this means zero workers
min_workers: 2 # this means zero workers
# Cloud-provider specific configuration.


provider:
    type: aws
    region: eu-west-2
    availability_zone: eu-west-2a

file_mounts_sync_continuously: False



auth:
    ssh_user: ubuntu
    ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
    InstanceType: c5.2xlarge
    ImageId: ami-xxxxxxa6b31fd2c
    KeyName: aws_ubuntu_test

    BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 200

worker_nodes:
   InstanceType: c5.2xlarge
   ImageId: ami-xxxxx26a6b31fd2c
   KeyName: aws_ubuntu_test


file_mounts: {
  "/home/ubuntu": "/home/user/RAY_AWS_DOCKER/ray_example_2_4/conda_env.yaml"
   }

setup_commands:
  - echo "start initialization_commands"
  - sudo apt-get update
  - sudo apt-get upgrade
  - sudo apt-get install -y python-setuptools
  - sudo apt-get install -y build-essential curl unzip psmisc
  - pip install --upgrade pip
  - pip install ray[all]
  - echo "all files :"
  - ls

  # - conda install -c conda-forge xgboost


head_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml


worker_start_ray_commands:

  - ray stop
  - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

amazon ec2 - Ray does not distribute over all availble cpus within cluster

amazon ec2 - Ray does not distribute over all availble cpus within cluster

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags