Yearly Archives: 2021


Rough notes on Ray

Just a bunch of snippets that we used to deploy a local Ray cluster. We couldn’t get the second node to connect to the cluster even though no error was given.

From https://www.anaconda.com/products/individual-b
 wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
 bash ./Anaconda3-2021.05-Linux-x86_64.sh
 source ~/anaconda3/bin/activate
 conda create --name ray.3.7 python=3.7.10;
 conda activate ray.3.7;
 conda install --name ray.3.7 pip;
 pip install ray==1.1.0
 pip install gym pandas torch ray ray[default] ray[rllib] ray[serve] ray[tune];

 ON WORKERS:
 From https://docs.docker.com/engine/install/ubuntu/
 sudo apt-get remove docker docker-engine docker.io containerd runc;
 sudo apt-get update;
 sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release;
 curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
 echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
 sudo apt-get update
 sudo apt-get install docker-ce docker-ce-cli containerd.io
 sudo addgroup --system docker
 sudo adduser $USER docker
 newgrp docker
 sudo systemctl restart docker

 Uninstall Docker Engine
 Uninstall the Docker Engine, CLI, and Containerd packages:
 sudo apt-get purge docker-ce docker-ce-cli containerd.io
 Images, containers, volumes, or customized configuration files on your host are not automatically removed. To delete all images, containers, and volumes:
 sudo rm -rf /var/lib/docker
  sudo rm -rf /var/lib/containerd
 You must delete any edited configuration files manually.


 2021-06-09 03:56:34,453    WARNING services.py:1740 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 7963275264 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.

 ON MASTER:
 ray up office.yaml
 ray up -vvvvvv office.yaml
 ray exec office.yaml 'ray status'

The yaml file that we used was the following

# A unique identifier for the head node and workers of this cluster.
cluster_name: default

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    image: "rayproject/ray-ml:1.1.0" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: ["--shm-size=10.24gb"]  # Extra options to pass into "docker run"

provider:
    type: local
    head_ip: 192.168.1.14
    ## head_ip: YOUR_HEAD_NODE_HOSTNAME
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [192.168.1.70]
    ## worker_ips: [WORKER_NODE_1_HOSTNAME, WORKER_NODE_2_HOSTNAME, ... ]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"
    # cache_stopped_nodes: False

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: tux
    #ssh_user: YOUR_USERNAME
    # Optional if an ssh private key is necessary to ssh to the cluster.
    ssh_private_key: ~/.ssh/id_rsa

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 20

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 20
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

initial_workers: 20

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

Do not use the snap version of docker for Ray on Ubuntu 20.04LTS

If you are trying to deploy a local Ray cluster on Ubuntu machines and you are getting the following error:

Shared connection to 192.168.1.74 closed.
     Running docker exec ray_container printenv HOME
       Full command is ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_71415f9f14/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec ray_container printenv HOME)'
 Shared connection to 192.168.1.74 closed.
     Running docker cp /tmp/ray_tmp_mount/default/~/ray_bootstrap_config.yaml ray_container:/home/ray/ray_bootstrap_config.yaml
       Full command is ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_71415f9f14/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker cp /tmp/ray_tmp_mount/default/~/ray_bootstrap_config.yaml ray_container:/home/ray/ray_bootstrap_config.yaml)'
 lstat /tmp/ray_tmp_mount/default/~: no such file or directory
 Shared connection to 192.168.1.74 closed.
 2021-06-09 11:41:09,299    INFO node_provider.py:93 -- ClusterState: Writing cluster state: ['192.168.1.70', '192.168.1.74']

You might need to consider removing the snap version of docker and follow the official instructions of docker.

# From https://docs.docker.com/engine/install/ubuntu/
sudo apt-get remove docker docker-engine docker.io containerd runc;
sudo apt-get update;
sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release;

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

sudo addgroup --system docker
sudo adduser $USER docker
newgrp docker
sudo systemctl restart docker

sign_and_send_pubkey: signing failed for RSA from agent: agent refused operation

When trying to ssh to a machine using a public key, we got the following error:

ssh '[email protected]'
sign_and_send_pubkey: signing failed for RSA "/home/tux/.ssh/id_rsa" from agent: agent refused operation

The problem was with the local .ssh folder which had wrong permissions set. To fix the above problem, we issued the following commands:

chmod 700 ~/.ssh;
chmod 600 ~/.ssh/*;


Reset a WatchGuard Firebox M200 1

  1. Power on the WatchGuard Firebox M200.
  2. Wait until the Arm indicator ( Shield symbol ) turns solid green.
    The indicator is found on the front part of the device (it is the second light on the column with the three lights).
  3. Press and hold the Reset button on the front of the device.
    The reset button is right next to the three lights.
    Do not release the button until step 8.
  4. After five seconds, the Arm indicator ( Shield symbol ) will turn red.
  5. Continue to hold the Reset button even if the Arm indicator ( Shield symbol ) is not lit.
  6. After approximately 40 seconds, the Arm indicator ( Shield symbol ) should start to flash green.
  7. Continue to hold the Reset button while the Arm indicator ( Shield symbol ) flashes green once per second.
  8. Once the Arm indicator ( Shield symbol ) starts to flash green twice per second, release the Reset button.
  9. Wait until the Arm indicator ( Shield symbol ) starts to flash red again.
  10. Press and hold the Reset button for five seconds to reboot the device.
  11. The Firebox will restart with factory-default settings.

After you perform the reset procedure, the Firebox will be reset to factory-default settings.
Any saved backup images will be deleted from the Firebox.
Interface 0 will be enabled as an external interface, as a DHCP client.
Interface 1 will be enabled as a trusted interface with the IP address 10.0.1.1, and a DHCP server will be enabled.
The default admin and status management user accounts will be available, with the default passphrases (default password) which will be the word readwrite.
The Web Setup Wizard will start automatically when you log in to Fireware Web UI.
The device will be discoverable by the WSM Quick Setup Wizard.
The device will be discoverable as a new FireCluster member (if the device supports FireCluster).