Deploy a new application to an HPC cluster

The previous lab introduced CycleCloud Projects as a means to customize a cluster template. In this lab we'll use a project to install a custom application on every node of a cluster. This lab builds on our foundation by showing how projects can be used for complex cluster configuration tasks such as customizing a VM during its configuration process.

Please send questions or comments to the Azure CycleCloud PM team.

Goals

In this lab you will learn how to:

Use CycleCloud Projects for installing a custom application in a cluster.
Stage application installation files in the blobs directory of a project.
Write a script that is executed on each and every cluster node as it boots.
Stage configuration files into every cluster node.
Upload a CycleCloud Project into a storage locker.
Start a new cluster that uses the new project.

Pre-requisites

Standard lab prerequisites
Lab 1 and Lab 2, or have a valid Azure CycleCloud installation with the CycleCloud CLI (cyclecloud command) configured

Customizing Cluster Nodes

When provisioning a VM as a cluster node there are often configuration steps that need to be performed during the VM boot-up process. It ranges from something as simple as setting up application path environment variables to more complicated tasks such as binding a node to an Active Directory domain. Furthermore, while Azure CycleCloud supports the use of custom images with baked-in applications, it is not unusual to install applications as part of the node preparation stage. Delegating these steps to the preparation stage reduces the tedium of creating custom images for every permutation of application and application version, especially in development environments.

This lab illustrates how you could use CycleCloud Projects to install Salmon, a popular bioinformatics application that is used for quantifying RNA in RNA sequencing experiments.

CycleCloud Projects

A CycleCloud Project consists of three main components — Templates, Specs, and Blobs.

Templates define the architecture of the CycleCloud cluster. The describe how the nodes of a cluster are laid out and how each node is configured.

Specs (short for specifications) define the configuration of each node. These configuration steps are captured in scripts. Zero or more specs may be assigned to each node in a cluster template.

Blobs are collections of files that can be accessed by all cluster nodes configured to use a spec of a project. You usually store application installers and sample test data in blobs.

The CycleCloud Projects reference page dives into greater detail on the concepts and examples of this powerful yet flexible configuration system.

Installing Salmon using CycleCloud Projects

A quick note about installing and setting up Salmon

The recommended process for installing Salmon is using Bioconda, a Conda channel for packaging bioinformatics tools. Azure CycleCloud comes with an Anaconda project/cluster type that makes this process simple. However, for the purposes of demonstrating the CycleCloud Project system we will set up and install Salmon from scratch.

Creating a new project for Salmon

Use the cyclecloud project init command to initialize a new project. If you have completed Lab2, you should have a cyclecloud_projects sub-directory in your home directory (if not, you can create it using the mkdir command). You could use that as the base directory for all of your projects. Go into that sub-directory and initialize a new project named salmon. As before, specify azure-storage as the default locker. Once the project is created, go into the new project directory.

ellen@Azure:~$ cd cyclecloud_projects/
ellen@Azure:~/cyclecloud_projects$ ls
azurecyclecloud_labs
ellen@Azure:~/cyclecloud_projects$ cyclecloud project init salmon
Project 'salmon' initialized in /home/ellen/cyclecloud_projects/salmon
Default locker: azure-storage
ellen@Azure:~/cyclecloud_projects$ cd salmon

A newly initialized project directory contains three sub-directories, one for each component of a CycleCloud Project: blobs, specs, and templates. A project.ini file that defines project properties is also created:

ellen@Azure:~/cyclecloud_projects/salmon$ ls
blobs  project.ini  specs  templates
ellen@Azure:~/cyclecloud_projects/salmon$

Staging the Salmon installer

Download the Salmon installation file into the blobs directory (this lab uses Salmon version 0.11.2):

ellen@Azure:~/cyclecloud_projects/salmon$ cd blobs
ellen@Azure:~/cyclecloud_projects/salmon/blobs$ wget -q https://github.com/COMBINE-lab/salmon/releases/download/v0.11.2/salmon-0.11.2-linux_x86_64.tar.gz
ellen@Azure:~/cyclecloud_projects/salmon/blobs$ ls
salmon-0.11.2-linux_x86_64.tar.gz
ellen@Azure:~/cyclecloud_projects/salmon/blobs$ cd ..
ellen@Azure:~/cyclecloud_projects/salmon$

Open the project.ini file in an editor Cloud Shell editor (or another editor):
```
ellen@Azure:~/cyclecloud_projects/salmon$ code project.ini
```

Edit the file so that it looks like this:

[project]
version = 1.0.0
name = salmon
type = application

[blobs]
files = salmon-0.11.2-linux_x86_64.tar.gz

The above edits defined the following:
- This project is of type application
- The manifest of the blobs directory should contain the file salmon-0.11.2-linux_x86_64.tar.gz
Save and exit the editor.

Adding a cluster-init script to the default spec

The specs directory contains specs of a project and a project can have one or more specs. For example, a project may have a default spec that contains configurations steps meant to be performed on every node of a cluster, and also a master spec that contains scripts meant to only be invoked on the cluster's headnode.

A default spec is automatically created in each new CycleCloud project. Within each spec are two sub-directories: chef and cluster-init

ellen@Azure:~/cyclecloud_projects/salmon$ cd specs/
ellen@Azure:~/cyclecloud_projects/salmon/specs$ ls
default
ellen@Azure:~/cyclecloud_projects/salmon/specs$ cd default/
ellen@Azure:~/cyclecloud_projects/salmon/specs/default$ ls
chef  cluster-init
ellen@Azure:~/cyclecloud_projects/salmon/specs/default$

The chef directory is created to hold Chef cookbooks and recipes that can be used in configuring nodes. The Chef reference page provides an excellent overview of Chef, but using Chef in CycleCloud Projects will not be covered in this lab.

The cluster-init directory contains three sub-directories:

ellen@Azure:~/cyclecloud_projects/salmon/specs/default$ ls cluster-init/
files  scripts  tests
ellen@Azure:~/cyclecloud_projects/salmon/specs/default$

files/: Small files that are downloaded into each and every cluster node using this spec.
scripts/: Collections of scripts that are executed on each node using this spec when the node boots. If there are multiple scripts in this directory they are executed in alphanumerical order of their filename.
tests/: Test scripts used to validate the successful deployment of a spec. (Tests will not be covered in this lab)

To illustrate the use of the scripts directory, we will create an install script for Salmon:

Create a new script named 10.install_salmon.sh inside the default/cluster-init/scripts directory.

Note: We use "here documents" to simplify the creation of new files. Simply copy all the lines starting with the cat command up to and including the EOF line, then paste it into Cloud Shell and press Enter. Alternatively, you can use a text editor to create and edit the file.

ellen@Azure:~/cyclecloud_projects/salmon/specs/default$ cd cluster-init/scripts/
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/scripts$ cat << EOF > ./10.install_salmon.sh
#!/bin/bash
set -ex

# make a /mnt/resource/apps directory
# Azure VMs that have ephemeral storage have that mounted at /mnt/resource. If that does not exist this command will create it.
mkdir -p /mnt/resource/apps

# Create tempdir
tmpdir=$(mktemp -d)

# download salmon installer into tempdir and unpack it into the apps directory
pushd $tmpdir
SALMON_VERSION=0.11.2
jetpack download "salmon-${SALMON_VERSION}-linux_x86_64.tar.gz"
tar xf salmon-${SALMON_VERSION}-linux_x86_64.tar.gz -C /mnt/resource/apps

# make the salmon install dir readable by all
chmod -R a+rX /mnt/resource/apps/salmon-${SALMON_VERSION}-linux_x86_64 

#clean up
popd
rm -rf $tmpdir
EOF
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/scripts$

The contents of the scripts directory should now look like this:

ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/scripts$ ls
10.install_salmon.sh  README.txt
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/scripts$

Using the `cluster-init/files` directory to stage files on every node

It is sometimes very useful to stage small files on every node of a cluster. Examples of these are application configuration parameters or license files. Here we will use cluster-init to stage environment files in /etc/profile.d of each node so that the salmon binary will be in the path of every user.

Create a new script 20.add_salmon_to_path.sh in the scripts directory of the default spec salmon/specs/default/cluster-init/scripts/:

ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/scripts$ cat << EOF > ./20.add_salmon_to_path.sh
#! /bin/bash

# create a symlink to the salmon directory
SALMON_VERSION=0.11.2
ln -s /mnt/resource/apps/salmon-${SALMON_VERSION}-linux_x86_64 /mnt/resource/apps/salmon

# add the profile files into /etc/profile.d
cp $CYCLECLOUD_SPEC_PATH/files/salmon.sh /etc/profile.d/
cp $CYCLECLOUD_SPEC_PATH/files/salmon.csh /etc/profile.d/

# make sure both are readable and executable
chmod a+rx /etc/profile.d/salmon.*
EOF
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/scripts$

Next, navigate to the sibling files/ directory and create two files, salmon.sh and salmon.csh:

ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/scripts$ cd ../files
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/files$ cat << EOF > ./salmon.sh
#!/bin/sh
export PATH=$PATH:/mnt/resource/apps/salmon/bin
EOF
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/files$ cat << EOF > ./salmon.csh
#!/bin/csh
setenv PATH $PATH\:/mnt/resource/salmon/bin
EOF
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/files$

Review the files in the salmon project's default spec:

ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init/files$ cd ..
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init$ ls scripts/
10.install_salmon.sh  20.add_salmon_to_path.sh  README.txt
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init$ ls files/
README.txt  salmon.csh  salmon.sh
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init$

Uploading the CycleCloud project into the storage locker

One of the steps in setting up a new Azure CycleCloud installation is the creation of an Azure storage account and an accompanying blob container. This container is the "Locker" that the CycleCloud server uses to stage CycleCloud projects for cluster nodes. CycleCloud cluster nodes orchestrated by this CycleCloud server are configured to download CycleCloud projects from this locker as part of the boot-up process of the node.

A storage account and container was created as part of the ARM installation process in Lab 1. To see this locker, use the cyclecloud locker list command:
```
ellen@Azure:~/cyclecloud_projects/salmon/specs/default/cluster-init$ cd
ellen@Azure:~$ cyclecloud locker list
azure-storage (az://cyclecloudcbekjhvzjrzswz/cyclecloud)
ellen@Azure:~$
```
In this example, the storage account name is cyclecloudcbekjhvzjrzswz, and the blob container name is cyclecloud.

The cyclecloud project upload command packages up the contents of the project and uploads it into the locker. To do this it needs to have credentials to access the blob container associated with the locker. You could create a SAS key for the container and use that, but for the purpose of this lab the service principal from Lab 1 will be used.

Edit the cyclecloud configuration file ~/.cycle/config.ini:
```
ellen@Azure:~$ code ~/.cycle/config.ini
ellen@Azure:~$
```
Add the section below, with subscription_id, tenant_id, application_id, application_secret matching those in the service principal used in Lab

Also replace the storage account name cyclecloudcbekjhvzjrzswz with the output of the cyclecloud locker list command:
```
[pogo azure-storage]
type = az
subscription_id = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
tenant_id = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
application_id = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
application_secret = xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
matches = az://cyclecloudcbekjhvzjrzswz/cyclecloud
```
You can locate your Subscription ID using the Azure CLI (az command) to list the accounts: az account list -o table
Your ~/.cycle/config.ini should now look something like this

Upload the project from its directory

ellen@Azure:~$ cd ~/cyclecloud_projects/salmon/
ellen@Azure:~/cyclecloud_projects/salmon$ cyclecloud project upload
Uploading to az://cyclecloudcbekjhvzjrzswz/cyclecloud/projects/salmon/1.0.0 (100%)
Uploading to az://cyclecloudcbekjhvzjrzswz/cyclecloud/projects/salmon/blobs (100%)
Upload complete!
ellen@Azure:~/cyclecloud_projects/salmon$

Create a new Cluster with the Salmon Project

Having uploaded the salmon project into the CycleCloud locker, you can now create a new cluster in CycleCloud and specify that each node should use the salmon:default spec. In this lab we will use Grid Engine as the base scheduler.

From the Cluster page of your Azure CycleCloud web portal, use the "+" symbol in the bottom-left-hand corner of the page to add a new Grid Engine cluster:
Give the cluster a descriptive name and complete the Required Settings as done previously.
Navigate to the Advanced Settings section. Under the Software section, click on the "Browse" button for "Master Cluster-Init" which will open a file selector dialog:
You will see a folder named salmon/. Open it by double-clicking it. Then open the 1.0.0/ folder. Finally, select the default/ folder by clicking on it once and pressing the "Select" button on the bottom of the dialog window. After pressing "Select" the file selector dialog will close. This selects the default spec of version 1.0.0 of the project salmon.
Repeat this selection step for "Execute Cluster-Init". When you're done, the Advanced Settings page should look like this:

Save the cluster and start it. When the master node turns green, log into it and verify that salmon has been installed:

ellen@Azure:~$ ssh ellen@40.117.78.137

 __        __  |    ___       __  |    __         __|
(___ (__| (___ |_, (__/_     (___ |_, (__) (__(_ (__|
        |

Cluster: SalmonCluster
Version: 7.5.0
Run List: recipe[cyclecloud], role[central_manager], role[application_server], role[sge_master_role], role[scheduler], role[monitor], recipe[cluster_init]
[ellen@ip-0A000404 ~]$ salmon
salmon v0.11.2

Usage:  salmon -h|--help or
        salmon -v|--version or
        salmon -c|--cite or
        salmon [--no-version-check] <COMMAND> [-h | options]

Commands:
    index Create a salmon index
    quant Quantify a sample
    alevin single cell analysis
    swim  Perform super-secret operation
    quantmerge Merge multiple quantifications into a single file

Run a parallel Salmon job on the cluster

Having created a CycleCloud cluster with Salmon application, you can now run Salmon jobs in parallel, utilizing autoscalable cluster compute resources. Please follow the below instructions to execute a sample Salmon job on your cluster. (borrowed from Getting started with Salmon guide)

While on the master node, download sample input data (transcriptome and sequencing data for Arabidopsis thaliana) into $HOME directory:
```
[cycleadmin@ip-0A000404 ~]$ curl https://bigcompbasestore.blob.core.windows.net/public/salmon_data.tar | tar x
```

Build an index on the transcriptome with salmon index command:

[cycleadmin@ip-0A000404 arabidopsis]$ salmon index -t athal.fa.gz -i athal_index

Submit parallel Salmon jobs salmon_quant.sh for all downloaded sequencing datasets to SGE scheduler queue:

[cycleadmin@ip-0A000404 arabidopsis]$ wget https://raw.githubusercontent.com/azurebigcompute/cyclecloud_tutorials/master/Lab3/scripts/salmon_quant.sh
[cycleadmin@ip-0A000404 arabidopsis]$ chmod +x salmon_quant.sh
[cycleadmin@ip-0A000404 arabidopsis]$ for fn in data/DRR0161?? ; do qsub -V -cwd ./salmon_quant.sh $fn ; done
Your job 1 ("salmon_quant.sh") has been submitted
Your job 2 ("salmon_quant.sh") has been submitted
Your job 3 ("salmon_quant.sh") has been submitted
Your job 4 ("salmon_quant.sh") has been submitted
Your job 5 ("salmon_quant.sh") has been submitted
Your job 6 ("salmon_quant.sh") has been submitted
Your job 7 ("salmon_quant.sh") has been submitted
Your job 8 ("salmon_quant.sh") has been submitted

After a short while you should observe in Clusters menu as CycleCloud is rolling out a cluster of execute nodes (8 cores in total):

Monitor the jobs in CycleCloud Jobs menu:

Observe as your jobs change the status from queued/waiting to running and then finish and disappear from the queue. After another while you should see your Salmon cluster automatically winding down:

Congratulations! You have completed the Lab 3 tutorial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial.md

Tutorial.md

Deploy a new application to an HPC cluster

Goals

Pre-requisites

Customizing Cluster Nodes

CycleCloud Projects

Installing Salmon using CycleCloud Projects

A quick note about installing and setting up Salmon

Creating a new project for Salmon

Staging the Salmon installer

Adding a cluster-init script to the default spec

Using the `cluster-init/files` directory to stage files on every node

Uploading the CycleCloud project into the storage locker

Create a new Cluster with the Salmon Project

Run a parallel Salmon job on the cluster

Files

Tutorial.md

Latest commit

History

Tutorial.md

File metadata and controls

Deploy a new application to an HPC cluster

Goals

Pre-requisites

Customizing Cluster Nodes

CycleCloud Projects

Installing Salmon using CycleCloud Projects

A quick note about installing and setting up Salmon

Creating a new project for Salmon

Staging the Salmon installer

Adding a cluster-init script to the default spec

Using the cluster-init/files directory to stage files on every node

Uploading the CycleCloud project into the storage locker

Create a new Cluster with the Salmon Project

Run a parallel Salmon job on the cluster

Using the `cluster-init/files` directory to stage files on every node