Commit 1ba9721d authored by Tami Toto's avatar Tami Toto
Browse files

Merge branch 'master' of code.arm.gov:lasso/containers/run-lasso-o_shcu

parents 32f4d800 1e9bef04
# LASSO-O Singularity Instructions
Follow these instructions if you will be running LASSO-O on an HPC cluster.
For the purposes of these instructions, we are using commands for the
slurm scheduler. Actual commands may be different at your institution
Slurm or PBS schedulers. Actual commands may be different at your institution
depending upon the scheduler used.
## Setting up Singularity environment on your HPC cluster
......@@ -28,71 +28,65 @@ Once in your compute node shell with the singularity module loaded,
you can run LASSO-O using the following instructions.
## Running LASSO-O via Singularity
Start LASSO-O via the `run.sh` script from the run-lasso-o_shcu folder:
You can start LASSO-O by executing the `run.sh` script via a scheduler command.
Below we provide examples for the Slurm and PBS schedulers.
These commands instruct the scheduler to run LASSO-O with one node
and one cpu. Currently, the LASSO-O container executes each process
in sequence. Later we will add the ability to support multiple cores.
Until then, LASSO-O will take a while to run depending upon the number of simulations
you are processing. In addition, the first time you run the lasso-o_shcu container,
the container runtime will download the container image, which will also take a few minutes.
<div style="background-color: #F9F5D2; border: 1px solid grey; margin: 10px; padding: 10px;">
<strong>NOTE: </strong>
If you do not use a scheduler to invoke the script, it will run on the login node,
which may be killed per host policy if it runs for too long.
</div>
When your job has completed, you may view the outputs created in the
`run-lasso-o_shcu/data/outputs` folder using the notebooks provided.
See the [notebooks/README.md](notebooks/README.md) file for more
information.
#### Start via Slurm
```bash
$ srun ./run.sh
$ srun --verbose --nodes=1 --ntasks=1 --cpus-per-task=1 ./run.sh
```
* Note that this example assumes you are using the Slurm scheduler. This
will invoke the Singularity container via the scheduler. If you do not
use the scheduler to invoke the script, it will run on the login node,
which may be killed per host policy if it runs for too long.
* LASSO-O will take a while to run depending upon the number of simulations
you are processing. The first time you run the lasso-o_shcu container, the container runtime
will download the container image, which will take a few minutes. Then,
LASSO-O will run a series of processes for each simulation provided as
input.
#### Checking the status of your job
To check the status of your job, list the queue for your user ID, as shown in the
following slurm example:
##### Check job status
> To check the status of your job, list the queue for your user ID:
```bash
[d3k339@marianas run-lasso-o_shcu]$ squeue -u d3k339
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
27540 slurm run.sh d3k339 R 0:08 1 dc014
$ squeue -u [user_name]
```
You can also get more information about the running job using the scontrol
command:
> You can also get more information about the running job using the scontrol command
> and the jobID printed out from the squeue command:
```bash
[d3k339@marianas run-lasso-o_shcu]$ scontrol show jobid -dd 27540
JobId=27540 JobName=run.sh
UserId=d3k339(20339) GroupId=users(100) MCS_label=N/A
Priority=290 Nice=0 Account=br21_d3k339 QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:39 TimeLimit=08:00:00 TimeMin=N/A
SubmitTime=2021-05-14T11:56:01 EligibleTime=2021-05-14T11:56:01
AccrueTime=Unknown
StartTime=2021-05-14T11:56:01 EndTime=2021-05-14T19:56:01 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-14T11:56:01
Partition=slurm AllocNode:Sid=marianas.pnl.gov:21213
ReqNodeList=(null) ExcNodeList=(null)
NodeList=dc014
BatchHost=dc014
NumNodes=1 NumCPUs=64 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=64,node=1,billing=64
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
JOB_GRES=(null)
Nodes=dc014 CPU_IDs=0-63 Mem=0 GRES=
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=./run.sh
WorkDir=/qfs/people/d3k339/lasso-test/run-lasso-o_shcu
Power=
NtasksPerTRES:0
$ scontrol show jobid -dd [jobID]
```
#### Start via PBS
PBS jobs must be started with a batch script. To run LASSO-O, first edit
the `pbs_sub.sh` file to use the appropriate parameters for your
environment. In particular, the account name, group_list, and QoS
parameters should be changed, but other parameters may also be adjusted
as needed.
After you have edited the batch script, then you should be able to submit a batch job via:
```bash
$ qsub pbs_sub.sh
```
When your job has completed, you may view the outputs created in the
`run-lasso-o_shcu/data/outputs` folder using the notebooks provided.
See the [notebooks/README.md](notebooks/README.md) file for more
information.
##### Check job status
> To check the status of your job, list the queue for your user ID:
```bash
$ qstat -u [user_name]
```
#!/bin/bash
### Job Name
#PBS -N lasso-o
### Account name (project code)
#PBS -A arm
### Specify the group name under which the job is to run.
### If not set, the group_list defaults to the primary group
### of the user under which the job will be run.
#PBS -W group_list=cades-arm
### Max time limit
#PBS -l walltime=01:00:00
### Queue
#PBS -q arm_high_mem
### Merge & persist std output and std error files
#PBS -j oe
#PBS -k eod
### Request 1 node with 1 processor per node
#PBS -l nodes=1:ppn=1
### A QoS is a classification that determines what kind of resources your job can use.
#PBS -l qos=std
./run.sh
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment