::MITgcm cluster facility

Frequently asked questions  
Home
Overview
Get an account
Login
PBS Queues
Compilers
Standard Libraries
Parallel Execution
MITgcm Examples
FAQ
Output Analysis
Storage
More Help
Hardware Layout
Technical Specs
Table Of Contents

Back Next

 
.Q: What sort of performance can I get?

Q: What useful commands are there for checking job status?

Q: How come my files on the scratch file systems were deleted?

Q: Is there a parallel Matlab on the cluster?

Q: I am getting a strange error message, what should I do?

Q: How do I modify my MANPATH variable to include the PBS help pages?

Q: What are the names of the different machines in the cluster?

Q: What options do I need in the MITgcm genmake file for compiling?

Q: I am using the Myrinet MPI with PBS and sometimes my jobs don't terminate when I expect. Instead they run to the PBS time limit and then stop. Whats happening?

Q: I am getting a license manager error using the Portland Group compilers (pgf77, pgf90, etc...). What should I do?

Q: What are useful environment variable/shell startup (.cshrc) settings for using the cluster?

Q: Why are my sessions getting killed?

Q: Where do I compile code?

Q: How do I submit a job?

Q: How long does each job run?

 

Q: What sort of performance can I get?

Q: What useful commands are there for checking job status?

qstat -n Lists nodes on which jobs are running.
qstat -a Shows all running PBS jobs

Q: How come my files on the scratch file systems were deleted?

The scratch file systems are kept below 80% utilization automatically by a tidying program that runs nightly. The program deletes oldest files first. If you have datasets that you wish to retain for long periods on the cluster then the unix command touch can be used to indicate which these files are by setting the file age to a recent date. However,

bulletthe scratch file systems are never backed up and so they are not intended for permanent archival.
bulletif too many files are marked for retention with touch then the tidying system will still remove the oldest of these.

The scratch directories from which files are automatically deleted are shown on the Hardware Layout diagram. The only user directories on the cluster that are backed up regularly are the home space directories.

Q: Is there a parallel Matlab on the cluster?

Coming soon .......

Q: I am getting a strange error message, what should I do?

You should feel free to send e-mail to cg01-admin@techsquare.com containing the error message. It is also often very useful to paste the error message into a search engine such as Google to see if this is a well known problem.

Q: How do I modify my MANPATH variable to include the PBS help pages?

The usual way to do this is to edit your shell startup settings file. To do this for the default shell of /bin/tcsh you edit the file .tcshrc in your home directory to contain a line like the following:
setenv MANPATH /usr/man:/usr/X11R6/man:/usr/local/man:/usr/pbs/man

Q: What are the names of the different machines in the cluster?

The Hardware Layout section includes a diagram showing the names of the machines in the cluster. The technical specifications of the machines is presented in textual form in the Technical Specs section.

Q: What options do I need in the MITgcm genmake file for compiling?

The genmake options for compiling MITgcm on the myrinet-3 cluster with the Portland group compiler and using the Myrinet optimized MPI are shown below. Note - the compilation should execute on a myrinet-3 cluster node.

MITgcm genmake options for myrinet-3 nodes with the Portland group compiler using the Myrinet optimized MPI.

case cg01+pgi:
    set LN         = ( '/bin/ln -s' )
    set CPP        = ( '/lib/cpp -traditional -P' )
    set INCLUDES   = ( '-I/usr/local/pkg/mpi/mpi-1.2.4..8a-gm-1.5/pgi/include' )
    set DEFINES    = ( ${DEFINES} '-DWORDLENGTH=4' )
    set FC         = ( '/usr/local/pkg/mpi/mpi-1.2.4..8a-gm-1.5/pgi/bin/mpif77' )
    set FFLAGS     = ( '-byteswapio -r8 -Mnodclchk -Mextend' )
#    set LIBS       = ( '-L/home/cnh/src/gm-1.4/libgm')
    set FOPTIM     = ( '-tp p6 -v -O2 -Munroll -Mvect=cachesize:512000,transform -Kieee' )
    set LINK       = ( '/usr/local/pkg/mpi/mpi-1.2.4..8a-gm-1.5/pgi/bin/mpif77' )
    breaksw

Q: I am using the Myrinet MPI with PBS and sometimes my jobs don't terminate when I expect. Instead they run to the PBS time limit and then stop. Whats happening?

If the timeout on the --gm-kill option of the mpirun.ch_gm command is set too low (for example --gm-kill 1) the startup script for mpi execution may hang, even though the compute processes have terminated. To work around this problem set the timeout value larger (for example --gm-kill 5). The timeout value needed increases with the number of processes. For thirty-two processes a value of 5 seems to work robustly.

Q: I am getting a license manager error using the Portland Group compilers (pgf77, pgf90, etc...). What should I do?

If the environment variable PGI is not set correctly then the Portland compilers (pgf77, pgf90) and the mpi compilations scripts based of these compilers give the error message:

LICENSE MANAGER PROBLEM: Cannot find license file

To fix this set the environment variable to the correct value.

Q: What are useful environment variable/shell startup (.cshrc) settings for using the cluster?

Environment variable name and

c-shell setting syntax.

Meaning

setenv PATH ${PATH}:/usr/pbs/bin
Sets command search path to include PBS commands.
setenv MANPATH
/usr/man:/usr/X11R6/man:/usr/local/man:/usr/pbs/man
Sets man page search path to include PBS help pages (the command should be entered on a single line).
setenv PGI /usr/local/pgi
Sets the location of the Portland group compilers, needed for compilation to work with pgf77 or pgf90.

 

Q: Why are my sessions getting killed?

Jobs submitted through PBS take priority over programs run outside PBS. When a PBS session starts on a node it will kill any user programs already running on that node.

Q: Where do I compile code?

It is necessary to compile the code from one of the myrinet cluster machines. A list of machine names can be found here: To log on to a machine, use either the secure shell command (ssh) or the remote shell command (rsh). For running on the myrinet-3 and myrinet-4 cluster, compile on a myrinet-3 or myrinet-4 machine. It is important to check that the processor is free (type "top" to make sure that the processor is not being used for a job).

Q: How do I submit a job?

Submit a job by logging into cg01, not from a specific myrinet machine using the command /usr/pbs/bin/qsub script_name.pbs. To avoid having to type the path of the pbs commands such as qsub, edit your shell startup file (.cshrc) in your home directory with the following line:

setenv PATH ${PATH}:/usr/pbs/bin/

Q: How long does each job run?

Maximum queue time is currently set to 7200 seconds. It is important to set up a run that will terminate within 2 hours. It is possible to edit your script to automatically resubmit a simulation that will take longer than 2 hours - example scripts and data files can be found here.