Banner
ACM


The ACM cluster (Applied and Computational Mathematics) is operated by several members in the Mathematics department at FSU.

 
Hardware

ACM consists of nodes with two distinct architectures, both sets having an independent queue.

  • 1 headnode:
    • 4 GB of RAM
    • 2 AMD Opteron 2216 (2.4GHz) Dual Core processors
    • Gigabit Ethernet interface
    • 4x DDR InfiniBand Network Interface
    • 1.5 TB of usable file storage
    • No GPU
  • 4 compute nodes (16 cores) designated by ACM_Tesla which are for GPU computing tests:
    • 4 GB of RAM
    • 2 AMD Opteron 2216 (2.4GHz) Dual Core processors (4 cores total)
    • Gigabit Ethernet interface
    • 4x DDR InfiniBand Network Interface
    • Tesla GPU
  • 17 compute nodes (68 cores) designated by ACM_GTS which are for general-purpose computing:
    • 4 GB of RAM
    • 2 AMD Opteron 2216 (2.4GHz) Dual Core processors (4 cores total)
    • Gigabit Ethernet interface
    • 4x DDR InfiniBand Network Interface
    • NVIDIA 8500 GTS GPU
 
Software
  • GCC 4.1.2
  • PGI 7.1.5 (32-bit and 64-bit) Compiler Suite: including HPF, F90, F77, C, C++
  • OFED (InfiniBand management) 1.3 including both GCC and PGI versions of:
    • OpenMPI 1.2.5
    • MVAPICH 1.0.0 (InfiniBand-optimized MPICH)
    • MVAPICH3 1.0.2 (InfiniBand-optimized MPICH2)
  • NVIDIA CUDA compilers 1.1
  • MATLAB R2008a

Since the various MPI implementations use a common naming scheme for programs and environment variables, users must select their MPI implementation using the appropriate shell script in order to avoid name conflicts. MPI setup scripts are located in /etc/mpivars and are available for bash and csh. There are MPI environmental setup scripts located in /etc/mpivars which follow the naming convention of _-.(c)sh.

For example, to to use OpenMPI 1.2.5 with the GCC compilers, a user running the bash shell would type:

source /etc/mpivars/OpenMPI_GCC-1.2.5.sh

Whereas, a user running the csh shell would type:

source /etc/mpivars/OpenMPI_GCC-1.2.5.csh

It is advisable to log out after switching MPI implementations to reset environment variables. To change your default MPI implementation, use the mpi-selector-menu command.

 
Filesystems

On the headnode, the local filesystem provides 1.4TB of usable shared storage in a RAID10 configuration. DSC home directories are linked to ~/fshome on the headnode ONLY. The compute nodes are configured to NFS mount the cluster home directories, and additionally have ~40 GB of local scratch space available for temporary usage. Access to ACM is restricted to machines within DSC's network. Users outside the network must transfer files through the PAMD machine.

 
Job Scheduling

The ACM headnode serves many purposes including being the gateway, file server, subnet manager, and license manager for the cluster. Thus, it is essential that users do NOT to clog it down with resource-intensive tasks. Rather, resource-intensive jobs should be submitted through SGE, ACM's preferred queuing system. On ACM, SGE is configured with two queues: ACM_Tesla and ACM_GTS, representing the two architectures. Both of these queues support batch and interactive logins. The ACM cluster also participates in the Condor opportunistic scheduling system. Condor jobs can be submitted from the ACM headnode and will be executed as resources become available.

Example: Submitting Batch Jobs

Batch jobs are submitted using the qsub command to submit to the appropriate queue. At the time of this writing the current list of SGE queues are:

  • acm_gts for machines with the 8500 GTS cards
  • acm_tesla for machines with the single-precision Tesla cards
  • acm_tesla_double for machines with the double-precision Tesla cards (when available)

For example, to submit a serial job to the acm_tesla queue:

qsub -q acm_tesla myjob.sh

To submit parallel MPI jobs, it is necessary to specify a parallel environment. For a list of available parallel environments type:

qconf -spl

At the time of this writing the current list of parallel environments are:

  • make
  • make_gts_fu
  • make_gts_rr
  • make_tesla_fu
  • make_tesla_rr

These environments allow a user to specify a preferred machine architecture and slot assignment strategy. The make_gts* and make_tesla* environments assign slots on the nodes with GTS cards or Tesla cards, respectively. The *rr (round robin) environments attempt to distribute jobs uniformly across compute nodes, while the *fu (fill-up) environments attempt to saturate each node in turn. The make (default) environment includes all machines (both Tesla and GTS nodes) and assigns jobs in a round robin fashion.

To submit a job to a parallel environment you need to add the -pe flags to the qsub statement.

qsub -pe [pe_name] [numslots] [jobname]

For example, to submit a job called myjob.sh to the make_gts_fu environment with 16 processes you would type:

qsub -pe make_gts_fu 16 myjob.sh

Example: Interactive Login

Please refrain from directly accessing nodes to test code or run jobs using ssh. Instead, use the interactive login feature of qlogin so the scheduler can manage resources optimally.

At the time of this writing the current list of SGE queues are:

  • acm_gts for machines with the 8500 GTS cards
  • acm_tesla for machines with the single-precision Tesla cards
  • acm_tesla_double for machines with the double-precision Tesla cards (when available)

For example, if you would like a session on a machine with a Tesla GPU, type:

qlogin -q acm_tesla
 
FAQs
Double Precision Tesla Card
  • Where is the card installed?
  • The card has been successfully installed in ACM000 which is part of the ACM cluster. This node has the special version of the NVIDIA driver and the CUDA libraries that were distributed with it which are different than those installed on any other node.
  • How do I run programs on this card?
  • Since the card was distributed with a version of CUDA that is different that on any other node. You need to first log into this node by typing qlogin -q ACM_Tesla_double then you can compile and test your code. It is suggested that you keep all code and binaries separate from those created and used on other nodes as there may be large incompatibilities that arise.
  • Does this card have any known issues?
  • The alignedTypes program that is included with the NVIDIA SDK either creates runaway processes or crashes the node when running the RGBA and RGBA_2 codes. DO NOT TRY RUNNING THIS PROGRAM.
  • HELP! The machine locked up!
  • As this is a development version there are bound to be some bugs that crash the machine or make it unresponsive. Unfortunately when this happens you will need to contact one of the system admins to get this reset (email us at ops.sc.fsu.edu).
 
Benchmarks

The standardized test that is used to compare computation clusters is called HPL (High Performance LINPACK). Unfortunately this hasn't been extended into GPU resources and there is no way to perform a direct comparison to CPUs. Using a simple formula, we can calculate the theoretical peak performance of the system.

Rtheoretical = cores x clock x FLOPs/cycle
  = 84 x 2.4 GHz x FLOPs/cycle
  = 403 to 806 GFLOPs

where FLOPs/cycle is somewhere between 2 and 4. Using the standard HPL tests the ACM cluster has achieved an Rpeak of about 300 GFLOPs using the CPU alone. Generally, it is difficult to achieve 60%-70% of Rtheoretical. so this number is expected. To give a sense of scale, in October 2007 the FSU'S HPC achieved 2.2 TFLOPs with the same tests.

 
Resource Utilization

System utilization can be viewed in Ganglia