Center for Fluid Mechanics, Turbulence and Computation, Brown University

trapeza.cfm.brown.edu

Beowulf Cluster: trapeza.cfm.brown.edu

Hardware

The trapeza cluster consists of one master node, trapeza.cfm.brown.edu, sixteen slave nodes (node00, ... node15), a Netgear Ethernet switch with 16 10/100 ports and 2 Gigabit Fiber ports, a 16 port Myrinet switch, and a Belkin 16 port KVM switch.

Currently all nodes have an Asus A7V motherboard (VIA KT133 Chipset) and an AMD K7 Thunderbird 1000MHz CPU. Each slave node has 256MB of 133MHz RAM, an Intel EtherExpress Pro/100+ network interface, a Myrinet 2000 SAN/PCI network interface, and a floppy drive. The master node has 768KB of 133MHz RAM, an Intel Gigabit Ethernet Adapter, an Intel EtherExpress Pro/100+ network interface, a CDROM drive, a floppy drive, a 40GB system disk, and a RAID-1 disk array of mirrored 80GB drives to provide space for the user applications running on the cluster and any data needed or generated by the cluster applications.

Software

The trapeza cluster is running the Scyld Beowulf release 2.0 (27bz-7), based on Red Hat Linux 6.2. Scyld Beowulf uses a system called "bproc" to manage processes running on the slave nodes.

Although Scyld Beowulf comes with a version of MPICH (the beompi package installed in /usr/mpi-beowulf) running over TCP/IP, trapeza has another MPICH (mpich/gm), which can take advantage of the Myrinet hardware and software. This is installed in /usr/gmpi (/usr/local/mpich-gm).

The Portland Group (PGI) Fortran and C compilers are available on trapeza. They are installed in /usr/pgi (/usr/local/pgi3.2). HTML documentation is available online under /usr/pgi/doc/.

The Toolworks TotalView debugger is installed in /usr/totalview (/usr/local/toolworks/totalview.4.1.0-4/linux-x86), though it may require some more system setup and configuration before it can be used for MPI jobs on the cluster.

The PBS (Portable Batch System) is in the process of being set up to manage placing MPI jobs appropriately on the slave nodes. This will eliminate the need to manually select the nodes and ports used by the mpi job, and will allow jobs to be queued if insufficient node resources are available at the time of submission.

Trapeza is accessible via SSH, and allows users in the CFM NIS domain login access.

Trapeza is new, promises exciting performance, and is undergoing continued refinement. Please send support questions to support@cfm.brown.edu

User Setup

New users on trapeza may need to adjust their login environment in order to run applications on the cluster. There are scripts on trapeza to provide the environment variables and execution paths needed to access the installed software. If the user's ~/.profile or ~/.login files set PATH or LM_LICENSE_FILE to a constant string, some of these default settings may be lost. The shared user login scripts are located in /etc/profile.d, and users may find examples of how to add components to your execution and licensing paths there.

CFM users' home directories are automatically mounted on trapeza when they log in, but files there will not be available to MPI jobs running on the cluster. Instead, a Cluster Home directory is provided on the RAID array for each user. The environment variable CHOME is set at login time to point to the user's /chome directory.

The program executable and all data must be copied into the CHOME directory for the program to run.

All output must also be written path-relative to the CHOME directory.

Building and Running MPI Jobs

The tools for building and running MPI jobs using mpich-gm are under /usr/gmpi (/usr/local/mpich-gm). In particular, the commands mpicc, mpif77 and mpif90 in /usr/gmpi/bin provide the appropriate switches and libraries to build mpich-gm applications with the PGI compilers. The script /usr/gmpi/bin/mpirun can be used to start mpich-gm jobs; it ends up calling /usr/gmpi/bin/mpirun.ch_gm. The mpiman command puts up an Xman browser pointing to the mpich-gm manual pages.

In order to use mpirun or mpirun.ch_gm, the user needs to provide a cluster configuration file. The default location of this file is ~/.gmpi/conf, but since home directories are not visible on the cluster slave nodes, the conf file will need to be placed somewhere under $CHOME and specified to mpirun. This can be done either by setting GMPICONF to the desired file path, or by using the "-f" or "--gm-f" command line option to mpirun. Specifying the gmpi conf file on the command line overrides GMPICONF. If the path specified by GMPICONF or the --gm-f option does not start with a slash ('/'), the file is found relative to the current directory.

The format of the gmpi conf file has one line specifying the number of nodes listed on the rest of the file, followed by one line for each node containing the node name and the Myrinet GM Port ID number the job will use to communicate between slave nodes. Here is an example listing all sixteen nodes in the cluster:

    16
    node00 4
    node01 4
    node02 4
    node03 4
    node04 4
    node05 4
    node06 4
    node07 4
    node08 4
    node09 4
    node10 4
    node11 4
    node12 4
    node13 4
    node14 4
    node15 4

The node name must be un-qualified (not nodeXX.beonet or nodeXX.myrinet), since it needs to match the gm_hostname on the node. The gmpi configuration file may need to be edited to select unused nodes at the time of program execution.

The port ID must be a value from the set [2 4 5 6 7]. Since this cluster has only single CPU nodes, typically only one GM port per node should be used by mpich-gm jobs on trapeza. The port ID's 0 and 1 are used by the GM software, and port ID 3 is used by TCP/IP. In preparation for the deployment of PBS, ports 5, 6 and 7 have been reserved, leaving only ports 2 and 4 available for manual mpirun jobs.

The "-np" mpirun command line option specifies how many nodes from the configuration to use. There is a "--gm-r" option to allocate nodes starting from the end of the gmpi conf list.

Help on options specific to mpirun.ch_gm (and not described in the mpirun manual) can be listed by running "mpirun --gm-h".

"gmpilaunch" is a new script that can be used instead of mpirun.ch_gm for starting MPI jobs. It will also be usable with PBS when that queuing system becomes available. GM ports 5, 6, and 7 are currently available through gmpilaunch, but gmpilaunch takes care of the node and GM port assignments transparently to the user. The user may specify a "machines" file, a number of nodes to run on, or both. The default "machines" file is /usr/gmpi/share/machines.LINUX, which lists each node once.

If the "--np" or "-n" option is used and the number of nodes requested is smaller than the number of machines in the "-f" option machines file or with the "-m" option machines list, then vacant machines will be allocated first, and processes will only be placed on a non-vacant machine if there were not enough vacant machines available. The "--help" or "-h" option will print a summary of the available options for gmpilaunch.

"beostatus" is an x-client application that may be run on trapeza (displaying on your desktop) to easily monitor node usage. More detailed information is available from the command line program "beostat".

Beowulf/MPI/Myrinet/PBS Web Sites

Beowulf Clustering

The Beowulf Project http://www.beowulf.org/

Beowulf Underground : Current Articles http://www.beowulf-underground.org/

Scyld Computing Corporation http://www.scyld.com/

NPACI Rocks http://rocks.npaci.edu/

Beowulf cluster Mini-HowTo http://www.fysik.dtu.dk/CAMP/cluster-howto.html

MPI/MPICH/PBS

MPI Home Page http://www-unix.mcs.anl.gov/mpi/

MPICH Home Page http://www-unix.mcs.anl.gov/mpi/mpich/indexold.html

EPCC Training and Education Course Materials http://www.epcc.ed.ac.uk/epcc-tec/documents/coursemat.html

Portable Batch System http://www.openpbs.org/

Myrinet

Myricom Home Page http://www.myri.com/

Myrinet at OSC http://www.osc.edu/~djohnson/myrinet/

Other Clustering variants

Heterogeneous MPI http://www.ens-lyon.fr/~mercierg/mpi.html

Pm2 High Perf's Home Page http://www.ens-lyon.fr/~rnamyst/pm2.html

BIP Messages Software http://lhpca.univ-lyon1.fr/software/distrib.html

GAMMA Project http://www.disi.unige.it/project/gamma/

MOSIX http://www.cs.huji.ac.il/mosix/

Cluster Examples

Duke CS Cluster Computing Lab http://www.cs.duke.edu/ari/

The UNH Parallel and Distributed Computing Laboratory http://www.cs.unh.edu/~rdr/galaxy.html

UDel SAMson Beowulf Cluster http://www.bartol.udel.edu/Mri/sam/

FHCRC Biomath Server Page http://queenbee.fhcrc.org/

Network and NFS Performance

The Public Netperf Homepage http://www.netperf.org/

ClusterNFS Sourceforge Homepage http://ClusterNFS.sourceforge.net/

The Apache documentation has been included with this distribution.

For documentation and information on Red Hat Linux, please visit the Red Hat, Inc. website.
The manual for Red Hat Linux is available here and locally by package in /usr/doc/HTML/.

`The Beowulf Project`	`http://www.beowulf.org/`
`Beowulf Underground : Current Articles`	`http://www.beowulf-underground.org/`
`Scyld Computing Corporation`	`http://www.scyld.com/`
`NPACI Rocks`	`http://rocks.npaci.edu/`
`Beowulf cluster Mini-HowTo`	`http://www.fysik.dtu.dk/CAMP/cluster-howto.html`

`MPI Home Page`	`http://www-unix.mcs.anl.gov/mpi/`
`MPICH Home Page`	`http://www-unix.mcs.anl.gov/mpi/mpich/indexold.html`
`EPCC Training and Education Course Materials`	`http://www.epcc.ed.ac.uk/epcc-tec/documents/coursemat.html`
`Portable Batch System`	`http://www.openpbs.org/`

`Myricom Home Page`	`http://www.myri.com/`
`Myrinet at OSC`	`http://www.osc.edu/~djohnson/myrinet/`

`Heterogeneous MPI`	`http://www.ens-lyon.fr/~mercierg/mpi.html`
`Pm2 High Perf's Home Page`	`http://www.ens-lyon.fr/~rnamyst/pm2.html`
`BIP Messages Software`	`http://lhpca.univ-lyon1.fr/software/distrib.html`
`GAMMA Project`	`http://www.disi.unige.it/project/gamma/`
`MOSIX`	`http://www.cs.huji.ac.il/mosix/`

`Duke CS Cluster Computing Lab`	`http://www.cs.duke.edu/ari/`
`The UNH Parallel and Distributed Computing Laboratory`	`http://www.cs.unh.edu/~rdr/galaxy.html`
`UDel SAMson Beowulf Cluster`	`http://www.bartol.udel.edu/Mri/sam/`
`FHCRC Biomath Server Page`	`http://queenbee.fhcrc.org/`

`The Public Netperf Homepage`	`http://www.netperf.org/`
`ClusterNFS Sourceforge Homepage`	`http://ClusterNFS.sourceforge.net/`