分类: 高性能计算
2013-08-23 01:46:54
this document describes the steps necessary to quickly set up a cluster with ibm system x, rack-mounted servers. although the examples given in this document are specific to idataplex hardware (because that's the most common server type used for clusters), the basic instructions apply to any x86_64, ipmi-controlled, rack-mounted servers.
contents[] |
this document is meant to get you going as quickly as possible and therefore only goes through the most common scenario. for additional scenarios and setup tasks, see .
this configuration will have a single dx360 management node with 167 other dx360 servers as nodes. the os deployed will be rh enterprise linux 6.2, x86_64 edition. here is a diagram of the racks:
in our example, the management node is known as 'mgt', the node namess are n1-n167, and the domain will be 'cluster'. we will use the bmcs in shared mode so they will share the nic on each node that the node's operating system communicates to the xcat management node over. this is call the management lan. we will use subnet 172.16.0.0 with a netmask of 255.240.0.0 (/12) for it. (this provides an ip address range of 172.16.0.1 - 172.31.255.254 .) we will use the following subsets of this range for:
the network is physically laid out such that port number on a switch is
equal to the u position number within a column, like this:
here is a summary of the steps required to set up the cluster and what this document will take you through:
install one of the supported distros on the management node (mn). it is recommended to ensure that dhcp, bind (not bind-chroot), httpd, nfs-utils, and perl-xml-parser are installed. (but if not, the process of installing the xcat software later will pull them in, assuming you follow the steps to make the distro rpms available.)
hardware requirements for your xcat management node are dependent on your cluster size and configuration. a minimum requirement for an xcat management node or service node that is dedicated to running xcat to install a small cluster ( < 16 nodes) should have 4-6 gigabytes of memory. a medium size cluster, 6-8 gigabytes of memory; and a large cluster, 16 gigabytes or more. keeping swapping to a minimum should be a goal.
for a list of supported os and hardware, refer to .
note: you can skip this step in xcat 2.8.1 and above, because xcat does it automatically when it is installed.
to disable selinux manually:
echo 0 > /selinux/enforce sed -i 's/^selinux=.*$/selinux=disabled/' /etc/selinux/config
note: you can skip this step in xcat 2.8 and above, because xcat does it automatically when it is installed.
the management node provides many services to the cluster nodes, but the firewall on the management node can interfere with this. if your cluster is on a secure network, the easiest thing to do is to disable the firewall on the management mode:
for rh:
service iptables stop chkconfig iptables off
if disabling the firewall completely isn't an option, configure iptables to allow the following services on the nic that faces the cluster: dhcp, tftp, nfs, http, dns.
for sles:
susefirewall2 stop
the xcat installation process will scan and populate certain settings from the running configuration. having the networks configured ahead of time will aid in correct configuration. (after installation of xcat, all the networks in the cluster must be defined in the xcat networks table before starting to install cluster nodes.) when xcat is installed on the management node, it will automatically run makenetworks to create an entry in the networks table for each of the networks the management node is on. additional network configurations can be added to the xcat networks table manually later if needed.
the networks that are typically used in a cluster are:
in our example, we only deal with the management network because:
for more information, see .
configure the cluster facing nic(s) on the management node. for example edit the following files:
[rh]: /etc/sysconfig/network-scripts/ifcfg-eth1
[sles]: /etc/sysconfig/network/ifcfg-eth1
device=eth1 onboot=yes bootproto=static ipaddr=172.20.0.1 netmask=255.240.0.0
if the public facing nic on your management node is configured by dhcp, you may want to set peerdns=no in the nic's config file to prevent the dhclient from rewriting /etc/resolv.conf. this would be important if you will be configuring dns on the management node (via makedns - covered later in this doc) and want the management node itself to use that dns. in this case, set peerdns=no in each /etc/sysconfig/network-scripts/ifcfg-* file that has bootproto=dhcp.
on the other hand, if you want dhclient to configure /etc/resolv.conf on your management node, then don't set peerdns=no in the nic config files.
the xcat management node hostname should be configured before installing xcat on the management node. the hostname or its resolvable ip address will be used as the default master name in the xcat site table, when installed. this name needs to be the one that will resolve to the cluster-facing nic. short hostnames (no domain) are the norm for the management node and all cluster nodes. node names should never end in "-enx" for any x.
to set the hostname, edit /etc/sysconfig/network to contain, for example:
hostname=mgt
if you run hostname command, if should return the same:
# hostname mgt
ensure that at least the management node is in /etc/hosts:
127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 ### 172.20.0.1 mgt mgt.cluster
when using the management node to install compute nodes, the timezone configuration on the management node will be inherited by the compute nodes. so it is recommended to setup the correct timezone on the management node. to do this on rhel, see . the process is similar, but not identical, for sles. (just google it.)
you can also optionally set up the mn as an ntp for the cluster. see .
it is not required, but recommended, that you create a separate file system for the /install directory on the management node. the size should be at least 30 meg to hold to allow space for several install images.
note: in xcat 2.8 and above, you do not need to restart the management node. simply restart the cluster-facing nic, for example: ifdown eth1; ifup eth1
for xcat 2.7 and below, though it is possible to restart the correct services for all settings, the simplest step would be to reboot the management node at this point.
it is recommended that spanning tree be set in the switches to portfast or edge-port for faster boot performance. please see the relevant switch documentation as to how to configure this item.
it is recommended that lldp protocol in the switches is enabled to collect the switch and port information for compute node during discovery process.
note: this step is necessary if you want to use xcat's automatic switch-based discovery (described later on in this document) for ipmi-controlled rack-mounted servers (including idataplex) and flex chassis. if you have a small cluster and prefer to use the sequential discover method (described later) or manually enter the macs for the hardware, you can skip this section. although you may want to still set up your switches for management so you can use xcat tools to manage them, as described in .
xcat will use the ethernet switches during node discovery to find out which switch port a particular mac address is communicating over. this allows xcat to match a random booting node with the proper node name in the database. to set up a switch, give it an ip address on its management port and enable basic snmp functionality. (typically, the snmp agent in the switches is disabled by default.) the easiest method is to configure the switches to give the snmp version 1 community string called "public" read access. this will allow xcat to communicate to the switches without further customization. (xcat will get the list of switches from the table.) if you want to use snmp version 3 (e.g. for better security), see the example below. with snmp v3 you also have to set the user/password and authproto (default is 'md5') in the table.
if for some reason you can't configure snmp on your switches, you can use sequential discovery or the more manual method of entering the nodes' macs into the database. see for a description of your choices.
xcat supports many switch types, such as bnt and cisco. here is an example of configuring snmp v3 on the cisco switch 3750/3650:
1. first, user should switch to the configure mode by the following commands:
[root@x346n01 ~]# telnet xcat3750 trying 192.168.0.234... connected to xcat3750. escape character is '^]'. user access verification password:
xcat3750-1>enable password:
xcat3750-1#configure terminal enter configuration commands, one per line. end with cntl/z. xcat3750-1(config)#
2. configure the snmp-server on the switch:
switch(config)# access-list 10 permit 192.168.0.20 # 192.168.0.20 is the ip of mn switch(config)# snmp-server group xcatadmin v3 auth write v1default switch(config)# snmp-server community public ro 10 switch(config)# snmp-server community private rw 10 switch(config)# snmp-server enable traps license?
3. configure the snmp user id (assuming a user/pw of xcat/passw0rd):
switch(config)# snmp-server user xcat xcatadmin v3 auth sha passw0rd access 10
4. check the snmp communication to the switch :
yum install net-snmp net-snmp-utils
snmpwalk -v 3 -u xcat -a sha -a passw0rd -x cluster -l authnopriv 192.168.0.234 .1.3.6.1.2.1.2.2.1.2
later on in this document, it will explain how to make sure the switch and switches tables are setup correctly.
there are two options for installation of xcat:
pick either one, but not both.
if not able to, or not wishing to, use the live internet repository, choose this option.
go to the site and download the level of xcat tarball you desire. go to the page and download the latest snap of the xcat dependency tarball. (the latest snap of the xcat dependency tarball will work with any version of xcat.)
copy the files to the management node (mn) and untar them:
mkdir /root/xcat2 cd /root/xcat2 tar jxvf xcat-core-2.*.tar.bz2 # or core-rpms-snap.tar.bz2 tar jxvf xcat-dep-*.tar.bz2
point yum to the local repositories for xcat and its dependencies:
cd /root/xcat2/xcat-dep// ./mklocalrepo.sh cd /root/xcat2/xcat-core ./mklocalrepo.sh
[sles 11]:
zypper ar file:///root/xcat2/xcat-dep/sles11/xcat-dep zypper ar file:///root/xcat2/xcat-core xcat-core
you can check a zypper repository using "zypper lr -d", or remove a zypper repository using "zypper rr".
[sles 10.2 ]:
zypper sa file:///root/xcat2/xcat-dep/sles10/xcat-dep zypper sa file:///root/xcat2/xcat-core xcat-core
you can check a zypper repository using "zypper sl -d", or remove a zypper repository using "zypper sd".
when using the live internet repository, you need to first make sure that name resolution on your management node is at least set up enough to resolve sourceforge.net. then make sure the correct repo files are in /etc/yum.repos.d:
to get the current official release:
cd /etc/yum.repos.d wget
to get the deps package:
wget
for example:
wget
to setup to use sles with zypper:
[sles11]:
zypper ar -t rpm-md xcat-core
zypper ar -t rpm-md/ xcat-dep
[sles10.2 ]:
zypper sa xcat-core
zypper sa/ xcat-dep
xcat uses on several packages that come from the linux distro. follow this section to create the repository of the os on the management node.
see the following documentation:
[rh]: use yum to install xcat and all the dependencies:
yum clean metadata yum install xcat
[sles]: use zypper to install xcat and all the dependencies:
zypper install xcat
note: in xcat 2.8.2 and above, xcat supports cloning new nodes from a pre-installed/pre-configured node, we call this provisioning method as sysclone. it leverages the opensource tool . if you will be installing stateful(diskful) nodes using the sysclone provmethod, you need to install systemimager and all the dependencies (using sysclone is optional):
[rh]: use yum to install systemimager and all the dependencies:
yum install systemimager-server
[sles]: use zypper to install systemimager and all the dependencies:
zypper install systemimager-server
add xcat commands to the path by running the following:
source /etc/profile.d/xcat.sh
check to see the database is initialized:
tabdump site
the output should similar to the following:
key,value,comments,disable "xcatdport","3001",, "xcatiport","3002",, "tftpdir","/tftpboot",, "installdir","/install",, . . .
if the tabdump command does not work, see .
if you need to update the xcat rpms later:
to update xcat:
[rh]:
yum clean metadata yum update '*xcat*'
[sles]:
zypper refresh zypper update -t package '*xcat*'
note: this will not apply updates that may have been made to some of the xcat deps packages. (if there are brand new deps packages, they will get installed.) in most cases, this is ok, but if you want to make all updates for xcat rpms and deps, run the following command. this command will also pick up additional os updates.
[rh]:
yum update
[sles]:
zypper refresh zypper update
note: if you are updating from xcat 2.7.x (or earlier) to xcat 2.8 or later, there are some additional migration steps that need to be considered:
all networks in the cluster must be defined in the networks table. when xcat was installed, it ran makenetworks, which created an entry in this table for each of the networks the management node is connected to. now is the time to add to the networks table any other networks in the cluster, or update existing networks in the table.
for a sample networks setup, see the following example:
the password should be set in the passwd table that will be assigned to root when the node is installed. you can modify this table using tabedit. to change the default password for root on the nodes, change the system line. to change the password to be used for the bmcs, change the ipmi line.
tabedit passwd #key,username,password,cryptmethod,comments,disable "system","root","cluster",,, "ipmi","userid","passw0rd",,,
to get the hostname/ip pairs copied from /etc/hosts to the dns on the mn:
chdef -t site forwarders=1.2.3.4,1.2.5.6
search cluster nameserver 172.20.0.1
makedns -n
for more information about name resolution in an xcat cluster, see .
you usually don't want your dhcp server listening on your public (site) network, so set site.dhcpinterfaces to your mn's cluster facing nics. for example:
chdef -t site dhcpinterfaces=eth1
then this will get the network stanza part of the dhcp configuration (including the dynamic range) set:
makedhcp -n
the ip/mac mappings for the nodes will be added to dhcp automatically as the nodes are discovered.
nothing to do here - the tftp server is done by xcat during the management node install.
makeconservercf
if you want to run a discovery process, a dynamic range must be defined in the networks table. it's used for the nodes to get an ip address before xcat knows their mac addresses.
in this case, we'll designate 172.20.255.1-172.20.255.254 as a dynamic range:
chdef -t network 172_16_0_0-255_240_0_0 dynamicrange=172.20.255.1-172.20.255.254
several xcat database tables must be filled in while setting up an idataplex cluster. to make this process easier, xcat provides several template files in /opt/xcat/share/xcat/templates/e1350/. these files contain regular expressions that describe the naming patterns in the cluster. with xcat's regular expression support, one line in a table can define one or more attribute values for all the nodes in a node group. (for more information on xcat's database regular expressions, see .) to load the default templates into your database:
cd /opt/xcat/share/xcat/templates/e1350/ for i in *csv; do tabrestore $i; done
these templates contain entries for a lot of different node groups, but we will be using the following node groups:
in our example, ipmi, idataplex, 42perswitch, and compute will all have the exact same membership because all of our idataplex nodes have those characteristics.
the templates automatically define the following attributes and naming conventions:
for a description of the attribute names in bold above, see the .
if these conventions don't work for your situation, you can either:
now you can use the power of the templates to define the nodes quickly. by simply adding the nodes to the correct groups, they will pick up all of the attributes of that group:
nodeadd n[001-167] groups=ipmi,idataplex,42perswitch,compute,all nodeadd n[001-167]-bmc groups=84bmcperrack nodeadd switch1-switch4 groups=switch
to see the list of nodes you just defined:
nodels
to see all of the attributes that the combination of the templates and your nodelist have defined for a few sample nodes:
lsdef n100,n100-bmc,switch2
this is the easiest way to verify that the regular expressions in the templates are giving you attribute values you are happy with. (or, if you modified the regular expressions, that you did it correctly.)
if not using a terminal server, sol is recommended, but not required to be configured. to instruct xcat to configure sol in installed operating systems on dx340 systems:
chdef -t group -o compute serialport=1 serialspeed=19200 serialflow=hard
for dx360-m2 and newer use:
chdef -t group -o compute serialport=0 serialspeed=115200 serialflow=hard
since the map between the xcat node names and ip addresses have been added in the hosts table by the 31350 template, you can run the makehosts xcat command to create the /etc/hosts file from the xcat hosts table. (you can skip this step if creating /etc/hosts manually.)
makehosts switch,idataplex,ipmi
verify the entries have been created in the file /etc/hosts. for example your /etc/hosts should look like this:
127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 ### 172.20.0.1 mgt mgt.cluster 172.20.101.1 n1 n1.cluster 172.20.101.2 n2 n2.cluster 172.20.101.3 n3 n3.cluster 172.20.101.4 n4 n4.cluster 172.20.101.5 n5 n5.cluster 172.20.101.6 n6 n6.cluster 172.20.101.7 n7 n7.cluster . . .
add the node ip mapping to the dns.
makedns
xcat supports 3 approaches to discover the new physical nodes and define them to xcat database:
this is a simple approach in which you give xcat a range of node names to be given to the discovered nodes, and then you power the nodes on sequentially (usually in physical order), and each node is given the next node name in the noderange.
with this approach, xcat assumes the nodes are plugged into your ethernet switches in an orderly fashion. so it uses each node's switch port number to determine where it is physically located in the racks and therefore what node name it should be given. this method requires a little more setup (configuring the switches and defining the switch table). but the advantage of this method is that you can power all of the nodes on at the same time and xcat will sort out which node is which. this can save you a lot of time in a large cluster.
if you don't want to use either of the automatically discovery processes, just follow the manual discovery process.
choose just one of these options and follow the corresponding section below (and skip the other two).
note: this feature is only supported in xcat 2.8.1 and higher.
sequential discovery means the new nodes will be discovered one by one. the nodes will be given names from a 'node name pool' in the order they are powered on.
specify the node name pool by giving a noderange to the nodediscoverstart command:
nodediscoverstart noderange=n[001-010]
the value of noderange should be in the xcat format.
note: other node attributes can be given to nodediscoverstart so that xcat will assign those attributes to the nodes as they are discovered. we aren't showing that in this document, because we already predefined the nodes, the groups they are in, and several attributes (provided by the e1350 templates). if you don't want to predefine nodes, you can give more attributes to nodediscoverstart and have it define the nodes. see the for details.
at this point you can physically power on the nodes one at a time, in the order you want them to receive their node names.
there are additional nodediscover commands you can run during the discovery process. see their for more details.
nodediscoverstatus
nodediscoverls -t seq -l
nodediscoverstop
note: the sequential discovery process will be stopped automatically when all of the node names in the node name pool are used up.
this method of discovery assumes that you have the nodes plugged into your ethernet switches in an orderly fashion. so we use each nodes switch port number to determine where it is physically located in the racks and therefore what node name it should be given.
to use this discovery method, you must have already configured the switches as described in
the table templates already put group-oriented regular expression entries in the switch table. use lsdef for a sample node to see if the switch and switchport attributes are correct. if not, use chdef or tabedit to change the values.
if you configured your switches to use snmp v3, then you need to define several attributes in the switches table. assuming all of your switches use the same values, you can set these attributes at the group level:
tabch switch=switch switches.snmpversion=3 switches.username=xcat switches.password=passw0rd switches.auth=sha
prerequisite: the dynamic dhcp range has been configured before your power on the nodes.
if you have a few nodes which were not discovered by sequential discovery or switch discovery, you could find them in discoverydata table. the undiscovered nodes are identified as 'undef' method in discoverydata table.
display the undefined nodes with nodediscoverls command:
nodediscoverls -t undef uuid node method mtm serial 61e5f2d7-0d59-11e2-a7bc-3440b5bedbb1 undef undef 786310x 1052ef1 fc5f8852-cb97-11e1-8d59-e41f13eeb1ba undef undef 7914b2a 06dvac9 96656f17-6482-e011-9954-5cf3fc317f68 undef undef 7377d2c 99a2007
if you want to manually define the 'undefined' node to a specific free node, using the 'nodediscoverdef' command ( available 2.8.2 or higher).
e.g. you have a free node n10 and you want to define the undefined node which uuid is '61e5f2d7-0d59-11e2-a7bc-3440b5bedbb1' to the n10, run following command:
nodediscoverdef -u 61e5f2d7-0d59-11e2-a7bc-3440b5bedbb1 -n n10
after the manually defining, the 'node name' and 'discovery method' attributes of undefined node will be changed. you could display the change by nodediscoverls command:
# nodediscoverls uuid node method mtm serial 61e5f2d7-0d59-11e2-a7bc-3440b5bedbb1 n10 manual 786310x 1052ef1 fc5f8852-cb97-11e1-8d59-e41f13eeb1ba undef undef 7914b2a 06dvac9 96656f17-6482-e011-9954-5cf3fc317f68 undef undef 7377d2c 99a2007
and you can run 'lsdef n10' to see the 'mac address' and 'mtm' has been updated to the node definition. if the next task like bmcsetup has been set in the chain table, the task chain will be continued to run after the running of nodediscoverdef command.
if you want to update node firmware when you discover the nodes, follow the steps in before continuing.
if you want to automatically deploy the nodes after they are discovered, follow the steps in before continuing. (but if you are new to xcat we don't recommend this.)
to initiate any of the 3 discover methods, walk over to systems and hit the power buttons. for the sequential discovery method power the nodes on in the order that you want them to be given the node names. wait a short time (about 30 seconds) between each node to ensure they will contact xcatd in the correct order. for the switch and manual discovery processes, you can power on all of the nodes at the same time.
on the mn watch nodes being discovered by:
tail -f /var/log/messages
look for the dhcp requests, the xcat discovery requests, and the "
a quick summary of what is happening during the discovery process is:
after a successful discovery process, the following attributes will
be added to the database for each node. (you can verify this by running
lsdef
if you cannot discover the nodes successfully, see the next section .
if at some later time you want to force a re-discover of a node, run:
makedhcp -d
and then reboot the node(s).
when the bmcsetup process completes on each node (about 5-10 minutes), xcat genesis will drop into a shell and wait indefinitely (and change the node's currstate attribute to "shell"). you can monitor the progress of the nodes using:
watch -d 'nodels ipmi chain.currstate|xcoll'
before all nodes complete, you will see output like:
==================================== n1,n10,n11,n75,n76,n77,n78,n79,n8,n80,n81,n82,n83,n84,n85,n86,n87,n88,n89,n9,n90,n91 ==================================== shell ==================================== n31,n32,n33,n34,n35,n36,n37,n38,n39,n4,n40,n41,n42,n43,n44,n45,n46,n47,n48,n49,n5,n50,n51,n52, n53,n54,n55,n56,n57,n58,n59,n6,n60,n61,n62,n63,n64,n65,n66,n67,n68,n69,n7,n70,n71,n72,n73,n74 ==================================== runcmd=bmcsetup
when all nodes have made it to the shell, xcoll will just show that the whole nodegroup "ipmi" has the output "shell":
==================================== ipmi ==================================== shell
when the nodes are in the xcat genesis shell, you can ssh or psh to any of the nodes to check anything you want.
at this point, the bmcs should all be configured and ready for hardware management. to verify this:
# rpower ipmi stat | xcoll ==================================== ipmi ==================================== on
to get the remote console working for each node, some uefi hardware settings must have specific values. first check the settings, and if they aren't correct, then set them properly. this can be done via the asu utility. the settings are slightly different, depending on the hardware type:
show uefi.com1activeafterboot show uefi.serialportsharing show uefi.serialportaccessmode show uefi.remoteconsoleredirectionand create a file called asu-set with contents:
set uefi.com1activeafterboot enable set uefi.serialportsharing enable set uefi.serialportaccessmode dedicated set uefi.remoteconsoleredirection enable
show devicesandioports.com1activeafterboot show devicesandioports.serialportsharing show devicesandioports.serialportaccessmode show devicesandioports.remoteconsoleand create a file called asu-set with contents:
set devicesandioports.com1activeafterboot enable set devicesandioports.serialportsharing enable set devicesandioports.serialportaccessmode dedicated set devicesandioports.remoteconsole enable
then for both types of machines, use the tool to check these settings:
pasu -b asu-show ipmi | xcoll # or you can check just one node and assume the rest are the same
if the settings are not correct, then set them:
pasu -b asu-set ipmi | xcoll
for alternate ways to set the asu settings, see .
now the remote console should work. verify it on one node by running:
rcons
to verify that you can see the genesis shell prompt (after hitting enter). to exit rcons type: ctrl-shift-e (all together), then "c", the ".".
you are now ready to choose an operating system and deployment method for the nodes....
there are two options to install your nodes as stateful (diskful) nodes:
this section describes the process for setting up xcat to install nodes; that is how to install an os on the disk of each node.
the command copies the contents of the linux distro media to
/install/
copycds/rhel6.2-*-server-x86_64-dvd1.iso
copycds /dev/dvd # or whatever the device name of your dvd drive is
tip: if this is the same distro version as your management node, create a .repo file in /etc/yum.repos.d with content similar to:
[local-rhels6.2-x86_64] name=xcat local rhels 6.2 baseurl=file:/install/rhels6.2/x86_64 enabled=1 gpgcheck=0
this way, if you need some additional rpms on your mn at a later, you can simply install them using yum. or if you are installing other software on your mn that requires some additional rpms from the disto, they will automatically be found and installed.
the copycds command also automatically creates several osimage defintions in the database that can be used for node deployment. to see them:
lsdef -t osimage # see the list of osimages lsdef -t osimage# see the attributes of a particular osimage
from the list above, select the osimage for your distro, architecture, provisioning method (in this case install), and profile (compute, service, etc.). although it is optional, we recommend you make a copy of the osimage, changing its name to a simpler name. for example:
lsdef -t osimage -z rhels6.2-x86_64-install-compute | sed 's/^[^ ]\ :/mycomputeimage:/' | mkdef -z
this displays the osimage "rhels6.2-x86_64-install-compute" in a format that can be used as input to mkdef, but on the way there it uses sed to modify the name of the object to "mycomputeimage".
initially, this osimage object points to templates, pkglists, etc. that are shipped by default with xcat. and some attributes, for example otherpkglist and synclists, won't have any value at all because xcat doesn't ship a default file for that. you can now change/fill in any that you want. a general convention is that if you are modifying one of the default files that an osimage attribute points to, copy it into /install/custom and have your osimage point to it there. (if you modify the copy under /opt/xcat directly, it will be over-written the next time you upgrade xcat.)
but for now, we will use the default values in the osimage definition and continue on. (if you really want to see examples of modifying/creating the pkglist, template, otherpkgs pkglist, and sync file list, see the section . most of the examples there can be used for stateful nodes too.)
using a postinstall script ( you could also use the updatenode method):
mkdir /install/postscripts/data cp/install/postscripts/data
create the postscript updatekernel:
vi /install/postscripts/updatekernel
add the following lines to the file
#!/bin/bash rpm -uivh data/kernel-*rpm
change the permission on the file
chmod 755 /install/postscripts/updatekernel
add the script to the postscripts table and run the install:
chdef -p -t group -o compute postscripts=updatekernel rnetboot compute
after the initial install of the distro onto nodes, if you want to update the distro on the nodes (either with a few updates or a new sp) without reinstalling the nodes:
copycdsor, for just a few updated rpms, you can copy the updated rpms from the distributor into a directory under /install and run createrepo in that directory./rhel6.3-*-server-x86_64-dvd1.iso
chdef -t osimage rhels6.2-x86_64-install-compute -p pkgdir=/install/rhels6.3/x86_64note: the above command will add a 2nd repo to the pkgdir attribute. this is only supported for xcat 2.8.2 and above. for earlier versions of xcat, omit the -p flag to replace the existing repo directory with the new one.
updatenode compute -p ospkgs
this section describes how to install or configure a diskful node (we call it as golden-client), capture an osimage from this golden-client, the osimage can be used to clone other nodes later.
note: this support is available in xcat 2.8.2 and above.
if you want to use the sysclone provisioning method, you need a golden-client. in this way, you can customize and tweak the golden-client’s configuration according to your needs, verify it’s proper operation, so once the image is captured and deployed, the new nodes will behave in the same way as the golden-client.
to install a golden-client, follow the section .
to install the systemimager rpms onto the golden-client:
[rh]:
mkdir -p /install/post/otherpkgs/rhels6.3/x86_64/xcat cd /install/post/otherpkgs/rhels6.3/x86_64/xcat tar jxvf xcat-dep-*.tar.bz2
[rh]:
chdef -t osimage -ootherpkglist=/opt/xcat/share/xcat/install/rh/sysclone.rhels6.x86_64.otherpkgs.pkglist chdef -t osimage -o -p otherpkgdir=/install/post/otherpkgs/rhels6.3/x86_64 rpower reset # you could also use the updatenode method
[fedora/centos]
using to capture an osimage from the golden-client.
imgcapture-t sysclone -o
tip: when imgcapture is run, it pulls an osimage from the
golden-client, and creates an osimage definition on xcat management
node. use lsdef -t osimage
the command tells xcat what you want to do next with this node, tells the node hardware to boot from the network for the next boot, and powering on the node using starts the installation process:
nodeset compute osimage=mycomputeimage rsetboot compute net rpower compute boot
tip: when nodeset is run, it processes the kickstart or autoyast template associated with the osimage, plugging in node-specific attributes, and creates a specific kickstart/autoyast file for each node in /install/autoinst. if you need to customize the template, make a copy of the template file that is pointed to by the osimage.template attribute and edit that file (or the files it includes).
it is possible to use the wcons command to watch the installation process for a sampling of the nodes:
wcons n1,n20,n80,n100
or rcons to watch one node
rcons n1
additionally, nodestat may be used to check the status of a node as it installs:
nodestat n20,n21 n20: installing man-pages - 2.39-10.el5 (0%) n21: installing prep
note: the percentage complete reported by nodestat is not necessarily reliable.
you can also watch nodelist.status until it changes to "booted" for each node:
nodels compute nodelist.status | xcoll
once all of the nodes are installed and booted, you should be able ssh to all of them from the mn (w/o a password), because xcat should have automatically set up the ssh keys (if the postscripts ran successfully):
xdsh compute date
if there are problems, see .
note: this section describes how to create a stateless image using the genimage command to install a list of rpms into the image. as an alternative, you can also capture an image from a running node and create a stateless image out of it. see for details.
the command copies the contents of the linux distro media to
/install/
copycds/rhel6.2-server-20080430.0-x86_64-dvd.iso
copycds /dev/dvd # or whatever the device name of your dvd drive is
tip: if this is the same distro version as your management node, create a .repo file in /etc/yum.repos.d with content similar to:
[local-rhels6.2-x86_64] name=xcat local rhels 6.2 baseurl=file:/install/rhels6.2/x86_64 enabled=1 gpgcheck=0
this way, if you need some additional rpms on your mn at a later, you can simply install them using yum. or if you are installing other software on your mn that requires some additional rpms from the disto, they will automatically be found and installed.
note: to use an osimage as your provisioning method, you need to be running xcat 2.6.6 or later.
the provmethod attribute of your nodes should contain the name of the osimage object definition that is being used for those nodes. the contains paths for pkgs, templates, kernels, etc. if you haven't already, run to copy the distro rpms to /install. default osimage objects are also defined when copycds is run. to view the osimages:
lsdef -t osimage # see the list of osimages lsdef -t osimage# see the attributes of a particular osimage
from the list found above, select the osimage for your distro, architecture, provisioning method (install, netboot, statelite), and profile (compute, service, etc.). although it is optional, we recommend you make a copy of the osimage, changing its name to a simpler name. for example:
lsdef -t osimage -z rhels6.3-x86_64-netboot-compute | sed 's/^[^ ]\ :/mycomputeimage:/' | mkdef -z
this displays the osimage "rhels6.3-x86_64-netboot-compute" in a format that can be used as input to mkdef, but on the way there it uses sed to modify the name of the object to "mycomputeimage".
initially, this osimage object points to templates, pkglists, etc. that are shipped by default with xcat. and some attributes, for example otherpkglist and synclists, won't have any value at all because xcat doesn't ship a default file for that. you can now change/fill in any that you want. a general convention is that if you are modifying one of the default files that an osimage attribute points to, copy it into /install/custom and have your osimage point to it there. (if you modify the copy under /opt/xcat directly, it will be over-written the next time you upgrade xcat.)
you likely want to customize the main pkglist for the image. this is the list of rpms or groups that will be installed from the distro. (other rpms that they depend on will be installed automatically.) for example:
mkdir -p /install/custom/netboot/rh cp -p /opt/xcat/share/xcat/netboot/rh/compute.rhels6.x86_64.pkglist /install/custom/netboot/rh vi /install/custom/netboot/rh/compute.rhels6.x86_64.pkglist chdef -t osimage mycomputeimage pkglist=/install/custom/netboot/rh/compute.rhels6.x86_64.pkglist
the goal is to install the fewest number of rpms that still provides the function and applications that you need, because the resulting ramdisk will use real memory in your nodes.
also, check to see if the default exclude list excludes all files and directories you do not want in the image. the exclude list enables you to trim the image after the rpms are installed into the image, so that you can make the image as small as possible.
cp /opt/xcat/share/xcat/netboot/rh/compute.exlist /install/custom/netboot/rh vi /install/custom/netboot/rh/compute.exlist chdef -t osimage mycomputeimage exlist=/install/custom/netboot/rh/compute.exlist
make sure nothing is excluded in the exclude list that you need on the node. for example, if you require perl on your nodes, remove the line "./usr/lib/perl5*".
the linuximage.pkgdir is the name of the directory where the distro packages are stored. it can be set multiple paths. the multiple paths must be separated by ",". the first path is the value of osimage.pkgdir and must be the os base pkg directory path, such as pkgdir=/install/rhels6.2/x86_64,/install/updates/rhels6.2/x86_64 . in the os base pkg path, there is default repository data. in the other pkg path(s), the users should make sure there is repository data. if not, use "createrepo" command to create them.
if you have additional os updates rpms (rpms may be from the os website, or the additional os distro) that you also want installed, make a directory to hold them, create a list of the rpms you want installed, and add that information to the osimage definition:
mkdir -p /install/updates/rhels6.2/x86_64 cd /install/updates/rhels6.2/x86_64 cp /myrpms/* .
if there is no repository data in the directory, you can run "createrepo" to create it:
createrepo .
the createrepo command is in the createrepo rpm, which for rhel is in the 1st dvd, but for sles is in the sdk dvd.
note: when the management node is rhels6.x, and the otherpkgs repository data is for rhels5.x, we should run createrepo with "-s md5". such as:
createrepo -s md5 .
... myrpm1 myrpm2 myrpm3
chdef -t osimage mycomputeimage pkgdir=/install/rhels6.2/x86_64,/install/updates/rhels6.2/x86_64 pkglist=/install/custom/install/rh/compute.rhels6.x86_64.pkglist
if you add more rpms at a later time, you must run createrepo again.
note: after the above setting,
if you have additional rpms (rpms not in the distro) that you also want installed, make a directory to hold them, create a list of the rpms you want installed, and add that information to the osimage definition:
mkdir -p /install/post/otherpkgs/rh/x86_64 cd /install/post/otherpkgs/rh/x86_64 cp /myrpms/* . createrepo .
note: when the management node is rhels6.x, and the otherpkgs repository data is for rhels5.x, we should run createrepo with "-s md5". such as:
createrepo -s md5 .
myrpm1 myrpm2 myrpm3
chdef -t osimage mycomputeimage otherpkgdir=/install/post/otherpkgs/rh/x86_64 otherpkglist=/install/custom/netboot/rh/compute.otherpkgs.pkglist
if you add more rpms at a later time, you must run createrepo again. the createrepo command is in the createrepo rpm, which for rhel is in the 1st dvd, but for sles is in the sdk dvd.
if you have multiple sets of rpms that you want to keep separate to keep them organized, you can put them in separate sub-directories in the otherpkgdir. if you do this, you need to do the following extra things, in addition to the steps above:
xcat/xcat-core/xcatsn xcat/xcat-dep/rh6/x86_64/conserver-xcat
there are some examples of otherpkgs.pkglist in
/opt/xcat/share/xcat/netboot/
note: the otherpkgs postbootscript should by default be associated with every node. use lsdef to check:
lsdef node1 -i postbootscripts
if it is not, you need to add it. for example, add it for all of the nodes in the "compute" group:
chdef -p -t group compute postbootscripts=otherpkgs
postinstall scripts for diskless images are analogous to postscripts
for diskfull installation. the postinstall script is run by genimage
near the end of its processing. you can use it to do anything to your
image that you want done every time you generate this kind of image. in
the script you can install rpms that need special flags, or tweak the
image in some way. there are some examples shipped in
/opt/xcat/share/xcat/netboot/
chdef -t osimage mycomputeimage postinstall=/install/custom/netboot/rh/compute.postinstall
note: this is only supported for stateless nodes in xcat 2.7 and above.
sync lists contain a list of files that should be sync'd from the management node to the image and to the running nodes. this allows you to have 1 copy of config files for a particular type of node and make sure that all those nodes are running with those config files. the sync list should contain a line for each file you want sync'd, specifying the path it has on the mn and the path it should be given on the node. for example:
/install/custom/syncfiles/compute/etc/motd -> /etc/motd /etc/hosts -> /etc/hosts
if you put the above contents in /install/custom/netboot/rh/compute.synclist, then:
chdef -t osimage mycomputeimage synclists=/install/custom/netboot/rh/compute.synclist
for more details, see .
you can configure any noderange to use this osimage. in this example, we define that the whole compute group should use the image:
chdef -t group compute provmethod=mycomputeimage
now that you have associated an osimage with nodes, if you want to list a node's attributes, including the osimage attributes all in one command:
lsdef node1 --osimage
there are other attributes that can be set in your osimage definition. see the for details.
if you are building an image for a different os/architecture than is on the management node, you need to follow this process: . note: different os in this case means, for example, rhel 5 vs. rhel 6. if the difference is just an update level/service pack (e.g. rhel 6.0 vs. rhel 6.3), then you can build it on the mn.
if the image you are building is for nodes that are the same os and architecture as the management node (the most common case), then you can follow the instructions here to run genimage on the management node.
run to generate the image based on the mycomputeimage definition:
genimage mycomputeimage
before you pack the image, you have the opportunity to change any files in the image that you want to, by cd'ing to the rootimgdir (e.g. /install/netboot/rhels6/x86_64/compute/rootimg). although, instead, we recommend that you make all changes to the image via your postinstall script, so that it is repeatable.
the genimage command creates /etc/fstab in the image. if you want to, for example, limit the amount of space that can be used in /tmp and /var/tmp, you can add lines like the following to it (either by editing it by hand or via the postinstall script):
tmpfs /tmp tmpfs defaults,size=50m 0 2 tmpfs /var/tmp tmpfs defaults,size=50m 0 2
but probably an easier way to accomplish this is to create a postscript to be run when the node boots up with the following lines:
logger -t xcat "$0: begin" mount -o remount,size=50m /tmp/ mount -o remount,size=50m /var/tmp/ logger -t xcat "$0: end"
assuming you call this postscript settmpsize, you can add this to the list of postscripts that should be run for your compute nodes by:
chdef -t group compute -p postbootscripts=settmpsize
now pack the image to create the ramdisk:
packimage mycomputeimage
the kerneldir attribute in linuximage table is used to
assign one directory to hold the new kernel to be installed into the
stateless/statelite image. its default value is /install/kernels, you need to create the directory named
assuming you have the kernel in rpm format in /tmp, the value of kerneldir is not set (which will take the default value: /install/kernels).
this procedure assumes you are using xcat 2.6.1 or later. the rpm names are an example and you can substitute your level and architecture. the kernel will be installed directly from the rpm package.
the kernel rpm package is usually named kernel-
cp /tmp/kernel-2.6.32.10-0.5.x86_64.rpm /install/kernels/ createrepo /install/kernels/
usually, the kernel files for sles are separated into two parts: kernel-
kernel-ppc64-base-2.6.27.19-5.1.x86_64.rpm kernel-ppc64-2.6.27.19-5.1.x86_64.rpm
2.6.27.19-5.1.x86_64 is not the kernel version. 2.6.27.19-5-x86_64 is the kernel version . follow this naming convention to determine the kernel version.
after the kernel version is determined for sles, then:
cp /tmp/kernel-ppc64-base-2.6.27.19-5.1.x86_64.rpm /install/kernels/ cp /tmp/kernel-ppc64-2.6.27.19-5.1.x86_64.rpm /install/kernels/
run genimage/packimage to update the image with the new kernel: (use sles as example)
since the kernel version is different from the rpm package version, the -g flag needs to be specified on the genimage command for the rpm version of kernel packages.
genimage -i eth0 -n ibmveth -o sles11.1 -p compute -k 2.6.27.19-5-x86_64 -g 2.6.27.19-5.1
the kernel drivers in the stateless initrd are used for the devices during the netboot. if you are missing one or more kernel drivers for specific devices (especially for the network device), the netboot process will fail. xcat offers two approaches to add additional drivers to the stateless initrd during the running of genimage.
genimage-n
generally, the genimage command has a default driver list which will
be added to the initrd. but if you specify the '-n' flag, the default
driver list will be replaced with your
the default driver list:
rh-x86: tg3 bnx2 bnx2x e1000 e1000e igb mlx_en virtio_net be2net rh-ppc: e1000 e1000e igb ibmveth ehea sles-x86: tg3 bnx2 bnx2x e1000 e1000e igb mlx_en be2net sels-ppc: tg3 e1000 e1000e igb ibmveth ehea be2net
note: with this approach, xcat will search for the drivers in the rootimage. you need to make sure the drivers have been included in the rootimage before generating the initrd. you can install the drivers manually in an existing rootimage (using chroot) and run genimage again, or you can use a postinstall script to install drivers to the rootimage during your initial genimage run.
refer to the doc .
nodeset compute osimage=mycomputeimage
(if you need to update your diskless image sometime later, change your osimage attributes and the files they point to accordingly, and then rerun genimage, packimage, nodeset, and boot the nodes.)
now boot your nodes...
rsetboot compute net rpower compute boot
this section gives some examples of using key commands and command combinations in useful ways. for any xcat command, typing 'man
in this configuration, a handy convenience group would be the lower systems in the chassis, the ones able to read temperature and fanspeed. in this case, the odd systems would be on the bottom, so to do this with a regular expression:
# nodech '/n.*[13579]$' groups,=bottom
or explicitly
chdef -p n1-n9,n11-n19,n21-n29,n31-n39,n41-n49,n51-n59,n61-n69,n71-79,n81-n89, n91-n99,n101-n109,n111-119,n121-n129,n131-139,n141-n149,n151-n159,n161-n167 groups="bottom"
we can list discovered and expanded versions of attributes (actual vpd should appear instead of *) :
# nodels n97 nodepos.rack nodepos.u vpd.serial vpd.mtm n97: nodepos.u: a-13 n97: nodepos.rack: 2 n97: vpd.serial: ******** n97: vpd.mtm: *******
you can also list all the attributes:
#lsdef n97 object name: n97 arch=x86_64 . groups=bottom,ipmi,idataplex,42perswitch,compute,all . . . rack=1 unit=a1
xcat provides parallel commands and the sinv (inventory) command, to analyze the consistency of the cluster. see
combining the use of in-band and out-of-band utilities with the xcoll utility, it is possible to quickly analyze the level and consistency of firmware across the servers:
mgt# rinv n1-n3 mprom|xcoll ==================================== n1,n2,n3 ==================================== bmc firmware: 1.18
the bmc does not have the bios version, so to do the same for that, use psh:
mgt# psh n1-n3 dmidecode|grep "bios information" -a4|grep version|xcoll ==================================== n1,n2,n3 ==================================== version: i1e123a
to update the firmware on your nodes, see .
to do this, see .
xcat has several utilities to help manage and monitor the mellanox ib network. see .
if the configuration is louder than expected (idataplex chassis should nominally have a fairly modest noise impact), find the nodes with elevated fanspeed:
# rvitals bottom fanspeed|sort -k 4|tail -n 3 n3: psu fan3: 2160 rpm n3: psu fan4: 2240 rpm n3: psu fan1: 2320 rpm
in this example, the fanspeeds are pretty typical. if fan speeds are
elevated, there may be a thermal issue. in a dx340 system, if near
10,000 rpm, there is probably either a defective sensor or misprogrammed
power supply.
to find the warmest detected temperatures in a configuration:
# rvitals bottom temp|grep domain|sort -t: -k 3|tail -n 3 n3: domain b therm 1: 46 c (115 f) n7: domain a therm 1: 47 c (117 f) n3: domain a therm 1: 49 c (120 f)
change tail to head in the above examples to seek the slowest fans/lowest temperatures. currently, an idataplex chassis without a planar tray in the top position will report '0 c' for domain b temperatures.
for more options, see rvitals manpage: