The following three sections show excerpts from system log files in
the
Startup messages following Production Server
installation
(taken from
kern.log)
Formation of a new Production Server
cluster
(taken from
daemon.log)
Recovery of
an existing Production Server cluster
(taken from
daemon.log)
These messages track normal cluster startup operations; therefore, in addition to providing some level of assurance that cluster formation and recovery operations are proceeding in an orderly fashion, they also provide a starting point for troubleshooting cluster-related problems.
Example A-1
shows a transcript of a portion
of the startup messages displayed during a reboot of the first cluster member
system after installing Production Server.
This information is also sent to
>>> boot
.
.
.
jumping to bootstrap code Digital UNIX boot - Wed May 28 17:05:23 EDT 1997 Loading vmunix ...
.
.
.
pci0 at nexus eisa0 at pci0 ace0 at eisa0 ace1 at eisa0 lp0 at eisa0 fdi0 at eisa0 fd0 at fdi0 unit 0 cirrus0 at eisa0 cirrus0: Cirrus Logic CL-GD5428 (SVGA) 512 Kbytes pci2000 at pci0 slot 8 isp0 at pci2000 slot 0 isp0: QLOGIC ISP1020A isp0: Firmware revision 5.19 (loaded by console) scsi0 at isp0 slot 0 rz0 at scsi0 target 0 lun 0 (LID=0) (DEC RZ28M (C) DEC 0568) (Wide16) rz1 at scsi0 target 1 lun 0 (LID=1) (DEC RZ29B (C) DEC 0007) (Wide16) rz5 at scsi0 target 5 lun 0 (LID=2) (DEC RRD45 (C) DEC 1645) pza0 at pci2000 slot 1 pza0 firmware version: DEC F01 A10 scsi1 at pza0 slot 0 rz9 at scsi1 target 1 lun 0 (LID=3) (DEC RZ26 (C) DEC 392A) rz10 at scsi1 target 2 lun 0 (LID=4) (DEC RZ26 (C) DEC 392A) processor at scsi1 target 6 lun 7 (LID=12) (DEC ASE DEC L01 A10 TMV2) (Wide16) pza1 at pci2000 slot 2 pza1 firmware version: DEC L01 A10 scsi2 at pza1 slot 0 rz18 at scsi2 target 2 lun 0 (LID=13) (DEC RZ26N (C) DEC 0744) rz19 at scsi2 target 3 lun 0 (LID=14) (DEC RZ26N (C) DEC 0616) processor at scsi2 target 6 lun 7 (LID=22) (DEC ASE DEC L01 A10 TMV2) (Wide16) pza2 at pci2000 slot 3 pza2 firmware version: DEC F01 A10 scsi3 at pza2 slot 0 pza3 at pci2000 slot 4 pza3 firmware version: DEC F01 A10 scsi4 at pza3 slot 0 mchan0: Module revision = 33E [1] mchan0: jumpered as VH1 configuration mchan0 at pci0 slot 11 tu0: DECchip 21040: Revision: 2.3 tu0 at pci0 slot 13 tu0: DEC TULIP (10Mbps) Ethernet Interface, hardware address: 08-00-2B-E5-F8-0A tu0: console mode: selecting 10BaseT (UTP) port: half duplex gpc0 at eisa0 Created FRU table binary error log packet kernel console: ace0 dli: configured clubase: configured [2] dlmsl: configured [3] drd: configured. [4] cnxagent: configured [5] dlm: configured. [6] memory channel thread init [7] rm_sw_init: begin MC initialization. rm_boot_am_i_alone: entered checking for existing memory channel nodes [8] rm_slave_init rm_get_proto: returning vers = 1 slave unit boot phase 0: checking cables [9] slave unit boot phase 1: request data ... slave unit boot phase 2: get lock data from all nodes slave unit boot phase 3: update request ... memory channel software inited - node 1 on mc0 [10] rm_get_proto: returning vers = 1 ccomsub: state change detected via remote node 0 ccomsub: configured [11] mcnet: configured memory channel - adding node 0 RM member change callback: no change in member bitmap 0x3 ADVFS: using 1153 buffers containing 9.00 megabytes of memory starting LSM Checking local filesystems /sbin/ufs_fsck -p
.
.
.
Streams autopushes configured Initializing the ASE Availability Manager [12] AM found a host at bus 1 target 6, lun 7 AM found a host at bus 2 target 6, lun 7 Configuring network hostname: clu14.abc.def.com [13]
.
.
.
/usr/sbin/drd_dma: Peer-to-peer DMA is NOT sure to work between [14] scsi and MEMORY CHANNEL controllers /usr/sbin/drd_dma: Peer-to-peer DMA over MEMORY CHANNEL is NOT enabled. ONC portmap service started Cluster member started Starting ASE ... [15] Initializing the ASE Availability Manager ASE logger started (/usr/sbin/aselogger) ASE agent started (/usr/sbin/aseagent) ASE member started
.
.
.
cnxagent: Get MC information reports hubless [16] cnxagent: added node mcclu13 cnxagent: mcclu14 is now a cluster member [17] dlm_agent: resuming lock activity
.
.
.
Network Time Service started cnxagent: resuming
.
.
.
Printer service started The system is ready.
The messages highlighted in Example A-1 indicate the following:
The three
mchan
lines indicate that a device probe has found the MEMORY CHANNEL adapter and determined
its revision number.
This adapter is jumpered as VH1, indicating that it is
part a virtual hub.
(The message indicate whether a MEMORY CHANNEL adapter is jumpered
as VH0 or VH1 (virtual hub) or connects to a MEMORY CHANNEL hub.)
[Return to example]
The cluster component is initializing. [Return to example]
The Distributed Lock Manager (DLM) Session Layer component is initializing. [Return to example]
Distributed raw disk (DRD) is initializing. [Return to example]
The connection manager is initializing. [Return to example]
The DLM is initializing. [Return to example]
The general-purpose MEMORY CHANNEL thread has completed initialization. [Return to example]
The system is looking for other nodes connected to the MEMORY CHANNEL that may be either running or in the process of booting. [Return to example]
This system is the second to boot (slave) and initialize MEMORY CHANNEL code. [Return to example]
The initialization of low-level MEMORY CHANNEL software is complete. [Return to example]
The cluster communication subsystem is initializing. [Return to example]
The ASE availability manager driver is initializing. The hardware probes for shared buses and reports any active hosts found. [Return to example]
The system prints its hostname
(the output from
/sbin/hostname).
[Return to example]
The
drd_dma
checks
the hardware configuration to determine whether the system can use peer-to-peer
DMA, and prints the result.
[Return to example]
The ASE daemons are started. [Return to example]
The
cnxagent
subsystem reports that the cluster is operating in a virtual hub configuration.
[Return to example]
The system is identified as a cluster member. [Return to example]
See the TruCluster Software Products Administration manual for descriptions of important messages generated by TruCluster products.
Example A-2
shows messages from the
daemon.log
file related to the formation of a new Production Server cluster.
May 17 17:49:33 mcclu5 cnxpingd: starting [1] May 17 17:49:33 mcclu5 cnxagentd: starting May 17 17:49:34 mcclu5 cnxmond: changed alias with : /sbin/ifconfig mc0 alias 10.0.0.42 netmask 255.255.255.0 [2] May 17 17:49:34 mcclu5 cnxmgrd: starting [3]
.
.
.
May 17 17:50:14 mcclu5 cnxmgrd: attempting cluster recovery/formation [4] May 17 17:50:14 mcclu5 cnxmgrd: recovery, considering mcclu5 May 17 17:50:14 mcclu5 cnxmgrd: node mcclu5, cluster incarn 0, update_seq 0 May 17 17:50:14 mcclu5 cnxmgrd: node mcclu5 not a member May 17 17:50:14 mcclu5 cnxmgrd: forming a cluster May 17 17:50:14 mcclu5 cnxmgrd: completed cluster recovery/formation
.
.
.
May 17 17:50:18 mcclu5 cnxmgrd: starting join operation for mcclu5 [5] May 17 17:50:18 mcclu5 cnxmgrd: join, getting status from mcclu5 May 17 17:50:18 mcclu5 cnxmgrd: node mcclu5, cluster incarn 0, update_seq 0 May 17 17:50:18 mcclu5 cnxmgrd: adding node mcclu5 May 17 17:50:18 mcclu5 cnxmgrd: update complete, summary follows May 17 17:50:18 mcclu5 cnxmgrd: members are: May 17 17:50:18 mcclu5 cnxmgrd: mcclu5 May 17 17:50:18 mcclu5 cnxmgrd: timed out are: May 17 17:50:18 mcclu5 cnxmgrd: none May 17 17:50:18 mcclu5 cnxmgrd: finished join operation, update_seq 2
.
.
.
May 17 17:50:18 mcclu5 xntpd[668]: xntpd version 1.3 May 17 17:51:45 mcclu5 ASE: local HSM Notice: member mcclu8 is UP [6]
.
.
.
May 17 17:51:52 mcclu5 cnxmgrd: starting join operation for mcclu8 [7] May 17 17:51:52 mcclu5 cnxmgrd: join, getting status from mcclu8 May 17 17:51:52 mcclu5 cnxmgrd: node mcclu8, cluster incarn 0, update_seq 0 May 17 17:51:52 mcclu5 cnxmgrd: adding node mcclu8 May 17 17:51:53 mcclu5 cnxmgrd: update complete, summary follows May 17 17:51:53 mcclu5 cnxmgrd: members are: May 17 17:51:53 mcclu5 cnxmgrd: mcclu5 May 17 17:51:53 mcclu5 cnxmgrd: mcclu8 May 17 17:51:53 mcclu5 cnxmgrd: timed out are: May 17 17:51:53 mcclu5 cnxmgrd: none May 17 17:51:53 mcclu5 cnxmgrd: finished join operation, update_seq 3
.
.
.
May 17 17:52:04 mcclu5 ASE: local Director Notice: agent on mcclu8 came ONLINE [8]
.
.
.
May 17 17:54:54 mcclu5 cnxmond: node mcclu8 timed out [9] May 17 17:55:30 mcclu5 cnxmgrd: intend to remove node mcclu8 May 17 17:55:54 mcclu5 cnxmgrd: starting removal operation May 17 17:55:54 mcclu5 cnxmgrd: removing node mcclu8 May 17 17:55:55 mcclu5 cnxmgrd: update complete, summary follows May 17 17:55:55 mcclu5 cnxmgrd: members are: May 17 17:55:55 mcclu5 cnxmgrd: mcclu5 May 17 17:55:55 mcclu5 cnxmgrd: timed out are: May 17 17:55:55 mcclu5 cnxmgrd: none May 17 17:55:55 mcclu5 cnxmgrd: finished removal operation, update_seq 4
The connection manager monitor daemon,
cnxmond,
starts the ping daemon,
cnxpingd, and agent daemon,
cnxagentd, on system
mcclu5.
[Return to example]
The connection manager monitor daemon (cnxmond) on
system
mcclu5
acquires a spinlock on the MEMORY CHANNEL bus
(mc0) and registers the
cluster_cnx
service alias (10.0.0.42).
[Return to example]
As a result of acquiring the spinlock and registering the
cluster_cnx
alias, the connection manager starts the director daemon
(cnxmgrd)
on system
mcclu5.
[Return to example]
The director tries to find systems that are eligible to become cluster members. If the director finds eligible systems, these systems become members and the cluster is formed from them. In this case, no other eligible systems were found; therefore, the cluster is formed with no other members. [Return to example]
Systems are added to the cluster as members one at a time.
Since system
mcclu5
is up and running, it is considered
first and added as a member.
[Return to example]
System
mcclu8
is detected as up and running.
[Return to example]
System
mcclu8
becomes a cluster member.
The cluster now has two members,
mcclu5
and
mcclu8.
[Return to example]
The ASE director detects a new agent on system
mcclu8.
[Return to example]
The director detects a system failure and removes the failed
system from the cluster membership.
The cluster now has only one member,
mcclu5.
[Return to example]
Example A-3
shows an excerpt from the
daemon.log
file related to the recover of an existing Production Server cluster.
May 17 18:18:03 mcclu5 cnxmond: changed alias with : /sbin/ifconfig mc0 alias 10.0.0.42 netmask [1] 255.255.255.0 May 17 18:18:03 mcclu5 cnxmgrd: starting May 17 18:18:43 mcclu5 cnxmond: recovery delay completed May 17 18:18:43 mcclu5 cnxmgrd: attempting cluster recovery/formation May 17 18:18:43 mcclu5 cnxmgrd: recovery, considering mcclu5 May 17 18:18:43 mcclu5 cnxmgrd: node mcclu5, cluster incarn 0000000000080e90, update_seq 2 [2] May 17 18:18:43 mcclu5 cnxmgrd: found member of cluster incarnation 0000000000080e90 May 17 18:18:43 mcclu5 cnxmgrd: node is mcclu5, update_seq is 2 May 17 18:18:43 mcclu5 cnxmgrd: same update_seq 2, node mcclu5 May 17 18:18:43 mcclu5 cnxmgrd: using mcclu5 May 17 18:18:43 mcclu5 cnxmgrd: recovering a cluster [3] May 17 18:18:43 mcclu5 cnxmgrd: starting recovery update May 17 18:18:43 mcclu5 cnxmgrd: update complete, summary follows May 17 18:18:43 mcclu5 cnxmgrd: members are: May 17 18:18:43 mcclu5 cnxmgrd: mcclu5 May 17 18:18:43 mcclu5 cnxmgrd: timed out are: May 17 18:18:43 mcclu5 cnxmgrd: none May 17 18:18:43 mcclu5 cnxmgrd: finished recovery update, update_seq 3 May 17 18:18:43 mcclu5 cnxmgrd: completed cluster recovery/formation
A director is selected and cluster recovery is started. [Return to example]
The director finds system
mcclu5, which
was a member of a cluster (note the nonzero value of
cluster incarn).
[Return to example]
The cluster created in Example A-2 is recovered. [Return to example]