A Cluster-Related Messages in System Log Files

The following three sections show excerpts from system log files in the /var/adm/syslog.dated/date directories:

Startup messages following Production Server installation (taken from kern.log)

Formation of a new Production Server cluster (taken from daemon.log)

Recovery of an existing Production Server cluster (taken from daemon.log)

These messages track normal cluster startup operations; therefore, in addition to providing some level of assurance that cluster formation and recovery operations are proceeding in an orderly fashion, they also provide a starting point for troubleshooting cluster-related problems.

A.1 Startup Messages Following Production Server Installation

Example A-1 shows a transcript of a portion of the startup messages displayed during a reboot of the first cluster member system after installing Production Server. This information is also sent to /var/adm/syslog.dated/date/kern.log. Callouts in this example highlight messages relevant to cluster installation.

Example A-1: Startup Messages Related to Cluster Installation

>>> boot

.
.
.
jumping to bootstrap code
 
Digital UNIX boot - Wed May 28 17:05:23 EDT 1997
 
Loading vmunix ...

.
.
.
pci0 at nexus
eisa0 at pci0
ace0 at eisa0
ace1 at eisa0
lp0 at eisa0
fdi0 at eisa0
fd0 at fdi0 unit 0
cirrus0 at eisa0
cirrus0: Cirrus Logic CL-GD5428 (SVGA) 512 Kbytes
pci2000 at pci0 slot 8
isp0 at pci2000 slot 0
isp0: QLOGIC ISP1020A
isp0: Firmware revision 5.19 (loaded by console)
scsi0 at isp0 slot 0
rz0 at scsi0 target 0 lun 0 (LID=0) (DEC     RZ28M    (C) DEC 0568) (Wide16)
rz1 at scsi0 target 1 lun 0 (LID=1) (DEC     RZ29B    (C) DEC 0007) (Wide16)
rz5 at scsi0 target 5 lun 0 (LID=2) (DEC     RRD45   (C) DEC  1645)
pza0 at pci2000 slot 1
pza0 firmware version: DEC  F01  A10   
scsi1 at pza0 slot 0
rz9 at scsi1 target 1 lun 0 (LID=3) (DEC     RZ26     (C) DEC 392A)
rz10 at scsi1 target 2 lun 0 (LID=4) (DEC     RZ26     (C) DEC 392A)
processor at scsi1 target 6 lun 7 (LID=12) (DEC ASE DEC  L01  A10 TMV2) (Wide16)
pza1 at pci2000 slot 2
pza1 firmware version: DEC  L01  A10   
scsi2 at pza1 slot 0
rz18 at scsi2 target 2 lun 0 (LID=13) (DEC     RZ26N    (C) DEC 0744)
rz19 at scsi2 target 3 lun 0 (LID=14) (DEC     RZ26N    (C) DEC 0616)
processor at scsi2 target 6 lun 7 (LID=22) (DEC ASE DEC  L01  A10 TMV2) (Wide16)
pza2 at pci2000 slot 3
pza2 firmware version: DEC  F01  A10   
scsi3 at pza2 slot 0
pza3 at pci2000 slot 4
pza3 firmware version: DEC  F01  A10   
scsi4 at pza3 slot 0
mchan0: Module revision = 33E   [1]
mchan0: jumpered as VH1 configuration
mchan0 at pci0 slot 11               
tu0: DECchip 21040: Revision: 2.3
tu0 at pci0 slot 13
tu0: DEC TULIP (10Mbps) Ethernet Interface, hardware address: 08-00-2B-E5-F8-0A
tu0: console mode: selecting 10BaseT (UTP) port: half duplex
gpc0 at eisa0
Created FRU table binary error log packet
kernel console: ace0
dli: configured
clubase: configured   [2]
dlmsl: configured   [3]
drd: configured.   [4]
cnxagent: configured   [5]
dlm: configured.   [6]
memory channel thread init   [7]
rm_sw_init: begin MC initialization.
rm_boot_am_i_alone: entered
checking for existing memory channel nodes   [8]
rm_slave_init
rm_get_proto: returning vers = 1
slave unit boot phase 0: checking cables   [9]
slave unit boot phase 1: request data ...
slave unit boot phase 2: get lock data from all nodes
slave unit boot phase 3: update request ...
memory channel software inited - node 1 on mc0   [10]
rm_get_proto: returning vers = 1
ccomsub: state change detected via remote node 0
ccomsub: configured   [11]
mcnet: configured
memory channel - adding node 0
RM member change callback: no change in member bitmap 0x3
ADVFS: using 1153 buffers containing 9.00 megabytes of memory
starting LSM
Checking local filesystems
/sbin/ufs_fsck -p

.
.
.
Streams autopushes configured
Initializing the ASE Availability Manager   [12]
AM found a host at bus 1 target 6, lun 7  
AM found a host at bus 2 target 6, lun 7  
Configuring network
hostname: clu14.abc.def.com   [13]

.
.
.
/usr/sbin/drd_dma: Peer-to-peer DMA is NOT sure to work between   [14]
               scsi and MEMORY CHANNEL controllers
/usr/sbin/drd_dma: Peer-to-peer DMA over MEMORY CHANNEL is NOT enabled.
ONC portmap service started
Cluster member started
Starting ASE ...   [15]
        Initializing the ASE Availability Manager
        ASE logger started (/usr/sbin/aselogger)
        ASE agent started (/usr/sbin/aseagent)
ASE member started

.
.
.
cnxagent: Get MC information reports hubless   [16]
cnxagent: added node mcclu13
cnxagent: mcclu14 is now a cluster member   [17]
dlm_agent: resuming lock activity

.
.
.
Network Time Service started
cnxagent: resuming

.
.
.
Printer service started
The system is ready.

The messages highlighted in Example A-1 indicate the following:

The three mchan lines indicate that a device probe has found the MEMORY CHANNEL adapter and determined its revision number. This adapter is jumpered as VH1, indicating that it is part a virtual hub. (The message indicate whether a MEMORY CHANNEL adapter is jumpered as VH0 or VH1 (virtual hub) or connects to a MEMORY CHANNEL hub.) [Return to example]

The cluster component is initializing. [Return to example]

The Distributed Lock Manager (DLM) Session Layer component is initializing. [Return to example]

Distributed raw disk (DRD) is initializing. [Return to example]

The connection manager is initializing. [Return to example]

The DLM is initializing. [Return to example]

The general-purpose MEMORY CHANNEL thread has completed initialization. [Return to example]

The system is looking for other nodes connected to the MEMORY CHANNEL that may be either running or in the process of booting. [Return to example]

This system is the second to boot (slave) and initialize MEMORY CHANNEL code. [Return to example]

The initialization of low-level MEMORY CHANNEL software is complete. [Return to example]

The cluster communication subsystem is initializing. [Return to example]

The ASE availability manager driver is initializing. The hardware probes for shared buses and reports any active hosts found. [Return to example]

The system prints its hostname (the output from /sbin/hostname). [Return to example]

The drd_dma checks the hardware configuration to determine whether the system can use peer-to-peer DMA, and prints the result. [Return to example]

The ASE daemons are started. [Return to example]

The cnxagent subsystem reports that the cluster is operating in a virtual hub configuration. [Return to example]

The system is identified as a cluster member. [Return to example]

See the TruCluster Software Products Administration manual for descriptions of important messages generated by TruCluster products.

A.2 Formation of a New Cluster

Example A-2 shows messages from the daemon.log file related to the formation of a new Production Server cluster.

Example A-2: Log File Showing Formation of a New Cluster

May 17 17:49:33 mcclu5 cnxpingd: starting [1]
May 17 17:49:33 mcclu5 cnxagentd: starting
May 17 17:49:34 mcclu5 cnxmond: changed alias with : /sbin/ifconfig mc0 alias 10.0.0.42
netmask 255.255.255.0 [2]
May 17 17:49:34 mcclu5 cnxmgrd: starting [3]

.
.
.
May 17 17:50:14 mcclu5 cnxmgrd: attempting cluster recovery/formation [4]
May 17 17:50:14 mcclu5 cnxmgrd: recovery, considering mcclu5
May 17 17:50:14 mcclu5 cnxmgrd: node mcclu5, cluster incarn 0, update_seq 0
May 17 17:50:14 mcclu5 cnxmgrd: node mcclu5 not a member
May 17 17:50:14 mcclu5 cnxmgrd: forming a cluster
May 17 17:50:14 mcclu5 cnxmgrd: completed cluster recovery/formation

.
.
.
May 17 17:50:18 mcclu5 cnxmgrd: starting join operation for mcclu5 [5]
May 17 17:50:18 mcclu5 cnxmgrd: join, getting status from mcclu5
May 17 17:50:18 mcclu5 cnxmgrd: node mcclu5, cluster incarn 0, update_seq 0
May 17 17:50:18 mcclu5 cnxmgrd: adding node mcclu5
May 17 17:50:18 mcclu5 cnxmgrd: update complete, summary follows
May 17 17:50:18 mcclu5 cnxmgrd:   members are:
May 17 17:50:18 mcclu5 cnxmgrd:    mcclu5
May 17 17:50:18 mcclu5 cnxmgrd:   timed out are:
May 17 17:50:18 mcclu5 cnxmgrd:    none
May 17 17:50:18 mcclu5 cnxmgrd: finished join operation, update_seq 2

.
.
.
May 17 17:50:18 mcclu5 xntpd[668]: xntpd version 1.3
May 17 17:51:45 mcclu5 ASE: local HSM Notice: member mcclu8 is UP [6]

.
.
.
May 17 17:51:52 mcclu5 cnxmgrd: starting join operation for mcclu8 [7]
May 17 17:51:52 mcclu5 cnxmgrd: join, getting status from mcclu8
May 17 17:51:52 mcclu5 cnxmgrd: node mcclu8, cluster incarn 0, update_seq 0
May 17 17:51:52 mcclu5 cnxmgrd: adding node mcclu8
May 17 17:51:53 mcclu5 cnxmgrd: update complete, summary follows
May 17 17:51:53 mcclu5 cnxmgrd:   members are:
May 17 17:51:53 mcclu5 cnxmgrd:    mcclu5
May 17 17:51:53 mcclu5 cnxmgrd:    mcclu8
May 17 17:51:53 mcclu5 cnxmgrd:   timed out are:
May 17 17:51:53 mcclu5 cnxmgrd:    none
May 17 17:51:53 mcclu5 cnxmgrd: finished join operation, update_seq 3

.
.
.
May 17 17:52:04 mcclu5 ASE: local Director Notice: agent on mcclu8 came ONLINE [8]

.
.
.
May 17 17:54:54 mcclu5 cnxmond: node mcclu8 timed out [9]
May 17 17:55:30 mcclu5 cnxmgrd: intend to remove node mcclu8
May 17 17:55:54 mcclu5 cnxmgrd: starting removal operation
May 17 17:55:54 mcclu5 cnxmgrd: removing node mcclu8
May 17 17:55:55 mcclu5 cnxmgrd: update complete, summary follows
May 17 17:55:55 mcclu5 cnxmgrd:   members are:
May 17 17:55:55 mcclu5 cnxmgrd:    mcclu5
May 17 17:55:55 mcclu5 cnxmgrd:   timed out are:
May 17 17:55:55 mcclu5 cnxmgrd:    none
May 17 17:55:55 mcclu5 cnxmgrd: finished removal operation, update_seq 4

The connection manager monitor daemon, cnxmond, starts the ping daemon, cnxpingd, and agent daemon, cnxagentd, on system mcclu5. [Return to example]

The connection manager monitor daemon (cnxmond) on system mcclu5 acquires a spinlock on the MEMORY CHANNEL bus (mc0) and registers the cluster_cnx service alias (10.0.0.42). [Return to example]

As a result of acquiring the spinlock and registering the cluster_cnx alias, the connection manager starts the director daemon (cnxmgrd) on system mcclu5. [Return to example]

The director tries to find systems that are eligible to become cluster members. If the director finds eligible systems, these systems become members and the cluster is formed from them. In this case, no other eligible systems were found; therefore, the cluster is formed with no other members. [Return to example]

Systems are added to the cluster as members one at a time. Since system mcclu5 is up and running, it is considered first and added as a member. [Return to example]

System mcclu8 is detected as up and running. [Return to example]

System mcclu8 becomes a cluster member. The cluster now has two members, mcclu5 and mcclu8. [Return to example]

The ASE director detects a new agent on system mcclu8. [Return to example]

The director detects a system failure and removes the failed system from the cluster membership. The cluster now has only one member, mcclu5. [Return to example]

A.3 Recovery of an Existing Cluster

Example A-3 shows an excerpt from the daemon.log file related to the recover of an existing Production Server cluster.

Example A-3: Log File Showing Recovery of an Existing Cluster

May 17 18:18:03 mcclu5 cnxmond: changed alias with : /sbin/ifconfig mc0 alias 10.0.0.42 netmask [1]
255.255.255.0
May 17 18:18:03 mcclu5 cnxmgrd: starting
May 17 18:18:43 mcclu5 cnxmond: recovery delay completed
May 17 18:18:43 mcclu5 cnxmgrd: attempting cluster recovery/formation
May 17 18:18:43 mcclu5 cnxmgrd: recovery, considering mcclu5
 
May 17 18:18:43 mcclu5 cnxmgrd: node mcclu5, cluster incarn 0000000000080e90, update_seq 2 [2]
May 17 18:18:43 mcclu5 cnxmgrd: found member of cluster incarnation 0000000000080e90
May 17 18:18:43 mcclu5 cnxmgrd: node is mcclu5, update_seq is 2
May 17 18:18:43 mcclu5 cnxmgrd: same update_seq 2, node mcclu5
May 17 18:18:43 mcclu5 cnxmgrd: using mcclu5
May 17 18:18:43 mcclu5 cnxmgrd: recovering a cluster [3]
May 17 18:18:43 mcclu5 cnxmgrd: starting recovery update
May 17 18:18:43 mcclu5 cnxmgrd: update complete, summary follows
May 17 18:18:43 mcclu5 cnxmgrd:   members are:
May 17 18:18:43 mcclu5 cnxmgrd:    mcclu5
May 17 18:18:43 mcclu5 cnxmgrd:   timed out are:
May 17 18:18:43 mcclu5 cnxmgrd:    none
May 17 18:18:43 mcclu5 cnxmgrd: finished recovery update, update_seq 3
May 17 18:18:43 mcclu5 cnxmgrd: completed cluster recovery/formation

A director is selected and cluster recovery is started. [Return to example]

The director finds system mcclu5, which was a member of a cluster (note the nonzero value of cluster incarn). [Return to example]

The cluster created in Example A-2 is recovered. [Return to example]