6 Troubleshooting

This chapter describes the following problems, which you might encounter during installation and suggests corrective actions:

Setting logging levels (PS, AS, MC)

Kernel build fails (PS, AS, MC)

Cannot ping members across the primary network (PS, AS)

MEMORY CHANNEL cables are crossed (PS)

System cannot join the cluster (PS)

ASE validation fails (PS, AS)

The drd_ivp utility cannot determine ASE membership (PS)

Inconsistent view of shared SCSI devices (PS, AS)

6.1 Setting Logging Levels (PS, AS, MC)

For Production Server and Available Server, you can set the asemgr logging level to Informational, which increases the amount of messages written to /var/adm/syslog.dated/date/daemon.log.

For Production Server and MEMORY CHANNEL, you can use the mchan_debug attribute in the /etc/sysconfigtab file to generate verbose MEMORY CHANNEL error messages. Set the attribute as shown in the following example:

mchan:
      mchan_debug=1

You must reboot the system in order for the mchan_debug change to take effect. The additional debug information, when included in a problem report, can help your DIGITAL service representative diagnose problems.

6.2 Kernel Build Fails (PS, AS, MC)

After prompting for configuration options, the installation procedure attempts to build a new kernel using the doconfig utility. If the newly configured kernel cannot be built, the installation procedure displays the following message:

*** WARNING ***
An error has occurred during system configuration.  A partial listing
of the error log file (./errs) follows:

.
.
.
*** NOTE ***
The customized kernel for this machine could not be successfully
created.  One possible problem could be kernel layered products
that might be incompatible with the operating system.  This
script will now automatically attempt to build a kernel using the
operating system only.
Is this ok? (y/n) [y]:

If the rebuild is still unsuccessful, the installation procedure displays the following message:

*** NOTE ***
A new kernel for this machine could not be successfully created.
 
Unable to build the new kernel. Please perform the following actions:
 
        o Run "doconfig" to build a good kernel.
        o Move the new kernel to /.
        o Before rebooting make sure that the MEMORY CHANNEL IP 
          addresses for all cluster members are recorded in each member's
          /etc/hosts file.
        o Reboot the system.

For information on building, tuning, and debugging kernels see the DIGITAL UNIX System Administration, System Configuration and Tuning, and Kernel Debugging manuals.

6.3 Cannot Ping Members Across the Primary Network (PS, AS)

The primary network for Production Server is the MEMORY CHANNEL subnet; the primary network for Available Server is the network attached to the interface specified during installation. (Section 2.1 has a description of each product's primary network interface.)

If a member system does not respond to the ping command, do the following:

Check that each member's primary interface is configured UP and that the appropriate interface-related entries are present in that member's /etc/rc.config and /etc/hosts files.
In the following example, host clu14's primary network interface is mc0:
```
# ifconfig mc0
```
```
mc0: flags=863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX>
     inet 10.0.0.2 netmask ffffff00 broadcast 10.0.0.255 ipmtu 8008
```
The interface is configured UP, and has the following NETDEV_n and IFCONFIG_n entries in the member's /etc/rc.config file:
```
# egrep "mc0|10.0.0.2" /etc/rc.config
NETDEV_1="mc0"
IFCONFIG_1="10.0.0.2 netmask 255.255.255.0"
```
The interface's host entry in /etc/hosts associates the IP address assigned to the IFCONFIG entry to the IP name assigned the CLUSTER_NET entry:
```
# rcmgr get CLUSTER_NET
mcclu14
# grep mcclu14 /etc/hosts
10.0.0.2     mclu14.abc.def.com     mcclu14
```

Make sure that the following entries are in each member system's /etc/hosts:
- An entry for each member system's IP name and IP address on the cluster's primary network.
- The IP host addresses used by critical network services such as BIND, NIS, and NTP.
- For Production Server, the MEMORY CHANNEL IP address of the connection manager service (cluster_cnx), which must be host number 42 on the MEMORY CHANNEL subnet. (The clu_ivp utility checks for the presence of the cluster_cnx service but does not verify its IP address.)
- For systems with more than one network interface, the IP host names and addresses used to communicate with cluster members and clients through those network interfaces. For example, a Production Server cluster has a conventional Ethernet or FDDI network in addition to its MEMORY CHANNEL subnet; an Available Server ASE often has a secondary network as a backup.

6.4 MEMORY CHANNEL Cables Are Crossed (PS, MC)

Each system in a failover-capable cluster must have identically configured MEMORY CHANNEL adapters.

For a physical hub configuration, if the primary adapter is plugged, for example, into the primary hub's linecard in slot 3, the alternate adapter must be plugged into the alternate hub's linecard in slot 3. (The slot location determines the adapter's node ID, and the node IDs must be identical among all cluster members.)

If the MEMORY CHANNEL adapters are not connected properly, the system can panic with the following message:

rm_check_cables: cables are crossed

In a two-system, virtual-hub cluster, the jumper settings determine the node IDs. A system's primary and alternate adapters must be jumpered identically (either as VH0 or VH1). See the TruCluster Software Products Hardware Configuration manual for information on configuring MEMORY CHANNEL adapters. See Section 3.14 for information on setting up a tie-breaker disk for virtual-hub clusters.

6.5 System Cannot Join the Cluster (PS)

If cnxshow indicates that a system is unable to join the cluster, perform the following checks:

Use the ps ag command to verify that the portmap and cfgmgr processes are running. These processes, while not specific to clusters, must be running in order for the cluster to operate. For example:
```
# ps ag | egrep "portmap|cfgmgr" | grep -v egrep
 
224 ??   I    0:04.77 /usr/sbin/portmap
244 ??   I    0.00.01 /sbin/cfgmgr
```

Check initialization and error messages (for example, the daemon.log and kern.log files, and the uerf utility). See Appendix A for examples of startup, cluster formation, and cluster recovery messages.

6.6 ASE Validation Fails (PS, AS)

If either the clu_ivp utility or the drd_ivp utility (Production Server only) reports that the available server environment (ASE) validation checks failed, run the asemgr utility with the -d and -h options on one system in each ASE to ensure that all ASE member systems are up and running. For example:

# asemgr -dh
        Member Status
 
Member:                   Host Status:    Agent Status:  
mcclu6                    UP              RUNNING        
mcclu7                    UP              RUNNING

See asemgr(8) for more information on these options.

Because each Available Server installation consists of a single ASE, the following applies only to Production Server installations.

All members in an ASE must have the same ASE ID. You can use the rcmgr get ASE_ID command to check the ASE identifier (ASE_ID) of each system. For example:

# rcmgr get ASE_ID
1

To change a system's ASE_ID, follow these steps:

If DRD services are configured, delete all services on the system.

Shut the system down to single-user mode.

Set the ASE_ID value. In the following example, the rcmgr command is used to set the ASE_ID value to 2 to match the ASE_ID assigned to the other members in the ASE:
```
# rcmgr set ASE_ID 2
```

Halt and reboot the system.

Add any DRD services that were deleted.

6.7 The drd_ivp Utility Cannot Determine ASE Membership (PS)

If the drd_ivp utility is run (either manually or as part of the clu_ivp utility) prior to defining the available server environment (ASE) member list, it can report that it is unable to determine ASE membership. For example:

#drd_ivp
 
Cluster Configuration Information
 
Hostname               ASE_ID   BSSD   BSSD    DRD    Lic
                                 Reg   Resp   Conf    Reg
----------------------------------------------------------
mcclu6                      0    Yes    Yes    Yes    Yes
mcclu7                      0    Yes    Yes    Yes    Yes
 
DRD configuration validation tests succeeded.
Unable to determine which nodes are in the same ASE
as node mcclu6.   Verify that node mcclu6 is up and that it
has the ASE_ID parameter in its '/etc/rc.config' file.
Verify that mcclu6 is registered as a member of an ASE.
Unable to determine which nodes are in the same ASE
as node mcclu7.   Verify that node mcclu7 is up and that it
has the ASE_ID parameter in its '/etc/rc.config' file.
Verify that mcclu7 is registered as a member of an ASE.
Failed to validate ASE_ID values.

Use the asemgr utility to populate the ASE member list. Then rerun either the clu_ivp utility or the drd_ivp utility to check that the systems are registered as members of the ASE.

The TruCluster Software Products Administration manual provides more information on troubleshooting the DRD subsystem.

6.8 Inconsistent View of Shared SCSI Devices (PS, AS)

If the member systems connected to a shared SCSI bus have inconsistent views of the devices on the bus (all ASE members must have identical numbers for shared buses and devices), do the following:

Make sure that all shared SCSI cables are connected and terminated as described in the TruCluster Software Products Hardware Configuration manual.

For systems that support the bus_probe_algorithm console variable, check that its value is set to new (see Section 2.3).

Verify that the shared SCSI buses are numbered equivalently on each system. As mentioned in Chapter 5, you can run the clu_ivp utility on each system and compare the output to check whether all system have the same view of shared SCSI buses and devices. If you discover an inconsistency, do the following on the affected system or systems:
1. Run the /var/ase/sbin/ase_fix_config utility, described in Section 3.7, and adjust the bus numbering.
2. Build a new kernel using the doconfig -c HOSTNAME command.
3. Move the new kernel to /vmunix.
4. Reboot the system.