A Error Messages

This appendix contains a partial list of important messages generated by the TruCluster software. These messages have an Alert severity level and are included in the daemon.log file unless otherwise noted. A message with an Alert severity level indicates that a critical condition exists and needs the immediate attention of a system manager.

Log file entries specify the following information:

Time of event

Name of the local system

Component identifier

Member on which the event was generated

Daemon that generated the event

Event severity level

Message text

If the daemon that generated an event is disconnected from the available server environment (ASE) logger daemon, and the message arrived after the disconnect, the ASE logger daemon may not be able to identify the daemon that sent the message. In this case, the source of the event is specified as "unknown client." For example:

Aug 31 11:34:35 staff1 DECsafe: unknown client Info: ASE_INQ_SERVICES
 
Reply from Director seq: 12 ch: 3  ASE_OK

Messages that specify AseUtility as the daemon that generated the message were produced by a command or daemon unrelated to the TruCluster software. For example, the following commands were produced by the Logical Storage Manager (LSM) software:

AseUtility Error: voldisk: Volume daemon is not accessible
 
 
AseUtility Error: voldisk define of rz19 failed
 
 
AseUtility Error: voldisk: Device rz19: define failed: Device path invalid

The ASE action scripts capture output from the commands that they execute. This output is sent to the logger daemon. If the action script fails, the command output is logged as errors. See the appropriate software documentation for information on errors not related to the TruCluster software.

The following sections describe some of the Alert messages generated by the TruCluster software.

A.1 ASE Agent Daemon Alert Messages

This section describes some Alert messages generated by the available server environment (ASE) agent daemon.

Can't stop service <service> for failed device. rebooting!

A device has failed and the agent cannot stop the service associated with the failed device. If a stop fails, umount may have failed, because a file is open locally on the NFS file system. If the service is relocated to another member and later relocated back to the original member and the member's cache for the file system did not get flushed because of the failed umount, the cache could get flushed when the service restarts on the original member and could cause file corruption. To prevent this, the ASE agent daemon reboots the local node.

Member <member> cut off from net

The member is disconnected from the network.

Member <member> is not available

The member that was running the director is not answering pings over the network or over the SCSI bus; therefore, it is considered unavailable.

device access failure on <device> from <host>

The specified device cannot be reached from the specified host.

AM can't access <device> on <host> on reservation reset

The reservation for the specified device on the specified host has been lost. This could happen if a SCSI reset occurred. Usually if this occurs, the device can be rereserved. However, in this case, the ASE agent daemon cannot open the device special file for the specified device, so the reservation cannot be redeemed.

AM failed to rereserve <device> on <host>

The disk reservation was lost because of a SCSI reset, and the ASE agent daemon was unable to rereserve the device.

AM reports a lost reservation for <device> on <host>

The reservation for the specified device was lost on the specified host. The reservation may have been taken when the ASE director daemon started the service on a different host.

Can't fetch new configuration data base!

The ASE agent daemon stored a new configuration database, but cannot fetch the new database. The ASE agent daemon exits, and the system manager must resolve the problem.

Network is partitioned between local host and <remote_host>

The ASE agent daemon has discovered, through the host status monitor (HSM) daemon, that the local member is separated from the specified remote host because of a network partition. The system manager may have to resolve this condition, which could be caused by bad cable routing. The ASE agent daemon logs an Alert message for each member that is cut off from the partitioned member.

Cut off from net and can't stop services. reboot!

The ASE agent daemon has been cut off from the network; therefore, it is stopping all of the services currently running on the member so they can be started on another member. If a stop fails, umount may have failed because a file is open locally on the NFS file system. If the service is relocated to another member and later relocated back to the original member and the member's cache for the file system did not get flushed because of the failed umount, the cache could get flushed when the service restarts on the original member and could cause file corruption. To prevent this, the ASE agent daemon reboots the local node.

Possible security breach attempt: connect tried from unknown <remote_host>
 
 
Possible security breach attempt: connect request from non-member <remote_host>

A process on a nonmember system tried to connect to the ASE agent daemon. For security purposes, the ASE agent daemon's connection maintenance code refuses connection requests from systems that are not in the ASE agent daemon's current member list. One of the previous Alert messages is logged if a connection request is received from a nonmember system.

main: fatal error...

The ASE agent daemon encountered an error from which it could not recover and exited. This Alert message is logged, in addition to more detailed Alert messages that describe the reason that the ASE agent daemon exited.

possible device failure: <device>

The ASE agent daemon tried to start a service but discovered that the devices used by that service are unreachable.

A.2 ASE Director Daemon Alert Messages

This section describes some Alert messages generated by the available server environment (ASE) director daemon.

Lost connection to the HSM... exiting

The ASE director daemon exited because it lost its connection to the ASE host status monitor (HSM) daemon.

Possible security breach attempt: connect tried from unknown <remote_host>

Possible security breach attempt: connect request from nonmember <remote_host>

A process on a nonmember system tried to connect to the ASE director daemon. For security purposes, the ASE director daemon's connection maintenance code refuses connection requests from systems that are not in the ASE director daemon's current member list. One of the previous Alert messages is logged if a connection request is received from a nonmember system.

Unable to start service <service>

The ASE director daemon cannot start the specified service. If a service is restricted to run on a subset of the members, this message indicates that it cannot run on any of those members. Check the appropriate daemon.log event logging file for more information.

Unable to stop service <service> due to a timeout.  The service is in an 
unknown state.

The ASE director daemon timed out waiting for the ASE agent daemon to reply to a stop service request.

Cannot contact local agent... exiting

The ASE director daemon exited because it could not contact its local agent.

Network connection down... exiting

The ASE director daemon exited because its network connection was not available.

Received message from agent which is not in the config database

The ASE director daemon received a message from a nonmember agent.

Can't ping my agent, exiting...

The ASE agent daemon on the member that is running the ASE director daemon is not registered with the portmap daemon.

Unable to start service <service> on <host>.

A service relocation failed.

Cannot start service <service>.

After a device failure, a service cannot be started on any potential member.

Member <member> is not available.

A member is not answering pings over the network or the SCSI bus; therefore, it is considered unavailable.

Service <service> cannot be run on any available members.

Unable to stop service <service> due to a timeout.  
The service is in an unknown state.

One of the stop scripts did not return an exit value within its timeout period, and the stop action may not have completed. It is important to ensure that the service is completely stopped before continuing.

A member has an invalid IP address.
 
ASE members are on different subnets.

The Internet Protocol (IP) address must be a valid address on the same subnet as the other ASE members.

Can't ping agent on <member>

The ASE agent daemon on the specified member is not registered with the portmap daemon.

Can't open channel to agent on <member>

The ASE director daemon cannot establish a connection with the agent on the specified member.

A.3 ASE Host Status Monitor Daemon Alert Messages

This section describes some Alert messages generated by the available server environment (ASE) host status monitor (HSM) daemon.

Network ping to host <host> is working but SCSI ping is not

A problem exists in all of the SCSI bus paths between the host specified in the message and the member that reported the message. Check the cabling between systems and disks on the shared buses.

Network ping to host <host> is working and now SCSI ping is also working

The condition described in the first ASE HSM daemon message has been cleared. SCSI pings can now be sent between the hosts on at least one of the shared buses.

A.4 The asemgr Utility Alert Messages

This section describes some Alert messages generated by the asemgr utility.

Test of alert script

This message is generated when you chose the "Test the error alert script" item from the asemgr utility's Managing the ASE menu.

Bad return code from ****

This message is generated by a return code from a routine that was not expected and indicates a bug in the TruCluster software. If it occurs, contact your field service representative.

Net partition - cannot find a director.

The asemgr utility cannot find the ASE director daemon because of a network partition.

Unable to translate host <host> to an IP address

A routine cannot map a member host name to an Internet Protocol (IP) address. There could be a problem with the /etc/hosts file or with Berkeley Internet Name Domain (BIND).

Could not allocate database
 
 
Could not malloc

These messages occur if a malloc operation fails. They indicate that the system is running out of memory or swap space.

Configuration database is corrupted (Invalid length of ASE version)
 
BUG NOTICE: Exit before finishing unmarshal_tree

Something is wrong with the ASE database (for example, it has been corrupted).

A.5 MEMORY CHANNEL Alert Messages

This section describes some Alert messages generated by the MEMORY CHANNEL subsystem.

memory channel - alternate on-line

In a redundant MEMORY CHANNEL configuration, the alternate MEMORY CHANNEL interconnect has come on line. This message is printed only when the alternate comes on line after MEMORY CHANNEL software initialization.

switching from mc<number> to mc<number>

The cluster is failing over from the primary MEMORY CHANNEL interconnect to the secondary MEMORY CHANNEL interconnect.

rm_sw_init: can't fail over from mc<number> to mc<number>

The cluster cannot fail over to the secondary MEMORY CHANNEL interconnect due to hardware problems with the secondary MEMORY CHANNEL interconnect.

requesting memory channel failover, node <node>

A member system is requesting other member systems to fail over to the secondary MEMORY CHANNEL interconnect.

memory channel - checking cables

The MEMORY CHANNEL subsystem is checking that the primary MEMORY CHANNEL interconnect is plugged into the same hub on all member systems.

memory channel failover request from node <node>

A MEMORY CHANNEL failover request has been received from the specified member system.

rm_boot_request_init: didn't switch

The cluster cannot fail over to the secondary MEMORY CHANNEL interconnect due to hardware problems with the secondary MEMORY CHANNEL interconnect.

memory channel node <node> already cluster member,crashing 
node <node>

A node that has been identified as a cluster member is requesting cluster membership. The MEMORY CHANNEL subsystem will shut it down to restore consistency.

memory channel - failed initialization

A hardware problem has prevented MEMORY CHANNEL subsystem initialization.

received a request from node <node> to failover

The specified node has requested a failover to the secondary MEMORY CHANNEL interconnect.

rm_failover_rmerror_request: can't fail over from mc<number> to mc<number>

Failover to the secondary MEMORY CHANNEL interconnect is not possible, probably due to a member system's not being able to access the secondary MEMORY CHANNEL interconnect.

rmerror_get_errcnt_kl:crashing node <node>

The specified MEMORY CHANNEL node is unresponsive and is being shut down.

rmerror_free_errcnt_lk: Too many retries, node <node> 
must be down
 
rmerror_init:Error_count = <number> unit = <number> 
Err_reg = <value> Node = <node>

A MEMORY CHANNEL error interrupt has been received and error recovery is in progress.

rmerror_init:crashing node <node>

The specified node is unresponsive and is being shut down.

rmerror_state_change:
         unit = <number>  Err_reg = <value> node = <node>

A state change has been received, indicating that another member system has joined or left the cluster.

rmerror_state_change: failed to failover

The cluster made an unsuccessful attempt to fail over from the primary MEMORY CHANNEL interconnect to the secondary MEMORY CHANNEL interconnect. It is likely that a member system cannot access the secondary MEMORY CHANNEL interconnect.

rmerror_railover:
         Node = <node>  Flag = <value>  Action = <value>

The MEMORY CHANNEL subsystem has requested a failover to the secondary MEMORY CHANNEL interconnect.

rmerror_failover: no alternate mc to fail over to

No functional secondary MEMORY CHANNEL interconnect is available for failover.

rmerror_failover: negative error count

Failover has been simultaneously initiated on multiple member systems. This is an informational message.

rmerror_failover_1:crashing node <node>

The specified MEMORY CHANNEL node is unresponsive and is being shut down.

rmerror_failover: not every node can failover

The cluster aborted a failover to the secondary MEMORY CHANNEL interconnect, because not all member systems could fail over to it.

rmerror_failover_2: crashing node <node>

The specified MEMORY CHANNEL node is unresponsive and is being shut down.

checking for existing memory channel nodes

The MEMORY CHANNEL subsystem is looking for other nodes connected to the MEMORY CHANNEL interconnect that may be either running or in the process of booting.

unresponsive mc nodes - waiting for node mask

A node connected to the MEMORY CHANNEL interconnect is not responding to boot requests. The MEMORY CHANNEL subsystem is waiting for the node to boot.

crashing unresponsive node <node>

The node indicated in the message did not respond to repeated boot requests. It may be hung, so the MEMORY CHANNEL software attempts to crash it to allow cluster formation to progress. This crashing ... message is usually preceded by several unresponsive mc nodes ... messages.

booting as primary memory channel node

This MEMORY CHANNEL node is the first node to boot and initialize its MEMORY CHANNEL subsystem.

memory channel software inited - node <node>

Initialization of low-level MEMORY CHANNEL software is complete.

requesting memory channel interrupt, node <node>

This MEMORY CHANNEL node has requested an interrupt from another node, which has already initialized its low-level MEMORY CHANNEL software. This is the first step a node takes to initialize its MEMORY CHANNEL software when another MEMORY CHANNEL node is already initialized.

requesting memory channel update interrupt, node <node>

This MEMORY CHANNEL node has requested an update interrupt from another node, which has already initialized its low-level MEMORY CHANNEL software. This is the second step a node takes to initialize its MEMORY CHANNEL software when another MEMORY CHANNEL node is already initialized.

memory channel status request from node <node>

A MEMORY CHANNEL node is looking for other existing MEMORY CHANNEL nodes.

memory channel request from node <node>

This MEMORY CHANNEL node is responding to an interrupt from another node, which is attempting to initialize its low-level MEMORY CHANNEL software. This is the first step a node takes to initialize its MEMORY CHANNEL software when another MEMORY CHANNEL node is already initialized.

memory channel update request from node <node>

This MEMORY CHANNEL node is responding to an update interrupt from another node, which is attempting to initialize its low-level MEMORY CHANNEL software. This is the second step a node takes to initialize its MEMORY CHANNEL software when another MEMORY CHANNEL node is already initialized.

memory channel - adding node <node>

Low-level MEMORY CHANNEL software is adding another node.

memory channel - removing node <node>

Low-level MEMORY CHANNEL software is removing a node.

memory channel node <node> timed out, hardware does not see it

A node is not responding. It will be removed.

memory channel thread init

The general-purpose MEMORY CHANNEL thread has completed initialization.

A.6 DLM Alert Messages

This section describes some Alert messages generated by the distributed lock manager (DLM) subsystem. These messages are logged to the console and kernel log in the /usr/adm/syslog.dated file. Some, as noted, are also logged to the user's terminal.

dlm_subsys_configure: can't init lkid table

Either the system has an insufficient amount of memory or the number of locks allocated at boot time (indicated by the dlm_locks_cfg kernel attribute) is too large. To fix this problem, either decrease the value of the dlm_locks_cfg kernel attribute in the /etc/sysconfigtab file and reboot, or add more memory to the system.

dlm_subsys_configure: can't init rsb table

Either the system has an insufficient amount of memory or the size of the DLM resource hash table (indicated by the rhash_size kernel attribute) is too large. To fix this problem, either decrease the value of the rhash_size kernel attribute in the /etc/sysconfigtab file and reboot, or add more memory to the system.

dlm_subsys_configure: can't init pdb table

Either the system has an insufficient amount of memory or the size of the process descriptor block hash table (indicated by the pdb_hash_size kernel attribute) is too large. To fix this problem, either decrease the value of the pdb_hash_size kernel attribute in the /etc/sysconfigtab file and reboot, or add more memory to the system.

dlm_subsys_configure: can't start timeoutq

The DLM cannot start the DLM timeout queue thread. To fix this problem, reboot the member system. If the problem recurs, contact your DIGITAL support representative.

dlm_subsys_configure: dlm_hab_configure failed

The DLM cannot configure its habitat. To fix this problem, reboot the member system. If the problem recurs, contact your DIGITAL support representative.

dlm: configured

The DLM subsystem has been configured successfully at boot time.

dlm_subsys_configure: configure failed

DLM configuration has failed on this member system.

dlm_create_lock: pid <value> copyout err of lkid <value>

The dlm_lock or dlm_quelock function cannot return a lock ID to the buffer specified in the function call. This message is sent to the console, system log, and the user's terminal. To fix this problem, check the application program that called the function and ensure that it passes a valid buffer address. See the TruCluster Production Server Application Programming Interfaces manual for more information about DLM functions.

dlm_create_lock
    COMP_LOCK: pid <value> IVLOCKID lkid <value> uaddr <value>

The dlm_lock function has attempted to use an invalid lock ID. This can occur when the function is interrupted by a signal and, before it resumed, the application that called the function dequeued the lock or corrupted the lock ID in its signal handler. This message is sent to the console, system log, and the user's terminal. See the TruCluster Production Server Application Programming Interfaces manual for more information about DLM functions.

dlm_create_lock:
 
    pid <value> err while copying out valblk for lkid <value>
 
dlm_convert_lock:
 
    valblk copyout fault: lkid <value> kvalbp <value>
 
                   uvalb_p <value>

The dlm_lock, dlm_quelock, dlm_cvt, or dlm_quecvt function cannot return the resource's value block to the buffer specified in the function call. This message is sent to the console, system log, and the user's terminal. To fix this problem, check the application program that called the function and ensure that it passes a valid buffer address. See the TruCluster Production Server Application Programming Interfaces manual for more information about DLM functions.

dlm_collect: pid <value> DLM_EFAULT notf_entry
 
dlm_collect: pid <value> DLM_EFAULT ngot

The dlm_notify function cannot return the blocking notification routine parameter or hint to the buffer specified in the function call. This message is sent to the console, system log, and the user's terminal. To fix this problem, check the application program that called the function and ensure that it passes a valid buffer address. See the TruCluster Production Server Application Programming Interfaces manual for more information about DLM functions.

A.7 DRD Alert Messages

This section describes some Alert messages generated by the distributed raw disk (DRD) subsystem. DRD messages are logged to the kern.log file unless otherwise noted.

drd_configure_subsys: failed in drd_driver_configure

One of the subcomponents of DRD was unable to initialize. There will usually be an accompanying error message providing more detail on the cause of the initialization error. Verify that all hardware components are operational.

drd_configure: drd-maphash-size, invalid size.
 
drd_bp_pool_configure: bogus tunables, using default.

An invalid value was specified for a tunable parameter. See drd (7) for a description of the tunable parameters.

drd_configure: subsystem unconfiguration not yet supported.

The DRD subsystem cannot be dynamically unconfigured.

drd_map_delete: can't delete, drain failed.

The underlying physical device driver failed to complete outstanding I/O operations. Check the system error logs for driver-specific errors.

drd_map_add: LMF PAK not registered.

The required product license has not been registered. Use the lmf command to register the appropriate Production Authorization Key (PAK).

drd_map_add: rejecting map on validation errors.

A corrupt or invalid map entry has been received. The underlying device driver type or device type may not be supported or operational.

drd_resolve_map: Can't find server for DRD drd3.

The DRD subsystem has repeatedly tried to determine which node within the cluster is the server of the specified disk. The retry timeout limit has been reached and the DRD subsystem is returning an error to the calling application. Use the asemgr utility to verify that the service is operational. This error could indicate that stale DRD device special files are being accessed.

drd_map_rpc_add: attempt to replace local map with remote.
 
drd_map_rpc_add: rejecting new remote map.

The server of a DRD disk has received a new map entry, indicating that another node also believes that it is the disk's server. At any given time, there should be only one server. Check for the occurrence of any errors from the available server environment (ASE) subsystem that may determine the cause of this problem.

drd_map_rpc_add: drd_map_add() failed with 23.

The error number specifies an error return status when attempting to add a new DRD map entry. One of the validation checks has failed, or a DRD server is unable to open the underlying device.

bss_open: device type not supported with drd.

The underlying driver type or device type does not meet DRD's validation requirements.

bss_rm_init: register_RM_member_callback failed
 
bss_rm_init: get_RM_information failed
 
bss_rm_init_sync: RM_GET_CHARACTERISTICS failed
 
bsc_rm_init: get_RM_information failed
 
bsc_rm_init: register_RM_member_callback failed

The underlying MEMORY CHANNEL subsystem is returning an error status to the DRD. Check for related error messages and verify that the MEMORY CHANNEL hardware is operational.

bss_subsys_configure: invalid bssd count

This indicates that the argument passed to the bssd command is not within the acceptable range. See bssd(8) for details.

bss_subsys_configure: bssd not restartable, reboot needed.

The bssd daemon cannot be killed and restarted. See bssd(8) for details. This message indicates that the bssd daemon has been killed off and an attempt has been made to restart it. To restart the bssd daemon, reboot the system.

bss_subsys_configure: bssd already running.

An attempt was made to start a second bssd daemon while one bssd daemon is already running. Only one instance of the bssd daemon is allowed to run at a time.

Another BSSD is already running, exiting.

An attempt was made to start a second bssd daemon while one bssd daemon was already running. Only one instance of the bssd daemon is allowed to run at a time. This message appears in the daemon.log file.

Daemon not restartable.  Reboot required.

The bssd daemon has been killed off and an attempt is being made to restart it. The bssd daemon cannot be restarted. You must reboot the system reboot to make DRD operational again. This message appears in the daemon.log file.

Can't register with portmap.

The portmapper daemon may have been killed off, which prevents the bssd daemon from establishing its network connection. This message appears in the daemon.log file.

Returned from kernel call, exiting.

The bssd daemon is exiting. This informational message appears in the daemon.log file.

DRD is NOT loaded and configured.
 
Unable to open DRD control device

The DRD subsystem has not been initialized or has failed to initialize. Check the kern.log file for related error messages. This message appears in the daemon.log file.

Failed ioctl to set socket descriptor.

The kernel portion of DRD has rejected the bssd daemon's attempts to connect. Check the kern.log file for related error messages. This message appears in the daemon.log file.