This appendix contains a partial list of important messages generated
by the TruCluster software.
These messages have an Alert severity level and
are included in the
daemon.log
file unless otherwise noted.
A message with an Alert severity level indicates that a critical condition
exists and needs the immediate attention of a system manager.
Log file entries specify the following information:
Time of event
Name of the local system
Component identifier
Member on which the event was generated
Daemon that generated the event
Event severity level
Message text
If the daemon that generated an event is disconnected from the available server environment (ASE) logger daemon, and the message arrived after the disconnect, the ASE logger daemon may not be able to identify the daemon that sent the message. In this case, the source of the event is specified as "unknown client." For example:
Aug 31 11:34:35 staff1 DECsafe: unknown client Info: ASE_INQ_SERVICES Reply from Director seq: 12 ch: 3 ASE_OK
Messages
that specify
AseUtility
as the daemon that generated the
message were produced by a command or daemon unrelated to the TruCluster
software.
For example, the following commands were produced by the Logical
Storage Manager (LSM) software:
AseUtility Error: voldisk: Volume daemon is not accessible AseUtility Error: voldisk define of rz19 failed AseUtility Error: voldisk: Device rz19: define failed: Device path invalid
The ASE action scripts capture output from the commands that they execute. This output is sent to the logger daemon. If the action script fails, the command output is logged as errors. See the appropriate software documentation for information on errors not related to the TruCluster software.
The following sections describe some of the Alert messages generated by the TruCluster software.
This section describes some Alert messages generated by the available server environment (ASE) agent daemon.
Can't stop service <service> for failed device. rebooting!
A device has failed and the agent cannot stop the service associated
with the failed device.
If a stop fails,
umount
may have
failed, because a file is open locally on the NFS file system.
If the service
is relocated to another member and later relocated back to the original member
and the member's cache for the file system did not get flushed because of
the failed
umount, the cache could get flushed when the
service restarts on the original member and could cause file corruption.
To prevent this, the ASE agent daemon reboots the local node.
Member <member> cut off from net
The member is disconnected from the network.
Member <member> is not available
The member that was running the director is not answering pings over the network or over the SCSI bus; therefore, it is considered unavailable.
device access failure on <device> from <host>
The specified device cannot be reached from the specified host.
AM can't access <device> on <host> on reservation reset
The reservation for the specified device on the specified host has been lost. This could happen if a SCSI reset occurred. Usually if this occurs, the device can be rereserved. However, in this case, the ASE agent daemon cannot open the device special file for the specified device, so the reservation cannot be redeemed.
AM failed to rereserve <device> on <host>
The disk reservation was lost because of a SCSI reset, and the ASE agent daemon was unable to rereserve the device.
AM reports a lost reservation for <device> on <host>
The reservation for the specified device was lost on the specified host. The reservation may have been taken when the ASE director daemon started the service on a different host.
Can't fetch new configuration data base!
The ASE agent daemon stored a new configuration database, but cannot fetch the new database. The ASE agent daemon exits, and the system manager must resolve the problem.
Network is partitioned between local host and <remote_host>
The ASE agent daemon has discovered, through the host status monitor (HSM) daemon, that the local member is separated from the specified remote host because of a network partition. The system manager may have to resolve this condition, which could be caused by bad cable routing. The ASE agent daemon logs an Alert message for each member that is cut off from the partitioned member.
Cut off from net and can't stop services. reboot!
The ASE agent daemon has been cut off from the network; therefore, it
is stopping all of the services currently running on the member so they can
be started on another member.
If a stop fails,
umount
may have failed because a file is open locally on the NFS file system.
If
the service is relocated to another member and later relocated back to the
original member and the member's cache for the file system did not get flushed
because of the failed
umount, the cache could get flushed
when the service restarts on the original member and could cause file corruption.
To prevent this, the ASE agent daemon reboots the local node.
Possible security breach attempt: connect tried from unknown <remote_host> Possible security breach attempt: connect request from non-member <remote_host>
A process on a nonmember system tried to connect to the ASE agent daemon. For security purposes, the ASE agent daemon's connection maintenance code refuses connection requests from systems that are not in the ASE agent daemon's current member list. One of the previous Alert messages is logged if a connection request is received from a nonmember system.
main: fatal error...
The ASE agent daemon encountered an error from which it could not recover and exited. This Alert message is logged, in addition to more detailed Alert messages that describe the reason that the ASE agent daemon exited.
possible device failure: <device>
The ASE agent daemon tried to start a service but discovered that the devices used by that service are unreachable.
This section describes some Alert messages generated by the available server environment (ASE) director daemon.
Lost connection to the HSM... exiting
The ASE director daemon exited because it lost its connection to the ASE host status monitor (HSM) daemon.
Possible security breach attempt: connect tried from unknown <remote_host>
Possible security breach attempt: connect request from nonmember <remote_host>
A process on a nonmember system tried to connect to the ASE director daemon. For security purposes, the ASE director daemon's connection maintenance code refuses connection requests from systems that are not in the ASE director daemon's current member list. One of the previous Alert messages is logged if a connection request is received from a nonmember system.
Unable to start service <service>
The ASE director daemon cannot start the specified service.
If a service
is restricted to run on a subset of the members, this message indicates that
it cannot run on any of those members.
Check the appropriate
daemon.log
event logging file for more information.
Unable to stop service <service> due to a timeout. The service is in an unknown state.
The ASE director daemon timed out waiting for the ASE agent daemon to reply to a stop service request.
Cannot contact local agent... exiting
The ASE director daemon exited because it could not contact its local agent.
Network connection down... exiting
The ASE director daemon exited because its network connection was not available.
Received message from agent which is not in the config database
The ASE director daemon received a message from a nonmember agent.
Can't ping my agent, exiting...
The ASE agent daemon on the member that is running the ASE director
daemon is not registered with the
portmap
daemon.
Unable to start service <service> on <host>.
A service relocation failed.
Cannot start service <service>.
After a device failure, a service cannot be started on any potential member.
Member <member> is not available.
A member is not answering pings over the network or the SCSI bus; therefore, it is considered unavailable.
Service <service> cannot be run on any available members.
The ASE director daemon cannot start the specified service.
If a service
is restricted to run on a subset of the members, this message indicates that
it cannot run on any of those members.
Check the appropriate
daemon.log
event logging file for more information.
Unable to stop service <service> due to a timeout. The service is in an unknown state.
One of the stop scripts did not return an exit value within its timeout period, and the stop action may not have completed. It is important to ensure that the service is completely stopped before continuing.
A member has an invalid IP address. ASE members are on different subnets.
The Internet Protocol (IP) address must be a valid address on the same subnet as the other ASE members.
Can't ping agent on <member>
The ASE agent daemon on the specified member is not registered with
the
portmap
daemon.
Can't open channel to agent on <member>
The ASE director daemon cannot establish a connection with the agent on the specified member.
This section describes some Alert messages generated by the available server environment (ASE) host status monitor (HSM) daemon.
Network ping to host <host> is working but SCSI ping is not
A problem exists in all of the SCSI bus paths between the host specified in the message and the member that reported the message. Check the cabling between systems and disks on the shared buses.
Network ping to host <host> is working and now SCSI ping is also working
The condition described in the first ASE HSM daemon message has been cleared. SCSI pings can now be sent between the hosts on at least one of the shared buses.
This section describes some Alert
messages generated by the
asemgr
utility.
Test of alert script
This message is generated when you chose the "Test the error alert script"
item from the
asemgr
utility's Managing the ASE menu.
Bad return code from ****
This message is generated by a return code from a routine that was not expected and indicates a bug in the TruCluster software. If it occurs, contact your field service representative.
Net partition - cannot find a director.
The
asemgr
utility cannot find the ASE director daemon
because of a network partition.
Unable to translate host <host> to an IP address
A routine cannot map a member host name to an Internet Protocol (IP)
address.
There could be a problem with the
/etc/hosts
file or with Berkeley Internet Name Domain (BIND).
Could not allocate database Could not malloc
These messages occur if a
malloc
operation fails.
They indicate that the system is running out of memory or swap space.
Configuration database is corrupted (Invalid length of ASE version) BUG NOTICE: Exit before finishing unmarshal_tree
Something is wrong with the ASE database (for example, it has been corrupted).
This section describes some Alert messages generated by the MEMORY CHANNEL subsystem.
memory channel - alternate on-line
In a redundant MEMORY CHANNEL configuration, the alternate MEMORY CHANNEL interconnect has come on line. This message is printed only when the alternate comes on line after MEMORY CHANNEL software initialization.
switching from mc<number> to mc<number>
The cluster is failing over from the primary MEMORY CHANNEL interconnect to the secondary MEMORY CHANNEL interconnect.
rm_sw_init: can't fail over from mc<number> to mc<number>
The cluster cannot fail over to the secondary MEMORY CHANNEL interconnect due to hardware problems with the secondary MEMORY CHANNEL interconnect.
requesting memory channel failover, node <node>
A member system is requesting other member systems to fail over to the secondary MEMORY CHANNEL interconnect.
memory channel - checking cables
The MEMORY CHANNEL subsystem is checking that the primary MEMORY CHANNEL interconnect is plugged into the same hub on all member systems.
memory channel failover request from node <node>
A MEMORY CHANNEL failover request has been received from the specified member system.
rm_boot_request_init: didn't switch
The cluster cannot fail over to the secondary MEMORY CHANNEL interconnect due to hardware problems with the secondary MEMORY CHANNEL interconnect.
memory channel node <node> already cluster member,crashing node <node>
A node that has been identified as a cluster member is requesting cluster membership. The MEMORY CHANNEL subsystem will shut it down to restore consistency.
memory channel - failed initialization
A hardware problem has prevented MEMORY CHANNEL subsystem initialization.
received a request from node <node> to failover
The specified node has requested a failover to the secondary MEMORY CHANNEL interconnect.
rm_failover_rmerror_request: can't fail over from mc<number> to mc<number>
Failover to the secondary MEMORY CHANNEL interconnect is not possible, probably due to a member system's not being able to access the secondary MEMORY CHANNEL interconnect.
rmerror_get_errcnt_kl:crashing node <node>
The specified MEMORY CHANNEL node is unresponsive and is being shut down.
rmerror_free_errcnt_lk: Too many retries, node <node> must be down rmerror_init:Error_count = <number> unit = <number> Err_reg = <value> Node = <node>
A MEMORY CHANNEL error interrupt has been received and error recovery is in progress.
rmerror_init:crashing node <node>
The specified node is unresponsive and is being shut down.
rmerror_state_change:
unit = <number> Err_reg = <value> node = <node>
A state change has been received, indicating that another member system has joined or left the cluster.
rmerror_state_change: failed to failover
The cluster made an unsuccessful attempt to fail over from the primary MEMORY CHANNEL interconnect to the secondary MEMORY CHANNEL interconnect. It is likely that a member system cannot access the secondary MEMORY CHANNEL interconnect.
rmerror_railover:
Node = <node> Flag = <value> Action = <value>
The MEMORY CHANNEL subsystem has requested a failover to the secondary MEMORY CHANNEL interconnect.
rmerror_failover: no alternate mc to fail over to
No functional secondary MEMORY CHANNEL interconnect is available for failover.
rmerror_failover: negative error count
Failover has been simultaneously initiated on multiple member systems. This is an informational message.
rmerror_failover_1:crashing node <node>
The specified MEMORY CHANNEL node is unresponsive and is being shut down.
rmerror_failover: not every node can failover
The cluster aborted a failover to the secondary MEMORY CHANNEL interconnect, because not all member systems could fail over to it.
rmerror_failover_2: crashing node <node>
The specified MEMORY CHANNEL node is unresponsive and is being shut down.
checking for existing memory channel nodes
The MEMORY CHANNEL subsystem is looking for other nodes connected to the MEMORY CHANNEL interconnect that may be either running or in the process of booting.
unresponsive mc nodes - waiting for node mask
A node connected to the MEMORY CHANNEL interconnect is not responding to boot requests. The MEMORY CHANNEL subsystem is waiting for the node to boot.
crashing unresponsive node <node>
The node indicated in the message did not respond to repeated boot requests.
It may be hung, so the MEMORY CHANNEL software attempts to crash it to allow cluster
formation to progress.
This
crashing ...
message is usually
preceded by several
unresponsive mc nodes ...
messages.
booting as primary memory channel node
This MEMORY CHANNEL node is the first node to boot and initialize its MEMORY CHANNEL subsystem.
memory channel software inited - node <node>
Initialization of low-level MEMORY CHANNEL software is complete.
requesting memory channel interrupt, node <node>
This MEMORY CHANNEL node has requested an interrupt from another node, which has already initialized its low-level MEMORY CHANNEL software. This is the first step a node takes to initialize its MEMORY CHANNEL software when another MEMORY CHANNEL node is already initialized.
requesting memory channel update interrupt, node <node>
This MEMORY CHANNEL node has requested an update interrupt from another node, which has already initialized its low-level MEMORY CHANNEL software. This is the second step a node takes to initialize its MEMORY CHANNEL software when another MEMORY CHANNEL node is already initialized.
memory channel status request from node <node>
A MEMORY CHANNEL node is looking for other existing MEMORY CHANNEL nodes.
memory channel request from node <node>
This MEMORY CHANNEL node is responding to an interrupt from another node, which is attempting to initialize its low-level MEMORY CHANNEL software. This is the first step a node takes to initialize its MEMORY CHANNEL software when another MEMORY CHANNEL node is already initialized.
memory channel update request from node <node>
This MEMORY CHANNEL node is responding to an update interrupt from another node, which is attempting to initialize its low-level MEMORY CHANNEL software. This is the second step a node takes to initialize its MEMORY CHANNEL software when another MEMORY CHANNEL node is already initialized.
memory channel - adding node <node>
Low-level MEMORY CHANNEL software is adding another node.
memory channel - removing node <node>
Low-level MEMORY CHANNEL software is removing a node.
memory channel node <node> timed out, hardware does not see it
A node is not responding. It will be removed.
memory channel thread init
The general-purpose MEMORY CHANNEL thread has completed initialization.
This section describes some Alert messages
generated by the distributed lock manager (DLM) subsystem.
These messages
are logged to the console and kernel log in the
/usr/adm/syslog.dated
file.
Some, as noted, are also logged to the user's terminal.
dlm_subsys_configure: can't init lkid table
Either the system has an insufficient amount of memory or the number
of
locks
allocated at boot time (indicated by the
dlm_locks_cfg
kernel attribute) is too large.
To fix this problem, either decrease the
value of the
dlm_locks_cfg
kernel attribute in the
/etc/sysconfigtab
file and reboot, or add more memory to the system.
dlm_subsys_configure: can't init rsb table
Either the system has an insufficient amount of memory or the size of
the DLM resource hash table (indicated by the
rhash_size
kernel
attribute) is too large.
To fix this problem, either decrease the value
of the
rhash_size
kernel attribute in the
/etc/sysconfigtab
file and reboot, or add more memory to the system.
dlm_subsys_configure: can't init pdb table
Either the system has an insufficient amount of memory or the size of
the process descriptor block hash table (indicated by the
pdb_hash_size
kernel attribute) is too large.
To fix this problem, either decrease
the value of the
pdb_hash_size
kernel attribute in the
/etc/sysconfigtab
file and reboot, or add more memory to the system.
dlm_subsys_configure: can't start timeoutq
The DLM cannot start the DLM timeout queue thread. To fix this problem, reboot the member system. If the problem recurs, contact your DIGITAL support representative.
dlm_subsys_configure: dlm_hab_configure failed
The DLM cannot configure its habitat. To fix this problem, reboot the member system. If the problem recurs, contact your DIGITAL support representative.
dlm: configured
The DLM subsystem has been configured successfully at boot time.
dlm_subsys_configure: configure failed
DLM configuration has failed on this member system.
dlm_create_lock: pid <value> copyout err of lkid <value>
The
dlm_lock
or
dlm_quelock
function
cannot return a lock ID to the buffer specified in the function call.
This
message is sent to the console, system log, and the user's terminal.
To
fix this problem, check the application program that called the function
and ensure that it passes a valid buffer address.
See the TruCluster Production
Server
Application Programming Interfaces
manual for
more information about DLM functions.
dlm_create_lock
COMP_LOCK: pid <value> IVLOCKID lkid <value> uaddr <value>
The
dlm_lock
function has attempted to use an invalid
lock ID.
This can occur when the function is interrupted by a signal and,
before it resumed, the application that called the function dequeued the
lock or corrupted the lock ID in its signal handler.
This message is sent
to the console, system log, and the user's terminal.
See the TruCluster
Production Server
Application Programming Interfaces
manual for more information about DLM functions.
dlm_create_lock:
pid <value> err while copying out valblk for lkid <value>
dlm_convert_lock:
valblk copyout fault: lkid <value> kvalbp <value>
uvalb_p <value>
The
dlm_lock,
dlm_quelock,
dlm_cvt,
or
dlm_quecvt
function cannot return
the resource's value block to the buffer specified in the function call.
This message is sent to the console, system log, and the user's terminal.
To fix this problem, check the application program that called the function
and ensure that it passes a valid buffer address.
See the TruCluster Production
Server
Application Programming Interfaces
manual for
more information about DLM functions.
dlm_collect: pid <value> DLM_EFAULT notf_entry dlm_collect: pid <value> DLM_EFAULT ngot
The
dlm_notify
function cannot return the blocking
notification routine parameter or hint to the buffer specified in the function
call.
This message is sent to the console, system log, and the user's terminal.
To fix this problem, check the application program that called the function
and ensure that it passes a valid buffer address.
See the TruCluster Production
Server
Application Programming Interfaces
manual for
more information about DLM functions.
This section describes some Alert messages generated by the distributed
raw disk (DRD) subsystem.
DRD messages are logged to the
kern.log
file unless otherwise noted.
drd_configure_subsys: failed in drd_driver_configure
One of the subcomponents of DRD was unable to initialize. There will usually be an accompanying error message providing more detail on the cause of the initialization error. Verify that all hardware components are operational.
drd_configure: drd-maphash-size, invalid size. drd_bp_pool_configure: bogus tunables, using default.
An invalid value was specified for a tunable parameter.
See
drd
(7) for a description of the tunable parameters.
drd_configure: subsystem unconfiguration not yet supported.
The DRD subsystem cannot be dynamically unconfigured.
drd_map_delete: can't delete, drain failed.
The underlying physical device driver failed to complete outstanding I/O operations. Check the system error logs for driver-specific errors.
drd_map_add: LMF PAK not registered.
The required product license has not been registered.
Use the
lmf
command to register the appropriate Production Authorization
Key (PAK).
drd_map_add: rejecting map on validation errors.
A corrupt or invalid map entry has been received. The underlying device driver type or device type may not be supported or operational.
drd_resolve_map: Can't find server for DRD drd3.
The DRD subsystem has repeatedly tried to determine which node within
the cluster is the server of the specified disk.
The retry timeout limit
has been reached and the DRD subsystem is returning an error to the calling
application.
Use the
asemgr
utility to verify that the
service is operational.
This error could indicate that stale DRD device special
files are being accessed.
drd_map_rpc_add: attempt to replace local map with remote. drd_map_rpc_add: rejecting new remote map.
The server of a DRD disk has received a new map entry, indicating that another node also believes that it is the disk's server. At any given time, there should be only one server. Check for the occurrence of any errors from the available server environment (ASE) subsystem that may determine the cause of this problem.
drd_map_rpc_add: drd_map_add() failed with 23.
The error number specifies an error return status when attempting to add a new DRD map entry. One of the validation checks has failed, or a DRD server is unable to open the underlying device.
bss_open: device type not supported with drd.
The underlying driver type or device type does not meet DRD's validation requirements.
bss_rm_init: register_RM_member_callback failed bss_rm_init: get_RM_information failed bss_rm_init_sync: RM_GET_CHARACTERISTICS failed bsc_rm_init: get_RM_information failed bsc_rm_init: register_RM_member_callback failed
The underlying MEMORY CHANNEL subsystem is returning an error status to the DRD. Check for related error messages and verify that the MEMORY CHANNEL hardware is operational.
bss_subsys_configure: invalid bssd count
This indicates that the argument passed to the
bssd
command is not within the acceptable range.
See
bssd(8)
for details.
bss_subsys_configure: bssd not restartable, reboot needed.
The
bssd
daemon cannot be killed and restarted.
See
bssd(8)
for details.
This message indicates that the
bssd
daemon has been killed off and an attempt has been made to restart
it.
To restart the
bssd
daemon, reboot the system.
bss_subsys_configure: bssd already running.
An attempt was made to start a second
bssd
daemon
while one
bssd
daemon is already running.
Only one instance
of the
bssd
daemon is allowed to run at a time.
Another BSSD is already running, exiting.
An attempt was made to start a second
bssd
daemon
while one
bssd
daemon was already running.
Only one instance
of the
bssd
daemon is allowed to run at a time.
This message
appears in the
daemon.log
file.
Daemon not restartable. Reboot required.
The
bssd
daemon has been killed off and an attempt
is being made to restart it.
The
bssd
daemon cannot be
restarted.
You must reboot the system reboot to make DRD operational again.
This message appears in the
daemon.log
file.
Can't register with portmap.
The
portmapper
daemon may have been killed off, which
prevents the
bssd
daemon from establishing its network
connection.
This message appears in the
daemon.log
file.
Returned from kernel call, exiting.
The
bssd
daemon is exiting.
This informational message
appears in the
daemon.log
file.
DRD is NOT loaded and configured. Unable to open DRD control device
The DRD subsystem has not been initialized or has failed to initialize.
Check the
kern.log
file for related error messages.
This
message appears in the
daemon.log
file.
Failed ioctl to set socket descriptor.
The kernel portion of DRD has rejected the
bssd
daemon's
attempts to connect.
Check the
kern.log
file for related
error messages.
This message appears in the
daemon.log
file.