Wednesday, March 12, 2008

10g RAC: Increase CSS Misscount

Someone have used Oracle 10 RAC with multiPATH or... perhaps they will find the problem (Oracle Cluster restart server) when they change SP PATH or ... on SAN or ... (Example EMC).

Example: ocssd log

ERROR: clssnmDiskPingMonitorThread: voting device access hanging (45010 miliseconds)

Becasue An i/O latencies to the voting disk are greater than the default IOT(an internal i/o timeout interval).

Note:

- HBA cards with a Link Down Timeout greater than the default misscount.
- Bad cables to the SAN/storage array that effect i/o latencies
- SAN switch (like Brocade) failover latency greater than the default misscount
- EMC Clariion Array when trespassing the SP to the backup SP greater than default misscount
- EMC PowerPath path error detection and I/O repost and redirect greater than default misscount
- NetApp Cluster (CFO) failover latency greater than default misscount
- Sustained high CPU load which effects the CSSD disk ping monitoring thread
- Poor SAN network configuration that creates latencies in the I/O path.

That is not this point. Because I try to tell how can i increase. But we consider when change misscount from the default value:

- Customers drive SLA and cluster availability The customer ultimately defines Service Levels and availability for the cluster. Before recommending any change to misscount, the full impact of that change should be described and the impact to cluster availability measured.
- Customers may have timeout and retry logic in their applications Many customers have timeout and retry logic embedded in their applications. The impact of Delaying reconfiguration may cause 'artificial' timeouts of the application, reconnect failures and subsequent logon storms.
- Misscount timeout values are version dependent and are subject to change As we have seen, misscount calculations are variable between releases and between versions within a release. Creating a false dependency on misscount calculation in one version may not be appropriate for later versions.
- Internal I/O timeout interval (IOT) algorithms may change in later releases as stated above, there exists a direct relationship between the internal I/O timeout interval and misscount. This relationship is subject to change in later releases.
- An increase in misscount to compensate for i/o latencies directly effects reconfiguration times for network failures. The network heartbeat is the primary indicator of connectivity within the cluster. Misscount is the tolerance level of missed 'check ins' that trigger cluster reconfiguration. Increasing misscount will prolong the time to take corrective action in the event of network failure or other anomalies effecting the availability of a node in the cluster. This directly effects cluster availability.
- Changing misscount to workaround voting disk latencies will need to be corrected when the underlying disk latency is corrected, misscount needs to be set back to the default The customer needs to document the change and set the parameter back to the default when the underlying storage I/O latency is resolved.
- Do not change default misscount values if you are running Vendor Clusterware along with Oracle Clusterware. The default values for misscount should not be changed when using vendor clusterware. Modifying misscount in this environment may cause clusterwide outages and potential corruptions.
- Changing misscount parameter incurs a clusterwide outage As note below, the customer will need to schedule a clusterwide outage to make this change.
- Changing misscount should not be used to compensate for poor configurations or faulty hardware
- Cluster and RDBMS availability are directly effected by high misscount settings.

(Meta Link: Note:294430.1)


Note:
Before change, we can get MISSCOUNT Value. By ocrdump command-line.

$ ocrdump
$ grep -A 1 miss OCRDUMPFILE
[SYSTEM.css.misscount]
UB4 (10) : 60

Steps To Modify The CSS Miscount
--------------------------------

1) Shut down CRS on all but one node
2) Execute crsctl as root to modify the misscount:
$ORA_CRS_HOME/bin/crsctl set css misscount n
where n is the maximum i/o latency to the voting disk +1 second

$ ./crsctl set css misscount 120

3) Reboot the node where adjustment was made
4) Start all other nodes shutdown in step 1

These following are only relevant on 10.2.0.1 ... with Patch 4896338
In addition to MissCount, CSS now has two more parameters:
1) reboottime (default 3 seconds) - the amount of time allowed for a node
to complete a reboot after the CSS daemon has been evicted. (I.E. how
long does it take for the machine to completely shutdown when you do a
reboot)
2) disktimeout (default 200 seconds) - the maximum amount of time allowed
for a voting file I/O to complete; if this time is exceeded the voting
disk will be marked as offline. Note that this is also the amount of
time that will be required for initial cluster formation, i.e. when no
nodes have previously been up and in a cluster.

$CRS_HOME/bin/crsctl set css reboottime r [-force] (r is seconds)
$CRS_HOME/bin/crsctl set css disktimeout d [-force] (d is seconds)


NOTE:
Customers should not modify CSS settings unless guided by either Oracle support or Oracle development to do so.

Enjoy!

No comments: