Pat Donlin
2014-02-27 20:17:01 UTC
I had submitted a patch back on Nov 19, 2013 regarding a fix to lanplus
retry. This had resolved a problem whereby a retry of a payload type of
IPMI_PAYLOAD_TYPE_IPMI first removed the request from the queue before
going back for a retry of the message. I have been able to determine why
this fix works correctly. More importantly I have been able to resolve
other retry problems in lanplus where assertion panics were hitting on
certain retry operations. A new, replacement patch for resolving both
of these types of retry bugs follows.
The first bug,where the ipmi_lanplus_send_payload() is sending a payload
type of IPMI_PAYLOAD_TYPE_IPMI is retryable, however I found in testing
that it did not remove the previous request entry from the list of
requests chain. If the original message had timed out, a second message
sent, the second reply would not match up to the right entry on the list
as the req command and sequence numbers are the same. By first removing
the first request from the chain this resolves it. The consequence of
not removing the stale entry was random errors.
The second bug is when waiting for a message response times out during
the ipmi_lanplus_send_payload types IPMI_PAYLOAD_TYPE(s)
RCMP_OPEN_REQUEST, RAKP1, RAKP_3. In various testing where the message
timed out on either of these three payload types, ipmitool would
assertion panic upon retry as the session_state was wrong. The timeout
could be due to the message never getting to the BMC, the BMC never
acting/responding to the message, or the reply message packet dropped
(it is UDP after all). If the BMC had acted on the message but the reply
was not received, the BMC state would had advanced, and a retry of any
of these three commands would error. It is not knowable at retry time if
the BMC had acted on the message or not. The solution is upon message
timeout failure, retry all three commands in the sequence. This has
shown to be reliable and does not result in assertions or any unexpected
BMC behaviors. Should the original message response eventually arrive
very late, it is just discarded.
The testing for these problems was elusive until we found a moderately
slow BMC and had separate sessions direct a fusillade of nmap operations
on the BMC, then run simple ipmitool commands. This caused sufficient
loading of the network and BMC to cause lengthy delays and outright
packet drops. The general approach on the second fix is to return a
timeout error code back through ipmi_lanplus_open where the sequence can
be retried.
retry. This had resolved a problem whereby a retry of a payload type of
IPMI_PAYLOAD_TYPE_IPMI first removed the request from the queue before
going back for a retry of the message. I have been able to determine why
this fix works correctly. More importantly I have been able to resolve
other retry problems in lanplus where assertion panics were hitting on
certain retry operations. A new, replacement patch for resolving both
of these types of retry bugs follows.
The first bug,where the ipmi_lanplus_send_payload() is sending a payload
type of IPMI_PAYLOAD_TYPE_IPMI is retryable, however I found in testing
that it did not remove the previous request entry from the list of
requests chain. If the original message had timed out, a second message
sent, the second reply would not match up to the right entry on the list
as the req command and sequence numbers are the same. By first removing
the first request from the chain this resolves it. The consequence of
not removing the stale entry was random errors.
The second bug is when waiting for a message response times out during
the ipmi_lanplus_send_payload types IPMI_PAYLOAD_TYPE(s)
RCMP_OPEN_REQUEST, RAKP1, RAKP_3. In various testing where the message
timed out on either of these three payload types, ipmitool would
assertion panic upon retry as the session_state was wrong. The timeout
could be due to the message never getting to the BMC, the BMC never
acting/responding to the message, or the reply message packet dropped
(it is UDP after all). If the BMC had acted on the message but the reply
was not received, the BMC state would had advanced, and a retry of any
of these three commands would error. It is not knowable at retry time if
the BMC had acted on the message or not. The solution is upon message
timeout failure, retry all three commands in the sequence. This has
shown to be reliable and does not result in assertions or any unexpected
BMC behaviors. Should the original message response eventually arrive
very late, it is just discarded.
The testing for these problems was elusive until we found a moderately
slow BMC and had separate sessions direct a fusillade of nmap operations
on the BMC, then run simple ipmitool commands. This caused sufficient
loading of the network and BMC to cause lengthy delays and outright
packet drops. The general approach on the second fix is to return a
timeout error code back through ipmi_lanplus_open where the sequence can
be retried.