[Ipmitool-devel] lanplus fixes

Discussion:

Pat Donlin

2014-02-27 20:17:01 UTC

I had submitted a patch back on Nov 19, 2013 regarding a fix to lanplus
retry. This had resolved a problem whereby a retry of a payload type of
IPMI_PAYLOAD_TYPE_IPMI first removed the request from the queue before
going back for a retry of the message. I have been able to determine why
this fix works correctly. More importantly I have been able to resolve
other retry problems in lanplus where assertion panics were hitting on
certain retry operations. A new, replacement patch for resolving both
of these types of retry bugs follows.

The first bug,where the ipmi_lanplus_send_payload() is sending a payload
type of IPMI_PAYLOAD_TYPE_IPMI is retryable, however I found in testing
that it did not remove the previous request entry from the list of
requests chain. If the original message had timed out, a second message
sent, the second reply would not match up to the right entry on the list
as the req command and sequence numbers are the same. By first removing
the first request from the chain this resolves it. The consequence of
not removing the stale entry was random errors.

The second bug is when waiting for a message response times out during
the ipmi_lanplus_send_payload types IPMI_PAYLOAD_TYPE(s)
RCMP_OPEN_REQUEST, RAKP1, RAKP_3. In various testing where the message
timed out on either of these three payload types, ipmitool would
assertion panic upon retry as the session_state was wrong. The timeout
could be due to the message never getting to the BMC, the BMC never
acting/responding to the message, or the reply message packet dropped
(it is UDP after all). If the BMC had acted on the message but the reply
was not received, the BMC state would had advanced, and a retry of any
of these three commands would error. It is not knowable at retry time if
the BMC had acted on the message or not. The solution is upon message
timeout failure, retry all three commands in the sequence. This has
shown to be reliable and does not result in assertions or any unexpected
BMC behaviors. Should the original message response eventually arrive
very late, it is just discarded.

The testing for these problems was elusive until we found a moderately
slow BMC and had separate sessions direct a fusillade of nmap operations
on the BMC, then run simple ipmitool commands. This caused sufficient
loading of the network and BMC to cause lengthy delays and outright
packet drops. The general approach on the second fix is to return a
timeout error code back through ipmi_lanplus_open where the sequence can
be retried.

Zdenek Styblik

2014-03-03 07:59:07 UTC

Permalink

Post by Pat Donlin
I had submitted a patch back on Nov 19, 2013 regarding a fix to lanplus
retry. This had resolved a problem whereby a retry of a payload type of
IPMI_PAYLOAD_TYPE_IPMI first removed the request from the queue before going
back for a retry of the message. I have been able to determine why this fix
works correctly. More importantly I have been able to resolve other retry
problems in lanplus where assertion panics were hitting on certain retry
operations. A new, replacement patch for resolving both of these types of
retry bugs follows.
The first bug,where the ipmi_lanplus_send_payload() is sending a payload
type of IPMI_PAYLOAD_TYPE_IPMI is retryable, however I found in testing that
it did not remove the previous request entry from the list of requests
chain. If the original message had timed out, a second message sent, the
second reply would not match up to the right entry on the list as the req
command and sequence numbers are the same. By first removing the first
request from the chain this resolves it. The consequence of not removing
the stale entry was random errors.
The second bug is when waiting for a message response times out during the
ipmi_lanplus_send_payload types IPMI_PAYLOAD_TYPE(s) RCMP_OPEN_REQUEST,
RAKP1, RAKP_3. In various testing where the message timed out on either of
these three payload types, ipmitool would assertion panic upon retry as the
session_state was wrong. The timeout could be due to the message never
getting to the BMC, the BMC never acting/responding to the message, or the
reply message packet dropped (it is UDP after all). If the BMC had acted on
the message but the reply was not received, the BMC state would had
advanced, and a retry of any of these three commands would error. It is not
knowable at retry time if the BMC had acted on the message or not. The
solution is upon message timeout failure, retry all three commands in the
sequence. This has shown to be reliable and does not result in assertions or
any unexpected BMC behaviors. Should the original message response
eventually arrive very late, it is just discarded.
The testing for these problems was elusive until we found a moderately slow
BMC and had separate sessions direct a fusillade of nmap operations on the
BMC, then run simple ipmitool commands. This caused sufficient loading of
the network and BMC to cause lengthy delays and outright packet drops. The
general approach on the second fix is to return a timeout error code back
through ipmi_lanplus_open where the sequence can be retried.

Hello Pat,

please, fix the following and I'll give it a commit, hm?

~~~
+ if ((payload->payload_type == IPMI_PAYLOAD_TYPE_IPMI) && entry)
+ ipmi_req_remove_entry( entry->rq_seq, entry->req.msg.cmd);
~~~

Missing '{ ... }'

~~~
+ int rc, retry;
~~~

Please, split it.

~~~
+ //session->v2_data.session_state = LANPLUS_STATE_PRESESSION;
~~~

I see no reason why to keep this around.

Thanks,
Z.

Pat Donlin

2014-03-03 13:17:54 UTC

Permalink

lanplus retry assertion patch, resubmitted with corrections per Zdenek.
Regards,

Pat

Pat Donlin

2014-03-03 15:12:06 UTC

Permalink

Zdenek, apologies for the inconvenience as I am resubmitting the lanplus
patch again, with an additional change from prior patch. To avoid
confusion I renamed this lanplus_retry_assertion_2. Further testing
caught an omission in lanplus.c/ipmi_lanplus_open_session() where the
return code should be 0,1, or 2 but some errors were returning -1. Now
with this patch the function returns 0, 1, or 2 for error codes, in line
with the _RAKP1 and _RAKP3 functions. With this final fix, the open
session errors retry correctly but also on non-timeout errors will exit
with error correctly.
Regards,

Pat