Pat Donlin
2014-04-17 13:40:59 UTC
Funada,
I have seen similar retry failures in the lanplus driver, and have
submitted fixes which are presently part of the 1.8.14-rc1 build. The
specific problem you describe is likely the lack of cleanup on the prior
request before starting the retry request. The last 3 lines added to the
diff below highlights the call to clean up the old request off the list.
In general I am an advocate of increasing the default timeout for
lanplus from 1 second to at least 4 seconds. My experience with Intel
Romley and Grantley platforms has shown a steady increase in load on
BMCs and more frequent occasions where the BMC simply cannot respond
within 1 second.
Regards,
Pat Donlin
Principal Engineer
SGI
diff -c lanplus.c.orig lanplus.c.leakfix
*** lanplus.c.orig 2014-01-15 07:48:36.000000000 -0600
--- lanplus.c.leakfix 2014-01-15 07:48:43.000000000 -0600
***************
*** 2099,2104 ****
--- 2099,2105 ----
uint8_t * msg_data;
int msg_length;
struct ipmi_session * session = intf->session;
+ struct ipmi_rq_entry * entry = NULL;
int try = 0;
int xmit = 1;
time_t ltime;
***************
*** 2123,2129 ****
/*
* Build an IPMI v1.5 or v2 command
*/
- struct ipmi_rq_entry * entry;
struct ipmi_rq * ipmi_request =
payload->payload.ipmi_request.request;
lprintf(LOG_DEBUG, "");
--- 2124,2129 ----
***************
*** 2304,2309 ****
--- 2304,2312 ----
if (rsp)
break;
+ // req timed out, remove entry
+ if ((payload->payload_type ==
IPMI_PAYLOAD_TYPE_IPMI) && entry)
+ ipmi_req_remove_entry( entry->rq_seq,
entry->req.msg.cmd);
}
/* only timeout if time exceeds the timeout value */
I have seen similar retry failures in the lanplus driver, and have
submitted fixes which are presently part of the 1.8.14-rc1 build. The
specific problem you describe is likely the lack of cleanup on the prior
request before starting the retry request. The last 3 lines added to the
diff below highlights the call to clean up the old request off the list.
In general I am an advocate of increasing the default timeout for
lanplus from 1 second to at least 4 seconds. My experience with Intel
Romley and Grantley platforms has shown a steady increase in load on
BMCs and more frequent occasions where the BMC simply cannot respond
within 1 second.
Regards,
Pat Donlin
Principal Engineer
SGI
diff -c lanplus.c.orig lanplus.c.leakfix
*** lanplus.c.orig 2014-01-15 07:48:36.000000000 -0600
--- lanplus.c.leakfix 2014-01-15 07:48:43.000000000 -0600
***************
*** 2099,2104 ****
--- 2099,2105 ----
uint8_t * msg_data;
int msg_length;
struct ipmi_session * session = intf->session;
+ struct ipmi_rq_entry * entry = NULL;
int try = 0;
int xmit = 1;
time_t ltime;
***************
*** 2123,2129 ****
/*
* Build an IPMI v1.5 or v2 command
*/
- struct ipmi_rq_entry * entry;
struct ipmi_rq * ipmi_request =
payload->payload.ipmi_request.request;
lprintf(LOG_DEBUG, "");
--- 2124,2129 ----
***************
*** 2304,2309 ****
--- 2304,2312 ----
if (rsp)
break;
+ // req timed out, remove entry
+ if ((payload->payload_type ==
IPMI_PAYLOAD_TYPE_IPMI) && entry)
+ ipmi_req_remove_entry( entry->rq_seq,
entry->req.msg.cmd);
}
/* only timeout if time exceeds the timeout value */
1. regarding libipmitool library (sarath azad)
2. Implementation of lanplus for retry (Kazuyuki Funada)
3. [BMR #81324] PigeonPoint Systems various patches (Dmitry Bazhenov)
----------------------------------------------------------------------
------------------------------
Message: 2
Date: Fri, 11 Apr 2014 09:25:58 +0000
Subject: [Ipmitool-devel] Implementation of lanplus for retry
Content-Type: text/plain; charset="iso-2022-jp"
Hello. I'm a newbie in this list and I have a question about current implementation of lanplus for retry.
Our firmware developers including me have faced a issue that target controller occasionally returns 0xc1 response to "Get Chassis Status" command.
We found the issue happened when target controller could not respond within 1 second and ipmitool retried sending packet. We checked debugging output of ipmitool and understood the mechanism was below.
- user issued command(netfn=0x00 command=0x01) by ipmitool using lanplus I/F
- ipmitool sent command(netfn=0x06 command=0x01) and added it to list with seq#2
(target controller did not respond within 1 second)
- ipmitool sent command again and added it to list with seq#2
- target controller sent response for 1st command
- ipmitool received it and removed 1st entry(seq#2)
- ipmitool sent command(netfn=0x2c command=0x00) and added it to list with seq#3
- target controller sent response for 2nd command(retry)
- ipmitool received it and removed 2nd entry(seq#2)
- ipmitool sent user command(netfn=0x00 command=0x01) and added it to list with seq#4
- target controller sent response for 3rd command(seq#3) and it had 0xc1 response
- ipmitool received it and removed 3rd entry(seq#3)
- ipmitool returned this 0xc1 response to user even though it was not for user command(seq#4).
I guess ipmitool should remove 1st entry by itself before adding retried command to list.
I also checked source code of version 1.8.13 and found "lanplus.c" does not have a code for the purpose but "lan.c" has it.
My question is whether current implementation of "lanplus.c" is correct or not.
I also know we can avoid this problem by using "-N" option and we will use it for a while.
Best Regards,
Kazuyuki Funada
------------------------------
2. Implementation of lanplus for retry (Kazuyuki Funada)
3. [BMR #81324] PigeonPoint Systems various patches (Dmitry Bazhenov)
----------------------------------------------------------------------
------------------------------
Message: 2
Date: Fri, 11 Apr 2014 09:25:58 +0000
Subject: [Ipmitool-devel] Implementation of lanplus for retry
Content-Type: text/plain; charset="iso-2022-jp"
Hello. I'm a newbie in this list and I have a question about current implementation of lanplus for retry.
Our firmware developers including me have faced a issue that target controller occasionally returns 0xc1 response to "Get Chassis Status" command.
We found the issue happened when target controller could not respond within 1 second and ipmitool retried sending packet. We checked debugging output of ipmitool and understood the mechanism was below.
- user issued command(netfn=0x00 command=0x01) by ipmitool using lanplus I/F
- ipmitool sent command(netfn=0x06 command=0x01) and added it to list with seq#2
(target controller did not respond within 1 second)
- ipmitool sent command again and added it to list with seq#2
- target controller sent response for 1st command
- ipmitool received it and removed 1st entry(seq#2)
- ipmitool sent command(netfn=0x2c command=0x00) and added it to list with seq#3
- target controller sent response for 2nd command(retry)
- ipmitool received it and removed 2nd entry(seq#2)
- ipmitool sent user command(netfn=0x00 command=0x01) and added it to list with seq#4
- target controller sent response for 3rd command(seq#3) and it had 0xc1 response
- ipmitool received it and removed 3rd entry(seq#3)
- ipmitool returned this 0xc1 response to user even though it was not for user command(seq#4).
I guess ipmitool should remove 1st entry by itself before adding retried command to list.
I also checked source code of version 1.8.13 and found "lanplus.c" does not have a code for the purpose but "lan.c" has it.
My question is whether current implementation of "lanplus.c" is correct or not.
I also know we can avoid this problem by using "-N" option and we will use it for a while.
Best Regards,
Kazuyuki Funada
------------------------------