[Ipmitool-devel] Implementation of lanplus for retry (Kazuyuki Funada)

Discussion:

Pat Donlin

2014-04-17 13:40:59 UTC

Funada,

I have seen similar retry failures in the lanplus driver, and have
submitted fixes which are presently part of the 1.8.14-rc1 build. The
specific problem you describe is likely the lack of cleanup on the prior
request before starting the retry request. The last 3 lines added to the
diff below highlights the call to clean up the old request off the list.

In general I am an advocate of increasing the default timeout for
lanplus from 1 second to at least 4 seconds. My experience with Intel
Romley and Grantley platforms has shown a steady increase in load on
BMCs and more frequent occasions where the BMC simply cannot respond
within 1 second.

Regards,

Pat Donlin
Principal Engineer
SGI

diff -c lanplus.c.orig lanplus.c.leakfix
*** lanplus.c.orig 2014-01-15 07:48:36.000000000 -0600
--- lanplus.c.leakfix 2014-01-15 07:48:43.000000000 -0600
***************
*** 2099,2104 ****
--- 2099,2105 ----
uint8_t * msg_data;
int msg_length;
struct ipmi_session * session = intf->session;
+ struct ipmi_rq_entry * entry = NULL;
int try = 0;
int xmit = 1;
time_t ltime;
***************
*** 2123,2129 ****
/*
* Build an IPMI v1.5 or v2 command
*/
- struct ipmi_rq_entry * entry;
struct ipmi_rq * ipmi_request =
payload->payload.ipmi_request.request;

lprintf(LOG_DEBUG, "");
--- 2124,2129 ----
***************
*** 2304,2309 ****
--- 2304,2312 ----

if (rsp)
break;
+ // req timed out, remove entry
+ if ((payload->payload_type ==
IPMI_PAYLOAD_TYPE_IPMI) && entry)
+ ipmi_req_remove_entry( entry->rq_seq,
entry->req.msg.cmd);
}

/* only timeout if time exceeds the timeout value */

1. regarding libipmitool library (sarath azad)
2. Implementation of lanplus for retry (Kazuyuki Funada)
3. [BMR #81324] PigeonPoint Systems various patches (Dmitry Bazhenov)
----------------------------------------------------------------------
------------------------------
Message: 2
Date: Fri, 11 Apr 2014 09:25:58 +0000
Subject: [Ipmitool-devel] Implementation of lanplus for retry
Content-Type: text/plain; charset="iso-2022-jp"
Hello. I'm a newbie in this list and I have a question about current implementation of lanplus for retry.
Our firmware developers including me have faced a issue that target controller occasionally returns 0xc1 response to "Get Chassis Status" command.
We found the issue happened when target controller could not respond within 1 second and ipmitool retried sending packet. We checked debugging output of ipmitool and understood the mechanism was below.
- user issued command(netfn=0x00 command=0x01) by ipmitool using lanplus I/F
- ipmitool sent command(netfn=0x06 command=0x01) and added it to list with seq#2
(target controller did not respond within 1 second)
- ipmitool sent command again and added it to list with seq#2
- target controller sent response for 1st command
- ipmitool received it and removed 1st entry(seq#2)
- ipmitool sent command(netfn=0x2c command=0x00) and added it to list with seq#3
- target controller sent response for 2nd command(retry)
- ipmitool received it and removed 2nd entry(seq#2)
- ipmitool sent user command(netfn=0x00 command=0x01) and added it to list with seq#4
- target controller sent response for 3rd command(seq#3) and it had 0xc1 response
- ipmitool received it and removed 3rd entry(seq#3)
- ipmitool returned this 0xc1 response to user even though it was not for user command(seq#4).
I guess ipmitool should remove 1st entry by itself before adding retried command to list.
I also checked source code of version 1.8.13 and found "lanplus.c" does not have a code for the purpose but "lan.c" has it.
My question is whether current implementation of "lanplus.c" is correct or not.
I also know we can avoid this problem by using "-N" option and we will use it for a while.
Best Regards,
Kazuyuki Funada
------------------------------

Hank Bruning

2014-04-17 14:06:22 UTC

Permalink

Pat,
An increase from one to four seconds is a band aid on a much bigger problem.
I would advocate your request be rejected. The IPMB bus loading that causes
a timeout varies second to second.
System Managers, especially ATCA and Open Computer Project use a feed back
loop to sense when the timeouts start to occur.

If a malicious piece of software is pounding the IPMI device in /dev it's a
denial of service attack and 4 seconds will not help you from the LAN
interface.
It's one of the reasons that linux /dev ipmi access is not required by Open
Compute Project.

Use a feed back loop to sense when the bus is busy.

Hank Bruning
JBlade
IPMI Hemi Architect

Post by Pat Donlin
Funada,
I have seen similar retry failures in the lanplus driver, and have
submitted fixes which are presently part of the 1.8.14-rc1 build. The
specific problem you describe is likely the lack of cleanup on the prior
request before starting the retry request. The last 3 lines added to the
diff below highlights the call to clean up the old request off the list.
In general I am an advocate of increasing the default timeout for
lanplus from 1 second to at least 4 seconds. My experience with Intel
Romley and Grantley platforms has shown a steady increase in load on
BMCs and more frequent occasions where the BMC simply cannot respond
within 1 second.
Regards,
Pat Donlin
Principal Engineer
SGI
diff -c lanplus.c.orig lanplus.c.leakfix
*** lanplus.c.orig 2014-01-15 07:48:36.000000000 -0600
--- lanplus.c.leakfix 2014-01-15 07:48:43.000000000 -0600
***************
*** 2099,2104 ****
--- 2099,2105 ----
uint8_t * msg_data;
int msg_length;
struct ipmi_session * session = intf->session;
+ struct ipmi_rq_entry * entry = NULL;
int try = 0;
int xmit = 1;
time_t ltime;
***************
*** 2123,2129 ****
/*
* Build an IPMI v1.5 or v2 command
*/
- struct ipmi_rq_entry * entry;
struct ipmi_rq * ipmi_request =
payload->payload.ipmi_request.request;
lprintf(LOG_DEBUG, "");
--- 2124,2129 ----
***************
*** 2304,2309 ****
--- 2304,2312 ----
if (rsp)
break;
+ // req timed out, remove entry
+ if ((payload->payload_type ==
IPMI_PAYLOAD_TYPE_IPMI) && entry)
+ ipmi_req_remove_entry( entry->rq_seq,
entry->req.msg.cmd);
}
/* only timeout if time exceeds the timeout value */

implementation of lanplus for retry.

Our firmware developers including me have faced a issue that target

controller occasionally returns 0xc1 response to "Get Chassis Status"
command.

We found the issue happened when target controller could not respond

within 1 second and ipmitool retried sending packet. We checked debugging
output of ipmitool and understood the mechanism was below.

- user issued command(netfn=0x00 command=0x01) by ipmitool using lanplus

I/F

- ipmitool sent command(netfn=0x06 command=0x01) and added it to list

with seq#2

(target controller did not respond within 1 second)
- ipmitool sent command again and added it to list with seq#2
- target controller sent response for 1st command
- ipmitool received it and removed 1st entry(seq#2)
- ipmitool sent command(netfn=0x2c command=0x00) and added it to list

with seq#3

- target controller sent response for 2nd command(retry)
- ipmitool received it and removed 2nd entry(seq#2)
- ipmitool sent user command(netfn=0x00 command=0x01) and added it to

list with seq#4

- target controller sent response for 3rd command(seq#3) and it had 0xc1

response

- ipmitool received it and removed 3rd entry(seq#3)
- ipmitool returned this 0xc1 response to user even though it was not

for user command(seq#4).

I guess ipmitool should remove 1st entry by itself before adding retried

command to list.

I also checked source code of version 1.8.13 and found "lanplus.c" does

not have a code for the purpose but "lan.c" has it.

My question is whether current implementation of "lanplus.c" is correct

or not.

I also know we can avoid this problem by using "-N" option and we will

use it for a while.

Best Regards,
Kazuyuki Funada
------------------------------

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
Ipmitool-devel mailing list
https://lists.sourceforge.net/lists/listinfo/ipmitool-devel

Kazuyuki Funada

2014-04-18 00:08:54 UTC

Permalink

Hi Pat,

I understood the fix of this problem was already included in rc version and will be released as 1.8.14 officially.

Thanks,
Kazuyuki Funada

Harshad Prabhu (hprabhu)

2014-04-22 17:52:58 UTC

Permalink

Lan.c had the fix for this and lanplus did not

This is the lan.c fix which needs to similar in lanplus also

http://sourceforge.net/p/ipmitool/mailman/message/28435582/

-----Original Message-----
From: Kazuyuki Funada <funada-***@necst.nec.co.jp>
Date: Friday, April 18, 2014 at 5:38 AM
To: "ipmitool-***@lists.sourceforge.net"
<ipmitool-***@lists.sourceforge.net>
Subject: Re: [Ipmitool-devel] Implementation of lanplus for retry
(Kazuyuki Funada)

Hi Pat,

I understood the fix of this problem was already included in rc version
and will be released as 1.8.14 officially.

Thanks,
Kazuyuki Funada

1. regarding libipmitool library (sarath azad)
2. Implementation of lanplus for retry (Kazuyuki Funada)
3. [BMR #81324] PigeonPoint Systems various patches (Dmitry Bazhenov)
----------------------------------------------------------------------
------------------------------
Message: 2
Date: Fri, 11 Apr 2014 09:25:58 +0000
Subject: [Ipmitool-devel] Implementation of lanplus for retry
Content-Type: text/plain; charset="iso-2022-jp"
Hello. I'm a newbie in this list and I have a question about current
implementation of lanplus for retry.
Our firmware developers including me have faced a issue that target
controller occasionally returns 0xc1 response to "Get Chassis Status"
command.
We found the issue happened when target controller could not respond
within 1 second and ipmitool retried sending packet. We checked
debugging output of ipmitool and understood the mechanism was below.
- user issued command(netfn=0x00 command=0x01) by ipmitool using lanplus I/F
- ipmitool sent command(netfn=0x06 command=0x01) and added it to list with seq#2
(target controller did not respond within 1 second)
- ipmitool sent command again and added it to list with seq#2
- target controller sent response for 1st command
- ipmitool received it and removed 1st entry(seq#2)
- ipmitool sent command(netfn=0x2c command=0x00) and added it to list with seq#3
- target controller sent response for 2nd command(retry)
- ipmitool received it and removed 2nd entry(seq#2)
- ipmitool sent user command(netfn=0x00 command=0x01) and added it to list with seq#4
- target controller sent response for 3rd command(seq#3) and it had 0xc1 response
- ipmitool received it and removed 3rd entry(seq#3)
- ipmitool returned this 0xc1 response to user even though it was not
for user command(seq#4).
I guess ipmitool should remove 1st entry by itself before adding retried command to list.
I also checked source code of version 1.8.13 and found "lanplus.c" does
not have a code for the purpose but "lan.c" has it.
My question is whether current implementation of "lanplus.c" is correct or not.
I also know we can avoid this problem by using "-N" option and we will use it for a while.
Best Regards,
Kazuyuki Funada
------------------------------

--------------------------------------------------------------------------
----
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
Ipmitool-devel mailing list
https://lists.sourceforge.net/lists/listinfo/ipmitool-devel

---------------------------------------------------------------------------
---
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech