Balancing Your Interrupts

 We had a weird one this week. The MRTG graphs were showing us some excessive CPU usage on the 10 servers that run our core systems. These servers handle about 30 million requests per day and they were recently moved to new hardware. In all the haste of the move I’d really just been using them as if they were the hardware equivalent of what they were moved from. The truth was, I went from 4 cores to 8 cores. I probably should have seen this issue on 4 cores, but the golden rule, if it ain’t broke, don’t fix it kept me from touching things.

Basically what we saw was a huge pile up of usage on CPU0 while CPU1-7 were relatively idle. I immediately assumed we had a problem with php or some component under it, mysql, curl, something. None of it made any sense really as I was pretty sure they all should be multi threaded and make use of all of our cpus.

I whipped up a quick and dirty shell script to watch so we could see the distribution of apache across our cores.


#!/bin/bash
/bin/ps -eo pid,psr,comm | grep $1 |awk '{print $2}' | sort -rn | uniq -c | sort -rn

Basically just a wrapper for ps so you don’t have to deal with the escapes with watch.

Run it like this and watch the spread on a default 2 second interval:

root@L1COXMLP001# watch ./cpu_spread httpd

The output typically looked something like this:

     10 0
      1 6
      2 7
      1 3
      4 2
      1 1
      1 5
      1 4

And top output looked like this:

 

Cpu0  : 19.1%us,  4.3%sy,  0.0%ni, 75.8%id,  0.4%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  :  0.6%us,  0.3%sy,  0.0%ni, 98.9%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  0.2%sy,  0.0%ni, 98.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.5%us,  0.3%sy,  0.0%ni, 96.9%id,  2.4%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.4%us,  0.2%sy,  0.0%ni, 99.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.5%us,  0.2%sy,  0.0%ni, 99.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.5%us,  0.2%sy,  0.0%ni, 99.2%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.6%us,  0.2%sy,  0.0%ni, 99.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 

Both indicating that we were heavily weighted to CPU0. We did some heavy load testing with the code, changing components, eventually mostly with curl_multi that showed good balance under load, and as we faded back, again a strong preference for CPU0.

I ran this problem by our head windows guy and he immediately pointed at CPU interrupts, telling me that all my packets were inherently going to be loaded up on CPU0 and that was probably the source of my issue. He pointed me at this article to help translate to my OS.

I wouldn’t have believed that my kernel was at fault. We’ve spent the better part of the last 5 years not worrying about kernel issues when operating our systems. But, as I walked through the info in that link, I realized that we were indeed handling all eth0 interrupts on CPU0 as evidenced by the output below, a cat of /proc/interrupts :

 

          CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0: 1933309583          0          0          0          0          0          0          0    IO-APIC-edge  timer
  1:          0          0          0          0          0          0          0          0    IO-APIC-edge  i8042
  8:          0          0          0          0          0          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0          0          0          0          0   IO-APIC-level  acpi
 14:          0          0          0          0          0          0          0          0    IO-APIC-edge  libata
 15:          0          0          0          0          0          0          0          0    IO-APIC-edge  libata
 66:          0          0          0          0          0          0          0          0   IO-APIC-level  uhci_hcd:usb1
 74:          0          0          0          0          0          0          0          0   IO-APIC-level  uhci_hcd:usb2
 82:          0          0          0          0          0          0          0          0   IO-APIC-level  uhci_hcd:usb3
 90:         18          0          0          0          0          0          0          0   IO-APIC-level  ehci_hcd:usb4
106: 2898261048          0          0          0          0          0          0          0         PCI-MSI  eth0
169:       4072    3384603          0   81206560    1052812      61543     236521        210   IO-APIC-level  ioc0
NMI:    1596732      81288      98747      66461      59269      67638      63367      71652 
LOC: 1932382093 1932382033 1932381958 1932381883 1932381804 1932381733 1932381658 1932381583 
ERR:          0
MIS:          0

 

You can clearly see that eth0 is handling all interrupts on CPU0.

So the solution according to the link cited above is to get to a kernel version at or above 2.6.24.3. This was concerning as the stock upgrade for CentOS had blow out our ethernet interfaces entirely on the last yum update. There is apparently very little development happening on the bnx2 drivers. Regardless, I went to kernel.org and pulled down the source for the lowest numbered stable version above 2.6.24.3. We ended up with 2.6.27.59. The process for those who haven’t done it in a while boils down to this:

1. cd /usr/src/kernel
2. wget "http://www.kernel.org/pub/linux/kernel/v2.6/longterm/v2.6.27/linux-2.6.27.59.tar.bz2"
3. tar xvfj linux-2.6.27.59.tar.bz2
4. cd linux-2.6.27.59
5. make clean && make mrproper
6. make menuconfig
7 make clean
8. make -j 7 bzImage
9. make -j 7 modules
10. make -j 7 modules_install
11. make install 

After this, it’s a good policy to edit your /boot/grub/menu.list and change your default setting to 30 seconds or so, just in case something goes horrifically wrong. It will give you time, especially if you’re remote, to deal with manipulating the boot. Once the steps above are done you just need to reboot and choose the kernel you want to boot.

Ours went smoothly and we booted without any problems. Once up, our stats looked like this:

root@L1COXMLP001# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:         40          0          0          0          0          0          0          0   IO-APIC-edge      timer
  1:          0          0          0          0          0          0          0          0   IO-APIC-edge      i8042
  8:          0          0          0          0          1          0          0          0   IO-APIC-edge      rtc0
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
 14:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide0
 15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide1
 18:         16         20         16         14         16         20         17         15   IO-APIC-fasteoi   uhci_hcd:usb1
 19:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 21:          1          1          2          2          0          1          1          1   IO-APIC-fasteoi   uhci_hcd:usb2
 23:          3          2          2          3          4          1          1          3   IO-APIC-fasteoi   ehci_hcd:usb4
4341:    2847208    2847351    2847206    2847296    2847296    2847335    2701240    2701364   PCI-MSI-edge      eth0
4342:       2170        549       2175        829        832        570     322285     322167   PCI-MSI-edge      ioc0
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:   27885246   27885314   27885219   27885125   27885031   27884938   27884844   27884744   Local timer interrupts
RES:      11888      11107      14026      10673      14827      13929      13509      15878   Rescheduling interrupts
CAL:        444        462        448        495        460        479        491        256   function call interrupts
TLB:      37769      36178      60347      32592      58644      55544      36587      55544   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
ERR:          0
root@L1COXMLP001#

Output of top:

Cpu0  :  5.8%us,  1.5%sy,  0.0%ni, 92.3%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu1  :  5.8%us,  1.5%sy,  0.0%ni, 92.3%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu2  :  5.7%us,  1.5%sy,  0.0%ni, 92.4%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu3  :  5.7%us,  1.5%sy,  0.0%ni, 92.4%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu4  :  5.8%us,  1.5%sy,  0.0%ni, 92.3%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu5  :  5.7%us,  1.5%sy,  0.0%ni, 92.4%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu6  :  5.8%us,  1.4%sy,  0.0%ni, 89.9%id,  2.7%wa,  0.0%hi,  0.1%si,  0.0%st
Cpu7  :  5.8%us,  1.5%sy,  0.0%ni, 89.8%id,  2.7%wa,  0.0%hi,  0.1%si,  0.0%st

This is clearly the balance we were looking for. I guess the moral of the story is, don’t assume the distro’s are doing exactly what you need when it comes to core kernel performance.


2 Comments

  1. IRQ Guy says:

    From the IRQBalance daemon documentation:

    “For the Networking interrupt class, it is essential that the interrupt goes to one and one core only. The implementation of the Linux TCP/IP stack will then use this property to get some major efficiencies in its operation.” https://irqbalance.org/documentation.html

    So it’s wrong to assume that an “even” distribution of eth0 interrupts among cpu cores is desirable, looks better intuitively, but the IRQBALANCE daemon guys say it’s not as efficient as assigning the task to “one core, and one core only.”

0 Trackbacks

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>