Performance tools:
mpstat
iostat
ioping
collectl
htop
ps
vmstat
netstat (ss) / nfsstat
iotop
1. CPU
High CPU utilization usually points to a process misbehaving, memory leak, applications overloading the system or
protocols latencies (NFS, SMB, iSCSI read/writes to disks).
1.1 Identify processes that use highest CPU / MEM
Identify CPU high utilization using mpstat -P ALL.
# mpstat -P ALL
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
all 10.68 2.12 4.22 1.58 0.00 0.08 0.00 0.00 0.00 81.33
0 4.98 1.16 2.18 2.21 0.00 0.13 0.00 0.00 0.00 89.34
1 18.06 3.34 6.63 1.16 0.00 0.01 0.00 0.00 0.00 70.79
2 15.59 3.10 6.74 0.82 0.00 0.05 0.00 0.00 0.00 73.69
3 16.57 2.96 5.76 0.73 0.00 0.01 0.00 0.00 0.00 73.97
htop help finding out which processes use most CPU and / or memory.
The htop tabs are clickable so it’s easy to choose between CPU, MEM, Command etc.
For more options htop man page.
Use lsof to identify files opened by certain processes.
List open files by process (example tgt)
# lsof -c tgt
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
tgtd 517 root cwd DIR 8,0 4096 2 /
tgtd 517 root rtd DIR 8,0 4096 2 /
tgtd 517 root txt REG 8,0 300016 419498 /usr/bin/tgtd
tgtd 517 root mem REG 8,0 2047384 396560 /usr/lib/libc-2.19.so
tgtd 517 root mem REG 8,0 14648 396525 /usr/lib/libdl-2.19.so
tgtd 517 root mem REG 8,0 149301 396575 /usr/lib/libpthread-2.19.so
tgtd 517 root mem REG 8,0 160026 396544 /usr/lib/ld-2.19.so
List open files by user (example root)
#lsof -u root
#ps auxefw
for a complete list of processes and commands.
the output will be fairly long so better to use | grep if searching for specific processes or users.
1.2 Once the troubling process has been identified, check whether it can be killed or restarted. Use ‘kill’ to send specific signals to a process.
# kill -s SIGSTOP 519
# kill 9 519
2. Memory
vmstat is a good tool for memory util troubleshooting.
A few examples:
# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 631948 13532 114992 0 0 15 2 16 35 0 0 99 1 0
# vmstat -a
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free inact active si so bi bo in cs us sy id wa st
1 0 0 632080 101472 124512 0 0 15 2 16 35 0 0 99 1 0
# vmstat -d
disk- ------------reads------------ ------------writes----------- -----IO------
total merged sectors ms total merged sectors ms cur sec
fd0 0 0 0 0 0 0 0 0 0 0
sda 4843 1484 249642 156650 1378 2059 42232 27756 0 39
sdb 1276 226 9198 1180 38 0 189 6 0 1
sdd 1269 229 7456 1326 38 0 189 3 0 1
sdc 1254 229 7394 980 45 0 189 20 0 0
# vmstat --help for all options
3. Disks
The iostat command generates two types of reports, the CPU Utilization report and the Device Utilization report.
Iostat is a very good tool when troubleshooting disk issues, IOs / throughput issues.
# iostat
Linux 3.14.2-1-ARCH (server1) 09/18/14 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.04 0.00 0.11 0.57 0.00 99.28
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 1.25 24.84 4.25 124913 21392
sdb 0.26 0.91 0.02 4599 94
sdd 0.26 0.74 0.02 3728 94
sdc 0.26 0.74 0.02 3697 94
sde 0.03 0.19 0.00 960 0
sdg 0.03 0.19 0.00 960 0
sdf 0.03 0.19 0.00 960 0
sdh 0.02 0.18 0.00 924 0
%user - Show the percentage of CPU utilization that occurred while executing at the user level (application).
%nice - Show the percentage of CPU utilization that occurred while executing at the user level with nice priority.
%system - Show the percentage of CPU utilization that occurred while executing at the system level (kernel).
%iowait - Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
%steal - Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
%idle - Show the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
The disk transfer rates are shown in 1K blocks by default, unless indicated otherwise.
tps - total number of transfers to the device
Blk_read/s / Blk_wrtn/s - read and writes per second
Blk_wrtn / Blk_read - total blocks written / read
%util - Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
await - The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them
# man iostat
%util and await are the most important indicators if there is an IO bottleneck.
The disk performance is determined by RAID type, disk number in the array, average IOPs / drive (determined by the rotational speed) and IO workload (how many reads and writes)
Sometimes DBs running on NFS exports are trawling the filesystem (indexing, scanning, etc) instead of doing any actual useful reads or writes.
iotop
iotop watches I/O usage information output by the Linux kernel (requires 2.6.20 or later) and displays a table of
current I/O usage by processes or threads on the system.
# iotop -o -a
The collectl utility is a system monitoring tool that records or displays specific operating system data for one or more sets of subsystems. Any set of the subsystems, such as CPU, Disks, Memory or Sockets can be included in or
excluded from data collection.
Used without options it shows a general view of the CPU, MEM and network utilization
# collectl
waiting for 1 second sample...
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut
0 0 30 60 0 0 0 0 0 0 0 0
0 0 47 78 0 0 0 0 0 2 0 1
0 0 32 78 0 0 0 0 0 4 0 1
These generate summary, which is the total of ALL data for a particular type
b - buddy info (memory fragmentation)
c - cpu
d - disk
f - nfs
i - inodes
j - interrupts by CPU
l - lustre
m - memory
n - network
s - sockets
t - tcp
x - interconnect (currently supported: Infiniband and Quadrics)
y - slabs