神刀安全网

DDNet Server Statistics with ServerStatus, RRDtool and Nim

2016-05-14

About a month ago I set up statistics for the official DDNet servers. My motivations for this are:

  1. Monitor the servers more easily
  2. Get notified about server problems
  3. Have nice graphs to look at

The choices for the software used are mainly made to keep resource usage low, a general principle used for DDNet since we run on cheap VPSes all around the world and are limited in CPU and memory resources. In the rest of this post we will explore the 3 major tools used, their purpose in our solution as well as their performance impact:

  1. ServerStatus : Gather live server statistics
  2. RRDtool : Record and graph data
  3. Nim : Favorite programming language for performance and readability

Gathering live server statistics with ServerStatus

We’ve been running BotoX’s ServerStatus to get live server statistics for some time now. It works quite well to quickly notice major server problems like a high load or incoming (D)DoS attacks, provided that you keep an eye out for it.

We use ServerStatus by running its simple Python client on each server to gather interesting information. The client transmits that data by TCP to the C/C++ server, which aggregates it into a JSON file . This JSON file is then fetched and displayed every two seconds by the JavaScript frontend of our Status page .

On a regular Saturday morning the end result looks like this:

DDNet Server Statistics with ServerStatus, RRDtool and Nim

On a quick glance you notice that the DDNet.tw server has high CPU usage (which is totally normal since it runs some hefty cron jobs every 20 minutes) and DDNet RUS is receiving a small DoS attack with just 1.6 MB/s (unfortunately also totally normal). Apart from that everything looks fine.

ServerStatus footprint, calculated from Linux /proc statistics as follows (in Nim):

import os, strutils, posix, strfmt # CPU and memory usage of process, based on PROC(5) and # http://stackoverflow.com/a/16736599 let   pid = paramStr(1)   # Clock ticks per second   frequency = sysconf(SC_CLK_TCK)   # Size of memory page in bytes   pagesize  = sysconf(SC_PAGESIZE)    uptime    = readFile("/proc/uptime").split[0].parseFloat   fields    = readFile("/proc/" & pid & "/stat").split    # Amount of time in user mode   utime     = fields[13].parseInt   # Amount of time in kernel mode   stime     = fields[14].parseInt   # Amount of children time in user mode   cutime    = fields[15].parseInt   # Amount of children time in kernel mode   cstime    = fields[16].parseInt   # Time process started after boot   starttime = fields[21].parseInt    # Resident Set Size: number of pages in memory   rssmem    = fields[23].parseInt    totaltime = utime + stime + cutime + cstime   seconds   = uptime - (starttime / frequency)    cpuusage  = 100 * (totaltime / frequency) / seconds   memusage  = rssmem * pagesize / 1_000_000  echo interp"${cpuusage:.2f} % CPU ${memusage:.2f} MB Memory"

Part CPU Memory
Client 0.14 % 3.70 MB
Server (9 clients) 0.58 % 0.18 MB

Recording and graphing data with RRDtool

I haven’t used RRDtool for about 7 years, but it’s still an excellent tool to record data into a fixed-size round robin database. For us three functions of RRDtool are important: create to create the database, update to add a new value to be aggregated into the database, and graph to render the database into a beautiful graph.

CPU, network and memory are the most important resources for me, so their usage should be recorded. Let us use network traffic as an example and create a database:

First we need to think about what data we want to record in the RRD:

1 sample = 30 seconds   1 day  =   2880 samples =   6 * 480 pixels, each pixel is 03:00 min  7 days =  20160 samples =  42 * 480 pixels, each pixel is 21:00 min 49 days = 141120 samples = 147 * 960 pixels, each pixel is 73:30 min 

Then we can use this to create the actual database file:

rrdtool create ddnet.tw-net.rrd `# File name` /   --step 30 `# Interval in seconds with which data is fed` /   DS:network_rx:GAUGE:60:0:U `# Data source receiving` /   DS:network_tx:GAUGE:60:0:U `# DS Sending` /   RRA:AVERAGE:0.5:6:480 `# Round robin archive for 1 day` /   RRA:AVERAGE:0.5:42:480 `# RRA for 7 days` /   RRA:AVERAGE:0.5:147:960 `# RRA for 49 days`

If you’re curious about what exactly happens here, you can find more information in rrdcreate(1) .

The resulting ddnet.tw-net.rrd file is just 32 KB in size and will forever stay that exact size. (All our databases together are just 1 MB.) New data in each round robin archive simply overwrites the oldest data. A disadvantage of RRDtool is that you need to think ahead and plan what data you want to store.

The next step is to put new data into our little database, which we should do every 30 seconds:

rrdtool update ddnet.tw-net.rrd N:42:1234

Super simple! 42 is our network_rx value, 1234 the value for network_tx . These values are now aggregated using the AVERAGE and finally put into their respective archives.

Once we have enough values we can finally create the graph, for example for 1 day:

rrdtool graph ddnet.tw-net-1d.png --rigid --base 1000 /   --width 419 --height 150 --logarithmic --units=si -a PNG /   `# Calculation over last day only` /   --vertical-label "Bytes/s" --start now-1d /   `# Fetch data from RRD file` /   DEF:network_rx=ddnet.tw-net.rrd:network_rx:AVERAGE /   DEF:network_tx=ddnet.tw-net.rrd:network_tx:AVERAGE /   `# Calculate aggregates based on data` /   VDEF:network_rx_a=network_rx,AVERAGE /   VDEF:network_rx_m=network_rx,MAXIMUM /   VDEF:network_rx_c=network_rx,LAST /   VDEF:network_rx_s=network_rx,TOTAL /   VDEF:network_tx_a=network_tx,AVERAGE /   VDEF:network_tx_m=network_tx,MAXIMUM /   VDEF:network_tx_c=network_tx,LAST /   VDEF:network_tx_s=network_tx,TOTAL /   `# Draw area graph in light colors` /   AREA:network_tx#fee8c8: /   AREA:network_rx#e0e0e0: /   `# Draw clear area outline on top` /   LINE1:network_tx#e34a33:"out" /   `# Print aggregate values to legend` /   GPRINT:network_tx_a:"avg/: %6.2lf %sB" /   GPRINT:network_tx_m:"max/: %6.2lf %sB" /   GPRINT:network_tx_c:"cur/: %6.2lf %sB" /   GPRINT:network_tx_s:"sum/: %6.2lf %sB/n" /   `# Other area outline` /   LINE1:network_rx#636363:"in " /   GPRINT:network_rx_a:"avg/: %6.2lf %sB" /   GPRINT:network_rx_m:"max/: %6.2lf %sB" /   GPRINT:network_rx_c:"cur/: %6.2lf %sB" /   GPRINT:network_rx_s:"sum/: %6.2lf %sB/n"

As always, the manual of rrdgraph explains the possibilities.

I’m using RRDtool 1.6.0 instead of 1.4.8 because I very much prefer its density of x-axis labels. Here are the outputs of our database:

RRDtool 1.4.8: DDNet Server Statistics with ServerStatus, RRDtool and Nim RRDtool 1.6.0: DDNet Server Statistics with ServerStatus, RRDtool and Nim

RRDtool footprint:

Part Runtime Memory
Create (9 servers) 0.01 s 1.74 MB
Graph (9 servers) 1.99 s 2.82 MB

Putting it together with Nim

To aggregate the raw data from ServerStatus into 30-second packets I use a small Nim program. It automatically creates new databases when a new server is added and keeps them updated:

import common, json, osproc, os, times, strutils, tables  type   Data = object     network_rx, network_tx: BiggestInt     cpu, memory_used, memory_total, swap_used, swap_total: BiggestInt     load: float  const freq = 30 # report new data to rrd every 30 seconds  var   lastUpdated: BiggestInt = 0   dataTable = initTable[string, Data]()   count = 0   countTable = initTable[string, int]()  proc rrdCreate(file, dataSources: string) =   discard execCmd(rrdtool & " create " & file & " --step " & $freq & " " & dataSources &     " RRA:AVERAGE:0.5:6:480 RRA:AVERAGE:0.5:42:480 RRA:AVERAGE:0.5:147:960")  proc rrdUpdate(file: string, values: varargs[string, `$`]) =   var valuesString = ""   for value in values:     if valuesString.len > 0:       valuesString.add ":"     valuesString.add value   discard execCmd(rrdtool & " update " & file & " N:" & valuesString)  proc updateServer(server: JsonNode) =   let domain = server["type"].str    for name, value in dataTable.mgetOrPut(domain, Data()).fieldPairs:     when value is BiggestInt:       value += server[name].num     elif value is float:       value += server[name].fnum     else:       error "Unhandled type in Data object"    inc countTable.mgetOrPut(domain, 0)    # Only save data if we got 30 values in the expected time span   if count == freq and countTable[domain] == freq:     let data = dataTable[domain]     if data == Data():       dataTable.del(domain)       return     else:       dataTable[domain] = Data()      let       fileNet = (rrdDir / domain) & "-net.rrd"       fileCpu = (rrdDir / domain) & "-cpu.rrd"       fileMem = (rrdDir / domain) & "-mem.rrd"      if not existsFile fileNet:       fileNet.rrdCreate("DS:network_rx:GAUGE:60:0:U DS:network_tx:GAUGE:60:0:U")     if not existsFile fileCpu:       fileCpu.rrdCreate("DS:cpu:GAUGE:60:0:100 DS:load:GAUGE:60:0:U")     if not existsFile fileMem:       filemem.rrdCreate("DS:memory_used:GAUGE:60:0:U DS:memory_total:GAUGE:60:0:U DS:swap_used:GAUGE:60:0:U DS:swap_total:GAUGE:60:0:U")      fileNet.rrdUpdate(data.network_rx div freq, data.network_tx div freq)     fileCpu.rrdUpdate(min(data.cpu div freq, 100), data.load / freq)     fileMem.rrdUpdate(data.memory_used div freq, data.memory_total div freq, data.swap_used div freq, data.swap_total div freq)  proc updateAllServers =   let statsJson = parseFile statsJsonFile   let newUpdated = parseBiggestInt statsJson["updated"].str    if newUpdated <= lastUpdated:     return    inc count    for server in statsJson["servers"]:     try:       updateServer(server)     except:       discard    if count == freq:     count = 0     for val in countTable.mvalues:       val = 0  while true:   let startTime = epochTime()    updateAllServers()    sleep(int(epochTime() - startTime + 1) * 1000) # every second

The final graphs can be seen on the DDNet Server Statistics page .

Nim monitor footprint:

Part CPU Memory
Monitor (9 servers) 0.03 % 0.89 MB

Alerts through Cron and Mail

Now we certainly have nice graphs, but automated alerts about suspicious events would be even better, for example:

  • Network traffic over 2 MB/s for 4 min
  • Memory and swap over 90% for 4 min
  • CPU over 90% for 21 min
  • Load over 10 for 21 min
  • Server unreachable for 1 hour

To check these conditions a cron job is run regularly and thanks to a MAILTO entry a mail is sent when an alert has been triggered.

We can get out a single value from the database to standard output using PRINT :

rrdtool graph x -s-4min /   DEF:v=ddnet.tw-net.rrd:network_rx:AVERAGE /   VDEF:vm=v,AVERAGE /   PRINT:vm:%lf

The alert program itself is written in Nim as well and merely gets a few values from the databases and checks if the limits are exceeded:

import common, osproc, os, json, strutils  proc get(file, value, time: string): float =   let (output, errorCode) = execCmdEx(rrdtool & " graph x -s -" & time & " DEF:v=" & file & ":" & value & ":AVERAGE VDEF:vm=v,AVERAGE PRINT:vm:%lf")    if errorCode != 0:     raise newException(ValueError, "Error code " & $errorCode & " from rrdtool: " & output)    result = output.splitLines[^2].parseFloat  if paramCount() != 1:   echo "alert [1d|7d|49d]"   quit 1  let statsJson = parseFile statsJsonFile for server in statsJson["servers"]:   let     domain = server["type"].str     name = server["name"].str     fileNet = (rrdDir / domain) & "-net.rrd"     fileCpu = (rrdDir / domain) & "-cpu.rrd"     fileMem = (rrdDir / domain) & "-mem.rrd"    template alert(s: string) = echo name, " ", s    case paramStr(1)   of "1d":     if fileNet.get("network_rx", "4min") + fileNet.get("network_tx", "4min") > 2_000_000:       alert "network traffic over 2 MB/s for 4 min"      if fileMem.get("memory_used", "4min") + fileMem.get("swap_used", "4min") > 0.9 * (fileMem.get("memory_total", "4min") + fileMem.get("swap_total", "4min")):       alert "memory and swap over 90% for 4 min"    of "7d":     if fileCpu.get("cpu", "21min") > 90.0:       alert "CPU over 90% for 21 min"     if fileCpu.get("load", "21min") > 10.0:       alert "Load over 10 for 21 min"    of "49d":     let network_rx = fileNet.get("network_rx", "4410")     if network_rx != network_rx: # NaN       alert "unreachable for 1 hour"    else:     echo "unknown parameter ", paramStr(1)     quit 1

Nim alert footprint:

Part Runtime Memory
Alert (9 servers) 0.25 s 2.12 MB

Conclusion

Live view of the last day of the DDNet.tw web server, also hosting this blog: DDNet Server Statistics with ServerStatus, RRDtool and Nim DDNet Server Statistics with ServerStatus, RRDtool and Nim DDNet Server Statistics with ServerStatus, RRDtool and Nim

You can see the full graphs on the DDNet Server Statistics page . As usual you can find the entire source code in our git repository .

All in all the system runs on very little resources, puts out some nice graphs and alerts me automatically about defined problems.

Part Runtime CPU Memory
Client 0.14 % 3.70 MB
Server (9 clients) 0.58 % 0.18 MB
Create (9 servers) 0.01 s 1.74 MB
Graph (9 servers) 1.99 s 2.82 MB
Monitor (9 servers) 0.03 % 0.89 MB
Alert (9 servers) 0.25 s 2.12 MB

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » DDNet Server Statistics with ServerStatus, RRDtool and Nim

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址