spaceblog

ruminations on network management

I just finished up reading the notes taken from the 2003 BoF and 2004 BoF on config management at LISA, which I discovered via a post on the cfengine list about a new list, config-mgmt, and got inspired thinking again about my own plans for network and config management and thought I’d scribble down some of the ideas before they dissolve back into the ether.

(this turned out a lot longer than I expected it to)

  • DCML looks like it’ll be useful as a storage format; I hacked up something something similar called “fengshui” (though you won’t find the code (directly) there) which was an XMLish description of our DC which I used to plan the move last September. The tinkering I’ve done since tried to extend that XML to hold more information for rendering that layout, and for generating cfengine and nagios configuration from the single specification. DCML looks like it’s gone further than that (or at least, it plans to) and so I’ll probably start looking at that to build my own tools on top of.

  • I’ve been really impressed with the design of SCons; being able to describe the build in a clear and practical way with the additional power of a complete programming language behind it to extend the build as you need to. In configuration management, capturing the specifics of the tools that are used to do the job behind a simple API I see as a requirement for the uptake of such tools, and think that something similar to SCons as a worker agent that supplements, augments, or even replaces tools like cfengine (though not discounting enhancing cfengine itself; it is already a fair way along the road to being the tool that I am imagining) in which the platform specific work is left to the tool and the description of the intended result can be easily specified by the operator.

    There’s a problem in that last item – with so many different target applications that one could conceive of managing with a Super Config Tool, how do you create a simple interface that still gives the operator enough flexibility to acheive goals you haven’t yet considered without breaking out of the system? It’s brought up in the proceedings linked above that admins often skip the task of “shoehorning” the configuration into the management tool if they know they can do it faster on the bare metal.

    Here’s an example; I create my tool with features for configuring various MTA setups; SMTP AUTH, secondary MX, various forms of sender and recipient checking, and target it for postfix, and it becomes a simple matter of creating a config for, say, cfengine, that does the usual work (checking for and installing the package, ensuring it is running, setting configuration variables in the correct config file, adding supplementary files, etc: all standard and trivial operations with cfengine). But another person using this tool may want to configure sendmail with it, and I’m pretty damn sure all those high level concepts are possible with sendmail, so you’d better be able to make it happen with this tool too.

    Config steps that happen outside of the application (e.g. checking that the process is running) tend to be common for an operating system (i.e. observing the process table, or running service $daemon status), but configuring an application to have a particular behaviour is where the game becomes more complicated. It is unlikely that there is a 1-1 mapping of configuration options and possible values for each of the concepts above in both postfix and sendmail. How would you then translate a combination of those into a configuration for either without hardcoding the results – already we’ve decided that we want to allow the admin to do things we haven’t even thought of yet.

    At this point, I think compromising on the middle ground of being able to specify the application and configure it specifically is the way to go; in my case I’ll choose postfix and within my tool explain what options need to be set and what to set them to, and allow the sendmail guy the same; the tool itself will know how to translate specific key/value pairs into postfix or sendmail configuration files.

    I should add, there are tools like Plesk and whatnot that give this high level interface in which the operator can enable or disable general features, but the drawback (specifically with Plesk) is that it is tailored to only one product, which makes it less attractive for adminstrators looking to integrate with their existing infrastructure. (Additionally, Plesk requires the installation of Plesk branded applications, which is even less desirable: not only are you forced to use their specific choice of application, you’re also tied to the release schedule of the mgmt tool vendor and not your operating system vendor.) Tools which allow operators to manage any application, and even multiple types of application, are going to gain more mindshare as they won’t be subject to “religious” opinion of the applications they manage.

    Cfengine does this well, I think, but the main barrier that users face is the high learning curve and the feeling that instructing cfengine to do something takes longer than doing it yourself. That’s a tough nut to crack; I can’t see an obvious answer to the problem without getting caught in the Plesk trap but I’ve found in my own experience using a macro language to simplify common operations has been a real win; we use M4 at Anchor to generate constructs ranging from editing of configuration files to generating default arguments to copy commands and creating a high-level interface to testing and installing packages. The drawback here is that I had to learn M4 to do this, and it’s not pretty (but it felt good, like riding a rollercoaster :-) A simpler macro language in cfengine itself would be a win, or (and I entertain thoughts of an all powerful management tool again) generating configuration from a higher management tool would also be good. I see a tie-in with DCML again; my early experiments with my bastardised XML “fengshui” also attempted to automatically generate cfengine inputs with promising amounts of success (nothing sufficient for production though).

  • Monitoring and data collection are essential parts of the equation; nagios and cricket are the tools I use, and currently it’s a lot of work to keep the nagios configuration in sync with the rest of the network. I experimented with M4 again to generate this config, with less success than with cfengine; the variablility of service tests meant the macros had less value than with cfengine and hence I’d end up writing almost as many macros as services, not a big win.

    I experimented with autogenerated nagios from fengshui, though, and that turned out better; the nagios config structure matches the object inheritance model closer than just plain macros, which meant that the structure of the XML would describe the service descriptions easily.

    Cricket is a different beast, but supplements nagios in areas that would otherwise be unmonitored: usage over time (bandwidth, disk usage, system load) in cricket gives more meaningful results than a spot check, and its ability to send traps when it detects and anomaly is useful.

    Collating all this monitoring data currently means keeping a few tabs on the browser open and keeping an eye on them, and listening out for notifications via jabber and/or SMS, which feels like I’m not getting the complete picture. Nagios has a plugin that will allow a client to poll it using XML-RPC for the same data that the user sees in the browser, and the data cricket collects can be retrieved easily by a third-party tool. It should then be simple enough to have a daemon that collects data from nagios and cricket together and allow the interface of our Super Management Tool to display this information to the user, combining plots of load and disk usage with response times of the webserver to give a clearer view of the network.

    Once all this data is presentable, the tool can perform analysis to determine the cause of current faults (of course within limits, but I’m still dancing with my thoughts here) and potentially identify faults before they occur. (It’s simple to give an ETA on a full disk based on current trends, I think other patterns will emerge to indicate other potential faults.) There’s also the possibility of recording and playback of faults; say when you return to the office from lunch and you want to see what happened in the leadup to a service outage, you’d rewind the whole network a few hours and watch what happened, just like you were on CSI:Delaware and watching a security tape for clues.

    Of course there’s plenty of opportunity here for OpenGL eye candy, too…

One thing I haven’t tackled yet is testing and rollout process; most shops define their own but it’s rare to be supported fluidly by the tools; at least in my experience it’s the most difficult to get a regular process in place, most of the time straight rollouts just work, and the problem of doing a quick on the metal hack vs doing it right also comes into play. I guess I’ll sit on this one a while and see if I can’t come up with some ideas.

systems url of the day

traceroute.org links to the traceroute CGIs for major ISPs in a lot of countries. What would be cooler is a script that scrapes the traceroutes from all of them to build a picture of how your subnet looks to the rest of the net.

best read of the day

So I’m mucking around with DejaGNU, trying to build a sensible test suite for filtergen (and by sensible, I don’t mean sensible, but instead “sufficiently different to that of GCC that the documentation and examples are really not helping”).

Basically, I can’t get my test program to be executed properly; stdout is some basic verification output to confirm that my scanner and parser are behaving correctly, whilst stderr captures all the warnings and errors – but the harness doesn’t like this: every time a message gets to stderr, the harness crashes. This seems to contradict the behaviour expected from reading /usr/share/dejagnu/dg.exp and this message, in which one expects to be able to test the warnings, errors, and output of a program.

So eventually, after pouring over gcc testsuite libraries, trying to work out how it all sticks together, I find myself in /usr/share/dejagnu/remote.exp, looking at local_exec, and I see this gem:

# Tcl's exec is a pile of crap. It does two very inappropriate things
# firstly, it has no business returning an error if the program being
# executed happens to write to stderr. Secondly, it appends its own
# error messages to the output of the command if the process exits with
# non-zero status.

EUREKA!

The rest of this function is comic gold; here’s another beautiful piece of prose (though certainly not the best nor the limit of entertainment one can find in this function).

# Uuuuuuugh. Now I'm getting really sick.
# If we didn't get an EOF, we have to kill the poor defenseless program.
# However, Tcl has no kill primitive, so we have to execute an external
# command in order to execute the execution. (English. Gotta love it.)