import site.body

The Manhattan Project

For a number of years i have been working on what i call the Manhattan Project, an attempt to build a collection of cooperating tools to run small scale container systems. It was mainly intended as a stepping stone from existing Server/VM based approaches to a IaaS or PaaS/serverless model

There are a couple of conscious design choices as listed below, some of these may seem odd given the current state of the art of container design but make sense in the context of its original design

Features/Design Choices

  • Split core non overlapping functionality into different apps
  • network allocation
  • networking provisioning
  • scheduling
  • container handling
  • image handling
  • No persistant daemons required (fully optional)
  • 'static' job scheduling
  • Ability to convert to simple 'dynamic' scheduling
  • Fully self hosted
  • zero or minimal installation
  • Management via ssh
  • Authoritative cluster state in version control
  • Attempt to be container solution agnostic
  • Software defined high availability designed off ceph crush maps
  • Be simple
  • text files for everything

My Intended use case it to take a collection of servers where i have a ssh login account and turn them into container hosts without having to do much work to them. ideally these would be booted off live ISOs or other RO storage (possibly a network boot) to make management even simpler.

couple this with a radius server or some other plugin for PAM and a very basic small scale grid with the ability to limit or charge customers could be build. While likely not a good idea it is entertaining to envision how this could be built using older mechanisms in Linux that would be familiar to most older admins

Many parts of the project have started to come together after many rewrites (choice of technology comes into play here and having to build some tech from scratch) but it is slowly starting to weave into a cohesive whole.

As an example of how this is intended to work, most of the container management focuses around a program called fission which is intended mainly to sync the local state with the servers, starting and stopping the containers and basically doing job management. As part of its duties it may call out to itabot, a separate project for IP management to allocate IPs and build access policies based heavily off tagging and a bit of set theory.

When designing for the filesystem and making the filesystem the state of the program several intresting things fell out of the design, one of these was the use of symlinks for allocating jobs to nodes and general one way linkages. This also proved to be useful for applications that have a core template and multiple instances of that template, by doing variable name expansion in things like hostnames it becomes very easy to spin up another 10 instances of a job with a simple for loop around the ln command

The use of text files allows rapid editing with a text editor or grep/find and sed for simple looking up of information and editing, or scripting with a heavier program language. Using yaml for the configs makes things much simpler to script and while performance does suffer simple benchmarks show that 10k files only take a second or two. given the scale this is intended to operate at this is significantly more nodes than i would be intending to handle. One other advantage of this format is easily extending the file format for 3rd party programs without having to accommodate them int he main app (the advantage of being schemaless) allowing arbitrary metadata to be added to any file, extending its use in unforeseen ways. this may prove to work against me but time will tell.

The splitting into arbitrary programs is mainly due to wanting to split off the ip metadata and allocation tool into a separate program/product as it is generally useful in its own right to be able to add metadata to any ip and look this up at a later date (and not only ips you are actively managing). This also is reminiscent of older programs such as mail delivery programs that had known 'extension points' or hard points as i call them. Specific places in a program that could shell out to an external program allowing swapping in and out of implementations. this would allow you to swap out itabot with something that pulls ip information from AWS or other cloud provider easily (or just plug those providers into itabot)

The decision to go with 'static' scheduling is one of simplicity and the desire to reduce the amount of services required to run the grid to zero. By default you should only need ssh access to a kernel capable of containers. If you desire more dynamic scheduling then it would be easy to load each host up with a program that reports to a central job periodically. this job would then perform heartbeats and if a machine goes down, check out the latest job repository, update the allocations then execute the changes and push the repo back to master

ideally each user of the grid should have a separate ssh login, but in some cases this may not work (eg ssh access to a shell server for a company). As such some form of locking is required to prevent multiple servers from overwriting each others changes and ensuring consistent state. At this point the intent is to leverage the graph abilities of version control and have the client push its revision to the remote server, on writing the hash it is a simple matter to check the old hash and if that hash is a descendent of your hash, if it is not then you may need to pull and merge the newer changes before you can push. note that this hash only gets update on the servers you are working on, this means that concurrent updates that do not overlap should proceed without issue. some more formal checking needs to take place to ensure this is safe as there are some potentially hairy edge cases.

the hard points come back into things when attempting to be version control and container implementation independent, be specifying the required functionally commands in a separate file it should be easy to swap them in and out, an initial check of the required functionality (descendant check, uncommitted files) has shown that this functionality exists in all the version control software checked. This seems to be one of the simpler abstractions in the entire program and makes my wonder why so many people tie themselves to git when a bit of extra work makes it applicable to anything.

While designing this solution i intended to leverage as much existing infrastructure as possible to avoid having to rewrite some hairy stuff, this included the 'remoting' api which could have been fulfilled via http or some other similar protocol (protobufs over tcp for example) this would then lead to having to run a persistent daemon and handling user accounts and authentication. a simple fix to this was to leverage ssh, and potentially ssh certificates (of which i have a separate project called lockpick for, that is also minimal install on a server and accessible via ssh). This would allow interfacing to existing infrastructure and leveraging existing skills and setups

One thing i did note when looking at kubernates deployments is that there is a minimum amount of nodes that is required to build what i would call an 'admin certified' cluster ready for customers that has the required amount of redundancy, this commitment seems to be between 305 machines which can be a significant initial investment when all i want to do is run a couple of jobs on a couple of machines i have free (that may disappear tomorrow) and that may be across multiple cloud providers. ideally i should require zero remote nodes as a minimum (just manage jobs locally and quickly build test environments), effectively a one machine minimum and scale out from there. at this point it seems that the decision to go to a heavier solution such as kubernates would be for features/standardization or when you have enough jobs and people working on the cluster to move up a notch and less likely on the node count. note above that the file count (and therefor job/node count) will be a limiting factor for scaling out and its likely that 100 jobs or about 15 nodes may be the point at which you want to scale up

It may be intresting to have images of kubernates ready to go for this tool so that when you are at the point of wanting to scale to a bigger solution and you have acquired enough nodes, you could simply deploy it with a simple set of commands and then begin migrating the jobs to the new cluster before turning it live. This once again seems like a simple straight forward thing to implement and should consist of baking a few images.

One innovation that has come up while developing this solution was the use of ceph's crush maps which are used for software defined placement of storage among a set of nodes each with different weights and hierarchy. By modelling the nodes that run jobs in a tree of folders (inheriting variables from each dir as it walks up the tree to the config file leaves) it is easy and simple to build a hierarchy that, while a little less flexible matches what ceph feeds into its implementation. the actual rule engine is a 4 instruction bytecode stack machine that allows us to influence placement decisions and easily express policies such as 'at least 2 machines locally and optionally one or more remote, in that order'. This makes defining custom placement policies fairly easy to write (though the design and readability of such things is open for debate and intended to be solved via a couple of preset profiles to take the burden off most people). This 'software defined high availability' is intended to be the killer feature of the stack and is hopefully implemented elsewhere once it has proven its success

if this is all sounding similar to ansible then you would not be far off the mark, it is heavily inspired off the design and hoping to learn from some of its mistakes while being an easy to deploy and wield tool

At the end of the day this has been an intresting project that has spawned some intresting tech and served as the basis for a log of the documentation on doger.io (and a reason to maintain that site for all these years) and while i don't see it getting popular in any form for the niece role i want it for (running a bunch of jobs on arbitrary machines with little setup) it appears to be a very very nice fit