Farmshare+1

From FarmShare

(Difference between revisions)
Jump to: navigation, search
(package management)
Line 11: Line 11:
If a new macbook is about 2 cores and about 8GB, a 24-core, 128GB RAM machine can support approximately 12 users.  So 10 such machines can support approximately 100 simultaneous users, if each is using slightly more than a macbook's worth of resources.  Maybe at $5k per standard Dell node, that's $50k.
If a new macbook is about 2 cores and about 8GB, a 24-core, 128GB RAM machine can support approximately 12 users.  So 10 such machines can support approximately 100 simultaneous users, if each is using slightly more than a macbook's worth of resources.  Maybe at $5k per standard Dell node, that's $50k.
-
If I had to rank the user requirements, I would probably order them as: resources, software, discoverability.  They need to be able to run something bigger than their laptop, and/or different from their laptop (whether Free Software or proprietary), and then they also need to be able to figure out how to do that.
+
If I had to rank the user requirements, I would probably order them as: software, resources, discoverability.  They need to be able to run something bigger than their laptop, and/or different from their laptop (whether Free Software or proprietary), and then they also need to be able to figure out how to do that.
After the design meets the user needs, there are also admin needs to consider.  Mainly we need to be able to easily build new machines, reboot the machines, manage the configuration of the machines, and also track performance metrics and log for security.  And we're actually really close there, with the main problem being the "technical debt" we have in the form of the oldest software parts of FarmShare: AFS, Stanford puppet modules and the other AFS-based tools (tripwire, log rotation, filter-syslog).
After the design meets the user needs, there are also admin needs to consider.  Mainly we need to be able to easily build new machines, reboot the machines, manage the configuration of the machines, and also track performance metrics and log for security.  And we're actually really close there, with the main problem being the "technical debt" we have in the form of the oldest software parts of FarmShare: AFS, Stanford puppet modules and the other AFS-based tools (tripwire, log rotation, filter-syslog).
Line 44: Line 44:
* http://blog.ajdecon.org/building-scientific-dev-environments-with-easybuild-and-module2pkg/
* http://blog.ajdecon.org/building-scientific-dev-environments-with-easybuild-and-module2pkg/
-
==Ruth's notes==
+
==Ruth's suggestions==
*re-branding, maybe we call is SRCC "student environment" or something
*re-branding, maybe we call is SRCC "student environment" or something
*hand off cardinal systems to another group, leave those as is.
*hand off cardinal systems to another group, leave those as is.

Revision as of 10:54, 15 April 2015

This is meant to be a design document for the next version of farmshare. Hopefully we can fill it in with as much design guidance and implementation details as possible.

One key factor that we want to use to drive the design is our user's needs. The system needs to be a certain way because it benefits the users and not because "this is how we do things", or "we've always done it this way".

Discoverability needs to be a higher priority than similarity to previous system. Even if a user has used the system before and they say, "I used to do X in Y way", it's not important to keep it working the same way if they can easily figure out how to do X the new way.

Many of the users are looking for newer software when they log in to FarmShare. So we need to track Ubuntu better, going with the latest Ubuntu 6mo release schedule.

Many of the users are looking for more performance than their laptop. As a baseline, we can take a 2015 MacBook Air or MacBook Pro 13". So when a user logs in to FarmShare+1, they should have available some resources that are greater than their MacBook. With approximately 3000 unique users per month (but only a couple of hundred unique per day?) we need to have 200x the MacBook power in our cluster or something on that order. Probably 10x new R630s would be enough to start (assuming we then add the 10 new Dells we already have). Interestingly, local disk is not a priority, I think, first RAM, then CPU.

If a new macbook is about 2 cores and about 8GB, a 24-core, 128GB RAM machine can support approximately 12 users. So 10 such machines can support approximately 100 simultaneous users, if each is using slightly more than a macbook's worth of resources. Maybe at $5k per standard Dell node, that's $50k.

If I had to rank the user requirements, I would probably order them as: software, resources, discoverability. They need to be able to run something bigger than their laptop, and/or different from their laptop (whether Free Software or proprietary), and then they also need to be able to figure out how to do that.

After the design meets the user needs, there are also admin needs to consider. Mainly we need to be able to easily build new machines, reboot the machines, manage the configuration of the machines, and also track performance metrics and log for security. And we're actually really close there, with the main problem being the "technical debt" we have in the form of the oldest software parts of FarmShare: AFS, Stanford puppet modules and the other AFS-based tools (tripwire, log rotation, filter-syslog).

Contents

storage

The new system really hinges on not having AFS home directories. And not having any system utilities or configuration depend on AFS.

It's important not to have the storage system be the same system where user workload is run, as that will not make for a stable system. So we need some additional storage hardware for the new homedirs, in addition to having AFS available.

I think it's very reasonable to shoot for 10GB quota per user default. At 3k active users, that's only about 30TB of allocated storage, especially as not everyone will bump up against the quota limit. But we probably want to think something like 100 IOPs per user (if we're trying to equate to an old laptop), or actually more like an SSD's worth of IOPS per user if we're trying to match a modern MacBook. So we're talking 30TB but at say 0.3M IOPS? Unfortunately nothing cheap comes to mind for those requirements. If budget is the main issue, we have to give up the high-availability requirement.

Let's face it, we haven't had an HA fileserver for FarmShare in many years, and things have held up OK, with only a couple of downtimes. So a SuperMicro based ZFS / NFS box with a few SSDs, could probably match our 0.3M IOPS (cached) and 30TB usable performance requirement, for about $20k. And without requiring a separate backend storage network.

network

If we're getting new hardware, we can start all over with the network. Provision a whole new VLAN and IP range and probably keep the central firewall (with a new vsys). The key is not new network hardware, just just a new VLAN/IP range that fits all the farmshare machines.

I think it's OK to put the IPMI controllers on the same VLAN as the public IPs, just give them shadow net IPs. At some point there was a lot of concern about users from FarmShare having network access to the IPMI controllers for the machines, but I think that concern is overblown. If that's a requirement, then we need separate management network, which costs more.

misc

Logging: throw out the entire old logging infrastructure. Start with modern default Ubuntu rsyslog config and just add a Splunk destination.

Logwatch and "root mail" aka sysadmins keeping an eye on stuff: throw out all the filter-syslog / newsyslog / AFS stuff and just run logwatch or make some Splunk dashboards.

Tripwire: throw out the entire existing tripwire infrastructure from 15 years ago, and go with modern OSSEC, if necessary.

batch jobs: Maybe we need to move to SLURM. Should be fine. The only questions is which execution hosts. Can we spec new ones?

configuration management: we can use the Stanford central puppet infrastructure, but we'd have to start over from scratch without the existing Stanford modules in order to meet the logging/tripwire/splunk goals above, as most of them depend on AFS. This is where the challenge is; I'd estimate build system + configuration management effort is about half a man year total.

package management

We need to have the regular upstream Ubuntu repos. Then, we need to have an easy way to add additional repos and the packages from those. The Stanford repo is just an example of the latter. Then, we need a way to add packages outside the OS, so that would be a separate /share/sw (or similar) tree. Just like we have /farmshare/software now. Perhaps we standardize on lmod + easybuild + fpm(?) for the latter.

Ruth's suggestions

  • re-branding, maybe we call is SRCC "student environment" or something
  • hand off cardinal systems to another group, leave those as is.
  • retire ryes if we can just put a GPU into each new machine
  • is it still the case that all fully-sponsored sunetids need access to this system? Or should it be more limited
  • can we get a faculty committee involved in scheduling/policy determinations
Personal tools
Toolbox
LANGUAGES