Often when DevOps is discussed, we speak about bringing development practices into Operations teams – i.e. storing desired state in git, automating host provisioning, controlled release of new features, etc.
While our Operations team are working hard to make this a reality for Mountain Warehouse, the other side of the coin is that Development should take responsibility for the systems that they are building – not just build and “chuck over the wall” for Ops to worry about.
Where we were
A (fair) few years ago the IT team was small enough that everyone did a bit of everything – some support, some server administration, network administration, etc. As we’ve grown we’ve got a dedicated support team to handle those incoming requests, an Operations team worry about the network and server setup, and several Development teams building our systems.
As we came from a small-team direction, quite often one developer (often the team lead) became the crutch that kept everything working for that team.
What didn’t work for us
We tried several different staff members in an “Application Support” role – so the rest of the company or IT support would escalate issues to this role and they were responsible for working out the severity. But what we came to realise is that you need a really broad understanding of the company and it’s systems in order to do this first stage of investigation and triage. We also ended up with an extra layer that slowed down our response time so that when team leads were cornered by other members of the business they often still had no idea that there was an issue.
What works for us – rotating developers “on duty”
Every six weeks we pick someone from the team to be the first port of call for all queries that come in from support or the relevant business teams. They also keep an eye on the various monitoring systems that we have in place, check for messages in RabbitMQ error queues, investigating Sentry e-mails, etc.
Their job is to make sure that there is a timely response to each of these events – it’s up to them to do an initial investigation, see if there’s a quick fix and if there isn’t raise it for the rest of the team (or handle it themselves if they’re not too busy). They should also be able to prioritise how quickly we should attempt a fix and discuss “fix it twice” options so that the same category of problem won’t come up again.
The ultimate goal is for them to pass a system on to the next person that runs more smoothly than the one they received from the previous person.
What’s hard about it?
- From a regular developer’s perspective, where your work is usually fairly defined on a day-to-day perspective, when you’re “on duty” it can feel more stressful, especially the first time you’ve been on rota. We’ve found this does reduce with practice.
- There can be a lot to hand over to the next person so there’s a bit of inefficiency when the rota changes.
- We can often end up fixing a problem only for a similar one to recur later. Everyone should be taking a fix-it-twice approach to this but this involves more training and team lead involvement.
What works about it?
- When a developer has finished their time on duty, the feedback is that they understand their team’s systems much better – and that is reflected in their ability to build on them.
- Developers are more able to predict what will cause problems in production and we have a pretty low ticket failure rate because of that (i.e. when we release a ticket and the system degrades because of it).
- It allows the Lead Developers to spend more time working with the team to get new projects out the door – they’re no longer such a bottleneck or single point of call for that information.
- All developers get regular exposure to other members of the business.
- A lot of errors are spotted very quickly and fixes put out before anyone notices.