DevOpsDays London 2018
DevOpsDays London was awesome! The talks were all really interesting and informative and there were some really great conversations I was lucky to be part of.
I've collected a Twitter Moment of the tweets I felt summed up the conference, and have embedded some of them in the below posts.
I was very, very impressed with the inclusivity aspect of the conference. As a privileged, well-abled, fairly social, mixed-but-visibly-white male, I'm incredibly fortunate as a person, and am not very often challenged by being at events like this.
I was really impressed with the signers who were allowing deaf attendees to still understand the conference talks, as well as the use of live captioning in order to make it possible for those with audio impairments or those who may not have English as their first language. In a couple of cases, even someone who had been too busy tweeting to hear what was just said!
It was also pretty cool to have stickers on our name badges which would allow us to specify how we wanted to initiate conversations. The options were:
- "only speak to me if I approach you"
- "speak to me if I know you"
- "speak to me, even if I don't know you"
It was nice to see a range of the stickers in use at the conference, and although I'm pretty comfortable jumping into conversations with people, I'm aware that not everyone will be, and it was nice to see it tackled this way, in a very visually obvious manner.
Another visual cue was the use of different colour lanyards as a way to make attendees aware about their photographs. I liked having the option of "only if you ask my permission first" which not only helps cut down the risk of yawning in the background of a photo, but also cases where you're not comfortable with wanting it taken at that time. I found it interesting that there were even some sponsors with this colour, proving that even those people you'd expect to want photos taken may not always!
I also found it good to see another visual clue for determining attendees' pronouns, to prevent you from assuming how they identify.
It was also really cool to see gender-neutral toilets in a number of places. Although I still stuck to what is traditionally "the men's room", I found it was good to have the option, as well as helping to make everyone more comfortable with the idea.
Since the conference, I've been much more aware about comments made by people, use of pronouns, and being more aware of inclusivity, which is a positive personal outcome! I've been really happy about. I've been looking to share these practices around with my family, colleagues and friends, but there's still a good way to go!
The open spaces idea was something I'd not seen before at a conference - I'd first thought it was meant to be an unconference of more talk topics, but quickly found and understood it was all about giving attendees a chance to have a more structured discussion. I found this was great for topics that attendees wanted to crowdsource ideas or get a debate going. It also seemed like there were some participants who had a lot of knowledge they wanted to share, but maybe didn't have the confidence or the amount of content to fit a full conference talk, so the open spaces were a good space to utilise for knowledge-sharing.
I found it really interesting that the spaces were built around the idea of being very free to move around in, recommending that if you weren't getting enough out of the session, you moved on to a different space. With so many going on, it helped you not get "stuck" in a conversation you didn't find useful, as well as helping you not feel committed to a single session.
As a speaker, I was invited to have drinks and canapes on Wednesday night with the fellow speakers and organisers, which was a nice icebreaker and a good way to meet the personalities behind the faces I'd be seeing on stage. Always a bit socially awkward, it was a nice icebreaker and a good way to ease into some networking before the full conference.
Neurodiversity and the Essence of DevOps
Jeff spoke primarily about how we need to realise that empathy is important, even moreso as we move towards DevOps transformations. Given the movement to DevOps is all around bridging the gap between development and operations teams, we need to hold empathy as a fundamental design of our organisation.
Jeff Sussna's talk was without a doubt one of the most inspiring talks I've heard in a while - DevOps as EmpathyOps made so much sense #devopsdays— Alex E (@quasi_quasar) September 20, 2018
“Don’t assume because people are been doing a thing for a long time, they can not do different things under the right conditions” by @jeffsussna #DevOpsDays #DevOpsDaysLDN @DevOpsDaysLDN pic.twitter.com/5dOMe2LbIJ— Andres Guisado (@andresguisado) September 20, 2018
The main take away was that we need to ensure that we seek wisdom in everyone - whether that's their experience, thoughtfulness, or something else of value they can contribute.
How to leverage AWS features to secure and centrally monitor your accounts
Kate from The Guardian spoke about how their teams manage their own AWS accounts. They operate a model where each team has their own dev, staging and production accounts which has meant that they've now got 44 accounts managed in a very much decentralised fashion. These 14 development teams' accounts are policed with help from a single Information Security team, which means it's not quite feasible that the InfoSec team can manage the accounts on their own.
How do you know if you are cross-functional, or barely cross functional?
Kate mentioned that teams have the build, deploy and support ownership, so they should also own the security of their accounts and software that supports it. However, that requires the extra expertise and responsibility that developers also need to learn from the operations side, which is where tools like Trusted Adviser and Inspector come in.
AWS Trusted Adviser
Trusted Adviser is a free product which can help report against your AWS accounts and find areas of noncompliance such as S3 buckets being public.
It can also be used to flag open or overly permissive Security Group rules, warn about uneven Availability Zone distribution, buckets that are not versioned, or access keys that have been exposed i.e. by accidentally pushes to a public repository.
Inspector can get quite expensive, but can serve another purpose around an integrated AWS product for managing security scanning.
Kate warned that there can be number of false positives, describing a time where they received some warnings about a number of unencrypted outbound connections to port 80. Once digging into this, it turns out it was the EC2 Metadata endpoints, which is definitely not a security hole!
Capital One Cloud Custodian
Although not mentioned in the talk, I'd like to add in here a quick note about Cloud Custodian, an Open Source product built by my overlords at Capital One US, which helps create rules and helps track down compliance across your AWS (and some Azure) resources. It helps us manage all of our UK and US AWS accounts, helping us hit both regulatory and internal compliance targets.
As a developer, I've found it's really useful to help keep me on my toes and ensure that our resources are abiding by internal compliance ruling. I've especially been a fan of the ability for it to automagically remediate findings, which means that you don't personally need to do anything until i.e. your next deployment, which should include the remediation steps.
Death by Dashboards
Kate went on to speak about dashboards, and that "when you have too many, build one more", pointing out that dashboards that aren't immediately useful aren't useful.
Using the example of tracking down overly permissive Security Groups, Kate described how to determine the same information in the AWS UI would require multiple steps. By creating a dashboard that presents all the required information in one place, it makes it much easier to see and action. Especially as, in this example, the VPC that the Security Group is in could be locked-down, so it may reduce the risk of the permissive access.
Kate mentioned that you need to make your dashboard provide specific, timely and actionable feedback.
In this example, it'd be more useful to find out more specifics such as if that Security Group is in use. For instance, if it's got very low numbers of servers affected, it can be deprioritised, whereas if it's something that affects your fleet, you should get on it sooner than later.
Making your feedback timely is a very difficult line to tread, making sure that you're not finding you've been breached a week after the fact, but also not overwhelming your users with constant data and alerting, resulting in alarm fatigue.
The data also needs to be actionable - if a user can't work out what they need to do with the information it's a bit useless. Kate asks that with the data available, can you automagically recommend corrective actions or best practices that should be followed?
If you're building internal tooling, your users are in the same building and literally can't run away from you - make sure that you spend time working out what they want out of a dashboard. You need to build the tool they actually need, not the tool you think they need.
One of Kate's top tips was to get feedback from both the most junior and most senior person on the team, and build it for the most junior person, while retaining functionality for power users.
Love that @KateAWhalen highlighted that your most junior team members are an asset, not a liability, when it comes to software resilience: dashboards, alerts, etc should be designed to help *everyone*, and different context levels challenge what is “intuitive”! #devopsdays— Denise Yu (@deniseyu21) September 20, 2018
And, unfortunately, no one will read the docs! So try and make sure it's as user-friendly as possible, to the point where it shouldn't need any supporting documentation.
Kate mentioned that in the cases of rules within Security Groups, the most permissive is applied when there are duplicates of the same port, which is an interesting gotcha to keep in mind!
Another gotcha Kate mentioned was that when specifying permissions around
Principal.AWS: *, you're actually allowing any AWS users access, not just users for your account!
Kate also reminded us that a system should only be able to access the information and resources it actually needs, instead of the convenience of allowing access "just in case it'll need it in the future".
Kate spoke about their use of AWS StackSets as a method to create/update/delete resources across accounts, which helps with InfoSec teams being able to push out cross-account requirements. She mentioned that it requires an overpowered administrator IAM role in each account, but after that it's fairly simple. She also noted that it's important to communicate changes to your accounts, with enough time to align to the new standards, especially if it will affect production services.
Security auditing is such an important aspect in the cloud, as @KateAWhalen is telling us. InSpec from @chef is one great way to manage the compliance of your instances, and @CapitalOneTech has built https://t.co/sa3ZArKV72 to fine-grain manage infrastructure #devopsdays— Jamie Tanna (@JamieTanna) September 20, 2018
As mentioned, using something like InSpec can also be great, which I wrote about in my blog post Notes from the AWS + Chef Dev Day Roadshow in London.
Euan spoke about all things Production, and how things can and will go wrong, and it's just something we have to deal with.
He spoke about how we should never be "comfortable" getting alerts. For instance, if you're receiving emails that your backup system has taken a backup, but are never validating whether the backup is around the right size, or that all the required files are uploaded, you may as well bin the alert!
Euan also spoke about planning for failure, and how you need to make sure that you know what you'll do when things break. Euan's advice was to write your runbooks as if you've been called at 2am and just want to know what you need to do to restore service and get back to sleep. You don't want a massive wall of text, you want something actionable that can be digested by a groggy brain and will tell you exactly what you need to do to restore service.
And as good as your existing runbooks are, what happens if the system that hosts your runbooks breaks? Is there a runbook for where the backup documentation platform is?
Once you have the documentation ready you need to follow the processes to the letter, and ensure the documentation is correct and that your team(s) are comfortable with steps required. Additionally, this needs to be done regularly! There's no point doing it once as your services will change over time, which means that so will the processes and documentation to recover them.
Euan also spoke about Disaster Recovery exercises, and that they're very important to test regularly, in full! He recounted a time where the Financial Times had an issue trying to fail over to their Disaster Recovery region, where their failover process required the live system to be available so it could replicate data - woops!
They have also had unplanned DR occurrences in production, which led them to find there many services in Production that weren't able to work with the DR region. It took 5 days to recover back to live, after which their DR region then went down! Luckily it was once they were live in their primary region, but it just shows that you need to know what happens when enacting Disaster Recovery, and how you fail over before it's too late! Planned Disaster Recovery is the only way to do it safely.
Euan advised having a central space to report changes and problems, such as a Slack channel for incidents, where incidents can then break-out into a separate Slack channel to talk in a focussed manner. This also helps reduce alert spam, and keep the conversation focussed and just for those who need to be there.
When things go wrong, Euan's advice is to take a deep breath - you need to calm yourself down so you're acting rationally and not panicking. It's not the end of the world (caveats available for safety-critical systems). The first thing to discover is what's the actual impact? Is space low on a dev server? Or is AWS down for all of their customers? Or could it just be a really slow hotel WiFi connection that a customer is viewing from?
Once you know the impact, check the basics first, like whether the instance(s) are actually up or that the load balancer is routing traffic to the instance(s).
Something very important is to not be afraid to ask for help - you may be checking things you're not comfortable with, and if it's late at night you may not be thinking straight yet. But even if it is, it can be hard to get your mind straight.
To prevent undue interruptions from other teams and people within the business, Euan recommended placing an Incident Manager in charge, who can field questions and keep you focussed on what needs to be done to restore service.
One great piece of advice he had was that "if you think you're overcommunicating, it's probably just the right amount". Keeping everyone updated on pretty much exactly what you're doing and thinking is really positive, and it'll help others keep calmer knowing you're looking into it. Euan also makes sure to stress that you can and should take breaks - tired people don't think good, and you don't want to cause more issues. Don't necessarily feel obligated to stick around all the time (caveats, again) because there may be others who can come on call to help you swap out for some rest.
"*Especially* during long-running and urgent production issues, it's important to ensure that your team take rest breaks to ensure they are performing at their best." (ish)@efinlay24 at #DevOpsDays LDN pic.twitter.com/AHx73vAjnf— Alex Yates (@_AlexYates_) September 20, 2018
And once you've restored service, give yourself a pat on the back. Well done for braving the storm, and get some well-deserved rest. If out of hours, you shouldn't rush back to work the next day, but instead making sure that you're in the best physical and mental state to resume your duties. Conducting a Learning Review (post-mortem) is a really beneficial process, ensuring that you can find out what did and did not work well from all those involved, and even if the Incident Report is not made public, it'll help your teams assess how to improve their actions for next time.
Euan highlighted a writeup from Travis CI in April, where they shared, in a digestible format, what the issue was, what they learned, and how they're fixing it. I'll also highlight GitLab.com's incident database incident in 2017 as another great example of a detailed post-mortem.
As an aside, this highlights that letting people have access to production from their machines can always be dangerous! You can put as many barriers in place, but at the end of the day, if you can get to production, it's likely you'll break it. We have a model where production access requires extra steps, to help make it much less likely that we'll be there, and that when we are, we're even more careful.
Bright Screens, Blue Days: Developing Self-Care Tech
Velvet's talk spoke about the implicit assumptions we make in our tech and how we should be considering our users more carefully. They used an example of a Mental Health application which allows the user to take notes, then determines from written content whether the user is in a "good" or "bad" period. However, Velvet mentioned that actually, the user could be grieving and therefore writing something that could be categorised as a "bad" period, but grieving is a healthy process, and therefore you're miscategorising them.
10 Practical Steps Towards Creating an Extraordinary Team
Unfortunately I didn't get any notes, but you can watch the talk on YouTube.
Taking the 3 Ways of DevOps on the Road
Unfortunately I didn't get any notes, but you can watch the talk on YouTube
What is Cloud Native, and why should I care?
Unfortunately I didn't get any notes, but you can watch the talk on YouTube
Bash is testing
Matt took us through some of the ways that you can use Bash as a means to test branch points. Although there are even testing frameworks around for Bash, such as bats, I'm still adamant that by the time you feel you need to reach for something like this, you should instead be nuking the Bash script from orbit and replacing it with something in a (cue flame war) real language.
Overengineering Your Personal Website - How I Learn Things Best
I spoke about this very website, and the "monstrosity" it is - I'll be writing a follow-up post about it in more depth, but in the mean time, check out the talk on YouTube.
DevOpsDays London was the largest group I've ever spoken to totalling around 400 people in the audience, but as I got on stage, the waves of nerves completely disappeared and I started speaking. I've done a number of talks before, so although I knew what I was getting myself into, I still had nerves. And not just because I don't like public speaking, but also because I found writing an Ignite Talk really hard! Having a set of auto-incrementing slides was a big change from my usual approach of taking time where I needed to, and adjusting pacing where I needed to, rather than having to keep on-point and on-slide very strictly. There were a couple of times where I was waiting for the slide to transition, indicating I'd missed some content because in run-throughs I was slightly over time on each slide. But I must've got the main pieces of information across, so that's the important bit.
I also made a massive faux pas and missed the original cut-off for the slides and transcript submission, as I'd misread the dates on the email. I was on holiday when a chase email came in, so rushed the content the weekend I got home, which only gave me a few days until the conference itself, I looked to refine the content down, which I found a really uncomfortable process because I made the mistake of thinking that the transcript needed to be word-for-word. The panic caused by this made it quite a painful few days leading up to the conference, and stupidly I didn't ask about it, which was a massive mistake, because when I mentioned about it at the speaker dinner, I was told that it didn't need to be exact, just an approximation to help with signing and closed captioning.
It was still a great experience, and I'm really glad I was invited to attend and that my talk went down so well!
Fargate - why and when?
In the first open spaces session, we spoke about using Fargate as a completely managed service, allowing you to push Docker images as-and-when you want, without having to worry about the underlying node, removing another set of patching requirements for your fleet.
One participant spoke about how they have 9 Fargate Tasks which cost them a total of $1000/month. They noted that although it's a fair bit of money, as with most managed solutions you're paying for convenience and theoretically infinite scalability.
Working in a regulated environment, we at Capital One UK need to make sure that we have an exit strategy available (such as the case where AWS would ramp prices suddenly) so asked about their approach with Fargate. The response was that you could either move back to something like Amazon ECS but would likely go for managed Kubernetes, which then has a standard API layer and reduces a lot of Cloud Vendor lock-in. One comment was that if you're using more than a couple of Fargate Tasks, it's worth starting to think about moving to Kubernetes. One rebuttal was that moving to Kubernetes would only give you the ability to manage secrets more easily.
One complaint about Fargate was the very specific naming requirements for IAM roles, which could make it difficult and would most likely require one AWS Account per environment.
There were questions around security:
- Is there the ability to SSH onto the Fargate node? No
- Is it possible to have privileged containers? Yes, but requires some configuration
- How do you actually security scan containers?
One question was around the monitoring of the Fargate host, and checking that it was still running okay. I mentioned, in the form of a counter-question, that you wouldn't do the same with a managed RDS cluster, as one of the perks of having a managed service is not needing to worry, and instead letting AWS work on the hard stuff.
DevOps in a regulated industry
We spoke around the various difficulties of working in a regulated environment, down to paper trail, extra hoops to jump through, and various other difficulties.
Something that shocked me was how the majority of people in the room mentioned that their companies allowed developers access to production servers. At Capital One UK, we practice the belief that a developer should only ever need SSH access in a disaster scenario - and even then, it wouldn't be them personally. In the case that a non-automated change needed to be enacted, the development team would create i.e. an emergency change with the steps that the implementer would need to go through, command-by-command. This would then get run by a trusted person who has break-glass ability to access the box in an audited fashion.
The idea is that we shouldn't get used to the ability to being able to SSH on and check log files, or interact with production data, even in a read-only capacity, as anything that we need to access should be already available to us in i.e. AWS CloudWatch. This ensures that we'll always be able to self-serve (which fits in well with Damon's Ops-as-a-Service ideal), and will never need to touch Production directly. If we did need to break glass to access log files we don't usually have access to, we then have the conversation about why we don't usually have access, and determine if we need to set up log collection for them in the future.
One rebuttal was questioning if this is a bit of a step back, but as mentioned, this actually means we'll be more self-sufficient, and means it's much less likely for us to need to get onto a production box, ensuring we're less likely to be exposed to sensitive data. Obviously, there would be the chance to "break glass" and gain SSH access if the incident required it, for instance an error state that wasn't catered for.
Another conversation was about configuration drift if there's production access, to which I mentioned about how Configuration Management tools such as Chef can provide you with a regular "check-in" to replace any manual drift with the expected configuration. One participant spoke about allowing production access to developers, but once the system detected a login, it would set itself to terminate as soon as the connection dropped, ensuring any changes they made would not be persisted against production traffic.
One participant mentioned that they see an issue around the culture, where developers are happy to pair and get stuck into new technology, but infrastructure engineers think "this is what I know, this is what I'm doing".
Another comment was around the difficulty of gaining sign-off for the Cloud where most services are on-premises. This fed into a conversation about having "approved services" for Cloud/On-Premises and the speed it takes to take advantage of managed solutions, which reduce developer+operations workload, but can be complicated for the regulators to understand or approve.
Talking about various compliance targets and hitting the regulators' checklists, I mentioned Cloud Custodian. I didn't mention it, but also thought about how using a platform such as Chef Automate/InSpec for compliance monitoring is also a good idea, which I covered briefly in my blog post Notes from the AWS + Chef Dev Day Roadshow in London.
Speaking about compliance, one comment was about how in the scheme of things (for a small business), the chance of having compliance issues is actually smaller than that of your company folding.
Finally we spoke around having common architectures, such as having consistent testing strategies, infrastructure architectures and use of services, which then makes internal design reviews much more of a box-ticking exercise. This makes it easier for engineers to get stuff done, and easier for compliance and audit sign-off.
What's wrong with a good monolith?
Coming from working on a monolithic Identity Service (due to use of a commercial off-the-shelf solution), I was interested to find why anyone else would want to use a monolith themselves.
Monoliths are a much easier route to a Minimum Viable Product, and if you're looking to hit market quicker, it's a much quicker route. Once you have some business value to split down to microservices, such as the maintenance burden or scalability of that monolith, then you can tackle splitting them. Starting with microservices can be easier if you've used Domain Driven Design, but will still slow you down as instead of having to deploy and scale a single service, you now have
n number of services to manage. It was pointed out that it's much better to have a monolith than a number of badly designed microservices!
Robert, one of the participants has also written up this conversation in Well Architected Monoliths are Okay, with the key design of creating your monolith in an event-driven design which replaces internal function calls with an out-of-memory event-queue to provide clean domain boundaries by design. This means that while the codebase is monolithic, the application is built in a non-monolithic fashion, with clear separation between components and the ability to refactor it out nicely. I really like this pattern, and agree that until scaling starts to be an issue, it would make more sense than lots of microservices, which then requires different development + deployment practices.
And like with all services, creating a good monolith requires a good app design, as it would with a (micro)services-oriented architecture.
As I felt I'd gotten all I wanted out of the Monolith discussion, I left and went to this open space - but as I was a little late I had missed a fair bit.
There were a few discussions around how to manage the monitoring aspect, especially in terms of measuring how often cold starts occur and what the impact is, which is one of the larger problem with serverless.
Both an architectural and cost-saving choice was to make everything asynchronous where possible - instead of blocking while
FunctionA waits for
FunctionB to respond instead let
FunctionC to handle the response from the request sent by
Another cost-saving opportunity is the removal of idle time wherever possible - including on your build agents. Instead of having an always-on agent burning a hole in your pocket with very little build usage (especially out-of-hours) you can instead have them spun up on-demand, which can be used with i.e. AWS CodeBuild.
Tracing can be helped through the use of correlation IDs that are sent between request/responses, and can therefore help following user journeys through the whole stack.
The final advice was to look at using something like the Serverless Framework to help simplify your functions, and to make it easier to work in a more cloud-agnostic and consistent manner, as well as exposing other tools for easier deployments.
One thing I'm looking to learn more about is the QA side of serverless - how do you test the integration points as well as the component-level functions themselves.
Who Broke Prod? Growing a Culture of Blameless Failure
Emma described coming into the office on a Monday after tackling a high severity incident over a weekend and then coming in and being asked "Who broke prod?". Although a seemingly innocuous question which may have had the best intentions, it still aims to put blame on someone. Emma described how she'd not once thought about who was to blame, her only concern was to restore service and then get back to her daughter's birthday party!
Emma's key takeaway point was that failure is inevitable - we just have to embrace that we're fallible creatures and that we make mistakes; the sooner we can accept this, the better! She spoke about how the business as a whole needs to be comfortable with the "brutal transparency" of failure. She spoke about how you should cover your walls with screens displaying the state of your systems, with the inevitable greens and reds to show whether things are in a good state or, more likely, not. Most interestingly, she described how this shouldn't be limited to just your engineering teams, but should also i.e. hold your sales team to account for not making their targets, and share the client feedback loop, as well as any broken builds/deploys/production services.
By making the state of your business visible, you'll not only make people more comfortable with the fact that things will go wrong, but you'll also be able to start to "know your normal". This is a practice we feel quite heavily in my team at Capital One, where after standup every day, we check the monitoring dashboards to determine how the Identity Service is performing. Because we do it every day, we're now comfortable with what "normal" looks like, and in the cases that we see some weirdness in our graphs, we can then start to look at what we can do to remediate it, if needed.
Emma noted that learnings that fed into DevOps from lean and agile came from the manufacturing business, where there is a level of predictability and safety that can be guaranteed. However, in software, we just don't have those assurances and therefore have to embrace the fallibility of us as humans.
Emma reminded us that failure is drilled into us at a biological level to prevent us from taking those (potentially life-threatening) risks again. Because of this, we need to be able to respond to the negative feedback better and stop turning into an armadillo of emotions. When receiving feedback, Emma recommends saying "thank you", and unpacking the emotional response afterwards, where you have more time to digest and react to it.
Emma mentioned that we should have "improvement katas" which help us build up our resilience, as we need to be able to accept and re-live our failures so we can learn from them. One question we should be asking is "how can I react better next time?". Emma's response to asking this question each time is to write her answer on a post-it note and stick it on her monitor. The next time there's an issue, meaning she'll have the reminder of what to do better right in front of her.
One recommendation is to try and have a second pair of eyes if at all possible, just like when you're writing the code. Having a second person when debugging systems allows for you to formulate your train of thought out loud which can help you realise when you're not quite making sense, as well as having the other person calling out when you're going down the wrong route. And even if you can't have someone to look over your shoulder, there's always Rubber Duck Debugging.
While incidents are ongoing, Emma recommended using a single channel to perform overall incident updates to make the learning review easier, as all the conversations will be in one place. Making sure you don't have i.e. a
#managers-only channel, where everyone is actually speaking their mind, is especially important. This goes against the levels of transparency which can make it or break it when in crisis mode; conversations need to be public and open, and needs the necessary details from everyone involved to help resolve it.
Emma also talked about the renaming of a "post-mortem" to a "learning review", a subtle rename that can make a big difference to those attending who may be feeling like it's their fault, especially with the term "post-mortem" stemming from the investigation to find the cause/blame for a person's death. But she also mentioned that if you're trying to tackle blame at this point, then you're doing it far too late, because blame happens much earlier.
One of the biggest issues with a blame culture is becoming defensive. People who aren't comfortable taking the blame won't necessarily share all the details when they're diagnosing, which can lead to missing important details to what could have gone wrong. You need to make sure you have brutal transparency around the state of the system, ensuring that every decision and every action is shared, so everyone is aware of what steps are being taken. Don't worry or care about how others are looking at your debugging method!
And finally, organisations need to stop punishing people for experimenting with new features or trying new processes. If all they're going to do is then blame them when things go wrong, they'll decide to play safer and as an organisation we'll suffer from stifled innovation. Alternatively, you may just start losing staff, as they're not feeling supported or able to attempt to make changes.
Using the term "we" instead of "I", "you", "them" can also help build trust, and shows a level of care - by actively changing the way we communicate, it shows that "we're all in this together".
Managing people and other horror stories
Code is easy, people are hard
If it's anything I took away from Ramez' talk, it was that the manager is Batman's butler Alfred, who does a tonne of work behind the scenes to prepare and support Batman for whatever missions he's undertaking. The manager is not separate from the team, they're a fully embedded colleague with the same stake in the mission and goals of those who report into them.
A manager's role is not only to plan the work on behalf of the team, but to instead be on the hook for the delivery of those projects. Whereas an individual contributor is looking to deliver piece(s) of functionality, the manager is responsible for the overall delivery of that component. That means that the manager's goals are not just delivering everything that their team delivers, but also their own personal goals, because accountability of team delivery is their expected day job!
Ramez stressed the fact that management is a different role to being an individual contributor - and that if you don't understand that, you're doomed. Thinking that moving to a manager role is the "same again" is a mistake; instead of having responsibility over functionality, you have responsibility for people.
Ramez shared a few commandments of management that he's picked up:
- Thou shalt manage a Team: A manager needs to nurture the team and to build and maintain their spirit to help them excel and keep them going
- Thou shalt give them a Reason to exist: As Ramez mentions, "a team is a collection of individuals guided by a common purpose". Without a mission statement for the team or a vision of what you're going to achieve, you can never push forwards and truly grow.
- Thou shalt Serve your team: The term "servant leadership" applies perfectly here, where the manager is there to make the team work well, they're delivering on target, are happy and are following a career trajectory that they want to. A manager has no results without them; "only when the team is awesome, are you awesome".
Ramez touched upon a time where his manager wasn't noticing issues in the team, nor proactively tackling the team drag, which led to the team underperforming.
And leading into his Alfred point, Ramez described how tennis as a sport is exciting because it's a show. It's all about keeping the game tense, always wondering who's point is next, and who will win. But the most important people on the court aren't the players, it's the ball collectors - without them, there would be lots of uninteresting waiting for players to collect the balls. Managers are the ball collector - they help keep that excitement and rush towards the common goal.
Why are Distributed Systems so hard? A network partition survival guide
Denise's talk was a really enjoyable experience, teaching us all about how to manage networking issues when using distributed systems. To top this great talk off, it was mostly narrated through cat drawings, so I'd heavily recommend a watch.
Denise spoke about how with the software we write, it's okay to build a monolith. Scaling a monolithic application is possible by throwing more compute at it. But with a database, you can only do so much with more hardware. Larger datasets across many tables and usages need to be optimised for query speed as well as special considerations for backup/restoration and replication.
But if you decide to build it with a distributed model, each team/application can tune their database more easily and selectively.
Denise briefly touched upon "The 8 Fallacies of Distributed Computing", but most importantly the fact that the network is unreliable.
Denise then led into CAP theorum and the common recommendation to "choose two". But Denise disputed that, sharing with us how it's impossible to choose just two - we can't sacrifice partition tolerance, as nothing distributed will be immune. If you can tolerate a partition in the network, you're inherently not distributed.
We learned a little about the differences between the options (and trade-offs associated with them), and how different distributed systems manage their options.
Denise reminded us that software and hardware failure is inevitable - networks will have blips, a switch or router between the nodes may get accidentally unplugged, or a "noisy neighbour" may steal your resources. This is all unavoidable, so you need to decide what makes the most sense operationally.
Tickets and Silos Ruin Everything
Damon took us on an entertaining journey from what looked like a simple production change to a huge ticket queue and many hours of person time to get it resolved. I'd thoroughly recommend watching it!
With "conventional wisdom" as Damon mentioned, the result from a painful change like this would be "we need better tools" or "we need more people", or maybe even that there needs to be "more discipline and attention to detail" or, worse, "maybe we need more change reviews/approvals".
He additionally spoke about how we spend a lot of time making sure the software itself is good but don't actually invest in supporting all the steps that happen after the deployment.
Damon went on to talk about the four horsemen of the Operations apocalypse; silos, ticket queues, toil and low trust.
Silos are a way of working that means everyone has their own section of knowledge, and in order to get anything done, you'll always need something from someone else. It encourages disconnect due to different context and processes between teams and people, and the mismatch of quality of work can result in lots of rework.
Ticket queues are used as a way to manage silos, but instead make the process much more difficult. Because there is quite a disconnect between the two parties, there is lower motivation to help someone out and as Jeff would put it, there's reduced empathy. It's also a much more expensive process, due to a longer cycle time, even when in an Agile environment, as well as the increased risk due to the variation and quality, as well as the overhead of managing the ticket queues itself.
Damon went on to discuss "toil", which is all the manual, repetitive, automatable, "devoid of enduring value" work that we have to do, such as rehydrating instances with security updates or pushing bug fix updates. However, when we're actually engineering software, it's the complete opposite, where we have apply creative thinking, can't easily automate our jobs, and are delivering value with everything we do. Damon mentioned that toil can be tackled by tracking the amount of time spent on it, setting limits on the amount that a team can spend on it, and investing resources in efforts to reduce it, prioritising teams who are over their toil limits.
Finally, low trust can hurt incident management and response times. Escalating decisions away from the people with the context is never the right choice, as it makes an arbitrary choice much more difficult. We need those closest to the decisions to make them.
He mentioned that we can remove our silos by creating cross-functional teams. He spoke about two different models, starting with Netflix's engineering ideas, where the teams are empowered and there is no dev and ops split, which means that everyone is on call for their work. He then compared this to Google where there is very much a split, with teams of Site Reliability Engineers separate to engineering teams. The job of these SREs is to to ensure applications run smoothly, requiring strict handoff documentation and enforcing the concept of "error budget" which results in the engineering team having to pick up the application again. SRE as a concept is quite interesting coming from someone in the first category, and I'm definitely going to be looking into it further. Damon noted that these organisations both iterate at high-quality and high-velocity scales, but have very different responsibility models.
Damon described an Ops-as-a-Service guide design, with the ability to self-serve your Operations processes. For instance, you should build platforms that allow pull-based workflows, that can be consumed as needed in either responsibility model. This still sits in the remit of the delivery team, as they're closest to the code and the platform.
Andrew Bean - Five hops to DevOps – Changing the culture around software deployment in the public sector
Andrew spoke about the work he's done to start the DevOps transformation at The British Geological Survey.
Starting with adding in some Automated Build + Deployment capabilities and pushing through to Infrastructure as Code and Microservice architecture, Andrew has started to move the organisation to a better place. As you'd expect, it's easier to start with DevOps in mind, rather than retrofitting it, but Andrew mentioned he'd had some positive first steps.
Being "the DevOps person" in the organisation has been tough for Andrew, explaining how "DevOps is not one person's role", requiring too much time and being "too big". Having champions across the business helps, as it increases buy-in, as well as providing shared responsibility to enable more of the team to have ownership and drive to deliver their DevOps transformation.
Noting that DevOps is "different things to different people", Andrew said that it doesn't matter, and shouldn't be a barrier to getting started because you don't feel like you'll "do DevOps right". Although there's some overhead, as long as it improves the metrics you care about, it's a success!
Louise Paling - Agile Software Development: A Lego Star Wars Story
Although I've seen Louise's talk before, I'm still bowled over by what an absolutely awesome analogy of Agile is - I'll say nothing other than watch it yourself.
Philipp Krenn - Building Distributed Systems in Distributed Teams
Philipp from Elastic had the recommendation to design the delivery of your distributed systems in a distributed manner. The most important thing, though, is to shape it around the values of making it work for you all.
Working in a distributed and/or remote fashion is also an important cultural approach to follow, noting that your employees are adults and should be trusted as such. Managers shouldn't need to know what you're doing right there and then, and that your work should be flexible around everything in your life. You need to do the right thing for you but also for the company, which I feel was put quite well by my manager after my Ruptured Appendix last year, where he told me not to come back to work until I was fully healed and ready for it. If I came back too early, I may need more time off if it got too much, resulting in a worse experience for the team, management, but most importantly - my health. But if I waited a little longer and made sure that I was really fully ready mentally and physically, it'd be better for everyone in the long run.
But distributed teams don't always work, with two main problems. The first is timezones, to which Elastic has a great way of managing it. Instead of always having calls at i.e. UK time, they rotate meetings around timezones, meaning that it'll be in a friendly time for everyone at least once in a rotation, which helps "make it bearable for everyone".
Next is the issue around communications failure - working asynchronously can leave a lot of communication over written mediums, which can be difficult with language barriers, misinterpretations and incorrect assumptions on meaning.
One interesting metric to keep an eye on, Philipp mentioned, is the number of new staff joining compared to the number leaving. In Elastic's case, 130 employees joined compared to 9 leaving; "we must be doing something alright".
Richard Marshall - Life with Kube
Richard took us on his journey of using Kubernetes through the medium of Emoji, which was a hilarious and well-planned talk. As someone not really in-the-know with Kubernetes, I couldn't relate much to the talk, but found it interesting and engaging nonetheless!
"I'm a junior, where should I start with DevOps?"
This was a great session to hear about the experiences of juniors getting into the business from a group of ~50 juniors as well as a number of non-juniors and their approach for how they help prepare their new team members. It was especially interesting with a new set of grads joining us only the week before, so I was interested to hear what other new-ish juniors were thinking and how they'd like to get started.
I started off the space by prompting with my "learn by doing" attitude that I'd spoke about in my Ignite talk. I found it most useful to get stuck in, as well as trying the "good" and "bad" approaches to a problem, but others may not. As echoed by a participant, trying this all out in the safety of your own projects is best, as your company may not yet trust you to play around with things. One suggestion for companies is to have some sort of sandbox account(s) which allow you to either click a "revert to factory settings" button, or allow the user to horribly break things but then try and fix it themselves.
Pairing was called out as a great way to impart knowledge, especially as it forces the person driving to share their thoughts out loud for the navigator.
A few participants mentioned that we should be careful not to alienate our juniors by describing terminology with further terminology. This is something I've been trying to keep in mind when talking to everyone, not just juniors, which I've written about in my article Context is key: thinking about your audience.
With so much in the tech stack now, where would you even start?
There were several answers from the group:
- Take a given commit - how does that get to production? What tools and processes are needed? What are all the components that need to be in place for that to get to production? How do each of them get to a place where they're prod-ready?
- What is the best for the user/business to learn about? I.e. your paying customers won't care how your brochure site is built, but would care about the core application you develop for them
- Focus on the most high-touch areas of the codebase by looking through Merge Requests and find out what's happening most often. Are there a lot of code change? Configuration for i.e. Chef, Puppet? Terraform, CloudFormation, etc? Focusing on the most high-touch areas of the codebase will give you a quicker headstart
One comment was to remember that Configuration Management is (in a number of places) being replaced by containers which have their config pre-baked. Depending on your tech stack, it could be that picking up something like Chef may not be as helpful if you're moving away from it. Also, if your team just consumes cookbooks, instead of writing them, it may not add too much extra to start with learning how it all works.
In terms of what learning resources should be available, there were a few options from the group:
- Why not have a reading list ready to start off with, i.e. a list of bookmarks or an RSS feed?
- Stephen Mann's blog has some great reading on Web Apps
- I've found Command-Line Murder Mystery to be good for getting started with Command Line tooling and learning some of the basics of shell scripting in a fun, project-based way
- Drawing diagrams that map the hardware to the cloud layout
- If working on a single application, start with the architecture for the application, then start growing outwards, filling in the gaps with everything else?
- One participant mentioned that they learn best in visual mediums, and that it can definitely help with thinking about all the infrastructure. And it's much easier to visualise with diagrams than a big wall of text!
- Give them small tasks that give them shorter cycles to positive feedback
- If there's lots of terminology, can you create a glossary for new starters? Or is it actually better if they create it themselves, as they go?
- Learn Chef Rally has a tonne of resources on DevOps in general that are recommended (even for someone who's supposedly already "doing DevOps")
- Learning systems such as Linux Academy or ACloudGuru can be useful, as well as internal learning systems that can share the context for internal usage models
Talking about complexity, one comment was around a tool like Terraform not being the barrier to entry, but instead the underlying AWS knowledge. This links in with terminology, and that you'll go from one unknown phrase to another.
There's loads of resources for developers getting into the operations side, but not vice versa - what can be done to help that?
Every company has its own way of using i.e. AWS, which makes it hard to go for AWS accreditation and then go into a company, when they'll be using it in ways that aren't standard.
One positive reminder from a participant to all juniors who are questioning their learning ability is:
Instead of thinking about what you've not learnt, instead think about what you've just learnt!
And finally, remember that you're not going to know everything, and it'll take time to start feeling like you're really knowledgeable. We all suffer from Imposter Syndrome, but you can ground yourself by reminding yourself about all the stuff you've learnt up until now.
The role of universities
There was a comment around how universities don't really prepare you for working in a DevOps environment, and a discussion around the role of universities and whether they even should be teaching this.
I rebutted by saying that realistically, I learned more of the job skills at hackathons, where I would be playing around with building systems and deploying them somewhere so we'd be able to integrate with i.e. sponsor APIs and webhooks.
Regarding a comment about university group projects being good for teamwork skills, I gave my opinion on how actually the group work modules were very unrepresentative of working life. In university, some students are just trying to get a passing grade, whereas when you're working everyone is wanting to deliver, and they're mostly working on things they're excited by. You're all aligned to a common goal so are more likely to want to excel at what they're working on.
Guest lectures were mentioned as a route to get more students gaining industry insight, as it'll help engage students by giving them persons in the tech industry who would be able to share "best practices". However, again, I felt that I i.e. didn't actually "get Agile" until I was practicing it. It was all well and good having workshops and people speaking about it, but until I'd done it in a real environment it didn't help.
I've expanded on my thoughts about University preparing you for work (in general, not specifically around DevOps) in My Path from School to University to Work, which I'd recommend reading.
Another comment on this was again reminding that DevOps is a cultural change, and is about the breaking down of silos to encourage better collaboration, which means that it's more the people skills than the technical ones.
One participant mentioned that they were being taught Pascal at University, but by the time they came to market, the jobs weren't Pascal. That being said, the programming languages you use are a "conduit to programming", as one participant nicely put it - it's not that you can only ever write in that language. Although I didn't echo this at the time, I definitely feel like the Java I wrote at University was very different to that of the code I write in industry and this would likely be true for the DevOps practices, too.
Where is the QA in DevOps?
If it works, it's quality
DevOps as a cultural shift is all about shared responsibilities across development and operations, but at the same time it should also include quality ownership. Automation should, in a number of places, make it possible for quality gates to be achieved in an automated way, to allow fast delivery and help make it even easier for a commit being pushed to hit production.
One participant shared that at the Financial Times, there are no longer dedicated Quality Engineering/Assurance teams, but there is an embedded Software Engineer in Test who is acts in more of a testing enablement role. They will help the team think in the testing mindset and how to write their tests which gives the team full ownership of their quality. This then allows them to perform more exploratory testing and to look to the more difficult problems such as the integration between multiple components.
One participant also mentioned that their approach to team dynamics is to have the developers write the unit tests, and then the developer + quality engineer would pair on creating the component-level functional acceptance tests. The creation of "test packs" that verify critical behaviour are also created to help check that regressions are avoided.
Not wanting to get "left behind" or become "obsolete", the question was the struggle of a "learning culture" when it is difficult to find the time to learn new skills, let alone get the existing work done. I've heard this said before that developers are always busy, but are also able to spend time still learning new techniques and tooling, but quality engineers don't seem to find that time. Is it maybe that developers are used to jumping on new frameworks every other day? Or could it just be that developers have the drive to try something new - if they i.e. wanted to pick up Scala, maybe they'd just spend some time building a project in Scala. There wasn't really a group answer to the question, more that it was something to reflect on. What was decided was that regardless of the role that is looking to self-improve, the company should be paying for it as a way to invest in their staff, and to make sure they're being the best they can be!
We also spoke a little about quality gates and how we approach them:
- One participant mentioned that they have manual grading from their QA team that then defines whether they're ready or not
- One participant spoke about having a release environment which mirrored production, aside from Release Candidates for the newest component(s) that were looking to deploy together. Once ready for release, the Release Candidates will be made full release versions, and they'll be deployed to production.
- At Capital One, we have a number of quality gates - the team own the component-level quality gates, often unit + integration tests, monitoring for drops in code coverage, and functional acceptance and performance tests while testing in a development environment where all nearest neighbours are stubbed. Once "ready for release", we'll bake a release version, and deploy it into our integrated PreProd environment with Production-like controls and integration with everything as it would be in Production. This then gets "system tested" to ensure that new artefacts aren't bringing in any regressions, after which they can be deployed into production. However, we don't purely mirror production, as there can be a little time delay between PreProd and Prod.
- And another mentioned that "if it works, it's quality", with the idea that if a commit passes its automated testing in the automated build pipeline, then it should be good enough to go to production, placing higher responsibility on automation
I'd also recommend a read of DevTestOps According to the Experts on mabl.com which I've seen doing the rounds today, which digs into it a little more.
Code Reusage with Microservices
This open space was concerned with how we can share code amongst microservices; both application code and infrastructure configuration.
We spoke a little about how to manage configuration changes - I spoke about the idea of having immutable infrastructure. This means that if we shouldn't make any changes in-place, but instead only make them on a new set of infrastructure. This is made much easier by having everything in configuration management, but can as easily be done using i.e. an EC2's userdata script.
One alternative was to use a tool like Consul, or
rsync changes across, which can help push changes in-place, but then you risk the chance of being unable to revert any changes that have been pushed out.
We spoke a little about canary/blue-green deployments and how just having the stack up doesn't mean the application is ready for traffic as it may need instance warming. At the same time, these deployments need to be aware of i.e. connection limits to databases or the number of licenses available for monitoring the services.
Like with many things, there's the trade-off between doing the "right thing" versus the "get it done right now" option.
One participant had an issue where they had a monolithic "shared library" which contained a lot of shared code, but each of their modules depended on it, which meant that refactorings or new features required careful planning in order to prevent breakage across multiple components. A solution shared by another participant was to split the library into multiple smaller libraries with their own domain scope, and which could be updated more independently.
One pattern that I've seen colleagues use for infrastructure code-reuse is using a tool like Troposphere, which creates Python bindings for CloudFormation. This allows you to share your CloudFormation Templates using Python packaging, allowing for code reuse in common Python ways. One alternative was to use Terraform Modules. However, whichever way you do it, one attendee mentioned that it's likely going to be made generic to the point it's a maintenance burden and hard to use anywhere. A counter argument is that yes, it will be generic, but it can be manageable using Terraform Workspaces.
We spoke briefly around updating configuration for applications, with the two main options being immutable infrastructure and continually refreshing infrastructure. Although I'm a huge fan of having configuration hard-versioned and immutable, helping you confirm that once a set of infrastructure is up, it won't change, I could also see some use cases for updating an instance once running.
This led to a conversation around secrets, and at what point do you recycle them? I shared how we currently pick up secrets at application initialisation, again enforcing immutability of configuration, and making it a conscious decision to restart / redeploy the application if secrets are rotated.
We also had a quick discussion around autogenerating code, such as swagger-codegen, and I mentioned that we'd spoken about it only the previous month in my team but discussed that we'd not go this route because there was no traceability or consistency with code style and quality when using autogenerated code.
I want to say another huge thanks to the organisers, it was a really great couple of days, and I've very much looking forward to coming next year, too!