Smashing Podcast Episode 42 With Jeff Smith: What Is DevOps?

42 min read
Smashing Podcast, Career, Smashing Events
Share on Twitter, LinkedIn

About The Author

Drew is a Staff Engineer specialising in Frontend at Snyk, as well as being a co-founder of Notist and the small content management system Perch. Prior to this, … More about Drew ↬

Email Newsletter

Weekly tips on front-end & UX.
Trusted by 182,000+ folks.

In this episode, we’re talking about DevOps. What is it, and is it a string to add to your web development bow? Drew McLellan talks to expert Jeff Smith to find out.

Show Notes

Jeff on Twitter
Jeff’s book Operations Anti-Patterns, DevOps Solutions
Attainable DevOps

Weekly Update

Bridging The Gap Between Designers And Developers written by Matthew Talebi
Useful React APIs For Building Flexible Components With TypeScript written by Gaurav Khanna
Smart CSS Solutions For Common UI Challenges written by Cosima Mielke
Tips And Tricks For Evaluating UX/UI Designers written by Nataliya Sambir
Solving CLS Issues In A Next.js-Powered E-Commerce Website written by Arijit Mondal

Transcript

Drew McLellan: He’s a DevOps practitioner that focuses on attainable levels of DevOps implementations, regardless of where you are in your journey. He’s director of production operations at digital advertising platform Centro, as well as being a public speaker, sharing his DevOps knowledge with audiences all around the globe. He’s the author of the book, Operations Anti-Patterns, DevOps Solutions for Manning Publishing, which shows how to implement DevOps techniques in the kind of imperfect environments most developers work in. So we know he’s an expert in DevOps, but did you know George Clooney regards him as the best paper airplane maker of a generation? My Smashing friends, please welcome Jeff Smith. Hi Jeff. How are you?

Jeff Smith: I’m smashing, Drew, how you doing?

Drew: I’m good. Thank you. That’s good to hear. So I wanted to talk to you today about the subject of DevOps, which is one of your main key area. Many of our listeners will be involved in web and app development, but maybe only have a loose familiarity with what happens on the operations side of things. I know those of us who might work in larger companies will have whole teams of colleagues who are doing ops. We’re just thankful that whatever it is they do, they’re doing it well. But we hear DevOps mentioned more and more, and it feels like one of those things that as developers, we should really understand. So Jeff, what is DevOps?

Jeff: So if you ask 20 people what DevOps is, you might get 20 different answers. So I will give you my take on it, all right, and know that if you’re at a conference and you mention this, you could get into a fist fight with someone. But for me, DevOps is really about that relationship between, and we focus on dev and ops, but really that inter team relationship and how we go about structuring our work and more importantly, structuring our goals and incentives to make sure that they’re aligned so that we are working towards a common goal. And a lot of the core ideas and concepts from DevOps come from the old world where dev and ops were always adversarial, where there was this constant conflict. And when you think about it, it’s because of the way those two teams are incentivized. One team is incentivized to push changes. Another team is incentivized to keep stability, which means fewer changes.

Jeff: When you do that, you create this inherent conflict and everything spills out from there. So DevOps is really about aligning those teams and goals so that we are working towards a common strategy, but then also adopting practices from both sides, so that dev understands more about ops and ops understands more about dev, as a way to gain and share empathy with each other so that we understand the perspective of where the other person is coming from.

Jeff: But then also to enhance our work. Because again, if I understand your perspective and take that into account in my work, it’s going to be a lot more beneficial for each of us. And there’s a lot that ops can learn from developers in terms of automation and how we go about approaching things so that they’re easily reproducible. So it’s this blending and skills. And what you’re seeing now is that this applies to different group combinations, so you’re hearing things like DevSecOps, DevSecFinOps, DevSecFinHROps. It’s just going to keep growing and growing and growing. So it’s really a lesson that we can stamp out across the organization.

Drew: So it’s taking some of the concepts that we understand as developers and spreading our ideas further into the organization, and at the same time learning what we can from the operations to try and move everyone forward.

Jeff: Absolutely, yes. And another aspect of ops, and you had mentioned it a little bit in the intro, is we think it’s just for these larger organizations with dedicated ops teams and things like that, but one thing to think about is ops is happening in your organization, regardless of the size. It’s just a matter of it’s you doing it, or if there’s a separate team doing it, but somehow you’re deploying code. Somehow you’ve got a server out there running somewhere. So ops exist somewhere in your organization, regardless of the size. The question is, who is doing it? And if it’s a single person or a single group then DevOps might even be even more particularly salient for you, as you need to understand the types of things that ops does.

Drew: As professional developers, how important do you think it is for us to have a good grasp of what DevOps is and what it means to implement?

Jeff: I think it’s super important, especially at this phase of the DevOps journey. And the reason I think it’s important is that one, I think we’re always more efficient, again, when we understand what our counterparts are doing. But the other thing is to be able to take operational concerns into account during your design development and implementation of any technology. So one thing that I’ve learned in my career is that even though I thought developers were masters of the universe and understood everything that had to do with computers, turns out that’s not actually the case. Turns out there’s a lot of things that they outsource to ops in terms of understanding, and sometimes that results in particular design choices or implementation choices that may not be optimal for a production deployment.

Jeff: They might be fine in development and testing and things like that, but once you get to production, it’s a little bit of a different ballgame. So not to say that they need to own that entire set of expertise, but they at least need to know enough to know what they don’t know. So they know when to engage ops early, because that’s a common pattern that we see is development makes a choice. I won’t even say make a choice because they’re not even cognizant that it’s a choice, but there’s something that happens that leads to a suboptimal decision for ops and development was completely unaware. So just having a bit more knowledge about ops, even if it’s just enough to say, maybe we should bring ops in on this to get their perspective before we go moving forward. That could save a lot of time and energy and stability, obviously, as it relates to whatever products you’re releasing.

Drew: I see so many parallels with the way that you’re talking about the relationship between dev and ops as we have between design and dev, where you’ve got designers working on maybe how an interface works and looks and having a good understanding of how that’s actually going to be built in the development role, and bringing developers in to consult can really improve the overall solution just by having that clear communication and an understanding of what each other does. Seems like it’s that same principle played out with DevOps, which is really, really good to hear.

Drew: When I think of the things I hear about DevOps, I hear terms like Kubernetes, Docker, Jenkins, CircleCI. I’ve been hearing about Kubernetes for years. I still don’t have any idea what it is, but from what you’re saying, it seems that DevOps isn’t just about … We’re not just talking about tools here, are we? But more about processes and ways of communicating on workflows, is that right?

Jeff: Absolutely. So my mantra for the last 20 years has always been people process tools. You get people to buy into the vision. From there, you define whatever your process is going to look like to achieve that vision. And then you bring on tools that are going to model whatever your process is. So I always put tools at the tail end of the DevOps conversation, mainly because if you don’t have that buy-in, then it doesn’t matter. I could come up with the greatest continuous deployment pipeline ever, but if people aren’t bought into the idea of shipping every change straight to production, it doesn’t matter, right? What good is the tool? So those tools are definitely part of the conversation, only because they’re a standardized way to meet some common goals that we’ve defined.

Jeff: But you’ve got to make sure that those goals that are being defined make sense for your organization. Maybe continuous deployment doesn’t make sense for you. Maybe you don’t want to ship every single change the minute it comes out. And there are plenty of companies and organizations and reasons why you wouldn’t want to do that. So maybe something like a continuous deployment pipeline doesn’t make sense for you. So while the tools are important, it’s more important to focus on what it is that’s going to deliver value for your organization, and then model and implement the tools that are necessary to achieve that.

Jeff: But don’t go online and find out what everyone’s doing and be like, oh, well, if we’re going to do DevOps, we got to switch to Docker and Kubernetes because that’s the tool chain. No, that’s not it. You may not need those things. Not everyone is Google. Not everyone is Netflix. Stop reading posts from Netflix and Google. Please just stop reading them. Because it gets people all excited and they’re like, well this is what we got to do. And it’s like, well, they’re solving very different problems than the problems that you have.

Drew: So if say I’m starting a new project, maybe I’m a startup business, creating software as a service product. I’ve got three developers, I’ve got an empty Git repo and I’ve got dreams of IPOs. To be all in on a DevOps approach to building this product, what are the names of the building blocks that I should have in place in terms of people and processes and where do I start?

Jeff: So in your specific example, the first place I would start with is punting on most of it as much as possible and using something like Heroku or something to that effect. Because you get so excited about all this AWS stuff, Docker stuff, and in reality, it’s so hard just to build a successful product. The idea that you are focusing on the DevOps portion of it is like, well I would say outsource as much of that stuff as possible until it actually becomes a pain point. But if you’re at that point where you’re saying okay, we’re ready to take this stuff in house and we’re ready to take it to the next level. I would say the first place to start is, where are your pain points? what are the things that are causing you problems?

Jeff: So for some people it’s as simple as automated testing. The idea that hey, we need to run tests every time someone makes a commit, because sometimes we’re shipping stuff that’s getting caught by unit tests that we’ve already written. So then maybe you start with continuous integration. Maybe your deployments are taking hours to complete and they’re very manual, then that’s where you focus and you say like, okay, what automation do we need to be able to make this a one button click affair? But I hate to prescribe a general, this is where you start, just because your particular situation and your particular pain points are going to be different. And the thing is, if it’s a pain point, it should be shouting at you. It should be absolutely shouting at you.

Jeff: It should be one of those things where someone says, oh, what sucks in your organization? And it should be like, oh, I know exactly what that is. So when you approach it from that perspective, I think the next steps become pretty apparent to you in terms of what in the DevOps toolbox you need to unpack and start working with. And then it becomes these minimal incremental changes that just keep coming and you notice that as you get new capabilities, your appetite for substandard stuff becomes very small. So you go from like, oh yeah, deploys take three hours and that’s okay. You put some effort into it and next thing you know, in three weeks, you’re like, man, I cannot believe the deployment is still taking 30 minutes. How do we get this down from 30 minutes? Your appetite becomes insatiable for improvement. So things just sort of spill out from there.

Drew: I’ve been reading your recent book and that highlights what you call the four pillars of DevOps. And none of them is tools, as mentioned, but there are these four main areas of focus, if you like, for DevOps. I noticed that the first one of those is culture, I was quite surprised by that, firstly, because I was expecting you to be talking about tools more and we now understand why, but when it comes to culture, it just seems like a strange thing to have at the beginning. There’s a foundation for a technical approach. How does the culture affect how successful DevOps implementation can be within an organization?

Drew: … how successful DevOps implementation can be within an organization.

Jeff: Culture is really the bedrock of everything when you think about it. And it’s important because culture, and we get into this a little bit deeper in the book, but culture really sets the stage for norms within the organization. Right. You’ve probably been at a company where, if you submitted a PR with no automated testing, that’s not a big thing. People accept it and move on.

Jeff: But then there’s other orgs where that is a cardinal sin. Right. Where if you’ve done that, it’s like, “Whoa, are you insane? What are you doing? There’s no test cases here.” Right. That’s culture though. That is culture that is enforcing that norm to say like, “This is just not what we do.”

Jeff: Anyone can write a document that says we will have automated test cases, but the culture of the organization is what enforces that mechanism amongst the people. That’s just one small example of why culture is so important. If you have an organization where the culture is a culture of fear, a culture of retribution. It’s like if you make a mistake, right, that is sacrilege. Right. That is tantamount to treason. Right.

Jeff: You create behaviors in that organization that are adverse to anything that could be risky or potentially fail. And that ends up leaving a lot of opportunity on the table. Whereas if you create a culture that embraces learning from failure, embraces this idea of psychological safety, where people can experiment. And if they’re wrong, they can figure out how to fail safely and try again. You get a culture of experimentation. You get an organization where people are open to new ideas.

Jeff: I think we’ve all been at those companies where it’s like, “Well, this is just the way it’s done. And no one changes that.” Right. You don’t want that because the world is constantly changing. That’s why we put culture front and center, because a lot of the behaviors within an organization exist because of the culture that exists.

Jeff: And the thing is, cultural actors can be for good or ill. Right. What’s ironic, and we talk about this in the book too, is it doesn’t take as many people as you think to change the organizational culture. Right. Because most people, there’s detractors, and then there’s supporters, and then there’s fence sitters when it comes to any sort of change. And most people are fence sitters. Right. It only takes a handful of supporters to really tip the scales. But in the same sense, it really only takes a handful of detractors to tip the scales either.

Jeff: It’s like, it doesn’t take much to change the culture for the better. And if you put that energy into it, even without being a senior leader, you can really influence the culture of your team, which then ends up influencing the culture of your department, which then ends up influencing the culture of the organization.

Jeff: You can make these cultural changes as an individual contributor, just by espousing these ideas and these behaviors loudly and saying, “These are the benefits that we’re getting out of this.” That’s why I think culture has to be front and fore because you got to get everyone bought into this idea and they have to understand that, as an organization, it’s going to be worthwhile and support it.

Drew: Yeah. It’s got to be a way of life, I guess.

Jeff: Exactly.

Drew: Yeah. I’m really interested in the area of automation because through my career, I’ve never seen some automation that’s been put in place that hasn’t been of benefit. Right. I mean, apart from the odd thing maybe where something’s automated and it goes wrong. Generally, when you take the time to sit down and automate something you’ve been doing manually, it always saves you time and it saves you headspace, and it’s just a weight off your shoulders.

Drew: In taking a DevOps approach, what sort of things would you look to automate within your workflows? And what gains would you expect to see from that over completing things manually?

Jeff: When it comes to automation, to your point, very seldom is there a time where automation hasn’t made life better. Right. The rub that people encounter is finding the time to build that automation. Right. And usually, at my current job, for us it’s actually the point of the request. Right. Because at some point you have to say, “I’m going to stop doing this manually and I’m going to automate it.”

Jeff: And it may have to be the time you get a request where you say, “You know what? This is going to take two weeks. I know we normally turn it around in a couple of hours, but it’s going to take two weeks because this is the request that gets automated.” In terms of identifying what you automate. At Central, I use the process where basically, I would sample all of the different types of requests that came in over a four week period, let’s say. And I would categorize them as planned work, unplanned work, value add work, toil work. Toil being work that’s not really useful, but for some reason, my organization has to do it.

Jeff: And then identifying those things that are like, “Okay, what is the low hanging fruit that we can just get rid of if we were to automate this? What can we do to just simplify this?” And some of the criteria was the risk of the process. Right. Automated database failovers are a little scary because you don’t do them that often. And infrastructure changes. Right. We say, “How often are we doing this thing?” If we’re doing it once a year, it may not be worth automating because there’s very little value in it. But if it’s one of those things that we’re getting two, three times a month, okay, let’s take a look at that. All right.

Jeff: Now, what are the things that we can do to speed this up? And the thing is, when we talk about automation, we instantly jumped to, “I’m going to click a button and this thing’s just going to be magically done.” Right. But there are so many different steps that you can do in automation if you feel queasy. Right. For example, let’s say you’ve got 10 steps with 10 different CLI commands that you would normally run. Your first step of automation could be as simple as, run that command, or at least show that command. Right. Say, “Hey, this is what I’m going to execute. Do you think it’s okay?” “Yes.” “Okay. This is the result I got. Is it okay for me to proceed?” “Yes.” “Okay. This is the result I got.” Right.

Jeff: That way you’ve still got a bit of control. You feel comfortable. And then after 20 executions, you realize you’re just hitting, yes, yes, yes, yes, yes, yes. You say, “All right. Let’s chain all these things together and just make it all one.” It’s not like you’ve got to jump into the deep end of, click it and forget it right off the rip. You can step into this until you feel comfortable.

Jeff: Those are the types of things that we did as part of our automation effort was simply, how do we speed up the turnaround time of this and reduce the level of effort on our part? It may not be 100% day one, but the goal is always to get to 100%. We’ll start with small chunks that we’ll automate parts of it that we feel comfortable with. Yes. We feel super confident that this is going to work. This part we’re a little dicey on, so maybe we’ll just get some human verification before we proceed.

Jeff: The other thing that we looked at in terms of we talk about automation, but is what value are we adding to a particular process? And this is particularly salient for ops. Because a lot of times ops serves as the middleman for a process. Then their involvement is nothing more than some access thing. Right. It’s like, well, ops has to do it because ops is the only person that has access.

Jeff: Well, it’s like, well, how do we outsource that access so that people can do it? Because the reality is, it’s not that we’re worried about developers having production access. Right. We’re worried about developers having unfettered production access. And that’s really a safety thing. Right. It’s like if my toolbox has only sharp knives, I’m going to be very careful about who I give that out to. But if I can mix up the toolbox with a spoon and a hammer so that people can choose the right tool for the job, then it’s a lot easier to loan that out.

Jeff: For example, we had a process where people needed to run ad hoc Ruby scripts in production, for whatever reason. Right. Need to clean up data, need to correct some bad record, whatever. And that would always come through my team. And it’s like, well, we’re not adding any value to this because I can’t approve this ticket. Right. I have no idea. You wrote the software, so what good is it me sitting over your shoulder and going, “Well, I think that’s safe”? Right. I didn’t add any value to typing it in because I’m just typing exactly what you told me to type. Right.

Jeff: And worst case, and at the end of it, I’m really just a roadblock for you because you’re submitting a ticket, then you’re waiting for me to get back from lunch. I’m back from lunch, but I’ve got these other things to work on. We said, “How do we automate this so that we can put this in the hands of developers while at the same time addressing any of these audit concerns that we might have?”

Jeff: We put it in a JIRA workflow, where we had a bot that would automate executing commands that were specified in the JIRA ticket. And then we could specify in the JIRA ticket that it required approval from one of several senior engineers. Right.

Jeff: It makes more sense that an engineer is approving another engineer’s work because they have the context. Right. They don’t have to sit around waiting for ops. The audit piece is answered because we’ve got a clear workflow that’s been defined in JIRA that is being documented as someone approves, as someone requested. And we have automation that is pulling that command and executing that command verbatim in the terminal. Right.

Jeff: You don’t have to worry about me mistyping it. You don’t have to worry about me grabbing the wrong ticket. That increased the turnaround time for those tickets, something like tenfold. Right. Developers are unblocked. My team’s not tied up doing this. And all it really took was a week or two week investment to actually develop the automation and the permissioning necessary to get them access for it.

Jeff: Now we’re completely removed from that. And development is actually able to outsource some of that functionality to lower parts of the organization. They’ve pushed it to customer care. It’s like now when customer care knows that this record needs to be updated for whatever, they don’t need development. They can submit their standard script that we’ve approved for this functionality. And they can run it through the exact same workflow that development does. It’s really a boon all around.

Jeff: And then it allows us to push work lower and lower throughout the organization. Because as we do that, the work becomes cheaper and cheaper because I could have a fancy, expensive developer running this. Right. Or I can have a customer care person who’s working directly with the customer, run it themselves while they’re on the phone with a customer correcting an issue.

Jeff: Automation I think, is key to any organization. And the final point I’ll say on that is, it also allows you to export expertise. Right. Now, I may be the only person that knows how to do this if I needed to do a bunch of commands on the command line. But if I put this in automation, I can give that to anyone. And people know what the end result is, but they don’t need to know all the intermediate steps. I have increased my value tenfold by pushing it out to the organization and taking my expertise and codifying it into something that’s exportable.

Drew: You talked about automating tasks that are occurring frequently. Is there an argument for also automating tasks that happen so infrequently that it takes a developer quite a long time to get back up to speed with how it should work? Because everybody’s forgotten. It’s been so long. It’s been a year, maybe nobody has done it before. Is there an argument for automating those sorts of things too?

Jeff: That’s a tough balancing act. Right. And I always say take it by a case by case basis. And the reason I say that is, one of the mantras in DevOps is if something painful, do it more often. Right. Because the more often you do it, the more muscle memory it becomes and you get to work out and iron out those kinks.

Jeff: The issue that we see with automating very infrequent tasks is that the landscape of the environment tends to change in between executions of that automation. Right. What ends up happening is your code makes particular assumptions about the environment and those assumptions are no longer valid. So the automation ends up breaking anyways.

Drew: And then you’ve got two problems.

Jeff: Right. Right. Exactly. Exactly. And you’re like, “Did I type it wrong? Or is this? No, this thing is actually broke.” So-

Jeff: Typing wrong or is this no, this thing is actually broke. So when it comes to automating infrequent tasks, we really take it by a case by case basis to understand, well, what’s the risk if this doesn’t work, right. If we get it wrong, are we in a bad state or is it just that we haven’t finished this task? So if you can make sure that this would fail gracefully and not have a negative impact, then it’s worth giving a shot in automating it. Because at the very least, then you have a framework of understanding of what should be going on because at the very least, someone’s going to be able to read the code and understand, all right, this is what we were doing. And I don’t understand why this doesn’t work anymore, but I have a clear understanding of what was supposed to happen at least based at design time when this was written.

Jeff: But if you’re ever in a situation where failure could lead to data changes or anything like that, I usually err on the side of caution and keep it manual only because if I have an automation script, if I find some confluence document that’s three years old that says run this script, I tend to have a hundred percent confidence in that script and I execute it. Right. Whereas if it’s a series of manual steps that was documented four years ago, I’m going to be like, I need to do some verification here. Right? Let me step through this a little bit and talk to a few people. And sometimes when we design processes, it’s worthwhile to force that thought process, right? And you have to think about the human component and how they’re going to behave. And sometimes it’s worth making the process a little more cumbersome to force people to think should I be doing this now?

Drew: Are there other ways of identifying what should be automated through sort of monitoring your systems and measuring things? I mean, I think about DevOps and I think about dashboards as one of the things, nice graphs. And I’m sure there’s a lot more to those dashboards than just looking pretty, but it’s always nice to have pretty looking dashboards. Are there ways of measuring what a system’s up to, to help you to make those sorts of decisions?

Jeff: Absolutely. And that sort of segues into the metrics portion of cams, right, is what are the things that we are tracking in our systems to know that they are operating efficiently? And one of the common sort of pitfalls of metrics is we look for errors instead of verifying success. And those are two very different practices, right? So something could flow through the system and not necessarily error out, but not necessarily go through the entire process the way it should. So if we drop a message on a message queue, there should be a corresponding metric that says, “And this message was retrieved and processed,” right? If not, right, you’re going to quickly have an imbalance and the system doesn’t work the way it should. I think we can use metrics as a way to also understand different things that should be automated as we get into those bad states.

Jeff: Right? Because a lot of times it’s a very simple step that needs to be taken to clean things up, right? For people that have been ops for a while, right, the disc space alert, everyone knows about that. Oh, we’re filled up with disc. Oh, we forgot it’s month end and billing ran and billing always fills up the logs. And then VAR log is consuming all the disc space, so we need to run a log rotate. Right? You could get woken up at three in the morning for that, if that’s sort of your preference. But if we sort of know that that’s the behavior, our metrics should be able to give us a clue to that. And we can simply automate the log rotate command, right? Oh, we’ve reached this threshold, execute the log rotate command. Let’s see if the alert clears. If it does, continue on with life. If it doesn’t, then maybe we wake someone up, right.

Jeff: You’re seeing this a lot more with infrastructure automation as well, right, where it’s like, “Hey, are our requests per second are reaching our theoretical maximum. Maybe we need to scale the cluster. Maybe we need to add three or four nodes to the load balancer pool.” And we can do that without necessarily requiring someone to intervene. We can just look at those metrics and take that action and then contract that infrastructure once it goes below a particular threshold, but you got to have those metrics and you got to have those hooks into your monitoring environment to be able to do that. And that’s where the entire metrics portion of the conversation comes in.

Jeff: Plus it’s also good to be able to share that information with other people because once you have data, you can start talking about things in a shared reality, right, because busy is a generic term, but 5,200 requests per second is something much more concrete that we can all reason about. And I think so often when we’re having conversations about capacity or anything, we use these hand-wavy terms, when instead we could be looking at a dashboard and giving very specific values and making sure that everyone has access to those dashboards, that they’re not hidden behind some ops wall that only we have access to for some unknown reason.

Drew: So while sort of monitoring and using metrics as a decision-making tool for the businesses is one aspect of it, it sounds like the primary aspect is having the system monitor itself, perhaps, and to respond maybe with some of these automations as the system as a whole gives itself feedback on onto what’s happening.

Jeff: Absolutely. Feedback loops are a key part of any real system design, right, and understanding the state of the system at any one time. So while it’s easy in the world where everything is working fine, the minute something goes bad, those sorts of dashboards and metrics are invaluable to have, and you’ll quickly be able to identify things that you have not instrumented appropriately. Right. So one of the things that we always talk about in incident management is what questions did you have for the system that couldn’t be answered, right. So what is it, or you’re like, “Oh man, if we only knew how many queries per second were going on right now.” Right.

Jeff: Well, okay. How do we get that for next time? How do we make sure that that’s radiated somewhere? And a lot of times it’s hard when you’re thinking green field to sit down and think of all the data that you might want at any one time. But when you have an incident, it becomes readily apparent what data you wish you had. So it’s important to sort of leverage those incidents and failures and get a better understanding of information that’s missing so that you can improve your incident management process and your metrics and dashboarding.

Drew: One of the problems we sometimes face in development is that teammate members, individual team members hold a lot of knowledge about how a system works and if they leave the company or if they’re out sick or on vacation, that knowledge isn’t accessible to the rest of the team. It seems like the sort of DevOps approach to things is good at capturing a lot of that operational knowledge and building it into systems. So that sort of scenario where an individual has got all the information in their head that doesn’t happen so much. Is that a fair assessment?

Jeff: It is. I think we’ve probably, I think as an industry we might have overstated its efficacy. And the only reason I say that is when our systems are getting so complicated, right? Gone are the days where someone has the entire system in their head and can understand it from beginning to end. Typically, there’s two insidious parts of it. One, people typically focus on one specific area and someone doesn’t have the whole picture, but what’s even more insidious is that we think we understand how the system works. Right. And it’s not until an incident happens that the mental model that we have of the system and the reality of the system come into conflict. And we realize that there’s a divergence, right? So I think it’s important that we continuously share knowledge in whatever form is efficient for folks, whether it be lunch and learns, documentation, I don’t know, presentations, anything like that to sort of share and radiate that knowledge.

Jeff: But we also have to prepare and we have to prepare and define a reality where people may not completely understand how the system works. Right. And the reason I think it’s important that we acknowledge that is because you can make a lot of bad decisions thinking you know how the system behaves and being 100% wrong. Right. So having the wherewithal to understand, okay, we think this is how the system works. We should take an extra second to verify that somehow. Right. I think that’s super important in these complicated environments in these sprawling complex microservice environments. Whereas it can be very, it’s easy to be cavalier if you think, oh yeah, this is definitely how it works. And I’m going to go ahead and shut the service down because everything’s going to be fine. And then everything topples over. So just even being aware of the idea that, you know what, we may not know a hundred percent how this thing works.

Jeff: So let’s take that into account with every decision that we make. I think that’s key. And I think it’s important for management to understand the reality of that as well because for management, it’s easy for us to sit down and say, “Why didn’t we know exactly how this thing was going to fail?” And it’s like, because it’s complicated, right, because there’s 500 touch points, right, where these things are interacting. And if you change one of them, it changes the entire communication pattern. So it’s hard and it’s not getting any easier because we’re getting excited about things like microservices. We’re getting excited about things like Kubernetes. We’re giving people more autonomy and these are just creating more and more complicated interfaces into these systems that we’re managing. And it’s becoming harder and harder for anyone to truly understand them in their entirety.

Drew: We’ve talked a lot about a professional context, big organizations and small organizations too. But I know many of us work on smaller side projects or maybe we volunteer on projects and maybe you’re helping out someone in the community or a church or those sorts of things. Can a DevOps approach benefit those smaller projects or is it just really best left to big organizations to implement?

Jeff: I think DevOps can absolutely benefit those smaller projects. And specifically, because I think sort of some of the benefits that we’ve talked about get amplified in those smaller projects. Right? So exporting of expertise with automation is a big one, right? If I am… Take your church example, I think is a great one, right? If I can build a bunch of automated tests suites to verify that a change to some HTML doesn’t break the entire website, right, I can export that expertise so that I can give it to a content creator who has no technical knowledge whatsoever. Right. They’re a theologian or whatever, and they just want to update a new Bible verse or something, right. But I can export that expertise so that they know that I know when I make this content change, I’m supposed to run this build button.

Jeff: And if it’s green, then I’m okay. And if it’s red, then I know I screwed something up. Right. So you could be doing any manner of testing in there that is extremely complicated. Right. It might even be something as simple as like, hey, there’s a new version of this plugin. And when you deploy, it’s going to break this thing. Right. So it has nothing to do with the content, but it’s at least a red mark for this content creator to say “Oh, something bad happened. I shouldn’t continue. Right. Let me get Drew on the phone and see what’s going on.” Right. And Drew can say, “Oh right. This plugin is upgraded, but it’s not compatible with our current version of WordPress or whatever.” Right. So that’s the sort of value that we can add with some of these DevOps practices, even in a small context, I would say specifically around automation and specifically around some of the cultural aspects too.

Jeff: Right? So I’ve been impressed with the number of organizations that are not technical that are using get to make changes to everything. Right. And they don’t really know what they’re doing. They just know, well, this is what we do. This is the culture. And I add this really detailed commit message here. And then I push it. They are no better than us developers. They know three get commands, but it’s the ones they use over and over and over again. But it’s been embedded culturally and that’s how things are done. So everyone sort of rallies around that and the people that are technical can take that pattern.

Jeff: … around that and the people that are technical can take that pattern and leverage it into more beneficial things that might even be behind the scenes that they don’t necessarily see. So I think there’s some value, definitely. It’s a matter of how deep you want to go, even with the operations piece, right? Like being able to recreate a WordPress environment locally very easily, with something like Docker. They may not understand the technology or anything, but if they run Docker Compose Up or whatever, and suddenly they’re working on their local environment, that’s hugely beneficial for them and they don’t really need to understand all the stuff behind it. In that case, it’s worthwhile, because again, you’re exporting that expertise.

Drew: We mentioned right at the beginning, sort of putting off as much sort of DevOps as possible. You mentioned using tools like Heroku. And I guess that sort of approach would really apply here on getting started with, with a small project. What sort things can platforms like Heroku offer? I mean, obviously, I know you’re not a Heroku expert or representative or anything, but those sorts of platforms, what sort of tools are they offering that would help in this context?

Jeff: So for one, they’re basically taking that operational context for you and they’re really boiling it down into a handful of knobs and levers, right? So I think what it offers is one, it offers a very clear set of what we call the yellow brick road path, where it’s like, “If you go this route, all of this stuff is going to be handled for you and it’s going to make your life easier. If you want to go another route, you can, but then you got to solve for all this stuff yourself.” So following the yellow brick road route helps because one, they’re probably identifying a bunch of things that you hadn’t even thought of. So if you’re using their database container or technology, guess what? You’re going to get a bunch of their metrics for free. You’re going to get a lot of their alerting for free. You didn’t do anything. You didn’t think anything. It’s just when you need it, it’s there. And it’s like, “Oh wow, that’s super are helpful.”

Jeff: Two, when it comes to performance sizing and flexibility, this becomes very easy to sort of manage because the goal is, you’re a startup that’s going to become wildly successful. You’re going to have hockey stick growth. And the last thing you necessarily really want to be doing is figuring out how to optimize your code for performance, while at the same time delivering new features. So maybe you spend your way out of it. You say, “Well, we’re going to go up to the next tier. I could optimize my query code, but it’s much more efficient for me to be spending time building this next feature that’s going to bring in this new batch of users, so let’s just go up to the next tier,” and you click button and you move on.

Jeff: So being able to sort of spend your way out of certain problems, I think it’s hugely beneficial because tech debt gets a bad rap, but tech debt is no different than any debt. It’s the trade off of acquiring something now and dealing with the pain later. And that’s a strategic decision that you have to make in every organization. So unchecked tech debt is bad, right? But tech debt generally, I think, is a business choice and Heroku and platforms like that enable you to make that choice when it comes to infrastructure and performance.

Drew: You’ve written a book, Operations, Anti-Patterns, DevOps Solutions, for Manning. I can tell it’s packed with years of hard-earned experience. The knowledge sort of just leaps out from the page. And I can tell it’s been a real labor of love. It’s packed full of information. Who’s your sort of intended audience for that book? Is it mostly those who are already working in DevOps, or is it got a broader-

Jeff: It’s got a broader… So one of the motivations for the book was that there were plenty of books for people that we’re already doing DevOps. You know what I mean? So we were kind of talking to ourselves and high-fiving each other, like, “Yeah, we’re so advanced. Awesome.” But what I really wanted to write the book for were people that were sort of stuck in these organizations. I don’t want to use the term stuck. That’s unfair, but are in these organizations that maybe aren’t adopting DevOps practices or aren’t at the forefront of technology, or aren’t necessarily cavalier about blowing up the way they do work today, and changing things.

Jeff: I wanted to write it to them, mainly individual contributors and middle managers to say like, “You don’t need to be a CTO to be able to make these sorts of incremental changes, and you don’t have to have this whole sale revolution to be able to gain some of the benefits of DevOps.” So it was really sort of a love letter to them to say like, “Hey, you can do this in pieces. You can do this yourself. And there’s all of these things that you may not think are related to DevOps because you’re thinking of it as tools and Kubernetes.” Not every organization… If you were for this New York State, like the state government, you’re not going to just come in and implement Kubernetes overnight. Right? But you can implement how teams talk to each other, how they work together, how we understand each other’s problems, and how we can address those problems through automation. Those are things that are within your sphere of influence that can improve your day to day life.

Jeff: So it was really a letter to those folks, but I think there’s enough data in there and enough information for people that are in a DevOps organization to sort of glean from and say like, “Hey, this is still useful for us.” And a lot of people, I think identify quickly by reading the book, that they’re not in a DevOps organization, they just have out a job title change. And that happens quite a bit. So they say like, “Hey, we’re DevOps engineers now, but we’re not doing these sorts of practices that are talked about in this book and how do we get there?”

Drew: So it sounds like your book is one of them, but are there other resources that people looking to get started with DevOps could turn to? Are there good places to learn this stuff?

Jeff: Yeah. I think DevOps For Dummies by Emily Freeman is a great place to start. It really does a great job of sorting of laying out some of the core concepts and ideas, and what it is we’re striving for. So that would be a good place to start, just to sort of get a lay of the land. I think the Phoenix Project is obviously another great source by Gene Kim. And that is great, that sort of sets the stage for the types of issues that not being in a DevOps environment can create. And it does a great job of sort of highlighting these patterns and personalities that occur that we see in all types of organizations over and over again. I think it does a great job of sort of highlighting those. And if you read that book, I think you’re going to end up screaming at the pages saying, “Yes, yes. This. This.” So, that’s another great place.

Jeff: And then from there, diving into any of the DevOps handbook. I’m going to kick myself for saying this, but the Google SRE Handbook was another great place to look. Understand that you’re not Google, so don’t feel like you’ve got to implement everything, but I think a lot of their ideas and strategies are sound for any organization, and are great places where you can sort of take things and say like, “Okay, we’re, we’re going to make our operations environment a little more efficient.” And that’s, I think going to be particularly salient for developers that are playing an ops role, because it does focus on a lot of the sort of programmatic approach to solving some of these problems.

Drew: So, I’ve been learning all about DevOps. What have you been learning about lately, Jeff?

Jeff: Kubernetes, man. Yeah. Kubernetes has been a real sort of source of reading and knowledge for us. So we’re trying to implement that at Centro currently, as a means to sort of further empower developers. We want to take things a step further from where we’re at. We’ve got a lot of automation in place, but right now, when it comes to onboarding a new service, my team is still fairly heavily involved with that, depending on the nature of the service. And we don’t want to be in that line of work. We want developers to be able to take an idea from concept to code to deployment, and do that where the operational expertise is codified within the system. So, as you move through the system, the system is guiding you. So we think Kubernetes is a tool that will help us do that.

Jeff: It’s just incredibly complicated. And it’s a big piece to sort of bite off. So figuring out what do deployments look like? How do we leverage these operators inside Kubernetes? What does CICD look like in this new world? So there’s been a lot of reading, but in this field, you’re constantly learning, right? It doesn’t matter how long you’ve been in it, how long you’ve been doing it, you’re an idiot in some aspect of this field somewhere. So, it’s just something you kind of adapt to

Drew: Well, hats off as I say, even after all these years, although I sort of understand where it sits in the stack, I still really don’t have a clue what Kubernetes is doing.

Jeff: I feel similar sometimes. It feels like it’s doing a little bit of everything, right? It is the DNS of the 21st century.

Drew: If you, the listener, would like to hear more from Jeff, you can find him on Twitter, where he’s at dark and nerdy, and find his book and links to past presentations and blog posts at his site, attainabledevops.com. Thanks for joining us today, Jeff. Did you have any parting words?

Jeff: Just keep learning, just get out there, keep learning and talk to your fellow peers. Talk, talk, talk. The more you can talk to the people that you work with, the better understanding, the better empathy you’ll generate for them, and if there’s someone in particular in the organization you hate, make sure you talk to them first.

About The Author

Email Newsletter

Show Notes

Weekly Update

Transcript

Further Reading