Nathan Harvey, a Cloud Developer Advocate at Google, is at the forefront of the DevOps movement. What is DevOps, you ask? AH-HAH!! Exactly. That is part of the reason I went to hear him speak, because the definitions I had been hearing did not make sense to me. They all just sounded like what Agile leaders have been preaching for years, and in some cases like backward steps.
Here is the definition he presented, as copied from GitHub: “A cultural and professional movement, focused on how we build and operate high velocity organizations, born from the experiences of its practitioners.” He is involved in fashioning this definition. After his talk, we spoke, and Harvey agreed with my statement that definition could just as easily be a definition of Agile!
In fact, Harvey told me the term was coined at an Agile conference, in what is known as an Open Space, where conference attendees can propose topics and generate informal discussions about them. That person, Harvey said, asked how to get the Operations side of IT businesses using Agile, sensing it was primarily used by development teams. Harvey stated during his excellent introductory talk that the problem arose not from Agile in principle, but in practice. Agile is supposed to be a mindset applied across the organization, and has been used in a wide range of workstreams besides software (long before the Agile Manifesto for Software Development was published). I have applied it to Operations teams, the teams charged with keeping information systems up and running.
In a truly Agile organization, Dev and Ops personnel would be on the same cross-functional teams. Because most companies don’t do that, they create a contrived conflict—I started to say “natural conflict,” but managers actively create this problem—in which Dev is incentivized to release stuff but Ops is incentivized to keep bugs off the “Production” servers, as Harvey explained. Besides the structural change to cross-functionality, building incentives around customer satisfaction and quality instead would eliminate this dichotomy.
The name DevOps is “unfortunate,” Harvey said, because the problem extends beyond just those two organizations. His point was that product managers and other business managers push both for faster delivery and higher stability, as may vendors and customers. All need to be brought into a mindset where the two are compatible, and secondary to the more important goals I just mentioned. His suggested metrics address the divide in more detail: delivery lead time (from commitment to the change, through deployment), deployment frequency, time to restore service, and deployment fail rate. Imagine if the Dev and Ops groups were jointly held to all of those, assuming you couldn’t make them one group.
The basic principles of DevOps, as outlined by Harvey, start with “Continuous Integration.” This refers to the practice of nearly continuously adding changes into the main “trunk” of software code rather than processing them through separate “branches” representing waterfall stages. Taking continuous integration to its logical extreme, don’t use branches at all, he said. Always build to the trunk, at most using only short-lived feature branches, as in less than a day.
To a waterfall project manager or a conventional Ops admin, this sounds like a recipe for chaos and disaster. But it has been accomplished, through a combination of “Automated Testing,” where software robots do more routine quality testing than humans ever could, and a “Shift Left on Quality.” If you envision any new-product development process as a flow chart, it usually starts with something like “Identify Product Strategy” or “Requirements Gathering” on the far left, and “Testing” and “Deployment” on the far right. Harvey was saying quality needs to be addressed in all of the left-hand boxes, too, instead of waiting until Testing. I was arguing for that philosophy when I was a waterfall project manager at Microsoft.
An example Harvey gave is when an automated test “goes red.” Upper managers and sales people may push to deliver anyway, if they have made the common but irrational mistake of promising delivery by a certain date, as if they are armed with infallible crystal balls. Harvey said you must stop new work and fix the build. (And not just by “commenting out” the offending code, kicking the can down the road.) This is equivalent, he said in our conversation, to Japanese managers encouraging automobile workers to stop the entire production line if they spot a problem. This didn’t translate well to America, he said, because the Japanese managers would come to the person and thank them, while in America the managers would berate them. Teach developers to write “defensive code,” he said in his talk, with the implication managers must support this, too. Again, I have encouraged this since my waterfall days, recognizing that bad quality both harms customer satisfaction and is costlier to go back and fix than to build in.
It is impossible to achieve 100% uptime of any system, and the closer you get, the more expensive a given level is to maintain, he said. Harvey recommended working across the organization to create an “error budget,” the percentage difference between 100% and your uptime target. I took this to mean if your uptime goal was “four nines” or 99.99%, then 0.01% is your error budget.
Toward that end, he recommended you establish service level indicators (SLIs) that are “successful enough” for the metrics like those listed earlier, and become the basis for service level objectives (SLOs) that are customer-focused targets for the SLIs. Naturally, these should be better than the service level agreements (SLAs) provided to customers. I noted in my head that I had learned all of this in my ITIL training years ago, so it should not be news to an Ops professional, but a lot of people were jotting these down.
The other question I came to the meeting with was, as I told him, “Isn’t this what we were doing 25 years ago?” The time frame maybe should have been longer, but my point was that in the early days of computers the same people designed, built, and maintained both hardware and its software. He agreed. Only when companies broke into silos was the problem introduced, so DevOps is a case of going back to the future.
It also is just part of Agile, specifically a movement to address partial implementations of the Agile mindset. What he said at the end of the talk was nearly word-for-word what I have been telling managers for nearly 20 years, first in relation to better meeting and teamwork practices, and then Agile: “The only way you’re going to get there is to go back and do the work.” You can’t just attend presentations and read books to improve. The work is “hard, it’s messy,” as he said, but worth it.
Source: Harvey, N. (2019), “DevOps—Improving Software Delivery and Operations,” Triangle DevOps Meetup presentation, 1/16/19, Morrisville, N.C.
Harvey’s suggested readings:
- Accelerate, by Gene Kim, Jez Humble, and Nicole Forsgren, from IT Revolution Press.
- Site Reliability Engineering, by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, from O’Reilly Media.
- State of DevOps Report, by and from Puppet and Splunk.