Ethical engineering on the path to excellent systems with Liz Fong-Jones of Google and Honeycomb.io
Liz Fong-Jones is a 15-year Site Reliability Engineer. She’s a former Google and current SRE and observability advocate for Honeycomb.io.
Liz evangelizes “Building Excellent Systems” that both look out for the people who use the system, and for the people who work on it.
In this fascinating episode, she talks with Ledge about wide ranging ethical issues in engineering, and about how it’s a business advantage to say you’re an ethical engineer.
Ledge: Liz, it’s so cool to have you on. Thank you for making time for us.
Liz: Thanks for having me, Ledge.
Ledge: For anybody who’s not familiar with you and your work, would you mind giving me a background story? A little introduction.
Liz: Sure! I have been a site reliability engineer, or DevOps engineer or whatever the heck you call it, a systems engineer of some flavor, for the past fifteen years.
I started my career working in video games and academic publishing and spent about eleven years at Google. I recently quit over ethical reasons and switched to working for a startup.
I am now kind of, instead of doing some advisory work on my own, I also am primarily focused on helping people run systems that don’t burn them out. So kind of thinking about what makes a production system excellent.
Ledge: There’s a lot to go into there. Obviously, you’re talking to a lot of freelancers and I know that the product and client and involvement is important. People want to believe in what they’re doing and support their own value sets through the work that they put out there. And I think there is a lot of creative integrity that goes in there.
Please talk about that. Are there heuristics to even think about? How do you check what you’re working on to make sure that it matches where you want to be and what you want to do?
Liz: One of the fascinating topics that I went through with every company I was interviewing at after I left Google was asking them the question, are there categories of customers that you wouldn’t take on? What’s your criteria for declining a client?
That goes both for companies as well as for individual people. That you have to think about, would I feel comfortable if the software I was building was used as part of a lethal weapons?
A less extreme example – am I comfortable with working for a company that is practicing discrimination or exclusion of certain categories of people? Or more mundane decisions that we make around, how strongly am I going to advocate that if I’m building an app that I make it accessible to people using screen readers, or people that have tactile issues with using like a touch screen?
There are a lot of interesting areas that are kind of interesting ethical choices because they are where we express our values.
Ledge: Wow! There’s three big questions right there. But, how do you even get into that if you’re sort of a person that isn’t familiar with evaluating your work in that way?
I think we all, maybe, have a sense that we would like to work for a thing that aligns well with our values but you actually have distilled this into a question set or something from your own experience that kind of goes, “Hey, I do want to know about that.”
How can you advise people to ask those questions well and properly?
Liz: A lot of it boils down to the questions that privacy reviewers ask at larger companies. Like, who is the intended user of this, who is a bystander who could potentially be an unintentional user of it or unintentionally impacted by this?
And asking those questions that go beyond the immediate impact of your work. Understanding, what’s the broader business strategy? Why is someone coming to you for your expertise?”
There is a story recently about the engineers at a company called Clarifai that were doing kind of image recognition work. It turns out that the engineers in the company weren’t told this is being used by the Pentagon.
You have to be prepared to ask those deeper questions. Not just what am I building, but who are the indirect users? Where is this going? Where is this leading?”
Ledge: It takes a lot of like personal and mental fortitude to risk your income and make those choices and be the boat rocker. Is that something that came naturally to you, or do you have to evolve to that point? Because it just takes a lot of strength to stand up and ask.
Liz: I think it’s a larger issue for folks who are working salaried for companies, where they could potentially lose their fulltime job.
There are plenty of opportunities that are out there and, in fact, it’s a business advantage to say that you’re an ethical engineer. It’s a business advantage to advertise, “These are the things they want to work on, these are things that they don’t work on.”
Because people know that, if they give you work and that’s aligned with your mission, you’re going to do it an excellent job and you’re going to be passionate about it. And people seek out those opportunities.
Whereas, if you start working in a company as a fulltime employee you don’t necessarily have a lot of direct say over it, aside from striking or walking out.
Ledge: Right. Both of which you have an experience with. The only place I can think that I’ve seen that come up is someone who can go, “I’m a certified ethical hacker. I’m sort of a white hat.”
But you’re advocating to expand that into the broader set of, like I said, it applies to anybody? Yes, it’s about engineers in the context that we’re all talking but there’s certainly no reason that it couldn’t be the bookkeeper, or the financial person, or HR. Any of those things, they’re going to touch all these areas. It’s like a more human kind of business exposing a whole layer that maybe we haven’t talked about before.
Liz: I totally agree with you. Although I do think that engineers, if we’re going to call ourselves engineers, we actually have engineering professional responsibilities. That people who are mechanical engineers have a duty to not cause harm with the structures that they build, to make sure that they’re engineered safely.
We need to think about if we’re going to call ourselves software computer engineers.
Ledge: Yes. That’s a good point.
Liz: Speaking of the subject of going beyond and looking at the implications of the software that we build, I think that there’s also an area that I’m super interested in which is the idea of building excellent systems.
I think that building excellent systems has to look out for users but also has to look out for the people who run the systems, who are often us. Or in the case of people who are working freelance, who are going to be the people who are kind of your longer-term clients. The people who have to maintain the stuff that you set up for them. How do you set up the durable process so that you have long-term happy customers and not just short-term?
Ledge: Talk about that. What’s an excellent system? The hallmarks of. How do I know that I’m excellent or not?
Liz: An excellent system is part a system that minimizes complexity. It reduces the amount of overhead of stuff that is incomprehensible, hard to understand, and that you’re designing it to have appropriate degrees of visibility. So someone can look under the hood without having to necessarily rebuild the whole car.
But I also think that there’s a piece with our customers of having to instill in them that they are going to be the long-term owners of this. That they have to ask you now. Kind of instill that culture of curiosity to be able to ask the questions, rather than being handed a box and then, “Okay, here you go.” Well, I guess I don’t know how it works. I guess I’m going to come back to you in six or 12 months asking like, “Hey Ledge, how did you build this?”
Ledge: That happens all the time. We often really want to educate people and you’ve got to make…
I come from maybe more of a business and organizational background and I think a lot about business continuity. Are we making choices that are sustainable absolutely or ethical in the right things? And also, let’s we have a fiduciary duty to the people around us in our business, we’ve got to make choices that at least can help drive the going concern here.
By making and shortcutting decisions that are cheaper, or don’t take care of people, or build things that are total technical debt disasters waiting to happen, are we really making good choices for the business?
They’re markedly not excellent systems if you don’t put the time and energy into that.
Liz: It’s definitely reducing complexity where you can, using patterns that people understand well. Like not inventing your custom framework if something off the shelf will do.
Documenting. Like, here’s what to do to stop the bleeding if something does go wrong. And then, here are the common places to start debugging the system.”
All of these various choices that you can make how we architect this thing and we design our culture around it.
If we’re coming in a more senior level, also thinking about how can we model good behavior to people as well just giving people code.
Ledge: Just not a lot of hours left in the day if you do all of those things at the same time.
I completely agree with you that that is incumbent upon leadership. And the leadership that anybody can take in their given role. Lead from the bottom, lead from the middle, lead from the top. Different types of things.
If you’re in a resource-constrained environment there’s going to be a lot of pressure to think about allocating capital, human or otherwise. How do you advise people to make that balance when it just feels like every minute we spend is so valuable because we’re burning cash like crazy?
Liz: Everything has to reduce to that argument you had about managing using business risks. Can you quantify the risks? Can you talk about, what would happen? How likely is this to happen? What’s the blast radius?
These are all great questions that can help you really narrow down what’s a priority.
For instance, in this most recent talk that I wrote, I use this analogy of falling apart. Cars are falling through the roadbed. Maybe it needs an earthquake retrofit for when the big one comes in twenty years. Which one are you going to fix first?
Well, you’re probably going to fix first the fact that cars are falling off through the roadbed because there’s no safety rail.
So, figuring out, what’s important to fix? What really impacts the bottom-line versus what’s a nice-to-have? What are customers actually complaining about? What are people churning because of?
What are employees quitting because of? That’s also a risk to consider. Your system is not going to run itself. We aspire to that but, at the end of the day, humans run our systems and if they all quit then your system won’t run.
Ledge: Absolutely. You get this idea that, I mean, most companies are spending 60 to 80% of their total cost load on humans and what do we do to make sure we get return on investment?
And in an ethical way too. So we can’t treat people like cogs, they just run them into the ground. The highest way to get ROI is going to be work 24 hours a day. You also can’t do that because there’s down.
Liz: And yet companies try to put people on call for 24 hours straight. It ends really badly when you have people who are tired and fatigued making bad decisions.
That’s a business risk to highlight. Is that, if you don’t have this idea of focusing on production excellence one day it’s going to result in a catastrophic outage.
Ledge: Right. Or drop a table in production or whatever. Everybody has their manifestation of that where like, “Oh, there’s no undo button because I was tired.” And, three days later into my production problem, I’m making radical errors.
We’ve all seen that happen. And I imagine in the SRE seat you’ve seen some insane things cause not reliability. Any good stories there?
Liz: I can think of a fair number of them although client confidentiality is important.
There’s another area that we haven’t really touched on which is kind of the dark debt concept. We understand it’s the technical that you can see but it’s harder to understand what people are spending the first 20 or 30 minutes, after they get paged, staring at a wall of dashboards or reading playbooks that are out of date. Things where they don’t have the appropriate level of visibility into the system before they even can start fixing it.
That’s an insidious form of debt that people don’t necessarily think of, because it adds to every outage that you have but it’s not necessarily in your face with eyes blinking.
Ledge: Yeah, I’ve never heard that. I totally understand what you’re talking about. Dark debt. That’s a great phrase, it’s like the dark energy of the universe that adds all the weight. Yeah, and repetitive problems that we could address maybe with, I don’t know, documentation spike.
You have to pay those off en masse because, I think, they do have, if you will, the highest interest rate of any of the problems. And, you’re right, they bleed you to do death because you weren’t paying attention to it.
Liz: Totally! A common metric that I use to think about is, when you have something that alerts a human being you have to ask the question, “Was it actually urgent? Is this something that we’re running into over and over or is this actually a new problem? Did we do something about it or is it just something that we could do nothing about? Did we actually try to address some of the leading causes of it?”
And I say causes, plural, because stuff is maybe triggered by one thing but if you look at everything that’s happening – and you talked about the question how rather why did this happen – kind of service level?” But instead ask the question of how did this happen, over and over, that generates a lot more insight into how we can make our systems more resilient.
Ledge: Absolutely! That makes a lot of sense. I’m reminded of when I had to write all kinds of cron jobs and alerts and everything. And you go, “I really ought not to send this email unless it links directly to documentation that allow someone to fix it.” We’ve all had log fatigue and slack fatigue, and it’s really easy to go, “Send that to dev/null and auto-trash those emails and mute that slack channel,” because it’s completely overwhelming.
It is, I guess, incumbent upon SREs in general to make sure that you provide an environment where people feel empowered to be able to fix a thing when it does happen.
Liz: Exactly. I know of environments that didn’t have production excellence, where people felt then safe to turn off alerts because someone was saying, “Oh my god, this broke ten years ago and we caught it because of this alert.” You have to look at the signal-to-noise ratio. You have to look at, do people feel empowered to run the system as it is today and actually make it better?
Ledge: Yeah, and like, “What the hell is that error?” “Hey, that happens every day at noon.”
It’s sometimes the canary in the mine. We aren’t finding root cause analysis on this so it comes up every day, and now it brought down the entire system. All of a sudden, then people care.
It’s also incumbent on us in that seat to translate some of this stuff into a metric that maybe business users or people on decision making positions, capital occasion positions, could go, “It’s really, really important. It’s not showing up in a line item on your income statement but it is. You just can’t tell.”
Liz: This ties to the first thing that we were talking about. When you start speaking up about these production excellence items, that kind of lends itself naturally towards, in the long-term, talking about these ethical risks as well of, sure, it’s going to boost your bottom line in the short-term but what about employee morale? What about employees who feel that they were tricked?
You can start having these conversations, not just about dark debt but also about ethical issues.
Ledge: Isn’t the dark debt in all of our organizational decisions as well?
I think one thing that we can trick ourselves into doing is drawing these glorious pictures of our organization and our culture. Sticks and boxes, and look at this communication, and this is so great.
I think of it, maybe from some of my sales and marketing experience, the funnel is leaking no matter what. The questions is, where is the leak? And if you don’t know where the leak is, chances are pretty good that it’s actually worse than you think. And not flagging it, the unknown unknowns or the ones you willfully ignore in your cultural development, is going to cause the biggest payoff later when people do leave.
It’s sort of when Atlas shrugs, if you will. The people that are actually doing the work all take off for utopia.
What do you recommend on the cultural development front? It’s so easy to ignore this stuff and it is the most expensive dark organizational debt.
Liz: It’s helpful to have a practice of… Have you heard of the concept of a pre-mortem?
Ledge: Predicting, I’m guessing, all the things that could go wrong?
Liz: Yeah. It’s kind of an interesting exercise in terms of asking yourself, “Okay, we’ve launched the thing that we’re working on. The news report say that it’s a disaster. Why is that?”
You can’t necessarily predict everything but you can think about, what are some of the potential things out there. And then think about, not necessarily patching each one of them but instead thinking about, what are our structural processes for encouraging fixing those issues before they become crises?
So, talking about unknown unknowns, you can at least use your known unknowns to kind of bootstrap the techniques that you need to use to approach that either organizational or technical complexity. It doesn’t matter which, really.
Can you ask those questions? Do people feel safe raising their hand?
The Boeing thing is going to be a huge case study. I can see it in all of the business school textbooks twenty years from now. Where the engineers who said, “Hey, wait a second, this 2.5% degree change is actually turning into a 5 degree change?”
Ledge: It’s the same way that those of us old enough to remember “The Challenger.” And the great body of case study work that came out of saying group think is bad. And then the work that comes out around the psychological safety that says, our technology and the solutions we build are going to mirror the organizational complexity in which they were built.
And it’s sort of pets and their owners start to look the same. That does happen. And the question is, does that align a set of values that we can all be honest about when we’re launching an organization that has a vision and mission?
Liz: The other cool thing here is thinking as well about, not only is it safe for people to raise concerns, but also is it safe for everyone you’re bringing onto the team to raise concerns?
It doesn’t make sense to build a diverse team and then not listen to their concerns. Nor does it make sense to say, “My team is not diverse. Everyone feels perfectly comfortable raising their concerns if they’re in the room, but the people outside the room are not listened to.”
That’s an area of organization risk as well.
Ledge: There’s so much opportunity for diversity of thought. There’s as much opportunity for giving lip service to diversity as if it’s some kind of destination. That, “Wow, yay, we’re diverse now because we can measure a metric.” But what do you do with that very, very powerful tool once you maybe have checked off some of the KPI boxes?
Finishing thoughts, Liz. What do you want to have 10,000 freelancers know from your experience that can make their career path more rewarding?”
Liz: I think that it’s important to vet your clients. Understand what is it you’re building. Who is it going to directly and indirectly impact? And including, in the set of people that are impacted, who are the people running the systems? Because the people running the systems long-term are definitely in that set of people who have to live with the decisions that you make as well.
Think about, how are you de-complexifying things? How are you listing out the risks that you’re aware of? And mitigating some of that by thinking about, how do we expose all these signals that we need and how do we make people safe to raise these concerns about, I think that was not enough. Or, hey, can we understand how these two components interact?
All these ranges of problems that we can address earlier rather than when we fail.
Ledge: All of which bring you more complexity that you can then break down into less complexities. It looks like our job is never finished.
Liz: Yep! The job is never finished. But we have a especially high-leverage industry. We can always address problems as they arise and try to get ahead of them.
Ledge: Liz, thank you for the insights. I know you’re extremely busy and that you just started new things. So, best of luck in all of that. It is awesome to have you on.