June 25, 2019 · 19 min read

Engineering management tips for worldwide Ops, Dev, and Leadership with Alex Newman of Institution Machines

Scaling an engineering organization is challenging enough when you’re colocated, and when your whole team is developers. What happens when you also introduce data science and 24/7 SRE? That’s three entirely different types of work. Now distribute those workers around the world and imagine the complexity.

Ledge sits down with Alex Newman, serial founder, engineer, and management thinker about how he and his co-founders at Intuition Machines think about and solve these problems. The key word? Evolution.

This episode is chock full of great value-driven techniques like blame-free process, accountability and autonomy, and ownership.

Alex Newman

Co-founder of hCaptcha and Intuition Machines

Alex Newman is the cofounder of hcaptcha.com and imachine.com. Alex has been working in open source for 30 years and continues to work on OSS today. He’s founded successful companies in the past such as ohmdata and was an early employee at Cloudera. Alex is passionate about building happy teams which trust each other.

alex-newman-758b78122

Read transcript

Ledge: Alright. Alex, thanks for joining us. Why don’t you give a quick intro of yourself and the company.

Alex: I’m Alex Newman. I run the technical team here at Intuition Machines / hCaptcha. I have a background in distributed systems and running dev teams at hCaptcha and I’m.

We are focused on providing machine learning services to the world. So, democratizing access to machine learning.

Ledge: Fantastic. Really exciting. You and I, off mike, were talking about the ways in which you’ve had to innovate worldwide. Team management around different types of product delivery and sub product delivery. Different teams. Different places.

I just wondered if you’d maybe dive into that because I know a lot of our clients and a lot of our listeners are in the same position. Just trying to make sense of delivering product in different ways across different groups of people.

Alex: The way I think about our team division is three-fold. As you pointed out, we do have a remote team. About half of our team is located here in California – most of the management – and the other half of the team is distributed across the world; Finland, Ohio, Texas, Florida. So that does present some type of challenge.

Additionally, we also operate three types of technical teams. We operate 24/7 systems, so that means we have an operations teams, SRE group. We write a lot of code so we have a developer group. We also do machine learning, so we have staff scientists and research projects going in all the time.

To accommodate these remote schedules and these very team tasks, we’ve had to be pretty disciplined about how we manage projects. Obviously, when you’re working in the same office with someone you can work in very tight iteration loops. But what we rely on, since we have so much remote, is we infuse a lot of trust on both sides of the development platform. We put a lot of trust in developers, and hopefully they put a lot of trust in their management – to the point of actually pushing for vulnerability.

We provide a huge amount of autonomy. We lean a lot on process to maximize the amount of autonomy we give to the the dev teams.

Finally, we really focus on ownership. I think the term that people use now is extreme ownership by the teams of the problems they’re working on. This is infused across all of the management and the engineering teams.

These are high-level values, so it would probably be difficult for your listeners to know exactly how we apply them, so hopefully we can get to some of those examples.

In the meantime, the other thing I’d like to point out is, there are these not just values from our teams, there’s also these different types of teams.

The way you think about running an operations team versus a dev team is an interesting place to start. Obviously, we want accountability for people’s actions. It’s really important for the ownership and autonomy to work. But when you’re operating an operations team you have to be very careful about how and where blame is applied.

There’s been a lot of research on what blame-free operations works like, and we follow those types of values for our rotating operations staff. But on the dev team side, what we do – and this goes back to that ownership thing – is we really try to push the accountability at a team level. Making people realize that the minimal unit of people working on things is two people, not individuals. So that that way there’s at least a group responsible for the successes and failures of those actions.

This is a little bit different. On the operations it’s blameless, we process. On the dev side, we push towards towards team ownership. Finally, on the management side we try to be very upfront. When we see four, fix, six type failures that led to the larger failure, that’s a sign that really the management needs to take ownership of these types of issues – even though people on the dev team or the operations team might feel as though that they are the ones who take responsibility.

That’s kind of not only how we apply our values, but how we tactically apply our values to the right unit of people to maximize the effectiveness of each individual team. I’m happy to provide more specific examples for all of them, but I think that’s the high level how we think about responsibility and allocation of work, and it trickles down to everything we do.

Ledge: I would call that highly evolved. Maybe a couple minutes about, okay, you got here from being ultimately a couple of founders and having to think about scale and building out a business. There are mandates that are necessary to deliver for customers, and then engineers and operators et cetera in teams need to get to that point.

What’s the evolutionary picture and path of doing that? How do you even know that you need it? There must have been symptoms and results, and trial and error. What did that look like?

Alex: Yeah, evolution is the right word to use. My co-founder and I have founded a series of companies, and I like to say that I made every mistake along the way before I came to this strategy. So mostly as a result of failing over and over and over again is where we got here.

A lot of the motivations and the ways we organize things have been kind of inspired by larger movements that are going along, but if you look at where we were, we roughly have almost increased in order of magnitude. I’ve worked at companies that have doubled in a year. We’ve gone from a couple of people to over a dozen people now. Even that is growing at a rate that is quite difficult.

One of the reasons why we pushed this particular model while we were growing was mostly as a result of allowing those teams to move independently while they were being trained, and also for us as an organization to move on multiple fronts at one time.

Another way of saying this is, what’s the alternative to the thing? The alternative to autonomy is trying to micromanage or manage your workers, which to some extent can help. It can also help with learning and bringing people up to speed faster, but it means that now your executives or your higher level people are spending all that time providing that management, rather than focusing on looking to customers.

So a lot of this stuff isn’t so much like, “Wow, aren’t we smart? Let’s adopt these things to do it better,” you can actually think of it as, “Wow, we’re moving so quickly. Customers really want to pay us money to deliver these products. How do we move in all of these directions?”

The answer to that was really focusing on making sure that the right skin in the game was at the right level. So, now when executives are dealing with customers and the customer says, “I need this feature in three or four days, I should have mentioned this earlier,” it’s actually the executive who’s negotiating the contract who is responsible for implementing that feature if it is a triage action and it doesn’t fit in the individual sprint.

That’s how we get the appropriate levels of ownership. If you’re going to agree to do something outside of the traditional process, then you’re going to be the one who’s on the hook to doing it.

So the question is, well, why do we do it that way? It’s because we didn’t used to do it that way and it caused a huge amount of problems. Then we realized, oh, what’s causing this problem? The reason why this problem was caused was the person who was agreeing to do something for a customer doesn’t have any skin in the game. They’re not the ones coding it. There’s no… Even though they’re coders and they could code it, their natural impulse is to please the customer and not to figure out, okay, how are we actually going to get this done? Do we have the resources? What are all the individual steps?

It was that realization that, figuring out who is responsible for what? Figuring out how to get that responsibility in the right place. Okay, now that we’ve got the responsibility in the right place, can we just give the person implementing that responsibility enough power to get it done on their own? That’s where the values came from.

We evolve the same way you evolve in nature, right? Nature beats you over the head and you either succeed at changing or you die. So far, we’ve stayed alive by being willing to adapt to these things that we’re learning from our outside environment.

Ledge: I’d imagine that you don’t have resources just sitting around waiting for the next emergent customer issue to be sold upon, so how do you deal with… Let’s presume that most of the time for any team member is pre-allocated, and they’re already doing something of value for some customer. How do you figure out how to pull from that bench and essentially reallocate that time to what is now dubbed as the next most important thing?

You can imagine quickly where that becomes like, oh, well we have six number one priorities, which one is really number one?

Alex: Part of it is, we did build a triage team which is the only team that works outside of a sprint schedule. The triage team is currently my co-founder and I. The people that negotiate the contracts are the people who will triage emergencies.

That is different than the on-call staff that might get woken if some bit of infrastructure, something like that breaks. The triage team does have the capability to deputize people who are working on sprint, bring them off sprint. There’s a whole process for how they go off-sprint, how they announce it in the standup channel, how we track how much time is being spent off-sprint.

Overall, the way that we prioritize these last minute things is, once again, really pushing back the person who’s agreeing to do these last minute things with the knowledge that there’s a 75% chance they’re going to be coding it.

I should also mention, we’re a little bit of a weird organization because we’re already half-a-dozen people and everyone on my company codes. I don’t think this is normal for most organization. Most organizations by now would have product managers and project managers. Something that we might certainly get to, but we are in a unique situation where all of our teams, anyone on our team could help triage customer issues.

So, what we’ve done is we’ve said, okay, the person who’s going to be agreeing to deal with these triage issues, they’re actually going to be the people working on it. That way, we won’t agree to anything crazy.

If something does come up that’s existential to the business, we can pull people off of high priority tasks. We actually have a developer-on-call program, which one developer every sprint allocates some amount of their time towards technical debt. They’re usually the first person we will pull off their sprint.

But, normally people are coding independent of the emergencies going on. What ends up happening is development gets slowed down.

We’re lucky enough to be in a company that’s been overwhelmed by customer interest. I think that if someone was earlier in in the process and still working on initial product market fit, that tradeoff would probably not be the right one to make. They should probably focus on business. But in a world where you have more customer and customer opportunities than you do have engineering capacity, the tradeoff is to basically limit that inflow of growth and to do the right thing for the customers that you have by triaging them.

I’m sorry if that was a very long, complicated answer to a simple question but I hope it gave some value.

Ledge: All answers to all questions tend to be complicated when there’s a good answer. The reality is, it depends.

This particular model I imagine was not the model before, may not be the model tomorrow. What heuristics and what ways do you monitor… How do you know when you need to go back to the drawing people on when you’re 16 people, or 20 people? How will you know, and what metrics do you watch, in order to get to that point?

Alex: That’s a great question. It definitely won’t be the way that we’re going forever. We’re already trying to build up our triage team and help our executives on the biz side.

I think, abstractly, when you think about these types of issues of allocation of resources and building teams and all this stuff, we have a lot of good techniques that we can adapt from the scrum community or the sprint community, the agile community. Doing these things of where things fell over and when they went down.

In general, it’s pretty normal for us to bring people over from the developer/operations staff over to triage activities. In addition, we also try to push developers more and more to deal directly with customers, to keep the executives on block.

I think, at a high level the way you get there is just, going back to what I was saying before, this trust thing. I can’t help but think that if the developers and the management have the right relationship, your developers will give you things that smell like you should be doing this, right?

So, if your developers are complaining that you’re agreeing to things that are unreasonable and that they don’t feel like the sprint schedule is reasonable, the right thing to do is to push that ownership to the person who’s going to have to do the work. So if they’re complaining that you’re bringing features that are unreasonable, bring that developer into the customer. Make sure that they’re in agreement of what’s going on. Just eliminating all the telephone games that you see in an organization I think is a big deal.

We can get to some of the detailed ways that we try to push the transparency out of people, but I think that… There’s this ancient phrase, ‘Those the gods wish to destroy, they first make arrogant.’

I think that, what you see with most organizations is not that they don’t know that they have to do these things, or they might even know that it’s a possibility, the real issue with organizations is recognizing internally where they’re weak and being honest about that. Writing it down. Measuring failure in an unemotional way.

If you can do those types of things and just stay connected to the psychological literature, stay connected to other leaders, these actual solutions will just kind of pop out.

I have to admit, now that I’m saying out loud, I don’t think any of these ideas are mine. I’ve been a big fan of the ideas that we see coming out of military now entering the civilian programming sector about how they run teams, how leadership flows from them. Embracing that community and other communities that have well trained leaders or leadership teams will suggest techniques to try.

So then the question is, how do you know if they’re succeeding or failing? I think that goes back to the values of the organization. If you’re in growth mode, then all that matters is growth and you should measure that as success and failure. If you’re in honing and getting better and expansion mode, then come up with metrics that match with there.

On a quarterly or monthly basis, the management should do an analysis, what are our high-level goals and how do we communicate them most effectively to the people?

I know a lot of this stuff sounds like high level stuff, but it’s actually quite actionable. When I talk about the responsibility or making sure that people have the right skin in the game, is properly allocated, how do you know when you’re doing it right? How do you know when you’re doing it wrong?

That actually is quite simple. If the wrong people don’t have ownership of the problem, it won’t get done effectively. If the right people have ownership of the problem, it’ll get done quickly and it will get done effectively without much oversight. If you find yourself having to put a lot of oversight to get things wrong, that’s a sign that you have not allocated appropriate skin in the game to the individual solving problems.

I hate to harp on this stuff over and over again, but this comes from being in engineering environments where this isn’t the case. I’ve worked at some pretty impressive large companies, worked at companies that have grown from five people to multibillion dollar companies, and I can tell you that the departments, teams and groups that were successful were the ones that really felt like they had full ownership, full autonomy and a real vision of how that product was going to be used.

So we’re really stealing those good ideas from the high performance organizations were having, and trying to them back into our enterprise.

Ledge: What’s your hire/fire philosophy? Not everybody makes the cut. How do you deal with that? Adding people into a highly functioning organization carries a greater and greater deal of risk of mis-hire.

Alex: I don’t know if we talked about the 70 people I rejected before I filled a particular role – actually I think it was like 60 people, something like that.

We do a lot of trust in the interviewing process. We lean pretty heavily on short conversations and take-home homework. The real goal for us is to build as accurate of a work simulator that we can. Basically, to build the environment that they would be working with, and see how they work in that environment.

I don’t want to say that I give people actual work that we have to get done – I’d be a liar to say that in previous jobs I haven’t used that mechanism to get good ideas of how to solve problems – but they are as close to real as we can. When I say as close to real to the problems we are facing, today, as we can. Part of that is just because as your organization changes your needs change as well.

I personally also really like teams of people who are more careful than me, and that’s kind of part of the bar that I have for the homework problem; is this person going to make me more conscientious by working with them?

The reason for that is not because I’m very careless or anything, it’s when you think about assembling a team, what you’re doing is you’re assembling a team of varied personalities. In our case, we hire people from all over the world and we have to not only get a very diverse team here in the United States, but across the world. So understanding those cultural differences, understanding those personality differences, understanding how all those will flow together is part of our interviewing process.

However, on the code side, we’ve been pushing more and more to make that process as impersonal as possible. So the actual code review portion, I think we’re moving to even eliminating names from the reviewer who is reviewing the code.

This gives us three axes that we’re making decisions from. The executives say, “Okay, how does this personality or coder type fit in with the team of coders?” The advocate or interviewer who is doing that initial once-through review is really saying, “Are they going to raise the carefulness that we want on this team?” Because carefulness is something that gets harder and harder as you grow. Then finally with the homework assignment side, we really do want to have a standard evaluation mechanism without resending the same homework problem over and over again – that obviously would have deficiencies. So we’re really pushing for that to be transparent. Like, if this was code in our code base, would we feel comfortable merging it right then and there?

It’s interesting. We’ve had some people where we’re like, “Hey, do we like this code? Do we want to hire them?” And they’re like, “Yeah. I think so. They’re pretty good. I don’t know.” And they were like, “Well, would we merge it into our code base as is?” And people were like, “Oh, of course not.” Then we say, “Well, we wrote the homework assignment as if they were supposed to write production worthy code. This isn’t production worthy code so it’s pushed back.”

That mechanism has made things pretty easy and clear-cut. From a hiring strategy perspective, we do like to have teams rather than individuals. This can make onlining new regions a little bit tricky, but usually good people have friends that they want to bring along with them. That’s been our experience as well.

So kind of this trifecta… I feel like there’s a lot of trifectas in our talk, but this trifecta of, do they have this personality trait or this superpower that we need right now? The executives feel like they fit well in the team, and finally this unbiased homework problem, we found to have a great impact and results for hiring our remote team.

One caveat being – and this is advice for the viewer – in the past we probably asked homework assignments that were a little bit more open-ended, a little bit easier to get done in a minimal way, and my advice is actually time-bound the interview and make sure that you ask something harder. Give them enough so that there’s no way they’ll be able to complete all of it.

It’s much more painful to have someone not get all the way through but do a good job on it and still hiring them. That’s way less painful than someone getting all the way through the assignment and you’re not sure if you want to hire them.

So, definitely lean on the side of harder problem, but be respectful to your interviewer by time-bounding it. So that way they don’t blow 10 days trying to pass an interview.

Ledge: Fantastic. Let’s see. I’ll finish with one question. Knowing what you know now, what do you wish you would have done different at the beginning?

Alex: It would be nice to have this code base now. I think the biggest mistake that I’ve made in general is, when we were first hiring we really focused on getting people who had had successful startup exits, very well credentialed, top schools. A lot of them had had very large exits before. Just really getting people who I knew personally who were kind of in that top tier.

Knowing what I know now, I would have focused a lot more early on about diversity. I think diversity is one of the most powerful things you can hire for. People who think different from each other, have different experiences, complement each other a lot better than people who kind of all code the same.

That’s not to say that their experience, good schools and good jobs aren’t valuable, but ultimately if you want a team to perform it’s very important that you focus on building the right team and not hiring the right individuals.

That was a painful lesson for us to learn, and I think that it’s something that people are starting to realize more and more in our industry, is how important all sorts of diversity is. I don’t want to sound like a PC billboard. I think my definition of diversity might be more broad that other people’s, but really pushing for these happy, diverse teams. I think if we would have done that earlier we would have been in an even better space than we are today.

Ledge: Fantastic. Well, Alex, congrats on the success so far. No doubt, more down the road. Thanks for spending time with us today.

Alex: It was lovely talking to you. I hope it was fun.

Ledge: Absolutely.

Alex: Ciao.