We talk with Kelly Stevens of OneSpace about the importance of data-driven decision-making, especially in today’s modern architectures.
Ledge: Hey, welcome everybody to the Gun.io Frontier Podcast. I’m Ledge, VP of Services. Today, I am joined by Kelly Stevens who is the VP of Engineering for OneSpace.
OneSpace is a SaaS platform that combines consumer search insights, performance monitoring tools, and content optimization to help brands achieve the perfect digital shelf by responding to market changes and making product page updates with speed and at scale.
Kelly is an established technology executive with a passion for data and using analytics to uncover opportunities and drive business decisions. We’re excited to talk to her today.
Kelly: Hello, how is it going?
Ledge: Thank you for joining us. It’s good to have you. Off-camera, off-mic, you and I talked a little bit about your journey. I thought it to be a fantastic story.
Growing up from the developer track into the technology leadership track, I think that’s something that appeals to a lot of the people both on the client side and on the freelancer side, folks that we work with.
One thing you said that really struck me was that, as a leader, first of all, you weren’t really an expert or weren’t allowed to be an expert in any one thing because you had to have a broad swath of subject matter expertise; and that one important part of evolving into a VP of Engineering role was really learning about what was actually important to the success of the team.
You had many potential priorities and there were some things that you found could be ignored early on and some things that couldn’t. And, at the end of that journey when you finally made that full evolution ─ at least, in today’s point because none of us are fully evolved yet ─ what matters in the end?
I wonder if you could talk about that from the picture of your journey.
Kelly: What matters? At the start of it, coming from being an engineer who kind of had my area of expertise to evolving to my leadership vision, I think the biggest challenge is recognizing that you can’t be an expert in everything. That’s what matters.
And then, what matters is learning enough that you’re knowledgeable across a multitude of disciplines within engineering. So I oversee IT and product engineering and their solution engineering.
And so, it’s being aware and educated in all of those areas. But, then, it’s knowing enough about each area so that you can lead the team but recognizing that you don’t have to be an expert.
That’s hard because when you come from an area where you are an expert, it’s very uncomfortable, sometimes, when you feel like you don’t know all the answers.
Ledge: Did you find it was easy or difficult to assemble people to fill in the gaps? Did you know what the gaps were as you were scaling the company?
Kelly: No. You don’t always know your gaps until you’re in a top position. We’ve always been fortunate the we’ve had a small team. We’ve had a small team of incredibly smart engineers and incredibly smart people. We’ve always kind of classified ourselves as scrappy.
If you’re talking about a startup world, you need to be scrappy because you don’t have all the money to invest in all the different resources.
Early on, I don’t think we knew what we didn’t know. You never know what you don’t know. We didn’t know what gaps we had so we made the best with what we had along the way.
Sometimes, that required getting my hands dirty and getting down to the details of things and asking other team members to go into the a deep dark hole to learn something new or to just peel that onion. If you’re investigating an issue, you have to get deep down to the deepest darkest layers and try everything you can.
But, over time, you do learn your gaps: the need for DevOps and monitoring your system and recognizing that that was a huge gap that we had and we didn’t have the expertise in-house to fill that. We’ve learned over time and filled those roles.
Ledge: Right! I love this. You talked about playing whack-a-mole for finding errors and fixing things in production, and then maybe evolving to a data-driven approach to specifically logging errors at every level of a distributed services system.
You might say, “Hey, there’s a 500 error. Where did that come from?” and it’s not being able to know that and sort of digging in by hand versus maybe even preemptively knowing that.
How has that changed both technically and culturally for your team?
It sounds like maybe there were some old habits and behaviors that needed to be broken.
Kelly: I think I’ve been in the position for almost five years. We were a monolithic architecture ─ a single .NET application ─ and pretty easy to debug. And when there was an error in the system, it pretty much told you exactly what line it was on, and that was very easy to fix.
Over time, we grew. We had a single engineer who was kind of in charge of monitoring our database. When he departed the company, it became very evident that he was kind of keeping the ship afloat behind the scenes and all these issues started to kind of bubble up to the surface.
We started seeing more errors. We started seeing more disruption in the system and I felt very blind at that moment. I had no idea where to look.
“Hey, the website is loading slow.”
“Well, great! What’s actually causing it to load slow and why?”
We didn’t really have any insights there. So we slowly started building and monitoring ─ our database seemed like a hot-button place to look at.
To fast forward to now, we are a ─ I wouldn’t call it a “microservices architecture” but we’re definitely a services architecture. We’re distributed. We have probably fifty to a hundred servers running in our environment, running different services; and we have five different applications into our platform that are used by a variety of different users. And it is very difficult now to trace issues without the proper logging in place.
Two years ago, when we launched this new architecture, we quickly learned the importance of that and then, we kind of had to play catch up over the last two years.
We’ve invested a lot of time in both the tooling and just in our development practices to make sure that we’re tracing errors all the way through the systems, that when we get a 500 error, we can more easily diagnose where it came from because back to the monolithic architecture, you don’t get those nice little “Hey, on line 302, in this file, there was an error.”
That exists somewhere in the hierarchy of error logging but they get lost along the way.
Ledge: Right. So you need to aggregate all those errors into one standard type of tooling.
What specifically? What’s the best practice on tooling as far as you’re concerned? What has worked for you?
Kelly: There are three aspects to it. There’s the actual logging and the code; and then, we’re using Sumo Logic as one of the tools to feed out all the raw logs; and then, we have another tool that is a visualization tool which is kind of like the BI tool on top of all of the logs plus all of the other monitoring we have in the system that’s available through AWS SQL server and MongoDB. These all feed into Datadog and Datadog provides those visualizations.
We have all different types of dashboards. We have live dashboards. We have historical dashboards. We have very specific dashboards for different things you want to look at.
But the two that we really look at are the live and the historical dashboards. The live alerts us to how our current system is performing as of right now. Are we experiencing any issues?
Then Datadog has the ability to automate alerts to us. Of course, we have it up on a TV in front of us so we can see. Things start flashing red when there are issues or we see certain graphs that typically are flat and they’re suddenly spiking.
We have these visual reminders on top of automated email alerts and our on-call person getting a text message. But, then, we also have the historical graphs. Those come into play so we see an issue on the board that looks out of the ordinary. But is it really out of the ordinary or is this just a pattern that we’re seeing?
Sometimes, patterns are good and, sometimes, you discover patterns are really bad.
At some point in time, you see that there’s a problem occurring. You know it’s growing. You know you’re going to reach kind of a boiling point of the system just crashing. But you don’t really know how to articulate what the problem is and you don’t know how to quantify it.
And when you’re back to the startup world where you’re scrappy and you have very few resources, it’s hard to ask the leadership team or the exec team, “Hey, we need to invest time in tech debt” or “We need to invest time in this performance area” because they don’t necessarily see the impact of it. They don’t see the immediate relief. They’re not seeing the errors behind the scenes.
They don’t see it until there’s a problem. They don’t see it until it’s a complete database failure and the system is down for three days which happened to us two years ago.
Ledge: Right. The real story is that ─ and you only know that after it cost you a lot of money. You talked about sort of from the leadership seat that you had a sixth sense that things were going to fall apart but you couldn’t quantify it without the proper tooling.
Kelly: Correct! During the time leading up to when we had what I call “classified catastrophic database failure” where our system was down for three days, we had all kinds of signals along the way.
Sites were loading slower. We were seeing more errors. We were getting more reports of issues from our users. We could see things happening in SQL server but we didn’t have a way to quantify it. We didn’t have a way to pinpoint exactly what the issue was until it became so noticeable that you had to address it.
That was kind of the launching point and it got everybody’s attention. We had to put together a plan. How are we were going to prevent this in the future?
That’s why we spent the last two years investing in all these tools. Now, when we do have that sixth sense, we can go back and say, “Okay, what are we seeing from history? What is the performance of our system over time? Are we seeing something change significantly out of the ordinary or is this just a bleep?” And, now, we have ways to trace it back to “Are we seeing increased usage of our platform?” which, in turn, could impact things.
In that case, maybe that’s okay or “Is there a specific query that’s growing in execution time?” We can see those things now. And, now, we can try to get ahead of them before they become problems. It’s great to see and have that insight which, in turn, when we’re doing planning for the year or a quarterly planning and it’s “Hey, we need time for technical debt,” one, we have something; we have data that we can point to, to say why we want to do it.
But, also, when it comes to the end of the quarter or the end of the year and we invest all the time and money in tech debt and performance improvements, we actually have dashboards that reflect the progress that we’ve made.
And that’s probably been the most exciting part. We actually get to see the impact of what we’ve done.
Ledge: I bet that’s true. It’s got to be really good for team culture. It’s probably not the easiest thing to explain to the people in the CFO office and the COO office, etc., to pay for these things like, “Why should we invest in this?”
What would your advice be to someone who hasn’t hit the catastrophic failure yet but probably has that sense and is unable to articulate?
What you’d like to do in any business ─ even a scrappy startup ─ would be “I just need enough investment to avoid the three-day downtime that has a six-figure revenue impact” or whatever it is.
What’s the advice there on getting ahead of that in a way that you can communicate to the check writers?
Kelly: I’ll speak a little bit from the business perspective because I feel like I’m involved in that level.
My advice for any startup or really anybody in this type of position is that we have to make tough decisions every day and you do have to find that balance; and, sometimes, you have to make difficult ones.
But always make sure that you’re making informed decisions. If you’re truly saying, “Hey, we need to make this tradeoff and we’re not going to invest in these areas,” make sure that if you’re not the technical person, that you are educated in what that means: What are the actual impacts? What is the worst that can happen?” And be aware of them.
That’s kind of where we have arrived in our organization when I work with our CEO and COO. It’s just making sure that they’re aware of the worst that can happen and making sure that everybody is on the same page and accepting that risk.
If you’re comfortable with the risk, then, by all means, make those difficult choices. But if you don’t understand what the risk is or you’re not comfortable with it, then, that’s where you kind of have to have that discussion and figure out what your next steps are to solve it.
Ledge: So that catastrophic problem probably cost a great deal of revenue and heartache and time and things of that nature. Would that more than have paid for doing the right things first?
Kelly: That’s a difficult question. Our platform is unique.
Ledge: Aren’t they all?
Kelly: During that outage, I would say that, revenue-wise, it wasn’t as significant of an impact as some businesses would have. Our end users are mainly internal users and our freelancers who are completing the work on our platform on behalf of our clients.
When we’re down, it slightly delays that project but the uniqueness of the work ─ that was a risk that we knew was there.
So why weren’t we investing a lot of time and why was this hard to sell to the executive team or to the general team as to why we would invest time on technical debt or this type of thing versus building new features?
Well, the new features are needed in order to onboard the new clients but the worst-case scenario which is the system being down wasn’t so terrible for our business.
It’s not great. Nobody wants that to happen but the risk there was worth that.
Now, we hit that and then you realize, okay, it’s not fun just the investment of everyone in distraction.
And so, there’s another inherent risk that you don’t always identify or quantify but that’s just really the distraction to the actual business and the team.
So you may have had these plans. You’ve deprioritized tech debt. You’ve deprioritized performance improvements and you’ve focused on the features. But, now, you hit that point where you have to deal with it.
While the system was down for three days, there was the time leading up to it before it officially went down and there was a time after it tried of diagnosed exactly what happened, and then coming up with a plan to fix it and trying to prevent it for the future.
Sometimes, it ends up being more costly in the long run to ignore these things even if you’re going to accept the risk.
Ledge: Right. And that risk calculation is really all about what this data gathering mindset provides you. So it’s the ability to say, “We’re over Threshold A” and that probably leads to a problem down the road. And maybe if multiple thresholds are being breached at the same time, we’d probably have an active production problem right now.
Ledge: So you can tell the difference, then, maybe between problems that are imminent disasters versus things that we should keep on the list at that mid-level priority.
Kelly: Before, without the data, you’re just kind of going off of feeling. You’re going off of what isn’t a fact. But when you have data, it’s factual. It’s right there and you can’t ignore it.
Ledge: How do you divide your resources human time-wise across dealing with technical debt versus dealing with new features? Have you found a way to split time with the same individuals working on each type of thing to get broader coverage and sort of knowledge transfer or do you go the other way and say, “There are some people are really good bug crushers and some people who are good at new features”? How do you thread that?
Kelly: That’s a very interesting question and something that we actually still struggle with as a team and an organization. We have tried a lot of different things over time.
To one of your questions, are some people good bug crushers and some people good feature developers?
One hundred percent, absolutely! There’s a special type of developer out there who is comfortable navigating the unknown, who is comfortable working in large systems and they’re not worried whether or not they’re an expert in everything. They’re not worried if they know all the technologies. They just enjoy the fun of getting them there, tracing things, and learning.
And that’s one type of developer and, I’d say, a special type of developer. I’ve only encountered a couple of them over time, at least, in people that I’ve worked with.
And then, there are great feature developers. To date, as I’ve said, I’ve counted a couple of people who are really good as bug crushers.
In our organization, they inherently have been promoted into leadership positions because, I think, of the nature of how they work so they don’t have as much time to crush those things.
Our team now is composed of great feature developers and we’re kind of struggling on the bug front and whatnot because our system is distributed across so many different technologies and so many different layers.
How we split time is a discussion. It is a constant fight ─ no matter what even with all this data ─ to fight for the time to work on tech debt and performance.
We have found that once you get the buy-in, you do have to have the so we do have a decent-sized bug backlog that has been kind of plaguing us for a long time. And what we found successful the last couple of quarters in order to be able to commit the time but also showcase that we’re actually making progress is to just dedicate that we want to put 25% of our time towards bugs; and we use the Agile process.
So when we forecast our points, it kind of takes 25% off of the point balance this general bug bucket. We don’t point bugs because bugs are very hard to point. They can be one point based on what our team or they can be fifty points. You just have no idea how long and how big bugs are.
And so, we just kind of time box them. We say, “Hey, engineers, we think it’s going to take an hour to work on this bug.” If someone is exceeding that hour, then that’s the point at which you say, “Hey, I’m finding that this bug is much bigger or harder to diagnose than I thought” then we can make the proper decision from there whether or not it’s something that you want to spend time on.
With some bugs, you have no choice. They’re critical bugs. You have to solve them. But with other bugs, they’re, in some cases nice-to-haves or maybe you think it’s high or something you really want to address but once you start learning like “Hey, someone is not going to be able to solve this in a day,” people quickly can go, “Yes, that’s not that big of a deal.”
Ledge: That’s when it becomes a feature.
Kelly: Exactly! Or we redo the whole concept.
Ledge: Absolutely! As you move to a distributed services, did you find more and more that the bugs became more abstracted into the ops layer?
I’ve heard other people discuss that, in fact, DevOps becomes more and more recursively important because what we perceive as a bug and a performance issue moves more and more into the inter-services ops layer. Have you found that to be the case?
Kelly: Absolutely! It one hundred percent has and I’m trying to quantify the importance here. I mean, there are just so many moving parts in distributed systems like this and so many different technologies and the proper monitoring and making sure that you, from a hosting perspective, are provisioned correctly to ─
You’re not just monitoring a couple of web servers on a load balancer and kind of monitoring their performance. You’re now monitoring each individual service and making sure that they’re provisioned properly.
Is it time for you to up the memory on the machine or use a different type of machine?
We use AWS. You have all kinds of choices of memory optimized and storage optimized depending on your purposes on top of just all the messaging technology underneath and monitoring that and making sure that’s implemented correctly.
I think that’s one of the most difficult things across having distributed services. It’s making sure everything is in sync and everything is on the right versions because you always want to use the latest and greatest of a new version or a new package of something.
But if you realize that you’re using this new package over here but it’s now not compatible with the other one, the other packages, you need to upgrade everything. It’s that type of learning that you don’t realize out of the gate. There’s more maintenance involved than you would ever imagine.
At first, you think, oh, okay, great! We’re going to just have the server over here for this service and this server here. And, suddenly, you wake up, one day, and you’ve got fifty servers all running different packages and now you need to make them all in sync.
We just went through that kind of pain point and I won’t call it a “nightmare” but it’s ─You think, okay, I’ll just upgrade this package.
Great! But, then, again, you just find all these layers that aren’t compatible or you need to make little tweaks to the code to make it work.
Ledge: Right ─ which brings all kinds of and regression problems and things of that nature, too.
We could go on forever. I appreciate the expertise and I want to respect your time.
As a closing statement for all those budding technology leaders out there, what would you want them to know?
Kelly: There’s a time and a place for tech debt and performance and you will always have to fight for the time. It’s not easy as a technology leader. You always want to invest as much time as you can in it but you do have to realize the balances of the business.
And so, make sure that you do your best to get your data, get your facts. It’s a much easier fight and discussion to have when you have the facts in place and the data there.
Just recognize that it’s a balance and you can’t get it all. But just find the right place and the right time to ask for it.
Ledge: Fantastic! Thank you, Kelly. We appreciate your time. It’s good spending time with you. We will make sure everybody is looking out for OneSpace and trying to get those digital shelves in order.
Kelly: Yes. Thank you so much.