Database Scaling at Intercom: Aurora, PlanetScale & Incident Response with Engineering Director Ryan Sherlock

In this episode of Smooth Scaling, Jose Quaresma talks with Ryan Sherlock, Director of Engineering at Intercom, about the realities of scaling databases in a fast-growing SaaS product. Ryan shares Intercom’s journey from a single MySQL database through Aurora, proxies, and per-customer scaling patterns—and what eventually pushed the team toward PlanetScale. The conversation also explores Intercom’s heartbeat-based approach to incident detection and response, focusing on customer impact rather than infrastructure metrics.

Ryan Sherlock is Senior Director of Engineering at Intercom in Dublin, where he leads the core technologies and infrastructure groups that power Intercom’s AI first customer service platform. Through talks and writing on the Intercom engineering blog, he shares practical playbooks on scaling infrastructure and engineering enablement, running high leverage incident response, and using heartbeat metrics to tie reliability directly to real customer outcomes rather than just server graphs. Outside Intercom, he serves on the board of the Rails Foundation, helping steward the future of the Ruby on Rails ecosystem. Before moving into tech leadership, Ryan spent several years as a professional cyclist, an experience he wrote about in “Why you should have skin in the engineering game”, and that still shapes how he thinks about risk, ownership, and reliability in software.

Episode transcript (auto-generated):

Jose

Hello and welcome to the Smooth Scaling Podcast, where we speak with industry experts to uncover how to design, build, and run scalable and resilient systems. I'm your host, Jose Quaresma, and today we had a really awesome conversation with Ryan Sherlock, who's a director of engineering at Intercom, and they are a leading AI agent and help desk solution for customer service. Ryan walked us through their evolving journey of database scaling at Intercom in the last 10 years, as well as their unique strategy around incident response and leveraging heartbeat metrics to detect and react to incidents early. I really enjoyed it and learned a lot from it. I hope you do too. Welcome, Ryan. It's great to have you on.

Ryan

Thanks. Thanks. It's great to be here.

Jose

And I would like to start kind of straight into hearing from you and your experience. So you started, you joined Intercom nine years ago, right? And can you walk us through a little bit the original setup, or the setup that was in place at the time, and any kind of initial scaling pains that you started feeling at the time?

Ryan

I think when I joined Intercom, a large theme was actually scaling pains. We'd been running a couple of years, but we had hit this rocket-shaped growth. And in Intercom, we really value product engineering and solving for our customers. And at these times, potentially not thought long enough about some of the infrastructure that we need to be able to actually support that scaling. And so the first couple of years when I joined, it actually was a lot of firefighting. I remember my first day: I joined and the area of the app responsible for our user model—so in Intercom, we allow our customers to be able to send targeted messages to their customers based on basically any criteria—and we had a large system to be able to deal with this. And it was basically going down every day. So we had a system that was built on, at that point, MongoDB and Elasticsearch. And my first day was sit down and then just watching fires going on. Now, fortunately, over the next year or two, we were able to get on top of all of that. And since then, it's been much more proactive scaling. But the very beginning was, it was a fiery start to my Intercom career.

Jose

Usually, it's a lot of the scaling pains that go towards the database and kind of the transaction limits that you start facing. Can you tell us a little bit about what were the considerations and the work at that time?

Ryan

No, so I can even roll back a little bit earlier. As I was saying, we tried to build simple and then layer on complexity when necessary. Intercom started as—or the database layer started as—a simple RDS MySQL database. We called it MainDB. We still have a MainDB. MainDB has now over a thousand tables. So that one, when we started in 2011, it worked fine. Then 2013, we started having incidents. The first thing that we do is, it's in the textbook: you vertically scale. So instead of whatever size instance we had from AWS, we quadrupled it. And then more runway. And fortunately, the business is going well. And then we went to the next bit of the textbook. And it's horizontal scaling. So later, we will do per-customer sharding and things like that. But at this point, it was just something simple. We'd have a particular piece of the product, and we'd be like, okay, we think there'll be a bunch of tables in this piece of the product. Let's just add another database for that and put all those tables in there. So in that way, we were kind of splitting up the load. And more runway. But as the business continued to grow, we needed to find more ways of handling the scale.

And one thing that luckily we landed on quite early that massively reduced the amount of queries that we were sending off to the database was Identity Cache. So we're a large Ruby on Rails monolith, as is Shopify. And Shopify had a gem called Identity Cache, which was basically a way of caching read queries to the database. So we had gone from near daily outages to back to normal. Our databases were super healthy.

And another thing at this time: we were on pretty basic MySQL at this point, but we were pretty—and we still are—we were close with AWS. And in 2014, they announced a new database product that looked magical, which would become Aurora. So in early 2016, we switched across to Aurora. And then again, that gave us another period of scalability. So as time went by, from the very sort of simple setup at the start, we layered on more complexity, more complexity. And as the app grew, we layered on more.

So at a certain point, our scale got to a stage where we were running over 16,000 Rails processes, and each Rails process would want a connection to the database, and Aurora's support was 16K connections at max. So then that meant we ended up rolling out ProxySQL, which was a way of scaling that. So if we had a web-serving instance that would have, say, a thousand processes running on it, we'd run one layer of ProxySQL locally on that host. So maybe we would take a thousand down to 24. And then we'd have a remote layer of ProxySQL as well, which would then further scale that down so that we were able to stay well within Aurora's 16K hard limits.

But doing that added complexity. And some of our gnarliest incidents until about two years ago were related to these various proxy layers and a database getting into metastable states, and at times us basically having to reboot Intercom—which would be going to a lot of the instances, restarting the instances, and then getting a recovery. So this takes us up to 2019-ish. And at that point, I think it was a year or two earlier at re:Invent, AWS had mentioned that Intercom had one of the largest tables in all of AWS Aurora. This is really bad. You should never be the largest of anything.

When you have a vendor like AWS and the scale that they operate at, and then you find out you're the biggest of something, yeah, that sounds like a problem. And the particular table in question was our conversations table. So for every conversation between a user and a teammate would have an entry in there. And we had tens of billions of rows there. And we had a lot of complexity in the indexes. So if, let's say, there was one gigabyte of data, there was like three gigabytes of indexes. And that all meant we basically couldn't migrate the table at all. It had been years since we were able to migrate it.

So then that put us finally down the journey of actually doing application sharding. And the way we did this was for each customer, we would create a schema in a set of clusters. So basically each customer would have a database for their high-end data. So what that bought us was instead of having one massive table that we weren't able to migrate, we'd have hundreds of thousands of much smaller tables that we could migrate, but it would just take a long time. So it was relatively quick to roll out. And then that bought us another five years of runway up until a project I started two years ago.

Jose

And when was that—because I know that now you do run in different regions—when was it that you also kind of... have you always been running in different regions in AWS, or is that something that has happened somewhat concurrently with some of the things you're sharing?

Ryan

It was about five years ago we introduced an EU region. So with our regions, if you're a customer of ours, your workspace and your data will reside in a single region. So we don't run data across multiple regions. And the driver behind having an EU region was we had many customers that wanted their data and everything related to it fully based in the EU. So that was a driver to drive that out.

And the way our infrastructure works across the various regions is it's largely a copy. There's largely a carbon copy of our US East 1 region, which is by far the largest region. And then the EU region, it's basically the same. There are some things that we optimize from a cost point of view, where there'd be some background queues where in the US region we would have one cluster of workers for one queue, while in the EU, we may have some shared queues for lower priority work. But for anything that's high priority—anything directly related to a customer talking to their end user—that would all be its own infrastructure.

Jose

Sorry, I kind of interrupted you there with a regional question, but you were sharing from the database evolution and you were at the kind of the customer sharding step, right? And what happened then?

Ryan

It was coming up to about this time two years ago, and Aurora had been a magical piece of software that gave us years of scaling. But there were several things that were impacting us that we wanted to change. One of them, and I referred to it earlier, was the connection and ProxySQL. So that layer was the root cause of most of Intercom's major incidents up until two years ago. So for many of our customers, anytime Intercom is down, their business is down. So we take this really, really seriously. And it was untenable to stay in a situation like that.

Next thing was database maintenance. So with Aurora, whenever you wanted to upgrade the database, it was basically downtime. So it was a maintenance window. And our customers are global. So we do have highs and lows. The low is less than, say, 3 p.m. Irish time on a Monday. The traffic is higher then, but we have customers all over the world. And for any point in the day is, for some set of customers, the most important time of our day. And also, I did every database maintenance window. So I was always along with one of the engineers. And they kind of suck. They suck for us, they suck for Intercom generally, they suck for our customers. We basically wanted to architect our way out of that.

And then the final bit was the native sharding. So we had this, like we built out this per-customer sharding, but a per-customer schema, per-customer database—it sounds really cool, but it actually did introduce a lot of complexity. So anytime we want to do a database migration—let's say we just want to create a new table—instead of one migration, it'd be 400,000 migrations that we would be running. And it's doable, it's fine, but there was a lot of complexity pulled in that we could start to see that was impacting engineering velocity, and then we wanted to move away from it.

Now, when we looked at doing our own application sharding, we did look at Vitess. So Vitess was the software YouTube built and then later open-sourced to scale MySQL. Slack use it, HubSpot use it, as many large organizations use it to scale MySQL. At that point in time, and I think this decision is still right, we felt that it was too early for a company like us to take on and build a team around it. You'd typically run it on Kubernetes. We weren't running Kubernetes in production. We thought that the learning would be too difficult for us to get up and running without potential issues for our customers.

But in 2023, it actually changed. There was a company called PlanetScale, and they do a managed version of Vitess. And it seemed to tick all of our boxes. And so we kicked off, started doing some prototyping. It looked good. And then we kicked off the migration about this time two years ago. The migration, like all things, took longer than we expected, but there was significant complexity in us unraveling from our per-customer databases back into something more standard. There was some nifty, novel engineering work, and in Intercom, we try not to do novel engineering work from an infrastructure perspective, but unfortunately, in this case, we had to.

Jose

So you're saying that in this setup, you went away from per-customer schema. Can you tell a bit more about that?

Ryan

Yeah. So for our high-scale data, each customer would have their own database previously. So their comments table or the conversations table, message thread, a lot of data around the customer itself. In PlanetScale, what we have is, to the app, it's logically one very large table. The Vitess layer does the magic of, from a Rails monolith, converting that into the correct query to access the right shard within a very large cluster.

So for example, previously we had five Aurora clusters and across those five Aurora clusters, we had 400,000 workspaces. Now we have, to the application, one table, except it's spread across 128 clusters. And those clusters, we're back at a place where we have many different levers we can pull for scale. So we can scale up the instances, we can add more readers, and by a few clicks of a button and small clock time, we can reshard. We can decide, hey, 128 isn't enough. We've got enough if you want to go to 256 shards. And those are just a few clicks away without a lot of engineering from, say, the teams in my group.

Jose

Yeah. And no longer the need for the proxy layer either.

Ryan

Yeah, the proxy layer is gone. Their equivalent is something called VTGates. And it's a very nifty piece of software where in Vitess, we can now do basically maintenance-free upgrades. So in Aurora, we'd have upgraded maybe once per year because it was an hour of downtime for our customers. And now, since we've rolled across, I honestly can't tell you how many times we've upgraded. It's probably 20 or 30 times. It's maybe more.

And what it can do, like this layer, when you're doing an upgrade, you can just upgrade one of the replicas, upgrade the other replica. And then on the VTGate layer, when everything is set up, it will pause writes for a split second. It will queue them up. Then it will point off to one of the upgraded replicas. And then it will release the writes again. And then that other replica that was a writer will get upgraded. And it's pretty magical. It sidestepped two of the major problems that we had.

Jose

I can see how that really simplified a lot of the work that you were doing and, I guess, increased the engineering speed again, right? With removing all those dependencies—or at least you get that for free from that system, right?

Ryan

Yeah, one of the problems that we hit, it's complexity in code. When we per-product sharded, what that meant is we had several databases that would be responsible—or that would need to be kept in sync—in a single request. For example, in the past, if we were creating a new comment, we would need to update the conversation. We'd need to update another model called message threads, which would be in a different database, and then update another model or create another model called comments, which is in another database. And now you're opening multiple transactions across multiple databases.

So we're rolling back away from that. One of the things that, getting back to this, with the way we can scale on Vitess, is we expect to put all of our potentially shardable data back into this very large production database. And then over time, once we have the conversation and the message thread and the comment all within the same database, we can remove complexity from the code, which also slows down the increased latency of our requests as well.

Jose

How much of a runway do you think you have now, right? So if you're looking forward, do you have an idea of, is there a clear limit that you can think of from a transaction perspective? Or did you say, okay, yes, we think that in two or three years we'll have to do something again if we continue with our growth. Is that something? How do you think about that?

Ryan

I was in All Hands a couple of months ago and our CEO, Owen, he was putting up some graphs on the growth of various aspects of the business. And I was sitting there being like, oh my. I'm really thankful that we kicked off this piece of work. Because, which is a way of answering the question, I think it's largely limitless. There's always going to be things that we hit. There'll always be these small things. But I don't think there is anything, at least within our databases, that we have to do to solve scaling. There are companies that are running 100x our data on this exact setup. And as long as we have a large enough credit card to pay AWS for the instances, we can scale largely infinitely.

Jose

That's a good place to be from that perspective, right?

Ryan

Yeah, much less stressful.

Jose

I guess it has been already quite the journey from this specific perspective. Is there anything that you would—any advice? So if you have a startup right now being still kind of small and expecting to scale, do you have any advice for them in terms of knowing what you know now with what you had to do?

Ryan

I think we now run over 2 million queries per second. So a lot of these, particularly towards the end, the things I talk about, they're only really problems when you've really scaled and they become problems. And you probably shouldn't do a lot of this sort of architecting until it actually is a problem because you probably should be trying to find product-market fit.

Now, the only exception, I think, to this is if you're starting a startup and you have data that is in any way shardable. So for instance, it could be a customer. Then having every table that is related to that customer, having customer ID or whatever your equivalent is on that table, and then having every index start with that customer ID. And you're adding a very, very small amount of complexity, but it basically means that if you do hit massive growth, then going and sharding and doing some of the stuff that I talked about earlier, it's all much, much simpler.

Some of the parts that were complex for us moving across to PlanetScale was we had lots of tables that didn't have indexes in appropriate locations. So then even before we could move stuff across, we'd have to go and we'd have to run big migrations. And these tables are very large. And there's weeks of wall-clock time waiting for all of this. So I think there's one small thing that can sidestep a lot of problems, but doesn't actually slow you down from actually building products that will hopefully scale.

Jose

Yeah. Awesome. Thank you for that. And I would like us to maybe change a little bit the topic and direction here because one of the things that I've also heard you talk about and read was around that, in Intercom, from your incident response strategy, you have a strong focus on heartbeat metrics. Can you tell us a bit about what those are and how Intercom uses them?

Ryan

Yeah, basically it's a way for us to monitor customer outcomes and not actually—well, we also monitor the systems as well. But our customers hire Intercom for a couple of jobs and largely boils down to customers being able to talk to their users. And in all of our regions, we're at a scale where we have really good data. And you can see the seasonality across a day, across a week. We know what we're expected to see from a number of comments being created, a number of Fin answers being created. We have a really good idea of what that should look like.

So if we ever see a deviation from expectations, then we say the heartbeat metric has fired, and that instantly kicks off our incident response. So for instance, if even the heartbeat metric dropped off by 50%, an incident would be created, an incident commander would be paged in, the relevant person on the team would be paged in—or if it was out of hours, the out-of-hours on-call would be paged in.

And we'd have automation that would roll back production by 20 minutes. If there's been any deployments in the last 20 minutes, we automatically roll back. And then we have additional tooling layered on. We use Incident.io and they have AI SRE. So at the same time, it'll be working in the background, trying to pull out any information. Does it look like any of the other incidents that we've had? Are there any issues that are just opened around this? Again, so that the initial responders have a bit of context when they come along.

And there's a few examples where we saw this. It was a few months ago—there was an incident where some code rolled out. Within 30 seconds, the heartbeat metric fired. It initiated a rollback. And our rollbacks are about two to three minutes; we can roll back our code. We want it faster and we're actively working on making it faster. And that whole incident was about two minutes. So our customers would have seen two minutes of impact. I remember those sort of incidents three, four, five years ago, and it was like the 10 or 11 minute incident.

And the heartbeat—oh yeah—for the initial response. I'd been an incident commander for years. And the first thing that you do when you see the page come in and you're about to join the incident call, you want to understand: what is the customer impact? Because that really sets up the next set of things. Because if you know that the customer impact is really bad, then you'll really instantly want to go to our status page product, update the status page, make sure our customers know that we're on it. And the heartbeat metric is the first dashboard that it ever opened. That really scratched that itch for me as an incident commander.

And then once we had that, we saw the signal was so strong that we rolled it out into more and more things and we increased the sensitivity as well. Initially, we would only create an incident if it was like three minutes it was down. But it got so good, we do it in 30 seconds. We could probably do it even less than that at this point.

Jose

And what are some of the metrics there? Is it all around customer outcomes? So is it also in terms of Fin responses to tickets or conversations? Is it comments from the human agents? Is that kind of all those sorts of metrics?

Ryan

Yeah, so we have various database tables that we would be able to monitor.

Jose

Or various writes tables. Okay, yeah. I did love that.

Ryan

So we'd have tables associated with Fin answering questions, or various metrics around Fin answering questions. So if we see any deviation there. Or then, like I mentioned the comments table earlier—the comments table does a lot of heavy lifting here. So if we see comments coming in, or a disruption in it from the mobile messenger or the web messenger or the teammate app itself.

Or another one is we do a lot of tracking around how the teammate moves around our application in general. And if we see a deviation on that, again, we know something's up, and we'll figure it out.

Years ago, it would be the database table or the database view is the first one I would look at because if the databases are having a problem, maybe it's the databases. But if there's no queries going to the database, you know there's also a problem. It might be that the database fails, but you know there's a problem. But this has gone up in the priority for me now.

Jose

And I see. I think quite interesting as well that you have the choice that you roll back by default once it kicks in, and then you can do the investigation, right? But I think that's probably just there that we'll probably save quite a few minutes there because otherwise, if it's just people looking at things and having to open a few pages and seeing what was deployed or not, then you're already using up two or three minutes there, right?

Ryan

No, exactly. Over the years, there have been so many incident reviews where the question would be: so why didn't you roll back? Or why didn't you roll back quicker? Or someone would try to, let's say a pull request went through that was the cause of the problem, and they would try to do a revert, and then they'd let it go through a pipeline and things like this. Like we get from when you click merge to in production, it's about 13 minutes. So it's fast, but two minutes is faster than 13 minutes.

So with any of these things, we tried to set up the humans and the first responders with all of the tools. But if there are things that we know work really well and we can get a bot to do it, then we get the bot to do it. And we've made sure that rollbacks at Intercom are an always-safe thing to do. And if there are occasions—and sometimes there are—if you're switching between databases or something like that, we just block that and we don't allow rollbacks at that point. But that's super rare and typically that would be some of the infrastructure engineers running that.

Jose

That's very interesting. Thank you. And maybe continuing in the incident response theme, and I would love to get a bit more detailed, right? We did have recently, a few weeks ago, a big incident with AWS in their US East 1 region. And I know that it impacted you. So can you take us through how that looked like for you?

Ryan

Yeah, sure. It's a pretty good example of how our heartbeat metrics work in reality. I think it was like 7:48 and 30 seconds was the first time AWS had issues. And our internal incident that paged—so like our on-call out of our service—they were paged at 7:49 UTC, and it also paged in an incident commander immediately.

So we were able to—the first responder, they were able to see that there was basically a cliff drop in our core metrics. And several minutes after that, she paged in the critical response. So basically, she marked the incident as critical. And it's our highest level. Typically, we have zero to one of these per year. But she pretty quickly understood that this is different from other ones.

And at 7:54, while I was making espresso for my wife and I, I got paged. I looked at the critical one and I was like, really? Really? It's quite rare. And then I went upstairs and then I saw that we were going to have a day of it.

Jose

Yeah. And can you tell us a bit more into kind of how much you can share, but from an infrastructure perspective, how did you handle it?

Ryan

When I went upstairs, I verified that the heartbeat metrics were broken. So that's a Datadog dashboard. We also, just to make sure that it's not a metrics issue, when we create that channel, we link into Honeycomb, which we use for high-cardinality observability. So I click on that link. I also see it's broken. So then you're like, okay, it's really broken.

I go to the exceptions dashboard and then I see it's Seahorse. And then one of my colleagues who was just joining the call as well is like, yeah, that's the AWS client. I was surprised because I didn't recognize Seahorse and I've been in hundreds of incidents. So then with a quick look at X, I realized this isn't us. This is something much larger than us.

And then you started just paging in various people that you knew would be able to help over the next while as you try to mitigate.

And it did impact Intercom. So DynamoDB is one of our core data stores. So we use it—the primary use case for it is storing the key-value data related to our customers and users. So for example, let's say you were a retail business and you had customers, maybe you would track what's in their basket, what tier they're signed up for, how much is their typical market spend, a bunch of things like that. We have a big JSON blob of all of that data and we use DynamoDB for that. So it's absolutely core to how Intercom works. And it was effectively down.

And our ability to log into AWS was down. Ability to create cases. There was a lot of things that was just down. So until DynamoDB returned, which was about two and a half hours, in the US East region, we were fully down.

So that was the stage where it wasn't really very much we could do from an engineering response team. But then the next, I think it was like 10 hours or 11 hours, once DynamoDB was back—AWS still had issues—but then we were able to actually do engineering work to get the app up and operational to some degree.

Jose

Thank you for sharing. And during that time, Europe was—your European instance was just running, or did you have a few dependencies across?

Ryan

No, the EU region was fine. So there was some impact, but it was due to third-party integrations where they were actually impacted by the same incident.

One of the interesting things, aside from the first two and a half hours from then out, and why our monolithic architecture was kind of interesting, was that we have—again, focusing on the US region—we have many, many clusters. So we have 300, 400 clusters that do various different work. For example, web fleets or various shapes of async worker.

When the incident started just before 8, that was a relatively low point in the size of our infra scaling. And then when DynamoDB went down, we scaled down all of our infrastructure to basically our nightly min sizes.

And the long tail part of the AWS incident was the inability to get new EC2 instances. So we were effectively at our low point in scale. But then our traffic—the European traffic first, and then US traffic later—they were all coming online. So we had this web fleet that would be at 20% of max peak capacity trying to do all of the work.

So some customers were likely down that also use Intercom, or maybe the traffic would be a little bit lower on a day like that. But we basically didn't have enough boxes for the messenger or boxes for the teammate app. But what was nice about our monolithic architecture is we've—all of the EC2 instances, they're all running the same thing.

So we were then able to take, say, a bunch of backfill workers or just async, less-priority, less-important work, and we'd be able to point to them and say, okay, you're now a web box.

Jose

Okay, you're now a web box.

Ryan

And they will reply, I'm now a web box. And they go and start serving web traffic. So we were able to get enough instances from that point of view to service that.

But then the next bottleneck that we hit was that VTGates layer that I mentioned earlier, like the proxy layer. It scales up and down. It's a lot of infrastructure at peak, and it goes up and down. So it was in a low place as well, because we effectively weren't making any MySQL queries. And it's a very different shape of box, and we had difficulty getting any instances for that.

So our vendor PlanetScale, they were able to help and do some things that were able to mitigate some of it, but it was the bottleneck. At this point, we had enough web boxes to serve all of our customer traffic, but there wasn't enough capacity in the proxy layer to serve it. So basically, database queries were slower.

So the way we would see it in the metrics and the way we saw it in the heartbeat metrics was just a depression. You could see the Intercom worked. It was just slow. When things are slower, people do less stuff. So you just do all that.

And then we also turned off various commenting-type things. We were pretty ruthless about cutting out features that didn't service the core customer product until we were able to get capacity again.

Jose

And was there any kind of—I think you shared some of the impact—were there any learnings from it? Would you do something, or are you trying now to do something different overall in your architecture and infrastructure because of the experience?

Ryan

Yeah, certainly with the VTGates layer. So what we changed is it can only scale up. So it went from—I'm going to make up numbers here—let's say from 40 instances to 120. We just set them in at 120. And as it scales, or as our peaks get more peaky and we add in more instances, then we will scale up and not scale back down.

So in all of our infra, that was largely the only bottleneck we weren't able to work our way around. So we're going to make sure that if there's an incident of this shape again, then it's just not going to affect us.

And another thing was—it's less related to our infra—but when we had this issue, from initial impact until our status page was updated, from the absolute first error until our status page was updated, it was about 10 minutes. But during an incident like this, that is critical.

We also want to proactively reach out to our customers in other ways. And when we were not able to log into AWS and get other data in AWS, we were not able to send some of those proactive messages to our customers in the way we would have wanted to. So we've had to build completely orthogonal to any of our infrastructure mechanisms to be able to reach out to our customers. So if all of this is on AWS, then we need something in Google or somewhere else to just be completely independent of this.

That was for the first two and a half hours. Once our app was working again, we were able to put a banner within our app to be able to say, hey, there's increased latency, click here to look at the status page, and then we kept that updated then.

Jose

Like you do also provide as a service to your Intercom customers, right?

Ryan

Yeah, of course.

Jose

They can have that. And I do see that your integration with Incident.io also allows for integrating with our status page and helping with that communication as well, right? So that's definitely important. As you're saying, it's only 10 minutes, but it's a very important 10 minutes, right? So if you can have that communication out earlier, then yeah, I see that.

Ryan

Yeah, and actually on that, a recent thing that we've done is—because that 10 minutes, our customers want faster. They see a problem. They quickly want to understand: is it the VPN in their office? It's a local issue. Because there are plenty of times there are local issues. And so they want to be able to go to our status page. Those 10 minutes or six minutes or five minutes can feel like an eternity if you've got a thousand teammates trying to do work.

So on our status page now, we link off to a public-facing version of our internal heartbeat dashboard. So now as soon as there is an incident, if you see anything going wrong at all, you can click on this link and you'll be able to see a Datadog dashboard that is very similar to what we see. So again, this is trying to short-circuit that: it is us.

And then also, if this drops off, we communicate to our customers that we're working on this. We know about this. We're working on it. We may not have updated our status page yet.

Jose

Very cool. Thank you for that. It's brilliant insight there, and thank you for sharing that. I think we're getting close to wrapping up. I would just like to end with a couple of rapid-fire questions for you. And the first one would be if you have any kind of book, podcast, thought leader that you follow that you would recommend to the audience.

Ryan

Yeah. So there's a book. It's Accelerate: The Science of Lean Software and DevOps. So when I joined Intercom nine years ago, Intercom did a lot of things radically different from my previous companies. And it was about two years later that Accelerate came out. And finally, there was a book version of what we do in Intercom that I could give people—either to candidates looking into Intercom or to people who were joining. So I could give them this book and not have to point them to a bunch of Intercom blog posts or internal docs or something like this. So yeah, I've bought many, many, many copies of that book.

Jose

And would you have any—or do you have any professional advice that you would give to your younger self? Or it could be your younger self or someone just starting right now in this career. Any advice that you would share?

Ryan

Yeah, I think it's: become an expert in the core tools that you use. So if something is really crucial to how you're going to do your job, get really good at it. And anytime I've not done this, it's come back to bite me.

For example, when I joined Intercom, I joined as a product engineer. For the first month or two, I didn't really learn Ruby on Rails. We had a little bit of Java in Intercom at the time, and I was doing some Java stuff. And what I should have done is day one—or preferably day minus 14—is go right into it.

Luckily, I noticed this. And then I devoured basically every book I could find, every website, every tutorial, and then read through those whilst with our codebase open and learning, okay, so I'm learning this concept about, say, ActiveRecord. Okay, how do we use it in Intercom's codebase? And I spent a lot of time doing that.

And I think that was something that set up my career in Intercom really well. Because I quickly went from being 10th percentile in this to 90th percentile across not very long. And then it really stood to me.

Jose

Yeah. And I would imagine now that I think that's probably a good use case for AI as well, kind of from a learning perspective and trying to learn the codebase and all that. So that example that you were telling—where are we using this kind of case, right, or this way of doing things—then you can probably pretty easily get to that answer with some help from AI, right?

Ryan

No, it's incredible. It short-circuits so much stuff that I've done in the past.

Jose

It's amazing. Very cool. And so, very last question. To you, scalability is?

Ryan

The discipline of building systems that reward success instead of punishing it. So more users, more engineers, more data, whatever. Whenever that's growing, it should mean success, not stress and incidents and things like that.

But having said that, there's also a balance. We don't want to over-engineer things here. We don't want to build a perfect system that will scale infinitely and then you don't have any customers. So I think anytime you see scaling issues, first have a little celebration that you're alive and you're doing well enough to have scaling issues, and then get back to actually solving the problem.

Jose

Nice. Wonderful, Ryan. I think that's a really good way to wrap up the episode. Thanks a lot for coming by.

And that's it for this episode of the Smooth Scaling Podcast. Thank you so much for listening. If you enjoyed, consider subscribing and perhaps share it with a friend or colleague. If you want to share any thoughts or comments with us, send them to smoothscaling@queue-it.com. This podcast is researched by Joseph Thwaites, produced by Perseu Mandillo and brought to you by Queue-it, your virtual waiting room partner. I'm your host, José Quaresma. Until next time, keep it smooth, keep it scalable.

[This transcript was auto-generated and may contain errors.]

Handle peak traffic with confidence, no matter the demand

Discover Queue-it