Amazon CSO Steve Schmidt: AI is changing security, but defenders are still ahead of attackers

There are very few companies on the planet that are bigger targets for both script kiddies and military hacking units than Amazon, which tracks an enormous amount of data on consumer spending habits and is home to AWS, the oldest and largest cloud infrastructure service provider. As the old saying goes, the teams tasked with keeping all that data under wraps have to be perfect every day, but those trying to steal that data need only get lucky once.

Steve Schmidt was AWS's CISO for 12 years before he was elevated to the top security job at Amazon in 2022, shortly after former AWS CEO Andy Jassy took over as Amazon CEO. His teams were responsible for protecting AWS during a period of extreme growth when best practices for cloud security were being developed on the fly, and helped AWS win the trust of the U.S. government in 2013 after helping build secure cloud infrastructure for the CIA, arguably the most important customer win in AWS history.

The world has changed a lot since then: Attacks on critical public and private infrastructure by countries such as China, Russia, and North Korea have skyrocketed, and artificial intelligence threatens to automate cyberattacks. The good news, for now, is that generative AI is helping defenders more than it is changing the capabilities of attackers, according to Schmidt, but there's no guarantee that lasts.

In a recent interview with Runtime, Schmidt discussed the frequency and style of the attacks Amazon fends off every day, the capabilities that cybersecurity defenders can bring to bear thanks to generative AI, and the potential that machine-generated code could lead to the rise of a new class of software vulnerabilities.

This interview has been edited and condensed for clarity.

Over the last couple of years we've seen a real increase in nation-state attacks on big tech companies. And obviously Amazon's right in the crosshairs of all that. I was wondering if you could give me a little sense of what you're seeing right now and how you've adjusted your tactics in response to those trends.

Schmidt: Let's start off with some statistics, because I think they're really illustrative of the problem. We operate a threat intelligence collection platform that we call MadPot internally, which is a global honeypot network. So [across] all of our AWS regions around the world, we observe about a billion potential threat interactions every day. And interestingly, when we launch a new sensor into that infrastructure, it takes about 90 seconds before that sensor is discovered by somebody, and within three minutes, we see people trying to break into it.

Do you assume that those are automated attempts to break in?

In most cases, yes, they are, and you can tell the difference with reasonable fidelity between an automated break-in attempt and something that's being crafted by a human. There are differences in behavior, differences in tools. The way an attack progresses looks different.

I think [there are] two major sets. You've got the bulk criminal actors who are looking for machines that they can compromise because they want to install things like crypto mining software on those boxes, they're looking for high-volume targeting with relatively low exploitation effort and automated installation of their toolkits once they get into the front door.

You save your really good stuff for the super hard targets.

The other side of that coin is the low-volume but high-acuity actors, whether that's the Chinese government or the Russians, or whomever else. They tend to be much, much more targeted, much lower volume, and I'd say they're a lot quieter than the big criminal actors are. They use some of the same tools because a sophisticated adversary will do two things: Number one is they will try and look like everybody else, so if they can use a toolkit that's being used by a criminal group, they'll do so. And secondarily, any sensible actor will use the least sophisticated technique they can to get into a machine. You save your really good stuff for the super hard targets.

I think that you see significantly more interactions now that have progression into automated exploitation attempts. We would see the same kind of level of scanning many years ago, but people weren't as rapid with their efforts to break into systems. It used to be a day or two days before someone tried to break into a box after they did their sweep and collection of IPs that were responding. Now, literally three minutes or less and we see break-in attempts, which is pretty concerning.

Do you think the rise of some generative AI technologies and the ease at which people are able to use them at this point are leading to that level of speed and numbers of widespread automated attacks?

What we see is that the threat actors here are leveraging AI, but they tend not to be for automated exploitation. What we're seeing a lot of use of AI for is automation of phishing emails or social-engineering attacks, other types of malicious activity where the interaction with a user or human needs to be more realistic.

If you think back a few years, you'd get a ton of phishing emails, but they were so obviously bogus, right? It was pretty easy to discard them. Now they're getting a lot closer to reality, and it's making it harder and harder for a normal person to be able to sort through what's real and what's not.

The other thing is AI is lowering the barrier to entry for adversaries. Activities that previously required significant technical expertise can now be conducted by less sophisticated actors. Malicious code snippets no longer require deep programming knowledge; you can generate those using AI tools and craft them for a particular exploitation effort.

We have not yet seen any real end-to-end automation attacks where someone could simply tell an AI system, "I want to attack this target," with any sophistication in plain English and have it execute the complete exploitation chain.

I think, however, it's really important not to overstate the threat here. There's a tendency on some folks' parts to view AI as some kind of magical capability that suddenly transforms cybersecurity. The reality is a lot more nuanced. It is certainly an accelerant to existing techniques, but I don't think yet that it's something that fundamentally changes the nature of attacks.

We have not yet seen any real end-to-end automation attacks where someone could simply tell an AI system, "I want to attack this target," with any sophistication in plain English and have it execute the complete exploitation chain. AI changes how code is created, it doesn't change how the code works. While certain attacks may be simpler to deploy and therefore more numerous, the foundation of how we detect and respond to these events tends to remain the same right now.

So maybe that's a good place to pivot towards what's happening on the defensive side with some of these newer generative AI tools. Can you give me some sense of what Amazon is doing internally to protect itself using some of these emerging technologies?

The good news, of course, is that AI is proving to be a really powerful tool for defenders. Everything that we've seen so far has shown us that we are far more effective using AI as defenders than the adversaries are using AI for attacks right now. Does that mean it's going to change over time? Who knows, we'll see. But for once, the defenders had the advantage, at least for a brief period of time.

We started this process by applying AI to revolutionize the way we do application security reviews. An application security review is a way that we look at any service or feature that launches in AWS, for example, and make sure that it meets our security bar, that customers can trust it when they give us their data to process, handle, and store. Application security reviews have really historically been a human-driven process that could add months to development cycles, and that time, frankly, makes developers unhappy because they don't get to launch their thing quickly.

To put this in perspective, our security teams review thousands of applications each month, conducting deep analyses that involve code reviews, architectural assessments, and threat modeling. And these reviews used to generate significant bottlenecks in our development pipeline; the security reviews are in the critical path for most teams at Amazon, meaning you can't launch until the security review is done.

So what we decided to do is to apply LLMs that are trained specifically on our historical security reviews. We trained the models on the before and the after and all of the changes that were required to fix the vulnerabilities that were discovered during the previous processes. Now this model has very specific knowledge about how the system can then identify similar issues across new applications and flag them for human review.

I think that what we will see shortly is the next step where agents can actually take remediation actions automatically, meaning we can define a set of parameters under which it is safe for an AI agent to take an action that used to take a human [intervention].

The results here are super promising. They've done a lot for us. Not only has this significantly accelerated our security validation process while maintaining a high bar, but it also helps us address something that I call the situational-knowledge gap. [That] is where senior engineers with years of experience typically find more vulnerabilities than newer team members. Now, because they're using a model trained on our historical app-sec reviews, new team members can leverage the collective knowledge of all of our prior senior engineers, reviews, and findings and as embedded in our AI systems.

The other piece that's really changing the way we do security business is the incident-response process. We've implemented generative AI in automating parts of our IR workflow right now, and what used to take hours of manual correlation analysis can now be accomplished in minutes, literally allowing our teams to focus on the more complex and critical parts of security issues.

I think that what we will see shortly is the next step where agents can actually take remediation actions automatically, meaning we can define a set of parameters under which it is safe for an AI agent to take an action that used to take a human [intervention].

From where I sit, the biggest, most successful application of generative AI at the moment is code generation. But obviously, the more code you generate, the more security problems you're going to have. How do you prevent those issues from getting into production, and how do you balance encouraging developers internally to use some of these technologies with being aware of the pitfalls?

Systems like Q Developer can actually help us in this space, because it's been specifically trained to not do things that humans typically do that are wrong. If you think about writing software in C, there are certain functions you should never use because they are unsafe from a security perspective. The Q Developer will never propose those functions because it has been told in the training process, "never do this, because it's bad."

Humans continue to write that. Why? Because what they were originally taught how to do, or because they were copying somebody else's code which had that error, or things like that. So I think that you will see many fewer, let's call them unforced errors in the code-generation process, and more importantly, when you use a foundation model that has been trained on good-quality previous code and new code, you will get better results out the other end.

Is there a possibility that even with training on the historical code that has been vetted — given the tendency of these models to hallucinate, to make things up, to invent things out of thin air — if that could also generate insecure code, like, code that is insecure in a new way that hasn't been seen before in your training?

It is statistically possible, sure. One of the reasons that we don't internally just have the LLM generate the code and put it into production without a human reviewing it is because LLMs are not perfect, and so we want a human set of eyes on this in review.

The important part about that, though, is not only the fact that there's a human making a review and fixing or approving things, but you take the output of that review and put it back into the training process, so you get that feedback loop. It says, "OK, here's what should have been instead of what was produced." Therefore it's seeded into the training, and the model gets better in the future.