Fixing Cloud Security with AWS Lambda

View Show Notes and Transcript

How to secure AWS cloud using AWS Lambda? We spoke to Lily Chau from Roku at Bsides SF about her experience and innovative approach to tackling security issues in AWS environments. From deploying IAM roles to creating impactful playbooks with AWS Lambda, Lily shared her take on automating remediation processes.  We spoke about the challenges of managing cloud security with tools like CSPM and CNAPP, and how Lily and her team took a different approach that goes beyond traditional methods to achieve real-time remediation.

Questions asked
:00:00 Introduction
01:56 A bit about Lily
02:27 What is Auto Remediation?
03:56 Example of Auto Remediation
05:19 CSPMs and Auto Remediation
06:58 Make Auto Remediation in Cloud work for you
09:49 Where to get started with Auto Remediation?
11:52 What defines a High Impact Playbook?
12:58 Auto Remediation for Lateral Movement
14:35 What is running in the background?
16:41 What skillset is required?  
19:08 The Fun Section

Lily Chau: [00:00:00] Every AWS account, we deploy two IAM roles. One is the read only security auditor role. So for read only configuration of organization. And then we have the security tagger role to be able to tag instances that are invariant or non compliant. So this tagging is how we are keeping track of invariance, which is much cheaper than actually sending it to the database.

So if you see something that's non compliant, tag it. If you remediate, tag that you remediated.

Ashish Rajan: CSPM and CNAPP are not the only tools that would help solve all your cloud security challenges. Some of you probably already knew this. Some of you would think, wait, that's not true. But let me explain. I had this great conversation with Lily from Roku, and we were talking about auto remediation.

For some of you who probably have heard of auto remediation in the past, some of you , this may be a new concept. But the idea is that you're able to automatically remediate something which is a non compliant issue in cloud. Now, the challenge normally comes from the fact that CSPM or CNAPP tools will take you all the way to the point of creating a Jira ticket or creating a Slack notification, but all of us security people really, what we [00:01:00] truly want is that thing to not exist, to be not compliant.

So Lily and her teammates developed a set of auto remediation lambda functions that they talk about in their talk here at BSidesSF. Lily and I had a great conversation about how her approach for auto remediation after being frustrated by the fact that Jira tickets not being actioned. How can you make this easier for developers to be able to solve these problems?

So she spoke about that at a talk here in BscienceSF and we had a great conversation about this very topic. We spoke about challenges On cultural shift required for auto remediation. What triggers it? What are some of the simple examples you wanted to be coding in this space as well? All that and a lot more in this conversation with Lily from Roku, I hope you get to enjoy this conversation.

If you're looking to auto remediation, share this with someone, if you know who they are working on auto remediation problem, but as always, if this is the second time or the third time you're listening to Cloud Security Podcast. I would really appreciate if you give us a follow, subscribe. If you're listening to us on iTunes or Spotify, enjoy the episode.

I'll see you next one. Welcome to Cloud Security Podcast. Today, we have Lily and Lily, could you [00:02:00] tell us about yourself, where you are these days and what are you doing in cloud security?

Lily Chau: So I'm Lily. I'm just think of me as your friendly neighborhood security janitor at Roku.

Ashish Rajan: Oh, I'm Lily.

Lily Chau: Over 10 years of experience in the industry.

I fix things, I break things, sometimes I pretend I'm a developer. And a bit of relic from the past, I used to run PlatypusCon, which is a hands on, only workshop, hacker conference based in Sydney.

Ashish Rajan: And now you're working on auto remediation.

Lily Chau: Yes.

Ashish Rajan: It's a very heavy AWS term usually, like I didn't hear a lot about that in Azure or GCP, but I'm sure they do that.

Yeah. But for people who may not have, heard of that. How do you describe auto remediation?

Lily Chau: Let me take a step back a bit. When tackling security by AWS we can take two approaches. So one is security by secure defaults and guardrails. And the other track is remediation for any non compliant or non invariant resources.

In part one is like your dream foundation of AWS. So you have your organizations with CloudTrail, GuardDuty, Least [00:03:00] Privileged IAM roles. You have SCPs, you have thousands of infrastructure as code, secure by default templates, frameworks, modules. So if there's any issues, it's in the code, which you can scan for, block, correct, or maybe even ask AI for a remedial solution.

The other part is what happens when someone wants to spin up something that bypasses your infrastructure as code workflows. So if they're gonna insist on doing that, because maybe they're new to AWS, Or they are more familiar with Docker, Kubernetes, or they just find Service Mesh really hard, which it is.

Ashish Rajan: Which it is, yes. So

Lily Chau: it's okay, you can still do that. So let's make sure if you're spinning up an EC2 that you're spinning it up securely. And while they learn Service Mesh and plan to migrate to Service Mesh, they don't do anything that would open up security issues that we can't live without. So the auto mediation part is auto mediating issues that are introduced by manual actions.

Ashish Rajan: Like what would be an example of maybe it doesn't have to be a complicated example. What's a [00:04:00] simple example for to what you said non compliant, but then has to be fixed after.

Lily Chau: So a super simple one is an EC2 that is spun up manually. So obviously because it's manual. It's prone to security issues.

Maybe other ones is manually deleting an S3 bucket or a load balancer. It might not seem that scary at first, but if your S3 bucket is actually has a domain that is managed by infrastructure as code, you basically created a sub domain takeover vulnerability.

Ashish Rajan: I feel like worthwhile calling out your point about deleting a load balancer or S3 bucket, these are also things that are not an easy conversation to have in any, most organizations, I feel like they'll be like, okay, like someone's going to, did you just say you're going to delete S3 Bucket? Did you just say you're going to delete my load balancer? Is there a cultural shift required to even have these conversations?

Lily Chau: I wouldn't say it's a cultural shift.

More of a frustration. You have all these findings from all these different tools [00:05:00] and it just increases year after year, which looks really bad on us. Yeah, it just doesn't decrease and the issues are actually the same every time. Once you start collecting the metrics to say, Hey, our mean time to remediate is very bad.

So if you're not going to fix it, maybe give me a shot at fixing it for you.

Ashish Rajan: Yeah. And would you say, because to your point about the tools as well, because you're talking about AWS or Azure GCP, CSPM comes up quite often, CSPM CNAPPs of the world, They obviously are claiming that they can solve a lot of these problems.

They'll do a Jira ticket, they'll do they can solve the whole world's problem. At some point, it may feel like they were not helpful in this context.

Lily Chau: Generically, a lot of CSPMs they stop their automation at filing the Jira ticket or pushing the workflow to Slack. So it's you find an issue, file a ticket, find an issue, push to Slack.

Not really what we wanted to do. We want to go beyond and fix the issue. There are some cases where they have started using AI to suggest remedial actions. [00:06:00] And so the problem with any vendor is you can't really customize it with any organization. They vary very wildly in their deployments.

They can use. CloudFormation, Terraform, Pulumi, GoPulumi, Python. So you need to cater your mediation for those deployments. It's very hard for a CSPM to do that. It's hard for us

Ashish Rajan: to even do that as well. Can you imagine an actual tool to do that as well? Yeah. So the CSPMs of the world that are stopping at that Jira ticket or Slack message, the challenge over there is that it's not truly remediation at that point in time.

But I think what we're talking about over here is the fact that, Oh, okay. Now that I find out that, okay, I have an S3 bucket or load balancer or something else that needs to be brought back to normality, whatever that normal for that organization is. Yeah. It doesn't matter if it's like a Python code or whatever.

It should be like brought back to normal. That would be auto remediation.

Lily Chau: Yeah. Yeah. Okay.

Ashish Rajan: And to your point, the frustrating part is now I have a collection of JIRA tickets That are all in backlog rather than being fixed.

Lily Chau: Yeah.

Ashish Rajan: Okay, fair. So that was [00:07:00] the motivation. I imagine we'll go to the technical detail of this as well as to what was it in the background and what are some of the examples of the one that you picked as a starting point and walk down?

I'm curious in terms of. Your approach as well, because the reason I said cultural shift is because I think there was a whole wave of auto remediation that CSPM space, but that disappeared really quickly because we were like, Oh I can't do that. Security is really bad. So what was your approach?

Maybe people can learn from that as a there is a way to do auto remediation, but This is what worked for me. What would be that method that works for you?

Lily Chau: So normally when people buy a cloud tool you get all excited, you get millions of findings and then that's it. You get stuck.

You don't know what to mediate. What do you mediate that will actually make an impact? And then you look around and ask around what are you doing with these cloud tools? And then you start to realize that the only thing people are filing tickets for and like jumping up and down about public S3 buckets.

So then the problem is we don't know what to [00:08:00] fix. We don't know how to prioritize. Or if we do know how to prioritize, it's we know how to fix it for one issue, but not necessarily the whole organization. So I think we started with making sure that all of our playbooks were all high impact. So they solve a lot of the problems of the company and they revolve around cost optimization, reducing attack surface, and reducing blast radius.

So how we approach that is, So it takes it was like six months to a year for each playbook. So yeah, first we say, okay, we're not tolerating this configuration anymore.

Ashish Rajan: Okay. You're drawing a line in the sand. This is it.

Lily Chau: Yeah. We make an announcement and then we send an email to all of the account admins and say, these are all your resources that are violating this new policy, and you have 60 days to fix it.

Okay. Or review it, at the very least. If not, then at this date, we're going to fix it for you.

Ashish Rajan: Oh, okay.

Lily Chau: Yeah. And then, as you get closer to that deadline, You remind them week by week, Hey, just a reminder, you still have X amount of [00:09:00] resources violating this policy. Just remember to review because the date is coming.

And then the day arrives and we say we are going to remediate, we don't actually remediate. Yeah. We don't want to break anything. So we actually just behind the lines, we just keep hounding and hounding. Hey, please review. Please review. Cause they may be false positive. Yeah.

Please. Please. And then eventually. Yeah. They get remediated we tune out all the false positives. We give it 30 days to observe for any breakages. Then we do it all again with the next playbook

Ashish Rajan: that is interesting because I think 60 days is a good mark because that is good enough in a sprint thing as well where you have that's enough time for them to plan ahead for it.

Yeah. If they want to proactively work on it. But I love the approach where drawing a hard line on the sand for, hey, this is a new policy and anything people can do to not be breaking the policy is a good idea. In terms of examples for what were one of the first few things you went for it because I love what you said about high impact, reducing blast radius, and it may be different things for different people.

Do you [00:10:00] find that in terms of the ease of doing it and the high impact part, was there any? Was there anything that came to you guys for, and it may or may not be relevant, but I'm just curious as to, is there a simple example that people can look at for doing auto remediation themselves?

Lily Chau: So the simplest, yeah, is if an EC2 is manually created. So starting at that, and then if it is, then you quarantine it via security groups to restrict ingress traffic and you attach an IAM policy to restrict access to other AWS resources.

Ashish Rajan: Yeah. Okay. How would you know that it's a manually created one?

Lily Chau: Previously, we used to look at the user agent to infer from the CloudTrail logs if it was manual.

Ashish Rajan: Yeah.

Lily Chau: We look for events initiated by employees, wherever it ends with like at Roku. com, we know it's by a person. So there's a responsible owner, we can nag and it makes the event actionable. The other thing we check for is in the CloudTrail logs, there's a new [00:11:00] read only flag that is set.

So if that is set, we know that it's a mutating action. So a create, modify, delete type of event.

Ashish Rajan: Oh, yeah. If read only equals false. Yes, it's a yeah. Okay, fair. Okay. Yeah, because that there's a default now on CloudTrail as well. Yeah. Oh, okay. Also, and if you if the EC2 event is listed over there, then it's most likely someone has actually created this.

Yes. Whereas if it's like a Terraform or something else, it would be not be there or it wouldn't be different if it was Terraform instead.

Lily Chau: This is looking at after the fact, after the deployment is spun up, you look at the CloudTrail logs to see the event that is happening and okay, read only flag equals false.

Ashish Rajan: Was the approach to create a database of these first or I would think that you can't just go back for one EC2. It would need to be big enough of an impact to go across the organization. So people who are thinking of doing this, they probably should think about not just one EC2 in one account.

To you point, about high impact.

Lily Chau: Yeah.

Ashish Rajan: How would you describe it as an example of high impact, like across a large footprint or?

Lily Chau: Every AWS account, we deploy two IAM roles. One [00:12:00] is the read only security auditor role, so for read only configuration of organization. And then we have the security tagger role, to be able to tag instances that are invariant or non compliant. So this tagging is how we are keeping track of invariance, which is much cheaper than actually saving it to the database. So if you see something that's non compliant, tag it. If you remediate, tag that you remediated.

Ashish Rajan: But how would you find them collectively in one place?

Is there a service for it in AWS that collectivity shows you across all your accounts, what tags are, violation.

Lily Chau: One account, yeah, you can look at the tags to pull out all of the EC2s. Yeah. I guess for our case, it's a little different because all of our resources we save into a database. Oh. And it's actually very similar to the open source stream alert from what I gather from people where any configurations that you care about, you just grab it, put it in a database, and then you can do an SQL query to find those tags.

Okay.

Ashish Rajan: Yeah. Fair. And you can just collect the [00:13:00] information. We were talking about this before we started recording about the whole lateral movement as well. We went with a simple example, maybe adding a bit more complexity to it. How would you do this for a possible lateral movement?

Lily Chau: For lateral movement, we'd need to do some pre work. So first we need to create a few IAM mappings, so between a principle and a resource. So a principle is like an IAM user, a role, a user group. Resource is another IAM user, so it could be an EC2, an EKS, a Lambda. Every time we see an admin IAM permission over resource.

You create a mapping every time you see an impersonate permission from a principal to a resource, you create a mapping and then you can chain these together to figure out if an EC2 is associated with a low privilege IAM role points to another IAM, which points to another IAM, which points to another with admin IAM permissions.

Now, when I say admin IAM permissions, what I mean is, create user, create policy, wildcard, IAM permissions. Because as you can guess, EC2 has no reason [00:14:00] to create an IAM user. Yeah. That's what we're trying to detect. That's our pre work.

Ashish Rajan: Yeah.

Lily Chau: And then our remediation playbook is we pull of these, pull all of the resources that have the admin IAM permission.

And pull all of the resources that are chained to have admin IAM permissions. So this training could be for two hops, three hops, 10 hops. Now that we have those, we need to double check in the CloudTrail logs or in IAM access analyzer that those permissions were not used legitimately over the past year.

And if not, we just strip them out. In a new policy, attach the new policy, detach the old policy.

Ashish Rajan: Ah, interesting. Most use cases will be more simpler. The lateral movement thing, the chaining of IAM roles is a very interesting one. It's almost that's where the complexity and large cloud footprint would come in as well.

So I'm glad you guys are doing this. Is there a tool for this now or what is running in the background because we didn't even talk about that. What is running in the background to make all this happen?

Lily Chau: Not much, actually, because we are tagging anything that's not compliant and [00:15:00] tagging things that are non remediated.

And if they're deleted, then okay, then they're gone. Yeah. So it's really just tagging. And then we have initially we have a database backup to make sure nothing went wrong. Yeah. But for the IAM, we store all those mappings in an RDS, which is probably not the most cheapest solution, but it's what I'm more familiar with.

Ashish Rajan: So are you guys using like a Lambda as well to do the auto remediation part?

Lily Chau: Yes.

Ashish Rajan: Okay. So tagging helps you collect the data in a way for lack of better way. It transforms it. Okay. This is what I care about. And then the Lambda kicks into, Oh, I see the ones tagged as S3 or violated or whatever. But this is the action that, We're talking about a lot of lambdas, right?

Yeah. How many lambdas are you at now? So

Lily Chau: it's, yeah, it's one lambda per playbook. I would say about 10 good ones, a few doozies.

Ashish Rajan: It goes back to what you were saying earlier. Would it be focusing on an S3 bucket, like a high impact, high risk problem? Yeah. That is your playbook. That is your lambda. But how would you get that across all your AWS accounts?[00:16:00]

Lily Chau: We have the one big lambda that iterates across all AWS accounts and all regions. But then they branch out to different Lambdas. One Lambda to check for misconfiguration for EC2, another Lambda to check for three misconfigurations on EBS, and another Lambda that checks for misconfigurations on the Elastic IPs.

They just branch out. And they perform their own checks, which is scan and tag, remediate.

Ashish Rajan: And what about when new accounts get added? Is this an organization? When new accounts are added,

Lily Chau: Yeah.

Ashish Rajan: Lambdas are already deployed

Lily Chau: yes. Yes. Yeah. So it's part of our normal workflow where it's a new account, CloudTrail, GuardDuty, security tagger role, security auditor role, and these 10 different IAM roles specific for each playbook.

Ashish Rajan: Now that we know the, okay, we have moving parts like tagging, which I think is a fairly, I think everyone would know what tagging would be in an AWS context. Lambda, I think we go into that territory of coding a bit more as well. I'm not a great coder, like as much as I would like to claim that I can write a Python or a Lambda function.

I'm copying a lot from Stack [00:17:00] Overflow and maybe making tweaks here and there. And to make it what I want to, it may not be the best program in the world. For people who are listening or watching this and want to understand, okay, I guess I'm making a Lambda function. What's the skill set level are we looking at?

And are you open sourcing what you're doing so I can copy paste what you're doing?

Lily Chau: With ChatGPT, we're all better developers than before.

Ashish Rajan: Actually, yeah, maybe, yeah, maybe ChatGPT is trying to change that. Yeah. Yeah, I can ask her to write me a Lambda function in Python.

Lily Chau: Yeah, absolutely.

Ashish Rajan: And I can make that a lot more.

Is that what you guys were using as well? I'm sure you got it. It definitely helped. Yeah, it definitely helped. Fair. What would you say the skill set should be then for someone who's thinking about going down this path of doing this in their organization?

Lily Chau: Definitely some sort of SDK knowledge. If you're going to use AWS SDKs for Python, then yeah, no Python.

Yeah. If you want to use WindowJS. You use Node.Js but beyond that, you should also know the language that your cloud deployment platforms are using. So if the majority of your resources are spun up by CloudFormation, you need to make sure your, CloudFormation. 'cause that's where you [00:18:00] need to make the fix.

Yeah. But it's Terraform. Yeah. You need to know a bit of Terraform to make the fix in Terraform. Yeah.

Ashish Rajan: Yeah. Are you making changes. Directly in the template as well, because we spoke about the manually created ones. You're making changes in CloudFormation templates and stuff as well.

Lily Chau: Yeah, especially for the S3 manual delete event.

So because that event will open up your environment to potential subdomain takeover vulnerability. Once your S3 bucket is manually deleted, we need to check in Route 53 and in the cloud for distribution. If though that domain name for the S3 bucket exists.

Ashish Rajan: Yeah.

Lily Chau: So if, so we need to delete them. So you know, if you're not using infrastructure as code, just delete them by CLI.

But if your domains are being managed by IAC, you can't delete them by CLI 'cause it would just revert itself. And your sub-domain takeover vulnerability just keeps opening up and up. Yeah. So you need to make sure you apply the fix in your Terraform instead.

Ashish Rajan: So use ChatGPT to get the lambdas that you might want to create.

But also look at CloudFormation or Terraform as the places where you can make an impact. Is there a plan to open source what you guys have [00:19:00] done?

Lily Chau: I would say keep an eye out on the Roku github. com website and something will be published there. Once all the approvals are checked. Okay,

Ashish Rajan: cool. I'll put a link in there as well.

Those are most of the technical questions I had, I've got three fun questions as well.

Lily Chau: Okay.

Ashish Rajan: So first one being, when you're not working on cloud security ,auto remediation, what do you spend most time on?

Lily Chau: Swimming.

Ashish Rajan: Swimming. Okay. Nice.

Lily Chau: If I can find somewhere where I can just lay on the beach, then that. Oh,

Ashish Rajan: nice. Yeah.

Which would be a lot in San Fran, San Jose?

Lily Chau: Not a lot, but I have a pool with a great mountain view.

Ashish Rajan: Oh, that, that definitely would help. Yeah. What is something that you're proud of that is not on your social media?

Lily Chau: I think it's probably changed, but I think looking back I think I'm actually most proud of the Platypus Con

Ashish Rajan: oh, wow. Okay, fair. And final question. What is your favorite cuisine or restaurant that you can share?

Lily Chau: Japanese but like sashimi, like all the raw stuff. Yeah. Yeah.

Ashish Rajan: All the raw stuff.

Lily Chau: Favorite restaurant. So in SF, there's Dumpling Home, which has the best [00:20:00] dumplings. Oh,

Ashish Rajan: wow. Okay.

Lily Chau: But the lines are always very long.

It's always one hour wait minimum. They have a sister store that they opened up recently, which I'm not going to tell you because they have no lines and I want to save that.

Ashish Rajan: Fair. It's for off camera. It's for off camera. But no, thank you for sharing that. And where can people find and connect with you online to know more about this place?

Lily Chau: So I'm on Twitter. @snailpea is P E A. Yeah. That's the only thing I use really. Okay. Fair.

Ashish Rajan: I'll tag that in the shownotes as well, but thank you so much for coming on the show. I really appreciate this. Thank you so much.

Thank you for listening or watching this episode of Cloud Security Podcast. We have been running for the past five years, so I'm sure we haven't covered everything cloud security yet.

And if there's a particular cloud security topic that we can cover for you in an interview format on Cloud Security Podcast, or make a training video on tutorials on Cloud Security Bootcamp, definitely reach out to us on info@cloudsecuritypodcast.tv By the way, if you're interested in AI and cybersecurity, as many cybersecurity leaders are, you might be interested in our sister AI Cybersecurity Podcast, which I run [00:21:00] with former CSO of Robinhood, Caleb Sima, where we talk about everything AI and cybersecurity.

How can organizations deal with cybersecurity on AI systems, AI platforms, whatever AI has to bring next as an evolution of ChatGPT, and everything else continues. If you have any other suggestions, definitely drop them on info at cloudsecuritypodcast. tv. I'll drop that in the description and the show notes as well so you can reach out to us easily.

Otherwise, I will see you in the next episode. Peace.