In this episode, Ashish spoke with Kushagra Sharma, Staff Cloud Security Engineer, to delve into the complexities of managing Identity Access Management (IAM) at scale. Drawing on his experiences from [Booking.com](http://booking.com/) and other high-scale environments, Kushagra shares insights into scaling IAM across thousands of AWS accounts, creating secure and developer-friendly permission boundaries, and navigating the blurred lines of the shared responsibility model.
They discuss why traditional IAM models often fail at scale and the necessity of implementing dynamic permission boundaries, baseline strategies, and Terraform-based solutions to keep up with ever-evolving cloud services. Kushagra also explains how to approach IAM in multi-cloud setups, the challenges of securing managed services, and the importance of finding a balance between security enforcement and developer autonomy.
Questions asked:
00:00 Introduction
02:31 A bit about Kushagra
03:29 How large can the scale of AWS accounts be?
03:49 IAM Challenges at scale
06:50 What is a permission boundary?
07:53 Permission Boundary at Scale
13:07 Creating dynamic permission boundaries
18:34 Cultural challenges of building dev friendly security
23:05 How has the shared responsibility model changed?
25:22 Different levels of customer shared responsibility
29:28 Shared Responsibility for MultiCloud
34:05 Making service enablement work at scale
43:07 The Fun Section
--------------------------------------------------------------------------------
📱Cloud Security Podcast Social Media📱
🛜 Website: https://cloudsecuritypodcast.tv/
🧑🏾💻 Cloud Security Bootcamp - https://www.cloudsecuritybootcamp.com/
✉️ Cloud Security Newsletter - https://www.cloudsecuritynewsletter.com/
Twitter: / cloudsecpod
LinkedIn: / cloud-security-podcast
Kushagra Sharma: [00:00:00] But then when I was working at booking. com, we had a scale of, over 3, 500 AWS accounts. For example, today we have S3 asterisks, which means all actions in their S3. Three years down the line, AWS releases a new feature. They add new actions to it. Yep. If we reviewed S3 asterisks even a month back, it doesn't mean that today it stands true because there might be new sub permissions added under the same.
I'm namespace and if you don't do your threat modeling recurring or you don't have processes around it, I don't have an answer how to solve it, but I have an answer of how you can prevent the blast radius, right? Some people at a company, they came up with the thought that you should learn to say no to the wrong things.
Then only you can say yes to the right things, right?
Ashish Rajan: Is GenAI helping with any of this?
Kushagra Sharma: In the industry I would say it's in a very premature stage, especially when it comes to security.
Ashish Rajan: Have you been struggling with scaling your identity access management permissions? Giving service access to people quickly in large AWS account. Like I'm talking thousands, not even just 1000, 2000, 4000 AWS accounts. So if you are someone who's probably struggling with scaling identity access management, this is the [00:01:00] episode for you. We had a chance to talk to Kushagra Sharma, who was a speaker at fwd:cloudsec for the past couple of years.
And we spoke about identity access management challenges that he had solved when he was working for booking. com. Now, in this conversation, we go into the scaling part, where do you even start, have a baseline, Terraform module, enabling services, where are the blurry lines of shared responsibility when it comes to services that you want to give access to developers in a developer friendly first way, but you may not have the right permissions from the cloud provider sometimes, or the right guidelines from it.
Or there are some gray areas that some may talk about. All that and a lot more in this episode of Cloud Security Podcast. If you know someone who's probably struggling with identity access management and wants to scale this is probably a good one to try and start leveling up from level zero. If you're starting today as well, we spoke about the initial stages of working with one or two accounts all the way up to going for thousands of AWS accounts, and even if you are multi cloud as well.
So if you know someone who's in that stage, definitely share this episode with them. And if you're here for a second or third time, or probably you're the fifth or sixth time, I would really appreciate it. If you're [00:02:00] watching this on YouTube, LinkedIn, definitely drop us a subscribe or a follow, but if you're listening to us on iTunes or Spotify, it would be super awesome if you can drop us a review or rating, it definitely helps more people find out about Cloud Security Podcast and the awesome work we're doing, which is helping people like yourself and other Cloud Security professionals out there do the awesome job in cloud security across all the popular cloud service providers and kubernetes.
Hello and welcome to another episode of Cloud Security Podcast. I've got Kushagra Sharma with me. Hey man, thanks for coming on the show.
Kushagra Sharma: Thanks for having me here.
Ashish Rajan: I am excited to have this conversation with you, man.
We're talking about scaling IAM and maybe before we get into all of this, could you just share a bit about yourself, man?
Kushagra Sharma: I'm Kushagra Sharma. I'm a staff cloud security engineer, previously worked with booking. com also was associated with fintech scale ups consulting industry. I started as an IAM engineer, developing access control for turnstiles for football stadiums, then transitioned into a bit into DevOps role and then security came naturally into it. Then as you hear DevSecOps said that was my role in a couple of assignments that I took over. Then I thought, yeah, security is my forte, cloud security being like the trending [00:03:00] field in there. I did some heavy migration from DC to cloud primarily on AWS and GCP.
And then it was all about securing the cloud.
Ashish Rajan: In terms of the scale of the accounts, like just to give the audience a bit more context on how large are we talking? Because a lot of people always look at, if you look at a lot of the tutorials and everything, people talk about one account, one AWS account, one GCP account.
That's not the reality. Like in the scale of how you have seen it, how large the number of accounts are.
Kushagra Sharma: So if you asked me this question, perhaps five or 10 years back, I would say, 50 accounts, a hundred accounts is like good scale, but then when I was working at booking. com, we had a scale of, over 3, 500 AWS accounts.
Yeah. There you go. And it was growing exponentially. So thousands of accounts is now, my definition of a large scale environment, I haven't seen something bigger than that, but happy to see that. Of course.
Ashish Rajan: Yeah, and I'm glad you call it out as well, because I think it's very easy for people to think, oh, when people say large, it's just 20, 30, 50, 100, and obviously [00:04:00] being at that scale, and even the scale of double digits and triple digits, there's obviously challenges of IAM.
And could you share some examples of IAM challenges people face at scale?
Kushagra Sharma: Are you sure? What I've seen most customers doing wrong, wrong is a harsh word, but what happens is once you start scaling, most security teams have this notion, that security ops teams or someone needs to approve your IAM requests.
So you raise a request, someone reviews the permission set, and then it gets approved and, available to use. Even if you have JIT flows, then there's this approval step, right? And when you deal with 3, 500 accounts or thousands of accounts, then you have more than thousands of entities. And you cannot have this approval step or like a human approving it in the back of the whole process.
Even if you automate that to some extent, you still have quite a lot of operational overhead. So at that time there's always a notion, approvals, red tape, security, maybe creating friction, which, doesn't end well. So you need to very well think it through of how you want to define your IAM strategy, whether you want to do IAM for your developers or you want your developers to do IAM themselves, of course, within a secure perimeter or [00:05:00] boundary, as you call it.
Ashish Rajan: Because a lot of people are in that IAM user, specifically, if you're talking about AWS, a lot of people, I feel like they're trying to start on the whole, oh, I'm going IAM user. Then you move to the SAML part. I'm going to do a single sign on now. Okay. Everyone's single sign on MFA. And then the next maturity after that is when you start seeing people do just in time provisioning as well between admin.
And we're not even going into the whole access control part. We're just basically talking about IAM just now, just to get access Just make sure Ashish gets access to the Google account as well. But then there is a whole aspect of every re:Invent every AWS summit. And now we're in this world of AI now.
So it's sounds like this. Things happening every week. Does that add more challenges? Obviously I spoke about the maturity of the different IAM pieces, but having new permissions or new services and change of role, does that also add up at that scale?
Kushagra Sharma: Yeah, that's a very good point. And definitely.
So back in the day, when AWS launched, reviewing permissions where, you know, S3, EC2s, and it was within your comfort zone. Then there was a huge introduction of new services. Now we have more than a hundred, 150 [00:06:00] services talking specifically about AWS. Yeah, so even as security teams you don't have the know how of each of the services and all of the hundred actions within grouped within a service namespace, right?
So how do you even review as a human, all the service, right? So there needs to be like a call that, hey, there should be some boundaries or, controls or permission sets. Or a perimeter, as you call it, do you know the last radius of what a user can do in the environment. Another interesting thing, because you mentioned here is, for example, today we have S3 asterisks, which means all actions in their S3.
Three years down the line, AWS releases a new feature. They add new actions to it, right? If you reviewed S3 asterisks, even a month back, it doesn't mean that today it stands true because there might be new sub permissions added under the same IAM namespace and if you don't do your threat modeling recurring or you don't have processes around it, I don't have an answer how to solve it, but I have an answer of how you can prevent the blaster is right.
Ashish Rajan: Yeah. I guess that's a good segway into talking about boundaries as well. Cause a couple of talks that you did revolve around the whole permission boundary in AWS. What is it for people who have [00:07:00] not worked with permission boundary yet?
Kushagra Sharma: So it was an interesting find that we found that, hey, there are IAM policies, which is basically defining permissions that your IAM entity can get.
But then there's an interesting feature in AWS called the AWS permission boundary, which basically dictates what an identity can get in terms of permission. So the maximum permission set, regardless of which identity based policy you attach to it. So for example, if I have a boundary that says IAM asterix allow, and then I deny, IAM create user, which basically says that if a user attaches and maybe the user is not familiar with AWS, they just attach a policy IAM asterix or full IAM access and they have this boundary. They'll still be able to perform all the IAM actions. But the denier that says, deny create user, that would still be applicable. So it's basically adding like a perimeter to your IAM entity, of which permission sets the developer project team assigns to it by the boundary.
So that's why it's called like a boundary as such.
Ashish Rajan: How does that work at scale then?
Kushagra Sharma: It was interesting at booking. com, what we did was two things. First is that we started to come [00:08:00] up with dynamic permission boundaries. And when you hear dynamic, you'll be like, what's that? Not every account is the same.
Not every environment is the same. Different compliance requirements have different, asks. For example, PCI, what we did this, we started to generate these boundaries at runtime based on the context that you get in terms of environment, compliance scope or account level exceptions, so on and so forth.
And then this boundary was unique for every account. But this was materialized at runtime. And interestingly, this boundary, what it contained was basically saying, hey, these are the no go actions. So creating IAM users, creating internet gateways, maybe accepting VPC peering.
So all the things that you think are actions that your developer should never do, or never touch, deny that. Blanket. So regardless if they attach full permissions, full administrative control policy to their IAM entity, you're still covered by that, right? But why not use SCP, man? What happens is that right now I give three for examples, but when you start dealing with hundreds of services, the length of your policy increases drastically.
So one of the challenges with SCPs as well as permission boundary is first, the [00:09:00] character limit, right? So you need to find a balance between these two. Then secondly, what we realized is that, as changing an SCP might have very drastic effect to environment, right?
And you don't want to touch them that often.
At least that's what my belief is. So what we did is that things that don't need frequent changes, let's say things that are non negotiable, You put it into SCPs, so block the usage of IMDS V1, deny the usage of root account at all, put them on SCPs, because these are the things you don't anticipate exceptions or deviations.
Things where you need the dynamic nature, which I spoke about briefly, that you put into the permission boundary, because here you can materialize account level exceptions, organizational unit level exceptions, and exceptions per service, where your RNC gives, gives you a green light or there's a risk of acceptance, for example.
So the dynamic things or things that need frequent changes, put them in this into permission boundaries, things that don't need change, non aggressive controls, put them in SCPs. So this way you're also balancing the character limitation of AWS as well at the same time.
Ashish Rajan: Yeah. And I guess it sounds like you guys have done some work on the beginning stages as well, because I imagine people who are either listening or watching to this, [00:10:00] and if they are in that thousands of accounts.
Just by putting a dynamic Terraform template to apply a permission boundary, the answer sounds like you guys had to find the balance between if we still need SCPs for an absolute not allowed in any account, doesn't really matter because there's also a limited set of permissions you can even have on that as well without impacting a large environment.
And then you have to have that stage two almost for the Terraform that the dynamic templates.
Kushagra Sharma: Yep, that's true. And as I said, as the services increases, you need to think about it, right? Do I allow all the services by default or do I maintain like safety strategy? So that's another topic, which, increases your character limit of your policies a lot.
Because if you start safe listing all AWS servers one by one, immediately in a list of a hundred services that your organization uses, but if you're not doing that at the cost of security, then you don't have this problem. It's finding what your organization does versus how your IAM entities or IAM permission sets are on there.
Yeah. It's a matter of finding the balance, as you said.
Ashish Rajan: So does it still apply, obviously because the [00:11:00] overall conversation that we have right now here is about scaling IAM and access control across a large, like a fairly large environment, do you find that with the dynamic permission, are you still doing SAML or are we doing IAM user?
Cause I guess it's worthwhile calling that out. Like where are we in that? I guess it may be a bit more granular, but we have the SCP. Yeah. We have IAM permission boundaries. Is that being applied to an IAM user or are we doing SAML? Cause how does that scale as well?
Kushagra Sharma: So this is purely IAM roles and SAML via Federation.
So first thing, don't use IAM users. So everyone listening here, never use it. Have an SCP to deny IAM users, no matter what. If your vendor is pushing for one, push them back. Hey, this is not the best practice. Yeah. Yeah. In our organization, we barely have IAM users. Maybe couple max for deployment purposes, but, or break glass purposes, but nothing apart from that.
Yeah. So here it's more, everything is federation of IAM roles to start with. And what we did interestingly and what I touched upon earlier is that we as a [00:12:00] security team didn't want to do IAM for development teams. We don't want to be the one, reviewing their access request or permissions or what permissions you're granting.
We centrally wanted to define this boundary, which is safe, and then enforce it in a way that, hey, you can create your own IAM entities as long as you have this boundary, that's the end goal. What's interesting is AWS provides you with an SCP or a condition key, which you can enforce saying that whenever I create a role, it needs to have this boundary attached.
So we just enforced it using an SCP. They need to have the boundary attached to every IAM entity you create.
Ashish Rajan: Also, otherwise they can't do anything if they don't attach the permission policy, even if they, to your point, it's an overall governance, security control, rather than I'm just trusting the fact that, Oh yeah, I told Ashish to use permission boundary.
He's going to make sure every time, it's trust, but verify, as they say.
Kushagra Sharma: Exactly. So today, if you look at our infrastructure or in the companies I worked in, all entities have this permission boundary attached, but there are no exceptions to it. So when you get to enforce it, then. That solves your problem because yeah, if you just put it as a guidance, then yeah, some might use it.
Some might forget about it. Some might not [00:13:00] find it like, useful.
Ashish Rajan: Yeah.
Kushagra Sharma: That's enforced. So you cannot update a role policy. You cannot create a role unless you pass this boundary. End of question.
Ashish Rajan: In terms of what's involved in the whole dynamic permission boundary. Sounds like obviously I need to build some Terraform modules for it.
But to put it back into the VS Cloud context. Are you creating boundaries at a account level, user level? Cause we're still talking about over 4, 000 accounts here.
Kushagra Sharma: Let's start a bit of a journey we had at booking. com especially. So when we started like securing or building the security foundation.
Okay. We were like, Hey, how do you make the account secure before using this sort of hand over it to develop as project teams? So then there was an obvious old school method that, baseline, the accounts have a baseline strategy, which applies even in today's world. But then when we were like, what the baseline consists of, right?
So it started with a Terraform repository where we started to put all the same controls, basic default roles, tooling, integration, so on so forth, prevent certain actions. But then when we heard about the permission boundary. That sort of formed our baseline. So we started defining a permission boundary [00:14:00] saying, Hey, these are the non negotiable actions.
They should be blocked. These are the services we allow in production or different environments. So these should be in the allow list. When you run, for example, an EC2, they should use the golden images using our bakery pipeline, so on and so forth. So putting all those conditions. In the permission boundary and it grew very organically.
Then we started to put at the same defaults, enforce account level S3, public access block and everything, package this into a module and deploy that into every account before it's handed over to the project teams. And this whole deployment mechanism with the help of our platform teams, it was a rolling release.
So it's not that you baseline it once, it's baselined every other time. So every week we have two or three versioning releases of the same boundary or the baseline and it was working fine, right? And then we had a number of challenges. So I talked through some of them. First was, of course, the scale.
When we started to add more and more to the boundary, of course, we were running out of character limits. So we started, using wildcards, not reducing the security scope or the security coverage, but, trying to be very smart with the character. Second is then we had requirements. Hey, for example, if you have PCR requirements and PCR [00:15:00] environment is very restricted compared to your normal production one.
So of course there was this thing. Hey, if I start making this restrictive, it would be for all environments. So how do you make it environment aware and catering a bit dynamic to the environment itself? Then the third one was like, Hey, we are not allowed to use, for example, this service, but for example, this specific team has an exception to do so for experiment purposes.
How do you handle exceptions? And if I grant an exception, I don't want that to be an all wide exception. I just want that exception to be in selected accounts that the team is requesting for, right? So how do you do account level exception? And then we were like, okay, the baseline strategy is good to start.
Then we didn't have so many accounts. It was still growing, but we need to rethink, right? And then we came up with this idea, using a Terraform dynamic statements, different boilerplates that let's define a boilerplate of what we want to do. So here's a file. Here are the list of actions that we want to deny.
Here are the list of actions we want to allow, here are the list of exceptions that would get eventually passed here. The compliance scopes here, the different inputs to the boundary. And then we were materializing this based on runtime. [00:16:00] So what we did is that when you run Terraform plan, it of course runs against the state in that specific account, right?
So we added this thing where we pass the account number, all of the speeds based on the compliance scope, the exceptions, you can't have everything. And using this, it would generate that boundary using this portal. So if you think about it. All the 3, 000, 4, 000 accounts you have, each one of them might have a different boundary, but all of them are managed centrally via one single repository.
That's the baseline.
Ashish Rajan: Oh, because as a developer, when I request for say, I need new dev AWS account or whatever, there's still a baseline that's been defined as let's just call it the global boundary for lack of a better word. And that's. Yeah, so there's a global boundary for, oh, it's all okay.
Don't use AI as, I don't know why, but just say, don't use AI 'cause Ashish doesn't know how to use it. But you can use EC2, you can use S3, you can make changes there, but you can't create IAM users, but then I have a PCI exception. I can go at that specific account level. That's the dynamic part, right?
Kushagra Sharma: Exactly. And it was interesting because this repository wasn't that, only security can touch it. It was open. Like [00:17:00] change management tools. So if a developer is good with Terraform and, he has this required approvals, he can create a pull request, merge request, and, say that, Hey, I want to change it.
So it was also educating people that, Hey, this is what's happening in your accounts and using this whole baseline thing, we were, also deploying central security services GuardDuty, enabling CloudTrail, for example because sometimes what happens is there's a lot of de duplication of work just because your teams don't know what the other team is doing or what's managed centrally.
Some teams wanted to create CloudTrail, to have better auditability, which is really good. But then they didn't know the fact that, Hey, there's an org wide trail configured here at your company, which is doing the same thing. So you're duplicating the feeds. Just because they didn't have visibility into it.
Ashish Rajan: So they would get visibility to into that as well that, Oh, but they just need to tap into the feed rather than go, Hey, where's my CloudTrail? I need to have my own CloudTrail.
Kushagra Sharma: So that's why I like I think this year I also talked about a bit, the shared responsibility model within our own company where you define these boundaries or, have some visibility because not every developer might know what the security team is doing, because when security works, they are very close within them and it forms a gray area for the development [00:18:00] team that what actually are they deploying? What are they protecting? Because how the whole cloud platforms are structured, it's very different.
Because for example, if I push an SCP, my developers might not have visibility into it because it's inherited by the account, but it's applied at the org level. It's all about, educating or making it very transparent in terms of how your org is setup, what are the controls that are applicable at each of these different levels, as I call it.
So yeah, it was a quite interesting journey and it's still not complete. So it's an ongoing thing. You need to keep refining beyond your toes. See when the new services are launched, do threat modeling, go back to previous enabled services. And it's a never ending process, basically.
Ashish Rajan: It'll be good to talk about how the responsibility kind of shifts, because it sounds like you guys have figured out a developer friendly way of doing security because we had a, I think a few months ago we had an episode on the security baseline, how to create that, and two speakers who did a good job at that, speaking about that as well in length. The one thing we, I didn't get a chance to cover back then, I'm curious to hear the answer from you on the shift of responsibility as it comes [00:19:00] in with the baseline. 'cause almost it's a lot of. Security teams in general, sometimes are not the ones, the first ones to go and build Terraform templates.
They are primarily focused on, Hey, I'm going to look at the CSPM alerts coming in. I'm going to look at the, I don't know, whatever the next best thing is for me to be able to proactively reduce the number of false positive alerts that are coming in or my IDS IPS or whatever, in terms of, I guess as these accounts grew and as you guys started building more Terraform templates, more baseline, how is this the shift in responsibility from you being the security team that's driving a lot of the security controls to now me being a developer who can just go, actually, you know what, Kushagra, I need access to be able to do AI service in AWS Oh yeah, just create a pull request and, just get that added to your own particular account.
How is that transition to that period was that because I feel like there's a lot of cultural context with need to be added there as well. So just the technical part sounds easy enough. I'm like, yeah, sure. Yeah. I'll put a pull request in. Was there a [00:20:00] cultural challenges that you can come across as well?
And how did you guys deal with it?
Kushagra Sharma: While we were developing, this baselining thing and all also moving to the cloud was very new to a lot of people. Because that's when the cloud journey started. So most organizations, after they are a couple of years into that stage, they built up like, Cloud center of excellence, CCOE but for us, it was like, no, let's build it right now.
And let's start telling people how to onboard to the cloud. So behind everything that was happening, there was also like a very simple process to start with of how do you want to move to cloud? Here's the intake sort of request where you say, Hey, I want to build this. This is my architecture diagram in a set format.
These are the services I want to use.
Ashish Rajan: Yeah,
Kushagra Sharma: How do I do it? And then they were like, of course, people in the company, like platform teams, the architects that you have that were helping them, this is how you do it in the best way possible, that so on and so forth. And then they were passing this information, Hey, you want to use the service?
It's not permitted. You want to experiment it. You have the necessary approvals from, risk compliance or GRC, for example, there's a single pane of code that you need to make a PR. And it wasn't that like security teams couldn't do the PR on their behalf. It was more, to get them [00:21:00] into the habit of reading the code and understanding what's there, right? So it came very naturally that they started to read through the repository. Sometimes they would even make a MR with suggestions that, Hey, if you change this, the deployment time would reduce by X percentage, which was really good, right?
Because it was development teams collaborating with security. And as I speak, you might be like, this is too good to be true, but it was the case, right? It was very beneficial to establish your CCOE. Like you might not be very mature into your cloud journey, but having like very brief things and, having a very brief process of how you want to intake these requests and how you evaluate, because a lot of times when it happens, when you're doing migrations, I was part of some of them, you just lift and shift and that's where you lose the capabilities of the cloud.
That's where you start using IAM users because you want static keys to be just embedded into your application somewhere. And that's where you lose the whole security as well as the benefits of the cloud.
Ashish Rajan: Yeah. So to your point, then the cultural shift over there with being able to almost train the security people to be okay with that as well.
But also having the buy in for that we're not lifting and shifting. We're basically transforming for as people ideally would like to [00:22:00] do it.
Kushagra Sharma: Exactly. And to be honest back in the day, there were some teams, who wanted to use new services. And even as security, we weren't very familiar with the services.
We worked together with them. We were like, Hey, this is your playground. Because we also have a different environment. I'll talk about different baseline things, variants that we have. So we were like, Hey, this is the safe zone. Go try experimented. We also want to see which logs are getting generated while you work on the service, which are the alerts getting fired by a CSPM or different services.
Do you not do the other way around? So reverse engineering, how you can build security control, seeing how they deploy. Because some areas were even gray for us when, there are better services or things where you don't have much guidance from the cloud providers. So it's a matter of, communicating and presenting security as we are not here to just say no, yes, but we are here to, reach that middle ground. Like you come one step, we come two steps and we find the middle ground to, make it happen in the shortest frame of time.
Ashish Rajan: Probably also an example where security is being an enabler rather than the blocker as well.
Kushagra Sharma: Yeah, exactly. Some people at a company that came up with the thought that you should learn to say no to the wrong things. Now you can say, yes to the right things, yeah. It's [00:23:00] about getting that management, but not just as security said no security said yes, because that's creating like a wrong notion within your company.
Ashish Rajan: Obviously since the time you guys have migrated into the cloud, you went through that journey, the cultural change that happened as well. In terms of complexity of cloud has evolved quite a bit as well. And you were referring to shared responsibility earlier, how has shared responsibility shifted now?
And I think you had some thoughts on this on your talk as well. So I'll definitely talk in the notes and description as well. What changed in the responsibility of that we share with our cloud service provider, even within the organization as well?
Kushagra Sharma: When you look at the general shared responsibility model, back when all the cloud service providers started advocating it like maybe a decade back.
It was very simple diagram, this is what we manage, which is, the data centers, the host layer and all, you manage the configuration of the service, your data. So it was very simple guidance which made sense. But now when we are a decade later, that diagram still stands as it is.
It didn't change compared to how many new services or how the cloud whole landscape evolved now we have managed services, for example, [00:24:00] and again, with managed services, there's again, this notion that if it's managed, then we don't have to do or care anything about it, but no, you can still expose data, for example, because these managed services, again, rely on 10 different services, which might not be managed, right?
Definitely. There needs to be a change in terms of how providers are not abstracting a lot of information, but are being very specific. But at the same time, I realized what the challenges they have. In terms to be very specific into that, but at the same time, this shouldn't stop us as a customer to start defining our own responsibility model of what we think is the perfect way.
And of course you can get like feedback from your cloud provider. Push them through that. Hey, this is what we think about it. And going from there because the landscape has changed a lot with all the new AI services on these new services getting added and threat modeling is not the same as it used to be like five or 10 years back in time.
Ashish Rajan: It has to keep it even more simple as well. When a lot of us. people like you and I were migrating up services to cloud. All we dealt with was either EC2, S3 or databases being moved across. Now we have Kubernetes, containers, serverless, and [00:25:00] now AI as well on top of that as well. So even the kind of technology or the technical context is also quite shifted.
So it's not just your point. managed service or a PaaS, but it's also the fact that, Oh, my infrastructure is quite different as well. Do I have a serverless Lambda or do I have a serverless Kubernetes? There's so many variations to it as well. And I think sometimes it gets confusing to the point where, Oh, how much has it shifted?
But did you find that even on the customer side as well? I think on your talk, you spoke about different levels of customer shared responsibility as well. Would you mind sharing those as well then?
Kushagra Sharma: Yeah, sure. So yeah, just to give a bit of context on the customer side, it was also the case because sometimes developers would be like, Oh, the security team must have controls at the SAP level, which would prevent anything from being public.
I'm like, sure. Yes. But there are corner cases to it, but like you cannot prevent everything. You can prevent them maybe on the infra layer. And then, as I said before, some teams thought that, Hey, We need CloudTrail, but then they were like, Oh, maybe there's a central trail or maybe there's not. Then sometimes there's this question that, Hey, whether I'm supposed to secure [00:26:00] it or whether it's a security team's responsibility, when you have such confusion, that's when you have security gaps.
So this led us to decide as a security team Hey, let's be an advocate. Let's make our own shared responsibility model. Because also all the controls I spoke about until now, we have preventative controls, detective controls, also shift lift controls that are in your CI CD pipeline. So how do we provide a picture to the end user who's owning the service in your company?
That, Hey, this is what the security team is doing. And this is what you're supposed to do. What's your share of sort of responsibility. Responsibility model, very simple. Four slabs. At the first slab was organization wise controls, which we basically say non negotiable controls. So all your SCPs, regional restrictions some sort of service restriction, a central service enablement, like GuardDuty, CloudTrail, WarpShield, whatnot.
They go in there. So these are the things that are managed centrally by security. These are non negotiable and you don't have to touch them or worry about it. It's there, right? So these are
Ashish Rajan: SCPs or permission boundaries?
Kushagra Sharma: These are SCPs. So it's the org level.
So no [00:27:00] permission boundaries, just like core, hardcore, non negotiable things.
So you don't have to worry about it. ,there are central audit logins. So you don't have to worry about enabling CloudTrail or GuardDuty nodes, it's all centrally managed. And then we had second layer, which is called organizational unit level controls, or like. groups, group level controls, where basically we had some sort of logical grouping, like for example, this is the PCI OU.
So there are these controls that are specific to this OU in line with PCI requirements. There is this specific OU for, sandbox environments, which needs some more relaxed controls. Environment specific controls at the same time, we were putting controls in terms of, cross access parts.
So for example, your dev shouldn't communicate with your production environment or your pre prod shouldn't be able to touch with PCI because that's a PCI flow connected. So access control patterns, so on, so forth, like at the grouping level, then there was a third level, which was more on the monitoring side.
So monitoring as well as the baseline side. So something I didn't mention is that when we were making these permission boundaries, there was a time where, PCI requirements were so strict that we had to create a flavor of the boundary. So by flavor, I call a variant of the boundary. But for [00:28:00] PCI, we created a new flavor called as the PCI flavor permission boundary, which was very restrictive, very catered to the PCI environment.
So this third layer is basically, creating different flavors. So if you think your boundary would deviate so drastically that you wouldn't be able to manage with Terraform logic or, all the grouping logic or the boilerplate logic.
Yeah. With a new variant of it, we call it as flavors.
So we have a specific sandbox like for example, sandbox flavor, PCI flavor, some specific business unit flavors like FinTech, for example. So this layer was basically, creating these flavors and putting controls very specific to these. And clearly we, at each of these blocks that I say, or the sub blocks, we had like markers saying that this is what the security team is managing.
This is what the platform teams are managing. And This is what the end user should be managing. So like small legends or, markers on each of the specific blocks. And then the fourth layer we had was more of the monitoring as well as the controls that you bake into your seed pipeline, like the shift left controls.
So no matter what you deploy, these are the alerts that you might get if you deviate from something because we also have this interesting concept. Whatever you prevent, you also monitor them. Just [00:29:00] because you're preventing them doesn't mean you won't stop monitoring
Ashish Rajan: them. Ah, yeah. If the prevention is turned off, how do you even know it's been turned off?
Kushagra Sharma: Exactly. So there should be some monitoring there. So we always try to maintain a parity between our preventative as well as detective checks because that keeps you in order. And also it's validating your preventative controls, right? Yeah. If you're seeing something that you prevented, it could be that the control is no longer, sufficient or it's no longer effective or it was never effective if you didn't test it.
So it's also keeping you in check in that regard.
Ashish Rajan: Do you feel that works for multi cloud as well? Because obviously I know we've been focusing a lot on the whole AWS side, but how does this scale out? Because another change that has happened since the migration or at least the initial wave of migration has been that now we have multi cloud as a new standard as well.
So does this apply in that context as well?
Kushagra Sharma: So the shared responsibility model in general does apply because it's more cloud agnostic, basically telling. Hey, publish what you have in terms of your layered security approach, because it might be clear to a security engineer, but not a software developer in your company.
On the other hand, when you start building these [00:30:00] controls, then of course, these are not cloud agnostic in terms of preventative controls. For example, SCPs might not translate into VPC service controls or something on GCP, but then also at the same time, you need to have control categories. For example, no usage of static keys.
So on AWS, it would be an IAM user on GCP. service account with Static destroy, right?
Ashish Rajan: Yes.
Kushagra Sharma: So you need to map them to a central agnostic layer, which is your control requirement. And then you start building out the controls.
Ashish Rajan: Oh, so your Terraform dynamic permissionary templates would also evolve based on whether it's Google Cloud that you're applying it to or AWS Cloud you're applying to, but the four levels that you spoke about would still be technically applicable.
Kushagra Sharma: Yeah, exactly. So on the technical side, it's different, but yeah, this whole layered approach to security in terms of the shared responsibility model would still be applicable.
Ashish Rajan: Interesting. And would you say, because another thing I found interesting in the conversation that you had in your presentation was also around the whole idea of A, enabling services, but also gray areas that exist between us and the cloud provider as well.
Maybe if you want to start with the gray areas first, it's [00:31:00] funny every time people talk about shared responsibility Oh, I know exactly what it is, but I think some of the things you gave as an example where it hit me hard, I'm going, if you're going to share some examples for people to realize what they miss out on shared responsibility, I'll be awesome.
Kushagra Sharma: All right. So yeah, it's interesting. So one of the examples I highlighted for AWS was the service link roles. So for the audience that don't know service link roles, these are default roles. managed by AWS, which basically as advocated by AWS is for the service to run smoothly in your AWS accounts. For example, AWS Glue, it might have a couple of service link roles that basically are there in your account for Glue to run successfully, right?
And now I, as a customer, when I open my account, I see that these are the two roles managed by AWS in my account, right? Which is fair. Like they might have identities in my account, but then as I just described about my shared responsibility model, my org level controls, which are non negotiable controls, which are SCPs.
I would be of the fact that, Hey, I'm a customer. I define my controls. So these service link roles are bound by that, right? Because that's the boundary or the perimeter I was defining what talking about, but there's a fine print that [00:32:00] service link roles don't fall under SCPs. So to sum it up, it's a IAM entity managed by a cloud provider present in your account. But the customer has no control over it.
So the cloud provider does have its reasons. But then in terms of shared responsibility, this should be an explicit responsibility mentioned by the cloud provider, right?
Ashish Rajan: Yeah.
Kushagra Sharma: We are managing this role. We are the ones responsible for it should anything go wrong. Because as a security engineer, I don't have the peace of mind seeing these.IAM entities not in my control, present in my account or all my thousands of accounts.
Ashish Rajan: I was gonna say, the first time I heard about this example from you, and the first thought that came to mind was the CapitalOne breach that happened ages ago where the AWS employee was the person who basically got access to the server, whatever.
Because I obviously never went into the detail of it, apart from knowing it's an instant profile role, blah, blah. But technically it could just be a service account, service linked role as well. If they have access to it, we don't know how that works. It's like a black box.
Kushagra Sharma: Exactly. And I spoke with a lot of people and some people were like, Hey, if you're using cloud provider, then you should trust them. And I'm like, sure. I would trust [00:33:00] them, but it's trust but verify. And if there was an explicit declaration of it, I would be still fine. So that's why even in the talk I did, I was like, It's defined, but the lines are very blurry and it's basically left to the perception of how the user understands it.
So I might perceive it in a very different manner than my colleagues or different customers out there.
Ashish Rajan: Some people may also look at this from a perspective of what's the probability of this being misused? So then we go into a risk conversation for, is this a risk that is acceptable or is this a risk that we want to manage?
But then again, how would you manage it because you don't have any control on it anyways?
Kushagra Sharma: The only risk acceptance is trust your cloud provider, which in current days with all the supply chain attacks and different vectors.
Ashish Rajan: Not every organization may be comfortable with that as well.
Kushagra Sharma: Exactly.
And this is just one example. These are the small things, that makes you ponder that, Hey, is the responsibility model enough? Or do you expect cloud providers to provide more granularity into it? Because there's an interesting concept. I think Google is doing the shared fate responsibility model.
Okay. Saying no matter what happens, it's the shared fate, right? So they are helping customers with certain things to, improve the security, so on and so forth. So it's a step forward. [00:34:00] But at the same time, I expect, cloud providers to be a bit more explicit or, work with customers to address these challenges.
Ashish Rajan: This reminds me of the conversation that I had ages ago in the beginning, when we were doing a lot of migration, a lot of questions were around, how do we know that the encryption in transit is happening in the AWS side? That used to be, I'm like it's never written anywhere. That it's basically being encryption in transit.
You just assume that, yeah, I guess it's obviously it's in the AWS network on the Azure network or GCP network. It's already encrypted in transit, right? Encryption at rest as well. I'm like, I don't know. Like I'm just making an assumption here. And a lot of us were making an assumption. We don't ask that question anymore.
So I think when you give me that example, the first thing, yeah, we have assumed a few things that we take for granted, which kind of also brings back to that. You were talking about the example of the S3 with a star, you may have reviewed six months ago, but re:Invent has happened since then.
There's a lot more services and a lot of the services that are spoken about after a big conference is usually like big service announcements. It's not the little ones, so you don't even hear about, Oh, [00:35:00] now you can do this in S3. Or you have this new feature that's available in EC2, but because you picked the star option, it never worked out or it never came up in your threat assessment.
In terms of being a developer friendly, developer first kind of company as well, how do you make service enablement work at scale? Because new services keep coming up, but then also to what you said the example of S3 star. I reviewed it six months ago, am I reviewing it every month now?
Because I don't know when AWS may release a new service. What was the approach for service enablement?
Kushagra Sharma: There's not a bulletproof answer to it, but how we try to do it is that when we enable the service, we are of course looking about which actions should be denied. Review the whole long list of actions and block the things which you think are not worth your environment.
Second is also things that are allowed, put conditions. For example, very basic example, if you're spinning up EC2s use baked images. So just putting a condition in there. So this is a review cycle you need to do at least once also to see the logging sources, what policies you want to deploy, which controls.
But coming to your question, when AWS [00:36:00] does a new announcement, they are not letting customers know in the fact that, Hey, these are the actions that you have added, but there are some, I think, I don't remember from the top of my mind. There are some monitors, which the cloud security community built, which are basically monitoring these and, sending like alert or to an RSS feed that, Hey, this is a new service.
These are the actions that were added. These were deprecated. So as a security practitioner, you need to be on top of things, constantly Have review cycles. So set up frequency when you want to revisit the service, because you, of course, you cannot jump ad hoc every time AWS does an announcement because the team needs to have hundreds of thousands of employees.
So have a set frequency, have a process where you want to review the service, do the threat modeling. And when you do the threat modeling, look into things that would change, not the things that you already reviewed, because that would save you time. So that's how we are doing it in one fashion, but at the same time, when you talk about building controls, then.
Also, like if you're using CSPM or some other third party tooling, they are proactive to build out new controls. And that's when you realize, Hey, now they're talking about S3 access grants. This didn't exist when S3 was done. So maybe it's something that triggers the whole process of, [00:37:00] defining policies or, defining the out of the box policies that you're getting from CSPM, because I always say this to my colleagues that CSPMs you think of, as an additional intelligence feed, because they are doing the same work you are trying to do, and they are building out all these out of the box policies, which you can leverage in an environment.
And never ignore them, seeing what they are changing as well, because if I see my release notes of my CSPM, I'm like, hey, they added these bunch of policies, but that means something changed, right? So it's a matter of monitoring all the feeds and having a process of how you want to redo your threat modeling when something changes or have constant iterations of it.
Ashish Rajan: I think another example I remember you gave was around the whole read only services as well. It's not easy, it's not that Every new service gets a read only service, a read only IAM role as well.
Kushagra Sharma: So yeah, so while I was explaining this challenge back in my talk, I also said that if you look at AWS, for example, they have this whole review process, which they talked about at reinforced this year, where whenever a new service is launched, they have review cycles, but then we were looking at the AWS read only policies managed by AWS, which has a list of all the services.
Yeah. That they allow, [00:38:00] for example, there was a service called Bedrock, a Gen AI service launched, I think last year, and it was only four or five months later that it got reflected into that read only policy, which means even their internal processes, I don't know how it works exactly, but it took four to five months to get that into place, some six, seven months, which tells you how complicated IAM has been, because there's another interesting point to it, given that you mentioned that for customers, we have a IAM size policy as we discussed.
But if you look at this read only policy by AWS, it has every service asterisks and it's like a list of 200 services. And it's actually nine times the permissible limit for a customer in terms of points. Okay. So of course it's AWS, so they have control over their policy set. But if you think about the customer, I would love to have the liberty to define such huge policies, right?
Because it solves a lot of pain points that's how complicated IAM in general is.
Ashish Rajan: Going back to the whole service enablement piece, in terms of the components, these are quite complex as well, do you have a whole CI CD pipeline? It's not just the AWS services anymore.
Are you going to use open source, [00:39:00] PaaS, SaaS? How do you enable across those kinds of things? Cause you've mentioned, like you had some shift left thing. Like how would you try to model a CI CD pipeline? There's so many questions up there at scale as well, cause everyone wants to use their version of CI CD pipeline as well.
Kushagra Sharma: Yeah. So what we also do is when we launch a new service, we, if you're launching a new service and enabling it, then it's very easy to enforce all the shift left controls, which is blocking them before they are even deployed. There are open source tools like Checkov. Then, that was acquired by different companies.
So there's a Bridgecrew and all of that, which you can use if you're using Terraform CDK.
Ashish Rajan: Yeah,
Kushagra Sharma: Not things before even they're getting deployed. So when you run Terraform plan, it says, Hey, you're failing these things because it's not catering to the configuration. What's interesting challenge is that if you want to introduce a new control for existing services that you would see that it would block deployments, then you might hamper production.
So how do you do that in a rolling faction? So for example, today I want to block certain configuration, which I mean, tens of teams have already deployed it and they would be blocked when they try to roll out a feature release to the service. How do you do it there? What we do is. We again, have a very fixed time periods that, Hey, you have X days to [00:40:00] remediate it until then we will only enable it in like soft fail mode.
So the enforcing shift left, but in soft fail mode. So they get a warning every time they try to deploy into production that, Hey, this isn't secure configuration you have until this date to fix it. And once that deadline is passed, then you start enforcing it in hard fail mode. And then you actually start blocking deployments because as I said, we are very focused on developer experience.
And, we just don't want to go one day just because we saw an article of blocking things. Of course, if it's not a business critical to start blocking the deployments. So it's more about giving them time to also remediate because somethings, anytime some of the resources to, even turn on a security configuration, you need to redeploy the resource.
Remember there was an interesting thing where we said, Hey, redeploy this thing with the security configuration. And that recreated the cluster, deleting all the data. Of course, we had backups, but you need to be very careful when you do some of these things because different services work different ways.
So it's all in the time and soft fail and slowly transitioning into like enforcement.
Ashish Rajan: So you could just be doing a patch upgrade, but go up one major upgrade versus a minor upgrade. Huge difference in how the functionality works as well. You need time to even [00:41:00] test if your application would continue to work the same way with the new major upgrade that's coming in.
Kushagra Sharma: Yep, precisely to all of that. We talked about a lot of layers of controls and all, but if you think from a developer's perspective, he would be swamped with all these alerts and that could be an alert fatigue if he starts doing every minor thing. So you need to also, define the severities of the policies or the SLAs of timelines to remediate very properly.
Because if you start marking everything as urgent while they are not, then at one point of time, you lose them.
Ashish Rajan: Yeah.
Kushagra Sharma: It's also how you communicate, how you ship these with actionable insights or remediation plans and everything using your alerting platform or whatever you had out there to send the right message, give them the time to breathe and look over all these issues that are out there.
Ashish Rajan: Totally random question. Is Gen AI helping with any of this?
Kushagra Sharma: In the industry, I would say it's in a very premature stage, especially when it comes to security. Even if Gen AI is recommending me something. Which has a security touch. I would still review it twice. I would
Ashish Rajan: just to verify it.
Kushagra Sharma: Yeah. Because it's security in certain use cases, for example, threat modeling, where you want to give standard architecture diagrams and [00:42:00] say, what's your opinion and review the output of it. But there needs to be like an engineer, an analyst. Who's in the security department reviewing that output, right?
Ashish Rajan: Yeah.
Kushagra Sharma: You cannot rely on it. Just like you can do in for certain use cases that are not security. It's interesting, but I haven't seen, something that really excites me to, jump into it and, put it into production. I think we are not there yet, at least in my opinion.
Ashish Rajan: It definitely is fascinating for me when I'm, because I love this conversation because we started off talking about identity access management and we covered the software aspect of identity access management, the shared responsibility as well.
Because as much as we would like to be all technical, IAM role permission kind of solves at the end of the day, there's a cultural aspect to it being accepted. There's a whole layer of how do you enable new services, permissions for new services. So going back to your dynamic permission, if new services has been reviewed, you would now have a permission boundary that can be created for those new services that are being created as well.
To bring it back a whole circle.
Kushagra Sharma: Indeed. So you'll just go back, add it to your permission boundary, and then it's allowed for everyone in scope that to be used. So [00:43:00] yeah, that's
Ashish Rajan: awesome, man. I love this conversation. I love how we brought it back a whole circle on scaling IAM aswell.
That's most of the technical questions I had, man. I've got a three non technical questions for you as well. So people get to know a bit about you as well. First one being, what do you spend most time on when you're not trying to solve the IAM challenges of the world?
Kushagra Sharma: I like biking a lot. I'm based in Netherlands, so there's a culture of biking.
It's very nice. In the evenings just to take your bike on a route go there. On the way back, maybe stop at a restaurant, have some food, come back. It works very nice. That's how I keep myself busy. Usually, a bit as well.
Ashish Rajan: I always wonder, do people actually throw their bike in the river?
Kushagra Sharma: I know they don't. So they fall due to some circumstances.
Ashish Rajan: I'm surprised how many cars are parked so close to the river and they don't reverse into the river. Yeah. It's a river, right? It's like a lake or something. It's a river.
Kushagra Sharma: Yeah, it's quite deep, the canals. And I cannot park there to be honest, if you ask me today, but I think the locals here are very skilled.
Ashish Rajan: What is your favorite cuisine or restaurant that you can share with us?
Kushagra Sharma: Okay. So originally I'm from India and I still have the bias for Indian food. So my favorite cuisine is still [00:44:00] Indian. My favorite food or dish recently that I started to stumble upon is a dish called Paneer Khurchan. I've not heard about it.
Ashish Rajan: No, what is it? So it's so paneer with what? As in?
Kushagra Sharma: Paneer Khurchan. So it's like a gravy. Okay. So of course with the paneer and then the gravy is with, certain type of lentil that's slow cooked over time. And then they mix it with the other spices and all. It's a very fusion kind of dish when I first had it, but actually it's quite a common dish in some parts of it.
Really? Yeah. In Amsterdam, you can find it, I think, in just one restaurant. It's at Kalashtar, but in the South of Amsterdam. Really, I definitely recommend it trying to anyone who has tried Indian cuisine and wants, something like adding spice to it, basically.
Ashish Rajan: Oh, wow. There you go. And for people who don't know what paneer is, it's like cottage cheese, I think it's probably the example. Thank you so much for doing this, man. Where can people find you on the internet to connect with you and talk more about scaling IAM across a large multi cloud environment as well, man?
Kushagra Sharma: Yeah. So my LinkedIn is out there. Just send me a message, a connection request, and we can get it from there.
As I said, everything is open on my talks over there. So you'll find what I [00:45:00] spoke about.
Ashish Rajan: Apparemtly, Kushagra is very open on the internet, so you'll find everything. But I'll still add his social links onto the show notes as well, so you can connect with him. But dude, thank you so much for doing this. I really appreciate the conversation as well.
I'm sure people will enjoy this.
Thank you for listening and watching this episode of Cloud Security Podcast We have been running for the past five years, so I'm sure we haven't covered everything cloud security and if there's a particular cloud security topic that we can cover for you in an interview format on cloud security podcast or make a training video on tutorials on cloud security bootcamp definitely reach out to us on info at cloud security podcast or tv By the way, if you're interested in AI and cybersecurity, as many cybersecurity leaders are, you might be interested in our sister podcast called AI cybersecurity podcast, which I run with former CSO of Robinhood, Caleb Sima, where we talk about everything AI and cybersecurity.
How can organizations deal with cybersecurity on AI systems, AI platforms, whatever AI has to bring next as an evolution of ChatGPT and everything else continues. If you have any other suggestions, definitely drop them on info at CloudSecurityPodcast. tv. I'll drop that [00:46:00] in the description and the show notes as well.
So you can reach out to us easily. Otherwise, I will see you in the next episode. Peace.