In this episode, Meg Ashby, a senior cloud security engineer shares how her team tackled AWS’s centralized VPC interface endpoints, a design often seen as an anti-pattern. She explains how they turned this unconventional approach into a cost-efficient and scalable solution, all while maintaining granular controls and network visibility. She shares why centralized VPC endpoints are considered an AWS anti-pattern, how to implement granular IAM controls in a centralized model and the challenges of monitoring and detecting VPC endpoint traffic.
Questions asked:
00:00 Introduction
02:48 A bit about Meg Ashby
03:44 What is VPC interface endpoints?
05:26 Egress and Ingress for Private Networks
08:21 Reason for using VPC endpoints
14:22 Limitations when using centralised endpoint VPCs
19:01 Marrying VPC endpoint and IAM policy
21:34 VPC endpoint specific conditions
27:52 Is this solution for everyone?
38:16 Does VPC endpoint have logging?
41:24 Improvements for the next phase
Thank you to our episode sponsor Wiz. Cloud Security Podcast listeners can also get a free cloud security health scan by going to wiz.io/csp
Meg Ashby: [00:00:00] We didn't really have any desire to really try and restrict any of that traffic to AWS services. For us, it was our internal services need to use AWS services, so we're just gonna trust that, right? We've run on AWS. AWS is a friend, so all great for us. We basically allow our clients to a certain extent to create their own lambda function code, and then we host it in one of our VPCs, which as a security professional, you might be like, whoa, that's like, why do you do that?
Just because you might, in your organization, have or want to move towards a little bit of a less traditional architecture or pattern. It doesn't always have to be this. Either I go down the formal vendor supported like golden happy trails path or I'm trucking it on my own through the wilderness.
Ashish Rajan: Trailblazer, yes. If you use centralized VPC interface endpoint in [00:01:00] AWS, then this is the episode for you. There are some gaps that you probably should be aware of, not just from a detection side, monitoring side but also the fact that this is technically an anti patent from AWS. Yes, you heard that if you are someone who's trying to do network security in the AWS world, usually the recommendation is, hey, if you want private network to be able to access internet, You should be able to use something called VPC interface endpoints to connect those services.
As great as that sounds, the recommendation seems to be to have individual VPC endpoints for individual VPCs instead of a central VPC. So in this conversation, I had Meg Ashby from Alloy and she was kind enough to come and share her experience on how the anti pattern became a actual good pattern, which is cost economical as well for them.
In this episode with Meg, you'll get to hear about why the centralized VPC path, why was the decision made there? What are some of the challenges, some limitations, detection, the gaps that are there you might have to face if you are going down that centralized VPC architecture and whether the centralized VPC architecture is [00:02:00] scalable, especially when VPC endpoints can only allow for one policy, but how do you do that for entire organization?
All that and a lot more in this episode. If you know someone who's working on network security and perhaps on the path to use a centralized VPC interface endpoint for their AWS network architecture, definitely share this episode with them. But if you are here for a second or third time, and if you're watching this on YouTube or Linkedin, I would super appreciate if you can give us a subscribe or a follow because that helps more people find us on the internet.
But also if you are listening to this on Apple or Spotify, if you could drop us a review or rating, it would definitely help us spread the word as well. So thank you so much for your support all these years. I appreciate all the time you spend with us on the podcast and also saying hello to us at conferences as well.
Hello and welcome to another episode of Cloud Security Podcast. Today I've got Meg and before I get into the conversation, cause I'm super excited about this. Meg, would you mind giving an introduction about yourself, where you are at the moment, what's your cloud security journey like?
Meg Ashby: Yeah, thanks for having me, Ashish.
I'm Meg Ashby. I do cloud security for a [00:03:00] fintech company in New York called Alloy. Previous to working at Alloy, I actually accidentally started my career in a cloud security and security generalist role, Marcus by Goldman Sachs. And then two years ago, moved over to Alloy to do their cloud security function.
Right now, it's still a really small team with my colleague and I, so we do everything from IAM, malware, patching, network security, all of our tooling, yeah, based solely in AWS and love doing it. Awesome.
Ashish Rajan: And something that has been a focus, at least based on the conversation that we had about the whole fwd:cloudsec talk that you have, which I'll put a link for in the show notes and stuff as well.
You spoke a lot about VPC interface endpoints. What are they for people who probably have not dug deep into the networking side of AWS? What is a VPC interface endpoint?
Meg Ashby: At a high level interface VPC endpoints are a way for your workloads [00:04:00] running in a VPC to be able to securely and privately access both US services and certain partner workloads without going over the public internet.
So for example, we can keep it really simple. If you have an EC2 instance running some kind of app for your company and you need to connect to something like SQS. is a great example, without an interface endpoint, that traffic would go over the internet. And by utilizing these interface private link endpoints, you can basically send that traffic through the AWS background.
Ashish Rajan: All right. So if I had primarily, I want a private network or the resources in a private network, but I want the private network to be able to reach an internet endpoint for lack of a better word, would that be a fair summary?
Meg Ashby: To a certain extent, because when we're looking at which third party services. So i.e. things that are not US, they have to formally support this sort of connectivity [00:05:00] pattern publicly. There are a few companies such as Datadog, which do publicly expose their private link endpoints. So anyone could use those to send traffic to Datadog endpoints over private link. But in general, if, your random endpoint you're trying to hit over the internet doesn't have private link connectivity option, unfortunately, that is a prerequisite for non AWS services.
Ashish Rajan: Oh, okay. And what do people generally do? I've heard of NAT gateways and other patterns as well. So what does ingress egress normally look like then for private networks when you're designing it?
Meg Ashby: Yeah. So typically when we think about ingress, I would say the pattern I see most often is the pretty generic one where you would have things like your WAF, or your application load balancer, some kind of ingress point that basically fronts your applications in a public subnet, or something that is exposed to the internet, which then, acts as your bodyguard [00:06:00] for your application, and then forwards the ingress traffic into your private workloads.
In terms of the egress side, we see a few different. patterns. One that I see most often and what is actually, I will say, most commonly represented in official AWS documentation is starting at a most simple level, you've got like your NAT gateways or your internet gateways that provide the egress traffic.
So for your private workloads we might have the NAT gateways and the public subnets, It just goes straight out through the NAT gateways. So like each set of compute would have its own NAT gateway. It's like a pretty common egress pattern. Similarly, when you want to get more sophisticated, we see other organizations, in that same pattern, putting things like either proxies or third party appliances between that private host and the NAT gateway to do either like content inspection or again [00:07:00] proxying if we're using a proxy instance.
And then once we get out to like really larger organizations, I've seen the pattern quite a bit where even for your like service to service communications, let's say you've got two compute clusters, both in different regions and they want to communicate with each other. Bigger organizations sometimes will even have a kind of internal, a transit mechanism, either through like transit gateways to not only facilitate that sort of connectivity, but they even sometimes will put third party appliances or their own custom logic to even inspect or route that sort of traffic as well.
Ashish Rajan: And what made you guys go down the VPC endpoint path?
Meg Ashby: Yeah, so for us, it was a bit of a two fold thing. What I said before, the larger organizations might tend to do this. Right now, we would not say that we're in a position where that's a need to have for us. And our desires on, how we decide [00:08:00] what we're going to do, it's what's right for us. And for us, the two main things we're looking at oftentimes is a financial cost for those sorts of solutions. So i.e. if you want to put proxies between your internal traffic, you got to pay for those proxies. We didn't really want to do that. And secondly, the care and feeding required of any sort of solution.
Again, if you're going to have your own custom solution, that's going to route, your internal traffic through third party appliances or your own logic, right? Someone's got to be there thinking about monitoring it to make sure it's not dropping traffic. So for us, we don't do that internal service to service pattern.
But what we do instead, thinking about the interface endpoints, is that something that we have embraced utilizing? One, because the care and feeding, it's managed offering by AWS so as soon as we set it up once, assuming AWS doesn't have an outage, no care and feeding required. So for us, yeah, that's [00:09:00] why we moved towards utilizing the interface endpoints.
Ashish Rajan: And you guys went down the centralized VPC endpoint path as well, which is I would've thought it was an anti pattern if people look at AWS. Why was that the case?
Meg Ashby: Yeah, I totally get where you're coming from. It's. It's definitely not a pattern that is widely documented in all of the reference architectures, but I can tell you a little bit about, like, why we chose to do that.
So taking a step back, just a little bit of a baseline, we have many VPCs that have a bunch of different workloads in them, and we decided for our security posture, what we wanted to start doing was monitoring our egress traffic. So we wanted to be looking at what domains. Our applications are hitting part of our functionality, just as our product is, we connect to a lot of different data sources and we'll ingest data on behalf of our clients, send the data back out to them.
So we have a lot of 3rd party connections, right? Both to our clients and to the data providers that [00:10:00] our clients are wanting to utilize. And so for us, it was important to start getting visibility into where that traffic is going. So when we were looking at, we probably want to start putting these, network appliances in line with, how our traffic is egressing, right?
So we can inspect that. So we were looking at two different options in terms of those network appliances. What we ended up doing was basically hosting those in a centralized location, and then we updated all of our routing. to instead of sending our traffic through those NAT gateways that were in each workload to the centralized area where we had the firewall appliances.
However, once. We actually got that in place and, connectivity working, everything flowing, we. Started realizing, when we're looking at all the traffic that these appliances were getting, that we were seeing a lot of traffic to things like the AWS services. And going back to what we said before, we didn't really have any [00:11:00] desire to really try and restrict any of that traffic to AWS services.
For us, it was our internal services need to use AWS services, so we're just gonna trust that, right? We run on AWS. AWS is a friend, so all great for us. So we wanted to start utilizing interface endpoints for putting that traffic through those interface endpoints. So they're going over the AWS backbone and they're not traversing over the internet and therefore through our firewall appliance. However, what I talk about a bit in our presentation is AWS makes a lot of money on interface endpoints, particularly because they have a running cost per hour. So there's a little bit of math involved of, how much traffic do I need to send over this interface endpoint per hour in order to balance out the cost of the firewall, appliance processing it, the egress out to the internet, even irrespective of security concerns. And what we realized [00:12:00] is, we as a security team wanted to use interface endpoints to keep that traffic private, but the CFO probably would not be super pleased if they saw our AWS bill spiking.
So that is what led us to the centralized VPC endpoint model, which is where you can have kind of your VPC endpoints shared across many different VPCs and workloads instead of each workload having its own set of interface endpoints.
Ashish Rajan: Doesn't that make the policy part harder or does it make it easier?
Meg Ashby: I will say in terms of managerial side of things, if you're not on the security team, it probably makes it easier because you have endless policies to deal with. But you're right from the security point of view, what you lose then is because if you have many workloads essentially sharing, we can go back to the example I love, which is SQS.
One SQS VPC endpoint and one VPC endpoint can only have one VPC endpoint [00:13:00] policy. So the typical thought is when I had N number of SQS VPC endpoints, I could have each policy being really distinct for the workload and what resources that workload needs to access. But when you move to a centralized model, right now you have one policy and you need to think about, okay, what level of granular controls or permissioning am I giving up in order to utilize that sort of centralized model?
And that was basically the theme of why I wanted to talk about how you can maintain your granular controls in that centralized interface endpoint model.
Ashish Rajan: So are there limitations in this? And I obviously a lot of people just go, and I'm guilty of this as well.
Before I started introducing myself to Azure and GCP, I used to believe that AWS's documentation is right. Like the example that I normally come down to is. If you use a third party, AWS recommends you to have external ID as the one way to [00:14:00] communicate using IAM role. Now, after learning Azure and GCP, I realized, wait, that's because they forced the vendors on the other side to also use AWS because you can't get an external ID if you're not using AWS.
So to your point about, now that we know the recommended pattern is actually not to use VPC endpoints, as a central VPC endpoint, what you said, one would logically think isn't that because there's only one policy now, I may have a lot of different environments, dev, staging, testing, and not just one application, multiple applications as well.
How practical is it to do it at scale?
Meg Ashby: When we think about, how we could implement those sorts of granular controls, we viewed it as four places that you could do those controls. So a little bit of, if you're not familiar with the centralized VPC endpoint architecture, how it works is in order to enable that connectivity, you one have to have connectivity between where your endpoints are hosted and the workloads that want to use them.
So assuming you've got either [00:15:00] VPC peering or some kind of transit gateway networking connectivity set up between the two places, your workload environments and where the endpoints are hosted. How the magic works in terms of actually facilitating the DNS side of things is that instead of utilizing the default enable private DNS, which is the default check mark on VPC endpoints, which is not compatible with the centralized pattern.
Instead, basically what you do is you take the DNS into your own hands. And create a private hosted zone, which has an alias record to that VPC endpoint. And then you share that private hosted zone with all of your, what I would call subscribing VPCs, all of those environments. VPCs or however you have them hosted.
And so that's one place that you can put those controls, right? If you, for example, are like, I'm okay with having dev staging and pre [00:16:00] prod or load testing all share VPC endpoints, but for production, I really want to keep them separate, right? So it becomes quite simple then. Don't associate your production VPCs with that private hosted zone, and it won't even know that it exists.
But that's a really coarse control explicitly on the route 53 private hosted zone side, you have to associate it at the VPC level. So if you have something like, yeah, we, for whatever reason have chosen to have one jumbo VPC and, just a bunch of subnets, that's how we do our controls.
So that's not going to work for you, even if you need more nuance, VPC is pretty broad. So we don't think for us that's like a great place to put those kinds of controls. The other second place, which is what we have chosen to do, is to actually use some pretty snazzy IAM condition keys in the VPC endpoint policy itself to maintain that central place [00:17:00] of granularity of controls, but if that doesn't work for you, the third place furthest down the pattern that you could put these controls is actually on the resource policy of the individual resources access through the VPC endpoint for us, it just turned out that there were too many resources and going back to the care and feeding, maintaining that all of those resources always have up to date policies for us was too much care and feeding.
So we went. with the maintaining the controls on the VPC endpoint policy itself.
Ashish Rajan: But then we can only have one policy.
Meg Ashby: Yes. Yes. And that's very true. But what we discovered is that there are certain condition keys that you can use within the IAM policy that allow you to identify the network location basically of where the traffic came from.
So there are a few caveats, but [00:18:00] basically what this allows us to do is it will allow us to apply granular IAM permissions, anything that you can do on a resource based policy, but to traffic that originated from specific network locations. So whether that's a specific VPC IDs or CIDR ranges for things like subnets.
Ashish Rajan: Even though you only have one VPC endpoint policy, you're almost like marrying that together with an IAM policy for what Ashish and Meg can do in the AWS environment, as long as they're coming from a known VPC or a known private CIDR range, they should be able to access the resource. Is that how you're looking at this?
Meg Ashby: Yeah, so we can go ahead and give like a specific example, like why we even wanted to consider doing this, why don't we just leave our centralized VPC endpoints and not even mess with the policies, right? Like security, you've got other things to worry about. Why do we care? And a little bit about, why we decided to do this is we [00:19:00] particularly have one VPC where we enable a collaborative, basically Lambda creation process with our clients.
So we basically allow our clients to a certain extent to create their own Lambda function code. And then we host it in one of our VPCs, which, as a security professional, you might be like, whoa, that's like, why do you do that? There's a little bit more context behind that. And we have a lot of, other controls that validate the code that's being deployed, right?
Ashish Rajan: And the level of trust is well there as well. Yeah.
Meg Ashby: But yeah, at the end of the day, right? Like we recognize this is a sensitive VPC, right? We really want to be really restrictive on what resources that VPC can access. So for us though, again, going back to the cost was, we didn't really want to have those sets of VPC endpoints just for this one environment.
So we wanted to have the ability that [00:20:00] would say, traffic that's coming from this VPC will only allow it to look at these very restricted amounts of resources or these certain actions, even if the IAM policy, is a little bit more restrictive on the role itself. And for us, it was mostly just a defense in depth mechanism and a bit of a balancing exercise, right?
Like we could have very easily just said, you get your own VPC endpoints. And we'll eat the cost for it. And, but for us, what we found is that there are two ways that we can still maintain the like VPC layer controls, which we can definitely talk a little bit more into the specific implementation of, but that was the impetus for us all the way wanting to really invest in trying to maintain the ability to do those sorts of granular controls in the centralized VPC endpoint architecture.
Ashish Rajan: Yeah. And 'cause I think to your point about giving the client access to create a Lambda function and having security controls around it, sounds like it [00:21:00] allows the flexibility to scale it without almost like a white glove service for every client that wants to work. That's that's not a scalable solution at that point in time. I think you were talking about the VPC endpoint specific conditions I would love for you to dig into that as well a bit, because I feel like we've layered the architecture for people where they understand, okay, so this is the reason why you went down the centralized VPC interface endpoint path.
And you obviously have married up the two together. So what were the VPC specific things? Cause I almost feel an example is very well suited to this point to kind of share the implementation of it.
Meg Ashby: If we're looking at the VPC endpoint policy, there are two IAM condition keys that we leaned heavily on, which is EC2 instance source VPC, and EC2 instance source private IPv4. Yeah, it's a mouthful. But what those are, and it's for credentials that come from EC2 instances. So this is important. So if you're running your workloads on raw EC2s, or you have things [00:22:00] that are like an EKS worker, or EKS, the main node, like node roles, or ECS, like node roles.
This condition key will be available, and they are exactly what they say they are. Which is the VPC ID and that IP address, the private IP address that request came from. And that's why it's very much tied to EC2 credentials. So for example, if you have an application running on a pure EC2 instance, that request will have this private IPv4 condition key present.
So you can use that in terms of, applying controls based off of like specific subnet CIDRs, for example. One thing to note is if you are going to use the IPV4 condition key, you need to use it also with the source VPC condition key because obviously your private IP addresses are not, but that also opens up [00:23:00] a whole other can of worms, right?
Like we are running modern compute architectures, right? Not everyone's running their apps purely on pure EC2 instances. So when we think about. like other types of compute. So things like a lambda function, right? Things like your EKS pods, right? They are actually not using EC2 based credentials. They are getting their own IAM roles.
And again, ECS tasks as well, those sorts of requests that come from their own task specific roles will not have this. Either of these condition keys available because those credentials are not coming from EC2 itself. So you might be like, whoa, right? That's a problem. So the second half of the talk that I gave went into our own decisions on how do we implement almost like a stopgap for that sort of forthcoming in those two IAM condition keys.
So how we [00:24:00] ended up doing it was a tag based solution on the role itself, because again, these credentials very explicitly are coming from IAM roles. If they were coming from EC2 based like instance profiles, they would fall into category one, where we have those condition keys for VPC ID and private IPv4.
However, if we're in IAM role land, what we ended up doing was utilizing a more generic global condition key, which would basically check certain tags on the IAM role itself, and we had it set so those tags would align to basically the VPC or the subnet that those applications were running in.
Ashish Rajan: Out of curiosity, would I be wrong in thinking that couldn't some of this could be done by SCPs as well?
Meg Ashby: Yes. So there definitely are more [00:25:00] formalized ways, especially when we're looking at the tag based solution, right? How do we enforce that those tags are always present, they're always well formed, right? Tags natively don't have any sorts of like super strong controls on them. What's going to stop someone from going in and saying Oh, you're looking for this tag key.
I'm just going to make it like testy test one, two, three. Which is not the value we're looking for. So AWS does provide things like tag policies that kind of operate right on that same sort of organizational pattern that like you mentioned, like SCPs. So that's definitely like an option of like, how can you enforce that sort of?
Ashish Rajan: Because I guess so much relies on the fact that tagging is there and the source VPC, source IPv4 is all available for, and with the tag, right tags as well. It's almost almost to what you called out, quite crucial pieces of the VPC endpoint policy puzzle.
Meg Ashby: And I definitely agree. With that, and this is [00:26:00] probably, not a solution that will ever make it into the formal AWS documentation, and, I respect that, but I think what we're trying to get at here is just because you might, in your organization, have or want to move towards a little bit of a less traditional architecture or pattern, in whatever context that may be, it doesn't always have to be this, either I go down the formal vendor supported golden happy trails path, or I'm trucking it on my own through the wilderness.
Ashish Rajan: Trailblazer, yes.
Meg Ashby: You are not like hacking away with your machete like fighting like lions and elephants every step of the way. There is this middle ground that there is a way for you to meet your own organization's kind of best practice, right? That may not always be the AWS certified best practice, but it might be, for your organization, for us, that was cost and care of feeding were some of our most biggest [00:27:00] concerns. But for your organization, that may be different if you're willing to do a little bit of the work either in compensating controls or other mechanisms to make it work for you.
Ashish Rajan: The reason I call that out is because a lot of people to what he said would go down the path of what AWS recommends because either it's AWS professional services or someone else said this is the right path.
This is what you have to go down the path of. But I would also add that there's a huge population of cloud security folks out there who are building unique solutions on their own, which whether it's in documentation or not. So I don't think it's the wrong way. I think it was really interesting to hear your perspective on how you solved the problem, which was towards a goal that you wanted to achieve.
Now, bring this all together. Is it for everyone or is it because it sounds like it would be something that is potentially a good place to start as well for people who have not done this before.
Meg Ashby: Yeah. Just seconding the point that you said before, like. Why would somebody consider moving to this?
If you got a huge amount of traffic, right? We moved to this model because the running [00:28:00] cost per hour was too much for us, right? But there is a break even point, right? At some point, if you have a certain amount of traffic, financially you are not going to see as many pain points with that, right?
If your load of traffic is astronomical. If you're in that bucket you're probably fine. Keep your APCM points decentralized and enjoy your day. But for the rest of us, I would say, particularly when you're looking at potentially moving to this model, what I would be most concerned about is, you have to be quite aware, especially at the IAM policy, which condition keys are appropriate for which type of resources.
So for example, there's a really lovely IAM condition key that's I believe just generically like source VPC, which you're like, wow, that sounds great. That sounds like exactly what we want, right? Unfortunately, it is not [00:29:00] applicable for VPC endpoint policies. It's only applicable for the policies on the resources themselves.
For example, like if you're accessing SQS, you can use that on the SQS policy, but not on the VPC endpoint policy. However, AWS will not stop you from putting that condition key into the VPC endpoint policy, right? It just will sit there as little lame duck that's not matching any traffic, right?
Because it's not appropriate for that layer. So that's like a gotcha that kind of got us at the beginning. That combined with the fact that VPC endpoint policy evaluation is eventually consistent, right? You might update your policy, and then try and send some traffic through it, and it is still evaluating on the old version of the policy, and it's not clear exactly when the policy is officially propagated.
So that led to some confusion when we first implemented things because it looked like these [00:30:00] condition keys were working, but they really were just using the old policy for five, eight, 10 minutes. And then, once it fully propagated, we realized like, this is not doing anything. Like, why isn't this working?
There's no error messages, but it's just giving like behavior we weren't expecting. So that was really frustrating at the beginning, so yes, being aware of what IAM condition keys are appropriate for VPC endpoint policies is like one big like gotcha that I would say is something I would go back and want to warn myself about.
Ashish Rajan: In the bigger scheme of things, also the time it takes for something to propagate as well to what you said, you may apply something, so it sounds like there are two takeaways here. One obviously is the fact that the conditions only apply at a certain level or certain layer. Which layer is, does that match for you specifically?
And going for a generic source VPC, try and find the one that is actually applicable to the resource that you're trying to work with, along with the private IP source that you have. And [00:31:00] the fact that it takes time for any changes to get applied, which kind of makes me think of the whole detection and monitoring part as well.
How do you even detect if it's working? So how are you simulating the actual scenario for it to know that, yeah, it's working and it's not working. That would be a, I can't even, I'm curious. What was the process there?
Meg Ashby: Yeah, totally. And I will say generically VPC endpoints, just because of where they operate, they themselves don't have any logs, right?
So how the permissions are evaluated. I know, it's quite interesting, right? So your VPC endpoint policy, we'll just go back to SQS, right? EC2 makes connectors requests to endpoint policy. The actual evaluation of whether that was allowed through the VPC endpoint policy is actually done on the individual resource.
So it will actually make it all the way to the SQS and then it will evaluate, the SQS resource policy [00:32:00] in line with what the VPC endpoint policy was and then send that back to the EC2 instance.
Ashish Rajan: Oh, okay. Wow, so you won't even get stopped in the beginning when you send the request.
It's only when you hear back from SQS.
Meg Ashby: Yes, it's actually evaluated right on. So the request will hit SQS itself and then will be evaluated. So that's a little bit interesting. So the VPC endpoint and its policy do not log, right? Like it does not have, things that will say, oh, this was denied or this wasn't.
Ashish Rajan: Yeah. Okay.
Meg Ashby: So you have. Going back to really basic AWS stuff, you do have CloudTrail, and they will say, for interface endpoints, the logging will say something like, the access denied because VPC endpoint policy did not have an allow condition, if you're going with an allowed listing approach, or it will say, because there was a deny in the VPC endpoint policy, if there was a deny statement.
However, it does get a little bit challenging then, is it's [00:33:00] okay, outside of CloudTrail, and there's a disconnect then between the CloudTrail API call and what we're doing really on a networking layer, right? Like the traffic itself, right? There's a little bit of a disconnect there. So it's not the easiest thing in the world.
And then you add in like the centralized model with maybe some really funky condition keys and it gets challenging. So one thing that we're actually looking at investigating a little bit more is if, and hopefully your organization is using VPC flow logs. By default, you get version 2, which has a set of pretty standard VPC low log entries and fields in it, which I think is like what most people are familiar with, but there are actually newer versions of VPC Flow logs that allow you to add more custom fields basically to your VPC Flow logs.
And in particular, there are two fields that we're particularly interested in exploring, [00:34:00] which I think is like packet destination address and then destination address.
Ashish Rajan: Okay.
Meg Ashby: And what the documentation at least is saying is that the difference between those two fields will be basically if there were intermediary ENIs that the traffic went through. So if you're saying EC2 accessing SQS, like one of those fields will have the ENI for SQS. And the other one will have, what was that intermediary. And so for us, we're hoping, will that intermediary hop have the ENI.
For the VPC endpoint. And then we could start rolling up the, monitoring on the volume of traffic itself and if it was allowed or denied by utilizing those newer fields and VPC flow logs.
Ashish Rajan: So the V two of VPC flow logs only gives you source and destination. And usually those are IP addresses, but this is not even that cause I think if you are within a VPC, you make a call to an [00:35:00] endpoint, is that logged in the VPCs? Flow log as well, or it's not even logged there.
Meg Ashby: It would be logged right now on the ENI for let's say your EC2 instance. But what you're not going to see is right now the traffic, it won't have any awareness of going through the intermediary.
Ashish Rajan: ENI choose SQS.
Meg Ashby: So you're losing the visibility of can I tell if the traffic actually went through the VPC endpoint? Or is it just saying it went from EC2 to SQS and it actually went over the internet? Or, did it go through the NAT gateway or,
Ashish Rajan: so it's not even logged as well?
Meg Ashby: It would, but it wouldn't, with v2, you're not getting the visibility of how it got there. Like it
Ashish Rajan: Oh okay. Sorry. Yeah, I get you.
Meg Ashby: Yeah. EC2 to SQS s but it, you're not, you don't know. Did it actually use the VPC endpoint or did it just?
Ashish Rajan: Just go straight through,
Meg Ashby: go through the internet, [00:36:00] or if you are really complex and you have a mismatch of various SQS.
Ashish Rajan: Sounds like it would've been a problem even if we were not using centralized VPC and VPC endpoint and we were still using the AWS recommended method, you would still have the same problem.
Meg Ashby: Yes. I think to a certain extent you will still have the visibility problem of, did I go through the NAT gateway?
Or if you do have multiple endpoints right. Did it use this one or that one? Yeah, overall, I would say it is, I still view it as like a open question. There's not, in my experience, a great way to tie together, right? Your VPC flow logs, which kind of really operate on a very much different like layer or different level compared to something like the policy or IAM policies.
In general on the resources or your VPC?
Ashish Rajan: Yeah. 'cause to your point, I like, I probably would put this on the, I guess on the side of AWS for this, because I feel like they're the ones who are [00:37:00] providing the service for VPC endpoint. They're the ones who are providing the network capability. They're the ones managing the network.
We just define what CIDR range we need. Everything else. That's the reason they went down the path , it's funny how it works because. I guess seven, eight years ago, they didn't even have VPC flow logs. It's only when people who like me and others who came from an on premise world, that was normal for us to have something like a source destination.
Like the whole Wireshark world that got us created was because you could tap into the traffic. But there was no way for you to tap into the traffic for a VPC. I feel like this is that next level of maturity, and maybe to what you said, v3 is the answer here, where potentially people would be able, 'cause I don't even think people realize, like for me, most people who are using VPC endpoint and in terms of logging, does VPC endpoint have logging by itself?
Meg Ashby: No. No, there's not logging you, you have CloudWatch metrics for how many like bytes were processed? And like active connections, but yeah, the employees themselves don't.
Ashish Rajan: Yeah, cause I'm thinking from like the next part of the question that I had around detection and monitoring, detection part sounds [00:38:00] already quite difficult because it's already almost like to what you said, you're someone is sitting through and making sure that, Hey, it's checked on the SQS side on CloudTrail side, is a policy available or not, if there was no policy, period.
The traffic is just flowing through. You don't even know if it's like coming from a known private IP, known VPC. Like now you're in this kind of world of, I would imagine, like people who don't use a centralized model, they probably have the bigger complexity of which VPCs to begin with.
Meg Ashby: Yeah. And that's.
And this is like, where your puzzle starts looking like one of those candy salads. Are you familiar with that trend? Everyone just pours something in and then you get something out. Going back to something we chatted about earlier is Yeah, on the individual resources themselves, like on the SQSQ, you can define conditions like source VPC, you can say where it came from, right?
But then if you add your SQSQ IAM resource policy you have your Sour Patch [00:39:00] Kids, which, yeah, you get the really great, sour flavor, but maybe somebody doesn't like the blue ones. So then you're like, ah, okay, then I'll put some like M& Ms in, we'll add some VPC endpoint policies, which, gives you something else.
But, it is a bit of a challenge for sure, with all the different places you can put controls and like what layer. We're like, how do you say, how do you put all of these different things together in a way that makes a really delicious candy salad and not one that's like both sour and sweet and nutty and chocolate in like a really gross way.
Ashish Rajan: It reminds me of, my wife is a big Harry Potter fan and we went to Harry Potter World. One of the, I feel like we've gone to all of them, where they give you butter beer, I think it's called, the Harry Potter drink. And it's like just the weirdest drink in the world. It's just literally sugar, but they call it beer for kids.
And you see all the adults having three or four glasses of it, but it's like literally the sugar syrup, you're just drinking up. But It's reminds me of that kind of where the example that you called out about the candy salad as well. It's it's how [00:40:00] it's packaged. Cause I also wonder a lot of people who feel they are detecting the fact that, Oh, I have a private network with a VPC endpoint that goes to, let's just use your example of SQS.
They may not have put a policy on SQS to say that it needs to come from this source VPC. They would just say that, Oh which is what happens most cases in my private VPC, I've called out, just contact this SQS endpoint. That's it. There's no policy on the other end. Sometimes people just go down that path.
So detection and monitoring would be something I would definitely, would urge people to look into from that perspective. So this is really a good way for me to bring this back into the conversation for, now we've spoken about detection monitoring, spoken about the challenges with implementation.
We've spoken about how would someone even go about doing this? In terms of, the, like the next stage you've called out, you want to look into the v3 for VPC as well for monitoring. Are there anything else that you're looking for in terms of improvement for this? Whether it's from the AWS side, like I totally feel this is from the AWS side.
Hopefully v3 addresses it. [00:41:00] Are there other things that you're thinking or considering that would be great improvements?
Meg Ashby: Yeah, so back on the networking and like, how do you monitor it? There's actually another service or capability that's provided by AWS. That's called Network Access Analyzer. And basically, if you're not familiar, it's not new.
It came out in 2021, but I find in my years, I've not. ever come across people chatting much about it. But what it allows you to do is define allowed or green network paths. So you could say something like, I always want the EC2s or my EKS cluster to access SQS. through this VPC endpoint will tell you is Oh, if you have, for example, EKS clusters that are accessing SQS, but they're actually doing it, they can access it over the internet, like I didn't associate the VPC or the route 53 private hosted zone in the [00:42:00] centralized model, because without associating the private hosted zone your access would still be over the internet if there wasn't that gateway available.
So utilizing the network access analyzer would, it basically I view it as like a config almost type of service, but for your networking layer, but generally,
Ashish Rajan: posture management for your networking layer.
Meg Ashby: Yeah,,
Ashish Rajan: so if you define a path and if that VPC behaves any differently, you get notified.
Meg Ashby: I would say it's not so much of a behaves, but is the network connectivity there?
If that makes sense.
Ashish Rajan: Like a, testing if this is an alive connection kind of a thing?
Meg Ashby: I would say almost like a layer above that.
Ashish Rajan: Like an example to your point, you've defined what the paved road for that is, and it will make sure that paved path is active, or would it make sure that the paved path is the only path that is followed?
Meg Ashby: Yeah, so we would lean towards the latter. I want to say all of my connections through SQS go through [00:43:00] this VPC endpoint. And so if there are connections that can go not through the VPC endpoint, that would be what they call a finding, which kind of is more on the, I don't know if you would call that detection or monitoring.
I feel like to me, that's both.
Ashish Rajan: I would lean more on the detection side. I think the way I see detection is that picking up things which we find are either the anti pattern or for lack of a better word. Oh this should not happen. So I want to detect the fact that S3 bucket is public because my green path is no S3 bucket is public.
Meg Ashby: Yeah. I think that's a great point. It's the detecting the anti pattern. So like how it works is you basically define all of your like green or your yes patterns and then it will look to see, okay, is anything not complying with your green patterns.
Ashish Rajan: Interesting. But can it enforce it as well?
And obviously I'm, I feel like it's funny. I was literally talking to someone about IAM access analyzer. And it sounds like network access analyzer is also something I should look into now. The but does it enforce it?
Meg Ashby: No. Inherently, it's [00:44:00] just.
Ashish Rajan: Or finding, basically, it tells you that, hey, there's an anti pattern, someone should look into this.
Meg Ashby: Yeah, totally. And then I'm sure, off of that, like any other AWS solution, you can write a CloudWatch event with a trigger to a lambda that will make everything nice and sparkly again. Perfect. left as an exercise that the listener to,
Ashish Rajan: yeah. You've heard this here first folks. That's most of the technical questions I had for this particular conversation.
I will also share the link for your talk and it's already on the website for fwd:cloudsec as well on the YouTube . So I'll definitely share the link over here and show notes in the description as well. I still have three questions are fun questions for you. Non technical first one being, what do you spend most time on when you're not trying to solve the centralized VPC problems of the world.
Meg Ashby: When I'm not at my computer and hopefully not on call, and even sometimes when I am on call, I love spending my time taking ballet classes around New York City. Oh nice. If you are ever in the city. In the ballet scene, you might see me even when I'm on call with my backpack in the corner of the studio, ready to respond to any [00:45:00] network incidents, elegantly, and definitely in a leotard
Ashish Rajan: second question, what is something that you're proud of that is not on your social media
Meg Ashby: when this is maybe gonna sound really silly.
But when I was in high school, I was like a part time nanny for some students. Like some children and over the summer, I taught a kid how to ride a bike, which maybe sounds really silly, but like this bike was too big. There were no training wheels. It was like a lot of me running and like pulling this handlebars, like through a field, keep pedaling.
But eventually the kid did learn how to ride the bike. And I feel like that is probably one of my biggest accomplishments.
Ashish Rajan: Wow. It's not easy. It's not. And I guess it's like the equivalent of teaching someone to fish versus fishing for them kind of a thing, now that the kid can just ride bikes, any kinds of bike, I imagine.
I don't know about the motorcycle, but definitely at least a bike, a push bike.
Meg Ashby: Yeah, for sure. And I think it's also parallels to a lot of people's either security or cloud learning, or, even today [00:46:00] AI journeys, which is like sometimes when you're doing things that are new and they're greenfield, right?
Like sometimes the training is not there, right? Like sometimes the way you have to learn is Oh, I'm just going to go into the console and try and deploy, right? Like a language model and then try and see, can I secure it, right? Sometimes you have to learn things the hard way, which sometimes can be like a bit scary.
It had to teach them how to ride the bike the hard way. We didn't have training wheels, right? The bike was too big, but it was what we had, right? You gotta make do with sometimes what's been given to you.
Ashish Rajan: Yeah, 100%. And final question. What's your favorite cuisine or restaurant that you can share with us?
Meg Ashby: I really like Thai Villa. Our office is like south of Union Square. So that's like more Union Square ish area. But it's really delicious. The inside is absolutely beautiful. So if you're ever in New York. Highly recommend trying to get a reservation.
Ashish Rajan: Assuming that's traditional Thai food?
Meg Ashby: Yeah.
Ashish Rajan: That's all the questions that I had for you, Meg, but where can people find you [00:47:00] on the internet to connect with you and talk more about centralized VPC and perhaps help with them solving their centralized VPC challenges?
Meg Ashby: Yeah, I would say the best place to find me publicly, I would say, actually right now it's on LinkedIn.
My legal name, Marguerite, in parentheses, Meg Ashby, is the best place to find me publicly but yeah, that's where I would recommend people get in contact with me.
Ashish Rajan: Awesome. And I'll put that in the show notes as well. But thank you so much for your time.
I enjoyed so much knowing the gaps that we have with VPC and how we can have benefits from anti patterns of AWS as well. So thank you so much for sharing that as well. And
thank you for listening or watching this episode of Cloud Security Podcast. We have been running for the past five years, so I'm sure we haven't covered everything cloud security yet and if there's a particular cloud security topic that we can cover for you in an interview format on cloud security podcast, or make a training video on tutorials on cloud security bootcamp, definitely reach out to us on info at cloud security podcast.
By the way, if you're interested in AI and cybersecurity, as many [00:48:00] cybersecurity leaders are, you might be interested in our sister AI Cybersecurity Podcast which I run with former CSO of Robinhood, Caleb Sima, where we talk about everything AI and cybersecurity. How can organizations deal with cybersecurity on AI systems, AI platforms, whatever AI has to bring next as an evolution of ChatGPT, and everything else continues.
If you have any other suggestions, definitely drop them on info at CloudSecurityPodcast. tv. I'll drop that in the description and the show notes as well so you can reach out to us easily. Otherwise, I will see you in the next episode. Peace.