Security for AI/ML Models in AWS

View Show Notes and Transcript

Episode Description

What We Discuss with Mike Chambers:

  • 00:00 Introduction
  • 06:59 What is AI/ML?
  • 13:50 Privacy Risk & Machine Learning
  • 24:03 Lifecycle of Machine Learning
  • 32:33 ML in AWS – How to get started?
  • 38:49 Components and Services to Consider
  • 45:03 CICD for ML
  • 51:14 AI Services for Security
  • 54:08 Open Source for AWS ML
  • 56:47 Maturity in ML
  • 1:01:11 Learn more about ML/AI in AWS

THANKS, Mike Chambers!

If you enjoyed this session with Mike Chambers, let him know by clicking on the link below and sending him a quick shout out at Linkedin:

Click here to thank Mike Chambers at Linkedin!

Click here to let Ashish know about your number one takeaway from this episode!

And if you want us to answer your questions on one of our upcoming weekly Feedback Friday episodes, drop us a line at ashish@kaizenteq.com.

Resources from This Episode:

  • Tools & services, discussed during the Interview

Ashish Rajan: I know, you , for some time I get some people on the stream, you already know yourself some time for people who may not have heard of it before. 


So how did he get into the whole AWS hero machine learning? What was your entry into this. 


Mike Chambers: Well, that’s like the entire podcast right there. All right. Where did we go? So my name is Mark Chambers. Hello? I’m into machine learning into AWS and yes, I was very, very fortunate to be invited into the AWS hero program which sort of pops me up onto people’s radars, I guess, so that people know that I’m doing this stuff. 


So I’ve been producing training material and online staff and YouTube stuff and live streaming stuff about machine learning and AWS and all kinds of stuff for quite some time. And yeah, AWS kind of, they support me in what I’m doing. I don’t work for AWS but they support me and they help give me connections into places. 


And, and tell me a little secrets about what’s coming up in the future. 


Ashish Rajan: Sweet. So, which is where the whole AWS hero kind of comes in. So you’re the AWS hero for machine learning AI. 


Mike Chambers: Oh yeah, I am. I’m one of them. I’m not the one I’d like to, I’m one of them. So there’s a few of us around the world. 


So each [00:01:00] quarter I think it is they reach out to people in the community that they’ve seen, who are doing outreach to the community and talking and blogging and all that kind of stuff. And they just, they invite you in, it’s an invite invite program. And so, yeah, I was fortunate enough year and a half ago, maybe. 


I don’t know, it’s been a whirlwind to be in this program yet. 


Ashish Rajan: That’s awesome. And I think it’s probably a good segway into my question as well. Cause I think a lot of people always get confused with AI ML. And it, I would love to say that I know what AI ML. What is it because I imagine people hear a lot about it. What does that mean for you? 


Mike Chambers: Yeah. Okay. So there’s park AI just for a second because that’s a very difficult term to define because it hasn’t really been officially given a definition as. 


People could argue against that machine learning. That is something that we can definitely talk about what that is. And unhelpfully maybe to start with it’s, it’s the closest thing as a data scientist or machine as, an IT professional that you’ll come to being something like magic, like it is amazing. 


Like it’s the ability for computers to learn it genuinely is that now it’s not actually magic and we do know [00:02:00] how it works. And obviously we apply it to things in our day-to-day lives, under the hood. What is it? It’s the ability for machines to recognize patterns in data that’s far , less exciting than that it’s like magic, but yeah, we’re all kind of familiar with the different kinds of use cases for machine learning, but under the hood, what it’s doing is it’s just recognizing patterns in data and learning. 


Ashish Rajan: Right. And so AI then is because it doesn’t have a definition. So 


Mike Chambers: they have AI. Well, so AI , it’s more of a pop culture kind of term, really? It’s I mean, artificial intelligence, like if you went 20 years ago, 40 years ago and talks to people about what AI was, because it was still a term that was being used back then. 


People would just look over the horizon of what was technically possible right now. And so you’d , expert systems, I don’t know if any of us remember expert systems where essentially the computer was able to take inputs and it’s like, really there were, if then statements, but just lots of them and say, well, if you’ve got this, this and this, then you must be a one of these like very, very, it’s relatively simple by today’s [00:03:00] standards. 


But way back, that was amazing that computers could do that. And that was sort of seen as the emergence of artificial intelligence. And people have been talking about this for ages. Now people talk about artificial intelligence being again, over the horizon, , this idea that , black mirror comes about. 


And we end up with sentience machines and things like that. So that now that’s not to say that it hasn’t been. Termed into and used as a marketing term, which absolutely has, but really, truly, I think artificial intelligence is just this unattainable goal that we’re constantly striving towards. I quite liked that because we are constantly moving forward with machine learning and constantly achieving new things. 


So it fits in the world. But machine learning really is where that, where the driver is. 


Ashish Rajan: So the business use case example that you see all around us, the technical technically machine learning examples than AI examples that, right. 


Mike Chambers: Yeah. Yeah. I think, I think if we’re seeing actual practical things that are actually happening, obviously we know how they work and they are based in real science and real data and the machine 


Ashish Rajan: like, and what’s a good [00:04:00] business use case for, I guess, where do we, where do you see people use machine learning quite a bit. 


Mike Chambers: Sure. Well, gosh, everywhere, like , it’s enormous subject. And I think that’s one of the thing that we need to sort of address right up front as well. It is an enormous subject and we’ll probably come back to that a few times during this discussion it’s used all over the place it’s used in places that you probably are quite familiar with. 


So. home assistance , Google Alexa and all this kind of stuff. And I’ve already pressed the mute button on mine. So it didn’t trigger when I said that. But , doing the voice recognition, but also doing the sentiment analysis and understanding the language of what you’re asking for. There’s kind of very in powerful in front of you, examples of what machine learning can do, but also things like recommendation, engines, things sending you, well, you bought this, you might also be interested in that. 


But then fraud detection and getting into other sort of more behind the scenes kind of stuff, medical applications, where they’re making medical diagnosis with, from data where everywhere it is applied in many, many places. And it will eventually be applied pretty much everywhere. 


Ashish Rajan: Right. And I think to your point, because even [00:05:00] their sub-categories within it, right. 


I think the whole Alexa Siri thing is that doesn’t make up as well if like natural language processing. Right. And do you have. At the, almost at a 10 feet view, 10,000 feet view, they’re all just doing even the, if then statements, I guess. 


Mike Chambers: Well, so I think that’s I think that’s a little unfair on machine learning. 


It’s a bit more than that. , so yeah, you got different categories, like you said. So you’ve got natural language processing and you’ve got image classification, image recognition, that kind of stuff. You’ve got clustering, you’ve got classification, you’ve got lots of different types of things. 


At its core and the reason why it’s different from regular development. So let’s say regular development is this, if then statements, the reason why we do that is because what we do is we get the computer itself or an algorithm to write the code for us. So let’s stepping through a simple sort of a scenario and maybe get some terminology, right? 


So, so half the battle with any understanding any topic, right, is the terminology. And so let’s get, let’s get a few terms in play. So we start out with data and we call that training data. That [00:06:00] training data then is it’s a wealth of data that we already have. We then take that and we provide that to an algorithm. 


An algorithm is a piece of code that a data scientist or machine learning engineers already written that knows how to look for patterns in that. We then that process of the algorithm that the algorithm goes through is called training. It does a training process and it produces a model. And the model essentially is a piece of executable code that will do the thing that we need. 


But we didn’t write that the algorithm wrote that that’s, what’s really exciting about this. The algorithm wrote this model broadly speaking. We might not even know how that works, but we know that it does work and we can use it. So we don’t necessarily, we don’t look inside that and go, oh, there’s the if then statement or whatever, anything like that. 


It works good. Excellent. Well, let’s move it into production and use it to recognize someone’s voice or recognize an image or do something like that. And that’s, that’s the kind of magical feeling part of it is there’s something in there that works. And that’s really exciting. I, I don’t know how, but 


Ashish Rajan: it’s certainly interesting because I think as you kind of mentioned, [00:07:00] it kind of reminded me that when we were trying to work on an ML project with, we had the whole connotation about supervised learning and unsupervised learning as well. 


There’s, I mean, there’s a few more layers to it as well, but I think it’s an interesting question that this came from, Vineet developing is as machine learning uses patterns of data and prediction, how can we address risk of privacy from security? 


Mike Chambers: Okay. Yeah. So very interesting topic. So , there’s two different sides. 


There’s two different parts of that that I think are interesting. You’ve got both the training side of things. So the data that requires to do the training of the model in the first place, and then you’ve got the inference side of things, making it inference from new data. So inferring making prediction. 


So both the start and the end, both Taz is touchpoint with data. So from a privacy and security concerns, Let’s say at the far end, when we have our actual outcome, then largely speaking, the way that we would deal with that is very similar to any other kind of application. Because at that point it’s an application that’s running now. 


We didn’t write it, but it’s an application that’s running. So the data that’s coming into the system, we would need to [00:08:00] secure it, like any other data that coming into a system. So I think that’s fairly well understood on the other end of the spectrum is where things get a little bit more interesting. 


And also there’s an ethics questions that come in there, which, which are related, I think, to this. So. In a normal mainstream. I don’t want you to get non machine learning workloads. Like the kinds of stuff that, , the, if then statement, I like how recording this, the, if then development cycle we, we can, let’s say we can keep developers away from production data. 


Like I like developers don’t get me wrong. Some of my best friends are developers, but , we essentially, we keep developers away from that data because they don’t want to have to think about the security of the data they’re playing around with. They just want to write code that’s. Their focus is to write code. 


So, we remove that need for them to be, , in a highly secure environment or whatever with a machine learning problem, that’s slightly different, right? Because data analysts, machine learning engineers, they can’t have to work on production data to make the model. So there are some concerns around that and sort of providing secure environments. 


How can we do that? I [00:09:00] think probably one of the prime ways is by making. Online and cloud environments for them to be able to do that work. So we don’t tend to as much do that work in a local environment, pulling down vast datasets, vast, very valuable and sensitive, potentially datasets to local development environments. 


We do it up in the cloud somewhere. And it also makes sense because we’re talking about vast quantities of data. So just from an efficiency point of view, that works out as well. You then also have this sort of ethics question as well. And this is where, , do you actually own the data that you’re building a model from if you don’t own the data, is that ethically and legally problematic. 


And there is lots of questions, which way more complicated than unqualified to talk about, but they are very front and center to huge discussions and debates in this area as well. So yeah, lots of things to talk about there, but from a security perspective, I think the prime thing is get the development up in the cloud. 


Ashish Rajan: I think just to add to that as well, by the way, great examples. I love the answers that you gave because the, that to your point about the normal, if then I’m going to use that same as well, the regular, if then [00:10:00] develop in life versus the algorithm, building machine learning life. So in the, if then statement there’s a lot more controls. 


They find a lot more data access kind of Coca Cola, quantitation that have happened for years with machine learning. I think it’s, it’s an interesting set because one of the projects that I was involved with, I remember we spoke about masking data, like, Hey, mask , my name mask Mike’s name Mike Moscone. 


Same. But it, unfortunately, machine learning sometimes doesn’t work like that because to your point, I mean, and you can correct me if I’m wrong. Obviously looking more in this space than I am from what I was toward the algorithm kind of required for it to understand the data properly. It requires a good day. 


And I mean, we can go into the whole data cleaning side separately, but I think it was really interesting to hear that if you mask the data, it doesn’t know if it’s Mike or a Ashish so , next time Microsoft comes in, they just come as numbers. But I don’t know if that actually is like a, it’s like the other side where well, to prevent privacy laws or lack of privacy, you can mask data if you want to. 


But then your algorithm is kind of like pointless because it’s [00:11:00] working on numbers. It doesn’t work off real information. So curious to hear your thoughts on that. 


Mike Chambers: Yeah. So, so what you’re saying there is absolutely correct. Now, there’s a deeper level to that as well, but you’re right. Let’s put a scenario in place where we’re not talking about images and all that kind of stuff. Or we’re talking about tabula data fraud detection. So with fraud detection, a bank will have vast quantities of transactions that they already have. They will have, some of them have already been flagged as fraudulent because they, they have been doing this business for quite some time and they know what these things, they know what they know, some transactions that have been fraudulent, but they don’t really know how to be able to spot them. 


In the future. And that’s because when you look at it, it’s got loads and loads of different data points in it. Each of those records that represents both fraudulent and non have got many, many data points and they will be things like someone’s name and their address and the time of purchase and the amount of the purchase of what they purchased and all this kind of stuff. 


Now. When you put it. And so we would then label that data and that comes to your point with supervised training, by the way. So that’s what a [00:12:00] supervised machine learning training is, is where we have a dataset where we already know these examples, and we want the computer to learn what that looks like. 


So it can detect it in the future. Now, when we, as humans look at it, , we might bring our own biases to it and on all kinds, but essentially we can’t really see the pattern because we’re human and we’re flesh and where we’re flawed things, , we’re not, we’re not purely statistical in our analysis. 


So we want to get the machine learning to do it. That’s the bottom line of a, supervised machine learning problem. And that’s what fraud detection might look like now to talk to your point about masking of that data. Y and machine learning projects in general is you need to get domain experts involved. 


People who are actually understand the data and it, and also understand machine learning, honestly, to be able to help with that kind of thing. Now, in that particular case , taking that person’s name and masking that, or even completely removing it is probably okay. And ethically that’s probably the right thing to do as well. 


If we start to say, well, if you’re called Frank, then you’re more likely to be fraudulent. Then , there’s probably something wrong with that. Doesn’t sound right. [00:13:00] I’m not an expert in fraud. But take their address out of it. Well, maybe that starts to be something slightly problematic. Cause you, you are talking about , what was, this was this transaction may. 


In Scotland, by someone who lives in Frankfurt, like, okay. So, so address does matter and not, not to, , all the socioeconomic and all that other kinds of stuff as well, potentially. So you kind of do need that probably again, not a fraud expert in order for the algorithm to be able to actually make the right kind of determination. 


So , that’s the sort of the illustration of why anonymizing data or masking data isn’t necessarily going to work. 


Ashish Rajan: Sorry. And this was a good point as well because as people who are listening to this episode were kind of warned on the idea for, I mean, what leavers do I have to pull from a data and privacy perspective as well, whether it’s data access or whether it’s the concept context of, do you really need my first name, last name, my address to your point use cases may vary like maybe a fraud intelligence bus medic quieter, but I don’t know. 


I’m just trying to think of an example where recommendation engine [00:14:00] may not. Especially, if you just say, Hey, you bought this t-shirt you probably should buy this trouser as well. So it doesn’t even require my address, my name and everything, but may be happy with me being a number. And it just might be a recognition for people who usually bought the t-shirt have also bought the spans as well. 


So maybe I’m just not, I don’t care who your name or, or what your name is or who you are and where you’re from, but this is the statistics saying that nine or 10 people who bought this t-shirt have bought this band or this house that as well. So most likely you would like it. Is that a fair simplification of it? 


Mike Chambers: . I’m interested in trying to pretend to be the domain expert in some of these businesses. But, I don’t know so much about the retail space. I mean, obviously we can make educated guests, but like I’m not an actual expert in the data of a retail space. 


However, you could imagine quite really that people who have bought this t-shirt who live in the Northern half of. Might want to buy this and people who buy this t-shirt who live in the Southern part of the country, where the weather is different, might want to buy something different. 


That’s, that’s absolutely a thing I could imagine that that’s your thing. So then maybe [00:15:00] the data masking side of things is slightly more nuanced than that. And maybe we deal with data pre-processing. So there is a conversation to be had about the, the data cleaning and data pre-processing may be happening in a separate environment. 


So we say, okay, we’re not going to give you the actual street address of this person, but we’ll give you this date, or we’ll give you this suburb or something, which anonymizes them somewhat. But then someone’s got to do that. And some systems got to do that, and there’s gotta be security around. Yeah, and 


Ashish Rajan: I think that’s a good one because that kind of good. 


It’s a good segway into the whole machine learning space, because we spoke about the business use cases. We spoke about the data privacy thing as well, but I think I will also want it to get into the whole building of a machine learning kind of thing as well. And how security kind of gets involved in there. 


Cause this is definitely like the customer use case and day to day security folks, our college finance challenge as well. I it’s the whole machine learning life cycle thing. That was an interesting time that I came across in. Would you be able to explain what machine learning last cycle is and different types of machine learning? 


Mike Chambers: Oh gosh. Yes. So yeah. All right. So I mean, I guess we, [00:16:00] we we’ve talked at a high level there about sort of a typical kind of flow. You start out with data, usually vast quantities of data, put it through an algorithm. Do you train and get your model, make your inference. What we’ll then find is that we can measure the effectiveness of that inference and we can take the outputs of that inference and the results of that, and feed it back into the cycle and start to refine our model, make it better so we can make it more accurate, would be potentially one thing. 


And I’m using accuracy in a very broad term there just by basically, meaning make it better. Accuracy is a specific term, but we can just make it better. The, the other thing as well is that situations change, right? So what we’ve talked about there. With the retail sector fashions change, people decide different things. 


So we can’t just have one model that we’ve made six years ago still providing the same business value today. So again, that machine learning cycle just keeps having to go around to, to adopt and adapt and get more data in from other places and use the same data that [00:17:00] we’ve been using a cycle it round and round. 


So it does mean that it’s quite an intensive industry, if you like. So machine learning in general uses quite a lot of compute and to constantly retrain takes quite a lot of compute and data flow and workflow to make that happen. So that is one of the reasons why it’s quite different from software development, again, from the, if then kind of software development. 


I think that industry has done quite well. If you are a software developer, how it works. You turn up to work at a software development agency, they’re using some kind of get repository in some kind of pipeline and they test it and they put it into production and I’m being very simplistic about it, but that’s basically what happens, right? 


And they might use different flavors of get repositories and they might use different flavors of pipelines, but essentially they’re doing the same thing and it should be anyway. And if, and if someone who works in that shop a, he goes and works in shot B. Then they will see a similar kind of thing happening machine. 


There are very different kinds of projects. There are different kinds of challenges there isn’t [00:18:00] really a one size fits all for that. And I think we’re still kind of catching up with what some of the possibilities are. There are lots of different solutions out there. They all require different kinds of security constraints and security controls and processes. 


And it just makes it kind of complex. If I’m honest, I don’t have a super good news story for you here that, do what? Just get this service or do this, and everything’s going to be fine. Everything kind of has to be sort of treated on its own at the moment. Anyway, until some somebody comes up with some standard way of delivering this stuff. 


So. 


Ashish Rajan: Right. And it’s kind of goes in well with what Rama had a comment as well over here to be Norfolk standard, best practice to begin with. I know it differs music, it’s use case, but wonder, I understand the beginning of actually made before it there’s another great comment. That was when I was thinking of my observation is that machine learning project starts with the problem and then moves forward, but security and privacy, and they landed at last or at least age thing as well, because it’s not a set standard for them. 


Mike Chambers: Yeah. Well, that’s the age old problem of [00:19:00] security, right. As well, so that people get really excited about delivering a solution. And then someone says, how are we going to secure this now? And so, yeah, we, we need to be thinking about security throughout always. And I know that I don’t need to persuade anybody on this podcast of that. 


But you’re right. , you start off with the problem that you want to solve, that’s actually a really insightful point. And something that a lot of people end up sort of getting confused with machine learning as well. You don’t with machine learning, collect a large amount of data, runs the machine learning and see what happens. 


You do actually have to decide what it is that you’re trying to achieve. What question do you have then you can figure out the areas of data that will probably be useful, but not necessarily, but probably and then work at it from that way. So and, and yeah, look , in a proper robust environment, security of that has to be thought about from the beginning as well. 


Age, old age, old problem. All right. Yeah. And I 


Ashish Rajan: think to your point another security thing that is worthwhile looking at it, the data scientists that you work with, I guess how they would, and usually nine or 10 times they are not employees of your [00:20:00] company as well. So you kind of have to figure out a way to provide them access from outside your, they treat them as contractors, like quote unquote contractors as well. 


So there’s that, there’s that challenge of data as well as access. 


Mike Chambers: Yeah. That’s a very good point. Actually. I hadn’t really thought of that so specifically, but you’re right. I mean, it’s a, if you’re working on a very large. Machine learning project, you might find that you are struggling to find resources, right? 


So we know this is an industry. We are really struggling to find enough people to do the jobs right now. And we know that there is a huge problem coming that the projections about the amount of machine learning that’s going to be using. In day-to-day application space in business and adding social value, all the rest of it. 


We don’t have enough people currently going through the training processes to fill those roles. So it’s a huge problem. Also huge opportunity if anybody’s interested in getting into the space. So yes, to your point. So at the moment, yeah, lots of agencies would be used for doing that kind of problem and yet giving them access to the data. 


But at some point you have to trust and I know that’s a [00:21:00] problem, but you kind of just have to, right. So it depends how far you want to go. , whether you bring those people into a physical space that you control. And so they’re being physically monitored. More and more that’s becoming something which just is a luxury that most people can’t do. 


So having your data in a cloud environment makes sense on two reasons, for two levels, like I’ve said before, the vast quantity of data means that you can afford to put it there. You also can afford to keep it there and you don’t have to sort of shuttle it around over networks all the time takes a long time. 


So that’s good, but also you can put security controls around it and say, okay, well here is an environment that you can actually use to interact with it. So we’re not using tools on our actual local machine, but whether using ideas and platforms in the cloud to do this, which so that makes sense from many levels. 


And it means that you can at least put security concerns. Around that part of it. Can you stop someone from taking a photograph of that screen? No, you can’t. And what monitoring, can you put in place for data exfiltration? The rest of it? There is some, but at some point, yeah, you can kind of have to trust your data [00:22:00] dataset. 


Ashish Rajan: There’s no data leakage prevention or the, yeah, it’s a hard one. And to your point, a lot of the questions are not even being asked as well. So you don’t even know that there are problems at this age, because at this point, a decision, a lot of people are focused on building the algorithm that helps them machine learning rather than cause it’s all about let’s get the data, see what we get, because this could be an experimental project to begin with. 


They would not be anything that comes out of it, which is useful. So it’s definitely interesting point quickly addressing one more comment that came through as well. Dive deep into web three crypto in NFT world. Oh my God, man. We’re talking about AI machine learning Martinez is already on web 3.0 and 


Mike Chambers: absolutely fantastic. Send some our way. 


Ashish Rajan: So kind of like now kind of going back to the drama, a quick question as well. So we spoke about the data. 


We spoke about what people do with it in the business use case context. We spoke with lifecycle as well. Now I want to kind of take a step back and go, okay, I’m ready to build one of you, Mike. You’ve filled me the idea for machine learning. What are some of the I guess if I’m trying to build this from into, I guess an AWS. 


What would be some of the components, obviously I work to acquire data, but what are some of the [00:23:00] other moving parts that may people should think of as which is a good standard practice as well? It kind of goes into what Rama was asking here, Chris, and where do I start? More than it happens in the beginning. 


Mike Chambers: Let’s talk about AWS services, right? so there are some AWS services which are sort of clear and obvious candidates for this kind of thing. It’s also kind of what I know about. So yes, there are going to be analogies with this in other platforms and places, but as I’ve already described, I think cloud works really well for this. 


Generally and AWS has obviously got some services, which work well in this space. So the first one is three. So storage data, that’s what you need. So, so when I’m talking about data, we’re talking at the very least terabytes or petabytes of data for, for a real commercial. High grade model. 


You’re talking about lot as much data as you can get usually. So storing it somewhere really cheaply like S3. It makes a lot of sense. And also with a three, you’ve got the security controls in place there as well. So you can prevent people from being able to get access to it from the outside world. And you can allow your, , your applications on the inside of your [00:24:00] AWS environment to get access to it. 


So it’s fairly well understood and fairly well known. And I’m sure that if you’re talking about AWS more generally in terms of security, then that’s not the last time we’ve tried. We’ve heard about that. Then it’s about how can we actually process that data in cloud now? Oodles of different ways that you can do that, right? 


So I’m not going to list them all out. I’m the one that I’ll point out straight away though, is probably what talks mostly to the kinds of workflows that we’ve been talking about as well is SageMaker studio. Now Sage maker is , an entire collection of tools and APIs and documentation and code samples and all kinds of things and containers that help data scientists and machine learning engineers to actually leverage cloud and leverage AWS cloud. 


And to be able to take the code that they’ve already written and add to it, to make it cloud scale relatively easily. That’s kind of what SageMaker is. A lot of people get a bit scared away from it. I totally understand that because they look at other AWS services. S3 and say, what does it do? It stores things. 


Excellent. I get it. Like, [00:25:00] yeah, I know. And if the S3 team is watching, then I understand it’s much more complex, less complex and that under the hood, but essentially it stores things, , easy to what does it do? It’s just like we get it. So you can, you can understand these things. You’ve got to, SageMaker actually, it’s many, many things. 


It’s not really a service. So when people peel the lid open on it, they look at it and go, that’s too complicated. And they put the lid back. I’m encouraging people not to do that. Just take the bit, which is useful to you. Anyway, there’s a wider conversation there about machine learning. I want to come back to that point, but. 


So inside of SageMaker, this is a service, a thing called SageMaker studio. And what that is a Jupiter labs environment, which is fully managed and allows you to be able to process data hooks nicely into existing AWS services. So this is the environment that you can use. That’s fully hosted in cloud with your defined security controls around it. 


And you can allow people to process data and interact with other AWS services and do what they need to do, but in a controlled environment. And I think if you were to [00:26:00] start out with a machine learning project, and this is a machine learning project where you are building your own models, then that is a fantastic place to start and you might branch off and go into other places, but that’s a fantastic place to start. 


Ashish Rajan: And so we have SageMaker, we have S3 bucket. And then in terms of so up, because Redshift seeps of keep coming up quite often as well. And I didn’t mention that shifts. I’m curious, is that not considered like a. Component that goes into this. Yeah. 


Mike Chambers: So, so Redshift is your data warehouse, right? 


So this is your petabyte scale data warehouse. And if you, if you want to so, so yeah, this is one of those things where there’s like, there’s 17 different ways to do the same thing. You have to decide which way you want to go. Now from a, from a Redshift perspective, I would suggest that if you’re already using that data warehousing for, , large scale analytics and I’m talking large scale analytics, if you’re, if you’re passing through a million rows, then you’re not big enough yet. 


But if you’re passing through multi hundreds of billions or hundreds of millions of rows, then sure. Then your petabytes scale, that’s fine. You can start [00:27:00] to use that, to do your dashboarding and reporting and analytics and all that kind of stuff. Now you can also bolt into that. , that could be your data source, I suppose, for machine learning and because of the way that machine learning is. 


Model creation side of things. It’s very asynchronous, right? It’s it’s an offline activity. So putting it into EStories is probably the cheapest place. So if it was in red shift, you’d probably bring it out of there. The other place is EMR. So the elastic MapReduce if you’ve got an EMR cluster, then there are. 


All because that’s a spark environment. Anyway, you’ve got spark ML, spark machine learning. And so you’ve got the ability to actually run machine learning workloads directly inside of EMR. And you can also link EMR into SageMaker as well. And you can also export data from EMR into S3. So there’s lots of possibilities. 


So if you already have one of those things like Redshift or EMR, then sure. Work at building in your machine learning workflow into that. Many places don’t so that’s, I guess why I sort of the default start is true. 


Ashish Rajan: Right. Okay, cool. Yeah. Cause I think I definitely heard of a S3 [00:28:00] bucket at the whole, the. 


A couple of companies were built on the whole concept of like all let’s make a, build a data lake. And it’s kind of like that for be a step one towards doing some kind of data algorithm. But I’m curious now, so if those are some of the moving parts of, from an AWS service perspective, if we were kind of now most people over here from maybe from a security background, you kind of you talk about from what were like, what are the other services in terms of networking, identity, compute, workload, like, , things like that. 


What were some of the thinking over there at that point? 


Mike Chambers: Yeah. So from that perspective, I’m very much the, sort of the out of the box, kind of what AWS does. And I guess, because I am the first person in this series on AWS, maybe I get to talk about this as well. So inside of AWS, how do you secure things in general? 


And that will also play to machine learning workloads. So , the fundamental starting block is identity and access management. So you have, IAM and this is very much about controlling who in your [00:29:00] organization has access into AWS. And it is somewhat about what services in AWS can talk to other services in AWS. 


There’s a very rich policy driven kind of ecosystem. Female would be being able to define what that security landscape looks like. So that’s very important and you can get very in-depth with that, and you need to have some kind of a process over there. I think. At a really fine grained level. So that’s that then on the actual workload side of things, actual servers running, doing things which could be running machine learning projects or otherwise we’re talking about VPC. 


So virtual private cloud, which is essentially a network and it, you get to define what that network looks like. You get to define public and private subnets. You get to define the routing and you get to define sort of stateless firewalls that are configured to manage the traffic between different components inside of that, that’s very important as well. 


So those two at the very, very base level, above that there are 1,000,001 other options, [00:30:00] of course, , web application firewalls, that security monitoring, and also let’s bring it back to machine learning for a second. Some interesting AWS used machine learning themselves to actually help enable some of the products and services that they have. 


One of them, for example, just pick one out. Guard duty guard duty is an intrusion detection system that works at the infrastructure layer, which uses machine learning. And they all say artificial intelligence to monitor the metadata of traffic flowing around inside of your AWS account and look for potential compromise. 


So is there someone , minting and NFTs, if, for example, when that’s not what you were planning to do, or is there someone sending spam or is it someone mining Bitcoin? They should be able to detect that kind of thing and alert you to the fact that we don’t think that’s what you’re doing in your account. 


We think that’s, what’s happening here. Here, have a look at this. So the use of machine learning there as an actual service, so you don’t need to know anything about it. You just plug it in and use it. 


Ashish Rajan: Right. Sweet. And that’s a good segway into one of the questions that [00:31:00] came in from Brandon. Mike, can you speak about, a bit about Zelcova? 


What the origin or the use case and where it’s headed the AWS cell COVID. I think it was announced in Zelkova and trios for the two services for data security that were announced by Amazon. And I’m like, Ooh, I haven’t heard about it for a long time. So our owners had to Google it as well, like oh, okay. 


Mike Chambers: I’m going to say I did Google it. Cause I saw that in the chat and I was like, honestly, I’m not sure, not sure what that is. 


Ashish Rajan: Fluffer, Amazon, AWS Cova or is a, B S P I R O S it doesn’t like announcement made by them like ages ago. And I never heard anything about it after that. So Brandon, if you have some information yourself yeah, feel free to share it, but I think it was like an insight, something that is. 


Personal AWS project for memory. My memory serves me, right? We say we’re going to open source, but they never, that never happened. Or they were used by something other, some other software in the backyard, but Brandon, you can correct us if I’m wrong. And 


Mike Chambers: I use quite a bit, and it’s not in my day to day vernacular for the moment. 


There are some other kinds of services that may be spun out of this, [00:32:00] but I wouldn’t be able to pat on heart say that that’s what it was. So yeah, I like it. It also appears to be some kind of plant because I’m good. Yeah, 


Ashish Rajan: you can, you can blame it on apparently most names of services in AWS. Plants or animals or fishers or something. So, 


Mike Chambers: so it depends on the era. There’s, there’s an entire conversation that can be had around that. You’ve got everything from and S3, and then you’ve got your elastic Beanstalk. And then you’ve got far gate. 


So it’s amazing. Yeah, I, there is no, I don’t wish to try and overlay any kind of rhyme or reason to AWS naming convention. I just go. Yeah, fair enough. 


Ashish Rajan: So, so Brandon just came back as well? No was nothing. We’re hoping we, I mean, yeah, I think, I don’t think, yeah, it’d be really interesting to hear if we actually hear anything about it in late in the field, but that was a great question. 


Mike Chambers: It’s every day from now on, I’m going to hear about this thing. That’s what it was, right. That’s what we were talking about. So I’ll have to, 


Ashish Rajan: yeah. And Brandon, if you find out about I’ll, I’ll love to hear about it as well, man. I think I’ve definitely been quite keen to hear about it. I’ve got another question from Phani as well. 


Work technology. Can we learn from so people who are starting off [00:33:00] new, I guess in the AI ML space, I would probably bend this one out. Cause I bought this question right in the end as well. So I’ll come back to this. Yeah. But before, cause we were kind of going through the AWS components that were, I guess in Moreland, this. 


To your point about the, if then world, where we had the whole CICD pipeline, we have all these other things that were done on guard duty and all that. So are they all still applicable from a machine learning project? I mean, I’m just thinking of a project, like kind of what we spoke with the retail store. 


I just want to make a recommendation engine and I’ve got a bunch of data from next last 10 years in an S3 bucket. Are there components like CICD and all that, that gets involved in this? 


Mike Chambers: . Yes. I think to my point, like, , if you were to ask me, is there is CD for machine learning, generally, I’m going to contentiously say no, but that’s because there are many different solutions. 


So there’s not just one there’s different things. Like I think if you look at recent. And I say recent, I mean the last year or two blog posts from AWS and also industry in general and service announcements and stuff. And you look [00:34:00] for the word pipeline and something to do with machine learning. 


You’ll find probably at least four completely different pipelines that can be used at different parts of the overall life cycle. And so you have to sort of choose the one that makes sense to you or maybe choose multiple of them. Cause you can pipeline the data pre-processing you can pipeline the. 


Pot, you can pipeline experiments because the other part of this, right. When I say, well, I glibly said before, an algorithm writes a model of that is such a gross oversimplification, right? So what really happens is the algorithm writes, produces a model it’s rubbish and we throw it away and then we do it again. 


It’s also rubbish and we throw that away and then we do it again. The process to actually train takes a long time, because we’re trying many different iterations of slightly different configuration options and stuff like that. So. Doing all orchestrating that is a pipeline in its own. Right. And then you’ve got the inference side and pipelines era as well. 


So it gets really, really complex, like who knew machine learning was complex. [00:35:00] So there are, there’s a bunch of things that now inside of the SageMaker world, and if you look inside the SageMaker studio, which kind of surfaces some of this stuff, you’ve got SageMaker pipelines, which is essentially an set of APIs that can allow help with the orchestration of that training process that bit in the middle. 


And I think that’s probably the closest thing, which is akin , to what we’re talking about in CICD sometimes these things as well are underpinned by actual pipeline. So code pipeline, which is essentially that’s a CICD tutor tooling that you can use in machine learning world as well. 


So there are bits and pieces like that. I do, however, want to take this. Problem space as well. Okay. So you’re working in retail. You’re wanting to build a recommendation engine. There are other ways that you can do it as well inside of AWS. Without having to go to that level of getting into tweaking with algorithms and producing models, you can actually use pre-existing services. 


So there is a layer inside of AWS, which it’s important to know that this exists from a security perspective, as well as practically speaking, there’s this [00:36:00] layer of what they call AI services. I appreciate that I had this sort of thing about AI that they call them AI service and they call them that to differentiate them from the ML stuff that we’ve been largely talking about with SageMaker. 


And these are services which are API driven. And if then developer, who has knowledge about being able to integrate with API is, is they would Easily integrate with these. And so these can be secured by regular kinds of policy inside of AWS. And you can basically in the case of personalized, which is a, a service, which does exactly what you’re talking about, you throw a bunch of your data at it and you can get recommendations out of it. 


it’s fully managed version of everything that we just spoke about. It’ll be a sensible idea, especially at the beginning of the project to see if that is suitable for what you need. There are a whole bunch of other products like that as well for image recognition. So if you want to build image recognition into your application, then. 


That recognition with a K service throw images into it. , you can go into the AWS console now and [00:37:00] actually throw images into it and it will give you a sample of the data that it would provide. Like it’s very interactive to see how it would actually work comprehend text recognition stuff, or text to speech stuff, all of this stuff. 


It also exists. It’s all ML powered and all of the hard work’s been done for you already. So if it fits your use case, you should probably go. 


Ashish Rajan: Actually that’s a good one. I didn’t realize there were services that were almost like a Baz. I get for lack of a better word as platforms that 


Mike Chambers: are exactly the right word. 


Yes. That’s exactly what it is. And most of them actually are pre-trained machine learning models. So literally I could take your image. I could put it into recognition and it would say, , that it’s a person that would find your face. It would tell me that you’re happy because you currently are, and it would give me an approximation of your age and all this kind of stuff, and it already exists already. 


So a whole bunch of pre-trained models take the audio from this podcast, stick it into is it transcribed and it would bring up. Audio to text like text speech to text then put it back [00:38:00] into poli and make it so that we’ve had a version of this podcast where all of our voices are free and replaced by automated AI voices in a different language with transcribed. 


So you can do all of these things just with pre-trained models that are just there. If how to use an API, you can bolt this into your application. Wow. 


Ashish Rajan: And cause that kind of makes me even think that, oh, are there AI services that how a lot of people would have a lot of security logs you go through and look to a large extent it’s pattern recognition or running queries on it as well. 


Oh, where to your point for fraud detection could be kind of part of it that, Hey, as she used her transaction from, I don’t know Australia do second data. He didn’t transactional from India. And on the two seconds he did a transaction from Scotland. Clearly isn’t moving that fast. So like things like that I’m sure you can teach an algorithm. 


So are this services AI? services quote unquote that kind of, you can I guess more of into security 


Mike Chambers: use cases. If nothing else has come out of this podcast, the fact that you’re going to say from now on AI services, quote unquote that’s perfect. But yes, in [00:39:00] answer to your question. 


Yes. So you have fraud detection service. There is actually a fraud detection services, part of those pre-trained models where you can supply your data and it will analyze that for fraud. Also generally the broad category is anomaly detection. So you’re looking for anomalies in something. 


So you’ve got a bunch of services called lookout for fantastic naming. So you’ve got look out for vision where you can basically feed in images. And one of the examples that we’ll give with this is feeding images off a production line. And if one of the products coming down, the production line has got a fault in it or a scratch on it, or it’s not, it doesn’t look like the rest, then it’ll get flagged. 


It’s just looking at. Something that looks out of place. You’ve also got to look out for metrics. So if you’ve got numbers coming out of something, which could be, , latency times coming out of a load balancer, for example, then it can detect that anomaly lots of things, which I just want to point this out as well. 


If anyone is sort of sitting there thinking, but I could do that. Like I, as a [00:40:00] person, I could look at metrics and I can find that yes you can. But what we’ve done here is we’re able to teach the algorithm or we’re able to produce a model. Let’s get the terminology, right. To do this in an automated way, at a scale that humans can’t do, can I teach you to recognize the difference between a dog and a cat? 


You probably already can. You’re very good. But we can also teach an algorithm to do it, a model to do it and get that done at a scale that we can’t do because we are flesh and we like to go to sleep and the rest of it. So that’s kind of where machine learning comes in. It’s not at the point yet where it’s going to take over the world and start, , start a robot revolution yet. 


It’ll get there. At the moment, what it’s mostly about is automating things and put, making things happen at scale so that we can derive that value out of the data that we’ve got. 


Ashish Rajan: Interesting. So maybe another segway into this is because in most like if, and then shaping kind of life, there’s usually only certain scenarios that are covered by AWS services from a security perspective. 


And you kind of [00:41:00] have to call it on the path of using open source tooling for. Doing some kind of security around it. Now, I don’t know. These kinds of use cases do exist in the MLM world, but I’m assuming services from AWS are usually enough or is there, because I guess if it sounds like SageMaker already covers the Jupiter part as well, so you kind of have that covered from there. 


So what are some of the open source tooling that you hear of in the ML space that people end up using, which is by gaps that are left by AWS? 


Mike Chambers: Well, so in ML, in general, It’s full of open source, right. And everybody’s using everybody, else’s open source software. If people haven’t heard of Jupiter notebooks, by the way, and everybody hasn’t heard of Jupiter notebooks at some point do yourself a favor and go and find out about them. 


Wiki pages of running code. There you go. Don’t just, they’re awesome. Yeah, you can use them. You don’t have to be a data scientist or machine learning engineer. You can just run Python scripts on your own machine. That’s still awesome. Right? So that’s that in terms of open source in general, , you hear things like TensorFlow, TensorFlow, PI torch, MX net. 


These are frameworks which are [00:42:00] used for building machine learning projects. They’re all open source and they’re all heavily contributed to, by the big players in the marketplace. Tens of flow is basically owned by Google. Annex net is highly contributed to, and basically owned by Amazon et cetera, et cetera. 


So the entire industry is powered by open source. Generally making models from a security perspective that probably are sort of some sort of niche things in there from those projects. But broadly speaking, yeah, the services that AWS have with those security things around it fine, where you would want to add to it potentially is if you want to manage more hassle holistically. 


So things like IAM in CloudWatch and monitoring and guard duty and security hub, they all work nicely. But sometimes you want to wrap that up into something in a bigger context, then sure. You start to look to other SIEM products. So if you’ve got, open-source seeing products that you want to use, that final commercial ones that you want to use, then there’s that element of it. 


And at the fundamental level, I think that the controls are there. 


Ashish Rajan: And to your point It’s fairly similar to, I [00:43:00] mean, I don’t want to compare the if then, cause sounds a bit kind of comparing apples and do potatoes. 


Mike Chambers: I guess we also don’t want to say that if then development is easy or anything, like it’s also, it has its own challenges, right? 


I, that hasn’t been how it sounded, but obviously machine learning is better. Yeah. 


Ashish Rajan: So to your point then how does the whole curve look for maturity in this? And cause I’m, I’m almost thinking that oh, Because a lot of us may have had little input into ML project. Some people may have been involved with invent and do end. 


What are some of the maturity scales for this Mitch? Like a ML project out of curiosity? Like what does level one look like and what is level? I don’t know, 25 looked like, I guess. Right? 


Mike Chambers: Yeah. Sure. Okay. , so context with this it’s really important. And I alluded to, I said, I wanted to come back to this and this is me now coming back to it, machine learning is a huge. 


It’s enormous. It is an entire slice of information technology. So when people lift the lid and look at it for the first time, I’m not just talking to SageMaker, I’m talking to everything it’s enormous and it overwhelms people. And that’s because you [00:44:00] have people at university who are standing at whiteboards and coming up with new ways of looking at data and algorithms and mathematics that will make my head explode. 


And I can’t even, I don’t even understand what they’re talking. I don’t even know the symbol that you just wrote, what even means that type of stuff. And then you’ve got all the way down to. The way that you build Silicon to actually make this stuff work. That literally is all of it. And everybody in everybody who works at those different disciplines is needed to do this. 


Right. And I’ve had conversations with people who are professors at university who are wrangling around with algorithms that just, I just make my head hurt and I go, wow. I just don’t know how you do that. That’s amazing. Thank you for doing that. And they look at me and there’s, , you’re putting algorithms to work on infrastructure at scale. 


And like, it sounds ridiculous, as I’m saying, it’s like, yes I’m just using some web services and I’m making it happen. It’s like I get it right. Cause it’s what I do. But to them, what I do is amazing. What to me, what they do is amazing. I still think it’s laughable, but anyway, I’m just using some APIs, but anyway, [00:45:00] everybody is necessary in that chain. 


So terms of maturity and what you were talking about in terms of , where you start and all that type of stuff. You, as a, let’s say you’re a regular developer right now choose the bit, which makes sense to you. The bit, the interest you choose, the bit which aligns to the business problem you’re trying to solve. 


And if that’s using an API service, then just go and use that because it’s already built and you’ll sort of get to see some of the capabilities and maybe you’ll want to move to a different part, but you could probably quite comfortably become an expert in using that and provide value to your business there. 


If you want to go deeper than take someone’s algorithm, that’s already been written. That’s what most people end up doing in machine learning. And there are well understood, well known algorithms for doing Classification problems or for doing image recognition, problems, or object detection, problems, all that kind of stuff. 


Like don’t reinvent the wheel because it’s been done already much like you do as a library of code it’s okay. It’s not cheating. And this is what we do. You take one of those. You might want to [00:46:00] extend it slightly, maybe, but otherwise just train your models from that. If you want it to, at some point, then sure. 


You can go and write your own algorithms, but you’ll disappear into a black hole and you’ll probably never emerge, but you can do that too. So there’s all different places here that you can get involved and there’s all different places that you can start. But you’ll need to rely on everybody else. 


Who’s in the chain as well. Does that make sense? I don’t. Does that 


Ashish Rajan: answer the question? I think it is a tube to pointing to growing space as well. So we may not have even experienced as well as to kind of like, and maybe that’s, there’s another layer of maturity over here. We just haven’t really reached that point. 


Mike Chambers: Oh, and will we ever like this very neatly comes back to the usefulness of AI. AI is still a thing was still over the horizon. We’re still constantly trying to strive for the next level of innovation. So, , machine learning has been around since the 1960s, like people were like putting together things in the 1970s. 


They couldn’t do very much because the computers back then were [00:47:00] woeful compared to what we have now. So imagine in, , five years from now with the increase in capacity and the Fort, that’s being driven through all of the innovation, which is happening right now, we are nowhere near finished. We started a very, very long time ago and we have much, much more to do before we call this done. 


Ashish Rajan: Yup. Yup. And that’s a good segway into Phani’s question as well. And my question is, so where does one start learning about this? I guess space. What’s a good place to start. I would love to show already our course as well, but I’ll let you. Sure. 


Mike Chambers: I mean, if only there was a really good course about machine learning. 


Yeah. It looks like I have put a course together about machine learning. I clearly, I’m going to mention that. And it’s very much focused on doing machine learning inside of AWS. And also is there to try and help you get to certification level without as well. So there is that outside of that the real key thing is to pick the place that you’re interested in in learning in that area that I talked about , in the vast quantity, if you’re really interested in maths, then go, go to a university somewhere and get some books and enjoy yourself. 


[00:48:00] And do you have my full respect? If you want to do something more in this sort of ML ops space, which is probably more common and more popular than there’s lots of resources on YouTube as well. The thing I would suggest that you do is that you choose a project to work on your own. Have a go at doing something yourself, find something that you are interested in outside of the world of technology potentially. 


So if you’re interested in sports or if you’re interested in horticulture or something, then find something where you can get data yourself. That’s data that you understand, and then put a machine learning project together around it. Try and make some predictions about stuff that you understand, because we’ve mentioned domain expertise back towards the beginning of this conversation. 


That’s really important for machine learning projects. So if you’re working on something yourself, then it’s important to understand the data and then work around with it, get some classification now of some data that about. That’s how I started. I I really enjoy image classification stuff because it’s very visual. 


It’s writing [00:49:00] painful and I enjoy making things out of Lego. And so my pet project, which is a continual theme in the background is that I’m working on a Lego brick detection. I’m ultimately building the Lego brick sorting machine. It’s a, it’s an ongoing project. It will probably never be finished because I’m constantly applying the new things that I learn and new things that industry comes up with to the concept of Lego brick detection. 


That’s my thing. What’s your thing. 


Ashish Rajan: Yeah, that’s such a good way to put it. That was also the last, the last of the questions that I had. And I’ve worked, we have few people growing up, people enjoyed it. So very insightful. Thank you. This was pretty awesome by the way. I really appreciate you kind of hanging out with me cause I’m sure there’s a lot more questions coming in. Cause Martinez kinda came back saying as out here in Georgia session and so brilliant session. So where can people find you and connect with you on this? 


Mike Chambers: Yeah, sure. So social media is the best thing to do. Of course. So LinkedIn is probably the best place. So if you look for Mike D. If you look for my chambers on LinkedIn that’s probably the best I am also on Twitter as well, but that’s, that’s probably slightly more frivolous. If you are interested in sort of Lego bricks, all the machines and [00:50:00] reverts gives them. 


That’s probably where you find that. So LinkedIn is probably the best place I’m also on YouTube as well. I should say that I have a YouTube channel, which has got quite a lot of content, which was put together streamed a lot during re-invent with reinvented announcements. Prior to that, we did an entire series where we actually built a machine learning project from scratch using basically live stream inputs as to what, what do you want to build? 


Okay, let’s build this. And that’s, if that’s what it was called, let’s build, let’s build something I think is what it was called. And so we went all the way through and it built an image classification algorithm and full on web app, essentially by the end of that, all the codes up on get hub. If you want to follow along with that look out for more stuff in that space and more stuff to come. 


And 


Ashish Rajan: I’ll leave a link for that in the show notes as well. I think we’re finally coming from Martinez. I missed getting live fish and very happy mesh get a big plus AWS month. Also Brenda here, my chamber live also. So thank you. Both is is definitely following your content. So I’m sure you’ll hear from my team as soon as well. 


We had for this episode, but thanks everyone for your time. And we’ll talk to him, talk to you in the next week and type of, sort of AWS security, but thanks [00:51:00] Mike. I, hopefully we can get happy again, but really we should hanging 


Mike Chambers: out. That’d be fantastic. It’s been an absolute pleasure. Thanks very much. 


Ashish Rajan: Thanks Mike. Thanks everyone else was about fear by.