Machine learning is transforming many industries, including healthcare. Artificial intelligence is playing a pivotal role in saving lives and improving patient outcomes. And it is easier than you may think to start applying AI models to medical imaging.

We just posted a course on the YouTube channel that will teach you how to build and evaluate medical AI models with TensorFlow.

Dr. Jason Adleberg teaches this course. He is a radiologist in New York City and a skilled programmer, making him the perfect instructor to guide you through this course.

You will use TensorFlow to evaluate chest x-rays.

In this hands-on course, you will learn how to build and evaluate AI models using TensorFlow, one of the most popular and powerful machine learning frameworks. The course is structured into two parts, offering both theoretical knowledge and practical application.

Part 1: Building and Training TensorFlow Models

This section starts with the basics, guiding you step-by-step to build and train a simple yet effective TensorFlow model. You will learn the fundamental concepts of TensorFlow, gain insights into model architecture, and discover various techniques to optimize model performance. Dr. Adleberg's expertise will help you grasp the essentials of medical AI model development.

Part 2: Evaluating Medical AI Models

Once you have mastered model building, you'll lean how to evaluate. In this part of the course, you'll explore key metrics like AUC (Area Under the Curve), sensitivity, and specificity. These metrics play a vital role in assessing model accuracy and reliability, particularly in clinical settings.

Here are the sections in the course, covering the two parts above.

  • Getting started with Google Colab
  • Facts about Chest X-Rays
  • Defining a Problem
  • Preparing the Data
  • Training the Model
  • Running the Model
  • Evaluating Performance
  • Stats: Histogram, Sensitivity & Specificity
  • Stats: AUC Curve
  • Saving our Model

Watch the full course on the YouTube channel (1-hour watch).

Course Transcript (autogenerated)

Machine learning is being used to save lives in the medical industry.

In this course, you will learn how to build and evaluate AI models with TensorFlow.

This is a great real-world project for improving your machine learning skills.


Jason Adelberg teaches this course.

He is a radiologist in New York City and also a programmer.

So let's start learning.

Hey everyone, my name is Jason.

I'm a doctor and computer programmer in New York City.

And today we're going to talk about how to build and evaluate medical AI models with TensorFlow.

This tutorial today will have two parts.

The first part is going to be building and training a really simple TensorFlow model.

And the second part is going to be going through the statistics, the evaluation of our model, and we'll talk about metrics like AUC, sensitivity, specificity.

This stuff will be really useful, especially if we're interested in deploying this in the clinical space.

I wanted to give a big shout out to Dr.

Walter Wiggins for inspiring this tutorial.

Here's his Twitter and here's mine.

And with that, let's get started.

All right, so today we're going to be using Google Colab.

Google Colab is a really cool website through which you can run different parts of Python code.

If you've never used it before, it's relatively easy.

We'll start by clicking connect up here.

And then basically in this tutorial, all these different blocks of code here, these are known as cells, and we can click the play button here to the left of it to get everything up and running.

This first cell will just be us downloading a whole bunch of things, so it's a good one to get started with.

All right, so today we're going to be working with chest x-rays.

And chest x-rays are the most common imaging study performed in hospitals, in emergency departments, in outpatient settings.

It's usually one of the very first things that doctors want to know about if you're not feeling well.

Now, there are a number of things and structures you can see on a chest x-ray, so let's just go over them real quickly.

Here's a normal chest x-ray.

You can see the lungs.

You can see the heart here in the middle.

You can see the aorta coming off of the heart and supplying blood to the rest of the body.

You can see a number of skeletal structures.

So, for example, here's your collarbone or your clavicle.

You can see all the ribs here on both sides.

You can see the vertebra here in the middle.

This line here is called your diaphragm.

It goes around like that.

And this separates your chest, your thorax, from your abdomen.

Underneath this line is your liver.

Your spleen is over here.

And this little kind of air bubble sitting in here, this is your stomach, which in this case just has a little bit of air in it.

And again, this is a normal chest x-ray.

Today, we're going to be using a pretty big data set of chest x-rays called the N-ray.

And this is an open source data set of a few thousand chest x-rays, which happens to have eight different labels.

So, here's the eight different labels available to us in this data set.

And these represent eight pretty commonly seen things on x-rays.

This is not everything that you can see that can go wrong in a chest x-ray, but this is some of the more common things in the world.

And let's just go through them real quickly.

Here we have atelectasis.

This is when a little piece of the lung kind of deflates a little bit.

So, that is up here.

Basically, this wedge-shaped thing up here in the right upper lobe.

Here we have cardiomegaly.

Cardiomegaly is when you have a really big heart.

Specifically, it's when like this length of the heart is more than 50% the length of rib-to-rib.

And this is a sign of heart disease.

Here we have a pleural effusion.

That's when you have something that's basically sitting right outside the lung in what's called the pleural space, which is not really supposed to be full of anything.

This one here, this is an infiltrate.

And that's when you have something sitting inside the alveoli of the lungs.

It's not supposed to be there.

There's a few different things that can do that, but most of the time this means that you have a pneumonia.

In this x-ray, we have a mass.

It's this rounded density up here in the left upper lobe apex or on top of the lung.

A mass is something that's more than three centimeters in diameter.

And with the mass, you know, again, it depends on the context, but generally we're sort of worried here about some sort of tumor.

This is a nodule, and a nodule is a mass that's smaller than three centimeters.

So this is a little bit harder to see.

Here's pneumonia.

A pneumonia is an infection inside of your alveoli.

Again, this is not like mutually exclusive with infiltrate.

And then finally, last but not least, this is a pneumothorax.

And you know how in this one we said you can have sometimes some fluid sitting in the plural space.

Well, here you have air in the plural space.

And this is also known as a collapsed lung.

I bring all this up because some of these conditions are a little bit easier to see and some are a little bit harder to see.

For example, you can bet that you can have some of these conditions For example, you can bet that a mass, which again is more than three centimeters, you can bet that that will be easier for us to see than a nodule.

And so it should be easier for a computer to see this than to see this, to see a nodule.

You know, basically how good our model is is going to depend partly on the technology that we're using, but partly on the data itself, generally speaking.

You know, if you have more data, that's better.

If you have a higher diversity of data, like different types of nodules, et cetera, the model will work a little bit better.

But the quality of the labels really matters as well.

So now that we've looked at the eight different findings, the e-label is available to us in our data set, let's choose one for the AI model.

So I'm going to choose cardiomegaly.

And let's take a look at the data to actually see kind of a little bit more about the different images that we have.

So I'm going to click here on the folder button and click here on the medical AI folder and here on the images.

And here's just a random one we'll open up.

You can see that this is an example of a chest x-ray up here.

Also in the data that we downloaded, so there's all the images there.

And then here is a giant spreadsheet of all of the images that we have available to us with all the labels on them.

So for example, here are, you know, 100 or so images with the labels right there in that column.

I'm going to grab the path for that file.

And then put that in here.

And then here's just another way that we can make sure that the python that our google collab can see everything.

This is showing just the first five rows which happen to be atelectasis.

Okay, next I'm going to look for the rows of this column where the label equals our finding.

And then we'll do the same thing for where the finding is no finding at all.

These are going to serve as negatives to us.

We want to make sure that we have enough images to train a model with here.

So for this line here, I'm just going to go ahead and make sure that we have enough examples of positive cases.

So this is showing us that we have 146 examples of cardiomegaly that we can use today.

That's pretty good.

Now one concept for building AI models is that you want to separate the data into a training set and a testing set.

Usually you do about 80% for training and 20% for testing.

What does that mean? So, you know, the AI model is going to get taught, is going to learn what's what from our training data set.

But in order to tell if it's working or not, we want to show it images that's never seen before.

And that's going to be our testing data set, which is the other 20% of all those images.

So I'm just going to manually define that right now, just like this.

And then I'm going to go ahead and spell out exactly the number of images that we're going to use for our training data set.

So I'm just going to say that that number is 80% of what we have to work with today.

And the same thing for our test data set.

Let's print those out just to make sure that they make sense.

And you can see that we're going to go, and as you can see, we'll use 116 positive cases for our training data set and 29 positive cases for our testing data set.

Here, you can see we have quite a lot, we have a lot of negative examples.

So our limiting factor today is going to be the number of positive cases.

For this example, today, we're going to do a 50-50 split where we want to have an equal number of positive and negative cases for our training and for our testing.

So we're just going to spell that out right here.

Right here, I'm just going to put together the rows that are going to go in our training data set.

That will look like this.

So right here, I'm just going to put together the rows that are going to go in our training data set.

And the same thing for our testing data set.

And the same thing for our testing data set.


Okay, great.

All right, so now that we know how many images we have to work with, now we'll just move them to different folders.

And we'll see a few examples of the images with Python to make sure we're ready to go.

So now we're just going to make some directories.

Here is our root directory with all the images in it.

And then this is how I'm just choosing to make some new directories.

So we'll make one like this for our finding.

We'll make one for our test data set.

We'll make one for our train data set.

And then we're going to do the same thing, but just make negative folders instead of the positive folders too.

And negative just means that there's no finding at all.

Like that.

Now we're going to go ahead and just move those files over.

And we're going to do this basically by iterating through like the rows in our data frame that we care about.

So we pay a little bit of attention to exactly how many images we have to move over.

But here's where, for example, the training positive examples happen to be located.

They are like this, as it's defined in our CSV file.

Here's where we're going to move it to.

And this is, again, the way that we defined it in the code block above.

And then this line here, this actually does the moving.

We'll try it real quick just to make sure it works.

If it does, we can double check by clicking here.

We're going into images.

Waiting a minute.

Going to cardiomegaly.

And you can see that now we've made some test folders, some train folders, and this should be a whole bunch of the images here, right? Now I'm just going to copy and paste this.

And instead of doing the positives in our training data set, I'm going to do the positives in our testing data set.

Because our training data set is 80% of all of this, this is more or less like the 80th percent row of all of the positive things that we have.

And then again, we're going to move this to the testing part.

Just going to tweak that right there.

And then we're going to do the same thing for our negatives.

So now I'm just going to again copy and paste those lines.

I'm just going to change positives to negatives.

Positive to negative.

Here and here.


I'll go ahead and just copy this here.


Now that we've moved everything over, let's just use Python to show everything directly in this notebook, just to make sure that what we're doing makes sense and is appropriate.

So I'm going to define two arrays.

And then this is something that we declared in the first blocker code when we started today.

But let's just like, for emphasis, describe that.

Right here, we're just going to show like smaller versions of the pictures that we're loading up.

And then we'll just show like six examples.

And this is the way that we're going to load it.

And this is the way that we're going to load it into Python itself.

image dot open image path dot resize image width image height image height like that positive images append.

And this is basically like a helper function that we declared up at the very top again.

Image just like that.

And I'll be sure that I spelled it correctly.

Just going to do the same thing again for the negative images.

So again, those are the images that have no findings at all.

Like that.

Now that we've actually loaded everything, we'll go ahead and actually just like show everything with matplotlib.

And so we're going to go ahead and show just six images.

So these will be our six images that have cardiomegaly.

That will go there.

And then we're gonna do the same thing.

I'm copy and pasting this for negative images.

So I'm just going to tweak that.

I'm going to tweak this.

And we're not using, again, like the whole point with the negative images is to show ones that have no finding at all.

So let's go ahead and click play.

Here you can see that it's showing us six examples of cardiomegaly from like the folder that it's created.

And to me, these all look like definite cases of cardiomegaly.

You can see that the heart here is for sure it's enlarged.

It's more than half of the chest.

And here these look like cases with no findings.

So here the heart is normal size and there's not really much else going on as well.

Okay, so now we have visualized our data.

We've moved everything around and we know that we want to build an AI model for cardiomegaly.

In this part, we're going to actually build the TensorFlow model.

Now there are lots of different ways to approach model building and we can spend an hour on this topic alone.

But basically here, I want to talk about two different concepts for this part.

So the first is called transfer learning.

And the second concept is called data augmentation.

We're going to use both of those things today.

All right, so this first line here, this is going to have us load our model directly through TensorFlow itself.

And we're just going to define the size of the image that it's going to be working with, which we kind of talked about earlier, but these are going to be kind of smaller, basically scaled down versions of our chest x-ray.

This three here is saying that basically it's a three channel image, so like a color image, right, like red, green, and blue.

Now it's true that, you know, our chest x-rays are actually just all shades of gray, but we're just going to put that in there because it's a little bit easier.

And then this one here, include top equals false.

We'll come back to this, but this is basically going to let us customize our model to do what we want to do here.

This should be weights plural.

And now we're using again the ImageNet network here.

But what we're basically going to do is, on top of all this, for the last layer of our model, we're going to have it spit out whether it thinks there's cardiomegaly or not.

So we're going to just manually define that here.

This is basically talking about exactly what type of output we want our model to put out.

So this is just getting that last layer, and here we're going to go ahead and define this by saying that we want to say either basically yes or no, positive or negative.

So this is the way that we can do that.

This is just basically some stuff to, again, help us with our big task, which is saying yes or no, cardiomegaly or not cardiomegaly.

I'm going to say yes or no, and then I'm going to say yes or no.

So this is just basically some stuff to, again, help us with our big task, which is saying yes or no, cardiomegaly or not cardiomegaly.

All these decisions here, we could talk about this for such a long time, but I'm going to skip over exactly some of the specifics here.

And if there's something to put an x in there, right.

Okay, as for the model, so now we have this model with a slightly customized last layer.

And now we're just going to basically go ahead and compile it.

Here we're just going to say that we're interested in the accuracy, basically, so how accurate is our model in between, saying whether something is positive or negative, and that's sort of the metric that we're going to use to help us figure out if our model is working or not.

Oh, and then one other thing is that we should put an equal sign right in there, and that would help us as well.

Okay, so now that this is all done, now we're going to go ahead and just define a bunch of things that are basically just kind of helping point our model to the right information.

So if you remember from earlier, all of our images are hiding out right there.

The directory where everything is located for the imaging for the training stuff is like this.

For testing, it's just about the same, but like this, change this here.

And then this kind of keeps going.

So the way we have our data structured is that we have like sub-folders in our training directory that are called positive, one called negative, which we're going to put in right here.

That looks like this.

Like this.

And then the same thing also exists just for our test data.

So I'm going to make these changes here, here, here, and there.

I'll click play right there.

Okay, so now the next concept I want to talk about is something called data augmentation.

So as we discussed earlier, we have a bunch of images available to us with cardiomegaly, but it's not like a ton.

There are some ways that we can basically kind of cleverly create more training data for ourselves, and that's called data augmentation.

So what we're going to use here is basically a really cool concept called an image data generator.

And this is basically something that's going to look at all the images that we have and kind of tweak them a little bit so that we're kind of generating like more data from the data we already have.

Now there's a lot of different ways that you can augment data, that you can kind of tweak data around, but these are the ones we'll just happen to use today.

And, you know, if you're playing around with this sort of on your own, you're welcome to, you know, kind of experiment with different ideas here.

So what are we doing? So first of all, we're just kind of, all these methods here basically are generating extra images that are slightly tweaked from our images.

So specifically, this is going to generate images that are slightly rotated one way or the other, that are slightly shifted, like stretched out, that are sheared, and that are zoomed in as well.

One thing that's important with data augmentation is that it's not just going to be zoomed in.

One thing that's important with augmentation for medical data is that, you know, there's kind of some different thoughts on this, but when it comes to flipping images, you know, a chest x-ray is never really going to be flipped around, right? Like, you know, your heart is always going to be like, you know, you're, you know, when you take a chest x-ray, like, right side up, right? Like your stomach's always going to be below your lungs.

So if we were to flip the data upside down, that wouldn't necessarily really be helpful for our AI model.

In regard to horizontal flipping, you know, I think there's not really like a consensus on this when it comes to medical imaging or specifically chest x-rays.

It's true that our bodies are not completely symmetrical, even though we like to pretend there are, so your heart's a little bit more on your left side.

There is a condition where your heart can be more on the right side, but it's not super common.

So I'm just going to say that we're not going to flip things horizontally so that we don't kind of confuse our model, but that's a choice that you could make if you wanted to.

For our test data, we definitely don't want to augment, don't want to mess with our test data at all.

So this is just a line where, you know, we are creating an image generator, but all that we're doing here is just like redefining the numbers that are inside of our, of each pixel, basically.

So this is, it's not really actually doing anything, right? This is basically saying that instead of each pixel going from like zero to 255, it's going to be between zero and one.

All right, let's go ahead and click there.

That works.

So that's image augmentation.

And now this here is something that we need to use in order to get our model to train, called a train and test generator.

And this is basically the way that we do it here.

So from directory, which we already have, target size and then we're just going to do the same thing for testing right below.

And then, you know, of course, we just need to kind of tweak some of these things here for this particular part.

We'll go ahead and just define the number of steps that we're going to use for this, which is basically because we have a batch size of one.

It's happens to be the same, happens to be just like all the images that we're using.

So, I mean, this is one way of doing it, like the way that our data is structured.

Like the way that our data is structured.

So again, train steps equals the length of that folder times two.

And then for testing, we're going to go ahead and say that it's like this.

And then we actually typed this in wrong here.

So let's go ahead and just fix this real quick.

That goes like that.

And then this goes like this.

And you can see here that when we click, so it's found 232 images for the training part and 60 images for the testing part.

And again, that is pretty close to 80% and 20% split.

All right, so we have everything set up.

We have our model ready to go.

We have our data ready to go.

This is the cool part.

Now we're finally going to run our model and let it train.

So in order to do that, this is the method that we're going to use.

We're going to use what's called

We're going to point it to the train generator that we talked about earlier.

We're just going to mention exactly the number of steps per epoch.

And epoch is basically one scan through all the data.

So in this particular example, we're just going to ask it to look at all the pictures 20 times in order to do the training.

And then we're also pointing it to the validation set.

That's our test set, the 20% that we sectioned off that it has not seen before.

And when we click play, you can see that you can see that I spelled epochs wrong.

And then finally, it's going to actually start to do its thing.

Now, the way that we formatted it for this tutorial today, it shouldn't really take too long to go through all the data.

But this will take about a few minutes or so.

So we're just going to click play and then basically step back.

We'll come back in a few minutes.

All right, so we're back.

It's been three minutes of training, which is really not that much time at all.

But that's OK.

That's enough for us to understand some of the concepts that we're working with today.

So now what we're going to do is we're going to see exactly how good of a job it did over time.

So basically, we're just going to plot how the accuracy changed.

And this is basically the accuracy for the training set, which should go up over time.

And then this is the accuracy for the validation set, the testing data set, which is basically the accuracy for the , you know, it hasn't seen before.

So we hope that this goes up.

But, you know, this is why we had to keep it separate, because this will actually show us like if it's doing a good job or not.

I do the same thing for loss.

And basically, loss is sort of another kind of abstract way of thinking about how good a job it's doing.

So loss is basically, you know, you want your accuracy to be going up over time, and you want your loss to be going down over time.

And loss is basically sort of a mathematical way of talking about, you know, the way that the network should look and the way that the network does look as it currently is.

So I'm just going to plot using matplotlib, some different stuff related to the loss and related to the accuracy as it changes over time.

We'll do the same thing here, but for the validation data set.

And again, this is going to be the training and test accuracy as it changes over time.

I'll just put this in here like this.

Let's click play.

And this is actually pretty encouraging.

So, you know, as this thing ran for a few minutes, only after a few minutes, it started to figure out what was what, what was cartomegaly and what wasn't.

So the training data set, we expect that to go up.

We expect that to get more accurate over time.

But the testing data set we hope gets more accurate.

And as you can see, it did.

So when it first started, it had like a 50-50 chance.

But after just a few minutes of training, it got to be roughly 70% or so accurate.

And, you know, that's better than flipping a coin.

I'm also going to go ahead and just do the same thing for loss, which again, that's another way of thinking about how good a job our model is doing.

And so instead of accuracy, we're just going to plot loss.

And I'm just going to like change this here so that we have a slightly better idea of what's going on.

So here's our loss.

And as you can see, this goes down quite a lot over time.

We could still, you know, we could zoom in on this to see a little bit more, but that's kind of the basic idea here.

All right, so now that we've trained a model, let's see how well it does.

And this section will basically have two parts.

So first, we'll just play around with a few images to see like what the model thinks.

And then we'll systematically look through all of the images with statistics, thinking about things like sensitivity, specificity, and AUC, or area under the curve.

You know, I think that kind of what distinguishes applications of AI in medicine is attention to all these details, because I think that, you know, these metrics are really important when it comes to thinking about whether, you know, this is really something that's going to ultimately help people.

All right, so let's start off with just like two different helper methods.

This first one here is just going to be a little helper method to load up an image like this.

We're going to resize it to fit the size that our model requires, which looks like that.

And we have to do an actual parentheses here.

From here, it's going to load it into a numpy array.

This is something we have to use just again to convert it from like zero to 255.

That is each pixel is going to go from a value between zero to 255 to zero and one.

This is just making sure that the array has three different values for like red, green, and blue.

Even though it's true, we aren't really using that, because x-rays are like usually all hopefully all in gray.

Okay, and then this line here.

This is what's going to return.

So this line model.predict.

This is how we're going to interact with our model to actually return a value between zero and one, where zero means that the model thinks there's no finding, and one means it thinks there is a finding, which here is cardiomegaly.

We're going to do another helper method here, and this is sort of like a sanity check for us.

So what this is ultimately going to do is this is going to show us an image.

It's going to show us like the actual file path that's associated with it.

Let's go ahead and load this up here.

And then, because our model returns us a value between zero and one, what we're going to do is we're going to define some cutoff point, which here I'm going to say is 0.5, where if the prediction is above 0.5, then the model's going to think that it's positive, right? So in this case, the model's going to think that it has cardiomegaly.

And this will make a little bit more sense in a minute once we actually like use this method, but this is just basically showing us all of the different information about the image as we're doing the prediction on it.

So this is the way that I'm going to just access all of that information.

Guess, plus, score, okay.

And then once we're down here, this will basically just use matplotlib to go ahead and get all of that up and running.

Let's go ahead and click play there.

Let's make sure that we added all of the plus signs.

And again, you know, one thing that's really important is that we want to be systematic in the way that we evaluate all of our data.

So what we're going to do here is we're going to iterate through all of the pictures, all the images in our test set, and basically try to just systematically, like in an organized way, figure out what I thought about everything.

This is going to be really important when it comes to getting things like sensitivity and specificity, which I'll come back to in one second.

But basically what we're going to do here is we're going to go through all of the negative images, so all the images in which there is no finding.

It was labeled, that is, as no finding.

And we're just going to do predictions on every single one of them.

So we have this array called results array.

And basically what we're going to do is we're just going to create this kind of like results array with all of the information that we care about.

So each row in the results array is going to be the file name, like the image itself, whether or not it had a label of being positive or negative.

These are all the negative images here, so it's going to have a negative label.

The guess, so what it happened to think.

And then the confidence, which again is that number between 0 and 1, where we're using like a 0.5 cutoff for positive versus negative, or cardiomegaly versus not cardiomegaly.

That first part was looking at all the negative images.

And now we're going to do the same thing with all the positive images.

I'm just going to change this down here, because again, these are like positive labels.

So at this point, we'll have some array with a whole bunch of stuff in it, with a whole bunch of predictions for all of the positive and negative test images.

What I'm going to do here is I'm just going to sort this array on basically that last column, which is like the confidence column.

So basically it's going to show us in order what it really thought was cardiomegaly, and then it's going to show us what it really thought was not cardiomegaly, what was no finding at all.

We're going to create a data frame from all these results here.

And then so that we can be kind of organized, we are going to create an actual list of column names too, that would be helpful.

File path, file name, label, guess, confidence.

And then once we click this here, it's going to go through and basically just make a guess on all the different images in our test set.

Okay, let's scroll.

Again, this array that we created, or this data frame that we created, let's just take a sneak peek to make sure that this makes sense.


And here's the first five rows, right? So here's just five images in our data set where these are the five images that our model thought had the most cardiomegaly.

All right, so that helper method that we did earlier, where it was like a sanity check to see if our model was actually working, let's go ahead and actually make a call to that.

So first, we're just going to grab a random number, like from the test set.

And then we're going to grab a random row from our data frame, which is like all the predictions on the test set.

And this is that helper method that we used earlier, right? So here's a random picture.

You can see that it was labeled as having cardiomegaly, and the model guessed there was cardiomegaly.

Here's an example where this was labeled as not having cardiomegaly.

I agree, I think that's a normal heart.

And our model also thought there was no cardiomegaly.

Here's an example where it was labeled as having cardiomegaly, and our model got it wrong.

Here's an example where the model said that there was cardiomegaly, and it was labeled also as cardiomegaly.

And just like checking it out, I think that's a pretty big heart, so I have to agree with that.

We can click this button a whole bunch of different times to see whether or not it was accurate or not.

We can also use some numbers, like again, sensitivity and specificity, AUC, to help us also get a better feel for if it's doing a good job or not.

So we'll get to that in one second, but let's just go ahead and show the entire data frame right now.

Or rather, let's show every fifth row in the data frame, just to get another feel for what it had.

So this is going to show the name of the file, whether or not it was labeled as positive or negative, whether or not it was labeled as positive or negative, and basically the confidence that it had.

So just looking at this really quick kind of overview, it looks like it did a pretty decent job, but some of the ones here in the middle I wasn't so sure about.

So in other words, you know, when it was really confident that there was cardiomegaly, it got it right.

When it was really confident that there was not cardiomegaly, it got it right.

But some of these ones in the middle, it kind of was a little bit iffy.

So now I want to show the same information in a histogram format, and this is good because this is going to kind of start to get us to think about some of those statistics that we mentioned earlier.

So this is the way that I'm going to build the histogram.

I'm going to grab the same thing here for the negative labels.

And then this is going to use the map plot lib histogram function.

That was kind of a mouthful there, but just some more stuff setting up this histogram chart.

X-axis, title, confidence, scores for different images.

And make a legend.


And here you can start to see exactly how good of a job it did.

We'll go ahead and just scoot this legend a little bit out of the way.

And this is a little bit better, right? So you can see that when the model was pretty confident, it got it right on both sides.

So every example close to a one, it got it right.

Every example close to a zero, it got it right.

But in the middle, it was kind of iffy.

Overall, this is actually pretty decent, though, in my opinion.

So now let's go ahead and look at a confidence score.

So now let's think about, you know, whether...

So earlier we had mentioned that, you know, again, the model was returning a value between zero and one.

And we were using like 50 or 0.5 as the cutoff.

So if it was above 0.5, we would say that it was positive, that the model thought that there was card immediately.

And if the model returned a number between zero and 0.5, we would think that it had no finding.

Let's see if maybe that was the best number for it to use.

So here we're going to create a helper function.

Called createWithCutoff that basically is going to kind of just redraw our histogram, but now we're going to talk about false negatives, false positives, true negatives, and true positives.

So this line here, this is saying, let's find everything where it was labeled as positive.

And the model thought that it was positive, or it thought that the confidence value was more than that.

This is going to give us an array of like exactly all of the confidence values that are above our cutoff.

Let's do the same thing with false positive, true negative, and false negative.

So for false positive, that means that the real label was actually negative, but we thought it was positive.

For true negative, that means that the real label was negative, and our model said that it was lower than the cutoff.

And I'm going to go ahead and just fix this right here before I forget.

And then again for false negatives, finally, that's going to mean that, you know, this was labeled as positive, but the model accidentally thought it was less than our cutoff value.

Here we're just going to make another histogram, but we're just going to kind of tweak it a little bit.

So now instead of having two different colors, we're going to have four different colors, one for each of those four categories we talked about.

It's going to be pretty similar otherwise.

And then all this stuff here is going to be basically just the same as it was last time.

So I won't go through basically all this again.

Just going to type another title in, so confidence scores for different images.

This line here, this is going to draw like a vertical line that helps us kind of differentiate where that cutoff value is, and I'll come back to that in a second as to why we care about that.

We're going to put this in the upper right.

All right, and now this part here is pretty important.

So now we're going to calculate the sensitivity.

So sensitivity basically is something that we use a mess and a lot to see if we should roll something in.

So like a screening test, generally, you want to have high sensitivity for that.

And the formula for sensitivity is true positives over true positives plus false negatives.

So another way of saying this is that something that's really sensitive has a low number of false negatives.

It might have a lot of false positives, so it might just think that like everything is positive, but at least we're avoiding false negatives that way.

And then, you know, the general scheme of things is that once we have a confirmatory test, then we want high specificity.

So for that, it's kind of the other way around where we want there to be basically a low false positive rate.

So, you know, something that's like an example of this, like if you remember, you know, in 2020, when we wanted to know more about whether someone had COVID, you could use like the screening test.

Sometimes that would be like the self swab test or a spit test.

I guess that was a thing.

But ultimately, the PCR test was the most specific.

So that was like the more confirmatory test.

A lot of the screening tests would have high false positives, but the PCR test, that was like a lot more accurate.

But the PCR test was like a little bit more specific.

Here, I'm just going to actually like spell it out on the thing itself.

All right.

And here, we're going to go ahead and basically say that we want a cutoff of 0.5.

So let's try this out.


And as you can see, now that we've defined a cutoff of 0.5, now we can start to think about true positives, false positives, true negatives and false negatives.

So here you can see that if we use a confidence level of 0.5, we have a relatively low rate of false positives, but higher amount of false negatives.

Let's see about if we change this value to be something like that.

How does that change our sensitivity and specificity? Well, if we lower that, then we have higher sensitivity, but we have worse specificity.

And basically, the big picture here is that a lot of tests in medicine kind of struggle with this.

Like, you know, how do we define a cutoff value for whether or not someone has a disease? For example, if you have diabetes, you can get a test called an A1C that basically looks at the amount of sugar in your blood over time.

And, you know, what should the cutoff be for someone who has diabetes versus someone who doesn't have diabetes? Now, the answer to that is 6.5.

And a lot of testing went in to figure out what's the best value for that particular test.

But here we're kind of thinking about a similar thing.

I mean, what is the cutoff for whether or not our model thinks you have cardiomegaly or doesn't have cardiomegaly? And the kind of important concept here is that, you know, we just arbitrarily chose 0.5, but that might not actually be the very best value.

And a lot of issues that might not be the very best value.

And, you know, ultimately, if our goal is to build something that can look at a chest x-ray and diagnose a disease, we want to be totally sure that whatever cutoff value it has for saying yes or no, we want to be sure that it's the very, very best cutoff value.

So for this next part, we're going to talk about ROC, or receiver operator curve, which is related to AUC, or area under the curve.

And this concept is basically how we can kind of figure out what is the very best cutoff point for this particular test, for this particular AI model, which again is going to say yes or no.

So let's build a method to create an AUC curve.

And basically what an AUC curve is going to do is it's going to do every single possible cutoff between 0 and 1.

It's going to calculate the sensitivity and specificity for all those possible values and classifications.

And that's going to give us a lot more information through which we can then figure out what's the best cutoff value.

So what we're going to do here is we're going to feed this thing all of the guesses that it made earlier.

So we're going to have it go through each one and basically now create that AUC curve.

So this is basically just going through all of the guesses that it had made.

And we'll take it that way.

Alright, so now we're just going to return basically all of the numbers of true positives, false positives, true negatives, and false negatives.

We're going to manually calculate the sensitivity, which we talked about earlier.

So we'll go ahead and do it this way, like this.

Specificity is going to be like this.


Alright, so this is sort of the first step.

And then now that we've created this function, AUC curve, sorted results.

Now basically from here, we're just going to create a line that basically strings together like all of these different sensitivities and specificities for all these different possible cutoff values.

And now that we have all these different values as well, we're going to make sure that we calculate what's called the area under the curve, which basically like numerically represents how good of a model this is.

Or in other words, it's basically how close is our AUC curve to being perfect.

So here's just some code to create a line plot.

Alright, there's the rest of the code there.

And now once we go ahead and run, you can see that our AUC curve is actually pretty good.

It has a pretty high area of 0.923.

One thing that's kind of funky about this curve, about this whole graph, is that you notice the x-axis here is kind of like backwards, like one minus specificity.

The reason for that is basically, you know, as sensitivity goes up, as a test becomes more sensitive, it gets less specific.

So as the number of false negatives goes down, the number of false positives is going to go up.

So here, you know, like this point on the curve up here, this test is like super sensitive, which really just means that it doesn't really have that many false negatives.

But you can see there's going to be a lot of false positives, so it's not going to be as specific as it could be.

Overall, this is a pretty good ROC.

And you know, given that we only trained this model for like three minutes, and we only fed it like 100 images instead of like thousands of images, this is pretty encouraging.

You know, if we were to redo this experiment with like a way more powerful model, with like way more images, and with a lot more training time, so three minutes, something like maybe, you know, overnight or whatever.

There's no reason why this ROC curve couldn't get way closer to one.

But again, 0.923 is already like pretty good as it is.

If you want to go ahead and save your model, that's pretty easy to do with the code base that we're using.

You can just do

We'll save it to the content part.

And we'll just go ahead and put it like this so that you can find it maybe a little bit more easily.

You also might want to zip all this stuff up.

And what's really cool about Colab is that you can do just exactly that by hitting the exclamation point when you start in Colab that allows us to do like shell shell commands.

So if you want to go ahead and try all this here, we'll click play.

And in a minute, you'll see that the model is basically saved to here and zipped up.

I took like two minutes or so for it to zip everything up there but anyway, just click on this icon here.

And then here's our model which is about a gigabyte or so big.

Click on those three dots, download, and then that'll save it to your computer in a minute or so.

Alright, well that just about wraps up this presentation.

If you have any questions, feel free to reach me at that Twitter account above or comment on this video below.

And I hope you enjoyed that and thanks for your attention.

Thanks for your time.