DevOps is one of the highest-paying roles you can get at a software company. And even if you aren't working as a DevOps, knowing how it works will make you a more productive developer.

We just published a DevOps Engineering course on the freeCodeCamp.org YouTube channel. Learn all about DevOps in this comprehensive course for beginners with three technical tutorials.

Colin Chartier created this course. He is the co-founder and CEO at LayerCI.

You will learn what DevOps is, continuous integration, continuous deployment strategies, and application performance management. Many DevOps practices are commonly used in programming and web development, it is important to understand key terms and technologies.

This course often references the MERN (MongoDB, Express JS, React JS, Node JS) technology stack. You will get a series of talks with technology recommendations based on these best practices. There will be several programming examples embedded in these talks. As long as you know the absolute basics of coding and the web, you'll have no problem following along.

Here are the sections in this course:

Unit 1 - Code Review Automation

  • Lesson 1  What is DevOps?
  • Lesson 2 - What is Test Driven Development (TDD)?
  • Lesson 3 - What is Continuous Integration (CI)? w/ CI setup TUTORIAL
  • Lesson 4 - What is code coverage?
  • Lesson 5 - Linting best practices
  • Lesson 6 - Ephemeral environments  w/ setup TUTORIAL

Unit 2 - Deployment Strategies

  • Lesson 7 - Virtual Machines (VMs) vs. Containers
  • Lesson 8 - Rolling deployments  
  • Lesson 9 - Blue/green deployments w/ Continuous Deployment setup TUTORIAL
  • Lesson 10 - What is autoscaling?
  • Lesson 11 - What is service discovery?

Unit 3 - Application Performance Management (APM)

  • Lesson 12 - What is log aggregation?
  • Lesson 13 - Vital production metrics

Watch the full course below or on the freeCodeCamp.org YouTube channel (2-hour watch).

Full Transcript

(Note: autogenerated)

This beginner's DevOps course is your first step toward a DevOps engineering role. It is taught by the CEO and co founder of LayerCI.

The goal of this course will be for regular developers and regular engineering practitioners to learn fundamental DevOps concepts so that they can go towards the DevOps engineering role. We'll also be talking about DevOps broadly in the introduction. But beyond that will primarily be talking about the engineering side of things.

DevOps is a methodology that helps engineering teams build products better by continuously integrating user feedback.

And if you google DevOps, and you look for pictures, you'll often see ones like this, it really helps understand how DevOps is different than the traditional way of thinking about software developed. Back in the day, software was developed, much like things would be developed in a factory. So any input would be programming, and then the output, you'd have a product that you could put on a CD, and you'd sell it to users.

But since the advent of the Internet, and have continuously updatable software, it's become really easy to launch things and get user feedback and integrate that into the current product instead of making a new version of the product. So websites like Facebook, continuously upgrade instead of requiring you to buy, you know, a new version of Facebook, unlike, you know, old games, like some city would require you to buy a new version of SimCity. And that idea is really formalized by DevOps, the sections are planning, where you take a set of features that you want to build. And you work with your team to make some specifications for those features might look like,

you code them. So developers on your team will build out these features so that they can be released. And of course, they're built. So for a website, you might take the source code and bundle it into JavaScript that a user's browser could run. For a video game, you might make releases for various different versions that run on Linux and versions that run on Windows and versions that run in the browser. So you take these built artifacts, and you test them. So testing is both automatic and manual. Automatic testing is usually colloquially known as continuous integration. And manual testing is colloquially known as quality assurance, QA.

And then after it's tested, and you know, the stakeholders have all given their feedback, it's released. And continuous deployment strategies, releasing and deploying all happens automatically after a change is known to be good. There's a lot of automation that can be done here. In larger teams. There's, you know, popular tools like Spinnaker by Netflix that we'll talk about in later talks. But the core idea is you want to take the software, and you want to send it to your users in a way that they don't notice if there's problems. So if there's an experimental UI change, you might show it to a few percentage of users and get their feedback before you show it broadly. Again, a company like Facebook, which has billions of users, even if 1% of their users complain, they'll get hundreds of millions of emails, the release is built and deployed. So deploying means it's released to your users for a website, it would mean it's publicly accessible on the internet. For a CD ROM, you know, you bundle your things onto a CD and you distribute that for a mobile release, you'd build the artifact and you submit it to the App Store. And then the App Store would review it and then publish a new update that your users could download.

And then you operate it. Operating is primarily things like scaling, making sure that enough resources exist for the load, adding more servers as required. configuring things dealing with architectural problems, essentially, on monitoring. So as your users use your software, and especially as they submit things, and start jobs and create posts on your forums, you want to make sure that those posts are all healthy.

And then finally, you take all this feedback, and you put it back to the planning stage. So the planning stage takes all the user feedback takes all of the things that the operations and deployments, teams learned about deploying and scaling the product. And then use that to build out new features, solve bugs, and make new versions of the back end and new versions of the architecture. And then just continues in the cycle. And this is what people mean when they say, our company uses DevOps, or our company is tech forward, or our company is digitally transformed. They mean that instead of taking a set of requirements and building one artifact, which is then shipped, it's a continuous cycle of taking feedback. You know, in these two weeks Scrum cycles, usually, and producing software that users actually want to use that they've had some say in producing DevOps engineering is another common part of DevOps. So beyond just the methodology, which is something that maybe the technical leaders and CEO would care about, there's a subfield of DevOps engineering, and this is usually what engineers mean.

You know when they say DevOps, and that's usually what job postings mean when they say DevOps. So if a job posting is asking for a DevOps engineer, you know, they're not asking for someone that can plan and deploy code. They're mostly asking for someone that can build a test, release, deploy and monitor.

So the three pillars of DevOps engineering, our pull request automation, deployment, automation, and application performance management. And we'll get into specifics about those. But the idea is, pull request automation helps developers build things faster, and helps them understand whether their proposed change is good, faster. Deployment automation, helps you deploy your code in a way that users don't complain. Again, Facebook has lots of deployment automation, because if they just threw their code out into the void, every time a developer made a change, there'd be hundreds of millions of complaints. And application performance management is automation around making sure that things are healthy. So automatically detecting downtime, automatically waking someone up, if you know, the site goes down overnight, automatically rolling back things if there's a problem. And we'll get into the specifics of all of these in future talks.

The first pillar, which I mentioned was pull request automation has primarily to do with the developer feedback cycle. So developers share work with each other by proposing these atomic sets of changes called pull requests. And by atomic I mean, they're full featured on their own, they don't require other things to run. First, it's like if a developer proposes a pull request, they should be expecting that that change is good. And as far as they can tell, the change fulfills some business requirements. And then what they have to do is get through some gates. So organizations and pull request automation, their goal is to make sure that developers can tell very quickly whether their change is good or not. So for example, if you're working on a website, and the developer proposes a change that adds a typo, that's something that can easily be automatically detected. And if you set up a typo gate that says no change may go in if it contains a typo, that would be an easy way to make sure that developers get automatic feedback about their changes. People say pull requests, you know, as of 2021, usually they mean Git. So Git is a technology originally popularized by Linux, and it helps developers make these sorts of changes and share them with each other. a pull request is usually reviewed by at least one other programmer and something called a code review, where the other programmer will tell you about code style will tell the proposing programmer, whether there's architectural problems, scaling problems, subjective things that can't easily be automated. But that process of review can also be greatly facilitated by a DevOps technology stack. And DevOps automation can help with things like ephemeral environments, and from linting. And from all of these other automations that we'll get into, after the code review has been done. Usually, an engineering manager or product manager in charge of the functionality being proposed, will get feedback. So if you create a new button, on a website, you'd like the designer that designed the button, and you'd like the product manager that requested the button be created. Both give feedback because if the button is phrased poorly, if it's placed poorly, if it's not mobile, responsive, those are all problems that would require another merge request. So it would be great if the original merge request fulfilled all of the requirements, the first time it was proposed. And so usually, non technical people will give feedback on pull requests as necessary.

So what can be automated for a DevOps engineer,

you can automate things like

automated test running, per change ephemeral environments, automated security scanning, notifications through reviewers, getting the right people to review it at the right time. And the end goal of all this automation is that a developer should be able to propose a change and get it merged the same day they propose the change. That's a huge organizational benefit, because it means that critical bugs can be very quickly fixed and merged and deployed without needing a special process. And it also means that developers aren't bogged down in bureaucracy, they can propose changes once they get through all the gates, the change will be deployed, there isn't additional special gates that they need to discover. So for example, if the proper gates and automations have been set up, a developer should be able to change a web page without having to ask everyone in the company. Whether this web page is used in certain workflows or not. By virtue of passing the tests and passing the QA review, it's assumed that the new change is good. And if a problem does arise, a new gate can be added to the automation so that in the future problems don't occur. The second pillar is deployment automation. And the famous post from 2000s, the founder of Stack Overflow, places Can you make a build in one step as the second most important question for a development organization, and things haven't really changed since then.

The efficiency of the build process isn't the only goal of deployment automation. However, other goals include the deployment strategies, I talked about Canary deployments where you want to show a feature to one user at a time starting new versions of your application without causing downtime. If you have to shut off your website, before upgrading it, and then turning on the new version, the visitors that visit the website in the middle of the upgrade will notice downtime. So there's clever deployment strategies you can do to avoid that. And finally, rolling back versions in case something goes wrong.

It's easy to overcomplicate the planets. Many companies have complex internal platforms for building and distributing releases. Broadly, success and deployment automation is finding the appropriate deployment tools to fulfill business goals and configuring them. And then ideal world there should be little to no custom code for deploying. So off the shelf solutions like Spinnaker and harness are wonderful places to start for this sort of thing.

Finally, application performance management, even the best code can be hamstrung by operational errors. There's a famous case where a user put a bunch of spaces at the end of their post in Stack Overflow. And they brought down Stack Overflow, which is a very popular developer website, because Stack Overflow hadn't deployed their code in a way that would deal well with a bunch of whitespace. So a bunch of space characters at the end of a post, even with the best code, and even with the simplest things like just a messaging board, it's easy to have faults that make it to production and are only uncovered by users. So application performance management ensures that metrics like how long it's taking for requests to be processed, how many servers are being used, all of those key health metrics are being processed. And if there's a problem, like if all of the requests to the landing page are suddenly taking a long time, the appropriate people can be notified automatically, instead of an engineer discovering on Twitter that their website is down.

Logging. So as a program executes, it will produce logs. And the logs generally have information about the state of things. It's useful to be able to map back logs, like, you know, a user visited the website with information about that user. So what was their IP address? What was their username, what resource did they access and what resources were used for fulfilling that access. So if they had to load something from a database, and the database was slow, it's useful to be able to say, the user had a slow experience because their request was fulfilled slowly. But the request was fulfilled slowly because it was fulfilled from the database slowly. So mapping these requests all the way down to their constituent components is very useful.

monitoring. So again, I mentioned metrics and automatically alerting people, but taking the logs and metrics, how slower things how much memory is left, and deciding what to do. So if there's a bunch of load, you might decide based on the metrics to automatically scale the number of servers, so add more web servers as they're being used. Based on the logs, if there's errors, you might want to automatically file tickets for engineers to look into them. And if there's a downtime, you might want to call someone the person on call so that they wake up and take care of the downtime. And they can drop everything, they can have a pager, so to speak. And that's alerting. So alerting is when fault is detected, some trigger has occurred based on the metrics, some number of requests or to slow things are unhealthy, you know, users are going to notice degraded performance, someone should be notified or something should be done some action, a new product, shouldn't dive into DevOps, engineering all at once. So all of that I've talked about our end goals for really large organizations like Netflix and Facebook,

developers that add automation as a situation required. So a new startup with no users building a website.

pillars two and three are essentially useless outages won't be noticed by anyone. Something like a downtime won't be noticed by anyone. It doesn't necessarily matter. You don't even necessarily need to run automated tests, a useful stack for someone there would be something like that low five, or sell or our product where you can get staging environments to collaborate with other developers. But that's about as far as you care for testing wise, you just get to an environments for every proposed change. And you can play around with it yourself to see in a manual QA setting, whether it's good or not a team building an app for 10 enterprise users. So enterprise users are much more sensitive to downtime. So test coverage and business hours alerting should be priorities. On logging and log aggregation error collection, there is popular tools like century and code Cove for automated test running, there's tools like bit rise and circle ci vaporizes are known for mobile testing, and for alerting There's a famous tool called Pedro duty that keeps track of who should be notified if there's a downtime. And so during business hours, you might assign someone to be the person that isn't supposed to take any meetings for the day. If there's a downtime, they will drop everything and solve the problem.

And a social media app like Reddit might

be using a large combination of things so century for catching errors in the website, Elasticsearch LogStash, Kibana is a popular way of collecting and looking at logs. pingdom will check whether certain pages are taking too long to respond. launchdarkly lets you add feature flags. So you can say whether a feature is enabled for some group of users or not, should the new landing page be shown to users in North America or Europe, in terraform, lets you automate the deployment process. So given a set of servers and a set of things that need to run on the servers, terraform will help you automatically create a plan to ensure that the right things are running in the right places.

And the conclusion of all of this is that DevOps engineering is vital for developer teams. By being cognizant of its three pillars, customers will have a confusing and disappointing experience, you know, things will go down, things won't scale properly, things will be slow. And so it's really important to keep the three pillars in mind as you're scaling an engineering organization, or if you're being hired as a DevOps engineer. new products don't need to automate very much. However, as the product matures, and it gets more users, it's more and more important to automate DevOps engineering and to dedicate more resources to it.

Very good morning to code review automation. Let's talk about testing, which is going to be really vital baseline information for when we talk about continuous integration, and other code review automation topic. So test driven development is a coding methodology, where tests are written before the code is written. And, you know, we're gonna explain tests and test driven development in terms of coffeemakers. So enjoy this picture of a nice coffee maker, as we continue.

test driven development spun around for a long time. It was popularized in the early 2000s. And the idea is simple, but it requires knowledge of how things came to be for it to really make sense.

So, historically, common words in software development, like quality assurance, QA, and unit test have roots in factories building physical products. If you were running a factory building coffeemakers, you would test that it worked at varying levels of completion.

So unit tests, ensure individual components work on their own. Does the heater work? Does the tank hold water? integration tests? ensure a few components work together? Does the heater heat the water in the tank?

system end to end tests? ensure everything works together? Does the coffee maker brew a cup of coffee

acceptance tests after being launched, sent to customers? Are they satisfied with the result? Are they confused with the button layout or breaking the coffeemaker within their warranty period?

All of these tests have software analogies, it's useful to know which components break in order to diagnose a problem. But it's also useful to know that the whole system is working correctly. Because even if every individual component works on its own, and if your coffeemaker doesn't heat water with its heater, that's going to be a problem when it comes to making coffee.

That's really the idea for testing. But let's get into test driven development, which is the methodology built on top of testing that's become so popular in the past 10 or 20 years. Most developers that aren't using test driven development have a similar workflow, they'll choose something to work on. Based on our idea of DevOps, it would be in the planning phase, the developers would find something to work on in the planning phase, they build it, so they'd write code, and they'd make a build from that code. And then they test it. So they've read small scripts that made sure that their code was working correctly. If you're making a function that adds two numbers, you might pass it to into unexpected The result is four. And that would be a good indication that your function was working correctly.

So steps one and three, as it turns out, are very connected. The tests written at the end essentially codify the specification, what is success for building a coffeemaker, it should heat up in five seconds. So write a test for that. It shouldn't brew coffee have sufficient strength, so write a test for that, and so on. test driven development uses the similarity of steps one in three to flip this process. So first, developers choose something to work on. And then they write the tests before reading the code. So they write tests that are currently failing because the specification isn't satisfied. And then they write code, until all of the specifications they wrote in step two are satisfied. So they might make a testing regimen that would work if the coffeemaker succeeded, and then build the cheapest coffeemaker, which satisfies that testing regimen.

And the end result is the same. So the software is built, it's tested, and it matches the specifications. But it's significantly easier in a lot of cases to write code. If you write the tests first because you know what you're building and it forces you to think about which things are important to work on and which things can be put into a later set of change.

So this is a very quick video to discuss testing. And the next video, we'll talk about continuous integration, which is really the DevOps continuation of this idea. See you there.

So we've talked about testing, where developers read scripts that make sure that their code continues working way off into the future years after they've made their code. And that leads us into our discussion of ci, which is really one of the big topics that people talk about in a DevOps context. And ci stands for continuous integration. It refers to developers continuously pushing small changes to a central repository numerous times per day. And those changes are verified by automated computer software that runs the tests that the programmers have defined.

So we've gone over what tests are. So let's talk about why a company would use ci.

Well, ci is really the first step in automating DevOps. Imagine the very simplest scenario where a single developer is making a program that'll be used by a small group of users.

That developer makes the original program releases it, and the project slowly builds traction.

Now, imagine that developer has a critical bug a year later.

And they go back to the old code, and they say, like, gee, this is really bad code. I've become a better programmer since a year ago. I don't really understand what's going on here. But that's really how development works. Programmers get better year after year. And they have to read and understand the bad code that they wrote just a year ago. And the only way to be confident making changes to that legacy code that might just be a year old, is to have ci

ci improves developer speed, because new changes can be made confidently without having to worry about breaking existing functionality. As long as the tests pass ci also reduces customer churn. problems in the software are much less likely to occur. If you have comprehensive tests that run automatically. As long as you get those check marks, you can be reasonably sure that the core features of your application will continue working. So how would you integrate ci into your development process? First, let's talk about the common branch based development process that many development teams use. So first, developers work on a feature branch. So they'll take the files that are most current, the ones shown to customers at a specific set of time, they'll branch off of it. So they'll make a new copy of the files to work on their their feature independently of all of the other developers working on things that make changes to the various components. So this feature makes a change to the mobile app, and to the website.

And then on that branch, they'll push it back to the repository, which is usually something like GitHub, git lab or Bitbucket. And then that repository will run ci, so the CI will be configured on the repository side, it'll run all of the tests that the programmer has defined. And then the results of those tests will be attached to the pull request. And the pull request is the developer asking to take their code and to merge it into the central repository that users will be shown. So you take the feature branch here,

and you put it at the end, and all of the other commits that are being shown to users. And so this commit is now the one that will be shown to users next, and the next time is a deployment, the features that the programmer meet will be visible to users. And the best part is it doesn't cost you anything a central Git repositories like GitHub, git lab and Bitbucket. Most have generous free tiers, even for organizations minus some security and access control, you know, permissions features that you might need. As you scale up, and ci providers like layer ci, GitHub actions, git lab pipelines, all have generous features as well. Their ci, you know, is really made for people working on websites, that's maybe something to consider. But if you're really early on in your projects lifecycle, it doesn't really matter which ci provider to use. Of course, there's one thing to take away from the discussion of ci, it's that ci is a vital tool, it's really the first thing that should be automated in most pull request automation schemes. Because it's so easy. developers should be writing these tests regardless. And so if you don't run the tests automatically, slowly, people will break things without realizing that they're breaking them. And users will notice those broken things.

And following best practices like feature branches and ci is a really easy way to scale a developer team. With just ci, a developer team can easily scale from one to 10 developers. And at some point in there, you'll have to start worrying about other pull request automation topics, like the ones we'll cover in the next section.

talk a lot about theory. But let's get practical for a little bit just to round out our understanding of how these DevOps concepts work. Let's look at what setting up ci looks like for an actual repository.

torey

This is the live chat example. It's an open source version of slack. That's used as a demo repository throughout layer three is internal documentation. Let's say for this open source version of slack, we'd like to run tests every time a developer proposed changes, so that in the pull requests tab, we'd be able to know whether a change was good. In particular, let's say a developer was changing the color of the website.

In the main website, After you log in

the top bar in sidebar purple, perhaps the customer has requested that the color be blue. Instead,

if we asked a developer on our team to make this change, they would go to the necessary design file

and edit the color.

In this case, there are two colors to change.

If the developer opened this pull request, it'd be very difficult for us to review their change.

Without a CI system, all we can see is the file change and the description of the commit. So we can see that they've edited main dot CSS, and that they've changed these color values. But it's very hard to understand the ramifications of this. And it's especially hard to understand whether this will have negative side effects for existing users, especially for changes that are less trivial than just changing a color.

For this request, if I was asked to review it, I would have to pull these changes onto my local developer machine, run the script locally, and then evaluate the changes locally. Or I could ask the developer to set up a screen sharing session, and they could walk me through the changes. Both of these add a lot of friction to the development process. It'd be better if I could evaluate their changes without needing any involvement at all entirely through a web interface.

But continuous integration helps. Continuous Integration allows developers to set up comprehensive tests, so that if something doesn't work anymore, after a proposed change, it says right in the pull request.

Let's close this change for now, and look at the repository to understand how to set up ci.

And this repository. One of the services is called Cypress. And it's an end to end Testing Service.

It contains several configurations. And these configurations interact with the page with a fake browser.

For example, this test enters a username and password, and then logs in and then ensures that the user is actually locked in

this test goes to the message area enters a random message and ensures that the message has actually been submitted that it's viewable in the remaining chat area.

With enough end to end tests, you can be reasonably confident that a chat system like this one

continues working.

So we'd like to run these tests every time a developer proposes a change. To do so we'll have to install a plug in into GitHub, set up the server to run after every pull request and run this test against the new server.

To do that, let's set up layers here.

For our use cases, it's easy to just install it directly onto our GitHub account.

We can now install it onto our GitHub repository.

And now, it's listed here. This means that we've successfully installed Lera ci onto this repository.

However, nothing will happen yet. Because there are no configuration files, we need to set up a configuration file for this repository that will start the whole stack and then run the tests in Cypress as required.

Let's do that now. Because our repository is Docker compose based. Let's use the Docker compose example as a starting point.

Here, we're going to install Docker, which is a containerization technology. We'll talk more about containers versus virtual machines later on these sets of talks. Install Docker compose, which is again a way of running multiple containers at the same time, these these concepts will become clear later on in this talk.

We copy the repository files into the test runner.

We build all of the services, we start all of the services and then we deploy the pipeline.

Let's skip the blank for now we'll talk about that in the deployment section of this DevOps course.

And after all the services are started, let's run tests.

Luckily, I've already pre set up a script for this, so I can copy my configuration.

So to recap, what this configuration will do is

install the necessary software, in this case, Docker and Docker compose,

copy the repository files, build all of the micro services, start them all locally within the test runner, and then run our tests against them.

So now that we've installed layer ci onto our repository, all we have to do is add this configuration, and we'll have set up ci for it.

So let's click Add File,

we'll name it layer file. This is how layer C is configuration files and other ci providers will have different file names, of course,

we'll copy our configuration.

And we'll commit the file.

So now that we've set up ci, we can see that there was a dot next to the commit name.

And that dot turns into a checkmark when the tests have passed. This means that every time a developer pushes new code and our source code management tool, look at a success metric. Namely, whether the tests have passed or not automatically, they won't have to run the test themselves. And the reviewer won't have to trust that the original developer has actually tested that the change works.

So let's go back to our original proposed change of changing the colors in production from blue to purple.

Here, we're going to make our change and reopen the pull request. But because we've configured a CI provider for it, we'll be able to see that the tests are running automatically directly in the pull request view itself.

Now, when our developer asks us for a review, it'll be much easier for us to be able to tell whether the change has negatively affected our customers workflows. In particular, because we've configured Cypress and later ci to check that logging in and posting messages still work. Well know that for this change, even though many files might have been changed, the core workflows still work, which gives us a degree of confidence that nothing terribly bad has happened with the code.

So we can look at the file change for our first idea of what the developer has done.

And then we can view what the CI is doing. So if we open the relevant pipeline,

we'll see that the tests are in progress of running, the new version of the application has been built and started within the CI runner. And the tests are running one by one. Here, it's tested that you can post chat messages within our alternative slacks chat page, the landing page loads, and logging in works correctly.

So now within our pull request view, we'll be able to see a big checkmark here, which shows that all of the relevant ci checks have passed. And then you can even automate within GitHub or other source code management platforms that certain checks must pass entirely. So you can automate that all ci checks must pass before a change could be merged. Let's make sure that developers are never reviewing code that's so obviously broken that it's breaking your tests. And you don't only have to run end to end tests here. You can also run linters unit tests and other versions of tests, which we talked about throughout this series of talks. And now that I'm happy with the change, I've reviewed the files, and I see that the CI has passed, I can merge it with a great deal more confidence than if I didn't have this automation in place.

That's it for setting up ci and an applied setting. Let's get back to theory for a little bit.

Continuing on the topic of testing and continuous integration, let's talk about code coverage. So code coverage quantitatively measures how comprehensive

The tests for a code base are, you might think that you have enough tests to find all of the common bugs and to really check all of the functionality of your app. But it's hard to put a number on it. Unless you're measuring code coverage. This is what a code coverage graph looks like from a popular tool. Each of these squares represents a file, and the color represents how many tests are covering that file. So bright green means 100% of the file is tested. And bright red means none of the file is tested. So that would be a priority of a file that should either be tested or excluded from the measurement.

So let's say you're taking over an existing code base, it's relatively large at 100,000 lines of code.

Over the years, it's been adopted by a couple 100 users, and you're expected to maintain it and add features without harming those users. So the first place you look at is the unit tests, which we discussed earlier. But they weren't really prioritized by the previous maintainers. So there's a mismatch of libraries and naming conventions. And it's kind of hard to tell which tests are testing which files and which files need to be tested. And before you write any new features, you'd like an objective way to measure how sensitive certain parts of the codebase are to being changed. If something has very comprehensive tests, you'll be much less scared to make changes and add features that touch that part of the code than if there's a part of the code that doesn't have pests. So this is where code coverage really shines. You've got a complicated code base that has existing users, you'd like to enforce that tests are written so that things aren't broken in an objective way. So getting into the first code of this whole series, let's look at this JavaScript function, which I will make bigger.

So this is a very simple function, if not a bit contrived.

It takes a number.

And it defines a few variables. It loops up to that number, pushing strings into a results list.

And then every 50 elements, it pushes a special string into the results list.

So this whole function is 10 lines of code. But not all 10 lines are equal. So really, there's three kinds of lines in a program like this. There's the syntax lines, like these closing ones that don't actually have any code in them. They're simply syntactic constructs for the programmers benefit. They don't you know, it doesn't even make sense to test these because how would you test that a semi colon existed or not?

There's logic lines like this one, which actually have side effects. And by side effects, I mean that these lines, if you remove them, would change the behavior of the program.

And those branch lines like this one, which changed the flow of the program. So for loops, and if statements in programming are used, as constructs that change the order of the commands that run, so this if statement, if it evaluates to true, would run this line. And if it didn't evaluate to true, it wouldn't run this line. So to reiterate, the three kinds of lines are syntactical ones that don't do anything. The actual logic ones that have effects, and the branch ones that change which lines of code execute.

And code coverage is usually defined as line coverage. So it's the ratio of the non syntax lines which are executed by tests over the total number of non syntax lines.

So again, consider this test.

If you expect the function should work with the input to and you manually calculate what the function should return for the input to

this would be a unit test for your function. But since you're only executing it on the input to this if statement, which requires an input of at least 50 to execute, wouldn't run. So you'd be testing this line, it would execute this line, it would execute and this line, which would also execute, so you'd be executing

five out of six lines,

and the deed that would be 83% test coverage. So just the single test gets us most of the way to understanding our function and understanding its problems.

related concept is called branch coverage. So instead of measuring how many lines of code it measures groups of lines, in our example, above, there's only two branches. There is the main branch. There's the body of the for loop, and there's actually a third branch called The if statement body. So here at this line will always execute the body of the for loop.

will only execute if i is less than n. so here you need and to be greater than or equal to one, or these lines to execute. And this line will only execute if i is greater than or equal to 49.

And so branch coverage would be how many individual branches out of these three are evaluated to true by a test. So you'd like to know how many of all of the branches are tested. And this is useful because if this line of code executes, then this line of code will always execute. So treating them both as individual things that need to be tested, doesn't really mean as much as taking the bodies of these statements as things that need to be tested.

And if you measured the test with branch coverage, you'd see the two of the three branches or x evaluated during the test.

So when should you care about this line coverage in branch coverage, we've already discussed one scenario where you've inherited an existing code base. However, it's important in many different situations. In general, you should measure and optimize for code coverage. If any of the following are true. Your product has users. And those users might leave if they're affected by bugs, in which case, it's important to measure code coverage because it lets you work with your team to improve the code coverage and reduce the number of bugs. You're working with developers that aren't immediately trustworthy, like contractors or insurance, that you're bringing them into your code base, they need to make changes in some fixed timescale, like form up internship. So they can immediately become experts in the entire code base. And you'd like them to be able to make changes without worrying too much about things breaking.

Or, if you're working on a very large code bases, many individually testable components. Your code coverage analysis can complement test driven development, which we talked about in the previous talk, to make sure that everyone on the team is generally working on important things, and that the things they make won't break in the future.

So it's a common mistake in code review automation to make things too rigid before the product has enough users, if you force developers to get 100% branch coverage, so to write two to five unit tests for every function, it's going to make them much slower at developing features that users will actually notice. Remember that tests are never viewed by users. So the only thing that users care about is the stability of the system. So if you have an MVP, or if you have a product that doesn't have very many active users, yet, it might not be worth it to measure or optimized for branch coverage. until those users care a lot about stability.

And by writing unit tests, and other types of tests, an important thing to keep in mind is that developers are solidifying the implementations of features that they might have to throw out. If you build a feature, and it ends up not being something that your users actually want, it's always a better idea to throw out that feature, then build it long term. So if a developer builds a feature, and writes many tests for it to improve the code coverage of that feature, they'll be much more likely not to throw it out, because they'll feel a sense of ownership. And they'll feel a sense of sunk cost in having built this feature and made it good, so to speak. So it's important not to over optimize for these things before, they're important. And it's a subjective idea. But really, you'll notice when your users start complaining about stability.

So organizationally, there are some common policies related to code coverage.

The first one is useful when you inherit a code base. And the policy is that code coverage must not decrease. This one is one of the easiest ones to automate. It's especially useful if you're taking over an existing code base, as I mentioned. And the idea is that code coverage ratio should never decrease. If the current code has 75% of its lines tested, and your new change introduces 40 lines of code, at least 30 of those lines will need to be tested. Otherwise, your changes, code coverage would be less than 75% 30 out of 40. And you'd be decreasing the average code coverage.

As with most code coverage policies, this will increase stability, so there'll be less bugs because things will be better tested at the expense of developer speed. So developers will have to make some complicated tests, and they might spend a lot of time making testing infrastructure. So features will be shipped less quickly if you make this sort of policy. And if you enforce it with code review automation.

unfortunate side effect of this policy is that changes that would be harder to test such as integrations will be less likely to be worked on by developers. So developers are incentivized by their paycheck and by their manager to ship features quickly to make many features per Scrum cycle. And so if certain features are harder to test, because they require internet connectivity, or connected third party API's, those features will be harder to make and harder to test. And so developers will be

less likely to make them regardless of whether they're important to the users or not. So it might be useful to have an exemptions policy in place for things like third party integrations. If your organization decides to go for this code coverage must not decrease policy.

Another useful policy is code owners for test files. If you've used code coverage automation to keep code well tested, it's often beneficial to define code owners for the tests themselves. This means that developers can change implementation details without formal reviewers. But logic changes. So the tests define what success means for a function or for an algorithm, then, changing the tests for a new implementation would need to be approved by a senior developer or manager.

in GitHub, with an engineering manager, a GitHub code owners file might contain this, which means that spec dot j s is a common JavaScript testing, naming convention. And at engineering manager username means if, if there's a file called code owners, which contains this, then the engineering manager will need to approve any change, which changes a test, which is probably a good policy.

So if you're working in a large code base with test driven development, especially,

or if you're hiring interns, or contractors, or if your users are especially sensitive to bugs, and you're afraid that they'll have a bad experience, if the even small bugs make it to them, it might be in your team's best interest to install a code coverage measurement tool. And at the time of writing, these are the three most common ones in the open source world code coverage, coveralls and Code Climate.

So we've talked about testing, and we've talked about continuous integration. And those are really like the initial things that are set up in a DevOps code review automation pipeline. But the problem is that it requires the developers to be on board. And of course, developers are probably busy building features and might not necessarily want to make tests or improve test coverage. So let's talk about linting, which is something that approximates testing, but doesn't need the developers to spend any time.

linters are programs that look at a program source code, and find problems automatically. They're a common feature of pull request automation, because they ensure that obvious bugs do not make it to production, obvious here in quotes. So an example of linting. Let's again, look at a JavaScript program, the very simple one, you should understand even if you don't know JavaScript,

it defines a variable var x equals five, and defines a function but continues after the open bracket, which is generally considered bad practice. It uses let for the second variable and defines it with the same name as the first one. So this is just confusing, and, you know, wouldn't be called good code, a code reviewer would mentioned this in a code review.

And then it says while x is less than 100, console, log x, and then it closes the while loop on this line, and it messes up the end Det. These three lines should be indented for consistency, and then it closes the function. Finally, you should realize that this while loop goes forever. x isn't incremented in the body of the while loop. So just by looking at the code statically without running any environment or looking at the code with a browser even, you can tell that this loop will run forever, and that's probably something a programmer didn't intend.

So

much of this feedback could be automated, a set of rules like don't shadow variables. Never name a variable in an inner scope that has the same name as a variable in an outer scope could be applied to each proposed change, so that human reviewers would not have to waste effort leaving code style comments. tools that maintain and run such lists are called linters.

relevantly, another class of code review feedback has to do with code style. It's easy for coder bureaus to waste time pointing out stylistic choices like tabs versus spaces, or camel case versus pothole case. These discussions bring no value to end users, you know, your customers don't care what case your code is written in. And ultimately, they just serve to cause resentment and missed deadlines within engineering teams. If a review takes an extra couple hours because of comments like this, that's a couple hours that the programmer could have been focusing their attention on another feature.

So engineering organizations should eventually adopt to maintain a global style guide.

But in most cases, just starting with something like the Google Style Guide, which is open source and available at this link, is a great starting point.

These guides often come with linter configurations, which help everything

Stay stylistically similar, and some programming languages like Python and go come with their own style guides and automation, like Pep eight. In the case of Python that will make it easy for developers using those programming languages to stay in a unified style.

an organizational thing you can do for code style is to knit, or which stands for nitpicking. Instead of blocking at the code review stage, if there's review feedback that a code style review feedback, it might be better for code reviewers to leave small review comments called mitts. So they'd say knit full colon shouldn't be styled this way.

This is great. Because it allows the reviewers to merge something with a few pieces of feedback so that ours don't have to be spent on a small piece of refactoring that could be done in a later state at a later point of time.

Once the style guide is adopted, it's possible to configure tools to automatically format code to follow the Style Guide, which tools are called Auto formatters. And the programming language go, which we use at layer ci, a command such as the following would use the standard format are the one that comes with go to clean up all of the source files in the repository. So we'd use the Janu, find command, find the files that have a go extension, and we'd exec go format on them. And this will take all of the source files and then format them with their style guides so that they all pass the style guide.

And of course, if your ci system is running tests automatically, every time your code is pushed, the code could be automatically limited as well. Programmers shouldn't have to wait for a human reviewer to tell them whether the code is limited and styled appropriately. In most cases, it's cheap and convenient to run linting and formatting automatically with a CI system.

So an easy solution is you get another checkmark. To get something set up quickly, it's a good start to make lint act the same as running a unit test in ci, so add an x if the code isn't listed properly, and then the developer can very quickly get stylistic feedback without needing to talk to another human or wasting their reviewers time for getting this sort of feedback.

And a CI configuration that might look like this. So copy the project files, run the linting the script. And then if the linting script fails, the whole pipeline would fail. This approach stops reviewers from the picking style, it passed the linter is a perfectly reasonable response to an overly zealous code reviewer. So even simple automation like this can improve the development speed of entire development teams. It also stops reviewers from having to give style feedback at all, at all of the checks for code review pass, like it passes all of the linters done the committed stylistically Okay, the reviewer might still leave some feedback for future reference, but they shouldn't be blocking commits getting to production because of small stylistic choices that aren't even in the linter.

A better long term solution is to set up a commit back button, which is common idea that happens all over the place and code review automation. But in this specific example, it might look like this. So you'd say if the code is not limited,

run, yes, lint with the dash dash fix flag, which again, goes through all of the source files. Yes, lint is the linter for JavaScript. So this would go through all of the source files. For each of them, it would apply the linting

rules, and it would fix any stylistic errors. And then it would create a commit

an additional, you know set of file changes on top of what the developers proposing. And it would create a new branch with a suffix listed. And it would push to that branch. So that developer pushes unlimited code, the bot would automatically create a commit which lifted everything and it would create a new branch so that if the developers code was known to be good, the reviewer could simply merge the Linton branch instead of the developers original one. And then we failed the pipeline with lint failed. This means that the unlimited version can't be merged. But the limited one, assuming all the feedback that isn't listed related,

could be merged. So we'd have two branches, the one the developers proposing and the limited one, the code reviewer would look at the one that wasn't rented. They'd say whether it was good or not, like the logic of the commit was good. And if it was good, then the reviewer in GitHub could merge this branch instead of the one they were asked to review. This branch would be the same as the original one with an additional commit on top of it.

So some examples of limiters for many programming languages. JavaScript the standard as a 2021 is Eastlands. TypeScript. Also now uses Eastlands, Python you

piland and flake eight. c++ is much more subjective. But a common choice is Google CPP lens from the Google style guide mentioned above, go comes with a format called go format, which acts somewhat like a linter. Although there's additional libraries available for rules beyond that, Java has checkstyle and find bugs, maybe older options, but there's a lot of choices for languages like Java. And Ruby has broke rubocop and pronto. We've seen users commonly use

and Java, JavaScript C sharp, and many other languages can be lifted with sonar cube, which is a popular static analysis framework that is commonly used at larger enterprises. But it has an open source version, that is a good place to start for 10 developer teams that would like to set up static analysis.

And finally, it was a startup called Deep source that we've talked to

startup to startup. And they're doing all sorts of interesting stuff with static analysis as well. Static Analysis is just the practice of looking at source code without running it and finding bugs. So I'd encourage you to look at deep source as well.

So in comparison to most other code automation tools, linters are exceptionally easy to set up. Any team with more than one developer should almost immediately set up a linter. To catch obvious bugs like infinite loops. Just by looking at the code, the linter would be able to tell you whether there was a common programmatic error like an infinite loop.

Automatic linting comes standard has many code editors. So it would be wise to teach developers how to configure their code editors to use the existing linting rules that your team has set up in the CI automation, so that the developers don't have to wait to push their code to get this feedback, they can get the yellow squigglies directly in their editor.

And then teams working on earlier products. Mentors can help avoid writing unit tests at all.

Instead of relying on a test suite, you can rely often on a static analysis to find common bugs like the code not compiling at all or

infinite loops or stylistic problems. This helps small teams refer their product has many users get feedback without needing to lock things in with tests.

Right. So that's it for linting and code style. We'll see in the next video.

Let's finish up our discussion of code review automation by talking about ephemeral environments, which are really the latest and greatest when it comes to doing code reviews and helping developers get their changes merged.

ephemeral environments are temporary environments that contain a self contained version of the entire application. Generally, for every feature branch, they're often spun up by a slack bot, or automatically on every commit using DevOps platforms like later ci itself, or Heroku.

Temporary environments are overtaking traditional ci platforms as the most valuable DevOps code review experience. Because these environments are made on every change all of the stakeholders, not just developers, but the product, people in the designers can review a change without needing that set up a developer environment or asking to screen share with the developer that proposed it.

So for a more concrete example, let's say developer is changing something on a website. So they're changing, you know, the front end or the back end, or, you know, some component of the website. And they'd like to get feedback on the proposed change. So a code reviewer would look at the code and they might not understand what the visual ramifications of that change are. But within the femoral environments, within the code review view itself, the reviewer would just have to click that button there.

I'll zoom in. So within GitHub, this is what the reviewer would see. They'd See The Description, the code change, but also a button to view the ephemeral environment. And when they click that button, it wakes up a version of the website specifically with this proposed change in it so that the reviewer can actually take a look at things and see whether the changes is visually and workflow wise, working well.

In general, ephemeral environments like halfway between development environments and staging environments. At the extreme staging is entirely replaced by formal environments in something called continuous staging.

Benefits of ephemeral environments. Well, the most common reason to adopt an ephemeral environment workflow is that is that it accelerates the software development lifecycle. Developers can review the results of changes visually, instead of needing to exclusively give feedback on the code change itself. Additionally, developers can share their work with non technical collaborators such as designers as easily assuring a link to the proposed version.

So

You could post a slack message like this saying, could you go to this link and give me feedback instead of needing to set up a zoom call to share your screen to get the other person to look at your proposed changes.

The hardest part of setting up ephemeral environments are dealing with state. So dealing with things like databases and microservices,

by their nature, you know, ephemeral environments are temporary. They're isolated from production environments, and really only lasts as long as a pull request does. A reviewer should be able to delete a resource in a review. So they should be able to see if you know, deleting a user still works without fear of that affecting the production environments. So in the early implementation of ephemeral environments, it might make sense to connect API servers with read only permissions to a staging database. So if you're using AWS, you might have an Iam role that has read only access to the database. But in that case, you wouldn't be able to sign up to the service, for example, because that would require database rights. The end goal should be to have a fresh copy of the database for every commit. So every time a developer proposes a change, they get the new database specifically for their environment that they can do whatever they want in

an ideal ephemeral database has three attributes. It's pre populated, so it contains representative anonymized data. The past security audits of PII, personally identifiable information must be scrubbed from databases used in ephemeral environments. It should be undoable. So if in the course of review data is deleted, it should be easy to reset the database to its original state. This is also crucial for reading destructive end to end tests, which we'll get into later. And it should be migrated, the database should use the schema currently used in production, it's not very useful to know if something's working with an old version of the schema. One of the most common classes of problems uncovered by formal environments or broken or nonperformance database migrations.

Another hard problem to solve with ephemeral environments is the life cycle. So when would you create them and when you destroy them? The classic approach is to title lifecycle of a pull request to the lifecycle of an ephemeral environments. So if the developer opens a pull request, create an environment for them, keep it running 24. Seven until the developer deletes the environment.

The biggest factor to consider there is cost. If each FML environment costs 10% of production. So it's 10% cheaper, and you have 30 open pull requests, you'd be quadrupling your monthly costs. So you know, that's an expensive developer tool.

Another approach is to create a chat ops bot that allows creating new environments for a specific branch with a specific timeout. So for example, the user type slash PR bot creates in the GitHub issue description that could create an environment or in Slack, the user could do the same thing.

This requires the environment to be provisioned at the time that it's required, which can be slow. And it's again, hard to tell when to delete these. The best approach is to create an ephemeral environment for every change. So similar to the PR workflow, but hibernate them as they're provisioned.

There's only a few providers that do this. So one is Heroku, which will, with Heroku review app can turn on and off environments. And the other one is layer ci, shameless plug, I suppose.

So as users use the environments, and layer ci, there'll be a hybrid ninja not enough. You can automate this yourself with memory snapshotting, but it's somewhat involved. So this might be something that's better left to using a third party for.

And back to that idea of continuous staging.

The idea is to merge staging ephemeral environments and a CI pipeline altogether. So this is kind of layer ci itself primarily sells to our users. As your ephemeral environments become more powerful and easier to create. They approach and overtake many aspects of traditional continuous integration pipelines. So if you can set up the website, and the back end and the database, then it's relatively easy to run tests because tests are usually much easier to wrap them the entire back end. At its logical conclusion, this concept becomes continuous staging, where ci CD and ephemeral environments form a single ci CD flow where a single base sets up all of the requirements for everything. And then that forks off into the unit tests but also the server but also how the review environment but also the linter is everything comes from that common base.

If you're going to make them yourself, you should probably budget about a month, or a month of time it took to set up your environment. So if your production environment has many different microservices and has many different databases, it'll be relatively difficult to set up an ephemeral environment flow. Large companies like Facebook have set this up for their internal pull requests.

But they haven't hired developer teams, infrastructure software engineers that do this. So if you're a smaller company, you might want to stick to a hosted service again, like layer ci, instead of making it yourself up to maybe when you have about 20 developers.

And to avoid having to micromanage starting and stopping environments, it's easiest to use a hosted provider. If you're doing just front end development. Some popular choices are for cell netlify. But if you're doing full stack deployments, really the only choices available right now are layer ci itself and Heroku review apps. There are some options available in many source code platforms like Git lab has a environments feature. But it's not really truly an ephemeral environments feature. So you should explore all of those options make an informed decision.

So that was ephemeral environments. And that concludes our discussion of code review automation.

Because pull request automation is such a core part of DevOps engineering, let's do another applied tutorial here.

In this example, we'll be setting up a femoral environments the same way we talked about before using a hosted platform for the sake of simplicity.

So because we've already set up ci for this repository, we already have our layer file, which is our CI configuration. However, many ci providers including layer ci, Roku, and others can set up ephemeral environments, which are small production deployments you can use to evaluate the changes live as a reviewer. Again, let's say we're changing the color this time back from blue to purple. And we'd like someone to be able to efficiently review our change, not just by looking at the test results, but also by looking at the federal environments. To read manual QA, you might see, in this case, it's actually very easy to set up. Let's go to our web micro service.

Let's create a new file.

Here, we'll make another layer file so that they run in parallel. And we'll say from our base layer file,

we'll say expose website.

And we'll expose the website running inside the runner itself. Later ci has this expose website directive, but many other providers have similar functionality that you can set up.

And let's jump right to creating a pull request for it.

So here, our code reviewer would not only see the test results, as you can see, those are here, the initial layer file.

Let's look at the actual graph to understand better what's going on. Here we have our tests running in the main layer file.

The main layer file has again built all of the services started all of the services. And now it's running our Cypress tests the same way it did in the CI chapter.

But after the Cypress tests run, we'll have a second environment, which is inheriting from the first and that second environment will have a clickable link that can be used for manual QA.

So let's see that here.

The snapshot is done being taken of the tests, which means that the ephemeral environment can start being built.

And now you can see that it's built a staging server button, and you can connect to it here.

So in our actual pull request, now that all of the tests and ci services have passed,

we can click the femoral environment button.

As soon as it appears.

We could click the main layer file details,

we can click the services web ephemeral environment,

and we can click View website. And what this does is it wakes up the pipeline which we initially set up to run our tests. But it forwards the internet visible link to the web server inside. So here we've created a fresh environment, specifically for this test, and we can see that the test has run and sent the message.

And we can evaluate this change so we can test that creating channels works for example.

And that in the test channel, it's still possible to set messages.

This means that you don't need 100% test coverage to be able to understand the nuance of a channel

For every pull request, you'll be able to spin up a new environments automatically, and then wake up that environment when a review needs to be completed.

And now that we're satisfied with the environment works correctly, we can merge the pull request.

And from now on, all changes which edit the website can be manually curated, so the reviewer can check that things work, but also a QA team or a designer or a product manager might be able to check that the change actually changes what it's supposed to do. That's it for formal environments. Let's get back to theory.

Welcome to DevOps Academy deployments. And this one, we'll be talking about foundational concepts. And primarily, when you talk about deploying, you're talking about VMs. And you're talking about containers. And containers are often also known as Docker. So let's talk about the difference between those two. Before we talk about deploying anything.

People talk about DevOps deployments, they're usually talking about the point that Linux, a large portion of all deployments are to Linux servers. And containers are really only defined in terms of Linux, in production as of right now.

So with all that in mind, let's talk about Linux in the abstract. So what Linux really helps you do is take care of four things when you're writing programs, it takes care of memory. So programs need memory to do things, memory is also known as Ram. And since you only have a finite amount of it, Linux itself needs to figure out which programs will get which sections of memory, so which ram sticks will have which programs running on them.

Linux also takes care of processors. So if you're running two things in parallel, Linux will make sure that the right amount of processors are dedicated to both.

If you're, if you've ever run very computationally intensive tasks on a laptop, you might notice that your browser gets laggy. That's because it's not getting enough processor time. So if you're running production workloads, Linux needs to make sure that every program is getting its fair share of processor time to run the actual program.

Because disk, so Linux takes the files of all programs and allocate space on the disk for them, you might have multiple disks, and you might have both spinning disks and solid state drives, you might even have disks shared across networks. So Linux takes care of all of that and make sure that the right files are on the right disks, and that programs have access to those files.

And finally, there's devices. So even beyond disk memory, and CPU, there's things like GPU. So for machine learning, you see GPU often, and things like network cards, which you use for connecting to the internet.

Linux needs to take these individual resources and allocate them to processes. So if you have five processes trying to connect to the internet at the same time, but only one network card, Linux needs to make sure that the right messages are sent to the right websites upstream. And the responses are sent to the right programs downstream.

So in a diagram, this is what that would look like. So here we have three programs, Chrome, Notepad, and Spotify. And they're all running in Linux. So this is assuming you have a Linux server running these three programs.

And here you have the four shared resources. So Chrome, asks for CPU, and Linux will allocate some of the CPU time to Chrome. And it'll also allocate something notepad and some to Spotify. And similarly for all the other shared resources.

So this is great, but there's too much sharing going on. What I mean by that is that programs know about each other. So if you had a program that accept expected a file at home calling file dot txt, then it could create that file, but another program could delete or read that file. so files can be read across programs. And that means those programs can communicate between each other, which isn't always what you want.

So, for example, programmers often use different versions of Python. Python is a popular programming language. And there's two popular versions used. One is Python two, and one is Python three, but they're both called Python. So if the file at user bin Python is a Python two executable, and you try to run a Python three program with it, then that program would error because you'd be using the wrong version of Python to run it. However, some of your programs might need Python two, and some of them might be Python three. So here, there's cross talk between programs in that they're both reading user bin Python, but they expect there to be different files there. So that's what I mean by programs overshare sometimes

They need different versions of files at the same place. And so you can't really run both programs at the same time.

Similarly, two web servers might listen to Port 80. That's how websites allow you to connect to them. So if you're running two web servers that both expects Port 80, to be open, the first one will start correctly. And the second one will crash, saying that Port 80 is already used these sorts of problems of sharing resources is really where virtual machines and containers shine. They allow you to separate resources like files and ports between programs so that programs can't step on each other's feet.

So here, if you are running your three programs in that container, it would look remarkably similar. So Chrome would be running, but it would be running within a container. And that container would be talking to Linux, which would then allocate the container resources. And similarly for the other programs.

Now, this might not make sense yet. But let's talk about what actually happens when you put something in a container like this. So what happens when there's a container between a program and Linux?

The big change is that each program will get its own version of shared resources like files and network ports.

The container running chrome might create a file at tilby slash chrome slash cash. While the container running notepad could read that file and see that it didn't exist, they get different copies of all of the systems files, so that they couldn't talk amongst each other, or have conflicting Python versions.

Similarly, if you had two web servers running that both expected to be able to open port 81, would be able to open port 80 in their container, and the other would be able to open port 80 in their container. So you can have two programs both thinking that there are the only program listening on port 80. But really, they'd be isolated within their own containers.

So in Linux containers work by creating namespaces, which are a Linux feature that groups shared resources together. If you had five processes running together within a Docker container, they'd still be running within Linux itself. But they would not see the other processes, the ones on the main Linux machine itself. So within the container, if you ran PS, au x, and you counted how many lines of output there were VSD UX is how you see the running commands in the Linux machine, you might see 10. And that means that there was 10 processes visible to you within the container. But within Linux, so within the container up here, if you said how many processes were running, it would say there's 10 processes. But within Linux itself, you'd see hundreds of processes running, including the 10 from this container. So the containers are kind of sandboxed or namespaced, into a single group of processes where the processes can't see the files outside of the container, or the processes or the network ports outside of the container. They only see the ports within this container.

So essentially, what's happening is the programs are asking, what are the contents of user lib Python, in our example before and instead of answering truthfully, Linux is answering with the contents of another file. So the container says what is user lib Python, and then Docker, if you were using Docker for your containers would respond with the contents of this file var lib, Docker overlay Fs one user lib Python, which is a totally separate file in the global system. So each container would have its own view of the files.

This little deception allows programs to run in parallel, because Linux would be would respond with different files for each container. One container could have Python pointing at a Python two executable, and one container could have Python pointing at a Python three executable.

So if that's how containers work, then how to VMs work, how are they different?

Well, VMs are very similar to emulators. If you've ever seen someone running an older video game on a modern computer, they're using a VM.

So the idea for containers was to provide fake Linux.

Within the container they don't really know they're running with within a container. They see files, but the files are simply pointing at a different place within the real Linux installation. The idea for VMs is to produce fake versions one level below that. So pretty produce fake versions of the CPU, RAM disk and devices.

The VM equivalent of Docker is called a hypervisor. It's the program which is in charge of creating the VMs. So when a VM is running something, it corresponds to an instance of the hypervisor within Linux.

The hypervisor might lie to the VM and say there's one SSD attached. There's one drive attached and then has 50 gigabytes of capacity. But then when the VM writes to that drive, it would instead go to a file, it wouldn't go to a real drive. And then on the host, it might be

This file. So when the VM itself is writing to the file, it's actually going through this file, which is very similar to the deception of matching files directly that the container had. But there's some practical differences. The first is that VMs are very powerful. You can use them to run other operating systems, such as Mac OS, or Windows, and different hardware configurations, you can emulate a gamecube, or an apple two within a Linux hypervisor.

in containers, it's the processes that are being lied to, they must still have been the sort of thing that would run within Linux itself. But in VMs, there's a nested operating system that generally doesn't know it's not talking to the real hardware. When the OS writes things to its drive, for example, those rights are sent to a file in Linux instead of a physical drive.

So

when the process writes to the operating system within the VM, that operating system sends the right to a drive, or what it thinks is a drive, but that drive goes through Linux, and Linux actually maps it to a file.

Various benchmarks show that CPUs and VMs are about 10 to 20%, slower than containers. VMs also usually use 50 to 100% more storage, because they need all of the things that an operating system would need. Duplicate containers don't need all of the files, they only need the application files. And finally, VMs use about 200 megabytes more memory for the operating system itself. Again, containers don't need all of the operating system because it's the processes that are being lied to. So VMs use more memory, they're slower, and they need more storage.

So given these performance benefits, it looks like containers are almost always a better choice. And in most cases they are However, there's a few cases where VMs are a better choice. Again, VMs mean virtual machines.

If you run in untrusted, so user supplied code, it's difficult to be confident that they can't escape a container. This has gotten better in recent years, but it's long been a contentious point. Virtual Machines are much older and much more mature. So if you're running untrusted code, usually it's a good idea to put that within a container.

If you're running a Windows or Mac OS script, so you're running a script that only runs on another operating system, you'd need to use a VM for similar reasons. Or if you're running an old video game that doesn't run in Linux, and you'd like to run it on a Linux computer, you'd need to VM. And vice versa, if you're using VMs, and other operating systems, if you'd like to run a Linux program in Windows, you'd usually have to use a VM to run it.

And finally, you can emulate hardware devices like graphics cards with a VM.

So if you're testing that your graphics card works correctly, you could emulate the response that it would give and then test that the operating system is working as expected.

So that's the big difference between VMs and containers. And these are really the two things that you often deploy. So let's go into actual deployment strategies in the next talk. I'll see you there.

Let's keep talking about deployments. So rolling deployments are one of the most popular deployment strategies. And we'll talk about the pros and cons of different deployment strategies throughout this section. Rolling deployments themselves work by starting a new version of the application, sending traffic to the new version, to make sure everything's okay, and then showing off the old version and repeating that until all versions of the old version or versions of the new version. I realize I've said version many times. So let's look at pictures that will help illustrate the point.

This is the myrn app. myrn stands for MongoDB, Node JS, react, and express js. Here, the user's web browser connects to both the front end and the back end, where front end is the stuff that the user sees. And back end is the services that provide connections to the database. So if you log in, you're connecting to the back end. If you're just viewing the landing page, you're connecting to the front end. Let's say your app has enough traffic that users will notice if it goes down for a little while. How would you push a new version of the application without causing downtime. This is where rolling deployments come in. The high level algorithm for a rolling deployment looks like this. So you create an instance of the new version of the backend say, you wait until it's up. So you keep trying to connect to it until you get a response that satisfactory. And then you delete an old version and route the traffic to the new version. If any instances of the old version still exist, go back to step one and repeat.

And our myrn example, we'd initially see three instances of the initial version and one instance of the new version. And we'd repeat the process

Until we had three instances of the new version and one instance of the initial version. So here, all of the versions are the back end.

And we add a new version of the new back end. And we turn off a version of the old back end. And we keep repeating that. So as time goes on, the red ones replaced the pink ones. And then after a few loops of this, the only ones remaining are red ones, we've added a red one, remove the pink one, added the red one, remove the pink one. And we're red is the newest version of the application.

So what are the benefits of rolling deployments over other ways of deploying things? Well, they're well supported. Rolling deployments are relatively straightforward to implement. In most cases, they're natively supported. And several orchestrators, if you've heard of Kubernetes, for example, Kubernetes helps you with this. AWS, Amazon Elastic Beanstalk also supports rolling deployments.

They don't have huge bursts. So in another deployment strategy, which we'll talk about, if you had three versions of the back end, you need to start six in total, to deploy the new version, and then you turn off the old three. So doubles for the duration of the deployments, the amount of things running, which might be difficult if you have a finite number of servers, for example. It's also not uncommon for services like databases to limit the amount of connections. So if you had six versions of the backend connecting to the database, now that might be too much load on the database, so that could cause problems.

And really, deployments are easily reverted. If in the course of an upgrade, you notice problems, it's usually easy to reverse the rolling deployment by just going in the opposite direction, removing a red adding a pink,

you can go in the opposite direction as well to rollback which is an important characteristic of deploying because things always go wrong.

The downsides of rolling deployments are they can be slow to run. So if you have 100 replicas, and you're replacing one at a time, and it takes 20 seconds, it would take 2000 seconds to replace all of the versions, which is quite a long time for deployment.

This can be mitigated by increasing the number of services being turned on and shut off at a time, which is sometimes called a burst limit, or a rolling deployment size.

The other problem is API compatibility, which is the biggest problem of rolling deployments. So if you add a new version of an API endpoint to your back end, and consume it in your front end, then since you're not switching them both at the same time, you might have version one of your back end serving a request for version two of your front end. And then that API wouldn't exist, so there'd be errors visible to the user for the duration of the deployment. This can be mitigated with complicated routing techniques, but it's generally better to make API is backwards compatible. So make version two of the front end be compatible with version one of the back end. So rolling deployments are relatively simple to understand and generally well supported. If your users mind when there's downtime, it's an excellent first step to deploy using a rolling deployment strategy. The key programming consideration is to ensure that services can consume both the old version and the new version of services API's. If this contract was violated, users might see errors for the duration of the deployments. Let's talk more about deployment strategies. And we'll go into bluegreen deployments.

Another deployment strategy people often see are bluegreen deployments. To set up a bluegreen. Deployment teams need to disambiguate which services will be consistently deployed, and which services will be shared across versions of the application. I'll explain a little bit more of what I mean by that in the next section, a database server would be a shared resource, multiple versions of the app would connect to the server at the same time, and a standard deployment would generally not upgrade or modify the database.

In our mern example, all of the other services or cluster resources, new versions of them would be deployed on every prod push.

So in a bluegreen deployment strategy, where you're upgrading the JavaScript or the myrn app, this is what that would look like. So there's a blue version, and a green version of the application where each is a fully standalone stack. But we're each connects to a shared database, and the database is not part of blue or green. It's a shared resource used by both bluegreen deployments are so called because they maintain two separate clusters, one named blue and one named green out of convention. If the current version of the application is deployed to blue, we deploy the new version to green and use it as a staging environment to ensure that the new version of the app works correctly before setting users to it.

After we're confident that the new version of the software works correctly, we'd move over production load from blue to green, and then repeat the cycle in the opposite direction. So here

We started with the users being sent to blue, which contains version one of the application. And then we're investigating that version two works. And after we're certain that version two works, will route users to version two. And then the version one is unused. And we can shut it off and replace it with version three, make sure it works, and then switch user traffic over to it over and over again. On the benefits side, bluegreen deployments are conceptually very easy to understand. To set them up, you just have to create two identical production environments and send requests to either one or the other, which is relatively simple with services like Amazon elastic load balancing.

They're also quite powerful, longer running tasks, like downloads can continue running in the old version of the application after traffic is switched over to the new version. So if a user has an established connection to green, and you've switched everyone else over to blue, that connection can continue finishing whatever it was doing. So if you're watching video, and you'd like to download it entirely to the user, which might take minutes that can continue going even during a prod push. Additionally, bluegreen deployments can be extended to many different workflows, which we'll discuss.

There's a few notable drawbacks to bluegreen deployments, it's difficult to deploy a hotfix, for example, to revert a change, because the old cluster might be running longer running tasks and unavailable to switch to. So if you have version one of the application, you switch over to version two, you realize version two is having problems, you might want to push version three very quickly, which addresses those problems. But version one would be the only place you could deploy version three. So you wouldn't be able to do that. It's also finicky to transfer load between the clusters, if resources autoscale, which we'll talk about later on, and load is transferred all at once the new cluster might not have enough resources allocated to serve the surge of requests, because requests went all at once to peak production load.

And finally, if one cluster modifies the shared service, like adding a column to a table in the database, it may affect the other cluster despite it not being the live one.

So here's some common extensions to bluegreen deployments. As I mentioned, they're very extensible. And many teams set up advanced workflows around bluegreen deployments to improve stability and deployment velocity. The first idea is a natural extension of bluegreen deployments, which I call rainbow deployments, but I don't think there's a standard term for them. Instead of only having two clusters, some teams keep an arbitrary number of clusters, so blue, green, red, yellow, so on. This is useful when you're running very long running tasks. If you're working on a distributed web scraper, and you're scraping tasks take days, for example, you might need your clusters to last until the last job is finished, to ensure things continue working as expected. So with the rainbow deployments, you'd keep all of the clusters that are still processing tasks around or if you're doing something like video encoding for long videos, you don't want to shut off the cluster that's in the middle of encoding a long video because that work would have to be redone.

In a regular deployment of clusters that only be shut off after all of their long running jobs are done processing.

Some teams rely heavily on manual QA and don't use continuous deployment. They're often building desktop or mobile apps, which needs to be published on longer release cycles. So if example.com is being routed to the blue cluster, it would be relatively simple to deploy a new version of the application to the green cluster endpoints new.example.com. to that. And with this setup, the new version of the app could be tested against the production database in the very environment that will soon become production. Such tests are often called acceptance tests, because they're happening in production with production data, with nothing like no privileged access to the code base. So for a game, you might have the new release available, or the API's for the new release available and have your QA testers test that. And after the QA testers give the Go ahead, you can point the game client to the new version, and then switch the labels of the to

another useful add on to bluegreen deployments. And really deployments in general is called the canary deployment. So if the new version of your app contains subjective changes, such as editing the UI, it might be ill advised to push them to all users at once. Facebook has billions of users. So if even 1% of their users complained about a change, that would be overwhelming amounts of feedback. The changes may break users workflows and need to be modified or rolled back in response to user feedback. So in the context of a bluegreen deployments, a canary deployment would be to an extension which routes maybe 5% of user traffic to the new version of the application. And check that those users don't have negative feedback before switching the rest of the users over. So if blue was version one green was version two, we'd have 95% going to version one. 5% going to version two

We'd wait to see if anyone on version two complained. If not, then we'd route everyone to version two. And we'd shut off version one, and then put version three in that one.

So bluegreen deployments are powerful and extensible deployment strategy that works well with teams that are deploying a few times per day. With strategy only really starts being problematic. It can use deployment scenarios, where there's many services being deployed many times per day.

Alright, let's keep talking about deployment. And I'll see you in the next talk.

continuous deployment can sound daunting, but it's not actually as difficult as it might seem, in many cases. Let's take a look back at our current example, and how it's deployed.

So, in our README, we've helpfully added this little line, which is how we're currently deploying to production.

If we look at our hosted version, slug, which is hosted at this domain, we can see that the color is still purple, despite having changed the color to blue in the previous video.

The reason that it's still purple is that we haven't production pushed, we haven't pushed the new version of the code, which contains the blue color.

And oftentimes, requiring human intervention to deploy is simply unfeasible, especially as product skill. So to deploy, let's run the deployment process manually first, and then let's talk about how to automate it with a continuous deployment system.

So here, we'll use a terminal and simply run the command directly from the read.

This developer computer comes with the SSH key required to deploy otherwise it would be difficult to disseminate this SSH key to all developers, which needed the ability to deploy new versions of code.

Here we can see that it's using Docker compose rebuilt, we'll talk more about how to set up a Docker file later on.

Now, if we refresh the page,

we can see that the deployment has created a new version of the application, which is blue, so it's picked up the color change which was merged in the previous commit.

So continuous deployment, we'd like it to run on merges to the main branch, we don't want to deploy a feature branches before they've been reviewed. For that, we can set up a very simple configuration.

We could write this configuration file in any directory, but let's write it in the API directory for now.

So we'll create another layer file

will inherit from the testing layer file to make sure that the deployment runs after tests have passed.

And will only run the deployment if the branch is the main branch.

However, if it is the main branch, we'd like to set up a secret and use that secret for our SSH key and then use that SSH key to run the script.

Let's do that now.

And then hopefully gives us the directives we need to expose the SSH key.

What's appose?

For now we're exposing the SSH key which is used to authenticate with a production machine within the CI process itself.

One other thing we need to do is change the ownership to be more restrictive. This is required for SSH, but it might not be required for other deployment processes.

So now that we have our SSH key within the CI server, and these steps are running after tests are passed, all we have to do is copy our command and run it as if it was part of a CI process.

All this configuration does is wait for tests to pass, check that the branch is the main branch, and then use the SSH key to deploy a new version of the application.

Let's create a new pull request with these changes and see how that looks.

We can see that as before the ephemeral environment and ci services have being built. But this API service is also being built the API being the directory which contains our continuous deployment process. Let's take a look at what the pipeline

actually looks like Instructure.

so here we can see

that the application has been built successfully, and is being started, just as in their regular ci without CD process. So we're running our continuous integration, but not our continuous deployment step and this layer file.

And we can see the tests are running.

As usual, the test running process requires starting a fake browser. So this takes around 30 seconds.

And then after the tests pass, we'll see that the deployment process runs. So the tests lead to the second level.

graph, these are usually called the build stages in a CI CD system.

So here, we can see

that the step was skipped because the branch was not the main branch, which is exactly what we wanted. However, now if we merge this pull request,

we'll create a new merge branch on the main branch. And we'll run the CI process once again.

Here.

And because this is the main branch, the deployment process itself will be running. So let's take a look at what that looks like.

We're simply loading the environment to run the command and right now.

And here, we can see that the deployment is running within ci itself. So instead of needing to run it as an individual developer, you can simply run this SSH command within ci. And this idea of deploying automatically from a CI process is called continuous deployment.

So let's work through that whole process end to end once just to make sure that it's clear on the deployment automation side of things.

So let's change the color, again, for the main landing page, just to make sure that it's visible if a change gets pushed correctly.

And again, we'll change the two colors.

And we'll create a new pull request.

And now our reviewer will have a lot of information about whether this change is good or not. So the reviewer will be able to see both the files changed.

So they'll see all we've done is change a few colors,

we'll be able to look at the CI process itself.

So they'll be able to see that tests are running.

So in particular, that the application builds start successfully, tests run against,

they'll be able to look at an ephemeral environment within minutes of me creating this new change.

And if they approve the change, it'll be shown to users in a short period of time, this whole process will only take about a minute, with the longest part being these automated browser tests.

One by one, these steps should become green.

Again, this is the base. This is the ephemeral environment. This is the continuous deployment process. And this is the egg gets status. This is the one that the administrators of GitHub might mark as required. So that only if all of the checks pass, can the commit be merged and shown to users. so here we can see that everything has passed. Let's take a look at the ephemeral environment just to double check that the color is the one we want.

We can see that from the thermal environment, we change the color to be this rose ish red, perhaps this is the color that was desired. So we'll say that this is correct. And the test has passed and successfully posted a message. So we know that the functionality of the application has continued to work after this change.

After we merge it

be the end to end test deploy process for for this merge commit.

If we take a look at that,

we'll see that because we merged to the main branch

the deployment is already running. in production, we're creating a production build. And in a short period of time, the production server should have the latest version of our application running on it. So here it's restarting the production instance.

And the snapshot is being taken. So everything is succeeded, we've successfully pushed if we go to our website, it is now the shade of red that we've changed. And that's what an end to end ci CD, ephemeral environment pipeline generally looks like at a very high level.

Alright, so let's talk more about deployment automation in the next section,

talked about deployment strategies. But that's not the only thing in deployments. Deployment strategies help you reduce downtime and deploy in a way that doesn't affect your users. But another key consideration for deployment is making sure that there's enough resources for your containers or VMs. So that if there's a large burst of users, your application doesn't go down.

So let's say you're building a CI system, this hits close to home because I have coursework at layer ci, a CI company,

your users would push code, you'd have to spin up runners to run tests against that code. And you'd see bursts of traffic during the users business hours. And you'd see significantly less traffic outside of those business hours. For a peak load of 10,000 concurrency runs, you'd need at least 10,000 runners provisioned.

However, at night, outside of peak hours, you wouldn't really need all 10,000 runners, most of them would sit idle.

So your usage might look like this, which is also indicative of a lot of applications, your lowest point is maybe 500. Runners required, and your highest point is 10,000 runners required. So you need 20 times more workers from the highest points in the day to the lowest points in the day.

In an ideal world, you'd be able to create or destroy these runners as necessary. during peak hours, you'd be able to create new ones, and then off peak hours, you'd be able to destroy them. That's the idea for auto scaling.

It's only possible to create and destroy workers because of cloud providers. At their enormous scale, it's possible to offer servers for cheap on small one hour leases. The most popular technology at the time of this post or this video, is AWS easy to spot instances, which act exactly like cloud hosted VMs with large discounts if you provision them for short periods of time. Another popular technology for auto scaling is Kubernetes horizontal pod auto scaling, which sounds daunting. But since many providers provide Kubernetes out of the box, you can just assume that if you're using Kubernetes, and containers, you'll get auto scaling if you configure it correctly. Just to illustrate, if you're using Microsoft's as your as your cloud provider, there's resources for auto scaling VMs. And for containers. If you're using AWS, there's again, resources for VMs and containers. And if you're using Google Cloud, there's resources for VMs and containers. auto scaling is usually discussed on the timeline of one hour chunks of work. If you took the concept of auto scaling and took it to its limit, you'd get serverless defined resources that are quickly started and use them on the timeline of milliseconds. So one to 100 milliseconds.

For example, a web server might not need to exist at all until a visitor requests the page. Instead, it could be spun up specifically for that request, serve the page and then shut back down.

That's exactly the idea for serverless. It's almost like taking auto scaling, so provisioning resources as the required and doing it very quickly on very small time intervals.

serverless is primarily used for services that are somewhat fast to start in stateless, you wouldn't run something like a CI job or a CI run within a serverless framework. But you might run something like a web server or notification service.

auto scaling is primarily used for services that are slower to start or require state, you'd likely run a CI job within an auto scaled VM or container, and not within a serverless container.

As a 2021, the distinction between the models is becoming quite blurred. serverless containers are becoming popular. And they often run for upwards of an hour. serverless containers act exactly like containers, but they're created and turned off in a serverless manner. So in response to a trigger.

Within a few years, it's likely that serverless and auto scaling will converge into a single unified interface. So I'm excited about that. That's that's going to be the future of deployment.

And that ends our discussion of auto scaling and serverless. I'll see you in the next talk.

Another key concept in deployment automation is service discovery.

A database might be at one IP address, so 10 dot 1.1 dot 1.6 by four three chosen arbitrarily, while the web server would be at another IP address 10.1 dot 1.2 8080.

And they'd have to discover each other, because the web server needs to talk to the database, and the database might have calls to the web server, those get even more complicated as you add more copies of your web server, or add entirely new services. Again, let's consider the myrn app from elsewhere in the academy DevOps series.

So you have a web browser, the user themselves is visiting your website. They're connecting to your front end, and they're connecting to your backend making API calls. And your back end is connecting to a database.

Here, there's three services that need to be discovered. The browser needs to learn that example.com corresponds to the front end, an example.com slash API corresponds to the back end. And the back end needs to land on the database is that 10.111 dot 1.3, for example. So the backend needs to know the IP address and port of the database. And the browser needs to know the IP address and port of the backend and the front end. In the very simplest configuration, everything is manually configured the back end and front end or add static IPS. And given host names within DNS Domain Name System, it's the mapping of example.com to an IP address on the internet. And the back end is configured to connect to MongoDB at a specific port. So your DNS configuration, this is the CloudFlare configuration page, which has a DNS provider would look like this. So if the user visits example.com, send them to this IP address.

And if they visit api.example.com, send them to that IP address. So this is all manually configured. And we've just manually put the IP addresses in this.

And then within the backend, we'd read an environment variable, which is a dictionary of key value pairs that are easily set when you're deploying things. So you'd say connect to the environment variable, specifying the MongoDB port,

and then connect to Port 27017, which is the default MongoDB port.

And then when you're starting the backend, you just have to specify the IP address that MongoDB is running in.

This configuration is completely fine. For simple products, it's difficult to mess up. It's relatively secure, and it doesn't over complicate things, you can go pretty far with a simple configuration. Most products could launch an MVP without any service discovery at all.

But you'll know that you need to start caring about complicating your service discovery when you see one of the following. So you need zero downtime deployments. You can't hard code things like this if you want to do rolling deployments. Because you can't easily automate where the arrows point to, you can't automatically change the IP address. If you're doing such a simple deployment strategy. For example, if you have more than a couple microservices, it's gonna get hard to remember where they all are. And if you're deploying to several environments, like if you have a developer environment, a staging environment, and ephemeral environments, and production environments that all have different IP addresses, it's gonna get pretty unwieldy to set the IP addresses all over the place. So let's focus on zero downtime deployments because they're illustrative of the broader problem. Before that, though, let's talk about reverse proxies, which are another crucial system design and DevOps concept. The idea for a zero downtime deployment is simple. As we've seen, you start a new version of the back end and front end, you wait till they're up, and then you shut off the old version of the backend and front end. So this happens in both rolling and bluegreen deployments. However, it's difficult to update the IP addresses in DNS itself. If our rolling deployments required changing these values directly in DNS, that wouldn't work very well, for various reasons. And particular, DNS can take a long time to propagate users in other countries than the United States, for example, might take days to see the new IP address, and they'd still be trying to connect to the old version.

The solution is to add a web server that acts as the gateway to the front end and back end, we'd be able to change where it points to without changing the DNS configuration itself. So web servers like these are called reverse proxies. And they're really crucial for setting up zero downtime deployments, and for service discovery itself.

So taking our myrn app and adding this level of complexity, the user's web browser would instead connect to the reverse proxy. So the user would ask the DNS system where is example.com

The DNS system would respond with, oh, it's here,

the IP address of the reverse proxy. And then the reverse proxy would take the user's request and send it to the appropriate, they view on that front end or back end, depending on what the user asked to connect to. And then from there, everything else would be the same.

So if you're running a deployments, like a rolling deployments, the proxy could choose which of v1 or v2 of the front end or back end to send the users request to, and that would just be by changing a configuration file.

And then after your deployment is done, you could turn off version one, and the reverse proxy could route traffic entirely diversion to a straightforward approach is to store the service IPS in a hash table. So implicitly, in the process of running at the moment, we assumed that our reverse proxy would be able to know the IPS of the new versions of our apps, which is exactly the statements of service discovery. And so that needing to manually tell our reverse proxy where the front end and back end live. So where is the IP address of version two of our back end, it would be convenient if we could automate it.

When the new versions come online, they can update the value for the key back end and front end with their own IPS in this hash table. And then the reverse proxy could watch for changes to the table and use that for routing decisions.

For a very concrete example, which is about as close as we're going to get to code in this set of videos, let's look at this nginx configuration. nginx is a very popular reverse proxy. And it's very commonly used in large tech companies.

And it lets you define where various host names go, if you pointed example.com, to the nginx, reverse proxy, again, in the picture, this would be nginx, the user thinks they're connecting to your website, but they'd be sending the request to nginx. Asking, for example, COMM And nginx would take their example and forward it to your actual front end. So nginx just has to learn where the IP for your front end is. And that's what this configuration would do. So we're telling nginx directly,

take this key from this file,

and then use conf D, which reads from a hash table and updates the configuration file, and then send the user there. So

all you need is a key value store, which has a key fob where the front end IP is the current front end version. And then to run your rolling deployments, you'd start a new front end version, check that it was alive, and then you just change the key in the hash table to point to the new one. And then constantly would pick up that change, replace this value with the new IP of version two of the front end, and then reload nginx, which would change the arrow to the new version of the front end like this.

That was a lot to deal with. So

let's back up a bit, all you'd need to do is update your front end, to set the key in the hash table for IPS front end to be the front ends IP, and then make your back end do the same for the back end location. And that way, when the new version of the application starts, it would update the key in the table. And then nginx would start routing users to the new version of the application.

This is what proxy passing means and nginx. So you see this proxy pass directive. But this is all very complicated. It's just an illustrative point of if you were to implement this yourself, how would you do it?

The most common thing used in industry is service discovery by using DNS itself. So DNS, we thought of before as the slow protocol that might take days to propagate changes across the network. But you can run DNS locally.

And that's the industry standard.

So let's talk about DNS a little bit.

The idea for DNS is just to map host names to IPS. When you visit layer ci.com. For example, the global DNS system will first map the name latest.com, to the addresses at the time of this video 104 dot 217 9.86 and 172 dot six 7.16 9.106, which are just arbitrary computers connected to the internet. And you can use the Digg command on a website to see where those addresses are. So this is saying for the key layer ci.com the values are these two.

And usually when people mentioned DNS, they mean the global service. So visiting websites on the internet. However, as I mentioned, it's possible to run DNS internally. It would be ideal if in our nginx configuration, we could specify HTTP full colon slash slash front end and then have front end resolve to the IP of our front end service. That way we wouldn't have to change anything except for the DNS configuration.

That's exactly how DNS based service discovery works. you configure your services to query a server you control for DNS queries and then this

So instead of saying MongoDB, full colon slash slash, process and MongoDB, you just say MongoDB full colon slash slash Mongo, where Mongo is a key in the key value pairs in the DNS that you control.

Of course, it's not trivial to deploy your own DNS server. In practice, though there are popular options like core DNS, the most likely thing you do is use a cloud provider or Kubernetes internal solutions.

So the end result you'd get is something like this, the user's web browser would connect to nginx. Thinking it was the website nginx would ask the DNS provider, where is the API right now, the cloud provider would respond with this is the IP address, given the deployments, so blue, green, or rolling deployments, this is what we currently want users to visit when they want to visit the API. And then

this would correspond to version one or version two of the backend. And then the proxy would forward the request there, the request to be fulfilled, and then we go back to the proxy, and then back to the user.

So the conclusion of all of this is that service discovery is tricky, but vitally important as a foundational building block for these deployment strategies. And for deployment automation in general, if you configure service discovery in an appropriate manner for your deployments, so DNS based in a Kubernetes cluster, for example, it makes significantly easier for developers to have microservices that talk to each other, instead of having a developer have to write connect a MongoDB app, and then deal with where MongoDB actually is. They can simply say connect to MongoDB at MongoDB colon slash slash Mongo. And then you as the DevOps platform engineer can configure where that Mongo always points to the right place the right IP address. By decoupling the application logic from the deployment logic, you'll help the developers on your team build faster, and you'll be able to deploy it more easily. So that's it for deployments. Let's go on to the next and final pillar, which is application performance management.

There aren't that many general topics in application performance management, so this section will be a little bit shorter. We'll go into more detail in future sections in the DevOps Academy. But just for this introductory video series, let's talk about two core concepts. The first of which is log aggregation. And it's a way of collecting and tagging application logs from many different services into a single dashboard that can easily be searched. One of the first systems that have to be built out in an application performance management system is log aggregation. Just as a reminder, application performance management is the part of the DevOps lifecycle where things have been built and deployed. And you need to make sure that they're continuously working so they have enough resources allocated to them. And errors aren't being shown to users.

In most production deployments, there are many related events that emit logs across services. At Google, a single search might hit five different services before being returned to the user. If you got unexpected search results, that might mean a logic problem in any of the five services. And log aggregation helps companies like Google diagnose problems in production, they built a single dashboard where they can map every request unique ID. So if you search something, your search will get a unique ID and then every time that search is passing through a different service, that service will connect that ID to what they're currently doing.

This is the essence of a good log aggregation platform, efficiently collect logs from everywhere that emits them and make them easily searchable. In the case of a fault. Again, this is our main app, the users web browser connects to a back end and the front end, and the back end then connects to a database.

If the user told us, the page turned all white and printed an error message, we would be hard pressed to diagnose the problem with our current stack, the user would need to manually send us the error and we'd need to match it with relevant logs in the other three services. Let's take a look at Elk, a popular open source log aggregation stack named after its three components, Elasticsearch, LogStash and cabana.

If we installed it in our burn app, we'd get three new services. So the users web browser, again would connect to our front end and back end. The back end would connect to Mongo, and all of these services, the browser, the front end, the back end and Mongo would all send logs to LogStash.

And then the way that these three components work, the components of ALK Elasticsearch Log Stash and cabana is that all of the others

Services send logs to LogStash. LogStash takes these logs, which are text emitted by the application. For example, the web browser. When you visit a web page, the web page might log this visitor access this page at this time. And that's an example of a log message. Those logs would be sent to LogStash, which would extract things from them. So for that log message, user did thing a time, it would extract the time and extract the message and extract the user and include those all as tags. So the message would be an object of tags and message so that you could search them easily. You could say, find all of the requests made by a specific user.

But LogStash doesn't store things itself. It stores things in Elasticsearch, which is efficient database for querying text. And Elastic Search exposes the results as Kibana

and cabana is a web server that connects to Elasticsearch and allows administrators as the DevOps person or other people on your team, the on call engineer to view the logs in production whenever there's a major fault.

So you as the administrator would connect to cabana cabana would query Elastic Search for logs matching whatever you wanted. You could say, hey, cabana, in the search bar, I want to find errors. And cabana would say Elastic Search finds the messages which contain the string error. And then Elasticsearch would return results that had been populated by LogStash and LogStash would have been samples results from all of the other services. If you visited a web page, this might be the sort of log that is emitted.

And it might be processed into an object like this. So it has a format, a date, and a simple time format. That's the same for all messages emitted by all different services, you'd have a service which service submitted the log, and you'd have the message, the actual content of the log.

And the processor, LogStash itself would often be connected to the internet so that JavaScript in the browser can catch errors and send them to LogStash. Although there are additional services like century that might be better suited for that. How would we use elk to diagnose a production problem? Well, let's say a user says I saw error code 1234567. When I tried to do this, with elk setup, we'd have to go to cabana, enter 1234567 in the search bar, press Enter. And then that would show us the logs that corresponded to that. And one of the logs might say, internal server error returning 1234567. And we'd see that the service that emitted that log was back end, and we'd see what time that blog was emitted at. So we could go to the time in that log. And we could look at the messages above and below it in the backend. And then we could see a better picture of what happened for the user's request.

And we'd be able to repeat this process going to other services until we found what actually caused the problem for the user.

The final piece of the puzzle is ensuring that logs are only visible to administrators. As logs can contain sensitive information like tokens, it's important that only authenticated users can access them. You wouldn't want to expose Kibana to the internet without some way of authenticating. My favorite way of doing this is to add a reverse proxy like nginx. Again, our friend nginx and then have the auth request mechanism check that the user is logged in. So in our back end, we could add something like this, which simply returns a successful status. If the user visits example.com slash auth request and there an admin, it would return a successful status. And if they're not admin, it would return an unauthorized status. And then we could configure nginx. Again, as mentioned in previous videos, to have these location blocks, the slash private location would connect to slash off. And then we could make sure that if this was slash logs, for example, that the user was logged in, because with this auth request directive, if the user visits slash logs, and they're not an administrator, they wouldn't be able to access them. Alternatively, Elasticsearch itself is run by a company called elastic and they have a paid version, which contains something called x pack, which facilitates this as well. So you can go for either a reverse proxy which authenticates users, or the paid version of the application.

As an aside, you can use log aggregation as an extra test. So in your ci pipelines, where you want to tell if code is good or not, you can repurpose your log aggregation stack to ensure that no warnings or errors occur while the tests run. If your end to end test looks like this. So you're starting your stack, you're starting your logging stack. You're running your tests with NPM run test. You could add an extra step which queries Elastic Search for logs matching error, and you could make sure that

There are no logs, that printed error. And then even if all of your tests pass, if there's an error going on, that error might be important despite all the tests passing. So this adds a free extra check to your ci stack.

And there's a few examples of log aggregation platforms. There's Elasticsearch, LogStash, Kibana, which we talked about, there's fluent D is another popular open source choice. There's data dog, which is very commonly used at larger enterprises. It's a hosted offering. And there's log DNA, which is another hosted offering. And those cloud providers also provide logging facilities like AWS, cloudwatch logs. So log aggregation is a key tool for diagnosing problems in production. It's relatively simple to install a turnkey solution like elk or cloud watch, and it makes diagnosing and triage problems and production significantly easier. That's it for log aggregation. I'll see you in the next talk.

Plus topic we're going to talk about is metric aggregation. metrics are simply data points that tell you how healthy production is. So as you can see on the screen, things like CPU usage, memory usage, disk IO, file, system fullness, are all important production metrics that you might care about. If log aggregation is the first tool to set up for production monitoring, metrics, monitoring would be the second. They're both indispensable for finding production faults, and debugging performance and stability problems.

Log aggregation primarily deals with text, logs or textual Of course. In contrast, metric aggregation deals with numbers. How long did something take his memory being used?

It's frighteningly difficult to understand what's going on in a production system. Netflix, for example, measures 2.5 billion different time series to monitor the health of their production deployments. Successful metric monitoring is being able to automatically notify the necessary teams when something goes wrong in production.

Let's keep looking at open source implementations of DevOps tools to keep things general. Prometheus is a tool originally deployed at SoundCloud is one of the most popular metrics servers. And this is what it looks like. Similarly structured, the inputs are sent to the retrieval, things like nodes would send how much disk usage they have to the media server, but also how long services are taking ALC itself would parse numbers out of logs and send them to permit this. And then promethease figures out what get services from using service discovery from the previous video. And then it takes those and it stores it in a time series database equivalent for numbers, what Elasticsearch is for text, and then that stored on the Prometheus server node itself.

And then finally, there's a front end. So other services can query promethease. To do things, one thing you might want to do is if there's something terribly wrong, like your website is down, you might want to connect a pager duty or email someone or send someone a text message with Twilio, beyond call engineer and tell them that something is wrong.

But you might also want to query metrics to get a view like this one. And that's what prom qL is used for. So grafana is the view that dark view with the graphs. And it's common way of viewing these time series. But you can make your own and you can make API's. And there's many other front ends that connected previous.

The diagram above is daunting, but it's quite similar to the architecture that we discussed for log aggregation frameworks. There's four key components. Like I mentioned, the time series database actually stores the measurements, retrieval, the alert manager and the web UI.

So the sorts of metrics we collect. Well, there's a lot of subjectivity about which metrics are important based on what your product does and what your users are. But here's a few ideas for what you store in something like promethease.

So request fulfillment times, these are very useful for understanding when systems are getting overloaded, or if a newly pushed change has negatively impact performance.

The format times are often parsed out of logs using a regular expression, for example, or taken out of a field in a database.

For a website or REST API, a common request fulfillment time would be time to response for web sites and rest API's. A common request fulfillment time would be time to response.

That way slow web pages could be discovered and identified in production.

A related metric that is very indicative of problems his request counts, and if there's a huge spike in requests per second, it's very likely that at least a few production systems will have trouble scaling. Watching request counts can also be used to detect and

mitigate attacks like denial of service attacks, which are when attackers sent many malicious requests to services in production.

The last common metric across many types of companies is server resources. Here's a few examples. So the database size and maximum database size. If you have two terabytes of disk for your database and your 1.5 terabytes in, you might want to alert someone to increase the amount of disk available for the database or delete things that are being unused web server memory. So if your web server is taking a lot of requests per second and doing a lot of processing, it might require more memory. So if it runs out of memory, it would crash and your users wouldn't be able to access your website anymore.

Network throughput. So if you're downloading many things, or uploading many things, you can saturate your network. And that would also cause degraded performance. And a final one is TLS certificate expiry time. So this lock in the browser that uses TLS certificates to see whether the browser is secure or not, are used all over the place internally. And these cause problems if they're not measured, and they're not alerted for. So for example, Google Voice had an outage in 2021, Google of all companies wasn't measuring when their TLS certificates would expire. And that caused an outage a few months ago.

So production faults very rarely look like no users can access anything. There are often a gradual ramp, certain API's taking longer and longer, and then eventually everything breaks. portal analysis is an easy way to pare down production statistics into something actionable. A website might measure how long it takes for websites to fully load their landing page to notice when there's a very obvious production issue. So with cuartel analysis, you'd split request times and many different buckets. How long did the slowest 1% of users take? How long did the slowest 5% of users take? How long did the slowest 25% of users take. So if your landing page is slower when users are logged in and logged out, just by visiting without a logged in user, you might not notice that the web page is very slow. But users that are logged in, would show up in the 1% of requests bucket, and you'd see that those users are having a bad degraded experience.

Or when example stackoverflow.com itself was notified of an outage because their landing page was taking a long time to respond to requests due to a specific post that was published to stack overflow. For metrics analysis, there's many common production tools. There's Prometheus and grafana. As we mentioned, there's data dog, again, not only log aggregation, but metrics aggregation as well. There's New Relic, which is I would say maybe the old reliable option. And there's again, cloud providers by their own versions of this. There's AWS cloudwatch metrics, Google Cloud monitoring, and as your monitor metrics.

That's it for application performance management. Thanks for watching.