Genetics - freeCodeCamp.org

The Novel Coronavirus Epidemic in China: How to Help Researchers Using Sequence Alignment on 2019-nCoV with MAFFT

freeCodeCamp — Mon, 27 Jan 2020 11:04:41 +0000

By Shen Huang

Novel Coronavirus (2019-nCoV) is a deadly virus that seems to have originated in Wuhan, China. As of January 26, the virus has already caused 76 deaths.

As a coronavirus targeting human respiratory systems, 2019-nCoV is highly infectious – especially during wet and cold seasons.

When people sneeze, they can shoot out respiratory system-related pathogens at a high speed. These can infect humans in many ways – most often through contacting mouth, nose, and eyes.

To avoid infections, you should avoid outdoor activities – especially in crowded areas. It's also important to sanitize your hands often and not to rub your eyes with your hands.

I'm in China, and my plans for Lunar New Year are now ruined. So I decided to stay home and create this tutorial on how to obtain genetic sequence data of 2019-nCoV and perform a Sequence Alignment on it with MAFFT.

I hope this article raises your interest in bioinformatics research, so you can help scientists fight these viral outbreaks.

What is Sequence Alignment? And what is MAFFT?

Sequence Alignment is a way of arranging DNA, RNA, or protein to identify regions of similarity that may reveal functional, structural or evolutionary relationships between the sequences. A recent publication suggested cross-species transmission from snake to human with the help of sequence alignment through MAFFT.

MAFFT (Multiple Alignment using Fast Fourier Transform) is a multiple sequence alignment program published in 2002. You can use it to perform sequence alignment for RNA sequences. Coronaviruses are, for example, viruses with a single-stranded RNA enveloped in a shell derived from the cell membranes of the host.

Where Can You Obtain RNA Sequence Data?

The latest update of 2019-nCoV can be found on NGDC (National Genomics Data Center of China). In this tutorial, we will analyze the 2019-nCoV virus and the SARS-CoV virus found inside the NCBI (National Center for Biotechnology Information) data bank.

SARS-CoV, infamously know as SARS (Severe Acute Respiratory Syndrome), has resulted 774 deaths in 17 reported countries around year 2020.

Example RNA Sequence Data from NCBI

I have copy and pasted the data into a file with the name of the virus. It should look something like the data in the screenshot above, with an index number followed by codes in a batch size of 10, for a total of 60 codes per line, separated by spaces.

How to Perform Sequence Alignment on 2019-nCoV with MAFFT

First, you need to install MAFFT. You can install it via Anaconda with the following commands.

Manual installation for different operating systems can be found on the MAFFT official website.

conda install mafft

MAFFT is fairly easy to use, but it process data in a special format. You'll need to preprocess your obtained data so that it can be aligned by MAFFT.

Here's the Python script that does this:

import sys
import re
output = ""
for filename in sys.argv[1:]:
    infile = open(filename)
    data = infile.read()
    data = " ".join(re.split("[^atcg\n]", data))
    data = data.replace(" ", "")
    output = output + ">" + filename + "\n" + data + "\n"
print(output)
outfile = open('SEQUENCES.txt', 'w+')
outfile.write(output)

You can save the above Python code into a file called "preprocess.py", inside the same folder as my virus RNA data. Then we can run the following bash command in the folder to preprocess the data.

python3 preprocess.py 2019-nCoV_HKU-SZ-002a_2020 icSARS-C7-MA

The output file called "SEQUENCES.txt" should now look like something below. The virus name is appended at the top of the file. The white space and index numbers are also stripped off.

Now you can perform Sequence Alignment with MAFFT in your Terminal with the following steps:

Locate your working folder.
Call "mafft" inside your terminal.
For input file, put "SEQUENCES.txt".
For output file, put "output.txt".
Select "1" for "Clustal format" as your output format.
Select "1" for "auto" as your strategy.
Leave all other arguments blank.

Here's a gif of me running this in my terminal:

After you hit enter, you just need to wait for MAFFT to align your RNA codes.

The finished product should look like something below:

Note that the "-" is used to shift the codes and "*" is used to highlight similarities between the sequences.

Congratulations, you have just learned how to perform Sequence Alignment with MAFFT! Now you can play with the gene code and take advantage of the alignment information however you like.

Help Wuhan fight off deadly disease as developer, data scientists and more:

https://github.com/wuhan2020/wuhan2020

A bit more about me: I'm a developer who's into all kinds of things. I've written some other fun tutorials like these:

How to create beautiful LANTERNS that ARRANGE THEMSELVES into words

How to drop LEPRECHAUN-HATS into your website with COMPUTER VISION

Want me to write a tutorial about something? Let me know. Happy coding.

The Computer Science of Evolution: an Introduction to Genetic Algorithms

freeCodeCamp — Thu, 11 Apr 2019 20:49:44 +0000

By Ben Mmari

Being a computer scientist with an interest in evolution and biological processes, the topic of genetic algorithms, and more broadly, evolutionary computation is to me what a candy shop is to a 5-year-old: Heaven. The mere possibility of being able to merge two of my interests in such a seamless manner has been extremely exhilarating, and it would be wrong for me to keep this knowledge and excitement all to myself.

So in an attempt to test out some of my learnings thus far, and share my findings with the rest of the world, I have decided to put together a series of articles on this topic.

In this post, I will provide a brief introduction to genetic algorithms and explain how they imitate the same natural processes that have been taking place on Earth for billions of years.

Life on Earth

Over the past 3.5 billion years, mother nature, father time, evolution and natural selection have collaborated together to produce all of the specialized forms of life that we see on earth today: like the carnivorous Venus Flytrap plant; the ocean-dwelling Atlantic Flying Fish; echolocation-using bats; long-necked giraffes; super-quick cheetahs, dancing Honeybees; and of course, yours truly, the street smart Homo sapiens.

The Venus Flytrap is a carnivorous plant that primarily feasts on insects and arachnids.

Some bats use echolocation to navigate and hunt prey and contrary to popular belief, bats are actually not blind; a species of bats known as The Flying Foxes actually have better eyesight than humans.

Flying Fish cannot fly in the same way that birds do, however, these fish can make powerful, self-propelled leaps out of the water where their long wing-like fins enable them to glide for considerable distances above the water’s surface.

Needless to say, life on Earth is one of, if not the most successful experiments ever run in our universe; and judging from the impressive outcomes of this experiment, it is clear that evolution is clearly onto something.

Recently, we humans — just one of the many end products of this process — realized that we could also take advantage of this ingenious approach to progressive problem solving, and since the 1950s, computer scientist, geneticists, mathematicians, and biologist, have attempted to mimic these biological processes through the implementation of computer simulations. With the aim of producing optimal solutions for difficult, non-trivial problems, in an efficient manner.

One of the first books I came across that sparked my interest in the field of evolutionary biology was The Blind Watchmaker, by Richard Dawkins. In this book, Richard Dawkins explains how complex mechanisms like echolocation (a process that bats use to navigate, hunt and forage, also known as bio-sonar), complex structures like spiderwebs (which spiders use to attract and catch their prey), and complex instruments like the human eye (those two spherical objects that you are currently using to read this article) are simply the result of thousands, if not millions of years of evolution and adaptation.

_The progressive evolution of the human eye. What started off as simple photosensitive cells, evolved into a complex instrument that we often take completely for granted. The first animals with anything resembling an eye lived about 550 million years ago. And, according to one [scientist’s](https://www.pbs.org/wgbh/evolution/library/01/1/l_011_01.html" rel="noopener" target="blank" title=") calculations, it would only take 364,000 years for a camera-like eye to evolve from a light-sensitive patch.

Even though these marvels of nature give the impression that they were built with a purpose from the get-go (i.e by a conscious ‘maker’), they are actually just a result of iterations upon iterations of trial and error, bundled up with ever-changing selection pressure (i.e a change in climate, habitat, or the behaviour and capabilities of predators or prey). So while they may look and behave like the outcome of precise, forward-thinking engineering, they are actually the result of a completely blind process, a process that does not know beforehand what the perfect ‘solution’ would be.

What are genetic algorithms and why do we need them?

Genetic algorithms are a technique used to generate high-quality solutions to optimization and search problems, which are based on fundamental biological processes. These algorithms are used in situations where the possible range of solutions is very large, and where the more basic approaches to problem-solving like exhaustive search/brute force would consume too much time and effort.

The traveling salesman problem asks the following question: “Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?” It is an NP-hard problem in combinatorial optimization.

We can use genetic algorithms to provide high-quality solutions to this problem, at a much lower cost than the more primitive problem-solving techniques, like exhaustive search, which would require you to permute through all possible solutions.

How do genetic algorithms work?

An algorithm works by iterating through a number of steps, up until it reaches a predefined termination point. Each iteration of the genetic algorithm produces a new generation of possible solutions, which, in theory, should be an improvement on the previous generation.

The steps are as follows:

Create an initial population of N possible solutions (the primordial soup)

The first step of the algorithm is to create an initial group of solutions that serve as the base solutions in generation 0. Each solution in this initial population will carry a set of chromosomes, which are made up of a collection of genes, where each gene is assigned to one of the possible variables of the problem. It is important that the solutions in the initial population are created with randomly assigned genes, in order to have a high degree of genetic variation.

Rank the solutions of the population by fitness (survival of the fittest, part 1).

In this step, the algorithm needs to be able to determine what makes one solution more ‘fit’ than another solution. This is determined by the fitness function. The aim of the fitness function is to evaluate the genetic viability of the solutions within the population, placing those with the most viable, favorable & superior genetic traits at the top of the list.

In the traveling salesman problem, the fitness function could be a calculation of the total distance traveled by the solution. Where a shorter distance equates to higher fitness.

Cull the weaker solutions (survival of the fittest, part 2)

In this step, the algorithm removes the less fit solutions from the population. The ‘fittest’ does not necessarily mean the strongest, the fastest or the fiercest, as humans usually tend to assume. Survival of the fittest simply means that the better equipped an organism is to survive in its environment, the more likely it is to live long enough to reproduce and spread its genes onto the next generation.

Steps 3 and 4 are collectively known as selection.

Breed the stronger solutions (survival of the fittest, part 3)

The remaining solutions are then paired with each other in order to mate and reproduce offspring. During this process, in its most basic form, each parent will contribute a % of their genes (in nature it is a 50/50 split) to each of their offspring, where P1(G)% +P2(G)% = 100%. The process of determining which of the parents’ genes will be inherited by the offspring is known as crossover.

Mutate the genes of the offspring (mutation)

The offspring will contain a percentage of the ‘mother’s’ genes, and a percentage of the ‘fathers’ genes and occasionally there will be a ‘mutation’ of one or more of these genes. A mutation is essentially a genetic abnormality, a copying error which causes one or more of the offspring’s genes to differ from the genes it inherited from its parents. In genetic algorithms, in some cases a mutation will increase the fitness of the offspring, in other cases, it will reduce it.

It is important to note that there does not need to be a mutation with each offspring, the required mutation frequency can also be a parameter of the algorithm.

In genetic algorithms, selection, crossover, and mutation are known as genetic operators.

Termination

Steps 2 to 5 will be repeated up until a predefined termination point. This termination point can be one of the following:

Maximum time/resource allocation reached.
Fixed number of generations have passed.
The fitness of the dominant solution cannot be surpassed by any future generations.

Solution convergence

Global optimum

In the ideal situation, the fittest solution will have the highest fitness value possible, i.e it will be the optimal solution, meaning that there will be no need to continue with the algorithm and produce further generations.

Local optimum

In some cases, if the parameters of the algorithm are not reasonable, the population may tend towards a premature convergence upon a less optimal solution, which is not the global optimum that we are after, but rather a local one. Once here, continuing the algorithm and producing further generations may be futile.

Global optimum vs local optimum

What would happen if there were no mutations?

On first glance, mutations may seem like an unnecessary, irrelevant part of the process. But without this fundamental aspect of randomness, evolution by natural selection would be completely restricted to the genetic variety set by the initial population, and there would be no new traits introduced into the population after that. This would severely hinder nature’s problem-solving capabilities, and life on earth would not be able to ‘adapt’ to its environment, at least not physically.

If this was the case in our genetic algorithm, at some point in our simulation, the future generations of the population would not be able to explore part of the solution space that their predecessors did not explore. A simulation without any mutations would severely restrict the genetic variation within the population, and in most cases — depending on the initial population — prevent us from ever reaching a global optimum.

Without mutations, we wouldn’t have mutants, and without mutants, we wouldn’t have the X-men franchise.

What would happen if the population size was not large enough?

I was recently at the Jukani Wildlife Sanctuary in Plettenberg, where I had the privilege of meeting a white tiger. He was a truly majestic animal. He was large, he looked ferocious, and, he was also 80% blind and getting progressively worse as the years went by.

Why was he blind? Because he is a product of generations of inbreeding. These white tigers are only produced when two tigers that carry a recessive gene controlling coat color are bred together. Thus, in order to ensure the continuation of these tigers in captivity, people have been breeding these tigers within a very limited population in order to either show them off at circuses, parade them at zoos, or keep them as household pets.

But one of the negative effects of inbreeding is that you severely limit the genetic variation within the species, which progressively increases the chances that harmful recessive traits will be passed onto the offspring.

The white tiger that I met at the Jukani Wildlife Sanctuary in April 2019. He looks majestic, but he is suffering.

Even in the wild, inbreeding can still be a massive problem. Over the past few decades, the rhino population in Southern Africa has been significantly impacted due to poaching, and if the population size reaches a low enough number it would mean that maintaining the genetic diversity of these threatened rhino species would be extremely difficult. So even if poaching doesn’t completely lead them to extinction, inbreeding could.

_Photo by [Unsplash](https://unsplash.com/photos/xtvo0ffGKlI?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText" rel="noopener" target="_blank" title="">redcharlie on Charles (Carlos) the II of Spain.

“The Habsburg King Carlos II of Spain was sadly degenerated with an enormous misshapen head. His Habsburg jaw stood so much out that his two rows of teeth could not meet; he was unable to chew. His tongue was so large that he was barely able to speak. His intellect was similarly disabled.”

The Habsburg King Charles II of Spain. His father was his mother’s uncle, making Charles their son, great-nephew and first-cousin respectively.

‘Inbreeding’ in our genetic algorithm, essentially means the breeding of solutions that have a very similar genetic makeup, which, thankfully, in this case, would not result in offspring with a predisposition to any physical abnormalities. But if the population is very small and if all of the solutions share a very similar genetic makeup then the fitness of the future generations of the population will be severely restricted. Meaning that it could take much longer to converge upon a globally optimal solution if we even get there at all.

Inbreeding is not always a bad thing, it just depends on which stage of the simulation you are in. In very advanced stages of the simulation, as the population converges towards a global/local optima, it is obviously very hard to avoid inbreeding, because, in some cases, many of the dominant solutions will be very similar to each other, and thus, will share a lot of the same genetic traits.

Wrapping up

Alright, that should cover the basics. If you have any questions, requests, or genetic mutations to contribute, please leave a comment below.

In the next post, we will delve into some code as we look at how each of the genetic operators outlined above plays out in the world of programming. I used the Ruby programming language for the software simulation that I worked on, and in it, I show how in only a few generations, a genetic algorithm can produce a predefined word or phrase from an initial collection of complete and utter gibberish. All of the code will be hosted on Github.

Programming the genome with CRISPR

freeCodeCamp — Sun, 18 Feb 2018 18:19:25 +0000

By Josh McMenemy

How scientists edit genomes with the help of computers

CRISPR (pronounced “crisper”) is part of a bacterial immune system evolved to ‘remember’ and remove invading viral DNA.

Its name is short for ‘Clustered Regularly Interspaced Short Palindromic Repeats’. But despite its mouthful of an acronym and complex biological origins, its engineering application is straightforward. To get started, there is only one protein you need to understand — Cas9.

Cas9 searches for a specified DNA sequence and cuts it by breaking both strands of the DNA molecule. This protein is useful to researchers because they can ‘program’ it to target any DNA sequence. A sgRNA (‘single guide’ RNA) molecule determines the sequence that Cas9 binds to. RNA is a biological molecule similar to DNA, that can bind to proteins and DNA.

sgRNAs are short sequences with a constant region and variable region. The constant region attaches the sgRNA to the Cas9 protein. The variable region causes Cas9 to bind to the DNA sequence that complements it (see the diagram below).

The Cas9 protein bound to the DNA when the PAM sequence is on the forward (top) strand. The bold sequence is the target sequence, the green sequence is the sgRNA, and the three blue characters are the PAM. The triangles show where Cas9 will cut the DNA.

Making sgRNA is cheap and fast. This allows researchers to quickly set up a Cas9 experiment that cuts any DNA sequence. Well, not actually any sequence. There is a small constraint: the target sequence must be flanked by the correct PAM (protospacer adjacent motif) — a short sequence of DNA.

Streptococcus pyogenes is an infectious species of bacteria. In the version of Cas9 it produces, the PAM motif is ‘NGG’, where N is any nucleotide (the ‘letters’ that make up DNA).

Luckily, the motif ‘NGG’ occurs roughly once every 42 basepairs in the human genome. This mean that researchers can find a target site near almost every sequence of interest.

Depending on the experimental set up, these cuts in the DNA can either cause a random change or a precise change to the DNA sequence (more on this later).

Before jumping into writing this program, I recommend studying the Cas9 diagram below.

The Cas9 protein bound to a DNA sequence when the PAM sequence is on the reverse (bottom) strand.

Note that DNA and RNA have a directionality based on their chemical structure. One end of the molecule is referred to as the 5(‘five-prime’) end, and the other is referred to as the 3 (‘three-prime’) end. This is important, because the sequences 5— AGG — 3 is not the same as 3— AGG — 5.

By convention, DNA and RNA sequences are assumed to be written 5to 3 unless otherwise marked. Sequences read in the 5— 3 direction are called ‘forward’ sequences. Sequences read the other way (3— 5) are called ‘reverse’ sequences. This is an arbitrary convention.

The diagram above shows an example of Cas9 bound when the PAM is on the reverse (bottom) strand.

Your first CRISPR program

The scenario

A scientist has a DNA sequence of interest and wants a list of all CRISPR targets contained in the sequence. Finding every target by hand is tedious and error prone.

The scientist wants a simple program where they can input a DNA sequence and have all possible Cas9 target sites returned. The scientist would also like the cut position and PAM sequence for each target site.

EXAMPLE INPUT (from Figure 1): 'CCACGGTTTCTGTAGCCCCATACTTTGGATG'

EXAMPLE OUTPUT: [{    'cut_pos': 6,    'pam_seq': 'TGG',    'target_seq': 'GTATGGGGCTACAGAAACCG',    'strand': 'reverse'  }, {    'cut_pos': 22,    'pam_seq': 'TGG',    'target_seq': 'GTTTCTGTAGCCCCATACTT',    'strand': 'forward'  }]

First, how do we find CRISPR targets in the sequence? Remember that the Cas9 protein can bind anywhere there is a ‘NGG’ motif.

The first step is to loop through the sequence looking for matches. When the program finds a ‘NGG’ match, we want to subtract three positions from the start of the PAM site, since that is where Cas9 cuts the DNA.

Then, we want to record the twenty basepairs before the PAM as the target sequence. Sounds good?

Well, the algorithm described above would actually miss about half of all CRISPR sites — because DNA is double stranded. This means if a ‘CCN’ is the sequence on the forward strand, then ‘NGG’ is the sequence on the reverse strand.

The program must also search for ‘CCN’ using similar logic for the reverse strand.

Example program

Not all CRISPR targets are equal

When CRISPR was first catching on, researchers would often pull up a sequence on their computer and pick targets by hand. Designing the optimal sgRNA has now become much more complex. Below are brief introductions to this complexity.

Off-targets

Researchers soon realized that Cas9 would sometimes bind and cut at loci that did not exactly match the target sequence. These off-target cuts would cause unintended changes in a researcher’s experiment (or potentially a patient’s genome in the case of a therapy!)

To design a good guide, a program must look at the entire genome (which is approximately 3 billion nucleotides for humans) to calculate an off-target score. Researchers have also recently engineered the Cas9 protein to have less off-target activity.

Knockout

When Cas9 binds, it creates a cut by making a double strand break to the DNA molecule. Most of the time, a cell can repair this break through a biochemcial pathway (called non-homologous end joining, or NHEJ).

This pathway is not always perfect, and sometimes when Cas9 cuts, the repair process makes a small insertion or deletion in the DNA sequence. In a protein-coding region of DNA, these small insertions and deletions cause a frameshift mutation — which will often disrupt the protein’s function.

Researchers will often knockout a gene to figure out how a protein affects a specific cell function or phenotype. Creating a knockout edit adds extra constraints to the sgRNA design, because now the guide must land in the coding region of the gene.

Editing

Instead of knocking out a gene, there are many times a scientist wants to make a precision edit. This is especially useful when trying to correct a disease causing a mutation. The best way to do this is still being researched. Most methods involve adding an extra donor piece of DNA.

On-target score

Some sgRNA sequences will cause Cas9 to cut better than others. Researchers have compared cutting efficiency across thousands of Cas9 targets to create predictive models of a sgRNA’s cutting efficiency.

Microsoft even supports an open source repository for ‘Machine Learning-Based Predictive Modeling of CRISPR/Cas9 guide efficiency’.

Other CRISPR-Cas systems

Researchers have discovered CRISPR-Cas systems in other bacteria. These other systems have different PAMs.

Final notes

Hope you learned something new! If you want to learn more about the biology, medical applications, commercial applications, or ethical implications of CRISPR-Cas genome engineering, then I recommend reading A Crack in Creation by Jennifer Doudna and Samuel Sternberg. Jennifer Doudna is one of the original discovers of CRISPR’s underpinnings.

About the Author

I was previously an undergraduate researcher in the Gersbach Lab at Duke University, and I am currently a Software Engineer at a Synthego.