combinatorics - freeCodeCamp.org

Permutation vs Combination: What is the Difference Between the Permutation Formula and the Combination Formula?

freeCodeCamp — Sat, 21 Dec 2019 21:46:28 +0000

By Neil Kakkar

Here's the short version.

Let's take ringing bells in a church as an example.

A permutation is an ordering of the bells. You're figuring out the best order to ring them in.

A combination is the choice of bells. You're choosing the bells to ring. If you have too many bells, you'd first choose them, and then think about ordering them.

This gives rise to the familiar identity: (n P r) = (n C r) * r!

The way to order r items out of n is to first choose r items out of n, and then order the r items (r! )

And, this means (n P r) = n! / (n-r)! and (n C r) = n! / ( (n-r)! * r! )

But do you want to know how to remember this forever?

I'm a big fan of first principles thinking. To understand a problem, get to the core of it, and reason up from there.

Not doing this is usually the source of confusion: if I don't understand how things work, I don't know where to hang the concepts. My mental framework isn't complete, so I decide to just remember it.

As you can imagine, this isn't ideal. So, from time to time, I indulge myself in an exercise of deriving things from the source, and building intuition for how things work.

This time around, we're building intuition for permutations and combinations.

For example, do you know why the formula for a combination is (n C r)? Where did this come from? And why are factorials used here?

Let's begin at the source. Factorials, Permutations, and Combinations were born out of mathematicians playing together, much like how Steve Jobs and Steve Wozniak founded Apple playing together in their garage.

Just like how Apple became a full fledged profitable company, the simple factorial, !, became the atom of an entire field of mathematics: combinatorics.

Forget everything, let's start thinking from the bottom up.

The first known interesting use case came from Churches in the 17th century.

Have you wondered how the bells are rung in churches? There's a machine that "rings" them in order. We switched to machines because the bells are too big. Also, there are tons of bells.

How did people figure out the best sequence to ring them in? What if they wanted to switch things up? How could they find the best sound? Each bell tower had up to 16 bells!

You couldn't change how quickly you could ring a bell - the machines only rang one bell every second. The only thing you could do was change the order of the bells. So, this challenge was about figuring out the best order.

Could we, on the way, also find out all the possible orders? We want to know all possible orders to figure out if it's worth trying them all.

A bell ringer, Fabian Stedman took up this challenge.

He started with 2 bells. What are the different orderings he could ring these bells in?[1]

1 and 2. or 2 and 1.

This made sense. There was no other way.

How about with 3 bells?

1, 2, and 3.
1, 3, and 2.

Then starting with the second bell,

2, 1, and 3.
2, 3, and 1.

Then starting with the third bell,

3, 1, and 2.
3, 2, and 1.

Total, 6.

He then realised this was very similar to two bells!

If he fixed the first bell, then the number of ways to order the remaining two bells was always two.

How many ways could he fix the first bell? Any of the 3 bells could be the one!

Okay, he went on. He then reached 5 bells.

This is when he realized doing things by hand is unwieldy. You only have so much time in the day, you've got to ring bells, you can't be stuck drawing out all the possible bells. Was there a way to figure this out quickly?

He went back to his insight.

If he had 5 bells, and he fixed the first bell, all he had to do was figure out how to order 4 bells.

For 4 bells? Well, if he had 4 bells, and he fixed the first bell, all he had to do was figure out how to order 3 bells.

And he knew how to do this!

So, ordering of 5 bells = 5 * ordering of 4 bells.

Ordering of 4 bells = 4 * ordering of 3 bells

Ordering of 3 bells = 3 * ordering of 2 bells.

.. You see the pattern, don't you?

Fun Fact: This is the key for a programming technique called recursion.

He did too. Although, it took him much longer, since no one near him had already discovered this.[2]

Thus, he figured out that the ordering of 5 bells = 5 * 4 * 3 * 2 * 1.

This ordering formula, in 1808, came to be known as the factorial.

We think of the factorial notation as the base, but the idea existed long before it had a name. It was only when the French mathematician Christian Kramp noticed it being used in a few places that he named it the factorial.

This ordering of bells is called a permutation.

A Permutation is an ordering of items.

When learning something, I think it helps to look at things from every different angle, to solidify understanding.

What if we tried to derive the formula above directly, without trying to reduce the problem to a smaller number of bells?

We have 5 spaces, right?

How many ways can we choose the first bell? 5, because that's the number of bells we have.

The second bell? Well, we used up one bell when we placed it in the first position, so we have 4 bells left.

The third bell? Well, we've chosen the first two, so there are only 3 bells left to choose from.

The fourth bell? Only 2 bells left, so 2 options.
The fifth bell? Only 1 left, so 1 option.

And there we have it, the total number of orderings is 5 * 4 * 3 * 2 * 1

Thus, we have our first general formula.

The number of ways to order N items is N!

The Permutation

Now, we're faced with a different problem. The king ordered new bells to be made for every church. Some are nice, some are okay, some will make you go deaf. But every one is unique. Each makes its own sound. A deafening bell surrounded by nice bells can sound majestic.

But, our bell tower still holds 5 bells, so we need to figure out the best ordering out of 8 bells that the skilled bell makers made.

Using the above logic, we can proceed.

For the first bell, we can choose any of the 8 bells.

For the second bell, we can choose any of the remaining 7 bells... and so on.

In the end, we get 8 * 7 * 6 * 5 * 4 possible orderings of 8 bells in 5 spaces.

If you're familiar with the formula version of (n P r), which is n! / (n-r)!, don't worry, we'll derive that soon enough, too!

One bad way to derive it is to multiply both the numerator and denominator by 3! in our example above -

we get 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 / 3 * 2 * 1 = 8! / 3!.

But this doesn't help us understand why this formula works. Before we get there, let's have a look at choosing things, or the Combination.

The Combination

Now that we know how to order things, we can figure out how to choose things!

Let's consider the same problem. There's a belltower with 5 bells, and you have 8 bells. However, right now, you don't want to figure out the order of bells (remember that's what a permutation is).

Instead, you want to choose the 5 best bells, and let someone else with better taste in music figure out the ordering. In effect, we're breaking the problem down into to parts: First, we figure out which bells to choose. Next, we figure out how to order the chosen bells.

How do you choose the bells? This is the "combination" from permutations and combination.

The combination is a selection. You're being selective. You're choosing 5 bells out of 8 your craftsmen have made.

Since we know how to order bells, we're going to use this information to figure out how to choose bells. Sounds impossible? Wait till you see the beautiful math involved.

Let's imagine all the bells are in a line.

Before finding all the ways to choose the bells, let's focus on one way to choose bells.

One way is to choose any 5 at random. This doesn't help us solve the problem much, so let's try another way.

We put the bells in a line, and choose the first 5. This is one way to choose the bells.

Notice that, even if we switch positions of the first 5 bells, the choice doesn't change. They're still the same one way to choose 5 unique bells.

This is true for the last three bells as well.

Now, the beautiful math trick - for this one way to choose the 5 bells, what are all the ordering of 8 bells where we choose exactly these 5 bells? From the image above, it's all the orderings of the 5 bells (5!) and all the orderings of the remaining three bells (3!).

Thus, for every single way to choose 5 bells, we have (5! * 3!) orderings of 8 bells.

What are the total possible orderings of 8 bells? 8!.

Remember, for each choice of first 5 bells, we have (5! * 3!) orderings of 8 bells which give the same choice.

Then, if we multiply the number of ways to choose the first 5 bells with all the possible orderings of one choice, we should get the total number of orderings.

Ways to choose 5 bells * orderings of one choice = Total orderings

So,

Ways to choose 5 bells = the total possible orderings / total orderings of one choice.

In math, that becomes:

(8 C 5) = 8! / ( 5! * 3!)

Lo and behold, we've found an intuitive explanation for how to choose 5 things out of 8.

Now, we can generalize this. If we have N things, and we want to choose R of them, it means we draw a line at R.

Which means the remaining items will be N-R. So, for one choice of R items, we have R! * (N-R)! orderings which give the same R items.

For all ways to choose R items, we have N! / (R! * (N-R)!) possibilities.

The number of ways to choose r items out of n is (n C r) = n! / (r! * (n-r)!)

In colloquial terms, (n C r) is also pronounced n choose r, which helps solidify the idea that combinations are for choosing items.

The Permutation - revisited

With the combination done and dusted, let's come back to Part 2 of our job. Our dear friend chose the best 5 bells by figuring out all possible combinations of 5 bells.

It's our job now to find the perfect melody by figuring out the number of orderings.

But, this is the easy bit. We already know how to order 5 items. It's 5!, and we're done.

So, to permutate (order) 5 items out of 8, we first choose 5 items, then order the 5 items.

In other words,

(8 P 5) = (8 C 5) * 5!

And if we expand the formula, (8 P 5) = (8! / ( 5! * 3!)) * 5!

(8 P 5) = 8! / 3!.

And, we've come full circle to our original formula, derived properly.

The number of ways to order r items out of n is (n P r) = n! / (n-r)!

Difference between permutation and combination

I hope this makes the difference between permutations and combinations crystal clear.

Permutations are orderings, while combinations are choices.

To order N elements, we found two intuitive ways to figure out the answer. Both lead to the answer, N!.

In order to permutate 5 out of 8 elements, you first need to choose the 5 elements, then order them. You choose using (8 C 5), then order the 5 using 5!.

And the intuition for choosing R out of N is figuring out all the orderings (N!) and dividing by orderings where the first R and last N-R remain the same (R! and (N-R)!).

And, that's all there is to permutations and combinations.

Every advanced permutation and combination uses this as a base. Combination with replacement? Same idea. Permutation with identical items? Same idea, only the number of orderings change, since some items are identical.

If you're interested, we can go into the complex cases in another example. Let me know on Twitter.

Check out more posts on my blog, and join the weekly mailing list.

End notes

This is how I imagine he figured things out. Don't take it as a lesson in history.
The Indians had, in the 12th century, 400 years before him.

An introduction to clustering algorithms

freeCodeCamp — Tue, 28 Mar 2017 16:44:07 +0000

By Peter Gleeson

Take a look at the image below. It’s a collection of bugs and creepy-crawlies of different shapes and sizes. Take a moment to categorize them by similarity into a number of groups.

This isn’t a trick question. Start with grouping the spiders together.

Images via Google Image Search, labelled for reuse

Done? While there’s not necessarily a “correct” answer here, it’s most likely you split the bugs into four clusters. The spiders in one cluster, the pair of snails in another, the butterflies and moth into one, and the trio of wasps and bees into one more.

That wasn’t too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare — or a passion for entomology — you could probably even do the same with a hundred bugs.

For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together.

Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them.

With a hundred bugs — there’d be many times more solutions than there are particles in the known universe.

How many times more? By my calculation, approximately five hundred million billion billion times more. In fact, there are more than four million billion googol solutions (what’s a googol?).

For just a hundred objects.

Almost all of those solutions would be meaningless — yet from that unimaginable number of possible choices, you pretty quickly found one of the very few that clustered the bugs in a useful way.

Us humans take it for granted how good we are categorizing and making sense of large volumes of data pretty quickly. Whether it’s a paragraph of text, or images on a screen, or a sequence of objects — humans are generally fairly efficient at making sense of whatever data the world throws at us.

Given that a key aspect of developing A.I. and machine learning is getting machines to quickly make sense of large sets of input data, what shortcuts are there available?

Here, you can read about three clustering algorithms that machines can use to quickly make sense of large datasets. This is by no means an exhaustive list — there are other algorithms out there — but they represent a good place to start!

You’ll find for each a quick summary of when you might use them, a brief overview of how they work, and a more detailed, step-by-step worked example. I believe it helps to understand an algorithm by actually carrying out yourself.

If you’re really keen, you’ll find the best way to do this is with pen and paper. Go ahead — nobody will judge!

Three suspiciously neat clusters, with K = 3

K-means clustering

Use when...

…you have an idea of how many groups you’re expecting to find a priori.

How it works

The algorithm randomly assigns each observation into one of k categories, then calculates the mean of each category. Next, it reassigns each observation to the category with the closest mean before recalculating the means. This step repeats over and over until no more reassignments are necessary.

Worked Example

Take a group of 12 football (or ‘soccer’) players who have each scored a certain number of goals this season (say in the range 3–30). Let’s divide them into separate clusters — say three.

Step 1 requires us to randomly split the players into three groups and calculate the means of each.

Group 1
  Player A (5 goals),
  Player B (20 goals),
  Player C (11 goals)
Group Mean = (5 + 20 + 11) / 3 = 12 goals

Group 2
  Player D (5 goals),
  Player E (3 goals),
  Player F (19 goals)
Group Mean = 9 goals

Group 3
  Player G (30 goals),
  Player H (3 goals),
  Player I (15 goals)
Group Mean = 16 goals

Step 2: For each player, reassign them to the group with the closest mean. E.g., Player A (5 goals) is assigned to Group 2 (mean = 9). Then recalculate the group means.

Group 1 (Old Mean = 12 goals)
  Player C (11 goals)
New Mean = 11 goals

Group 2 (Old Mean = 9 goals)
  Player A (5 goals),
  Player D (5 goals),
  Player E (3 goals),
  Player H (3 goals)
New Mean = 4 goals

Group 3 (Old Mean = 16 goals)
  Player G (30 goals),
  Player I (15 goals),
  Player B (20 goals),
  Player F (19 goals)
New Mean = 21 goals

Repeat Step 2 over and over until the group means no longer change. For this somewhat contrived example, this happens on the next iteration. Stop! You have now formed three clusters from the dataset!

Group 1 (Old Mean = 11 goals)
  Player C (11 goals),
  Player I (15 goals)
Final Mean = 13 goals

Group 2 (Old Mean = 4 goals)
  Player A (5 goals),
  Player D (5 goals),
  Player E (3 goals),
  Player H (3 goals)
Final Mean = 4 goals

Group 3 (Old Mean = 21 goals)
  Player G (30 goals),
  Player B (20 goals),
  Player F (19 goals)
Final Mean = 23 goals

With this example, the clusters could correspond to the players’ positions on the field — such as defenders, midfielders and attackers.

K-means works here because we could have reasonably expected the data to fall naturally into these three categories.

In this way, given data on a range of performance statistics, a machine could do a reasonable job of estimating the positions of players from any team sport — useful for sports analytics, and indeed any other purpose where classification of a dataset into predefined groups can provide relevant insights.

Finer details

There are several variations on the algorithm described here. The initial method of ‘seeding’ the clusters can be done in one of several ways.

Here, we randomly assigned every player into a group, then calculated the group means. This causes the initial group means to tend towards being similar to one another, which ensures greater repeatability.

An alternative is to seed the clusters with just one player each, then start assigning players to the nearest cluster. The returned clusters are more sensitive to the initial seeding step, reducing repeatability in highly variable datasets.

However, this approach may reduce the number of iterations required to complete the algorithm, as the groups will take less time to diverge.

An obvious limitation to K-means clustering is that you have to provide a priori assumptions about how many clusters you’re expecting to find.

There are methods to assess the fit of a particular set of clusters. For example, the Within-Cluster Sum-of-Squares is a measure of the variance within each cluster.

The ‘better’ the clusters, the lower the overall WCSS.

Hierarchical clustering

Use when...

…you wish to uncover the underlying relationships between your observations.

How it works

A distance matrix is computed, where the value of cell (i, j) is a distance metric between observations i and j.

Then, pair the closest two observations and calculate their average. Form a new distance matrix, merging the paired observations into a single object.

From this distance matrix, pair up the closest two observations and calculate their average. Repeat until all observations are grouped together.

Worked example

Here’s a super-simplified dataset about a selection of whale and dolphin species. As a trained biologist, I can assure you we normally use much more detailed datasets for things like reconstructing phylogeny.

For now though, we’ll just look at the typical body lengths for these six species. We’ll be using just two repeated steps.

Species          Initials  Length(m)
Bottlenose Dolphin     BD        3.0
Risso's Dolphin        RD        3.6
Pilot Whale            PW        6.5
Killer Whale           KW        7.5
Humpback Whale         HW       15.0
Fin Whale              FW       20.0

Step 1: compute a distance matrix between each species. Here, we’ll use the Euclidean distance — how far apart are the data points?

Read this exactly as you would a distance chart in a road atlas. The difference in length between any pair of species can be looked up by reading the value at the intersection of the relevant row and column.

    BD   RD   PW   KW   HW
RD  0.6                    
PW  3.5  2.9               
KW  4.5  3.9  1.0          
HW 12.0 11.4  8.5  7.5     
FW 17.0 16.4 13.5 12.5  5.0

Step 2: Pair up the two closest species. Here, this will be the Bottlenose & Risso’s Dolphins, with an average length of 3.3m.

Repeat Step 1 by recalculating the distance matrix, but this time merge the Bottlenose & Risso’s Dolphins into a single object with length 3.3m.

    [BD, RD]   PW   KW   HW
PW       3.2               
KW       4.2   1.0          
HW      11.7   8.5  7.5     
FW      16.7  13.5 12.5  5.0

Next, repeat Step 2 with this new distance matrix. Here, the smallest distance is between the Pilot & Killer Whales, so we pair them up and take their average — which gives us 7.0m.

Then, we repeat Step 1 — recalculate the distance matrix, but now we’ve merged the Pilot & Killer Whales into a single object of length 7.0m.

         [BD, RD] [PW, KW]   HW
 [PW, KW]      3.7              
 HW           11.7      8.0     
 FW           16.7     13.0   5.0

Next, repeat Step 2 with this distance matrix. The smallest distance (3.7m) is between the two merged objects — so now merge them into an even bigger object, and take the average (which is 5.2m).

Then, repeat Step 1 and compute a new distance matrix, having merged the Bottlenose & Risso’s Dolphins with the Pilot & Killer Whales.

   [[BD, RD] , [PW, KW]]    HW
HW                   9.8    
FW                  14.8   5.0

Next, repeat Step 2. The smallest distance (5.0m) is between the Humpback & Fin Whales, so merge them into a single object, and take the average (17.5m).

Then, it’s back to Step 1 — compute the distance matrix, having merged the Humpback & Fin Whales.

         [[BD, RD] , [PW, KW]]
[HW, FW]                  12.3

Finally, repeat Step 2 — there is only one distance (12.3m) in this matrix, so pair everything into one big object. Now you can stop! Look at the final merged object:

[[[BD, RD],[PW, KW]],[HW, FW]]

It has a nested structure (think JSON), which allows it to be drawn up as a tree-like graph, or 'dendrogram'.

It reads in much the same way a family tree might. The nearer two observations are on the tree, the more similar or closely-related they are taken to be.

_A no-frills dendrogram generated at [R-Fiddle.org](http://www.r-fiddle.org/#" rel="noopener" target="blank" title=")

The structure of the dendrogram gives insight into how the dataset is structured.

In this example, there are two main branches, with Humpback Whale and Fin Whale on one side, and the Bottlenose Dolphin/Risso’s Dolphin and Pilot Whale/Killer Whale on the other.

In evolutionary biology, much larger datasets with many more specimens and measurements are used in this way to infer taxonomic relationships between them.

Outside of biology, hierarchical clustering has applications in data mining and machine learning contexts.

The cool thing is that this approach requires no assumptions about the number of clusters you’re looking for.

You can split the returned dendrogram into clusters by “cutting” the tree at a given height. This height can be chosen in a number of ways, depending on the resolution at which you wish to cluster the data.

For instance, looking at the dendrogram above, if we draw a horizontal line at height = 10, we’d intersect the two main branches, splitting the dendrogram into two sub-graphs. If we cut at height = 2, we’d be splitting the dendrogram into three clusters.

Finer details

There are essentially three aspects in which hierarchical clustering algorithms can vary to the one given here.

Most fundamental is the approach — here, we have used an agglomerative process, whereby we start with individual data points and iteratively cluster them together until we’re left with one large cluster.

An alternative (but more computationally intensive) approach is to start with one giant cluster, and then proceed to divide the data into smaller and smaller clusters until you’re left with isolated data points.

There are also a range of methods that can be used to calculate the distance matrices. For many purposes, the Euclidean distance (think Pythagoras’ Theorem) will suffice, but there are alternatives that may be more applicable in some circumstances.

Finally, the linkage criterion can also vary. Clusters are linked according to how close they are to one another, but the way in which we define ‘close’ is flexible.

In the example above, we measured the distances between the means (or ‘centroids’) of each group and paired up the nearest groups. However, you may want to use a different definition.

For example, each cluster is made up of several discrete points. You could define the distance between two clusters to be the minimum (or maximum) distance between any of their points — as illustrated in the figure below.

There are still other ways of defining the linkage criterion, which may be suitable in different contexts.

Red/Blue: centroid linkage; Red/Green: minimum linkage; Green/Blue: maximum linkage

Graph Community Detection

Use when

…you have data that can be represented as a network, or ‘graph’.

How it works

A graph community is very generally defined as a subset of vertices which are more connected to each other than with the rest of the network.

Various algorithms exist to identify communities, based upon more specific definitions. Algorithms include, but are not limited to: Edge Betweenness, Modularity-Maximsation, Walktrap, Clique Percolation, Leading Eigenvector…

Worked example

Graph theory, or the mathematical study of networks, is a fascinating branch of mathematics that lets us model complex systems as an abstract collection of ‘dots’ (or vertices) connected by ‘lines’ (or edges).

Perhaps the most intuitive case-studies are social networks.

Here, the vertices represent people, and edges connect vertices who are friends/followers. However, any system can be modelled as a network if you can justify a method to meaningfully connect different components.

Among the more innovative applications of graph theory to clustering include feature extraction from image data, and analysing gene regulatory networks.

As an entry-level example, take a look at this quickly put-together graph. It shows the eight websites I most recently visited, linked according to whether their respective Wikipedia articles link out to one another.

You could assemble this data manually, but for larger-scale projects, it’s much quicker to write a Python script to do the same. Here’s one I wrote earlier.

Graph plotted with ‘igraph’ package for R version 3.3.3

The vertices are colored according to their community membership, and sized according to their centrality. See how Google and Twitter are the most central?

Also, the clusters make pretty good sense in the real-world (always an important performance indicator).

The yellow vertices are generally reference/look-up sites; the blue vertices are all used for online publishing (of articles, tweets, or code); and the red vertices include YouTube, which was of course founded by former PayPal employees. Not bad deductions for a machine.

Aside from being a useful way to visualize large systems, the real power of networks comes from their mathematical analysis. Let’s start by translating our nice picture of the network into a more mathematical format. Below is the adjacency matrix of the network.

         GH Gl  M  P  Q  T  W  Y
GitHub    0  1  0  0  0  1  0  0  
Google    1  0  1  1  1  1  1  1
Medium    0  1  0  0  0  1  0  0
PayPal    0  1  0  0  0  1  0  1
Quora     0  1  0  0  0  1  1  0
Twitter   1  1  1  1  1  0  0  1
Wikipedia 0  1  0  0  1  0  0  0
YouTube   0  1  0  1  0  1  0  0

The value at the intersection of each row and column records whether there is an edge between that pair of vertices.

For instance, there is an edge between Medium and Twitter (surprise, surprise!), so the value where their rows/columns intersect is 1. Similarly, there is no edge between Medium and PayPal, so the intersection of their rows/columns returns 0.

Encoded within the adjacency matrix are all the properties of this network — it gives us the key to start unlocking all manner of valuable insights.

For a start, summing any column (or row) gives you the degree of each vertex — i.e., how many others it is connected to. This is commonly denoted with the letter k.

Likewise, summing the degrees of every vertex and dividing by two gives you L, the number of edges (or ‘links’) in the network. The number of rows/columns gives us N, the number of vertices (or ‘nodes’) in the network.

Knowing just k, L, N and the value of each cell in the adjacency matrix A lets us calculate the modularity of any given clustering of the network.

Say we’ve clustered the network into a number of communities. We can use the modularity score to assess the ‘quality’ of this clustering.

A higher score will show we’ve split the network into ‘accurate’ communities, whereas a low score suggests our clusters are more random than insightful. The image below illustrates this.

Modularity serves as a measure of the ‘quality’ of a partition.

Modularity can be calculated using the formula below:

That’s a fair amount of math, but we can break it down bit by bit and it’ll make more sense.

M is of course what we’re calculating — modularity.

1/2L tells us to divide everything that follows by 2L, i.e., twice the number of edges in the network. So far, so good.

The Σ symbol tells us we’re summing up everything to the right, and lets us iterate over every row and column in the adjacency matrix A.

For those unfamiliar with sum notation, the i, j = 1 and the N work much like nested for-loops in programming. In Python, you’d write it as follows:

sum = 0
for i in range(1,N):
    for j in range(1,N):
        ans = #stuff with i and j as indices 
    sum += ans

So what is #stuff with i and j in more detail?

Well, the bit in brackets tells us to subtract ( _k_i kj ) / 2L from _Aij.

_Aij is simply the value in the adjacency matrix at row i, column j.

The values of _ki and _kj are the degrees of each vertex — found by adding up the entries in row i and column j respectively. Multiplying these together and dividing by 2L gives us the expected number of edges between vertices i and j if the network were randomly shuffled up.

Overall, the term in the brackets reveals the difference between the network’s real structure and the expected structure it would have if randomly reassembled.

Playing around with the values shows that it returns its highest value when _Aij = 1, and ( _k_i kj ) / 2L is low. This means we see a higher value if there is an ‘unexpected’ edge between vertices i and j.

Finally, we multiply the bracketed term by whatever the last few symbols refer to.

The ?ci,_ cj i_s the fancy-sounding but totally harmless Kronecker-delta function. Here it is, explained in Python:

def kroneckerDelta(ci, cj):
    if ci == cj:
        return 1
    else:
        return 0

kroneckerDelta("A","A")
#returns 1

kroneckerDelta("A","B")
#returns 0

Yes — it really is that simple. The Kronecker-delta function takes two arguments, and returns 1 if they are identical, otherwise, zero.

This means that if vertices i and j have been put in the same cluster, then ?ci,_ cj = 1_. Otherwise, if they are in different clusters, the function returns zero.

As we are multiplying the bracketed term by this Kronecker-delta function, we find that for the nested sum Σ, the outcome is highest when there are lots of ‘unexpected’ edges connecting vertices assigned to the same cluster.

As such, modularity is a measure of how well-clustered the graph is into separate communities.

Dividing by 2L bounds the upper value of modularity at 1. Modularity scores near to or below zero indicate the current clustering of the network is really no use. The higher the modularity, the better the clustering of the network into separate communities.

By maximising modularity, we can find the best way of clustering the network.

Notice that we have to pre-define how the graph is clustered to find out how ‘good’ that clustering actually is.

Unfortunately, employing brute force to try out every possible way of clustering the graph to find which has the highest modularity score would be computationally impossible beyond a very limited sample size.

Combinatorics tells us that for a network of just eight vertices, there are 4140 different ways of clustering them. A network twice the size would have over ten billion possible ways of clustering the vertices.

Doubling the network again (to a very modest 32 vertices) would give 128 septillion possible ways, and a network of eighty vertices would be cluster-able in more ways than there are atoms in the observable universe.

Instead, we have to turn to a heuristic method that does a reasonably good job at estimating the clusters that will produce the highest modularity score, without trying out every single possibility.

This is an algorithm called Fast-Greedy Modularity-Maximization, and it’s somewhat analogous to the agglomerative hierarchical clustering algorithm describe above. Instead of merging according to distance, ‘Mod-Max’ merges communities according to changes in modularity.

Here’s how it goes:

Begin by initially assigning every vertex to its own community, and calculating the modularity of the whole network, M.

Step 1 requires that for each community pair linked by at least a single edge, the algorithm calculates the resultant change in modularity ΔM if the two communities were merged into one.

Step 2 then takes the pair of communities that produce the biggest increase in ΔM, which are then merged. Calculate the new modularity M for this clustering, and keep a record of it.

Repeat steps 1 and 2 — each time merging the pair of communities for which doing so produces the biggest gain in ΔM, then recording the new clustering pattern and its associated modularity score M.

Stop when all the vertices are grouped into one giant cluster. Now the algorithm checks the records it kept as it went along, and identifies the clustering pattern that returned the highest value of M. This is the returned community structure.

Finer details

Whew! That was computationally intensive, at least for us humans.

Graph theory is a rich source of computationally challenging, often NP-hard problems — yet it also has incredible potential to provide valuable insights into complex systems and datasets.

Just ask Larry Page, whose eponymous PageRank algorithm — which helped propel Google from start-up to basically world domination in less than a generation — was based entirely in graph theory.

Community detection is a major focus of current research in graph theory, and there are plenty of alternatives to Modularity-Maximization, which while useful, does have some drawbacks.

For a start, its agglomerative approach often sees small, well-defined communities swallowed up into larger ones. This is known as the resolution limit — the algorithm will not find communities below a certain size.

Another challenge is that rather than having one distinct, easy-to-reach global peak, the Mod-Max approach actually tends to produce a wide ‘plateau’ of many similar high modularity scores — making it somewhat difficult to truly identify the absolute maximum score.

Other algorithms use different ways to define and approach community detection.

Edge-Betweenness is a divisive algorithm, starting with all vertices grouped in one giant cluster. It proceeds to iteratively remove the least ‘important’ edges in the network, until all vertices are left isolated. This produces a hierarchical structure, with similar vertices closer together in the hierarchy.

Another algorithm is Clique Percolation, which takes into account possible overlap between graph communities.

Yet another set of algorithms are based on random-walks across the graph, and then there are spectral clustering methods which start delving into the eigendecomposition of the adjacency matrix and other matrices derived therefrom. These ideas are used in feature extraction in, for example, areas such as computer vision.

It’d be well beyond the scope of this article to give each algorithm its own in-depth worked example. Suffice to say that this is an active area of research, providing powerful methods to make sense of data that even a generation ago would have been extremely difficult to process.

Conclusion

Hopefully this article has informed and inspired you to better understand how machines can make sense of data. The future is a rapidly changing place, and many of those changes will be driven by what technology becomes capable of in the next generation or two.

As outlined in the introduction, machine learning is an extraordinarily ambitious field of research, in which massively complex problems require solving in as accurate and as efficient a way possible. Tasks that come naturally to us humans require innovative solutions when taken on by machines.

There’s still plenty of progress to be made, and whoever contributes the next breakthrough idea will no doubt be generously rewarded. Maybe someone reading this article will be behind the next powerful algorithm?

All great ideas have to start somewhere!