David Karolyi - freeCodeCamp.org

Diffie-Hellman: The Genius Algorithm Behind Secure Network Communication

David Karolyi — Mon, 11 May 2020 18:53:11 +0000

Let's start with a quick thought experiment.

You have a network of 3 computers, used by Alice, Bob, and Charlie. All 3 participants can send messages, but just in a way that all other clients who connected to the network can read it. This is the only possible communication form between participants.

If Alice sends a message through the wires, both Bob and Charlie get it. In other words, Alice cannot send a direct message to Bob without Charlie receiving it as well.

But Alice wants to send a confidential message to Bob and doesn't want Charlie to be able to read it.

Seems impossible with these strict rules, right? The beautiful thing that this problem is solved in 1976 by Whitfield Diffie and Martin Hellman.

This is a simplified version of the real world, but we face the same problem when communicating through the biggest network that's ever existed.

Usually, you are not directly connected to the internet, but you are part of a local smaller network, called Ethernet.

This smaller network can be wired or wireless (Wi-Fi), but the base concept remains. If you send a signal through the network this signal can be read by all other clients connected to the same network.

Once you emit a message to your bank's server with your credit card information, all other clients in the local network will get the message, including the router. It will then forward it to the actual server of the bank. All other clients will ignore the message.

But what if there is a malicious client in the network who won't ignore your confidential messages, but read them instead? How is it possible you still have money on your bank account?

Encryption

It's kind of clear at this point that we need to use some kind of encryption to make sure that the message is readable for Alice and Bob, but complete gibberish for Charlie.

Encrypting information is done by an encryption algorithm, which takes a key (for example a string) and gives back an encrypted value, called ciphertext. The ciphertext is just a completely random-looking string.

It's important that the encrypted value (ciphertext) can be decrypted only with the original key. This is called a symmetric-key algorithm because you need the same key for decrypting the message as it was encrypted with. There are also asymmetric-key algorithms, but we don't need them right now.

To make it easier to understand this, here is a dummy encryption algorithm implemented in JavaScript:

function encrypt(message, key) {
    return message.split("").map(character => {
        const characterAsciiCode = character.charCodeAt(0)
        return String.fromCharCode(characterAsciiCode+key.length)
    }).join("");
}

In this function, I mapped each character into another character based on the length of the given key.

Every character has an integer representation, called ASCII code. There is a dictionary that contains all characters with its code, called the ASCII table. So we incremented this integer by the length of the key:

Character mapping

Decrypting the ciphertext is pretty similar. But instead of addition, we subtract the key length from every character in the ciphertext, so we get back the original message.

function decrypt(cipher, key) {
    return cipher.split("").map(character => {
        const characterAsciiCode = character.charCodeAt(0)
        return String.fromCharCode(characterAsciiCode-key.length)
    }).join("");
}

Finally here is the dummy encryption in action:

const message = "Hi Bob, here is a confidential message!";
const key = "password";

const cipher = encrypt(message, key);
console.log("Encrypted message:", cipher);
// Encrypted message: Pq(Jwj4(pmzm(q{(i(kwvnqlmv|qit(um{{iom)

const decryptedMessage = decrypt(cipher, key);
console.log("Decrypted message:", decryptedMessage);
// Decrypted message: Hi Bob, here is a confidential message!

We applied some degree of encryption to the message, but this algorithm was only useful for demonstration purposes, to get a sense of how symmetric-key encryption algorithms behave.

There are a couple of problems with this implementation besides handling corner cases and parameter types poorly.

First of all every 8 character-long key can decrypt the message which was encrypted with the key "password". We want an encryption algorithm to only be able to decrypt a message if we give it the same key that the message was encrypted with. A door lock that can be opened by every other key isn't that useful.

Secondly, the logic is too simple – every character is shifted the same amount in the ASCII table, which is too predictable. We need something more complex to make it harder to find out the message without the key.

Thirdly, there isn't a minimal key length. Modern algorithms work with at least 128 bit long keys (~16 characters). This significantly increases the number of possible keys, and with this the secureness of encryption.

Lastly, it takes too little time to encrypt or decrypt the message. This is a problem because it doesn't take too much time to try out all possible keys and crack the encrypted message.

This is hand in hand with the key length: An algorithm is secure if I as an attacker want to find the key, then I need to try a large number of key combinations and it takes a relatively long time to try a single combination.

There is a wide range of symmetric encryption algorithms that addressed all of these claims, often used together to find a good ratio of speed and secureness for every situation.

The more popular symmetric-key algorithms are Twofish, Serpent, AES (Rijndael), Blowfish, CAST5, RC4, TDES, and IDEA.

If you want to learn more about cryptography in general check out this talk.

Key exchange

It looks like we reduced the original problem space. With encryption, we can create a message which is meaningful for parties who are eligible to read the information, but which is unreadable for others.

When Alice wants to write a confidential message, she would pick a key and encrypt her message with it and send the ciphertext through the wires. Both Bob and Charlie would receive the encrypted message, but none of them could interpret it without Alice's key.

Now the only question to answer is how Alice and Bob can find a common key just by communicating through the network and prevent Charlie from finding out that same key.

If Alice sends her key directly through the wires Charlie would intercept it and would be able to decrypt all Alice's messages. So this is not a solution. This is called the key exchange problem in computer science.

Diffie–Hellman key exchange

This cool algorithm provides a way of generating a shared key between two people in such a way that the key can't be seen by observing the communication.

As a first step, we'll say that there is a huge prime number, known to all participants, it's public information. We call it "p" or modulus.

There is also another public number called "g" or base, which is less than p.

Don't worry about how these numbers are generated. For the sake of simplicity let's just say Alice picks a very big prime number (p) and a number which is considerably less than p. She then sends them through the wires without any encryption, so all participants will know these numbers.

Example: To understand this through an example, we'll use small numbers. Let's say p=23 and g=5.

As a second step both Alice (a) and Bob (b) will pick a secret number, which they won't tell anybody, it's just locally living in their computers.

Example: Let's say Alice picked 4 (a=4), and Bob picked 3 (b=3).

As a next step, they will do some math on their secret numbers, they will calculate:

the base (g) in the power of their secret number,
and take the calculated number's modulo to p.
Call the result A (for Alice) and B (for Bob).

Modulo is a simple mathematical statement, and we use it to find the remainder after dividing one number by another. Here is an example: 23 mod 4 = 3, because 23/4 is 5 and 3 remains.

Maybe it's easier to see all of this in code:

// base
const g = 5;
// modulus
const p = 23;

// Alice's randomly picked number
const a = 4;
// Alice's calculated value
const A = Math.pow(g, a)%p;

// Do the same for Bob
const b = 3;
const B = Math.pow(g, b)%p;

console.log("Alice's calculated value (A):", A);
// Alice's calculated value (A): 4
console.log("Bob's calculated value (B):", B);
// Bob's calculated value (B): 10

Now both Alice and Bob will send their calculated values (A, B) through the network, so all participants will know them.

As a last step Alice and Bob will take each other's calculated values and do the following:

Alice will take Bob's calculated value (B) in the power of his secret number (a),
and calculate this number's modulo to p and will call the result s (secret).
Bob will do the same but with Alice's calculated value (A), and his secret number (b).

At this point, they successfully generated a common secret (s), even if it's hard to see right now. We will explore this in more detail in a second.

In code:

// Alice calculate the common secret
const secretOfAlice = Math.pow(B, a)%p;
console.log("Alice's calculated secret:", secretOfAlice);
// Alice's calculated secret: 18

// Bob will calculate
const secretOfBob = Math.pow(A, b)%p;
console.log("Bob's calculated secret:", secretOfBob);
// Bob's calculated secret: 18

As you can see both Alice and Bob got the number 18, which they can use as a key to encrypt messages. It seems magic at this point, but it's just some math.

Let's see why they got the same number by splitting up the calculations into elementary pieces:

The process as an equation

In the last step, we used a modulo arithmetic identity and its distributive properties to simplify nested modulo statements.

So Alice and Bob have the same key, but let's see what Charlie saw from all of this. We know that p and g are public numbers, available for everyone.

We also know that Alice and Bob sent their calculated values (A, B) through the network, so that can be also caught by Charlie.

Charlie's perspective

Charlie knows almost all parameters of this equation, just a and b remain hidden. To stay with the example, if he knows that A is 4 and p is 23, g to the power of a can be 4, 27, 50, 73, ... and infinite other numbers which result in 4 in the modulo space.

He also knows that only the subset of these numbers are possible options because not all numbers are an exponent of 5 (g), but this is still an infinite number of options to try.

This doesn't seem too secure with small numbers. But at the beginning I said that p is a really large number, often 2000 or 4000 bits long. This makes it almost impossible to guess the value of a or b in the real world.

The common key Alice and Bob both possess only can be generated by knowing a or b, besides the information that traveled through the network.

If you're more visual, here is a great diagram shows this whole process by mixing buckets of paint instead of numbers.

_source: Wikipedia_

Here p and g shared constants represented by the yellow "Common paint". Secret numbers of Alice and Bob (a, b) is "Secret colours", and "Common secret" is what we called s.

This is a great analogy because it's representing the irreversibility of the modulo operation. As mixed paints can't be unmixed to their original components, the result of a modulo operation can't be reversed.

Summary

Now the original problem can be solved by encrypting messages using a shared key, which was exchanged with the Diffie-Hellman algorithm.

With this Alice and Bob can communicate securely, and Charlie cannot read their messages even if he is part of the same network.

Thanks for reading this far! I hope you got some value from this post and understood some parts of this interesting communication flow.

If it was hard to follow the math of this explanation, here is a great video to help you understand the algorithm without math, from a higher level.

If you liked this post, you may want to follow me on Twitter to find some more exciting resources about programming and software development.

Web scraping for web developers: a concise summary

David Karolyi — Wed, 13 Feb 2019 17:59:30 +0000

Knowing one approach to web scraping may solve your problem in the short term, but all methods have their own strengths and weaknesses. Being aware of this can save you time and help you to solve a task more efficiently.

Numerous resources exist, which will show you a single technique for extracting data from a web page. The reality is that multiple solutions and tools can be used for that.

What are your options to programmatically extract data from a web page?

What are the pros and cons of each approach?

How to use cloud services to increase the degree of automation?

This guide meant to answer these questions.

I assume you have a basic understanding of browsers in general, HTTP requests, the DOM (Document Object Model), HTML, CSS selectors, and Async JavaScript.

If these phrases sound unfamiliar, I suggest checking out those topics before continue reading. Examples are implemented in Node.js, but hopefully you can transfer the theory into other languages if needed.

Static content

HTML source

Let’s start with the simplest approach.

If you are planning to scrape a web page, this is the first method to try. It requires a negligible amount of computing power and the least time to implement.

However, it only works if the HTML source code contains the data you are targeting. To check that in Chrome, right-click the page and choose View page source. Now you should see the HTML source code.

It’s important to note here, that you won’t see the same code by using Chrome’s inspect tool, because it shows the HTML structure related to the current state of the page, which is not necessarily the same as the source HTML document that you can get from the server.

Once you find the data here, write a CSS selector belonging to the wrapping element, to have a reference later on.

To implement, you can send an HTTP GET request to the URL of the page and will get back the HTML source code.

In Node, you can use a tool called CheerioJS to parse this raw HTML and extract the data using a selector. The code looks something like this:

const fetch = require('node-fetch');
const cheerio = require('cheerio');

const url = 'https://example.com/';
const selector = '.example';

fetch(url)
  .then(res => res.text())
  .then(html => {
    const $ = cheerio.load(html);
    const data = $(selector);
    console.log(data.text());
  });

Dynamic content

In many cases, you can’t access the information from the raw HTML code, because the DOM was manipulated by some JavaScript, executed in the background. A typical example of that is a SPA (Single Page Application), where the HTML document contains a minimal amount of information, and the JavaScript populates it at runtime.

In this situation, a solution is to build the DOM and execute the scripts located in the HTML source code, just like a browser does. After that, the data can be extracted from this object with selectors.

Headless browsers

This can be achieved by using a headless browser. A headless browser is almost the same thing as the normal one you are probably using every day but without a user interface. It’s running in the background and you can programmatically control it instead of clicking with your mouse and typing with a keyboard.

A popular choice for a headless browser is Puppeteer. It is an easy to use Node library which provides a high-level API to control Chrome in headless mode. It can be configured to run non-headless, which comes in handy during development. The following code does the same thing as before, but it will work with dynamic pages as well:

const puppeteer = require('puppeteer');

async function getData(url, selector){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  const data = await page.evaluate(selector => {
    return document.querySelector(selector).innerText;
  }, selector);
  await browser.close();
  return data;
}

const url = 'https://example.com';
const selector = '.example';
getData(url,selector)
  .then(result => console.log(result));

Of course, you can do more interesting things with Puppeteer, so it is worth checking out the documentation. Here is a code snippet which navigates to a URL, takes a screenshot and saves it:

const puppeteer = require('puppeteer');

async function takeScreenshot(url,path){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  await page.screenshot({path: path});
  await browser.close();
}

const url = 'https://example.com';
const path = 'example.png';
takeScreenshot(url, path);

As you can imagine, running a browser requires much more computing power than sending a simple GET request and parsing the response. Therefore execution is relatively costly and slow. Not only that but including a browser as a dependency makes the deployment package massive.

On the upside, this method is highly flexible. You can use it for navigating around pages, simulating clicks, mouse moves, and keyboard events, filling out forms, taking screenshots or generating PDFs of pages, executing commands in the console, selecting elements to extract its text content. Basically, everything can be done that is possible manually in a browser.

Building just the DOM

You may think it’s a little bit of overkill to simulate a whole browser just for building a DOM. Actually, it is, at least under certain circumstances.

There is a Node library, called Jsdom, which will parse the HTML you pass it, just like a browser does. However, it isn’t a browser, but a tool for building a DOM from a given HTML source code, while also executing the JavaScript code within that HTML.

Thanks to this abstraction, Jsdom is able to run faster than a headless browser. If it’s faster, why don’t use it instead of headless browsers all the time?

Quote from the documentation:

People often have trouble with asynchronous script loading when using jsdom. Many pages load scripts asynchronously, but there is no way to tell when they’re done doing so, and thus when it’s a good time to run your code and inspect the resulting DOM structure. This is a fundamental limitation.

… This can be worked around by polling for the presence of a specific element.

This solution is shown in the example. It checks every 100 ms if the element either appeared or timed out (after 2 seconds).

It also often throws nasty error messages when some browser feature in the page is not implemented by Jsdom, such as: “Error: Not implemented: window.alert…” or “Error: Not implemented: window.scrollTo…”. This issue also can be solved with some workarounds (virtual consoles).

Generally, it’s a lower level API than Puppeteer, so you need to implement certain things yourself.

These things make it a little messier to use, as you will see in the example. Puppeteer solves all these things for you behind the scenes and makes it extremely easy to use. Jsdom for this extra work will offer a fast and lean solution.

Let’s see the same example as previously, but with Jsdom:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

async function getData(url,selector,timeout) {
  const virtualConsole = new jsdom.VirtualConsole();
  virtualConsole.sendTo(console, { omitJSDOMErrors: true });
  const dom = await JSDOM.fromURL(url, {
    runScripts: "dangerously",
    resources: "usable",
    virtualConsole
  });
  const data = await new Promise((res,rej)=>{
    const started = Date.now();
    const timer = setInterval(() => {
      const element = dom.window.document.querySelector(selector)
      if (element) {
        res(element.textContent);
        clearInterval(timer);
      }
      else if(Date.now()-started > timeout){
        rej("Timed out");
        clearInterval(timer);
      }
    }, 100);
  });
  dom.window.close();
  return data;
}

const url = "https://example.com/";
const selector = ".example";
getData(url,selector,2000).then(result => console.log(result));

Reverse engineering

Jsdom is a fast and lightweight solution, but it’s possible even further to simplify things.

Do we even need to simulate the DOM?

Generally speaking, the webpage that you want to scrape consists of the same HTML, same JavaScript, same technologies you’ve already know. So, if you find that piece of code from where the targeted data was derived, you can repeat the same operation in order to get the same result.

If we oversimplify things, the data you’re looking for can be:

part of the HTML source code (as we saw in the first paragraph),
part of a static file, referenced in the HTML document (for example a string in a javascript file),
a response for a network request (for example some JavaScript code sent an AJAX request to a server, which responded with a JSON string).

All of these data sources can be accessed with network requests. From our perspective, it doesn’t matter if the webpage uses HTTP, WebSockets or any other communication protocol, because all of them are reproducible in theory.

Once you locate the resource housing the data, you can send a similar network request to the same server as the original page does. As a result, you get the response, containing the targeted data, which can be easily extracted with regular expressions, string methods, JSON.parse etc…

With simple words, you can just take the resource where the data is located, instead of processing and loading the whole stuff. This way the problem, showed in the previous examples, can be solved with a single HTTP request instead of controlling a browser or a complex JavaScript object.

This solution seems easy in theory, but most of the times it can be really time-consuming to carry out and requires some experience of working with web pages and servers.

A possible place to start researching is to observe network traffic. A great tool for that is the Network tab in Chrome DevTools. You will see all outgoing requests with the responses (including static files, AJAX requests, etc…), so you can iterate through them and look for the data.

This can be even more sluggish if the response is modified by some code before being rendered on the screen. In that case, you have to find that piece of code and understand what’s going on.

As you see, this solution may require way more work than the methods featured so far. On the other hand, once it’s implemented, it provides the best performance.

This chart shows the required execution time, and the package size compared to Jsdom and Puppeteer:

These results aren’t based on precise measurements and can vary in every situation, but shows well the approximate difference between these techniques.

Cloud service integration

Let’s say you implemented one of the solutions listed so far. One way to execute your script is to power on your computer, open a terminal and execute it manually.

This can become annoying and inefficient very quickly, so it would be better if we could just upload the script to a server and it would execute the code on a regular basis depending on how it’s configured.

This can be done by running an actual server and configuring some rules on when to execute the script. Servers shine when you keep observing an element in a page. In other cases, a cloud function is probably a simpler way to go.

Cloud functions are basically containers intended to execute the uploaded code when a triggering event occurs. This means you don’t have to manage servers, it’s done automatically by the cloud provider of your choice.

A possible trigger can be a schedule, a network request, and numerous other events. You can save the collected data in a database, write it in a Google sheet or send it in an email. It all depends on your creativity.

Popular cloud providers are Amazon Web Services(AWS), Google Cloud Platform(GCP), and Microsoft Azure and all of them has a function service:

They offer some amount of free usage every month, which your single script probably won’t exceed, unless in extreme cases, but please check the pricing before use.

If you are using Puppeteer, Google’s Cloud Functions is the simplest solution. Headless Chrome’s zipped package size (~130MB) exceeds AWS Lambda’s limit of maximum zipped size (50MB). There are some techniques to make it work with Lambda, but GCP functions support headless Chrome by default, you just need to include Puppeteer as a dependency in package.json.

If you want to learn more about cloud functions in general, do some research on serverless architectures. Many great guides have already been written on this topic and most providers have an easy to follow documentation.

Summary

I know that every topic was a bit compressed. You probably can’t implement every solution just with this knowledge, but with the documentation and some custom research, it shouldn’t be a problem.

Hopefully, now you have a high-level overview of techniques used for collecting data from the web, so you can dive deeper into each topic accordingly.