reverse engineering - freeCodeCamp.org

How to Reverse Engineer a Website – a Guide for Developers

Abdurrahman Rajab — Wed, 13 Nov 2024 16:07:01 +0000

While using one of your favorite websites, you might have often thought, "What if this website had this particular functionality? That would be great!"

If you have ever had such thoughts, this article is for you. In it, you'll learn how websites communicate with servers and get data and how to work backwards to understand how that website functions.

You will also see how to add functionality to a website or use its APIs to recreate it yourself. You will use a simple demo website in the article to do that. The website contains some sales data grabbed by a remote API. In the demo, you will use the website to see what APIs have been used to get the data and how to use the sales data API.

If you understand how to access the data API on this website, you can use the same methodology to access this data on any website you like.

Prerequisites:

This article should be accessible to anyone who knows the basics of programming. You will see examples in JavaScript, but you can use these techniques in your favorite language. Having some basic knowledge of how the web works will also be helpful.

You need to install the project from GitHub and run it to experiment with this tutorial.

What is an API?

Application programming interfaces (APIs) allow two computer programs to communicate. You request the data you want to use in your project from an API, and the API fetches it for you.

These APIs can be local (like Windows APIs, Web APIs, and so on) or remote (like the APIs that developers provide through the internet, such as the Weather API and website APIs).

This article will focus on remote APIs, since developers often use this approach on modern websites. Websites use APIs to display results based on a response.

Some companies might provide access to their APIs so you can develop on top of them, but this is only the case for some. Sometimes, an API might need to provide the functionality or design you want. So, first, you should look at what a site offers and use it to create the features you want.

In this tutorial, you will learn how to understand and explore the APIs behind a website so you can use them in your projects. You will first learn how APIs work, then explore what reverse engineering means. Then, you'll use a demo website and an example through Postman, where you will use an API to get some data from the website. You'll be able to use this data anywhere you want.

How Do APIs Work?

The structure of an API contains two levels: the client and the server. The client requests data from the API, and the server provides it. This technology has been around for a long time and is now standardized.

The client starts to request the data by connecting with the correct endpoint and providing the related information for the server. The server checks this data, and based on that, it does its magic and returns a response to the client about which process to use.

Here is a simple drawing showing this process:

API requests and responses usually have a similar structure, which is:

Request:

Endpoints: the target URL of the API.
Methods: tell the server what to do with data, like get the data, update it, delete it, and so on.
Parameters: extra details you provide to the server for additional requests, like the topic, category, and so on.
Headers: these key-value pairs provide information about the client, authentication, and more.
Body: this is the actual data provider, which includes whatever the client wants from the server.

Response:

Status code: this three-digit HTTP status code tells the client about the server's result.
Header: this is similar to the request but has the server's information. It could be setting cookies or other details.
Body: has the actual data from the server to use.

Now that you know a bit about APIs and HTTP requests, you can reverse engineer a website.

What is Reverse Engineering?

Reverse engineering is the art of analyzing a system to understand how developers built it. It helps you figure out how it functions so you can improve or hack it.

Some people use reverse engineering to crack programs. Others use it to customize them or even add extra functionalities.

As for websites, the reverse engineering process will help you understand what APIs a site has and how it's using them. It enables you to write your program based on the site's APIs.

Sometimes, reverse engineering can be used to find bugs, crack software, or even use an API without permission. Website developers tend to prevent that by providing an official API for their website, setting limits for API usage, and detecting any unauthorized use.

For this reason, when you start to reverse engineer any program, you will need to consider the terms of use and the legal side of your work so that you’re not doing anything illegal or unethical.

How to Reverse Engineer a Website

To reverse engineer a website, you need to do two things: first, you need to explore the website to see how it works and learn what kind of data and endpoints it provides. Second, you need to set some assumptions about how it works and try to validate the assumptions.

A simple assumption would be that after logging into the website, the website receives authentication information through API requests. Getting this information will allow you to use the website APIs without the need to log in every time.

To validate this assumption, you'll need to investigate the requests sent and received by the website. Then you'll need to send your requests by yourself from an external source like your terminal through CURL or HTTP client like Postman.

I have created a demo website that we'll reverse engineer. You'll run the website on your computer and then reverse-engineer it. The website shows you a simple login page and has some customer data. Your goal will be to get the customer's recent sales data.

Here are a couple screenshots of the website and what you have:

Page 1: Login:

Page 2: The website data

Explore the Website

The first step in reverse engineering is to explore the website and see how it works. To do this, you'll use Chrome developer tools to check the requests sent by the website and see how they affect it. You'll also look for data received by those requests and see how you can use them.

At the same time you'll need to filter the requests since some of them don't send or receive data, but they get various files that the website uses, like CSS files or images.

Chrome developer tools help you analyze and understand a website, showing you the HTML elements, network, and storage that the website uses.

Check the Sales

Your target is to check the website's sales, so you need to log in to access the website and go to the dashboard page to check the sales.

On the sales page, you will do the following:

First, open Chrome dev tools (by either clicking on F12 or right-clicking anywhere, then opening Inspector) to see what kind of APIs the website provides.

Then, after opening the dev tools, you need to go to the network tab to check the network requests and see what the website sends to the server.

The network tab shows you the requests sent from the website to the server and how the server has responded to them.

As you can see in the previous image, you have an empty network in the dev tools. An empty network happens when you open the dev tools after the website sends the calls. A refresh (F5) on the website will be enough to check the calls.

In the following image, you can see the requests sent from the website to the server. If you analyze the request names, you will find one of the requests called sales, which is the one that has sales data. You can open the call and see the result of it.

If you click on the call, you will see that you have the headers, cookies, responses, and so on.

These tabs will help you understand the result of the call, the origin, and the request and response of the request. If you go to the response tab, you can see the sales data as JSON, which is what the website uses.

Right now, since you have this call, you can use it in the browser to get the result. To do this, you need to use the fetch function from JavaScript. This approach will help you see the function's result and how it works.

A simple way to do that is to click on the call, then go to “copy,” and choose “copy as fetch”. In this case, “fetch it” means copying the request to reuse as a fetch call in JavaScript, with all of the headers and body included in the copied text.

Here is the code of the fetch:

let fetchResult = fetch("http://localhost:3000/api/sales", {
"headers": {
"accept": "*/*",
"accept-language": "en,tr-TR;q=0.9,tr;q=0.8,en-US;q=0.7,ar;q=0.6,it;q=0.5",
"sec-ch-ua": "\"Chromium\";v=\"124\", \"Google Chrome\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin"
},
"referrer": "http://localhost:3000/dashboard",
"referrerPolicy": "strict-origin-when-cross-origin",
"body": null,
"method": "GET",
"mode": "cors",
"credentials": "include"
})

You can consume this call by using this code:

fetchResult.then(res => res.json()).then(console.log)

Here is the result of the fetch:

As you can see, you were able use the API to get the sales results and explore them. You can now use this data anywhere on the current website and fetch the sales API programmatically.

Doing this fetch through the browser's developer tools will assume that you are doing it in the website's name. This will add extra headers to the API request, like the current website hostname in the headers and current cookies attached to the API request.

But what if you're going to use the API outside the same website? You might wish to use the API on your website or a server. By using it on your website, I mean getting the sales data and showing it as a widget on your website, or even getting these data to store them on your server and process them to do data mining.

Using the data outside the host website would require a different hostname to be attached to the headers and other cookies connected to the API request. For these cases, you will need to use Postman, an HTTP request-testing software that helps you test, explore, and read API request data.

How to Use the API on Your Website

Since you got the API endpoint from the previous section, which was the following:

http://localhost:3000/api/sales

You might expect to use this URL and fetch the data you used before – and have it work immediately in Postman and be able to use it on your website. But it doesn’t work like this, since this fetch request does not contain any data about the authorization.

You can try it in Postman yourself to see the error. Postman provides two ways to write the fetch URL: the first one is to write the call in the Postman UI, and the other one is to import the API call to try it (as you can see in the image below).

To import the fetch request to Postman, first you need to click on the call from the network tab and copy it as:

cURL (Bash), then paste it to Postman. Copying the call as cURL (Bash) will allow you to get the headers and all related data from the website, like cookies, and so on.
URL only, which will have the URL without having extra data.

For this article, you will copy it as a URL and paste it into Postman. You'll do that to get a clean API call without the connected headers and cookies. Then, you'll click on the send button in Postman to make the request and get the result from the API.

When you click on send, you will see that you've gotten an unauthorized message from the API body and a status code of 200. The unauthorized message happened because you did not log into the website from Postman, and the status code is due to the API design. Some APIs might return a status code 401: unauthorized, which you might encounter on other websites.

Being unauthorized means that you are not logged in and do not have permission to use this specific API. Some APIs are public, which you can use without any API key or extra details. Other APIs need authorization in terms of using a username and password or even a key provided by the API provider.

In this example, we are using a private API which needs to be authorized. Here you are getting the data that you are only allowed to access.

How to Get the Authorization and Explore it Through the Website

Based on the previous section, when you tried the API call, you got a message telling you that you are unauthorized. The request needs authorization.

So you need to establish an assumption about authorization by saying that you can get the authorization information from the login page in the demo website. For the assumption here, you might think about how the website is using the username and password to record the session. Usually websites record the session through cookies.

After getting logged in, you will be able use the API call to get the data you want.

To try that, you can go through the following steps:

To figure out what the website is doing to get authorized, you can either:

Try copying the URL as bash and seeing what extra cookies and options you get
Try logging in, getting the data from the login, and sending it to the protected call

You need to figure out how the website login function works and what data is being sent to the server to verify the API request and get authorized. So, you should check the login page and analyze the requests.

Here are the steps:

Go to the login page
Open the dev tools
Go to the network tab
Click on the login button

When you click on login, enable the preserve log option to preserve logs when you browse through different pages or the website redirects you.

As you can see from the image above, you got a login call from the website. You need to explore the call and see what results you get from it. Here is the explanation of the data from the response:

The response:

In the above image, you can see that you got a message that just says “ok”, which does not provide much detail. Right now, you need to check the headers and the cookies to see what the server sent and if you can use the server headers for authentication.

If you check the headers, you can find a response header called set-cookie, which is responsible for setting a cookie on your machine. This one has a loggedin=true value, indicating a log-in flag that the website could use.

You will see the same value when you go to the cookies tab.

Here, you might think that having a cookie sent with the “sales” request header could authorize the request. To double-check that, you can open the sales request from the dev tools and see what extra details the request headers have, from headers or cookies:

If you go to the cookies tab, you will notice the request did send the same cookies:

To ensure the cookies are the reason, you can return to the Postman call and add a cookie to test the call.

You need to do the following:

Open the headers tab
Add cookie as header
Send the request
Check the result

As you can see, you got a result of the data, which means that the server authorized the request, and you can access the data. Getting the data confirms the hypothesis you set in the beginning: the endpoint needs authentication.

Checking the sales with cURL (Bash)

A more straightforward and more accessible way to do this would be by copying the request as cURL (Bash), which brings all the options to Postman. Then, you need to analyze the options and see what headers the server sent for authorization.

You can check out the following image, which has the URL pasted as cURL (Bash):

In the image, you can see that you have added 12 extra headers, and you can check them and analyze them. Sometimes, you might find an authorization header. Other times, you might have other token headers that you need to consider.

When you notice the header responsible for authorization, you need to go back to the website and analyze it from the beginning to check which endpoint provides the related authorization. You did it the hard way at the beginning to enable you to understand how to authenticate if the cookies were a token or something that would be challenging to figure out.

As you will see in the next section, website authentication is getting more complicated daily, and you must be ready to try all of the methods.

Next Steps: Authorization and Authentication

Authentication and security are significant issues. As you noticed on the website, you had to use the cookies to show authentication, which would be valid for some websites.

Other websites might have more advanced encrypted methods to authenticate and authorize. For those situations, basic knowledge and curiosity will help you explore and use the APIs from the website.

Some websites use the OAuth standards to authorize, saving a token on the website to send requests. As you move forward and reverse engineer more websites, you will notice the different patterns and will be able to understand them and become better at this work.

Wrapping Up

This article was for educational purposes, which is why we used a clean website to help you see things quickly.

In real-world examples, things are complicated, and you'll need to explore them more. But the main principles stay similar for all situations: one endpoint brings the authorization/authentication data and another that brings the related data.

Reverse engineering is not easy and requires a fair bit of patience, dedication, and persistence. As you can see, understanding the website takes a lot of time. Not all websites have clean API calls, and some have the calls mixed with a different number of files needed for the website, such as CSS scripts or even images. All you need is to be patient and try to think outside the box.

If you like this article, subscribe to my newsletter and follow me on Twitter.

How I solved a simple CrackMe challenge with the NSA’s Ghidra

freeCodeCamp — Wed, 20 Mar 2019 15:30:50 +0000

By Denis Nuțiu

Hello!

I’ve been playing recently a bit with Ghidra, which is a reverse engineering tool that was recently open sourced by the NSA. The official website describes the tool as:

A software reverse engineering (SRE) suite of tools developed by NSA’s Research Directorate in support of the Cybersecurity mission.

I’m at the beginning of my reverse engineering career, so I didn’t do anything advanced. I don’t know what features to expect from a professional tool like this, if you’re looking to read about advanced Ghidra features this is likely not the article for you.

In this article I will try to solve a simple CrackMe challenge that I’ve found on the website root-me. The challenge I’m solving is called ELF - CrackPass. If you want to give it try by yourself, then you should consider not reading this article because it will spoil the challenge from you.

Let’s get started! I open up Ghidra and create a new Project which I call RootMe.

Then I import the challenge file by dragging it to the project folder. I will go with the defaults.

After being presented with some info about the binary file, I press OK, select the file, and double click it. This opens up Ghidra’s code browser utility and asks if I want to analyse the file, then I press Yes and go on with the defaults.

After we import the file, we get some information about the binary file. If we press OK and dismiss this window, and then double click the file we imported, this opens up Ghidra’s code browser utility. I select Yes when prompted to analyze the binary and go on with the defaults.

The Code Browser is quite convenient. In the left panel we get to see the disassembly view and in the right panel the decompile view.

Ghidra shows us directly the ELF header info and the entry point of the binary. After double clicking the entry point, the dissembler view jumps to the entry function.

Now we can successfully identify the main function, which I rename to main. It would be nice if the tool would attempt to automatically detect the main function and rename it accordingly.

Before analyzing the main function, I wanted to change its signature. I changed the return type to int and corrected the parameters’ type and name. This change has taken effect in the decompile view which is cool! ?

Highlighting a line in the decompile view also highlights it in the assembly view.

Let’s explore the FUN_080485a5 function, which I’ll rename to CheckPassword.

The contents of the CheckPassword function can be found below. I’ve copied the code directly from Ghidra’s decompile view, which is a neat feature that many tools of this type lack! Being able to copy assembly and code is a nice to have feature.

void CheckPassword(char *param_1) {   ushort **ppuVar1;   int iVar2;   char *pcVar3;   char cVar4;   char local_108c [128];   char local_100c [4096];   cVar4 = param_1;       if (cVar4 != 0) {          ppuVar1 = __ctype_b_loc();           pcVar3 = param_1;           do {               if (((byte )(ppuVar1 + (int)cVar4) & 8) == 0) {         puts("Bad password !");                     /* WARNING: Subroutine does not return */         abort();       }       cVar4 = pcVar3[1];       pcVar3 = pcVar3 + 1;     } while (cVar4 != 0);   }   FUN_080484f4(local_100c,param_1);   FUN_0804851c(s_THEPASSWORDISEASYTOCRACK_08049960,local_108c);   iVar2 = strcmp(local_108c,local_100c);   if (iVar2 == 0) {     printf("Good work, the password is : \n\n%s\n",local_108c);   }   else {     puts("Is not the good password !");   }   return; }

After taking a look at the code, I’ve come to the following conclusions. The block with the if checks if the user has provided a password and inspects the provided password to check if it’s a valid character or something. I’m not exactly sure what it’s checking for, but here’s what __ctype_b_loc()’s documentation says:

_The __ctype_bloc() function shall return a pointer into an array of characters in the current locale that contains characteristics for each character in the current character set. The array shall contain a total of 384 characters, and can be indexed with any signed or unsigned char (i.e. with an index value between 128 and 255). If the application is multi-threaded, the array shall be local to the current thread.

Anyways, that block of code is not really worth the time, because it doesn’t modify our password in any way, it just verifies it. So we can skip this kind of verification.

The next function called is FUN_080484f4. Looking at its code, we can tell that it’s just a custom memcopy implementation. Instead of copying the C code from the decompiler view, I copied the assembly code — yes, this is fun.

*************************************************************                     *                           FUNCTION                                               *************************************************************                     undefined  FUN_080484f4 (undefined4  param_1 , undefined4  p     undefined         AL:1                undefined4        Stack[0x4]:4   param_1                                 XREF[1]:     080484f8 (R)        undefined4        Stack[0x8]:4   param_2                                 XREF[1]:     080484fb (R)                        FUN_080484f4                                    XREF[1]:     CheckPassword:080485f5 (c)    080484f4 55              PUSH       EBP 080484f5 89  e5           MOV        EBP ,ESP 080484f7 53              PUSH       EBX 080484f8 8b  5d  08       MOV        EBX ,dword ptr [EBP  + param_1 ] 080484fb 8b  4d  0c       MOV        ECX ,dword ptr [EBP  + param_2 ] 080484fe 0f  b6  11       MOVZX      EDX ,byte ptr [ECX ] 08048501 84  d2           TEST       DL,DL 08048503 74  14           JZ         LAB_08048519 08048505 b8  00  00       MOV        EAX ,0x0             00  00                         LAB_0804850a                                    XREF[1]:     08048517 (j)    0804850a 88  14  03       MOV        byte ptr [EBX  + EAX *0x1 ],DL 0804850d 0f  b6  54       MOVZX      EDX ,byte ptr [ECX  + EAX *0x1  + 0x1 ]             01  01 08048512 83  c0  01       ADD        EAX ,0x1 08048515 84  d2           TEST       DL,DL 08048517 75  f1           JNZ        LAB_0804850a                         LAB_08048519                                    XREF[1]:     08048503 (j)    08048519 5b              POP        EBX 0804851a 5d              POP        EBP 0804851b c3              RETComment: param_1 is dest, param_2 is src. 08048501 checks if src is null and if it is it returns else it initializes EAX (index, current_character) with 0. The next instructions move bytes into EBX (dest) from EDX (src).The loop stops when EDX is null.

And the other function FUN_0804851c generates the password from the “THEPASSWORDISEASYTOCRACK” string. Looking at the decompiled view. we can roughly see how this function works. If we didn’t have that, we would need to manually analyze every assembly instruction from the function to understand what it does.

Then, we compare the previously generated password with the password that we got from the user (the first argument, argv[1]). If it matches, the program says good job and prints it, else it prints an error message.

From this basic analysis, we can conclude that if we patch the program in various places, we can get it to spit the password without us needing to reverse any C function and write code. Patching the program means changing some of its instructions.

Let’s see what we have to patch:

At address 0x0804868c we patch the JNS instruction into a JMP. And voilà, the change is reflected in the decompiler view. The ptrace result check is bypassed.

{   ptrace(PTRACE_TRACEME,0,1,0);   if (argc != 2) {     puts("You must give a password for use this program !");                     /* WARNING: Subroutine does not return */     abort();   }   CheckPassword(argv[1]);   return 0;}

At address 0x080485b8 we patch the JZ instruction into a JMP. We bypass that password verification block we saw earlier.

void CheckPassword(undefined4 param_1) {   int iVar1;   char local_108c [128];   char local_100c [4096];   CustomCopy(local_100c,param_1);      GeneratePassword(s_THEPASSWORDISEASYTOCRACK_08049960,local_108c);   iVar1 = strcmp(local_108c,local_100c);   if (iVar1 == 0) {     printf("Good work, the password is : \n\n%s\n",local_108c);   }   else {     puts("Is not the good password !");   }   return; }

At address 0x0804861e we patch JNZ to JZ. This inverts the if/else condition. Since we don’t know the password, we’re going to submit a random password that is not equal to the generated one, thus executing the printf on the else block.

void CheckPassword(undefined4 param_1) {   int iVar1;   char local_108c [128];   char local_100c [4096];   CustomCopy(local_100c,param_1);   // constructs the password from the strings and stores it in   // local_108c    GeneratePassword(s_THEPASSWORDISEASYTOCRACK_08049960,local_108c);   iVar1 = strcmp(local_108c,local_100c);   if (iVar1 == 0) { // passwords are equal     puts("Is not the good password !");   }   else {     printf("Good work, the password is : \n\n%s\n",local_108c);   }   return; }

That’s all!

Now we run the program. In other tools we just save the file and it works, but in Ghidra it seems that we need to export it.

To export the program, we go to File -> Export Program (O). We change the format to binary and click OK.

I get the exported program on my desktop but it doesn’t work — I couldn’t manage to run the exported program. After trying to read it’s header with the readelf -h program, I get the following output:

root@DESKTOP:/mnt/c/users/denis/Desktop# readelf -h Crack.bin ELF Header:   Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00   Class:                             ELF32   Data:                              2's complement, little endian   Version:                           1 (current)   OS/ABI:                            UNIX - System V   ABI Version:                       0   Type:                              EXEC (Executable file)   Machine:                           Intel 80386   Version:                           0x1   Entry point address:               0x8048440   Start of program headers:          52 (bytes into file)   Start of section headers:          2848 (bytes into file)   Flags:                             0x0   Size of this header:               52 (bytes)   Size of program headers:           32 (bytes)   Number of program headers:         7   Size of section headers:           40 (bytes)   Number of section headers:         27   Section header string table index: 26 readelf: Error: Reading 1080 bytes extends past end of file for section headers

Shame. It looks like Ghidra has messed up the file header… and, right now I don’t want to manually fix headers. So I fired up another tool and applied the same patches to the file, saved it, ran it with a random argument and validated the flag.

Conclusions

Ghidra is a nice tool with a lot of potential. In its current state, it’s not that great but it works. I’ve also encountered a weird scrolling bug while running it on my laptop.

The alternatives would be to pay $$ for other tools of this kind, make your own tools, or work with free but not so user friendly tools.

Let’s hope that once the code is released, the community will start doing fixes and improve Ghidra.

Thanks for reading!

How I Reverse Engineered A Chrome Extension To Write My Own Flask App

freeCodeCamp — Fri, 02 Feb 2018 17:05:34 +0000

By Tushar Agrawal

Basically, if I have no intention of using a service then I won’t bother reverse-engineering it. — Jon Lech Johansen

As evident from my bio, I am crazy about music and pretty much anything related to it. And I believe that music videos, if well-directed, are possibly the best way to feel the inherent soul of music.

So, it all began with me watching the music video of a song “Heavydirtysoul” by Twenty One Pilots. The music video was so dope I didn’t even care for the lyrics. It was only after I listened to it a few times, I realized that I didn’t get much of the lyrics except the chorus part.

This is something that is an actual problem for many ESL (English as a Second Language) speakers. You can’t enjoy a song to its fullest if you don’t get the lyrics.

It was then that I thought of something: what if I could play the lyrics of a song alongside the music videos (much like subtitles)? It would be awesome if I could create subtitle files for my music videos and then play it on my video player!

Initial Approach and finding Musixmatch

I then began a comprehensive search for sites or APIs that could provide me the lyrics for a song. And as expected, I found a dozen sites that provided the lyrics. Cool… isn’t it?

Nah. Because, what I really needed was timed lyrics, much like a subtitle for a movie. I wanted the lyrics text to sync with the current video frame on the screen. After much searching, I was unable to find any such service.

It was only after a week someone told me to use Musixmatch, a chrome extension that embedded lyrics on YouTube videos. So, yeah, there was someone out there who was already doing what I had thought about. It sounded like most of the other well thought so-called new ideas I had...and I was just a step away from fetching SubRip “srt” subtitle files for my favorite music videos.

And the hacking started…

I already had a bit of experience working with the chrome developer tools (thanks to Node.js and front end designing). So I put on my hacker glasses and fired up Chrome Dev tools. I switched to the network tab and began to look for any text file that could contain the lyrics.

Snapshot of developer tools with YouTtube video playing

But I was analyzing requests on a page that was playing YouTube videos, so I had a plenty of requests. And since the extension was fetching lyrics, the request must have something to do with the Musixmatch domain.

So I filtered using the keyword ‘musix’ and looked patiently for my file and I finally found it. Lyrics along with the time stamp. I noted the URL of that request and frankly, it all seemed like gibberish to me. Anyways, I copied the URL string as such and then pasted it into the URL bar, and voilà, I got the lyrics.

So, the only thing left was to find out how the URL is being framed and what were the parameters..

Request URL

Parameters and what?

After all the analyzing and filtering, I finally ended up with this. A long URL with a bunch of unknown parameters.

Parameters for the URL

I needed to dig deeper to actually understand the importance of each parameter. At a glance, it was clear that the only parameters that actually mattered were res and v. Others were just for house-keeping stuff. Then I began to explore the options and ended up wasting an hour just to find that the parameter v is nothing but the YouTube Video Id.

For example, the Video Id or v for a YouTube video with a URL https://www.youtube.com/watch?v=ZQeq_T_2VE8 is ZQeq_T_2VE8. Now that I had unveiled the mystery of v, I thought it would take me hardly another hour to find about res, but boy was I wrong.

The curious case of the parameter ‘res’

An hour of deep analysis and research gave me nothing. A little later, I realized that the URL worked even when I changed few alphabets. I kept up digging and by the end of the 3 hours, I figured out that the alphabets in the string didn’t mean anything. They were just put randomly.

A typical value of res : 90rt120b114xz70xv82w85vv90a94hn90vb102av86

So I was done with the alphabets but the numeric values were still alien to me. The next thing I could think of was applying a bit of reverse-engineering to analyze the numbers.

I began with removing all the alphabets as they didn’t mean anything and the first thing I noticed that the number of those values were fixed, the number being 11. I tried it with many other videos, but the number remained constant.

Suddenly, it struck me, Video Id, the v, we discussed earlier also had 11 characters. However, each character in v could be an alphabet or a digit or even a ‘-’ or ‘_’, unlike res which had only numbers.

So, I tried the most obvious mapping that can map a character to its numeric value, ASCII, and voilà that was it. The characters were ASCII encoded and alphabets were randomly put in between the numbers to make the whole string look more random, I guess.

At this point, I was delighted. After all, I had learned about all the parameters and was only a step away from writing my own handy script to download the lyrics file in “srt” format. Just to be sure, I checked with different videos and there seemed to be no issue whatsoever. I also shared the URL with one of my friends (yeah, a music lover).

I got a quick reply and it said “What is it? There’s nothing”. I crosschecked the URL and it was working fine on my browser.

Who was the culprit ? :P

I don’t get sent anything strange like underwear. I get sent cookies. :P — Jennifer Aniston

Cookie field in the Request Headers

I fired up the developer tools again and then copied the link for a new song. It again worked and then I switched to an incognito tab and pasted that same URL. It didn’t work.

My experience of CTF (Capture The Flag) contests immediately told me that it had something to do with the cookies. That’s the most likely case if a URL is working in a browser window and not the other.

I switched to the developer console and saw that the cookie was indeed being sent by the browser. To be sure, I analyzed the request many times and it finally occurred to me that the cookie being sent was the same the Musixmatch server is sending in the response. Also, each cookie is valid for only a certain time period.

So, I wrote a Python script using urllib that first gets the cookie from a normal HTTP response since the cookie works across the domain. Then the cookie along with other parameters was framed as an HTTP request and we got the lyrics... Finally!!

Preparing the parameters for a successful request

Here is the Python code for all the steps discussed above. The code first generates the parameters followed by a request to get the cookies. URL is then prepared using the parameters. Next, the cookie is defined in the header request along with other header fields like ‘Host’ and ‘User-agent’ to give it more of an authentic request look.

Parsing the raw timed lyrics into srt format

Now, the next major thing or the only task left was to convert the raw timed lyrics data into a proper srt (SubRip Text) format. Here is what the MusixMatch lyrics format looked like.

HTTP Response for the lyrics

Below is a proper format for a srt file.These files contain formatted lines of plain text in groups separated by a blank line. Subtitles are numbered sequentially, starting at 1 as depicted in the figure below.

100:00:00,350 --> 00:00:03,45071 buildings explodedor caught fire.

200:00:03,490 --> 00:00:05,020Elliot, tell me what it isthat you think he did.

300:00:05,060 --> 00:00:06,930Sorry.I don't know if I can say.

This sounded like a whole lot of work was required as the data was yet to be properly formatted. But, if you have the required data and a knowledge of Python, all it takes is a simple script to handle the data and that’s exactly what I did. The HTML tags annoyed me a bit during HTML parsing but guess what, there is an awesome library just for HTML parsing which made the whole process very easy. No points for guessing the library’s name, HTMLParser :-).

Final words

So, I put together this script along with some modifications and with a simple front end on a flask server, I had my own lyrics fetching interface, possibly the only one of its kind in the whole world !!

By the way, if you are into music, have a look at Musixmatch. It is really awesome. This exercise was just for educational purposes and wasn’t used in any way to violate Musixmatch’s copyright.