Amr Hesham - freeCodeCamp.org

How to Run SQL-Like Queries on C/C++ Files

Amr Hesham — Thu, 02 May 2024 19:35:48 +0000

Hello everyone! I'm a Software engineer who's interested in low-level programming, compilers, and tool development.

At the end of 2023, I published my first article on freeCodeCamp about how I created a SQL-like Language to run queries on local Git repositories. If you want a bit more context, give it a read.

At the start of 2024, the project got bigger and bigger with more features and amazing contributors, and I started to think: what if I could run SQL-like queries not only on .git files but on any kind of local and remote data?

In my last article about How to Run SQL-Like Queries on Files, I explained the internal design of the GitQL SDK components and how to use it with any kind of data in general and how to implement the FileQL project.

In this article, I will explain how I used the GitQL SDK to implement the ClangQL (Clang Query Language) project, which is a tool that helps you run SQL-like queries on local C/C++ files.

How I Came Up with the ClangQL Project

As I mentioned in my past articles, GitQL SDK can run SQL-like queries on any local or remote structured data. Also, the compiler parses your code into an AST (Abstract Syntax Tree) Data structure. So the question that jumped into my mind was, why not run the query on the Abstract Syntax Tree?

There were no limitations I could think of for implementing this idea, so I started to think of the two main requirements for using GitQL: creating the Data Schema to describe the table structures and columns types, and implementing the Data Provider component to provide the data which in our case is the ATS information and mapping it to the Engine format.

The Data Schema for the C/C++ Code

You can think of the Data Schema as the place where we put structure and relationships of our data – for example, which tables we have, and for each table what columns they contain, and finally the types of each column.

This information is very useful when you're performing type checking and detecting if the user has written the wrong column name, for example, which is not defined in the selected table they want to use.

In our case, the tables can be classes, structs, enumerations, functions, variables and any other data that can be read from C++ such as macros and so on. But I decided to start simple with functions and variables only, then I planned to add other kinds.

So for the functions table, let's define what columns we need to include. The columns and types are not hard to guess, so let's take a normal function as an example. It has the name Text, and it returns type as Text, the number of parameters as Int, other C++ flags as Booleans (for example, is it a virtual function is_virtual or a pure virtual function is_pure_virtual?), and another flag to tell you if it is a static function is_static.

So to create a Data Schema you need to define two things: what tables you have, and what columns there are in this table. For example, in the functions table it will look like this:

lazy_static! {
    pub static ref TABLES_FIELDS_NAMES: HashMap<&'static str, Vec<&'static str>> = {
        let mut map = HashMap::new();
        map.insert(
            "functions",
            vec![
                "name",
                "signature",
                "args_count",
                "return_type",
                "class_name",
                "is_method",
                "is_virtual",
                "is_pure_virtual",
                "is_static",
                "is_const",
                "has_template",
                "access_modifier",
                "is_variadic",
                "file",
                "line",
                "column",
                "offset",
            ],
        );
    }
}

You also need to define the expected data type for each column:

lazy_static! {
    pub static ref TABLES_FIELDS_TYPES: HashMap<&'static str, DataType> = {
        let mut map = HashMap::new();
        map.insert("name", DataType::Text);
        map.insert("type", DataType::Text);
        map.insert("signature", DataType::Text);
        map.insert("args_count", DataType::Integer);
        map.insert("return_type", DataType::Text);
        map.insert("class_name", DataType::Text);
        map.insert("is_method", DataType::Boolean);
        map.insert("is_virtual", DataType::Boolean);
        map.insert("is_pure_virtual", DataType::Boolean);
        map.insert("is_static", DataType::Boolean);
        map.insert("is_const", DataType::Boolean);
        map.insert("has_template", DataType::Boolean);
        map.insert("access_modifier", DataType::Integer);
        map.insert("is_variadic", DataType::Boolean);
        map
    };
}

Now let's move on to the most exciting part: the Data Provider.

The Data Provider for the C/C++ Code

The data provider component is used to tell the engine how to load the target data – for example from where and on which thread – and provide these data in a format that is known by our GitQL Engine. So how we can extract that information from our C/C++ code?

Well, we need to get the AST after parsing the C/C++ code. So the first option is to write a C/C++ parser to parse the files and provide the AST. But this option has some problems here: it'll require a lot of hard work, as C++ is a large language. To write a parser from scratch means you need to support every new feature, and handle errors, and so on.

The other option is to take a well-written C/C++ parser from any Compiler that provides the parser as a library and use it to provide the AST. After some searching, I found that the Clang Compiler is well-designed and can provide the parser as a library to use it to build other tools such as code formatter and linter.

LibClang is written in C++ so I used binding for the Rust Programming language to parse the source file as a TranslationUnit. This is the parent node that contains information about classes, functions, and so on.

LibClang provides more than one way to visit the TranslationUnit and all of the children of it. One of them is using the clang_visitChildren function. It takes a function pointer that gives you the Node and its parent and returns the flag as int. Using this flag, you can control if you want to break, continue, or walk inside this node using the return type.

For example if you are visiting the Class or Struct node and want to visit the methods inside them, you need to return CXChildVisit_Recurse – and clang_visitChildren will provide the methods for you. But if you want to just read class info then you need to return CXChildVisit_Continue to continue to other nodes. Using those flags in the wrong way can lead to performance issues and visiting many nodes that aren't useful.

So to get a function's info, we need to call clang_visitChildren as we pass a pointer to our data to save the information we got. For example:

let mut functions: Vec = Vec::new();
let data = &mut functions as *mut Vec as *mut c_void;

let cursor = clang_getTranslationUnitCursor(translation_unit);
clang_visitChildren(cursor, visit_children, data);

We passed visit_children that point to the function that extracts the C/C++ function's information. It will look like this:

extern "C" fn visit_children(
    cursor: CXCursor,
    parent: CXCursor,
    data: *mut c_void,
) -> CXChildVisitResult {

    let cursor_kind = clang_getCursorKind(cursor);
    if cursor_kind == CXCursor_FunctionDecl
        || cursor_kind == CXCursor_CXXMethod
        || cursor_kind == CXCursor_FunctionTemplate
    {
        let function_name = clang_getCursorSpelling(cursor);
        let function_type = clang_getCursorType(cursor);
        let result_type = clang_getResultType(function_type);
        let arguments_count = clang_getNumArgTypes(function_type);

        // ... Extracing more and more information

        return CXChildVisit_Continue
    }

    CXChildVisit_Recurse
}

Also, if you want to refactor or build advanced searching tools on top of ClangQL, you'll need to get the source code location. For example, where exactly does the function you're searching for exist – on which file and line?

So to get them from Clang, you can use the below code. It provides the file name, line, column and offset data of the selected node:

let cursor_location = clang_getCursorLocation(cursor);

let mut file: CXFile = std::ptr::null_mut();
let mut line: u32 = 0;
let mut column: u32 = 0;
let mut offset: u32 = 0;

clang_getFileLocation(
    cursor_location,
    &mut file,
    &mut line,
    &mut column,
    &mut offset,
);

let file_name = clang_getFileName(file);
let file_name_str = CStr::from_ptr(clang_getCString(file_name)).to_string_lossy();

The source code of visit_children is too large to include because, as you can see, the function node contains a lot of information. So you can check the full and updated code for all visitors from this file in the ClangQL repository: DataProviderFile.

The LibClang creators provide clear documentation on how to walk through the Translation Unit and extract the needed data.

So now we have our Data Schema and Provider, and we can perform a query like SELECT * FROM functions. The result will be likes this:

The result of running a query to select all function information from one file

So after that I decided to name the project ClangQL which stands for Clang Query Language. Now I'm working on extracting more and more important information from the AST (feel free to contribute).

You can find the full source code with all customizations in the ClangQL repository.

Conclusion

You can check out the ClangQL project as a full sample created only in three files.

If you liked the project, you could give it a star ⭐ on GitQL and ClangQL.

You can check out the website for how to download and use the project on different operating systems.

The project is not done yet – this is just the start. Everyone is welcome to join and contribute to the project and suggest ideas or report bugs.

You can sponsor my work on GitHub ❤️.

Thanks for reading

How to Run SQL-Like Queries on Files

Amr Hesham — Tue, 12 Mar 2024 12:33:46 +0000

Hello everyone! I'm a Software engineer who is interested in low-level programming, compilers, and tool development.

At the end of 2023, I published my first article on freeCodeCamp about how I created a SQL-like Language to run queries on local Git repositories. If you want a bit more context, give it a read.

In this article, I will take you on a journey of updating the design of the GitQL project to be used also as an SDK. I will also explain how I used it to implement the FileQL project, which is a tool to run the SQL-like query on local files.

The First Use Case for this Idea

My first idea was to be able to use the same features of GitQL to build FileQL, which is a tool that allows you to run queries on a local file system.

Following that, everyone can use the GitQL project as an SDK to build their XQL. For example, LogQL, WeatherQL, CodeQL, AudioQL, BookQL, and so on.

How I Started to Think About the GitQL SDK

The first question was: what can be a different between GitQL and FileQL? This part could be dynamic depending on the data format and how to read them.

The answer was two components. Let's go over them in the following sections.

The first component is the Data Schema

In each SQL-like query, we need to perform some checks to make sure that everything is valid. For example, in a query like SELECT UPPER(name), commit_count + 1 FROM branches, we need to perform the following checks:

Check that there is a table with name branches.
The field name has the type of text so it can passed to the function UPPER without any problems.
The field commit_count has type the type of integer, so that we can use it with the plus operator and another integer.

These checks can be implemented if we are aware of the table names, field names, and types. This information was static in the GitQL project, but now, when I want to convert it to an SDK, I need to make it dynamic so any SDK user can modify it depending on their own data.

So, I encapsulated all the needed info in a component called DataSchema, and once the user passes it to the SDK, all checks will work correctly.

The second component is the Data Provider

Once we have defined the DataSchema component to make it easier to perform checks on data, we have to move to the next question: how can we provide the data to the GitQL Engine?

In GitQL, we have static functions to provide the data from .git files, but in the SDK, we don't only work with .git files, and we should support working with any kind of data.

So, the idea is to define an interface between the GitQL Engine and the SDK user to provide any kind of data in the needed format for the Engine. This component is called DataProvider, and I will explain the implementation details in the next section.

The Design and Implementation of the GitQL SDK

The goal is to allow the SDK user to pass their own definition of Data Schema and Provider and integrate them easily with the other GitQL components such as Tokenizer, Parser, Checker, Functions, and Engine.

How to design the Data Schema

The data schema should contain two kinds of information. Firstly, it should define the correct tables and field names, and secondly, it should specify the data types for those fields.

For example, in the case of FileQL, the correct table and field names are:

pub static ref TABLES_FIELDS_NAMES: HashMap<&'static str, Vec<&'static str>> = {
    let mut map = HashMap::new();
    map.insert(
        "files",
        vec!["path", "parent", "extension", "is_dir", "is_file", "size"],
    );
    map
};

Here, we define only one table called files, which has six fields: path, parent, extension, is_dir, is_file, and size.

In the other map, we define the correct data type for each field. For example:

pub static ref TABLES_FIELDS_TYPES: HashMap<&'static str, DataType> = {
    let mut map = HashMap::new();
    map.insert("path", DataType::Text);
    map.insert("parent", DataType::Text);
    map.insert("extension", DataType::Text);
    map.insert("is_dir", DataType::Boolean);
    map.insert("is_file", DataType::Boolean);
    map.insert("size", DataType::Integer);
    map
};

Then, we create an instance of Schema, and construct it using the two maps. It should pass them to the Data Schema instance list like this:

let schema = Schema {
    tables_fields_names: TABLES_FIELDS_NAMES.to_owned(),
    tables_fields_types: TABLES_FIELDS_TYPES.to_owned(),
};

How to design the Data Provider

The goal of the Data Provider component is to load the data and map them into the GitQL Engine object structure, so we can define it as an interface with a single function:

pub trait DataProvider {
    fn provide(
        &self,
        env: &mut Environment,
        table: &str,
        fields_names: &[String],
        titles: &[String],
        fields_values: &[Box<dyn Expression>],
    ) -> GitQLObject;
}

The SDK user can implement this interface for their own kind of data and make it work with different data.

Also, you can control how many threads you need and what extra parameters you want. For example, in FileQL I implemented it with the name FileDataProvider, and passed the base path to search as parameter.

You can also implement it in any way. For example, APIDataprovider, and load the data from server and map them into GitQLObject. You could also implement is as LogDataProvider, and so on. The main idea is the same – just provide the data to the engine.

How to use the SDK Components together

The GitQL SDK has four main components, and each one can be used for many purposes. However, all of them can be used and integrated easily with each other to run the SQL-like query on your data.

First of all, there is the GitQL CLI component, which contains the required functions to deal with the command line interface, such as the arguments parser, diagnostic reporter, and table render.

Next, there is the GitQL AST component. This component contains the required structures for the SDK, such as the AST nodes, functions, schema, data types, and values.

There is also the GitQL Parser component, which is used to perform lexical, syntax, and semantic analysis on the query. It takes the SQL-like query as a string. If everything is correct, it returns an AST node. Otherwise, it returns a Compile time error message as a string.

Lastly, there is the GitQL Engine component. The Engine component contains the Engine and DataProvider, so it takes your implementation of the DataProvider and the AST and evaluates each node on the data. In the end, it returns the data as a result or a runtime error as a string.

After adding the GitQL SDK crates to your project and configuring the Data Schema and Provider for your data, we can start using the GitQL SDK:

let mut env = Environment::new(schema);
let query = ...;

let mut reporter = DiagnosticReporter::default();
let tokenizer_result = tokenizer::tokenize(query.to_owned());
let tokens = tokenizer_result.ok().unwrap();
if tokens.is_empty() {
    return;
}

let parser_result = parser::parse_gql(tokens, &mut env);
if parser_result.is_err() {
    let diagnostic = parser_result.err().unwrap();
    reporter.report_diagnostic(&query, *diagnostic);
    return;
}

let query_node = parser_result.ok().unwrap();
let provider: Box<dyn DataProvider> = Box::new(FileDataProvider::new(base_path.to_owned()));
let evaluation_result = engine::evaluate(&mut env, &provider, query_node);

The code above takes the query as a string and processes it until getting the evaluation result from the engine:

Create an Environment instance using the DataSchema to track types.
Create an instance of DiagnosticEngine to use it for error reporting.
Pass the query to the tokenizer to convert the string into a list of tokens.
Pass the list of tokens to the parser to convert it to TreeDataStructure.
Create an instance of your DataProvider and pass it with the tree to the engine.
The engine returns the evaluation result which is an error or data.

Those components are not new at all, besides Data Schema and Provider, and you can enjoy reading about the design and implementation details in the first article.

This is almost all you need to make the project work, but you can add more customization and extra components, such as CLI arguments. The final result will be like this:

Demo for FileQL project running on local files

You can find the full source code with all customizations in the FileQL repository.

Conclusion

You can check the FileQL project as a full sample created only in three files.

If you liked the project, you could give it a star ⭐ on GitQL and FileQL

You can check the website for how to download and use the project on different operating systems.

The project is not done yet – this is just the start. Everyone is welcome to join and contribute to the project and suggest ideas or report bugs.

Thanks for reading!

How I Created a SQL-like Language to Run Queries on Local Git Repositories

Amr Hesham — Thu, 26 Oct 2023 17:00:00 +0000

Hello everyone! I'm a Software engineer who's interested in low-level programming, compilers, and tool development.

Three months ago I decided to learn the Rust programming language and build a Git client that focuses on simplicity and productivity.

‌I started to think about how I could build the Git client to provide some unique and useful features.

For example, I like the analysis page on GitHub that tells you how many commits each developer has made and how many lines they've inserted or deleted. But what if I want to get this analysis for some period of time, or order everything by inserted lines and not number of commits? Or order them by how many commits were made by week or month?

You can add a custom sorting option for the client, right? But I started thinking about how I could make it more dynamic. This motivated me to wonder if I could run SQL-like queries on the local .git files so I could query any information I wanted.

So imagine if you could run a query like this on your local git repositories:

SELECT name, COUNT(name) AS commit_num FROM commits GROUP BY name ORDER BY commit_num DESC LIMIT 10

I have implemented this idea with a project I made called GQL (Git Query Language). And in this article, I'm going to show you how I designed and implemented the functionality.

How Can You Take a SQL-like Query and Run it on .git Files?

The first idea I had was to use SQLite. But there were some problems I couldn't resolve.

For example, I couldn't customize the syntax, and I didn't want to read .git files and store them on a SQLite database and then perform the query. I wanted everything to run on the fly.

I also wanted to be able to use not only the SELECT, DELETE, and UPDATE commands but also provide commands related to Git like push, pull, and so on.

I've created different tools like compilers before, so why not create a SQL-like language from scratch and make it perform queries on the fly and see if it works?

How I Designed and Implemented a Query Language from Scratch

I wanted to start small by only supporting the SELECT command without advanced features such as aggregations, grouping, joining, and so on.

So I planned to parse the query into a data structure that would make it easy to perform validation and evaluation on it (like type checking and displaying helpful error messages if anything went wrong). After that, I would pass this data structure to the evaluator that would apply the query on my .git files.

Choosing a data structure to use

The best data structure for this case is to represent the query using an Abstract Syntax Tree (AST). This is a very common data structure used in compilers because it's fixable and make it easy to traverse and compose nodes inside others.

Also in this case, I didn't need to keep all the information about the query, only the information that needed for the next steps (this is why it's called Abstract).

Deciding what validation to perform

The most important validation in this case would be type checking to make sure each value is valid and used in the correct place.

For example, what if the query wanted to multiply text by other text – would this be valid?

SELECT "ONE" * "TWO"

The multiplication operator expects both sides to be a number. So in this case, I wanted to inform the user that their query is invalid and try to help them understand the problem as much as possible.

So how would that work? When I see an operator like *, you need to check both sides to see if the values are valid types for this operator or not. If not then, report a message like this:

SELECT "ONE" * "TWO"
-------------^

ERROR: Operator `*` expects both sides to be Number type but got Text.

Beside operators, I knew that I needed to check whether each identifier was a table, field, alias of a function name, or if it should be undefined. I also needed to report an error if, for example, a branches table contained only 2 fields like the example below:

Branches {
   Text name,
   Number commit_count,
}

So I created a table that contained representations for all tables and fields so I could easily perform type checking. If the user tried to select a field which was undefined in this schema, then it reported an error:

SELECT invalid_field_name FROM branches
-------------^

Error: Field `invalid_field_name` is not defined in branches table.

I had to make sure the same checks would be performed on conditions, function names, and arguments. Then, if everything was properly defined and had the correct types, the AST would be valid and we could go to the next step.

What happens after validating the Abstract Syntax Tree?

After making sure everything was valid, it was time to evaluate the query and how it fetched the result.

To do that, I just traversed the syntax tree and evaluated each node. After finishing, I should have the correct result in a list.

Let's go through that process step by step to see how it works.

For example, in a query like this:

SELECT * FROM branches WHEER name LIKE "%/main" ORDER BY commit_count LIMIE BY 5

The AST representation will look like this:

AbstractSyntaxTree {
  Select(*, "branches") 
  Where(Like(name, "%/main"))
  OrderBy(commit_count)
  Limit(5) 
}

Now we need to traverse and evaluate each node but in a specific order. We don't just go start to end or end to start because we need to do this in the same order that SQL would do it to get the same result.

For example in SQL, the WHERE statement must be executed before GROUP BY, and HAVING must be executed after.

In the above example, everything is in the correct order to execute, so let's see what each statement will do.

Select(*, "branches")

This will select all the fields from the table with the name branches and push them to a list – let's call it objects. But how can I select them from the local repository?

All information about commits, branches, tags, and so on is stored by Git on files inside a folder called .git in each repository. One option is to write a full parser from scratch to extract the needed information. But using a library to do this instead worked for me.

I decided to use the libgit2 library to perform this task. It's a pure C implementation of the Git core methods, so you can read all the information you need and to use it from Rust. There is a crate (Rust Library) created by the Rust official team called git2, so you can get the branch information easily like this:

let local_branches = repo.branches(Some(BranchType::Local));
let remote_branches = repo.branches(Some(BranchType::Remote));
let local_and_remote_branches = repository.branches(None);

and then iterate over each branch to get its information and store it like this:

for branch in local_and_remote_branches {
   // Extract information from branch and store it
}

Now we end up with list of all branches that we'll use in the next steps.

Where(Like(name, "%/main"))

This will filter the objects list and remove all items that do not match the conditions – in our case, those ending with "/main".

OrderBy(commit_count)

This sorts the objects list by the value of the field commit_count.

Limit(5)

This takes only the first five items and removes the rest from the objects list.

That's it! And now we end up with a valid result, which you can see below:

The examples below are valid and run correctly:

SELECT 1
SELECT 1 + 2
SELECT LEN("Git Query Language")
SELECT "One" IN ("One", "Two", "Three")
SELECT "Git Query Language" LIKE "%Query%"

SELECT commit_count FROM branches WHERE commit_count BETWEEN 0 .. 10

SELECT * FROM refs WHERE type = "branch"
SELECT * FROM refs ORDER BY type

SELECT * FROM commits
SELECT name, email FROM commits
SELECT name, email FROM commits ORDER BY name DESC
SELECT name, email FROM commits WHERE name LIKE "%gmail%" ORDER BY name
SELECT * FROM commits WHERE LOWER(name) = "amrdeveloper"
SELECT name FROM commits GROUP By name
SELECT name FROM commits GROUP By name having name = "AmrDeveloper"

SELECT * FROM branches
SELECT * FROM branches WHERE is_head = true
SELECT name, LEN(name) FROM branches

SELECT * FROM tags
SELECT * FROM tags OFFSET 1 LIMIT 1

How to support running on multiple repositories at the same time

After I published GQL, I got amazing feedback from people. I also got some feature requests, like wanting support for multiple repositories and filtering by repository path.

I thought this was a great idea, because I could get analysis for multiple projects and also because I could do it on multiple threads. It didn't seem like it would be very hard to implement, either.

So after finishing the validation step for the AST, it's time for the evaluation step but instead of evaluating it once, it will be evaluated once for each repository and then merging all results back in one list.

But what about supporting the ability to filter by repository path?

That was pretty easy. Do you remember the branches table schema? All I needed to do was introduce a new field with name repository_path to represent the repository local path for this branch and introduce it to other tables too.

So the final schema will look like this:

Branches {
   Text name,
   Number commit_count,
   Text repository_path,
}

Now we can run a query that uses this field:

SELECT * FROM branches WHERE repository_path LIKE "%GQL"

And that's it! 😉

Thanks for reading!

If you liked the project, you can give it a star ⭐ on github.com/AmrDeveloper/GQL.

You can check the website github.io/GQL for how to download and use the project on different operating systems.

The project is not done yet – this is just the start. Everyone is welcome to join and contribute to the project and suggest ideas or report bugs.