linguistics - freeCodeCamp.org

What are Context Free Grammars?

freeCodeCamp — Tue, 14 Jan 2020 09:39:14 +0000

By Aditya

Have you ever noticed that, when you are writing code in a text editor like VS code, it recognizes things like unmatched braces? And it also sometimes warns you, with an irritating red highlight, about the incorrect syntax that you have written?

If not, then think about it. That is after all a piece of code. How can you write code for such a task? What would be the underlying logic behind it?

These are the kinds of questions that you will face if you have to write a compiler for a programming language. Writing a compiler is not an easy task. It is bulky job that demands a significant amount of time and effort.

In this article, we are not going to talk about how to build compilers. But we will talk about a concept that is a core component of the compiler: Context Free Grammars.

Introduction

All the questions we asked earlier represent a problem that is significant to compiler design called Syntax Analysis. As the name suggests, the challenge is to analyze the syntax and see if it is correct or not. This is where we use Context Free Grammars. A Context Free Grammar is a set of rules that define a language.

Here, I would like to draw a distinction between Context Free Grammars and grammars for natural languages like English.

Context Free Grammars or CFGs define a formal language. Formal languages work strictly under the defined rules and their sentences are not influenced by the context. And that's where it gets the name context free.

Languages such as English fall under the category of Informal Languages since they are affected by context. They have many other features which a CFG cannot describe.

Even though CFGs cannot describe the context in the natural languages, they can still define the syntax and structure of sentences in these languages. In fact, that is the reason why the CFGs were introduced in the first place.

In this article we will attempt to generate English sentences using CFGs. We will learn how to describe the sentence structure and write rules for it. To do this, we will use a JavaScript library called Tracery which will generate sentences on the basis of rules we defined for our grammar.

Before we dive into the code and start writing the rules for the grammar, let's just discuss some basic terms that we will use in our CFG.

Terminals: These are the characters that make up the actual content of the final sentence. These can include words or letters depending on which of these is used as the basic building block of a sentence.

In our case we will use words as the basic building blocks of our sentences. So our terminals will include words such as "to", "from", "the", "car", "spaceship", "kittens" and so on.

Non Terminals: These are also called variables. These act as a sub language within the language defined by the grammar. Non terminals are placeholders for the terminals. We can use non terminals to generate different patterns of terminal symbols.

In our case we will use these Non terminals to generate noun phrases, verb phrases, different nouns, adjectives, verbs and so on.

Start Symbol: a start symbol is a special non terminal that represents the initial string that will be generated by the grammar.

Now that we know the terminology let's start learning about the grammatical rules.

While writing grammar rules, we will start by defining the set of terminals and a start state. As we learned before, that start symbol is a non-terminal. This means it will belong to the set of non-terminals.

T: ("Monkey", "banana", "ate", "the")
S: Start state.

And the rules are:

S --> nounPhrase verbPhrase
nounPhrase --> adj nounPhrase | adj noun
verbPhrase --> verb nounPhrase
adjective  --> the
noun --> Monkey | banana
verb --> ate

The above grammatical rules may seem somewhat cryptic at first. But if we look carefully, we can see a pattern that is being generated out of these rules.

A better way to think about the above rules is to visualise them in the form of a tree structure. In that tree we can put S in the root and nounPhrase and verbPhrase can be added as children of the root. We can proceed in the same way with nounPhrase and verbPhrase too. The tree will have terminals as its leaf nodes because that is where we end these derivations.

In the above image we can see that S (a nonterminal) derives two non terminals NP(nounPhrase) and VP(verbPhrase). In the case of NP, it has derived two non terminals, Adj and Noun.

If you look at the grammar, NP could also have chosen Adj and nounPhrase. While generating text, these choices are made randomly.

And finally the leaf nodes have terminals which are written in the bold text. So if you move from left to right, you can see that a sentence is formed.

The term often used for this tree is a Parse Tree. We can create another parse tree for a different sentence generated by this grammar in a similar way.

Now let's proceed further to the code. As I mentioned earlier, we will use a JavaScript library called Tracery for text generation using CFGs. We will also write some code in HTML and CSS for the front-end part.

The Code

Let's start by first getting the tracery library. You can clone the library from GitHub here. I have also left the link to the GitHub repository by galaxykate at the end of the article.

Before we use the library we will have to import it. We can do this simply in an HTML file like this.

<html>
    <head>
        <script src="tracery-master/js/vendor/jquery-1.11.2.min.js">script>
        <script src="tracery-master/tracery.js">script>
        <script src="tracery-master/js/grammars.js">script>
        <script src='app.js'>script>
    head>

html>

I have added the cloned tracery file as a script in my HTML code. We will also have to add JQuery to our code because tracery depends on JQuery. Finally, I have added app.js which is the file where I will add rules for the grammar.

Once that is done, create a JavaScript file where we will define our grammar rules.

var rules = {
        "start": ["#NP# #VP#."],
        "NP": ["#Det# #N#", "#Det# #N# that #VP#", "#Det# #Adj# #N#"],
        "VP": ["#Vtrans# #NP#", "#Vintr#"],
        "Det": ["The", "This", "That"],
        "N": ["John Keating", "Bob Harris", "Bruce Wayne", "John Constantine", "Tony Stark", "John Wick", "Sherlock Holmes", "King Leonidas"],
        "Adj": ["cool", "lazy", "amazed", "sweet"],
        "Vtrans": ["computes", "examines", "helps", "prefers", "sends", "plays with", "messes up with"],
        "Vintr": ["coughs", "daydreams", "whines", "slobbers", "appears", "disappears", "exists", "cries", "laughs"]
    }

Here you will notice that the syntax for defining rules is not much different from how we defined our grammar earlier. There are very minor differences such as the way non-terminals are defined between the hash symbols. And also the way in which different derivations are written. Instead of using the "|" symbol for separating them, here we will put all the different derivations as different elements of an array. Other than that, we will use the semicolons instead of arrows to represent the transition.

This new grammar is a little more complicated than the one we defined earlier. This one includes many other things such as Determiners, Transitive Verbs and Intransitive Verbs. We do this to make the generated text look more natural.

Let's now call the tracery function "createGrammar" to create the grammar we just defined.

let grammar = tracery.createGrammar(rules);

This function will take the rules object and generate a grammar on the basis of these rules. After creating the grammar, we now want to generate some end result from it. To do that we will use a function called "flatten".

let expansion = grammar.flatten('#start#');

It will generate a random sentence based on the rules that we defined earlier. But let's not stop there. Let's also build a user interface for it. There's not much we will have to do for that part – we just need a button and some basic styles for the interface.

In the same HTML file where we added the libraries we will add some elements.

<html>
    <head>
        <title>Weird Sentencestitle>
        <link rel="stylesheet" href="style.css"/>
        <link href="https://fonts.googleapis.com/css?family=UnifrakturMaguntia&display=swap" rel="stylesheet">
        <link href="https://fonts.googleapis.com/css?family=Harmattan&display=swap" rel="stylesheet">

        <script src="tracery-master/js/vendor/jquery-1.11.2.min.js">script>
        <script src="tracery-master/tracery.js">script>
        <script src="tracery-master/js/grammars.js">script>
        <script src='app.js'>script>
    head>
    <body>
        <h1 id="h1">Weird Sentencesh1>
        <button id="generate" onclick="generate()">Give me a Sentence!button>
        <div id="sentences">

        div>
    body>
html>

And finally we will add some styles to it.

body {
    text-align: center;
    margin: 0;
    font-family: 'Harmattan', sans-serif;
}

#h1 {
    font-family: 'UnifrakturMaguntia', cursive;
    font-size: 4em;
    background-color: rgb(37, 146, 235);
    color: white;
    padding: .5em;
    box-shadow: 1px 1px 1px 1px rgb(206, 204, 204);
}

#generate {
    font-family: 'Harmattan', sans-serif;
    font-size: 2em;
    font-weight: bold;
    padding: .5em;
    margin: .5em;
    box-shadow: 1px 1px 1px 1px rgb(206, 204, 204);
    background-color: rgb(255, 0, 64);
    color: white;
    border: none;
    border-radius: 2px;
    outline: none;
}

#sentences p {
    box-shadow: 1px 1px 1px 1px rgb(206, 204, 204);
    margin: 2em;
    margin-left: 15em;
    margin-right: 15em;
    padding: 2em;
    border-radius: 2px;
    font-size: 1.5em;
}

We will also have to add some more JavaScript to manipulate the interface.

let sentences = []
function generate() {
    var data = {
        "start": ["#NP# #VP#."],
        "NP": ["#Det# #N#", "#Det# #N# that #VP#", "#Det# #Adj# #N#"],
        "VP": ["#Vtrans# #NP#", "#Vintr#"],
        "Det": ["The", "This", "That"],
        "N": ["John Keating", "Bob Harris", "Bruce Wayne", "John Constantine", "Tony Stark", "John Wick", "Sherlock Holmes", "King Leonidas"],
        "Adj": ["cool", "lazy", "amazed", "sweet"],
        "Vtrans": ["computes", "examines", "helps", "prefers", "sends", "plays with", "messes up with"],
        "Vintr": ["coughs", "daydreams", "whines", "slobbers", "appears", "disappears", "exists", "cries", "laughs"]
    }

    let grammar = tracery.createGrammar(data);
    let expansion = grammar.flatten('#start#');

    sentences.push(expansion);

    printSentences(sentences);
}

function printSentences(sentences) {
    let textBox = document.getElementById("sentences");
    textBox.innerHTML = "";
    for(let i=sentences.length-1; i>=0; i--) {
        textBox.innerHTML += ""+sentences[i]+"
"
    }
}

Once you have finished writing the code, run your HTML file. It should look something like this.

Every time you click the red button it will generate a sentence. Some of these sentences might not make any sense. This is because, as I said earlier, CFGs cannot describe the context and some other features that natural languages possess. It is used only to define the syntax and structure of the sentences.

You can check out the live version of this here.

Conclusion

If you have made it this far, I highly appreciate your resilience. It might be a new concept for some of you, and others might have learnt about it in their college courses. But still, Context Free Grammars have interesting applications that range widely from Computer Science to Linguistics.

I have tried my best to present the main ideas of CFGs here, but there is a lot more that you can learn about them. Here I have left links to some great resources:

Context Free Grammars by Daniel Shiffman.
Context Free Grammars Examples by Fullstack Academy
Tracery by Galaxykate

Exploring the Linguistics Behind Regular Expressions

freeCodeCamp — Mon, 20 Nov 2017 15:38:09 +0000

By Alaina Kafkes

How a linguistic breakthrough ended up in code

_Image Credit: [xkcd](https://xkcd.com/" rel="noopener" target="blank" title=")

Regular expressions inspire fear in new and experienced programmers alike. When I first saw a regular expression — often abbreviated as “regex” — I remember feeling dizzy from looking at the litany of parentheses, asterisks, letters, and numbers. Regular expressions seemed nonsensical, impenetrable.

I expected regular expressions to crop up again in my upper-level computer science coursework — maybe by then I’d finally feel ready to tackle them — but I encountered them in an introductory class that I had put off until my senior year. The purpose of this course was to draw students who had never written a line of code into CS by introducing them to concepts like cryptography, human-computer interaction, machine learning — you know, only the latest and greatest of tech buzzwords.

I didn’t attend more than a handful of lectures, but one of the assignments stuck with me. I had to write an essay about a famous computer scientist or academic whose work impacted computer science. I chose Noam Chomsky.

Little did I know that learning about Chomsky would drag me down a rabbit hole back to regular expressions, and then magically cast regular expressions into something that fascinated me. What enchanted me about regular expressions was the homonymous linguistic concept that powered them.

I hope to spellbind you, too, with the linguistics behind regular expressions, a a backstory unknown to most programmers. Though I won’t teach you how to use regular expressions in any particular programming language, I hope that my linguistic introduction will inspire you to dive deeper into how regular expressions work in your programming language of choice.

To begin, let’s return to Chomsky: what does he have to do with regular expressions? Hell, what does he even have to do with computer science?

A Computer Scientist By Accident

Wikipedia christens Noam Chomsky as a linguist, philosopher, cognitive scientist, historian, social critic, and political activist, but not as a computer scientist. Because he is so highly regarded in all of these fields, his indirect contributions to the field of computer science often fall by the wayside.

The more I researched Chomsky’s academic work, the more accidental Chomsky’s foray into computing seemed. This affirmed my belief that all fields — even those that appear disparate from computer science — have something to offer to computing and the tech industry.

His contributions to the field of linguistics in particular exemplify the impact of interdisciplinary research on computer science. The Chomsky hierarchy transformed the code that computer scientists, software engineers, and hobbyists write today.

Yes, it was this hierarchy that brought regular expressions to computer science. But, before we can understand the jump from Chomsky to regular expressions, I’ll outline the Chomsky hierarchy.

Linguistic Law & Order

The Chomsky hierarchy is an ordering of formal grammars — think syntactic rules for formal languages — such that each grammar exists as a proper subset of the grammars above it in the hierarchy. Some formal languages have stricter grammars than others, so Chomsky sought to organize formal grammars into his eponymous hierarchy.

I briefly mentioned that formal grammars are syntactic rules: rules that give all possible valid phrases for a given formal language. Grammars provide the rules that build languages. In linguist-speak, a language’s formal grammar provides a framework with which nonterminals (input or intermediate string values) can be converted into terminals (output string values).

To elucidate this new vocabulary, I’ll walk through an example of converting a set of nonterminals into terminals using a made-up formal grammar. Let’s say that our pretend formal language, Parseltongue, has the following formal grammar:

Terminals: {s, sh, ss}
Nonterminals: {snake, I, am}
Production rules: {I → sh, am → s, snake → ss}

Using the production rules, I can convert the input sentence “I am snake” into “sh s ss.” This conversion happens piece by piece: “I am snake” → “sh am snake” → “sh s snake” → “sh s ss.”

As my Parseltongue example illustrates, formal grammars parse strings of nonterminals into terminal-only strings — grammatically correct phrases. But formal grammars act not only as generators of a language, but also recognizers of whether a string fits the formal grammar. Whereas the example string “I am a snake” can be fully converted into terminals, the string “I am not a snake” cannot be written in Parseltongue because the nonterminal “not” cannot be translated into a Parseltongue terminal.

To re-emphasize something I stated earlier: formal grammars generate formal languages. That means that, by creating a hierarchy of formal grammars, Chomsky also categorized languages themselves.

With that sobering introduction, let’s look at the four formal grammars in Chomsky’s hierarchy. From most to least strict, they are:

Regular grammars, which retain no past state knowledge from input string to output string
Context-free grammars, which retain only recent state knowledge from input string to output string
Context-sensitive grammars, which keep all past state knowledge from input string to output string
Unrestricted (or recursively enumerable) grammars, which have all state knowledge and thus can create every output string imaginable from a given input string

What is this “state knowledge” that I speak of? Think of knowledge in terms of scope. Regular grammars, for example, have no knowledge of the string’s past states in their “scope” in the process of converting an input string into an output string. This suggests that once the grammar makes an individual conversion of nonterminal to terminal (plus a series of zero or more nonterminals), the grammar “forgets” the previous state of the string.

On the other hand, unrestricted grammars hold onto every possible state of the string-in-translation. Context-free and context-sensitive grammars fall somewhere in the middle.

If you’re looking for a more detailed explanation of the grammars in the Chomsky hierarchy, you’ll have to take a peek at automata theory. I’ll focus on the grammar that brings us back to regular expressions, fittingly called the regular grammar.

On the Regular Expressions

Regular expressions and regular grammars are equivalent. They communicate the same set of syntactic rules, albeit using different formalisms, and both produce the same regular languages.

In linguistics, a regular expression is recursively defined as follows:

The empty set is a regular expression.
The empty string is a regular expression.
For any character x in the input alphabet, x is a regular expression that produces the regular language {x}.
Alternation: If x and y are regular expressions, then x | y is a regular expression. For example, the regular expression 0|1 produces the regular language {0,1}.
Concatenation: If x and y are regular expressions, then x • y is a regular expression. For example, the regular expression 0•1 produces the regular language {01}.
Repetition (also known as Kleene star): If x and y are regular expressions, then x is a regular expression. For example, the regular language `0•1produces the regular language{0, 01, 011, 0111, ...}`, ad infinitum.

A regular grammar is composed of rules like those of Parseltongue. Just as a regular grammar can be utilized to parse an input string into an output string, a regular expression converts strings quite similarly. You can see examples of this parsing for the alternation, concatenation, and repetition operations — or, to use my prior analogy, rules — that regular expressions adopt.

Let’s return to our friend Noam Chomsky for a moment. According to his hierarchy of grammars, regular grammars retain no information about intermediate steps in converting from an input string to an output string. What does this tell us about regular expressions?

The “forgetfulness” of regular grammars implies that translations in one part of the string do not impact how the other nonterminals in the string are translated in future steps. There is no coordination between different parts of the string in the creation of the output string.

Looking at the linguistics behind regular grammars gives us insight into why programmers first brought regular expressions into code. Although I’ve only discussed formal grammars as generators and recognizers of language, the fact that regular grammars convert input string to output string piece by piece makes them pattern-matchers. In programming, regular expressions use production rules to convert an input string — a pattern — into a regular language — a set of strings that match that pattern.

But I would have never written this blog post if programming language creators implemented regular expressions exactly as they are defined in the field of linguistics. Computational regular expressions are a far cry from their linguistic precursor, but the linguistic regular expressions that I covered provide a useful framework for understanding regular expressions in code.

Two Regular Expressions, Both Alike in Dignity

Hereafter, I will use the term regular expression to mean a linguistic regular expression and the term regex to signify a programmatic regular expression. In the wild, both linguistic and programmatic regular expressions are referred to as “regular expressions” even though they are quite different from one another — how confusing!

The difference between regular expressions and regexes stems from how they are used. Regular expressions — or regular grammars — are part of formal language theory, which exists to describe shared elements of natural languages — languages that evolved over time without human premeditation. Linguists use regular expressions for theoretical purposes, like the categorization of formal grammars in the Chomsky hierarchy. Regular expressions help linguists understand the languages that humans speak.

Regexes, on the other hand, are utilized by everyday programmers who want to search for strings that match a given pattern. While regular expressions are theoretical, regexes are pragmatic. Programming languages are formal languages: languages designed by people (here, programmers) for specific purposes. As you might imagine, programming language creators augmented the functionality of regexes in code. Let’s examine these enhancements.

Remember that regular expressions have three operations: alternation, concatenation, and repetition. I’m no regex expert — regexpert? — but all it takes is a peek at the regular expression Wikipedia page to notice that regexes implement more than just three operations.

For example, using POSIX regex syntax, the pattern .ork matches all four-character strings that end with the three characters “ork.” That period is more powerful than simple alternation, concatenation, and repetition, right?

Nope. Truth be told, even the fanciest of regex metacharacters — characters that invoke a regex operation — derive from regular expression operations. Assuming that the twenty-six lowercase letters of the alphabet are the only characters in the regular grammar, the regex pattern .ork could be written using only regular expression operations as [a|b|c|...|z]ork.

Though the sheer volume of metacharacters suggests that regex has a more powerful set of operations than regular expressions themselves, metacharacters are merely shortcuts for various permutations of the operations that define regular expressions. Regex metacharacters provide a programmer-friendly abstraction for common combinations of alternation, concatenation, and repetition.

So far, I’ve portrayed regexes as regular expressions with amazing shortcuts and clear-cut use cases. However, as you may recall from Chomsky’s hierarchy, regular grammars have the strictest rules and no scope. Luckily, regexes have a little more leeway than their linguistic precursor, thereby bestowing them with more practical power.

Breaking the Regular Grammar Rules

Recall that, according to the the Chomsky hierarchy, regular grammars retain no knowledge in converting an input string to an output string. Since regular expressions are equivalent to regular grammars, this means that regular expressions also have no memory of the intermediate states of a string as it changed from input to output. It also means that translating a nonterminal in one part of a regular expressions has no bearing on the translation of a nonterminal in another part of the expression.

For regexes, it’s a different story. Regexes violate this key regular grammar characteristic by supporting the ability to backreference. Backreferencing allows the programmer to parenthetically separate a subsection of a regular expression and refer to it using a metacharacter. To give an example, the pattern (la)\1 matches “lala” by employing the \1 metacharacter to repeat the search for “la.”

Because different parts of the string cannot influence one another in regular expressions, backreferencing gives regexes a lot more power than their predecessor. More importantly, backreferencing facilitates practical uses of regex such as searching for typos in which the same word was accidentally typed twice in a row. Pragmatism gives insight into why regular expressions were tweaked to create regexes in programming.

Another feature that increases the functionality of regex is the ability to alter the greediness of the matching. Different quantifiers — categories of regex patterns — can look similar but match drastically different parts of a string. A greedy quantifier () will attempt to match as much of the string as possible, whereas a reluctant quantifier (?) will try to match the minimum amount of characters in the string. Given the string “abcorgi,” the pattern `.corgiwould match the entire string but the pattern.?corgi` would only match “bcorgi.”

A possessive quantifier (+) also attempts to match as much of the string as possible, but, unlike the greedy quantifier, it will not backtrack to previous characters in the string in order to find the largest possible match. Given the string “abcorgi,” the patterns .*corgi and .+corgi would match the entire string. Though possessive and greedy qualifiers will often produce the same result, possessive qualifiers tend to be more efficient because they avoid backtracking.

Because quantifiers are metacharacters, they can technically be built from alternation, concatenation, and repetition: the three operations of regular expressions. However, quantifiers create a simple abstraction that allows programmers to quickly specify what type of match they would like.

Conclusion & Further Reading

What a journey we’ve undertaken! We learned about Chomsky and his eponymous hierarchy, then dove deeper into regular grammars. From regular grammars, we explored the linguistic definition of a regular expression. Finally, we used the differences between regular expressions and regexes to motivate how programmers use regex today.

Although I trace the history of regular expressions from Chomsky to modern programming languages, this blog post is not the end of the regex story. If you’d like to learn more about linguistic and computational regular expressions, I have some motivating questions for you.

What is automata theory and how does it relate to the Chomsky hierarchy?
How are regex implemented? What are the tradeoffs of various regex algorithms?
When is it appropriate to use regexes instead of built-in string match and manipulation libraries?

I also have a list of resources that I used to study up on the linguistic and computational elements of regular expressions. Happy regex-ing!

Enjoy what you read? Spread the love by liking and sharing this piece. Have thoughts or questions? Reach out to me on Twitter or in the comments below. Thank you Miles Hinson for proofreading this piece!