Natural Language Processing with spaCy & Python

Natural language processing, or NLP, is a branch of linguistics that seeks to parse human language in a computer system. spaCy is a popular Python library used for NLP.

We just published a NLP and spaCy course on the freeCodeCamp.org YouTube channel. In the course you will learn all about natural language processing and how to apply it to real-world problems using the Python spaCy library.

Dr. W.J.B. Mattingly created this course. Dr. Mattingly is a Postdoctoral Fellow at the Smithsonian Institution's Data Science Lab. He is also an excellent teacher.

Dr. Mattingly created a series of Jupyter Notebooks to go along with the course. They are hosted at the course's website.

Because NLP is such a complex problem for computers, it requires a complex solution. The answer has been found in artificial neural networks, or ANNs or neural nets for short. New methods for training, such as transformer models, push the field further. You will learn about these methods in this course.

Here are the sections in this course:

Course Introduction
Intro to NLP
How to Install spaCy
SpaCy Containers
Linguistic Annotations
Named Entity Recognition
Word Vectors
Pipelines
EntityRuler
Matcher
Custom Components
RegEx (Basics)
RegEx (Multi-Word Tokens)
Applied SpaCy Financial NER

Watch the course below or on the freeCodeCamp.org YouTube channel (3-hour watch).

Transcript

(autogenerated)

In this course you will learn all about natural language processing and how to apply it to real world problems using the spacey library.

Dr. Mattingly is extremely knowledgeable in this area, and he's an excellent teacher.

Hi, and welcome to this video.

My name is Dr. William Mattingly, and I specialize in multilingual natural language processing, I come to NLP from a humanities perspective, I have my PhD in medieval history, but I use spacey on a regular basis to do all of my NLP needs.

So what you're going to get out of this video over the next few hours is a basic understanding of what natural language processing is or NLP, and also how to apply it to domain specific problems, or problems that exist within your own area of expertise.

I happen to use this all the time to analyze historical documents, or financial documents for my own personal investments.

Over the next few hours, you're going to learn a lot about NLP language as a whole and most importantly, the spacey library.

I like the spacey library because it's easy to use, and easy to also implement really kind of general solutions to general problems with the off the shelf models that are already available to you.

I'm going to walk you through in part one of this video series how to get the most out of spacey with these off the shelf features.

In part two, we're going to start tackling some of the features that don't exist in off the shelf models.

And I'm going to show you how to use rules based pipes or components in spacey to actually sole domain specific problems and your own area from the entity ruler to the matcher to actually injecting robust complex regular expression or regex patterns, and a custom spacey component that doesn't actually exist at the moment.

I'm going to be showing you all that in part two, so that in part three, we can take the lessons that we learned in part one and part two, and actually apply them to solve a very kind of common problem that exists in NLP and that is information extraction from financial documents.

So finding things that are of relevance, such as stocks, markets, indexes and stock exchanges.

If you join me over the next few hours, you will leave this lesson with a good understanding of the standing of spacey and also a good understanding of kind of the off the shelf components that are there and a way to take the off the shelf components and apply them to your own domain.

If you also join me in this video and you like it, please let me know in the comments down below because I am interested in making a second part to this video that will explore not only the rules based aspects of spacey, but the machine learning based aspects of spacey.

So teaching you how to train your own models to do your own things such as training a dependency parser, training a named entity recognizer things like this, which are not covered in this video.

Nevertheless, if you join me for this one and you like it, you will find part two, much easier to understand.

So sit back, relax, and let's jump into what NLP is, what kind of things you can do with NLP such as information extraction, and what the spacey library is and how this course will be laid out.

If you like this video, also consider subscribing to my channel Python tutorials for digital humanities, which is linked in the description down below.

Even if you're not a digital humanists like me, you will find these Python tutorials useful because they take Python and make it accessible to students of all levels.

specifically those who are beginners, I walk you through not only the basics of Python, but also I walk you through step by step some of the more common libraries that you need.

A lot of the channel deals with texts or text based problems.

But other content deals with things like machine learning, and image classification and OCR, all in Python.

So before we begin with spacey, I think we should spend a little bit of time talking about what NLP or natural language processing actually is.

Natural Language Processing is the process by which we try to get a computer system to understand and parse and extract human language oftentimes with raw text.

There are a couple different areas of natural language processing.

There's named entity recognition, part of speech tagging, syntactic parsing, text categorization, also known as text classification, co reference resolution machine translation.

Adjacent to NLP is another kind of computational linguistics field called natural language understanding NLU This is where we train computer systems to do things like relation extraction, semantic parsing, question and answering this is where bots really kind of come into play, summarization, sentiment analysis and paraphrasing.

NLP and NLU are used by a wide array of industries, from finance industry, all the way through to law and academia with researchers trying to do information extraction from texts.

Within an LP, there's a couple different applications.

The first and probably the most important is information extraction.

This is the process by which we try to get a computer system to extract information that we find relevant to our own research or needs.

So for example, as we're gonna see, in part three of this video, when we need to apply spacey to the financial sector, a person interested in finances might need an LP to go through and extract things like company names, stocks, indexes, things that are referenced within maybe news articles, from Reuters to New York Times to Wall Street Journal.

This is an example of using NLP to extract information.

A good way to think about NLP is application in this area, is it takes in some unstructured data, in this case, raw text, and extracts structured data from it or metadata.

So it finds the things that you want it to find and extracts them for you.

Now while there's ways to do this with gazetteers, and list matching, using an NLP framework, like spacey, which I'll talk about in just a second, has certain advantages, the main one being that you can use and leverage things that have been parsed syntactically or semantically.

So things like the part of speech of a word things like its dependencies, things like its co reference, these are things that the spacey framework allow for you to do off the shelf, and also train into machine learning models, and work into pipelines with rules.

So that's kind of one aspect of NLP.

And one way it's used.

Another way it's used is to read in data and classify it.

This is known as text categorization.

And we see that on the left hand side of this image, text categorization or text classification.

And we conclude in this sentiment analysis for the most part as well, is a way we take information into a computer system, again, unstructured data or raw text, and we classify it in some way.

you've actually seen this at work for many decades now, with spam detection, spam detection is nearly perfect, it needs to be continually updated.

But for the most part, it is a solved problem.

The reason why you have emails that automatically go to your spam folder, is because there's a machine learning model that sits on the background of your on the back end of your email server.

And what it does is it actually looks at the emails, it sees if it fat fits the pattern for what it's seen as spam before, and it assigns it a spam label.

This is known as classification.

This is also used by researchers, especially in the legal industry, lawyers oftentimes receive hundreds of 1000s of documents, if not millions of documents, they don't necessarily have the human time to go through and analyze every single document verbatim.

It is important to kind of get a quick umbrella sense of the documents without actually having to go through and read them page by page.

And so what lawyers will oftentimes do is use NLP to do classification and information extraction, they will find keywords that are relevant to their case, or they will find documents that are classified according to the relevant fields of their case.

And that way, they can take a million documents and reduce it down to maybe only a handful, maybe 1000 that they have to read verbatim.

This is a real world application of NLP or natural language processing.

And both of these tasks can be achieved through the spacey framework.

spacey is a framework for doing NLP right now.

As of 2021, it's only available I believe in Python, I think there is a community that's working on an application with R but I don't know that for certain.

But spacey is one of many NLP frameworks that Python has available.

If you're interested in looking at all of them, you can explore things like NLT Kay, the natural language toolkit stanza, which I believe is coming out of the same program at Stanford.

There's many out there, but I find spacey to be the best of all of them for a couple different reasons.

Reason one is that they provide for you off the shelf models that benchmark very well meaning they perform very quickly.

And they also have very good accuracy metrics such as precision recall, and F score.

And I'm not going to talk too much about the way we measure machine learning accuracy right now, but know that they are quite good.

Second, spacey has the ability to leverage current natural language processing methods, specifically transformer models, also known, usually kind of collectively as Bert models, even though that's not entirely accurate, but it allows for you to use an off the shelf transformer model.

And third, it provides the framework for doing custom training relatively easily compared to these other NLP frameworks that are out there.

Finally, the fourth reason why I picked spacey over other NLP frameworks is because it scales well.

spacey was designed by explosion AI, and the entire Purpose of spacey is to work at scale AI at scale, we mean working with large quantities of documents efficiently, effectively and accurately.

spacey scales well because it can process hundreds of 1000s of documents with relative ease in a relatively short period of time, especially if you stick with more rules based pipes, which we're going to talk about in part two of this video.

So those are the two things you really need to know about NLP, and spacey in general, we're going to talk about spacey in depth as we explore it both through this video.

And and the free textbook I provide to go along with this video, which is located at spacey dot python humanities.com.

And it should be linked in the description down below this video and the textbook I meant to work in tandem.

Some stuff that I cover in the video might not necessarily be in the textbook because it doesn't lend itself well to text representation.

And the same goes for the opposite some stuff that I don't have the time to cover verbatim In this video, I cover in a little bit more depth in the video.

And in the book, I think that you should try to use both of these, what I would recommend is doing one pass through this whole video, watch it in its entirety and get an umbrella sense of everything that space you can do.

And everything that we're going to cover, I would then go back and try to replicate each stage of this process on a separate window or on a separate screen and try to kind of follow along and code and then I would go back through a third time and try to watch the first part Why talk about what we're going to be doing and try to do it on your own without looking at the textbook or the video.

If you can do that by your third pass, you'll be in very good shape to start using spacey to solve your own domain specific problems.

NLP is a complex field and applying NLP is really complex.

But fortunately, frameworks like spacey make this project and this process a lot easier.

I encourage you to spend a few hours in this video get to know spacey and I think you're going to find that you can do things that you didn't think possible and relatively short order.

So sit back, relax and enjoy this video series on spacey.

In order to use spacey, you're first going to have to install spacey.

Now there's a few different ways to do this.

Depending on your environment and your operating system, I recommend going to spacey.io backslash usage and kind of enter in the correct framework that you're working with.

So if you're using Mac OS versus windows versus Linux, you can go through and in this very handy kind of user interface, you can go through and select the different features that matter most to you.

I'm working with Windows, I'm going to be using PIP in this case, and I'm going to be doing everything on the CPU.

And I'm going to be working with English.

So I've established all of those different parameters.

And it goes through and it tells me exactly how to go through and install it using PIP in the terminal.

So I encourage you to go through and pause the video right now go ahead and install Windows however you want to.

I'm going to be walking through how to install it within the Jupyter Notebook that we're going to be moving to in just a second.

I want you to not work with the GPU at all.

Working with spacey on the GPU requires a lot more understanding about what the GPU is used for specifically, in training machine learning models.

It requires you to have CUDA installed correctly.

It requires a couple other things that I don't really have the time to get into in this video, but we'll be addressing in a more advanced spacey tutorial video.

So for right now, I recommend selecting your o s selecting either can use PIP or conda and then selecting CPU.

And since you're going to be working through this video with English texts, I encourage you to select English right now and go ahead and just install or download the N core web SM model.

This is the small model.

I'll talk about that in just a second.

So the first thing we're going to do in our Jupyter Notebook is we're going to be using the the exclamation mark to delineate in the cell that this is a terminal command, we're going to say pip install spacey, your output when you execute this cell is going to look a little different than mine.

I already have spacey installed in this environment.

And so mind kind of goes through and looks like this yours will actually go through and instead of saying requirement already satisfied it'll be actually passing out the the different things that it's actually installing to install spacey and all of its dependencies.

The next thing that you're going to do is you're going to again, you follow the instructions, and you're going to be doing Python dash m space spacey, space download, and then the model that you want to download.

So let's go ahead and do that right now.

So let's go ahead and say Python m spacing.

Download to this is a spacey terminal command.

And we're going to download the N core web SM and again, I already have this model downloaded So on my end, spacey is going to look a little differently than as it's going to look on your end as it prints off on the Jupyter Notebook.

And if we give it a just a second, everything will go through, and it says that it's collected it, it's downloading it.

And we are all very happy now.

And so now that we've got spacey installed correctly, and that we've got the small model downloaded correctly, we can go ahead and start actually using spacey and make sure everything's correct.

The first thing we're going to do is we're going to import the spacey library as you would with any other Python library.

If you're not familiar with this, a library is simply a set of classes and functions that you can import into a Python script so that you don't have to write a whole bunch of extra code.

Libraries are massive collections of classes and functions that you can call.

So when we import spacey, we're importing the whole library of spacey and now that we've seen something like this, we know that spacey has imported correctly, as long as you're not getting an error message, everything was in was imported fine.

The next thing that we need to do is we want to make sure that our English core web SM are small English model was downloaded correctly.

So the next thing that we need to do is we need to create an NLP object.

I'm going to be talking a lot more about this as we move forward.

Right now, this is just troubleshooting to make sure that we've installed spacey correctly and we've downloaded our model correctly.

So we're going to use the spacey dot load command.

This is going to take one argument, it's going to be a string that is going to correspond to the model that you've installed.

And this case n cor web s n.

And if you execute this cell and you have no errors, you have successfully installed spacey correctly and you've downloaded the English core web SM model correctly.

So go ahead take time, and get all this stuff set up.

Pause the video if you need to, and then pop back and we're going to start actually working through the basics of spacey.

I'm now going to move into kind of an overview of kind of what's within spacey, why it's useful and kind of some of the basic features of it that you need to be familiar with.

And I'm going to be working from the Jupyter Notebook that I talked about and the introduction to this video.

If we scroll down to the bottom of chapter one, the basics of spacey, then you get past the install section, you get to this section on containers.

So what are containers? Well, containers within spacey are objects that contain a large quantity of data about a text.

There are several different containers that you can work with.

In spacey, there's the doc, the doc Ben example, language, lexeme span, span group and token, we're going to be dealing with the lexeme a little bit in this video series.

And we're going to be dealing with the language container a little bit in this video series.

But really, the three big things that we're going to be talking about again and again is the dock the span and the token.

And I think when you first come to spacey, there's a little bit of a learning curve about what these things are, what they do, how they are structured hierarchically.

And for that reason I've created this, in my opinion, kind of easy to understand image of what different containers are.

So if you think about what spacey is as a pyramid, so a hierarchical system, we've got all these different containers structured around, really the dock object, your Docker container, or your dock object contains a whole bunch of metadata about the text that you pass to the spacey pipeline, which we're going to see in practice, and just a few minutes.

The doc object contains a bunch of different things.

It contains attributes.

And these attributes can be things like, like sentences.

So if you iterate over doc dot cents, you can actually access all the different sentences found within that doc object.

If you iterate over each individual item, or index and your doc object, you can get individual tokens.

tokens are going to be things like words or punctuation marks something within your sentence or text that has a self contained important value, either syntactically or semantically.

So this is going to be things like words a comma period, a semi colon, a quotation mark, things like this, these are all going to be your tokens.

And we're going to see how tokens are a little different than just splitting words up with traditional string methods and Python.

The next thing that you should be kind of familiar with are spans.

So spans are important because they kind of exist within and without of the doc object.

So unlike the token, which is an index of the doc object, a span can be a token itself, but it can also be a sequence of multiple tokens, we're gonna see that at play.

So imagine if you had a span in its category, maybe group one are our places.

So a single token might be like a city like Berlin, but span group two, this could be something like full proper names.

So of people, for example.

So this could be like, as we're going to see Martin Luther King, this would be a sequence of tokens, a sequence of three different items in the sentence that make up one span, or one self contained item.

So Martin Luther King, would be a person who's a collection of a sequence of individual tokens.

If that doesn't make sense, right now, this image will be reinforced as we go through and learn more about spacey in practice.

For right now, I want you to be just understanding that the doc object is the thing around which all of spacey sits, this is going to be the object that you create.

This is going to be the object that contains all the metadata that you need to access.

And this is going to be the object that you tried to essentially improve with different custom components, factories, and pipelines.

As you go through and do more advanced things with spacey, we're going to now see in just a few seconds how that dock object is kind of similar to the text itself.

But how it's very, very different and much more powerful.

We're now going to be moving on to chapter two of this textbook, which is going to deal with kind of getting used to the in depth features of spacing.

If you want to pause the video or keep this notebook or this book open up to kind of separate from this video and follow along.

As we go through and explore it in live coding, we're going to be talking about a few different things as we explore chapter two, this will be a lot longer than chapter one, we're going to be not only importing spacey, but actually going through and loading up a model, creating a dog object around that model.

So that we're going to work with container and practice.

And then we're going to see how that container stores a lot of different features or metadata attributes about the text.

And while they look the same on the surface, they're actually quite different.

So let's go ahead and work within our same Jupyter Notebook where we've imported spacey and we have already created the NLP object.

The first thing that I want to do is I want to open up a text to start working with within this repo, we've got a data folder.

Within this data sub folder, I've got a couple different Wikipedia openings, I've got one on MLK that we're going to be using a little later in this video.

And then I have one on the United States, this is wiki underscore us.

That's going to be what we work with right now.

So let's use our width operator and open up the data backslash wiki underscore us dot txt.

We're gonna just read that in as F.

And then we're going to create this text object which is going to be equal to F dot read.

And now that we've got our text object created, let's go ahead and see what this looks like.

So let's print text.

And we see that it's a standard Wikipedia article kind of follows that same introductory format and it's about four or five paragraphs long with a lot of the features left in such as the brackets that delineate some kind of a footnote.

We're not going to worry too much about cleaning this up right now, because we're interested in not with cleaning our data so much as just starting to work with the doc object in spacey.

So the first thing that you want to do is you're going to want to create a doc object