Harshit Tyagi - freeCodeCamp.org

How to Start Building Projects with LLMs

Harshit Tyagi — Mon, 30 Sep 2024 18:46:25 +0000

If you’re an aspiring AI professional, becoming an LLM engineer offers an exciting and promising career path.

But where should you start? What should your trajectory look like? How should you learn?

In one of my previous posts, I laid out the complete roadmap to become an AI / LLM Engineer. Reading this article will give you insights into the types of skills you’ll need to acquire and how to start learning.

The Best Way to Learn is to BUILD!

As Andrej Karpathy puts it:

Andrej emphasizes that you should build concrete projects, and explain everything you learn in your own words. (He also instructs us to only compare ourselves to a younger version of ourselves – never to others.)

And I agree – building projects is the best way to not just learn but really grok these concepts. It will further sharpen the skills you’re learning to think about cutting edge use cases.

But the main challenge with this learning philosophy is that good projects can be hard to find.

And that’s the problem I am trying to resolve. I want to help people, including myself, discover and build practical and real-world projects that help you develop skills that are worth showcasing in your portfolio.

Here’s What We’ll Cover:

What Should Be Your First Project?
Project #1: YouTube Video Summarizer
Project #2 preview: Multi-purpose Customer Service Bot
Project #3 preview: RAG-Powered Support Bot
Conclusion

What Should Be Your First Project?

If you’re a beginner who knows basic to intermediate programming, your initial projects should showcase that you can comfortably build applications with LLMs.

They should demonstrate that:

you know what APIs are
you know how to consume them
you know how to build products that people actually want to use

Building a chatbot provides a great starting point, but at this point everyone has developed one. And there are many solutions for easy Streamlit based prototypes. So, you need to develop something that’s actually usable and has the potential to reach a wider audience.

I’d suggest building a chatbot for WhatsApp or Discord or Telegram. Build a chatbot which solves a problem people struggle with, a problem that companies have started to build solutions for.

If I had to pick a good and, arguably, the most common AI project that every company has started to work on, it would be RAG-powered chatbots.

But before you get to building RAG-powered bots, you should start building something slightly more basic but practical with LLMs.

To kick things off, let’s start by building a YouTube Summariser.

Project #1: Summarise YouTube Videos

We’ll build the first part of this project in this tutorial: the core functionality of a YouTube video summariser tool.

Our bot will:

Receive the YouTube URL.
Validate if the URL is correct.
Retrieve the transcript of the video
Use an LLM to analyze and summarize the video’s content.
Return the summary to the user.

Setup and Requirements

For this project, we’ll code the core functionality in a Jupyter Notebook using the following Python packages:

langchain-together — for the LLM using the LangChain <> Together AI integration
langchain-community — for specific data loaders
langchain — for programming with LLMs
pytube — for fetching video info
youtube-transcript-api — for youtube video transcript

We’ll use the Llama 3.1 model offered as an API by Together AI.

Together AI is a cloud platform that offers the open source models as inference APIs. without worrying about the underlying infrastructure.

Let’s start by installing these:

!pip install — upgrade — quiet langchain
!pip install — quiet langchain-community
!pip install — upgrade — quiet langchain-together
!pip install youtube_transcript_api
!pip install pytube

Now let’s set up our LLM:

## setting up the language model
from langchain_together import ChatTogether
import api_key

llm = ChatTogether(api_key=api_key.api,temperature=0.0, 
                   model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo")

The next step is to process the YouTube videos as a data source. For this we’ll need to understand the concept of document loaders.

Introduction to Document Loaders

Document loaders provide a unified interface to load data from various sources into a standardized Document format.

They automatically extract and attach relevant metadata to the loaded content.
The metadata can include source information, timestamps, or other contextual data that can be valuable for downstream processing.
LangChain offers loaders for CSV, PDF, HTML, JSON, and even specialized loaders for sources like YouTube transcripts or GitHub repositories, as listed in their integrations page.

Categories of Document Loaders

Document loaders in LangChain can be broadly categorized into two types:

File Type-Based Loaders

Parse and load documents based on specific file formats
Examples include: CSV, PDF, HTML, Markdown

2. Data Source-Based Loaders

Retrieve data from various external sources
Load the data into Document objects
Examples include: YouTube, Wikipedia, GitHub

Integration Capabilities

LangChain’s document loaders can integrate with almost any file format you might need.
They also support many third-party data sources.

For our project, we’ll use the YoutubeLoader to get the transcripts in the required format.

YoutubeLoader from LangChain to Get Transcript:

## import the youtube documnent loader from LangChain
from langchain_community.document_loaders import YoutubeLoader

video_url = 'https://www.youtube.com/watch?v=gaWxyWwziwE'
loader = YoutubeLoader.from_youtube_url(video_url, add_video_info=False)
data = loader.load()

Process the YouTube Transcript

Display raw transcript content
Use the LLM to summarize and extract key points from the transcript:

# show the extracted page content
data[0].page_content

The page_content attribute contains the complete transcript as shown in the output below:

Now that we have the transcript, we simply need to pass this to the LLM we configured above along with the prompt to summarise.

First, let’s understand a simple method:

Langchain offers the invoke() method to which you need to pass the system message and the user or human message.

The system message is essentially the instructions for the LLM on how it is supposed to process the human request.

And the human message is simply what we want the LLM to do.

# This code creates a list of messages for the language model:
# 1. A system message with instructions on how to summarize the video transcript
# 2. A human message containing the actual video transcript

# The messages are then passed to the language model (llm) for processing
# The model's response is stored in the 'ai_msg' variable and returned

messages = [
    (
        "system", 
        """Read through the entire transcript carefully.
           Provide a concise summary of the video's main topic and purpose.
           Extract and list the five most interesting or important points from the transcript. For each point: State the key idea in a clear and concise manner.

        - Ensure your summary and key points capture the essence of the video without including unnecessary details.
        - Use clear, engaging language that is accessible to a general audience.
        - If the transcript includes any statistical data, expert opinions, or unique insights, prioritize including these in your summary or key points.""",
    ),
    ("human", data[0].page_content),
]
ai_msg = llm.invoke(messages)
ai_msg

But this method won’t work when you have more variables and when you want a more dynamic solution.

For this, LangChain offers PromptTemplate:

A PromptTemplate in LangChain is a powerful tool that helps in creating dynamic prompts for large language models (LLMs). It allows you to define a template with placeholders for variables that can be filled in with actual values at runtime.

This helps in managing and reusing prompts efficiently, ensuring consistency and reducing the likelihood of errors in prompt creation.

A PromptTemplate consists of:

Template String: The actual prompt text with placeholders for variables.
Input Variables: A list of variables that will be replaced in the template string at runtime.

# Set up a prompt template for summarizing a video transcript using LangChain

# Import necessary classes from LangChain
from langchain.prompts import PromptTemplate
from langchain import LLMChain

# Define a PromptTemplate for summarizing video transcripts
# The template includes instructions for the AI model on how to process the transcript
product_description_template = PromptTemplate(
    input_variables=["video_transcript"],
    template="""
    Read through the entire transcript carefully.
           Provide a concise summary of the video's main topic and purpose.
           Extract and list the five most interesting or important points from the transcript. 
           For each point: State the key idea in a clear and concise manner.

        - Ensure your summary and key points capture the essence of the video without including unnecessary details.
        - Use clear, engaging language that is accessible to a general audience.
        - If the transcript includes any statistical data, expert opinions, or unique insights, 
        prioritize including these in your summary or key points.

    Video transcript: {video_transcript}    """
)

How to Use LLMChain / LCEL for Summarization

A chain is a sequence of steps that consists of a language model, PromptTemplate, and an optional output parser.

Create an LLMChain with the custom prompt template
Generate a summary of the video transcript using the chain

Here, we are using LLMChain but you can also use LangChain Expression Language as well to do this:

## invoke the chain with the video transcript 
chain = LLMChain(llm=llm, prompt=product_description_template)

# Run the chain with the provided product details
summary = chain.invoke({
    "video_transcript": data[0].page_content
})

This will give you the summary object which has the text attribute that contains the response in markdown format.

summary['text']

The raw response will look like this:

To see the Markdown formatted response:

from IPython.display import Markdown, display

display(Markdown(summary['text']))

And there you go:

So, the core functionality of our YouTube summariser is now working.

But this is working in your Jupyter Notebook, to make it more accessible, we’d need to get this functionality deployed on WhatsApp.

How to serve the YT summariser on WhatsApp

For this, we’d need to serve our YT summarisation functionality as an API endpoint for which we are going to use Flask. You can also use FastAPI.

Now we’ll turn all the code in the Jupyter notebook into functions. So, add a function to check if it is a valid youtube URL, then define the summarise function that is basically a compilation of what we wrote in the Jupyter notebook.

You can configure our endpoint in the following manner:

@app.route('/summary', methods=['POST'])
def summary():
    url = request.form.get('Body')  # Get the JSON data from the request body
    print(url)
    if is_youtube_url(url):
        response = summarise(url)
    else:
        response = "please check if this is a correct youtube video url"
    print(response)
    resp = MessagingResponse()
    msg = resp.message()
    msg.body(response)
    return str(resp)

Once your app.py is ready with your Flask API, run the Python script, and you should have your server running locally on your system.

The next step is to make your local server connect with WhatsApp, and that’s where we’ll use Twilio.

Twilio allows us to implement this handshake by offering a WhatsApp sandbox to test your bot. You can follow the steps in this guide here to build this connection.

I got the connection established:

Now, we can start testing our WhatsApp bot:

Amazing!

I explain all the steps in detail in my project-based course on Building LLM-powered WhatsApp Chatbots.

It’s a 3-project course that contains two other more complex projects. I’ll give you a brief summary of those other projects here so you can try them out for yourselves. And if you’re interested, you can check out the course as well.

Project #2 — Build a Bot that Can Handle Different Types of User Queries

This bot acts as a customer service representative for an airline. It can answer questions related to flight status, baggage inquiries, ticket booking, and more. It uses Langchain’s Router and LLM models to dynamically generate responses based on the user’s input.

Different prompt templates are defined for various customer queries, such as flight status, baggage inquiries, and complaints.
Based on the query, the router selects the appropriate template and generates a response.
Twilio then sends the response back to the WhatsApp chat.

Project #3 — RAG-Powered Support Bot

This chatbot answers questions related to airline services using a document-based system. The document is converted into embeddings, which are then queried using Langchain’s RAG system to generate responses. Companies want developers these days who have these skills, so this is an especially practical project.

The guidelines/rules document is embedded using FAISS and HuggingFace models.
When a user submits a question, the RAG system retrieves relevant information from the document.
The system then generates a response using a pre-trained LLM and sends it back via Twilio.

These 3 projects will get you started so you can continue experimenting and learning more about AI engineering.

Customer Support is the most funded category in AI because it reduces the cost instantly if AI can handle communication with disgruntled users.

So, we build bots that can handle different types of queries, intelligent RAG powered bots which will have access to proprietary documents to provided up-to-date information to the users.

That’s why I created this project-based course to help you start building with LLMs.

Check out the course preview here:

And to thank you for reading this guide, you can use the code FREECODECAMP to get a 20% discount on my course.

I want to make this affordably accessible for all those who are sincere about building with AI, so I’ve priced it affordably at $14.99 USD.

Conclusion

In this tutorial, we focused on building a fun YouTube video summarizer tool that is served on WhatsApp.

The bot's core functionality includes:

Receiving a YouTube URL
Validating the URL
Retrieving the video transcript
Using an LLM to summarize the content
Returning the summary to the user

We used a number of Python packages including langchain-together, langchain-community, langchain, pytube, and youtube-transcript-api.

The project uses the Llama 3.1 model via Together AI's API.

We built the core summarisation functionality using

Using LangChain's invoke() method with system and human messages
Using PromptTemplate and LLMChain for more dynamic solutions

To make the tool accessible via WhatsApp:

The functionality is served as an API endpoint using Flask
Twilio is used to connect the local server with WhatsApp
A WhatsApp sandbox is used for testing the bot

To continue building further projects, check out the course.

It is a beginner track course where you start from learning to build with LLMs, then apply those skills to build 3 different types of LLM applications. Not just that – you learn to serve your applications as WA chatbots.

What to Know Before Taking Google's Machine Learning or Data Science Course

Harshit Tyagi — Wed, 27 Oct 2021 20:03:29 +0000

Whether you decide to take Andrew Ng’s Machine Learning and Deep Learning course on YouTube or any Data Science bootcamp, you will need a certain degree of mathematical and statistical knowledge.

This will not only help you understand basic ML/DS concepts, but it will also help you make a long-lasting, robust career as a data professional.

This is a short and precise guide for all self-taught devs and beginners in the field of Data Science and Machine Learning.

There's a common question that pops up in all my training programs, LinkedIn courses, videos on YouTube, or newsletters. It's that when people start learning DS/ML, after a certain point they feel lost in mathematics or statistics and sometimes programming.

And I have always recommended learning or refreshing some mathematical concepts that underpin ML as it helps you build intuition which keeps you curious throughout your learning journey.

To back up this claim, here are the prerequisites and prework Google recommends before taking their Machine Learning Crash Course:

Google ML course prerequisites

I’d recommend that you go through this article first and then look up all the links one by one and use this blog as a reference.

After going through the complete list of concepts and skills that are mentioned in the Google article, I also went through several books (Deep Learning by Ian Goodfellow, Deep Learning with Python by Francois Chollet, and several others).

From them, I tried to distill the essentials into three branches that you'll need to build a solid foundation for a career as a Data Analyst/Scientist/ML Engineer.

Following are the three pillars along with the a list of concepts that make for a good starter program:

From my course here.

Programming for Complete Beginners in Data Science and Machine Learning

Programming means telling a computer predefined rules that help it process input data and then get the results.

Machine learning, on the other hand, is giving the machine the results and data to find the rules that best approximate the relationship between the data and the results.

Programming offers that base platform which you can use to automate, verify, and solve problems of any scale.

The next question is which language should you learn?

Since most of the courses, libraries, and books are written to support Python infrastructure, I recommend learning Python and so does Google’s guide. Which language you use is a personal choice and a lot of it depends on the type of problem you’re trying to solve.

Most beginners prefer Python as it is the best way to develop end-to-end projects and there is a very large community of fellow developers who can help you. Chances are that ~90% of the problems that you’ll encounter in your journey (especially in the beginning phase) are already solved and documented for you.

1. Essential Python Programming for Machine Learning

Most data roles are programming-based except for a few like business intelligence, market analysis, and product analysis.

I am going to focus on technical data jobs that require expertise in at least one programming language. I personally prefer Python over any other language because of its versatility and ease of learning – hands-down a good pick for developing end-to-end projects.

Here are some of the topics/libraries you should study for data science/ML:

Common data structures (data types, lists, dictionaries, sets, tuples), writing functions, logic, control flow, searching and sorting algorithms, object-oriented programming, and working with external libraries.
Writing Python scripts to extract, format, and store data into files or back to databases.
Handling multi-dimensional arrays, indexing, slicing, transposing, broadcasting and pseudorandom number generation using NumPy.
Performing vectorized operations using scientific computing libraries like NumPy.
Manipulate data with Pandas – series, dataframe, indexing in a dataframe, comparison operators, merging dataframes, mapping, and applying functions.
Wrangling data using pandas – checking for null values, imputing it, grouping data, describing it, performing exploratory analysis, and so on.
Data Visualization using Matplotlib – the API hierarchy, adding styles, color, and markers to a plot, knowledge of various plots and when to use them, line plots, bar plots, scatter plots, histograms, boxplots, and seaborn for more advanced plotting.

2. Essential Mathematics for Data Science and Machine Learning

There are practical reasons why Math is essential for folks who want a career as an ML practitioner, Data Scientist, or Deep Learning Engineer.

Use linear algebra to represent data

_An image from the course: [https://www.wiplane.com/p/foundations-for-data-science-ml](https://www.wiplane.com/p/foundations-for-data-science-ml" data-href="https://www.wiplane.com/p/foundations-for-data-science-ml" class="markup--anchor markup--figure-anchor" rel="noopener ugc nofollow noopener" target="blank)

ML is inherently data-driven – data is at the heart of machine learning. We can think of data as vectors , an object that adheres to arithmetic rules. This leads us to understand how rules of linear algebra operate over arrays of data.

Use calculus to train ML models

Image from the course: [https://www.wiplane.com/p/foundations-for-data-science-ml](https://www.wiplane.com/p/foundations-for-data-science-ml" rel="noopener ugc nofollow noopener)

Model training happens does not happen “automatically”. Calculus is what drives the learning of most ML and DL algorithms.

One of the most commonly used optimization algorithms ( gradient descent) is an application of partial derivatives.

A model is a mathematical representation of certain beliefs and assumptions. It is said to learn (approximate) the process (linear, polynomial, etc) how the data is provided, was generated in the first place, and then make predictions based on that learned process.

Important topics include:

Basic algebra – **** variables, coefficients, equations, functions — linear, exponential, logarithmic, etc.
Linear Algebra – scalars, vectors, tensors, Norms(L1 & L2), dot product, types of matrices, linear transformation, representing linear equations in matrix notation, solving linear regression problem using vectors and matrices.
Calculus – **** derivatives and limits, derivative rules, chain rule (for backpropagation algorithm), partial derivatives (to compute gradients), the convexity of functions, local/global minima, the math behind a regression model, applied math for training a model from scratch.

3. Essential Statistics for Data Science

Every organisation today is striving to become data-driven. To achieve that, Analysts and Scientists need to be able to use data in different ways in order to drive decision making.

Describing data — from data to insights

Data always comes in raw and ugly. The initial exploration tells you what’s missing, how the data is distributed, and what’s the best way to clean it to meet the end goal.

In order to answer the questions you've defined, descriptive statistics enables you to transform each observation in your data into insights that make sense.

Quantifying uncertainty

Furthermore, the ability to quantify uncertainty is the most valuable skill that is highly regarded at any data company. Knowing the chances of success in any experiment/decision is crucial for all businesses.

Here are a few of the main staples of statistics that constitute the bare minimum:

_Image from the lecture on Poisson distribution — [https://www.wiplane.com/p/foundations-for-data-science-ml](https://www.wiplane.com/p/foundations-for-data-science-ml" data-href="https://www.wiplane.com/p/foundations-for-data-science-ml" class="markup--anchor markup--figure-anchor" rel="noopener ugc nofollow noopener" target="blank)

Estimates of location – mean, median and other variants of these.
Estimates of variability
Correlation and covariance
Random variables – discrete and continuous
Data distributions – PMF, PDF, CDF
Conditional probability – bayesian statistics
Commonly used statistical distributions – Gaussian, Binomial, Poisson, Exponential.
Important theorems – Law of large numbers and Central limit theorem.

Every beginner-level data science enthusiast should focus on these three pillars before diving into any core data science or core ML course

How to Learn these Foundational DS and ML Concepts

I created a learning roadmap that you can find here. It also told you what to learn and was also loaded with resources, courses, and programs that you can check out.

But there are a few inconsistencies in the recommended resources and the roadmap that I charted out.

Problems with Data Science or ML Courses

Every data science course that I listed in that article requires students to have a decent understanding of Programming, Math, or Statistics. For example, the most famous course on ML by Andrew Ng also relies heavily on the understanding of vector algebra and calculus.
Most courses that cover Math and Statistics for Data Science are just a checklist of concepts required for DS/ML with no explanation on how they are applied and how they are programmed into a machine.
There are exceptional resources to dive deep into Math but most of us are not made for it and you don't need to be a gold medalist to learn data science.

Bottom line: you need a resource that covers just enough applied math or statistics or programming to get started with data science or ML is missing.

Wiplane Academy — wiplane.com

https://www.wiplane.com

So, I decided to give in and develop the course myself. I have spent months designing and developing a curriculum that will provide a solid foundation for your career as a…

Data Analyst
Data Scientist
Or an ML Practitioner/Engineer

Here's the course – Foundations for Data Science or ML — First Steps to learn Data Science and ML

It's a comprehensive yet compact and affordable course that not only covers all the essentials, pre-requisites, and pre-work but also explains how each concept is used computationally and programmatically ( in Python).

And that’s not it – I will keep updating the course content every month based on your inputs. Learn more here.

How to Train BPE, WordPiece, and Unigram Tokenizers from Scratch using Hugging Face

Harshit Tyagi — Mon, 18 Oct 2021 22:27:40 +0000

If you've had some experience with NLP, you probably know that tokenization is at the helm of any NLP pipeline.

Tokenization is often regarded as a subfield of NLP but it has its own story of evolution. And now it underpins many state-of-the-art NLP models.

This post is all about training tokenizers from scratch by leveraging Hugging Face’s tokenizers package.

Before we get to the fun part of training and comparing the different tokenizers, I want to give you a brief summary of the key differences between the algorithms.

The main difference lies in the choice of character pairs to merge and the merging policy that each of these algorithms uses to generate the final set of tokens.

BPE Algorithm – a Frequency-based Model

Byte Pair Encoding uses the frequency of subword patterns to shortlist them for merging.

The drawback of using frequency as the driving factor is that you can end up having ambiguous final encodings that might not be useful for the new input text.

But it still has the scope of improvement in terms of generating unambiguous tokens.

Unigram Algorithm – a Probability-based Model

Next we have the Unigram model that approaches solving the merging problem by calculating the likelihood of each subword combination rather than picking the most frequent pattern.

It calculates the probability of every subword token and then drops it based on a loss function that is explained in this research paper.

Based on a certain threshold of the loss value, you can then trigger the model to drop the bottom 20-30% of the subword tokens.

Unigram is a completely probabilistic algorithm that chooses both the pairs of characters and the final decision to merge (or not) in each iteration based on probability.

WordPiece Algorithm

With the release of BERT in 2018, there came a new subword tokenization algorithm called WordPiece which can be considered an intermediary of BPE and Unigram algorithms.

WordPiece is also a greedy algorithm that leverages likelihood instead of count frequency to merge the best pair in each iteration but the choice of characters to pair is based on count frequency.

So, it is similar to BPE in terms of choosing characters to pair and similar to Unigram in terms of choosing the best pair to merge.

With the algorithmic differences covered, I tried to implement each of these algorithms (not from scratch) to compare the output generated by each of them.

How to Train the BPE, Unigram, and WordPiece Algorithms

Now, in order to have an unbiased comparison of outputs, I didn’t want to use pre-trained algorithms as that would bring size, quality, and the content of the dataset into the picture.

One way could be to code these algorithms from scratch using the research papers and then test them out. This is a good approach in order to truly understand the workings of each algorithm but you might end up spending weeks doing that.

I instead used Hugging Face’s tokenizers package that offers the implementation of all of today’s most used tokenizers. It also allowed me to train these models from scratch on my choice of dataset and then tokenize the input string of my own choice.

How to Train the Datasets

I chose two different datasets to train these models. One is a free book from Gutenberg which serves as a small dataset and the other one is the wikitext-103 which is 516M of text.

In the Colab, you can download the datasets first and unzip them (if required):

!wget http://www.gutenberg.org/cache/epub/16457/pg16457.txt

!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip

!unzip wikitext-103-raw-v1.zip

Import the Required Models and Trainers

Going through the documentation, you’ll find that the main API of the package is the class Tokenizer.

You can then instantiate any tokenizer with your choice of model (BPE/ Unigram/ WordPiece).

Here, I imported the main class, all the models I wanted to test, and their trainers, as I want to train these models from scratch.

## importing the tokenizer and subword BPE trainer
from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel, WordPiece
from tokenizers.trainers import BpeTrainer, WordLevelTrainer, \
                                WordPieceTrainer, UnigramTrainer

## a pretokenizer to segment the text into words
from tokenizers.pre_tokenizers import Whitespace

How to Automate Training and Tokenization

Since I need to perform somewhat similar processes for three different models, I broke the processes into 3 functions. I’ll only need to call these functions for each model and my job here will be done.

So, what are these functions?

Step 1 - Prepare the tokenizer

Preparing the tokenizer requires us to instantiate the Tokenizer class with a model of our choice. But since we have four models (I added a simple Word-level algorithm as well) to test, we’ll write if/else cases to instantiate the tokenizer with the right model.

To train the instantiated tokenizer on the small and large datasets, we will also need to instantiate a trainer, in our case these would be BpeTrainer, WordLevelTrainer, WordPieceTrainer, and UnigramTrainer.

The instantiation and training will need us to specify some special tokens. These are tokens for unknown words and other special tokens that we’ll need to use later on to add to our vocabulary.

You can also specify other training arguments' vocabulary size or minimum frequency here.

unk_token = ""  # token for unknown words
spl_tokens = ["", "", "", ""]  # special tokens

def prepare_tokenizer_trainer(alg):
    """
    Prepares the tokenizer and trainer with unknown & special tokens.
    """
    if alg == 'BPE':
        tokenizer = Tokenizer(BPE(unk_token = unk_token))
        trainer = BpeTrainer(special_tokens = spl_tokens)
    elif alg == 'UNI':
        tokenizer = Tokenizer(Unigram())
        trainer = UnigramTrainer(unk_token= unk_token, special_tokens = spl_tokens)
    elif alg == 'WPC':
        tokenizer = Tokenizer(WordPiece(unk_token = unk_token))
        trainer = WordPieceTrainer(special_tokens = spl_tokens)
    else:
        tokenizer = Tokenizer(WordLevel(unk_token = unk_token))
        trainer = WordLevelTrainer(special_tokens = spl_tokens)

    tokenizer.pre_tokenizer = Whitespace()
    return tokenizer, trainer

We’ll also need to add a pre-tokenizer to split our input into words as without a pre-tokenizer, we might get tokens that overlap several words: for instance we could get a "there is" token since those two words often appear next to each other.

Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer.

This function will return the tokenizer and its trainer object which we can use to train the model on a dataset.

Here, we are using the same pre-tokenizer (Whitespace) for all the models. You can choose to test it with others.

Step 2 - Train the tokenizer

After preparing the tokenizers and trainers, we can start the training process.

Here’s a function that will take the file(s) on which we intend to train our tokenizer along with the algorithm identifier.

‘WLV’ - Word Level Algorithm
‘WPC’ - WordPiece Algorithm
‘BPE’ - Byte Pair Encoding
‘UNI’ - Unigram

def train_tokenizer(files, alg='WLV'):
    """
    Takes the files and trains the tokenizer.
    """
    tokenizer, trainer = prepare_tokenizer_trainer(alg)
    tokenizer.train(files, trainer) # training the tokenzier
    tokenizer.save("./tokenizer-trained.json")
    tokenizer = Tokenizer.from_file("./tokenizer-trained.json")
    return tokenizer

This is the main function that we’ll need to call for training the tokenizer. It will first prepare the tokenizer and trainer and then start training the tokenizers with the provided files.

After training, it saves the model in a JSON file, loads it from the file, and returns the trained tokenizer to start encoding the new input.

Step 3 - Tokenize the input string

The last step is to start encoding the new input strings and compare the tokens generated by each algorithm.

Here, we’ll be writing a nested for loop to train each model on the smaller dataset first followed by training on the larger dataset and tokenizing the input string as well.

Input string - “This is a deep learning tokenization tutorial. Tokenization is the first step in a deep learning NLP pipeline. We will be comparing the tokens generated by each tokenization model. Excited much?!😍”

small_file = ['pg16457.txt']
large_files = [f"./wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]

for files in [small_file, large_files]:
    print(f"========Using vocabulary from {files}=======")
    for alg in ['WLV', 'BPE', 'UNI', 'WPC']:
        trained_tokenizer = train_tokenizer(files, alg)
        input_string = "This is a deep learning tokenization tutorial. Tokenization is the first step in a deep learning NLP pipeline. We will be comparing the tokens generated by each tokenization model. Excited much?!😍"
        output = tokenize(input_string, trained_tokenizer)
        tokens_dict[alg] = output.tokens
        print("----", alg, "----")
        print(output.tokens, "->", len(output.tokens))

And here's the output:

Analysis of the output:

Looking at the output, you’ll see the difference in how the tokens were generated which in turn led to different number of tokens generated.

A simple word level algorithm created 35 tokens no matter which dataset it was trained on.
BPE algorithm created 55 tokens when trained on a smaller dataset and 47 when trained on a larger dataset. This shows that it was able to merge more pairs of characters when trained on a larger dataset.
The Unigram model created similar (68 and 67) numbers of tokens with both the datasets. But you can see the difference in the generated tokens:

With larger dataset, merging came closer to generating tokens that are better-suited to encode real-world English language words that we often use.

WordPiece created 52 tokens when trained on a smaller dataset and 48 when trained on a larger dataset. The generated tokens have double ## to denote the use of a token as a prefix/suffix.

All three algorithms generated worse and better subword tokens when trained on a larger dataset.

How to Compare the Tokens

To compare the tokens, I stored the output of each algorithm in a dictionary and I’ll turn it into a dataframe to view the differences in tokens better.

Since the number of tokens generated by each model is different, I’ve added a token to make the data rectangular and fit a dataframe.

is basically nan in the dataframe.

import pandas as pd

max_len = max(len(tokens_dict['UNI']), len(tokens_dict['WPC']), len(tokens_dict['BPE']))
diff_bpe = max_len - len(tokens_dict['BPE'])
diff_wpc = max_len - len(tokens_dict['WPC'])

tokens_dict['BPE'] = tokens_dict['BPE'] + ['']*diff_bpe
tokens_dict['WPC'] = tokens_dict['WPC'] + ['']*diff_wpc

del tokens_dict['WLV']

df = pd.DataFrame(tokens_dict)
df.head(10)

Here's the output:

You can also look at the difference in tokens using sets:

To check out the code, head over to this Colab notebook.

Closing Thoughts and Next Steps

Based on the kind of tokens generated, WPC does seem to generate subword tokens that are more commonly found in the English language – but don’t hold me to this observation.

These algorithms are slightly different from each other and do a somewhat similar job of developing a decent NLP model. But much of the performance depends on the use case of your language model, the vocabulary size, speed and other factors.

This concludes our examination of tokenization algorithms. The next step to deep dive into this is to understand what embeddings are, how tokenization plays a vital role in creating these embeddings, and how they affect a model’s performance.

A further advancement to these algorithms is the SentencePiece algorithm which is a wholesome approach to the whole tokenization problem. But much of this problem is alleviated by HuggingFace, and even better – they have all the algorithms implemented in a single GitHub repo.

References and Notes

If you have questions about my analysis or any of my work in this post, I highly encourage you to check out these resources for a precise understanding of the workings of each algorithm:

Subword regularization: Improving Neural Network Translation Models with Multiple Subword Candidates by Taku Kudo
Neural Machine Translation of Rare Words with Subword Units - Research paper that discusses different segmentation techniques based BPE compression algorithm.
Hugging Face’s tokenizer package.

Connect with me

If you’re looking to get started in the field of data science or ML, check out my course on Foundations of Data Science & ML.

If you would like to see more of such content and you are not a subscriber, consider subscribing to my newsletter.

Have something to add or suggest, you can reach out to me via:

The Evolution of Tokenization – Byte Pair Encoding in NLP

Harshit Tyagi — Tue, 05 Oct 2021 15:26:44 +0000

Natural Language Processing may have come a little late to the AI game, but companies like Google and OpenAI are working wonders with NLP techniques these days.

These companies have released state-of-the-art language models like BERT and GPT-2 and GPT-3. And GitHub Copilot and OpenAI codex are among some of the popular applications that are in the news lately.

As someone who has had very limited exposure to NLP, I decided to take it up as an area of research so I can learn more about it. My next few articles and videos will focus on sharing what I learn after dissecting some important components of NLP.

Main Components of NLP

NLP systems have three main components that help machines understand natural language:

Tokenization
Embedding
Model architectures

Top Deep Learning models like BERT, GPT-2, and GPT-3 all share the same components but with different architectures that distinguish one model from another.

In this article (and the notebook that accompanies it), we are going to focus on the basics of the first component of an NLP pipeline which is tokenization. It's an often overlooked concept, but it is a field of research in itself.

We have come so far from the traditional NLTK tokenization process. And though we have state-of-the-art algorithms for tokenization, it's always a good practice to understand its evolution and how we got to where we are now.

So, here's what we'll cover:

What is tokenization?
Why do we need a tokenizer?
Types of tokenization – Word, Character, and Subword.
Byte Pair Encoding Algorithm - a version of which is used by most NLP models these days.

The next part of this tutorial will dive into more advanced (or enhanced versions of Byte Pair Encoding) algorithms:

Unigram Algorithm
WordPiece – BERT transformer
SentencePiece – End-to-End tokenizer system

What is Tokenization?

Tokenization is the process of representing raw text in smaller units called tokens. These tokens can then be mapped with numbers to further feed to an NLP model.

Here's an overly simplified example of what a tokenizer does:

## read the text and enumerate the tokens in the text
text = open('example.txt', 'r').read(). # read a text file

words = text.split(" ") # split the text on spaces

tokens = {v: k for k, v in enumerate(words)} # generate a word to index mapping

Here, we have simply mapped every word in the text to a numerical index. This is, of course, a very simple example and we have not considered grammar, punctuation, or compound words (like test, test-ify, test-ing, and so on).

So we need a more technical and accurate definition of tokenization for our work here. To take into account all punctuation and every related word, we need to start working at the character level.

There are multiple applications of tokenization. One of the use cases comes from compiler design where you might need to parse computer programs to convert raw characters into keywords of a programming language.

In deep learning, tokenization is the process of converting a sequence of characters into a sequence of tokens which further needs to be converted into a sequence of numerical vectors that can be processed by a neural network.

Why do we need a Tokenizer?

The need for a tokenizer came from the question "How can we make machines read?"

A common way of processing textual data is to define a set of rules in a dictionary and then look up that fixed dictionary of rules. But this method can only go so far, and we want machines to learn these rules from the text that it reads.

Now, machines don't know any language, nor do they understand sound or phonetics. They need to be taught from scratch and in such a way that they can read any language that's out there.

Quite a task, right?

Humans learn a language by connecting sound to the meaning and then we learn to read and write in that language. Machines can't do that, so they need to be given the most basic units of text to start processing the text.

That's where tokenization comes into play. It breaks down the text into smaller units called "tokens".

And there are different ways of tokenizing text which is what we'll learn now.

Different ways to tokenize text

To make the deep learning model learn from the text, we need a two-step process:

Tokenize – decide the algorithm we'll use to generate the tokens.
Encode the tokens to vectors

Word-based tokenization

As the first step suggests, we need to decide how to convert text into small tokens. A simple and straightforward method that most of us would propose is to use word-based tokens, splitting the text by spaces.

Problems with Word tokenizer

There's a high risk of missing words in the training data. With word tokens, your model won't recognize the variants of words that were not part of the data on which the model was trained.

So, if your model has seen foot and ball in the training data but the final text has football, the model won't be able to recognize the word and it will be treated with an token.

Similarly, punctuation poses another problem. For example, let or let's will need individual tokens which is an inefficient solution. This will require a huge vocabulary to make sure you've thought of every variant of the word.

Even if you add a lemmatizer to solve this problem, you're adding an extra step in your processing pipeline.

It's also tough to handle slang and abbreviations. We use lots of slang and abbreviations in text these days, such as "FOMO", "LOL", "tl;dr" and so on. What do we do for these words?

Finally, what if the language doesn't use spaces for segmentation? For a language like Chinese, which doesn't use spaces for word separation, this tokenizer will fail completely.

After encountering these problems, researchers looked into another approach which involved tokenizing all the characters.

Character-based tokenization

To resolve the problems associated with word-based tokenization, data scientists tried an alternative approach of character-by-character tokenization.

This did solve the problem of missing words, as now we are dealing with characters that can be encoded using ASCII or Unicode. Now it could generate embedding for any word.

Every character, whether it was a space, apostrophe, colon, or whatever can now be assigned a symbol to generate a sequence of vectors.

But this approach had its own cons.

Problems with character-based models

First, this approach requires more computing resources. Character-based models will treat each character as a token. And more tokens means more input computations to process each token which in turn requires more compute resources.

For example, for a 5-word long sentence, you may need to process 30 tokens instead of 5 word-based tokens.

Also, it narrows down the number of NLP tasks and applications. With long sequences of characters, you can only use a certain type of neural network architecture.

This limits the type of NLP tasks we can perform. For applications like entity recognition or text classification, character-based encoding might turn out to be an inefficient approach.

Finally, there's a risk of learning incorrect semantics. Working with characters could generate incorrect spellings of words. Also, with no inherent meaning, learning with characters is like learning with no meaningful semantics.

What's fascinating is that for such a seemingly simple task, multiple algorithms have been written to find the optimal tokenization policy.

After understanding the pros and cons of these tokenization methods, it makes sense to look for an approach that offers a middle route. We'll want one that preserves the semantics with limited vocabulary that can generate all the words in the text on merging.

Subword Tokenization

With character-based models, we risk losing the semantic features of the word. And with word-based tokenization, we need a very large vocabulary to encompass all the possible variations of every word.

So, the goal was to develop an algorithm that could:

Retain the semantic features of the token, that is information per token.
Tokenize without demanding a very large vocabulary with a finite set of words.

To solve this problem, you can think of breaking down the words based on a set of prefixes and suffixes. For example, we can write a rule-based system to identify subwords like "##s", "##ing", "##ify", "un##" and so on, where the position of the double hash denotes prefix and suffixes.

So, a word like "unhappily" is tokenized using subwords like "un##", "happ", and "##ily".

The model only learns relatively few subwords and then puts them together to create other words. This solves the problems of memory requirement and effort required to create a large vocabulary.

Problems with the subword tokenization algorithm:

First of all, some of the subwords that are created as per the defined rules may never appear in your text to tokenize and may end up occupying extra memory.

Also, for every language, we'll need to define a different set of rules to create subwords.

To alleviate this problem, in practice, most modern tokenizers have a training phase that identifies the recurring text in the input corpus and creates new subword tokens. For rare patterns, we stick to word-based tokens.

Another important factor that plays a vital role in this process is the size of the vocabulary that the user sets. A large vocabulary size allows for more common words to be tokenized, whereas smaller vocabulary requires more subwords to be created to create every word in the text without using the token.

Striking the right balance for your application is key here.

Byte Pair Encoding (BPE) Algorithm

BPE was originally a data compression algorithm that you use to find the best way to represent data by identifying the common byte pairs. We now use it in NLP to find the best representation of text using the smallest number of tokens.

Here's how it works:

Add an identifier () at the end of each word to identify the end of a word and then calculate the word frequency in the text.
Split the word into characters and then calculate the character frequency.
From the character tokens, for a predefined number of iterations, count the frequency of the consecutive byte pairs and merge the most frequently occurring byte pairing.
Keep iterating until you have reached the iteration limit (set by you) or until you have reached the token limit.

Let's go through each step (in the code) for some sample text. For coding this, I have taken help from Lei Mao's very minimalistic blog on BPE. I encourage you to check it out!

Step 1: Add word identifiers and calculate word frequency

Here's our sample text:

"There is an 80% chance of rainfall today. We are pretty sure it is going to rain."

## define the text first

text = "There is an 80% chance of rainfall today. We are pretty sure it is going to rain."

## get the word frequency and add the end of word () token ## at the end of each word

words = text.strip().split(" ")

print(f"Vocabulary size: {len(words)}")

Step 2: Split the word into characters and then calculate the character frequency

char_freq_dict = collections.defaultdict(int)
for word, freq in word_freq_dict.items():
    chars = word.split()
    for char in chars:
        char_freq_dict[char] += freq

char_freq_dict

Step 3: Merge the most frequently occurring consecutive byte pairings

import re

## create all possible consecutive pairs
pairs = collections.defaultdict(int)
for word, freq in word_freq_dict.items():
    chars = word.split()
    for i in range(len(chars)-1):
        pairs[chars[i], chars[i+1]] += freq

Step 4 - Iterate n times to find the best (in terms of frequency) pairs to encode and then concatenate them to find the subwords

It is better at this point to structure our code into functions. This means that we need to perform the following steps:

Find the most frequently occurring byte pairs in each iteration.
Merge these tokens.
Recalculate the character tokens' frequency with the new pair encoding added.
Keep doing this until there are no more pairs or you reach the end of the for a loop.

For detailed code, you should check out my Colab notebook.

Here’s a trimmed output of those 4 steps:

So as we iterate with each best pair, we merge (concatenate) the pair. You can see that as we recalculate the frequency, the original character token frequency is reduced and the new paired token frequency pops up in the token dictionary.

If you look at the number of tokens created, it first increases because we create new pairings – but the number starts to decrease after a number of iterations.

Here, we started with 25 tokens, went up to 31 tokens in the 14th iteration, and then came down to 16 tokens in the 50th iteration. Interesting, right?

How to improve the BPE algorithm

BPE algorithm is a greedy algorithm, which means that it tries to find the best pair in each iteration. And there are some limitations to this greedy approach.

So of course there are pros and cons of the BPE algorithm, too.

The final tokens will vary depending upon the number of iterations you have run. This also causes another problem: we now can have different tokens for a single text, and thus different embeddings.

To address this issue, multiple solutions have been proposed. But the one that stood out was a unigram language model that added subword regularization (a new method of subword segmentation) training that calculates the probability for each subword token to choose the best option using a loss function. We'll talk more about this in upcoming articles.

Do we use BPE in BERTs or GPTs?

Models like BERT or GPT-2 use some version of the BPE or the unigram model to tokenize the input text.

BERT included a new algorithm called WordPiece. It is similar to BPE, but has an added layer of likelihood calculation to decide whether the merged token will make the final cut.

Summary

In this blog, you've learned how a machine starts to make sense of language by breaking down the text into very small units.

Now, there are many ways to break text down and so it becomes important to compare one approach with another.

We started off by understanding tokenization by splitting the English text by spaces – but not every language is written the same way (that is using spaces to denote segmentation). So then we looked at splitting by character to generate character tokens.

The problem with characters was the loss of semantic features from the tokens at the risk of creating incorrect word representations or embeddings.

To get the best of both worlds, we looked at subword tokenization which was more promising. And finally we looked at the BPE algorithm to implement subword tokenization.

We'll look more into the next steps and advanced tokenizers like WordPiece, SentencePiece, and how to work with the HuggingFace tokenizer next week.

References and Notes

My post is actually an accumulation of the following papers and blogs that I encourage you to read:

Neural Machine Translation of Rare Words with Subword Units - Research paper that discusses different segmentation techniques based BPE compression algorithm.
GitHub repo on Subword NMT(Neural Machine Translation) - supporting code for the above paper.
Lei Mao’s blog on Byte Pair Encoding - I used the code in his blog to implement and understand BPE myself.
How Machines read - a blog by Cathal Horan.

If you’re looking to start in the field of data science or ML, check out my course on Foundations of Data Science & ML.

If you would like to get all my tutorials/blogs delivered directly to your inbox, consider subscribing to my newsletter here.

Have something to add or suggest, you can reach out to me via:

Use Python, SpaCy, and Streamlit to Build a Structured Financial Newsfeed

Harshit Tyagi — Thu, 23 Sep 2021 17:01:09 +0000

One of the very interesting and widely used applications of Natural Language Processing is Named Entity Recognition (NER).

Getting insights from raw and unstructured data is of vital importance. Uploading a document and getting the important bits of information from it is called information retrieval.

Information retrieval has always been a major task and challenge in NLP. And we can use NER (or NEL — Named Entity Linking) in several domains like finance, drug research, e-commerce, and more for information retrieval purposes.

In this tutorial post, I’ll show you how you can leverage NEL to develop a custom stock market news feed that lists down the buzzing stocks on the internet.

Pre-requisites

There are no real pre-requisites as such. It would be helpful if you had some familiarity with Python and the basic tasks of NLP like tokenization, POS tagging, dependency parsing, and so on.

I’ll cover the important bits in more detail, so even if you’re a complete beginner you’ll be able to wrap your head around what’s going on.

So, let’s get on with it! Follow along and you’ll have a minimal stock news feed that you can start researching by the end of this tutorial.

What you’ll need to get started:

Google Colab for initial testing and exploration of data and the SpaCy library.
VS Code (or any editor) to code the Streamlit application.
Source of stock market information (news) on which we’ll perform NER and later NEL.
A virtual Python environment (I am using conda) along with libraries like Pandas, SpaCy, Streamlit, Streamlit-Spacy (if you want to show some SpaCy renders.)

Goals of the Project

The goal of this project is to learn and apply Named Entity Recognition to extract important entities (publicly traded companies in our example) and then link each entity with some information using a knowledge base (Nifty500 companies list).

We’ll get the textual data from RSS feeds on the internet and extract the names of buzzing stocks. We'll then pull their market price data to test the authenticity of the news before taking any position in those stocks.

Note: NER may not be a state-of-the-art problem but it has many applications in the industry.

Let's move on to Google Colab for experimentation and testing.

To get some reliable authentic stock market news, I’ll be using the Economic Times and Money Control RSS feeds for this tutorial. But you can also use/add your country’s RSS feeds or Twitter/Telegram (groups) data to make your feed more informative/accurate.

The opportunities are immense. This tutorial should serve as a stepping stone to apply NEL to build apps in different domains solving different kinds of information retrieval problems.

If you go on to look at the RSS feed, it looks something like this:

Our goal is to get the textual headlines from this RSS feed and then we’ll use SpaCy to extract the main entities from the headlines.

The headlines are present inside the </code> tag of the XML here. Firstly, we need to capture the entire XML document and we can use the <code>**requests**</code> library to do that. Make sure you have these packages installed in your runtime environment in colab. You can run the following command to install almost any package right from a colab’s code cell: <pre><code class="lang-shell">!pip install <package_name> </code></pre> Send a <code>GET</code> request at the provided link to capture the XML doc. <pre><code class="lang-python">import requests resp = requests.get("https://economictimes.indiatimes.com/markets/stocks/rssfeeds/2146842.cms") </code></pre> Run the cell to check what you get in the response object. It should give you a successful response with HTTP code 200 as follows: Now that you have this response object, we can pass its content to the BeautifulSoup class to parse the XML document as follows: <pre><code class="lang-python">from bs4 import BeautifulSoup soup = BeautifulSoup(resp.content, features='xml') soup.findAll('title') </code></pre> This should give you all the headlines inside a Python list: Awesome – we have the textual data out of which we will extract the main entities (which are publicly traded companies in this case) using NLP. It’s time to put NLP into action. <h2 id="heading-step-2-how-to-extract-entities-from-the-headlines">Step 2: How to extract entities from the headlines</h2> This is the exciting part. We’ll be using a pre-trained core language model from the <code>**spaCy**</code> library to extract the main entities in a headline. Let's talk a little more about spaCy and the core models. <a target="_blank" href="https://spacy.io/">spaCy</a> is an open-source NLP library that processes textual data at a superfast speed. It is the leading library in NLP research which is being used in enterprise-grade applications at scale. spaCy is well-known for scaling with the problem. And it supports more than 64 languages and works well with both TensorFlow and PyTorch. Talking about core models, spaCy has two major classes of pre-trained language models that are trained on different sizes of textual data to give us state-of-the-art inferences. <ol> <li>Core Models — for general-purpose basic NLP tasks. </li> <li>Starter Models — for niche applications that require transfer learning. We can leverage the model’s learned weights to fine-tune our custom models without having to train the model from scratch. </li> </ol> Since our use case is basic in this tutorial, we are going to stick with the <code>en_core_web_sm</code> core model pipeline. So, let’s load this into our notebook: <pre><code class="lang-javascript">nlp = spacy.load("en_core_web_sm") </code></pre> Note: Colab already has this downloaded for us, but if you try to run it in your local system, you’ll have to download the model first using the following command: <pre><code class="lang-javascript">python -m spacy download en_core_web_sm </code></pre> <code>en_core_web_sm</code> is basically an English pipeline optimized for CPU which has the following components: <ul> <li>tok2vec — token to vector s(performs tokenization on the textual data), </li> <li>tagger — adds relevant metadata to each token. spaCy makes use of some statistical models to predict the part of speech (POS) of each token. More in the <a target="_blank" href="https://spacy.io/models/en">documentation</a>. </li> <li>parser — dependency parser establishes relationships among the tokens. </li> <li>Other components include senter, ner, attribute_ruler, and lemmatizer. </li> </ul> Now, to test what this model can do for us, I’ll pass a single headline through the instantiated model and then check the different parts of a sentence. <pre><code class="lang-javascript"># make sure you extract the text out of <title> tags processed_hline = nlp(headlines[4].text) </code></pre> The pipeline performs all the tasks from tokenization to NER. Here we have the tokens first: You can look at the tagged part of speech using the <code>pos_</code> attribute. Each token is tagged with some metadata. For example, Trade is a Proper Noun, Setup is a Noun, <code>:</code> is punctuation, so on, and so forth. The entire list of Tags is given <a target="_blank" href="https://spacy.io/models/en">here</a>. And then, you can look at how they are related by looking at the dependency graph using the <code>dep_</code> attribute: Here, Trade is a Compound, Setup is Root, Nifty is an appos (Appositional modifier). Again, all the syntactic tags can be found <a target="_blank" href="https://spacy.io/models/en">here</a>. You can also visualize the relationship dependencies among the tokens using the following displacy <code>render()</code> method: <pre><code class="lang-python">spacy.displacy.render(processed_hline, style='dep', jupyter=True, options={'distance': 120}) </code></pre> which will give this graph: <h3 id="heading-entity-extraction">Entity extraction</h3> And to look at the important entities of the sentence, you can pass <code>**'ent’**</code> as style in the same code: We have different tags for different entities like the day has DATE, and Glasscoat has GPE which can be Countries/Cities/States. We are primarily looking for entities that have the ORG tag that’ll give us Companies, agencies, institutions, and so on. We are now capable of extracting entities from the text. Let’s get down to extracting the organizations from all the headlines using ORG entities. <pre><code class="lang-python">ent.py companies = [] for title in headlines: doc = nlp(title.text) for token in doc.ents: if token.label_ == 'ORG': companies.append(token.text) else: pass </code></pre> This will return a list of all the companies as follows: So easy, right? That’s the magic of spaCy now! The next step is to look up all these companies in a knowledge base to extract the right stock symbol for that company. Then we'll use libraries like yahoo-finance to extract their market details like price, return, and so on. <h2 id="heading-step-3-named-entity-linking">Step 3 — Named Entity Linking</h2> Learning about what stocks are buzzing in the market and getting their details on your dashboard are the goals for this project. We have the company names, but in order to get their trading details, we’ll need the company’s trading stock symbol. Since I am extracting the details and news of Indian Companies, I am going to use an external database of <a target="_blank" href="https://www1.nseindia.com/products/content/equities/indices/nifty_500.htm">Nifty 500 companies (a CSV file).</a> For every company, we’ll look it up in the list of companies using pandas, and then we’ll capture the stock market statistics using the <a target="_blank" href="https://pypi.org/project/yfinance/">yahoo-finance</a> library. <pre><code class="lang-python">import yfinance as yf ## collect various market attributes of a stock stock_dict = { 'Org': [], 'Symbol': [], 'currentPrice': [], 'dayHigh': [], 'dayLow': [], 'forwardPE': [], 'dividendYield': [] } ## for each company look it up and gather all market data on it for company in companies: try: if stocks_df['Company Name'].str.contains(company).sum(): symbol = stocks_df[stocks_df['Company Name'].\ str.contains(company)]['Symbol'].values[0] org_name = stocks_df[stocks_df['Company Name'].\ str.contains(company)]['Company Name'].values[0] stock_dict['Org'].append(org_name) stock_dict['Symbol'].append(symbol) stock_info = yf.Ticker(symbol+".NS").info stock_dict['currentPrice'].append(stock_info['currentPrice']) stock_dict['dayHigh'].append(stock_info['dayHigh']) stock_dict['dayLow'].append(stock_info['dayLow']) stock_dict['forwardPE'].append(stock_info['forwardPE']) stock_dict['dividendYield'].append(stock_info['dividendYield']) else: pass except: pass ## create a dataframe to display the buzzing stocks pd.DataFrame(stock_dict) </code></pre> One thing that you should notice here is that I’ve added a “.NS” after each stock symbol before passing it to the <code>Ticker</code> class of the <code>yfinance</code> library. This is because Indian NSE stock symbols are stored with a <code>.NS</code> suffix in <code>yfinance</code>. And the buzzing stocks would turn up in a dataframe like below: Voilà! Isn’t this great? Such a simple yet profound app that could point you in the right direction with the right stocks. Now to make it more accessible, we can create a web application out of the code that we have just written using Streamlit. <h2 id="heading-step-4-how-to-build-a-web-app-using-streamlit">Step 4 — How to build a web app using Streamlit</h2> It’s time to move to an editor and create a new project and virtual environment for the NLP application. Getting started with Streamlit is super easy for such demo data applications. Make sure you have streamlit installed. <pre><code class="lang-javascript">pip install Streamlit </code></pre> Now, let’s create a new file called app.py and start writing functional code to get the app ready. Import all the required libraries at the top like this: <pre><code class="lang-python">import pandas as pd import requests import spacy import streamlit as st from bs4 import BeautifulSoup import yfinance as yf </code></pre> Add a title to your application: <pre><code class="lang-python">st.title('Buzzing Stocks :zap:') </code></pre> Test your app by running <code>streamlit run app.py</code> in your terminal. It should open up an app in your web browser. I have added some extra functionality to capture data from multiple sources. Now, you can add an RSS feed URL of your choice into the application and the data will be processed and the trending stocks will be displayed in a dataframe. To get access to the entire code base, you can check out my repository here: <a target="_blank" href="https://github.com/dswh/NER_News_Feed">https://github.com/dswh/NER_News_Feed</a> If you want to follow me step-by-step, watch me code this application here: <div class="embed-wrapper"> </div> You can add multiple styling elements, different data sources, and other types of processing to make it more efficient and useful. My app in its current state looks like the image in the banner. <h2 id="heading-next-steps">Next Steps</h2> Instead of picking a financial use case, you can also pick any other application of your choice – healthcare, e-commerce, research, and many others. All industries require documents to be processed and important entities to be extracted and linked. Try out another idea. A simple idea is extracting all the important entities of a research paper and then creating a knowledge graph of it using the Google Search API. Also, if you want to take the stock news feed app to another level, you can add some trading algorithms to generate buy and sell signals as well. I encourage you to go wild with your imagination. <h3 id="heading-how-you-can-connect-with-me">How you can connect with me</h3> If you liked this post and would like to see more of such content, you can subscribe to <a target="_blank" href="https://dswharshit.substack.com/publish/settings#twitter-account">my newsletter</a> or <a target="_blank" href="https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ">my YouTube channel</a> where I’ll keep sharing such useful and quick projects that one can build. If you’re someone who is just getting started with programming or want to get into data science or ML, you can check out my course at <a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml">WIP Lane Academy</a>. </article> <article> <h1> How Machine Learning Uses Linear Algebra to Solve Data Problems </h1> Harshit Tyagi — Wed, 01 Sep 2021 15:47:54 +0000 Machines or computers only understand numbers. And these numbers need to be represented and processed in a way that lets machines solve problems by learning from the data instead of learning from predefined instructions (as in the case of programming). All types of programming use mathematics at some level. Machine learning involves programming data to learn the function that best describes the data. The problem (or process) of finding the best parameters of a function using data is called model training in ML. Therefore, in a nutshell, machine learning is programming to optimize for the best possible solution – and we need math to understand how that problem is solved. The first step towards learning Math for ML is to learn linear algebra. Linear Algebra is the mathematical foundation that solves the problem of representing data as well as computations in machine learning models. It is the math of arrays — technically referred to as vectors, matrices and tensors. <h2 id="heading-common-areas-of-application-linear-algebra-in-action">Common Areas of Application — Linear Algebra in Action</h2> Source: [https://www.wiplane.com/p/foundations-for-data-science-ml](https://www.wiplane.com/p/foundations-for-data-science-ml" rel="nofollow noopener noopener noopener noopener) In the ML context, all major phases of developing a model have linear algebra running behind the scenes. Important areas of application that are enabled by linear algebra are: <ul> <li>data and learned model representation </li> <li>word embeddings </li> <li>dimensionality reduction </li> </ul> <h3 id="heading-data-representation">Data Representation</h3> **** The fuel of ML models, that is data, needs to be converted into arrays before you can feed it into your models. The computations performed on these arrays include operations like matrix multiplication (dot product). This further returns the output that is also represented as a transformed matrix/tensor of numbers. <h3 id="heading-word-embeddings">Word embeddings</h3> Don’t worry about the terminology here – it is just about representing large-dimensional data (think of a huge number of variables in your data) with a smaller dimensional vector. Natural Language Processing (NLP) deals with textual data. Dealing with text means comprehending the meaning of a large corpus of words. Each word represents a different meaning which might be similar to another word. Vector embeddings in linear algebra allow us to represent these words more efficiently. <h3 id="heading-eigenvectors-svd">Eigenvectors (SVD)</h3> Finally, concepts like eigenvectors allow us to reduce the number of features or dimensions of the data while keeping the essence of all of them using something called principal component analysis. <h2 id="heading-from-data-to-vectors">From Data to Vectors</h2> Source: [https://www.wiplane.com/p/foundations-for-data-science-ml](https://www.wiplane.com/p/foundations-for-data-science-ml" rel="nofollow noopener noopener noopener noopener) Linear algebra basically deals with vectors and matrices (different shapes of arrays) and operations on these arrays. In NumPy, vectors are basically a 1-dimensional array of numbers but geometrically, they have both magnitude and direction. Source: [https://www.wiplane.com/p/foundations-for-data-science-ml](https://www.wiplane.com/p/foundations-for-data-science-ml" rel="nofollow noopener noopener noopener noopener) Our data can be represented using a vector. In the figure above, one row in this data is represented by a feature vector which has 3 elements or components representing 3 different dimensions. N-entries in a vector makes it n-dimensional vector space and in this case, we can see 3-dimensions. <h2 id="heading-deep-learning-tensors-flowing-through-a-neural-network">Deep Learning — Tensors Flowing Through a Neural Network</h2> We can see linear algebra in action across all the major applications today. Examples include sentiment analysis on a LinkedIn or a Twitter post (embeddings), detecting a type of lung infection from X-ray images (computer vision), or any speech to text bot (NLP). All of these data types are represented by numbers in tensors. We run vectorized operations to learn patterns from them using a neural network. It then outputs a processed tensor which in turn is decoded to produce the final inference of the model. Each phase performs mathematical operations on those data arrays. <h2 id="heading-dimensionality-reduction-vector-space-transformation">Dimensionality Reduction — Vector Space Transformation</h2> Source: [https://www.wiplane.com/p/foundations-for-data-science-ml](https://www.wiplane.com/p/foundations-for-data-science-ml" rel="nofollow noopener noopener noopener noopener) When it comes to embeddings, you can basically think of an n-dimensional vector being replaced with another vector that belongs to a lower-dimensional space. This is more meaningful and it's the one that overcomes computational complexities. For example, here is a 3-dimensional vector that is replaced by a 2-dimensional space. But you can extrapolate it to a real-world scenario where you have a very large number dimensions. Reducing dimensions doesn’t mean dropping features from the data. Instead, it's about finding new features that are linear functions of the original features and preserving the variance of the original features. Finding these new variables (features) translates to finding the principal components (PCs). This then converges to solving eigenvectors and eigenvalues problems. <h3 id="heading-recommendation-engines-making-use-of-embeddings">Recommendation Engines — Making use of embeddings</h3> You can think of Embedding as a 2D plane being embedded in a 3D space and that’s where this term comes from. You can think of the ground you are standing on as a 2D plane that is embedded into this space in which you live. Just to give you a real-world use case to relate to all of this discussion on vector embeddings, all applications that are giving you personalized recommendations are using vector embedding in some form. For example, the above is a graphic from Google’s course on recommendation systems where we are given this data on different users and their preferred movies. Some users are kids and others are adults, some movies were are all-time classics while others are more artistic. Some movies are targeted towards a younger audience while movies like memento are preferred by adults. Now, we not only need to represent this information in numbers but also need to find new smaller dimensional vector representations that capture all these features well. A very quick way to understand how we can pull off this task is by understanding something called Matrix Factorization which allows us to break a large matrix down into smaller matrices. Ignore the numbers and colors for now and just try to understand how we have broken down one big matrix into two smaller ones. For example, here this matrix of 4X5, 4 rows, and 5 features, was broken down into two matrices, one that's 4X2 and the other that's 2X5. We basically have new smaller dimensional vectors for users and movies. And this allows us to plot this on a 2D vector space. Here you’ll see that user #1 and the movie Harry Potter are closer and user #3 and the movie Shrek are closer. The concept of a dot product (matrix multiplication) of vectors tells us more about the similarity of two vectors. And it has applications in correlation/covariance calculation, linear regression, logistic regression, PCA, convolutions, PageRank and numerous other algorithms. <h3 id="heading-industries-where-linear-algebra-is-used-heavily">Industries where Linear Algebra is used heavily</h3> By now, I hope you are convinced that Linear algebra is driving the ML initiatives in a host of areas today. If not, here is a list to name a few: <ul> <li>Statistics </li> <li>Chemical Physics </li> <li>Genomics </li> <li>Word Embeddings — neural networks/deep learning </li> <li>Robotics </li> <li>Image Processing </li> <li>Quantum Physics </li> </ul> <h2 id="heading-how-much-linear-algebra-should-you-know-to-get-started-with-ml-dl">How much Linear Algebra should you know to get started with ML / DL?</h2> Now, the important question is how you can learn to program these concepts of linear algebra. The answer is you don’t have to reinvent the wheel, you just need to understand the basics of vector algebra computationally and you then learn to program those concepts using NumPy. NumPy is a scientific computation package that gives us access to all the underlying concepts of linear algebra. It is fast as it runs compiled C code and it has a large number of mathematical and scientific functions that we can use. <h3 id="heading-recommended-resources">Recommended resources</h3> <ul> <li><a target="_blank" href="https://www.youtube.com/watch?v=kjBOesZCoqc&list=PL0-GT3co4r2y2YErbmuJw2L5tW4Ew2O5B">Playlist on Linear Algebra by 3Blue1Brown</a> **** — very engaging visualizations that explains the essence of linear algebra and its applications. Might be a little too hard for beginners. </li> <li><a target="_blank" href="https://www.deeplearningbook.org/">Book on Deep Learning by Ian Goodfellow & Yoshua Bengio</a> — a fantastic resource for learning ML and applied math. Give it a read, few folks may find it too technical and notation-heavy, to begin with. </li> </ul> <a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml">Foundations of Data Science & ML —</a> I have created a course that gives you enough understanding of Programming, Math (Basic Algebra, Linear Algebra & Calculus) and Statistics. A complete package for first steps to learning DS/ML. 👉 You can use the code <code>**FREECODECAMP10**</code> to get 10% off. Check out the course outline here: <div class="embed-wrapper"> </div> </article> <article> <h1> Programming, Math, and Statistics You Need to Know for Data Science and Machine Learning </h1> Harshit Tyagi — Fri, 20 Aug 2021 03:21:20 +0000 At the start of this year, I published a mind map on the <a target="_blank" href="https://www.freecodecamp.org/news/data-science-learning-roadmap/">Data Science learning roadmap (shown below)</a>. Many people found the roadmap useful, my article got translated into different languages, and a large number of folks thanked me for publishing it. Everything was good until a few developers pointed out that there are too many resources and many of them are expensive. Python programming was the only branch that had a number of really good courses but it ends right there for beginners. A few important questions on foundational data science struck me: <ul> <li>What should you do after learning how to code? Are there topics that help you strengthen your foundations for data science? </li> <li>What if you hate math and tutorials out there are either too basic tutorials or too deep? Could I recommend a compact yet comprehensive course on Math and Statistics? </li> <li>How much Math is enough to start learning how ML algorithms work? </li> <li>What are some essential statistics topics to get started with data analysis or data science? </li> </ul> You can find answers to a lot of these questions in the book <a target="_blank" href="https://www.deeplearningbook.org/">Deep Learning</a> by Ian Goodfellow and Yoshua Bengio. But that book is a bit too technical and math heavy for many. So in this article, I'll lay out some of the first steps you should take to learn Data Science or Machine Learning. <h2 id="heading-the-three-pillars-of-data-science-and-machine-learning">The Three Pillars of Data Science and Machine Learning</h2> Source: <a target="_blank" href="wiplane.com">wiplane.com</a> If you go through the prerequisites or pre-work of any ML/DS course, you’ll find a combination of programming, math, and statistics. Here is what <a target="_blank" href="https://developers.google.com/machine-learning/crash-course/prereqs-and-prework">Google recommends</a> that you do before taking an ML course: Google's recommended Python skills for Data Science and Machine Learning Google's recommended Math and Statistics skills for ML and DS (<a target="_blank" href="https://developers.google.com/machine-learning/crash-course/prereqs-and-prework">Source</a>) Let's go through these essential skills in a bit more detail to see what you need to learn to get into Data Science and Machine Learning. <h2 id="heading-essential-programming-skills-for-data-science-and-machine-learning">Essential Programming Skills for Data Science and Machine Learning</h2> Most data roles are programming-based, except for a few like business intelligence, market analysis, product analyst, and others. I am going to focus on technical data jobs that require expertise in at least one programming language. I personally prefer Python over any other language because of its versatility how relatively easy it is to learn. Hands down a good pick for developing end-to-end projects. <h3 id="heading-topics-and-libraries-to-know-for-data-science">Topics and libraries to know for data science:</h3> <ul> <li>Common data structures (data types, lists, dictionaries, sets, tuples), writing functions, logic, control flow, searching and sorting algorithms, object-oriented programming, and working with external libraries. </li> <li>Writing Python scripts to extract, format, and store data into files or back to databases. </li> <li>Handling multi-dimensional arrays, indexing, slicing, transposing, broadcasting and pseudorandom number generation using NumPy. </li> <li>Performing vectorized operations using scientific computing libraries like NumPy. </li> <li>Manipulating data with Pandas — series, dataframe, indexing in a dataframe, comparison operators, merging dataframes, mapping and applying functions. </li> <li>Wrangling data using Pandas — checking for null values, imputing it, grouping data, describing it, performing exploratory analysis, and so on. </li> <li>Data Visualization using Matplotlib — the API hierarchy, how to add styles, color, and markers to a plot, knowledge of various plots and when to use them, line plots, bar plots, scatter plots, histograms, boxplots, and Seaborn for more advanced plotting. </li> </ul> <h2 id="heading-essential-mathematics-for-data-science-and-machine-learning">Essential Mathematics for Data Science and Machine Learning</h2> There are <a target="_blank" href="https://towardsdatascience.com/practical-reasons-to-learn-mathematics-for-data-science-1f6caec161ea">practical reasons for why math is essential</a> for folks who want a career as an ML practitioner, Data Scientist, or a Deep Learning Engineer. <h3 id="heading-youll-use-linear-algebra-to-represent-data">You'll Use Linear Algebra to Represent Data</h3> An image from the lecture on Vector Norms (<a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml">from this course</a>) ML is inherently data-driven. Data is at the heart of machine learning. We can think of data as vectors — an object that adheres to arithmetic rules. This leads us to understand how rules of linear algebra operate over arrays of data. <h3 id="heading-youll-use-calculus-to-train-ml-models">You'll Use Calculus to Train ML Models</h3> Model training doesn't happen “automatically”. Calculus drives the learning of most ML and DL algorithms. One of the most commonly used optimization algorithms — gradient descent–is an application of partial derivatives. A model is a mathematical representation of certain beliefs and assumptions. It learns (approximately) the process (linear, polynomial, etc) of how the data is provided, and how it was generated in the first place. It then make predictions based on that learned process. <h3 id="heading-important-math-topics-to-know-for-data-science-and-machine-learning">Important Math Topics to Know for Data Science and Machine Learning:</h3> <ul> <li>Basic algebra — variables, coefficients, equations, functions — linear, exponential, logarithmic, and so on. </li> <li>Linear Algebra — scalars, vectors, tensors, Norms (L1 & L2), dot product, types of matrices, linear transformation, representing linear equations in matrix notation, solving linear regression problem using vectors and matrices. </li> <li>Calculus — derivatives and limits, derivative rules, chain rule (for backpropagation algorithm), partial derivatives (to compute gradients), convexity of functions, local/global minima, math behind a regression model, applied math for training a model from scratch. </li> </ul> <h2 id="heading-essential-statistics-for-data-science-and-machine-learning">Essential Statistics for Data Science and Machine Learning</h2> Every organisation today is striving to become data-driven. To achieve that, data analysts and scientists need to put their data to use in different ways in order to drive their decision making. <h3 id="heading-how-to-describe-data-from-data-to-insights">How to describe data — from data to insights</h3> Data always comes in raw and ugly. The initial exploration tells you what’s missing, how the data is distributed, and what’s the best way to clean it to meet the end goal. In order to answer the defined questions, descriptive statistics enables you to transform each observation in your data into insights that make sense. <h3 id="heading-how-to-quantify-uncertainty">How to quantify uncertainty</h3> You also need to be able to quantify uncertainty, and this is an extremely valuable skill that is highly regarded at any data company. Knowing the chances of success in any experiment/decision is critical for all businesses. <h3 id="heading-basic-statistics-to-know-for-data-science-and-machine-learning">Basic statistics to know for Data Science and Machine Learning:</h3> <ul> <li>Estimates of location — mean, median and other variants of these. </li> <li>Estimates of variability </li> <li>Correlation and covariance </li> <li>Random variables — discrete and continuous </li> <li>Data distributions— PMF, PDF, CDF </li> <li>Conditional probability — bayesian statistics </li> <li>Commonly used statistical distributions — Gaussian, Binomial, Poisson, Exponential. </li> <li>Important theorems — Law of large numbers and Central limit theorem. </li> </ul> Image from the lecture on Poisson distribution (<a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml">from this course</a>) <ul> <li>Inferential Statistics — A more practical and advanced branch of statistics that helps in designing hypothesis testing experiments, pushes us to understand the meaning of metrics deeply and at the same time helps us in quantifying the significance of the results. </li> <li>Important tests — Student’s t-Test, Chi-Square test, ANOVA test, and so on. </li> </ul> And there you have it. Every beginner-level data science enthusiast should focus on these three pillars before diving into any core data science or ML courses. <h2 id="heading-resources-to-learn-data-science-and-machine-learning-fundamentals">Resources to Learn Data Science and Machine Learning Fundamentals</h2> [https://www.freecodecamp.org/news/data-science-learning-roadmap/](https://www.freecodecamp.org/news/data-science-learning-roadmap/" rel="nofollow noopener) <a target="_blank" href="https://towardsdatascience.com/data-science-learning-roadmap-for-2021-84f2ba09a44f">My learning roadmap</a> also told you what to learn, and it was loaded up with resources, courses, and programs that you can take to learn those skills. But there are a few inconsistencies in the recommended resources and the roadmap that I had charted out. And many people were searching for a compact, comprehensive, yet affordable course. <h3 id="heading-problems-with-data-science-or-ml-courses">Problems with Data Science or ML Courses</h3> <ol> <li>Every data science course that I recommended in that article required that you have a decent understanding of Programming, Math, or Statistics. For example, <a target="_blank" href="https://www.youtube.com/watch?v=PPLop4L2eGk&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN">the most famous course on ML by Andrew Ng</a> also relies heavily on students' understanding of vector algebra and calculus. </li> <li>Most courses that cover Math and Statistics for Data Science are just a checklist of concepts required for DS/ML with no explanation on how they are applied and how they are programmed into a machine. </li> <li>There are exceptional resources to dive deep into Math but most of us are not made for it and you don't need to be a gold medalist in math to learn data science. </li> </ol> Bottom line: a resource that covers just enough applied math or statistics or programming to get started with data science or ML is missing. <h3 id="heading-wiplane-academy-wiplanecom">Wiplane Academy — wiplane.com</h3> So, I decided to give in and do it all myself. I have spent the last 3 months developing a curriculum that will provide a solid foundation for your career as a <ul> <li>Data Analyst </li> <li>Data Scientist </li> <li>ML Practitioner/Engineer </li> </ul> Hence, here I present you the <a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml">Foundations for Data Science or ML</a> — <a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml">First Steps to learn Data Science and ML</a> That's me when I decided to launch. It's a comprehensive yet compact and affordable course that not only covers all the essentials, pre-requisites, and pre-work but also explains how each concept is used computationally and programmatically (in Python). And that’s not it – I will keep updating the course content every month based on your input. Learn more <a target="_blank" href="https://www.wiplane.com/p/foundations-for-data-science-ml">here</a>. <h4 id="heading-early-bird-offer">Early Bird Offer!</h4> I am stoked to launch the pre-sale of this course as I am currently in the process of recording and editing the final bits of 2–3 modules (15-20 lectures). These will also be live by the first week of September. Grab the early bird offer, only valid until August 30th, 2021. </article> <article> <h1> Machine Learning Specialisation Courses for Advanced ML Practitioners </h1> Harshit Tyagi — Fri, 18 Jun 2021 15:43:56 +0000 Today, you don’t need to go to a university or a college to pursue a career in machine learning or any data-driven domain. But you do need a plan and a <a target="_blank" href="https://www.freecodecamp.org/news/data-science-learning-roadmap/">roadmap</a> to guide your studies. Once you have charted your own learning roadmap with a goal in mind, the next step is to screen the right set of courses that fit well into your roadmap. Then you can start building your foundation around those courses. In this article, I wanted to share a few advanced-level specializations and courses that are on my list and that can help you with your search for the right course or the right career track. So, here you go: <h2 id="heading-machine-learning-in-production-specialization-by-andrew-nghttpswwwdeeplearningaiprogrammachine-learning-engineering-for-production-mlops"><a target="_blank" href="https://www.deeplearning.ai/program/machine-learning-engineering-for-production-mlops/">Machine Learning in Production Specialization by Andrew Ng</a></h2> Discussions on topics like ML engineering and MLOps are helping to standardize practices that productionize ML models in the form of tools and techniques. And these novel ML engineering methods are not very well documented and aren't widely taught. With this specialization, thanks to a few pioneers of the field, you learn how to write production-ready, ML-powered applications. The specialization will help you: <ul> <li>Design an ML production system end-to-end </li> <li>Build data pipelines by gathering, cleaning, and validating datasets. </li> <li>Establish a model baseline and continuously improve a productionized ML application. </li> <li>Apply best practices and progressive delivery techniques to maintain and monitor a continuously operating production system. </li> </ul> This is a 5-month long specialization with 4 courses targeting every phase of ML Engineering. You can go to individual courses on <a target="_blank" href="https://www.coursera.org/specializations/machine-learning-engineering-for-production-mlops?">Coursera</a> and audit them to have free access. Andrew has done a remarkable job in democratizing Machine Learning for everyone, and this specialization might be able to do the same for ML Engineering. I’ll write a detailed review on this course once I complete the specialization. Let me know how you find it if you finish it before I do. <h2 id="heading-course-on-transformers-from-hugging-facehttpshuggingfacecocoursechapter1"><a target="_blank" href="https://huggingface.co/course/chapter1">Course on Transformers from Hugging Face🤗</a></h2> For all you deep learning enthusiasts, there is a new course on transformers from Hugging Face. It looks very promising as I really enjoyed going through <a target="_blank" href="https://course.fast.ai/">Practical Deep Learning for Coders</a> from the same Author (Sylvian Gugger). Anyone who is looking to dive deeper into training NLP models should definitely check this course out. The course curriculum covers: <ul> <li>Introduction to transformers, fine-tuning pre-trained models, and sharing the models and tokenizers. </li> <li>The dataset and the tokenizers library. </li> <li>How to develop specialized architectures, speed up training, and writing custom training loops. </li> </ul> The amazing part is that the course is absolutely free. Do it at your own pace and try to build a complete application instead of just a trained model. The best way to learn something is by using it in your own project to solve your own problems. <h2 id="heading-mlops-series-from-made-with-ml-by-goku-mohandashttpsmadewithmlcommlops"><a target="_blank" href="https://madewithml.com/#mlops">MLOps Series from Made with ML by Goku Mohandas</a></h2> I discovered Goku’s content a month ago and I am amazed by the amount of effort he is putting into developing this course on <a target="_blank" href="https://towardsdatascience.com/what-is-mlops-everything-you-must-know-to-get-started-523f2d0b8bd8">MLOps</a>. It takes a lot of work to actually hone your skills for each phase of MLOps because there are so many moving parts in an actual product (as compared to a model). Goku has divided the course into a number of subsections: <ul> <li>Product </li> <li>Data </li> <li>Modeling </li> <li>Scripting </li> <li>APIs and interfaces </li> <li>Testing & Reproducibility </li> <li>Production Systems </li> </ul> I highly recommend following his course – it is absolutely free and you can interact with similar minds by being a part of the community. Note: A common mistake beginners (myself included) make is they keep hopping from one course to another without actually completing the deliverables of any single one. This is fine if you are exploring, but when you set your sights on a course, try to build something out of it rather than simply following the lectures and doing the spoon-fed exercises. <h2 id="heading-interesting-read-of-the-week-a-project-of-ones-ownhttppaulgrahamcomownhtml"><a target="_blank" href="http://paulgraham.com/own.html">Interesting read of the week - A Project of One’s Own</a></h2> I read this post titled A project of One’s Own by Paul Graham (the co-founder of YCombinator) last week. It not only validated but strengthened my notions about working on your own projects voluntarily, even if it is a side gig. It is an excellent essay on the importance of working on ambitious projects that you choose yourself. Doing projects of your own with complete control and voluntary actions lets you feel the freedom and instigates your curiosity to dive deeper. It teaches you a lot more than any job that someone else has told you to do. Thanks for reading! If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase. </article> <article> <h1> Machine Learning Systems Book Recommendations – Learn How to Build and Understand ML Systems </h1> Harshit Tyagi — Fri, 07 May 2021 21:07:57 +0000 <blockquote> “Good friends, good books, and a sleepy conscience: this is the ideal life.” ― Mark Twain </blockquote> I hope you’re reading this blog in your pjs looking forward to a rejuvenating and healthy weekend. I have been working on multiple projects lately, from creating Machine Learing Engineering and Machine Learning Operations courses to developing end-to-end ML systems at scale. And I have realized that often times I am either revisiting a book that I’ve read or I’m referring to a book that I just skimmed through but never got the chance to really read. This week, I want to share with you the books that I personally feel every ML enthusiast and practitioner should read to get a sense of the breadth of ideas and depth of this field. It is a short and crisp list covering a majority of ML topics. It should be useful both for beginners getting started and intermediate-level professionals wanting to understand the intricacies of engineering successful ML systems. So, here we go... <h2 id="heading-1-hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow-2nd-editionhttpslearningoreillycomlibraryviewhands-on-machine-learning9781492032632"><a target="_blank" href="https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/">#1 — Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow, 2nd Edition</a></h2> By Aurélien Géron This book is simply a work of art. I highly recommend not just reading this book but also coding along with the author. The book is divided into two parts — the first part is focused on the fundamentals of machine learning and covers all the major classic ML algorithms. It has just the right amount of mathematical explanation and Python code to actually start developing models. The the second part focuses on neural networks and deep learning. I have read through this complete book and have maybe read a few chapters two or three times in order to get the concepts right and do the exercises. Reading tip for this one: Spend 2–3 days (or more if needed) with each chapter if you’re spending 2–3 hours learning actively. <h2 id="heading-2-machine-learning-engineeringhttpwwwmlebookcomwikidokuphp"><a target="_blank" href="http://www.mlebook.com/wiki/doku.php">#2 — Machine Learning Engineering</a></h2> By Andriy Burkov Andriy has done it again. This book explains each phase of the ML Systems Lifecycle and is a complete and concise resource for anyone who intends to build scalable ML-powered applications. The book is a compilation of engineering challenges and best practices to make ML work in production. Andriy explains how you should look to plan a project, why projects might fail, and how to approach every step . Here are the sections in this book: <ul> <li><a target="_blank" href="http://bit.ly/MLEbook-Chapter2">Before the Project Starts</a> </li> <li><a target="_blank" href="http://bit.ly/MLEbook-Chapter3">Data Collection and Preparation</a> </li> <li><a target="_blank" href="http://bit.ly/MLEbook-Chapter4">Feature Engineering</a> </li> <li><a target="_blank" href="http://bit.ly/MLEbook-Chapter5">Supervised Model Training</a> </li> <li><a target="_blank" href="http://bit.ly/MLEbook-Chapter7">Model Evaluation</a> </li> <li><a target="_blank" href="http://bit.ly/MLEbook-Chapter8">Model Deployment</a> </li> <li><a target="_blank" href="http://bit.ly/MLEbook-Chapter9">Model Serving, Monitoring, and Maintenance</a>. </li> </ul> His first book, The Hundred-Page Machine Learning Book, was a great success and the same can be said about this one as well. <h2 id="heading-3-practical-deep-learning-for-cloud-mobile-and-edgehttpslearningoreillycomlibraryviewpractical-deep-learning9781492034858"><a target="_blank" href="https://learning.oreilly.com/library/view/practical-deep-learning/9781492034858/">#3 — Practical Deep Learning for Cloud, Mobile, and Edge</a></h2> By Anirudh Koul, Siddha Ganju, and Meher Kasam The book follows the practical advice that you should learn by doing. It's a hands-on guide to building Deep Learning applications for the cloud, mobile browsers, and edge devices. I am currently reading this book and I am surprised that I didn’t stumble upon it before. Every chapter helps you build an application end-to-end. Each application targets a subdomain of deep learning, a different serving method, or techniques to optimize experimentation using TensorFlow. It's a must-read for people already familiar with deep learning. This book helps you dive deeper and learn by building a set of cool projects. <h2 id="heading-4-building-machine-learning-pipelineshttpslearningoreillycomlibraryviewbuilding-machine-learning9781492053187"><a target="_blank" href="https://learning.oreilly.com/library/view/building-machine-learning/9781492053187/">#4 — Building Machine Learning Pipelines</a></h2> By Hannes Hapke and Catherine Nelson After reading a number of case studies on how organizations like Spotify and Airbnb are using TF Extended to improve their ML platforms, I started learning about TFX. It can really help you optimize the development of end-to-end pipelines. The book explains techniques to set up ML pipelines right through from data ingestion to pipeline orchestration using Airflow or Kubeflow. TFX along with TF offers tools for every step of the process. This is an advanced-level read, and you should indulge only after you are done reading the top two recommendations. <h2 id="heading-interesting-read-of-the-week">Interesting Read of the Week</h2> This is a slightly unusual recommendation compared to what I usually write about. But I couldn’t resist sharing it with you because of the sheer quality of the work here. Do you understand how an internal combustion engine works? How all of these parts come together to power your vehicles and machines? Well, I have come across the best possible explanation of the functionality of all the basic engine parts. <h3 id="heading-read-the-article-internal-combustion-engine-by-bartosz-ciechanowskihttpsciechanowskiinternal-combustion-engine"><a target="_blank" href="https://ciechanow.ski/internal-combustion-engine/">Read the article: Internal Combustion Engine by Bartosz Ciechanowski</a></h3> The beautifully designed, 360 degree, in-action illustrations along with the explanation not only help you understand combustion engines, but for my part it definitely inspired me to work harder on my art. This made me question whether our education system is doing enough to inspire us or are they just getting away with “teaching” us. <h3 id="heading-thanks-for-reading">Thanks for reading!</h3> That’s it for this week. I don’t want to bog you down with a plethora of random ML books. Feel free to reach out if you have any thoughts, recommendations, or questions. If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase. </article> <article> <h1> What is MLOps? Machine Learning Operations Explained </h1> Harshit Tyagi — Fri, 26 Mar 2021 23:35:24 +0000 In this article, I'll teach you about Machine Learning Operations, which is like DevOps for Machine Learning. Until recently, all of us were learning about the standard software development lifecycle (SDLC). It goes from requirement elicitation to designing to development to testing to deployment, and all the way down to maintenance. We were (and still are) studying the waterfall model, iterative model, and agile models of software development. Now, we are at a stage where almost every organisation is trying to incorporate Machine Learning (ML) – often called Artificial Intelligence – into their product. This new requirement of building ML systems adds to and reforms some principles of the SDLC, giving rise to a new engineering discipline called Machine Learning Operations, or MLOps. And this new term is creating a buzz and has given rise to new job profiles. Here we’ll talk about: <ul> <li>What is MLOps? </li> <li>What problems does MLOps solve? </li> <li>What skills do you need for MLOps? </li> </ul> Keep reading and I'll explain each in detail. <h2 id="heading-what-is-mlops">What is MLOps?</h2> If you look MLOps up on Google trends, you'll see that it is a relatively new discipline. Again, it has come to be because more organizations are trying to integrate ML systems into their products and platforms. <h3 id="heading-heres-how-id-define-mlops">Here’s how I’d define MLOps:</h3> MLOps is an engineering discipline that aims to unify ML systems development (dev) and ML systems deployment (ops) in order to standardize and streamline the continuous delivery of high-performing models in production. <h3 id="heading-why-mlops">Why MLOps?</h3> Until recently, we were dealing with manageable amounts of data and a very small number of models at a small scale. The tables are turning now, and we are embedding decision automation in a wide range of applications. This generates a lot of technical challenges that come from building and deploying ML-based systems. In order to understand MLOps, we must first understand the ML systems lifecycle. The lifecycle involves several different teams of a data-driven organization. From start to bottom, the following teams chip in: <ul> <li>Business development or Product team — defining business objective(s) with KPIs </li> <li>Data Engineering — data acquisition and preparation. </li> <li>Data Science — architecting ML solutions and developing models. </li> <li>IT or DevOps — complete deployment setup, monitoring alongside scientists. </li> </ul> Here is a very simplified representation of the ML lifecycle. Teams at Google have been doing a lot of research on the technical challenges that come with building ML-based systems. A NeurIPS paper on <a target="_blank" href="https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf">hidden technical Debt in ML systems</a> shows you developing models is just a very small part of the whole process. There are many other processes, configurations, and tools that are to be integrated into the system. To streamline this entire system, we have this new Machine learning engineering culture. The system involves everyone from the higher management with minimal technical skills to Data Scientists to DevOps and ML Engineers. <h2 id="heading-what-problems-does-mlops-solve">What Problems Does MLOps Solve?</h2> Managing such systems at scale is not an easy task, and there are numerous bottlenecks that need to be taken care of. Following are the major challenges that teams are up against: <ul> <li>There is a shortage of Data Scientists who are good at developing and deploying scalable web applications. There is a new profile of ML Engineers on the market these days that aims to serve this need. It is a sweet spot at the intersection of Data Science and DevOps. </li> <li>Changing business objectives in the model —There are many dependencies with the data continuously changing, maintaining performance standards of the model, and ensuring AI governance. It’s hard to keep up with the continuous model training and evolving business objectives. </li> <li>Communication gaps between technical and business teams with a hard-to-find common language to collaborate. Most often, this gap becomes the reason that big projects fail. </li> <li>Risk assessment — there is a lot of debate going on around the black-box nature of such ML/DL systems. Often models tend to drift away from what they were initially intended to do. Assessing the risk/cost of such failures is a very important and meticulous step. For example, the cost of an inaccurate video recommendation on YouTube would be much lower compared to flagging an innocent person for fraud and blocking their account, and declining their loan applications. </li> </ul> <h2 id="heading-what-skills-do-you-need-for-mlops">What Skills Do You Need for MLOps?</h2> At this point, I’ve already given a lot of insights into the bottlenecks of the system and how MLOps solves each of those. You can discover the skills you need to target from those challenges. Following are the key skills you need to focus on: <h3 id="heading-1-framing-ml-problems-from-business-objectives">1. Framing ML problems from business objectives</h3> Machine learning systems development typically starts with a business goal or objective. It can be a simple goal of reducing the percentage of fraudulent transactions below 0.5%, or it can be building a system to detect skin cancer in images labeled by dermatologists. These objectives often have certain performance measures, technical requirements, budgets for the project, and KPIs (Key Performance Indicators) that drive the process of monitoring the deployed models. <h3 id="heading-2-architect-ml-and-data-solutions-for-the-problem">2. Architect ML and data solutions for the problem</h3> After the objectives are clearly translated into ML problems, the next step is to start searching for appropriate input data and the kinds of models to try for that kind of data. Searching for data is one of the most strenuous tasks. It is a process with several parts: <ul> <li>You need to look for any available relevant dataset, </li> <li>Check the credibility of the data and its source. </li> <li>Is the data source compliant with regulations like GDPR? </li> <li>How to make the dataset accessible? </li> <li>What is the type of source — static (files) or real-time streaming (sensors)? </li> <li>How many sources are to be used? </li> <li>How to build a data pipeline that can drive both training and optimization once the model is deployed in the production environment? </li> <li>What cloud services will you use? </li> </ul> <h3 id="heading-3-data-preparation-and-processing-part-of-data-engineering">3. Data preparation and processing — part of data engineering.</h3> Data preparation includes tasks like feature engineering, cleaning (formatting, checking for outliers, imputations, rebalancing, and so on), and then selecting the set of features that contribute to the output of the underlying problem. You need to design a complete pipeline and then code it to produce clean and compatible data that'll be fed to the next phase of model development. An important part of deploying such pipelines is to choose the right combination of cloud services and architecture that is performant and cost-effective. For example, if you have a lot of data movement and huge amounts of data to store, you can look to build data lakes using AWS S3 and AWS Glue. You might want to practice building a few different kinds of pipelines (Batch vs Streaming) and try to deploy those pipelines on the cloud. <h3 id="heading-4-model-training-and-experimentation-data-science">4. Model training and experimentation — data science</h3> As soon as your data is prepared, you move on to the next step of training your ML model. Now, the initial phase of training is iterative with a bunch of different types of models. You will be narrowing down to the best solution using several quantitative measures like accuracy, precision, recall, and more. You can also use qualitative analysis of the model which accounts for the mathematics that drives that model or, simply put, the explainability of the model. I have this complete list of tasks that you can read on training ML models: <div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://towardsdatascience.com/task-cheatsheet-for-almost-every-machine-learning-project-d0946861c6d0">https://towardsdatascience.com/task-cheatsheet-for-almost-every-machine-learning-project-d0946861c6d0</a></div> Now, you’ll be running a lot of experiments with different types of data and parameters. Another challenge that data scientists face while training models is reproducibility. This can be solved by versioning your models and data. You can add version control to all the components of your ML systems (mainly data and models) along with the parameters. This is now very easy to accomplish with the development of open-source tools like <a target="_blank" href="https://dvc.org/">DVC</a> and <a target="_blank" href="https://cml.dev/">CML</a>. Other tasks include: <ul> <li>Testing a model by writing unit tests for model training. </li> <li>Checking the model against baselines, simpler models, and across different dimensions. </li> <li>Scaling the model training using distributed systems, hardware accelerators, and scalable analysis. </li> </ul> <h3 id="heading-5-building-and-automating-ml-pipelines">5. Building and automating ML pipelines</h3> You should build your ML pipelines keeping in mind the following tasks: <ul> <li>Identify system requirements — parameters, compute needs, triggers. </li> <li>Choose an appropriate cloud architecture — hybrid or multi-cloud. </li> <li>Construct training and testing pipelines. </li> <li>Track and audit the pipeline runs. </li> <li>Perform data validation. </li> </ul> <h3 id="heading-6-deploying-models-to-the-production-system">6. Deploying models to the production system</h3> There are mainly two ways of deploying an ML model: <ul> <li>Static deployment or embedded model — where the model is packaged into installable application software and is then deployed. For example, an application that offers batch-scoring of requests. </li> <li>Dynamic deployment — where the model is deployed using a web framework like FastAPI or Flask and is offered as an API endpoint that responds to user requests. </li> </ul> Within dynamic deployment, you can use different methods: <ul> <li>deploying on a server (a virtual machine) </li> <li>deploying in a container </li> <li>serverless deployment </li> <li>model streaming — instead of REST APIs, all of the models and application code are registered on a stream processing engine like Apache Spark, Apache Storm, and Apache Flink. </li> </ul> Following are the considerations: <ul> <li>Ensuring that proper documentation and testing scores are met. </li> <li>Revalidating the model's accuracy. </li> <li>Performing explainability checks. </li> <li>Ensuring that all governance requirements have been met. </li> <li>Checking the quality of any data artifacts </li> <li>Load testing — compute resource usage. </li> </ul> <h3 id="heading-7-monitor-optimize-and-maintain-models">7. Monitor, optimize and maintain models</h3> Not only do you need to keep an eye on the performance of the models in production but you also need to ensure good and fair governance. Governance here means adding control measures to ensure that the models deliver on their responsibilities to all the stakeholders, employees, and users that are affected by them. As part of this phase, we need data scientists and DevOps engineers to maintain the whole system in production by performing the following tasks: <ul> <li>Keeping track of performance degradation and business quality of model predictions. </li> <li>Setting up logging strategies and establishing continuous evaluation metrics. </li> <li>Troubleshooting system failures and introduction of biases. </li> <li>Tuning the model performance in both training and serving pipelines deployed in production. </li> </ul> <h3 id="heading-further-recommended-reading">Further recommended reading</h3> This article was all about MLOps which is not a job profile but an ecosystem of several stakeholders. If you are someone who works at the crossover of ML and Software Engineering (DevOps), you might be a good fit for startups and mid-size organizations that are looking for people who can handle such systems end-to-end. ML Engineer is the position that serves this sweet spot and it's what aspiring candidates should be targeting. Following are a few resources that you can look at: <ul> <li>[Book]: Andriy Burkov’s book on <a target="_blank" href="http://www.mlebook.com/wiki/">Machine Learning Engineering</a>. </li> <li>[Book]: <a target="_blank" href="https://learning.oreilly.com/library/view/introducing-mlops/9781492083283/">Introduction to MLOps by O’Reilly media</a>. </li> <li>You can also aim for certification programs like the ones below: </li> </ul> <div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://cloud.google.com/certification/machine-learning-engineer">https://cloud.google.com/certification/machine-learning-engineer</a></div> <div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://aws.amazon.com/certification/certified-machine-learning-specialty/?ch=sec&sec=rmg&d=1">https://aws.amazon.com/certification/certified-machine-learning-specialty/?ch=sec&sec=rmg&d=1</a></div> You can also watch the video version of this blog here: <div class="embed-wrapper"> </div> If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase. </article> <article> <h1> Tableau Tutorial – How to Build Your Own COVID Tracker Dashboard </h1> Harshit Tyagi — Thu, 18 Mar 2021 18:15:46 +0000 I don’t use Tableau for my data science work, but I have done a couple of mini-projects to help me review the interface and learn what the hype is all about. So yesterday, I decided to create a complete dashboard using Tableau. I wanted to compare the ease of building, time it took to complete the project, and quality of the dashboard. So I chose to base it on the number of Novel Coronavirus cases in the world, since I'd built a similar <a target="_blank" href="https://towardsdatascience.com/building-covid-19-analysis-dashboard-using-python-and-voila-ee091f65dcbb">dashboard displaying COVID cases using Python, Jupyter Notebook, and Voila</a>. <h2 id="heading-pre-requisites-for-this-quick-tutorial">Pre-requisites for this quick tutorial</h2> There's nothing major – just make sure you have <a target="_blank" href="https://public.tableau.com/en-us/s/download">Tableau public installed</a>. To better understand the stark difference between the two approaches – that is, building a <a target="_blank" href="https://covid-19-voila-dashboard.herokuapp.com/">dashboard</a> using programming versus building it with Tableau – just skim through my <a target="_blank" href="https://towardsdatascience.com/building-covid-19-analysis-dashboard-using-python-and-voila-ee091f65dcbb">article on Building a COVID-19 interactive dashboard from Jupyter Notebooks</a> or watch the video <a target="_blank" href="https://youtu.be/FngV4VdYrkA">here</a>. You can view my Python-based dashboard <a target="_blank" href="https://covid-19-voila-dashboard.herokuapp.com/">here</a>. Let’s start building… <h1 id="heading-how-to-find-a-good-data-source">How to Find a Good Data Source</h1> The first step is to find a credible data source given the seriousness of the topic we’ve picked up. For this, we are going to leverage the <a target="_blank" href="https://github.com/CSSEGISandData/COVID-19">COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University</a>¹. This is maintained by a number of contributors from the University and is updated on a regular basis. There are many different types of datasets, but to keep things simple for now, we are going to use the country-specific data giving us the latest number of different types of cases (active, confirmed, deaths, recovered) for different countries/regions in the world. Here is the raw link to the file: <a target="_blank" href="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/web-data/data/cases_country.csv">https://raw.githubusercontent.com/CSSEGISandData/COVID-19/web-data/data/cases_country.csv</a> It is a CSV file that looks like this: <h1 id="heading-how-to-load-the-data-into-tableau">How to Load the Data into Tableau</h1> There are several ways of loading data into Tableau, including: <ul> <li>Uploading files from your local machine — Excel, CSV, text, JSON, PDF, Spatial, and so on. </li> <li>Connecting to data stored on a server — you can directly load data from Tableau Server, Google Cloud Storage/Analytics, MS SQL server, and others. You can use already available data connectors for these. </li> <li>You can also connect to sources you’ve connected to before. </li> </ul> In our case, we want to load the raw CSV file available on GitHub directly to Tableau. For this purpose, we can use a CSV web <a target="_blank" href="https://basic-csv-wdc.herokuapp.com/">data connector</a> developed by Keshia Rose. Here's the link to the connector: <a target="_blank" href="https://basic-csv-wdc.herokuapp.com/">https://basic-csv-wdc.herokuapp.com/</a> And these are the steps to load the data: <ul> <li>Under the Connect pane, click on <code>**Web Data Connector**</code>. </li> <li>Add the connector URL in the field that pops up and hit <code>Enter</code>. </li> </ul> <ul> <li>Now, add the link to the raw CSV file in the search field and click on <code>**Get Data!**</code>.</li> </ul> It will take a few seconds to load the data and then you can click on <code>Update now</code> to finally peek at the data available in the file: How to Explore the Data in Tableau Tableau presents the data in a very intuitive manner. We can learn about the basic attributes of the data and their types right from the preview and metadata. From the preview, we can find out about the features we have in the dataset that further define the questions we are interested in answering about the problem at hand. From the metadata view, we can find out about the data types (categorical/quantitative/DateTime, and so on) of those features that tell us how we can analyze those features in combination with others. Clicking on the metadata view displays the columns along with their names and types: It’s important to learn about the meaning of the features and their data types: How to find the data type of a variable — represented by the notations. <code>**#**</code> — denotes numerical data type. <code>**Abc**</code> — denotes categorical/string data type. <code>🌐</code> — denotes geographical values. Apart from these, we also have DateTime, clusters, and boolean notations. This should help us understand what we can do with this dataset. Since the data is already clean and formatted, we can skip the wrangling part and move on to define what we want from this analysis. So, let’s move on to the next step. <h1 id="heading-how-to-define-questions-based-on-the-columns">How to Define Questions Based on the Columns</h1> Based on the features we have and their data types, we can look to answer the following straightforward questions: <ul> <li>What is the current number of COVID cases in the world (total active, confirmed, deaths)? </li> <li>What is the current state of countries — if we can visualize this in one frame? </li> <li>Which are the most affected countries in terms of the number of cases and mortality rate? </li> </ul> You can add and define more or different questions, but I am going to walk you through these for now. It’s time to get down to answering these questions. <h1 id="heading-how-the-tableau-interface-works">How the Tableau Interface Works</h1> Here’s a quick tour of the Tableau interface. → At the bottom, you’ll see there are a number of icons, these are to: <ul> <li>check the connected data source </li> <li>add new sheets </li> <li>add new dashboards </li> <li>add new stories. </li> </ul> → Click on Sheet 1 which is created for us by default. In the picture above, I’ve annotated only the important parts of the interface. We can do most of the analysis by dragging and dropping features into columns and rows. <h1 id="heading-how-to-create-visualizations-in-tableau">How to Create Visualizations in Tableau</h1> We’ll now iterate over each question and create a dedicated sheet to analyze the data in order to answer that question. <h2 id="heading-1-total-number-of-cases">#1 Total Number of cases</h2> To answer this, we are going to make use of the following columns: <ul> <li>Confirmed </li> <li>Deaths </li> <li>Active </li> </ul> Now, Tableau knows that these are quantitative measures and adds a default aggregator (SUM in this case) as soon as you try to drag and drop any one of these. You can change the aggregator at any point using the Marks. To visualize the total (SUM) number of cases, simply drag each of the above features and put them in the columns field at the top. <blockquote> At any point, if anything goes wrong, you can use <code>_Cmd/Ctrl + z_</code> to undo it. </blockquote> Furthermore, you can change the color of each of the bars using Marks in the left pane. You can also play around with the font, text color, shadow, and more by right-clicking on the data visualization you want to format. Here’s what my formatted visualization looks like after a few changes (color and width). → Decent enough for the amount of effort we’ve put in. It would have taken a lot more time and effort to code this. Awesome, let’s move on to the next part. <h2 id="heading-2-world-map-that-displays-the-covid-cases-in-each-countryregion">#2 World Map that Displays the COVID Cases in Each Country/Region</h2> Since we have geospatial dimensions in the data, we can look to plot the numbers on a world map to visualize the situation in each country with reference to our variable of choice. I am going to plot the number of cases (confirmed, active, and deaths) on the world map using Latitude and Longitude columns. These are generated by Tableau from the Lat/Long variables, and are italicized in the Tables pane. How to do that: <ul> <li>The first step is to add a new sheet by clicking the icon adjacent to <code>Sheet 1</code> </li> <li>Drag the Longitude (generated) and drop it in Columns </li> <li>Drag Latitude (generated) and drop it in Rows. After doing this, you’ll have a blank world map in the main view. </li> <li>To add the names of countries, drop the <code>Country Region</code> column on the details box in the Marks pane. Doing this will produce the symbols map with country names showing in the tooltip. </li> <li>Now, we have a <code>Show Me</code> pane on the right top that shows us all the visualizations that you can use. The charts that are greyed out are not applicable and when you hover over them, it will tell you what all types of columns you need to make that chart applicable. Do it for the world map and you’ll learn that we need at least 1 geospatial dimension, 0 or more dimensions, and 0 or 1 measure. </li> <li>It’s time to add the measure, that is the variable that we want to visualize. I am choosing the number of confirmed cases. Drag and drop the Confirmed column on the Label box in the Marks pane. </li> </ul> You can also add other variables to the details box if you want to add them to the insights. Here’s what my symbols map looks like: Feel free to play around with the other map, add colors, or format what you want to see on the map. <h2 id="heading-3-most-affected-countries">#3 Most Affected Countries</h2> The total numbers and world map can only give you a brief overview of the pandemic. So, let’s dive a little deeper to see which countries are most affected in terms of confirmed cases, deaths, and mortality rate, and which countries have high recovery rates. These data are very simple to plot. Here are the steps: <ul> <li>Add a new sheet. </li> <li>Drag and drop the <code>Country Region</code> feature into Columns. </li> <li>Drag and drop the <code>SUM(Confirmed)</code> into Rows. You’ll have a bar chart ready for you in the main view with countries on the X-axis and the number of Confirmed cases on the Y-axis. </li> <li>Since we are supposed to look at the most affected countries, we need to sort the data, and Tableau makes it very easy for us. All we need to do is click on the <code>Sort descending</code> icon in the taskbar at the top. </li> <li>With all the bars aligned in descending order, we now simply want to pick a few that are above a certain threshold – let’s say top 10. Hold your cursor in a clicked state and drag it over the number of bars you want to shortlist. </li> <li>Hover over the shortlisted bars and click on Keep Only in the pop-up that appears. This will give you an uncluttered chart. </li> <li>You can turn on the labels from the taskbar or drop SUM (Confirmed) onto the Label box. </li> </ul> And again, you can add colors, format as you like, annotate, and do more with these data. Here are the charts that I created using the above steps: <blockquote> Don’t forget to rename your sheets as per their use case. </blockquote> <h2 id="heading-how-to-create-a-dashboard-out-of-these-sheets">How to Create a Dashboard out of These Sheets</h2> With enough visualizations and numbers, we can now dump them all on one screen to create a quick interactive dashboard out of it. This final step is very simple – all you need to do is click on the <code>New Dashboard</code> icon at the bottom. This will create an empty dashboard view, prompting you to drop the sheets you want to appear in your dashboard from the left pane. You can drag and drop the sheets to the dashboard and then position them to make your dashboard look insightful and appealing. Here’s my final dashboard: If you want to make changes to any of the visualizations, you can go back to that sheet and the changes will be reflected automatically in the dashboard. <h2 id="heading-how-to-share-your-dashboard">How to Share your Dashboard</h2> You can save all of your changes to your notebooks/dashboard on Tableau’s public server by creating your own personal account. Saving the dashboard will create a public link that you can share with your fellow analysts, collaborators, or friends. You can look at my dashboard here: <a target="_blank" href="https://public.tableau.com/profile/harshit.tyagi#!/vizhome/covid_book/Dashboard1">https://public.tableau.com/profile/harshit.tyagi#!/vizhome/covid_book/Dashboard</a>. <h1 id="heading-conclusion">Conclusion</h1> After building this dashboard using Tableau, I compared it with the amount of effort it took me to create the same using Python and Jupyter Notebook. I tried to score the two methodologies on different metrics on a scale of 1 - 5, where 5 is the best and 1 is the worst: Tableau turns out to be a clear winner here! I can say that Tableau seems to be a wise and time-efficient choice at least for these kinds of scenarios. <blockquote> Disclaimer: It may be incorrect to compare a programming language with a Data Analysis software. This is a fun comparison which is only applicable in this type of dashboard building task. This is my personal opinion as per my experiences and you should find the best choice of tool for yourself. </blockquote> <h2 id="heading-live-project">Live Project</h2> If you want to work on something similar yet advanced, you should check out my live project on <a target="_blank" href="https://www.manning.com/liveproject/predicting-disease-outbreaks-with-time-series-analysis?utm_source=harshit&utm_medium=affiliate&utm_campaign=liveproject_tyagi_predicting_3_11_21&a_aid=harshit&a_bid=f5119f17">Manning</a>. <h2 id="heading-video-version-of-this-blog">Video version of this blog!</h2> <div class="embed-wrapper"> </div> If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase. Citation(s): [1]: Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533–534. doi: 10.1016/S1473–3099(20)30120–1 </article> <article> <h1> How to Read a Research Paper – A Guide to Setting Research Goals, Finding Papers to Read, and More </h1> Harshit Tyagi — Thu, 04 Mar 2021 19:01:49 +0000 If you work in a scientific field, you should try to build a deep and unbiased understanding of that field. This not only educates you in the best possible way but also helps you envision the opportunities in your space. A research paper is often the culmination of a wide range of deep and authentic practices surrounding a topic. When writing a research paper, the author thinks critically about the problem, performs rigorous research, evaluates their processes and sources, organizes their thoughts, and then writes. These genuinely-executed practices make for a good research paper. If you’re struggling to build a habit of reading papers (like I am) on a regular basis, I’ve tried to break down the whole process. I've talked to researchers in the field, read a bunch of papers and blogs from distinguished researchers, and jotted down some techniques that you can follow. Let’s start off by understanding what a research paper is and what it is NOT! <h2 id="heading-what-is-a-research-paper">What is a Research Paper?</h2> A research paper is a dense and detailed manuscript that compiles a thorough understanding of a problem or topic. It offers a proposed solution and further research along with the conditions under which it was deduced and carried out, the efficacy of the solution and the research performed, and potential loopholes in the study. A research paper is written not only to provide an exceptional learning opportunity but also to pave the way for further advancements in the field. These papers help other scholars germinate the thought seed that can either lead to a new world of ideas or an innovative method of solving a longstanding problem. <h2 id="heading-what-research-papers-are-not">What Research Papers are NOT</h2> There is a common notion that a research paper is a well-informed summary of a problem or topic written by means of other sources. But you shouldn't mistake it for a book or an opinionated account of an individual’s interpretation of a particular topic. <h2 id="heading-why-should-you-read-research-papers">Why Should You Read Research Papers?</h2> What I find fascinating about reading a good research paper is that you can draw on a profound study of a topic and engage with the community on a new perspective to understand what can be achieved in and around that topic. I work at the intersection of instructional design and data science. Learning is part of my day-to-day responsibilities. If the source of my education is flawed or inefficient, I’d fail at my job in the long term. This applies to many other jobs in Science with a special focus on research. There are three important reasons to read a research paper: <ol> <li>Knowledge — Understanding the problem from the eyes of someone who has probably spent years solving it and has taken care of all the edge cases that you might not think of at the beginning. </li> <li>Exploration — Whether you have a pinpointed agenda or not, there is a very high chance that you will stumble upon an edge case or a shortcoming that is worth following up. With persistent efforts over a considerable amount of time, you can learn to use that knowledge to make a living. </li> <li>Research and review — One of the main reasons for writing a research paper is to further the development in the field. Researchers read papers to review them for conferences or to do a literature survey of a new field. For example, <a target="_blank" href="http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf">Yann LeCun’</a>s paper on integrating domain constraints into backpropagation set the foundation of modern computer vision back in 1989. After decades of research and development work, we have come so far that we're now perfecting problems like object detection and optimizing autonomous vehicles. </li> </ol> Not only that, with the help of the internet, you can extrapolate all of these reasons or benefits onto multiple business models. It can be an innovative state-of-the-art product, an efficient service model, a content creator, or a dream job where you are solving problems that matter to you. <h2 id="heading-goals-for-reading-a-research-paper-what-should-you-read-about">Goals for Reading a Research Paper — What Should You Read About?</h2> The first thing to do is to figure out your motivation for reading the paper. There are two main scenarios that might lead you to read a paper: <ol> <li>Scenario 1 — You have a well-defined agenda/goal and you are deeply invested in a particular field. For example, you’re an NLP practitioner and you want to learn how GPT-4 has given us a breakthrough in NLP. This is always a nice scenario to be in as it offers clarity. </li> <li>Scenario 2 — You want to keep abreast of the developments in a host of areas, say <a target="_blank" href="https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology">how a new deep learning architecture has helped us solve a 50-year old biological problem of understanding protein structures.</a> This is often the case for beginners or for people who consume their daily dose of news from research papers (yes, they exist!). </li> </ol> If you’re an inquisitive beginner with no starting point in mind, start with scenario 2. Shortlist a few topics you want to read about until you find an area that you find intriguing. This will eventually lead you to scenario 1. <h3 id="heading-ml-reproducibility-challenge">ML Reproducibility Challenge</h3> In addition to these generic goals, if you need an end goal for your habit-building exercise of reading research papers, you should check out the <a target="_blank" href="https://openreview.net/group?id=ML_Reproducibility_Challenge/2020">ML reproducibility challenge.</a> [https://openreview.net/group?id=ML_Reproducibility_Challenge/2020](https://openreview.net/group?id=ML_Reproducibility_Challenge/2020" rel="nofollow noopener) You’ll find top-class papers from world-class conferences that are worth diving deep into and reproducing the results. They conduct this challenge twice a year and they have one coming up in <a target="_blank" href="https://paperswithcode.com/rc2020">Spring 2021.</a> You should study the past three versions of the challenge, and I’ll write a detailed post on what to expect, how to prepare, and so on. Now you must be wondering – how can you find the right paper to read? <h2 id="heading-how-to-find-the-right-paper-to-read">How to Find the Right Paper to Read</h2> In order to get some ideas around this, I reached out to my friend, <a target="_blank" href="https://scholar.google.com/citations?user=zd0-SNQAAAAJ&hl=en&oi=ao">Anurag Ghosh</a> who is a researcher at Microsoft. Anurag has been working at the crossover of computer vision, machine learning, and systems engineering. https://anuragxel.github.io/ Here are a few of his tips for getting started: <ul> <li>Always pick an area you're interested in. </li> <li>Read a few good books or detailed blog posts on that topic and start diving deep by reading the papers referenced in those resources. </li> <li>Look for seminal papers around that topic. These are papers that report a major breakthrough in the field and offer a new method perspective with a huge potential for subsequent research in that field. Check out papers from <a target="_blank" href="https://blog.acolyer.org/">the morning paper</a> or <a target="_blank" href="https://www.thecvf.com/?page_id=413#Helmholtz">C</a>VF - test of time award/Helmholtz prize (if you're interested in computer vision). </li> <li>Check out books like Computer Vision: Algorithms and Applications by Richard Szeliski and look for the papers referenced there. </li> <li>Have and build a sense of community. Find people who share similar interests, and join groups/subreddits/discord channels where such activities are promoted. </li> </ul> In addition to these invaluable tips, there are a number of web applications that I’ve shortlisted that help me narrow my search for the right papers to read: <ul> <li><a target="_blank" href="https://www.reddit.com/r/MachineLearning/">r/MachineLearning</a> — there are many researchers, practitioners, and engineers who share their work along with the papers they've found useful in achieving those results.</li> </ul> [https://www.reddit.com/r/MachineLearning/](https://www.reddit.com/r/MachineLearning/" rel="nofollow noopener) <ul> <li><a target="_blank" href="http://www.arxiv-sanity.com/top">Arxiv Sanity Preserver</a> — built by Andrej Karpathy to accelerate research. It is a repository of 142,846 papers from computer science, machine learning, systems, AI, Stats, CV, and so on. It also offers a bunch of filters, powerful search functionality, and a discussion forum to make for a super useful research platform.</li> </ul> <ul> <li><a target="_blank" href="https://research.google/">Google Research</a> — the research teams at Google are working on problems that have an impact on our everyday lives. They share their publications for individuals and teams to learn from, contribute to, and expedite research. They also have a Google AI blog that you can check out.</li> </ul> <h2 id="heading-how-to-read-a-research-paper">How to Read a Research Paper</h2> After you have stocked your to-read list, then comes the process of reading these papers. Remember that NOT every paper is useful to read and we need a mechanism that can help us quickly screen papers that are worth reading. To tackle this challenge, you can use this <a target="_blank" href="http://ccr.sigcomm.org/online/files/p83-keshavA.pdf">Three-Pass Approach by S. Keshav</a>. This approach proposes that you read the paper in three passes instead of starting from the beginning and diving in deep until the end. <h3 id="heading-the-three-pass-approach">The three pass approach</h3> <ol> <li>The first pass — is a quick scan to capture a high-level view of the paper. Read the title, abstract, and introduction carefully followed by the headings of the sections and subsections and lastly the conclusion. It should take you no more than 5–10 mins to figure out if you want to move to the second pass. </li> <li>The second pass — is a more focused read without checking for the technical proofs. You take down all the crucial notes, underline the key points in the margins. Carefully study the figures, diagrams, and illustrations. Review the graphs, mark relevant unread references for further reading. This helps you understand the background of the paper. </li> <li>The third pass — reaching this pass denotes that you’ve found a paper that you want to deeply understand or review. The key to the third pass is to reproduce the results of the paper. Check it for all the assumptions and jot down all the variations in your re-implementation and the original results. Make a note of all the ideas for future analysis. It should take 5–6 hours for beginners and 1–2 hours for experienced readers. </li> </ol> <h2 id="heading-tools-and-software-to-keep-track-of-your-pipeline-of-papers">Tools and Software to Keep Track of Your Pipeline of Papers</h2> If you’re sincere about reading research papers, your list of papers will soon grow into an overwhelming stack that is hard to keep track of. Fortunately, we have software that can help us set up a mechanism to manage our research. Here are a bunch of them that you can use: <ul> <li><a target="_blank" href="https://www.mendeley.com/?interaction_required=true">Mendeley</a> [not free] — you can add papers directly to your library from your browser, import documents, generate references and citations, collaborate with fellow researchers, and access your library from anywhere. This is mostly used by experienced researchers.</li> </ul> https://www.mendeley.com/?interaction_required=true <ul> <li>Zotero [free & open source] — Along the same lines as Mendeley but free of cost. You can make use of all the features but with limited storage space.</li> </ul> https://www.zotero.org/ <ul> <li>Notion — this is great if you are just starting out and want to use something lightweight with the option to organize your papers, jot down notes, and manage everything in one workspace. It might not stand anywhere in comparison with the above tools but I personally feel comfortable using Notion and I have created <a target="_blank" href="https://www.notion.so/My-paper-pipeline-ec3ff02ce9c641d2953f6cdbc431a55a">this board</a> to keep track of my progress for now that you can duplicate:</li> </ul> <h2 id="heading-symptoms-of-reading-a-research-paper">⚠️ Symptoms of Reading a Research Paper</h2> Reading a research paper can turn out to be frustrating, challenging, and time-consuming especially when you’re a beginner. You might face the following common symptoms: <ul> <li>You might start feeling dumb for not understanding a thing a paper says. </li> <li>Finding yourself pushing too hard to understand the math behind those proofs. </li> <li>Beating your head against the wall to wrap it around the number of acronyms used in the paper. Just kidding, you’ll have to look up those acronyms every now and then. </li> <li>Being stuck on one paragraph for more than an hour. </li> </ul> Here’s a complete list of emotions that you might undergo as explained by Adam Ruben in <a target="_blank" href="https://www.sciencemag.org/careers/2016/01/how-read-scientific-paper">this article</a>. <h2 id="heading-key-takeaways">Key Takeaways</h2> We should be all set to dive right in. Here’s a quick summary of what we have covered here: <ul> <li>A research paper is an in-depth study that offers an detailed explanation of a topic or problem along with the research process, proofs, explained results, and ideas for future work. </li> <li>Read research papers to develop a deep understanding of a topic/problem. Then you can either review papers as part of being a researcher, explore the domain and the kind of problems to build a solution or startup around it, or you can simply read them to keep abreast of the developments in your domain of interest. </li> <li>If you’re a beginner, start with exploration to soon find your path to goal-oriented research. </li> <li>In order to find good papers to read, you can use websites like arxiv-sanity, google research, and subreddits like r/MachineLearning. </li> <li>Reading approach — Use the 3-pass method to find a paper. </li> <li>Keep track of your research, notes, developments by using tools like Zotero/Notion. </li> <li>This can get overwhelming in no time. Make sure you start off easy and increment your load progressively. </li> </ul> Remember: Art is not a single method or step done over a weekend but a process of accomplishing remarkable results over time. You can also watch the video on this topic on my <a target="_blank" href="https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ">YouTube channel</a>: <div class="embed-wrapper"> </div> Feel free to respond to this blog or comment on the video if you have some tips, questions, or thoughts! If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase. </article> <article> <h1> How to Get Your First Freelancing Client or Project </h1> Harshit Tyagi — Thu, 21 Jan 2021 19:08:41 +0000 It’s been almost three years since I left my full-time job as a Data Engineer and started freelancing as a full-time gig. My professional growth took a steep upward turn as I was hustling to get my first client or a new project. More than growth or money, I wanted to find where my interests truly were. Not to mention, working to make my own ideas come to life gave me immense joy. One question that I often encounter with regards to my profession is, <blockquote> How did you manage to get your first freelancing projects? </blockquote> Because everyone knows that getting your first gig is arguably the hardest part of freelancing. This blog post and video (at the bottom) is about building a strategy to start your freelancing career. <h2 id="heading-why-freelancing">Why freelancing?</h2> Who doesn’t want to set up a side income if you’ve got the time? Maybe you want to save up for a car or maybe you want to take a break year. That’s where freelancing comes in. Not to mention, you can become “your own boss”, work from a beach house in shorts, and at the same time, there is no limit to how much you can earn. You possess control of your growth. So, here you go! <h2 id="heading-five-steps-to-follow-to-get-your-first-freelancing-project">Five Steps to Follow to Get Your First Freelancing Project</h2> As I see it, there are five steps you need to follow that'll help you get your first gig and establish a freelancing career. Here they are: <ol> <li>Identify your skill as a service </li> <li>Define your ideal client or market </li> <li>Build your portfolio and profiles </li> <li>Market your services to clients </li> <li>Capture the results/gaps, analyze the output at each step, and attune your approach to keep growing. </li> </ol> <h2 id="heading-step-1-identify-your-skills-as-a-service">Step 1 — Identify your skills as a service</h2> The skills section of your résumé simply lists down your knowledge in terms of technologies and techniques. But you have to identify what you can do with each of those skills in the real world. Make it explicit so that the client understands what you can do for them. Your skills could be: <ul> <li>Web development </li> <li>Graphic design </li> <li>Digital marketing </li> <li>Data Analysis </li> </ul> Your services could be: <ul> <li>Building high-performance end-to-end websites and web applications. </li> <li>Creating beautiful illustrations and digital artworks for websites, videos, and posters. </li> <li>Helping individuals and SMEs in promoting their products and services online. </li> <li>Crunching data to uncover patterns and answer important business-centric questions to make informed decisions. </li> </ul> You can see the do's and don't's at each step in this infographic: Most people struggle with taking the first step. They are not sure if they are ready yet. I believe you’ll never be ready. Just start! Here’s a tweet from Gumroad’s founder, Sahil Lavingia: <div class="embed-wrapper"> <blockquote class="twitter-tweet"> <a href="https://twitter.com/shl/status/1350102029675290630"></a> </blockquote> <script defer="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div> For the rest of the article, I am going to focus on web development and data science services, but you can apply these principles to almost any service that you have to offer. Tips: <ul> <li>Look for people who have made a successful career out of freelancing. Study and replicate their process. </li> <li>Learn to use the right phrasing— tell your potential clients what you do and why you are better than the rest. </li> </ul> <h2 id="heading-step-2-define-your-ideal-client-or-market">Step 2 — Define your ideal client or market</h2> This whole blog post basically tries to answer the question, “how do you land your first client?” There is no freelancing career without clients, but that doesn’t mean you should reach out or pitch your services to just anyone. This won't do you much good, and instead will add to your frustration. Since you have your niche from the first step, now do some research on who needs that kind of service. Examples of a few clients may include: <ul> <li>Web development — Local stores, fitness instructors, institutes, hotels, creators (hot right now) </li> <li>Data Analysis — Data-driven startups/orgs </li> <li>Developing ML models and apps (DevOps)— Platform as a Service and Software as a Service organizations. </li> </ul> Who are not ideal clients (at least for a first client): <ul> <li>Franchise businesses </li> <li>MNCs, heavily guarded corporates operating in stealth mode for their clients (ZS, E&Y, Deloitte, and so on) </li> </ul> Pro-tips: <ul> <li>Look for individually owned businesses for web development. </li> <li>For data science projects and opportunities, first make sure you have a good grip on your fundamentals. Use platforms like codementor to help students with their projects. </li> </ul> <h2 id="heading-step-3-build-your-portfolio-and-profiles">Step 3— Build your portfolio and profiles</h2> After determining your niche and your target clients, it’s time to set up your shop. In this case that can be your portfolio website (highly recommended) or profiles on platforms like LinkedIn, Upwork, Toptal, or AngelList. <h3 id="heading-portfolio-website">Portfolio website</h3> An important thing to note here is that you want someone to trust you with their business. How can you make someone trust you? You can start by telling your professional story. Portfolio websites are one way to do just that. If you don’t have a story to tell yet, create one by working for yourself. You can build sample websites using commonly used themes available online, or you can propose to work on a project for a prospective client for free. Most people mistake a portfolio website with an online résumé. A portfolio website is not for you but for the clients that you want to target. It should showcase your services through: <ul> <li>Work samples — websites/applications, reports you have created, conference talks (if any), and so on. If you don’t have any samples, create a few! Develop dashboards/websites, host them, and show that you can deliver high-quality work. </li> <li>Testimonials/recommendations — ask your former colleagues/bosses/clients to write you a testimonial. TIP: if you haven’t worked anywhere before, do 1–2 free projects for prospective clients and then ask them to write a testimonial. I have done 2–3 free projects where the person referred me to another party. </li> <li>Write Blogs — this is something that has worked well for me. It is a way to establish your credibility. The blogs should contain relevant and authentic content through which the client can learn something about their business. Start by giving away! </li> </ul> Example ideas for blog posts: <ul> <li>How to reach a wider audience with a website </li> <li>What type of business metrics your Dashboard/Reports should contain </li> </ul> <h3 id="heading-profiles">Profiles</h3> Apart from having a website, you should also be socially present on professional platforms like LinkedIn. This will help you look for and connect with prospective clients. It can also give you work ideas, it lets you post updates, and helps you promote yourself. You can use the same principles we just discussed above when you're developing your social media profiles. A good profile can increase your odds of getting an opportunity for various reasons: <ul> <li>It shows how seriously you take your work. </li> <li>It can display your work samples and proficiency in skills to offer a service in your niche. </li> <li>It can add to your credibility if you have testimonials or recommendations written by an authentic colleague/partner/ex-boss. </li> </ul> <h3 id="heading-freelancing-platforms">Freelancing platforms</h3> _[Upwork profile](https://www.upwork.com/freelancers/~015a6822a75be60fd8?viewMode=1&s=1110580759050571776" data-href="https://www.upwork.com/freelancers/~015a6822a75be60fd8?viewMode=1&s=1110580759050571776" class="markup--anchor markup--figure-anchor" rel="noopener" target="blank) In addition to LinkedIn, there are a number of platforms that host a complete freelancing ecosystem with clients posting their needs and freelancers bidding to do that work. A few of the major platforms to look for freelancing projects/work include: <ul> <li>Upwork </li> <li>Toptal </li> <li>AngelList (for jobs at startups) </li> </ul> So, how can you be successful as a freelancer on these platforms? I’d say there is not one but multiple aspects of a profile that you have to get right. Your success relies on the following: <ul> <li>Your proficiency in the skills you have listed on the platform. </li> <li>How well-curated your profile is. </li> <li>Your proposal for the job. It tells the client how good of a fit you are for that advertised job. </li> <li>Lastly, luck! It plays a very small and rare role if you get the first three aspects right. </li> </ul> Here are a few tips to build an attractive and authentic profile: <ul> <li>Review projects that you’d be interested in applying for. Note down the keywords and skills that these clients use to describe their needs (skills that you possess, of course). Add these skills (strengths) to your profile that can connect you with the relevant projects. List up to 10 skills. </li> <li>Upload a professional picture along with a short and succinct bio that describes your niche/services. </li> <li>Highlight your best work in the portfolio. </li> <li>List down your certifications if you have any. Add weight to your profile. </li> <li>Be consistent with your skills, complete all sections of the profile, be concise and straightforward, and proofread each section. </li> </ul> <h2 id="heading-step-4-pitch-to-clients-outreach">Step 4 — Pitch to clients (outreach)</h2> It’s time to get down to business. If you can sell your services and show that you're well-suited to the clients’ requirements and budget, you might win the opportunity. But first, you need to get clients to propose to. This is where you need to work on your visibility and outreach. There are a number of ways you can do that: <ul> <li>Reach out to clients on platforms like LinkedIn — this is what worked for my niche! </li> <li>Use freelancing platforms — Upwork (general), Toptal (engineers), codementor (if you’re an expert), and AngelList. </li> <li>If you want to go one step ahead, use <a target="_blank" href="https://ads.google.com/intl/en_in/home/">Google Adwords</a> (advertise your services) or create a Facebook group for selling services in your niche, in your physical location (city/state). </li> </ul> My approach (not necessarily good for your niche!): <ul> <li>I do a lot of research to find organizations that align well with my niche. I mainly use <a target="_blank" href="https://www.linkedin.com/in/tyagiharshit/">LinkedIn</a>, <a target="_blank" href="https://twitter.com/tyagi_harshit24">Twitter</a>, and Google search (browsing) for this. </li> <li>I had categorized the shortlisted companies (~50) based on their domain (fintech, healthcare, Ed-tech) and created a template message, kind of like a cover letter, to go with the projects that they had in the pipeline along with my own ideas. </li> <li>Used to send samples of my work in each domain. If there was nothing to show, I’d start a new project in that domain and send them my GitHub repo to tell them what I was working on. </li> </ul> I landed my first freelancing client via Udacity in 2016. It was because I was one of their alumni and they launched a new platform. My proposal suited the client’s needs and I got hired. I consider myself fortunate in that regard. It got easy after that: <h2 id="heading-step-5-capture-the-data-analyse-them-and-attune-your-process">Step 5 — Capture the data, analyse them, and attune your process</h2> You might land your first project in a day, in a month, or in 6-months. Either way, an integral part of the process is to keep improving. You might fail at your first attempt but use that failure to get better at it. Capture the data: <ul> <li>The number of projects/clients you reached out to. How many responded, got interested, rejected, or went ahead. </li> <li>Why your proposal got rejected. Request a comment from the client. </li> <li>People who are successful in your niche, what are they doing differently? </li> <li>What’s new in your niche? How are people operating? </li> </ul> Analyze the data: <ul> <li>The gap between the requirements of the project and your portfolio — see what’s missing. </li> <li>Understand each data point you’ve captured, funnel the type of clients whom you had more chances to convert. </li> <li>Create a separate category of clients for whom you have to bridge a gap between their requirements and your expertise. </li> </ul> Attune your process: <ul> <li>Start working on new projects to attain new skills or master the ones you have enlisted. </li> <li>Restructure, polish and attune your profile to the client’s needs. </li> <li>Re-write the proposal emphasizing their needs and your services along with your work samples and numbers/statistics (if applicable). </li> </ul> If you are spending 2 hours a day on creating proposals and pitching to clients, then spend at least 3–4 hours on polishing your skill. Your strategy will only work if you have that curiosity to learn and build every day. Stick to these principles and it’s only a matter of time before you land your first freelancing client. What are you waiting for? Let’s get started! You can also watch the video on this topic on my YouTube channel: <div class="embed-wrapper"> </div> If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase. </article> <article> <h1> Data Science Learning Roadmap </h1> Harshit Tyagi — Tue, 12 Jan 2021 00:24:30 +0000 Although nothing really changes but the date, a new year fills everyone with the hope of starting things afresh. If you add in a bit of planning, some well-envisioned goals, and a learning roadmap, you'll have a great recipe for a year full of growth. This post intends to strengthen your plan by providing you with a learning framework, resources, and project ideas to help you build a solid portfolio of work showcasing expertise in data science. Just a note: I've prepared this roadmap based on my personal experience in data science. This is not the be-all and end-all learning plan. You can adapt this roadmap to better suit any specific domain or field of study that interests you. Also, this was created with Python in mind as I personally prefer it. <h2 id="heading-what-is-a-learning-roadmap">What is a learning roadmap?</h2> A learning roadmap is an extension of a curriculum. It charts out a multi-level skills map with details about what skills you want to hone, how you will measure the outcome at each level, and techniques to further master each skill. My roadmap assigns weights to each level based on the complexity and commonality of its application in the real-world. I have also added an estimated time for a beginner to complete each level with exercises and projects. Here is a pyramid that depicts the high-level skills in order of their complexity and application in the industry. Data science tasks in the order of complexity This will mark the base of our framework. We’ll now have to deep dive into each of these strata to complete our framework with more specific, measurable details. Specificity comes from examining the critical topics in each layer and the resources needed to master those topics. We’d be able to measure the knowledge gained by applying the learned topics to a number of real-world projects. I’ve added a few project ideas, portals, and platforms that you can use to measure your proficiency. <blockquote> Important NOTE: Take it one day at a time, one video/blog/chapter a day. It is a wide spectrum to cover. Don’t overwhelm yourself! </blockquote> Let’s deep dive into each of these strata, starting from the bottom. <h2 id="heading-1-how-to-learn-about-programming-or-software-engineering">1. How to Learn About Programming or Software Engineering</h2> (Estimated time: 2-3 months) First, make sure you have sound programming skills. Every data science job description will ask for programming expertise in at least one languages. <h3 id="heading-specific-programming-topics-to-know-include">Specific programming topics to know include:</h3> <ul> <li>Common data structures (data types, lists, dictionaries, sets, tuples), writing functions, logic, control flow, searching and sorting algorithms, object-oriented programming, and working with external libraries. </li> <li>SQL scripting: Querying databases using joins, aggregations, and subqueries </li> <li>Comfort using the Terminal, version control in Git, and using GitHub </li> </ul> <h3 id="heading-resources-to-learn-python">Resources to learn Python:</h3> <ul> <li><a target="_blank" href="https://www.learnpython.org/">learnpython.org</a> [free]— a free resource for beginners. It covers all the basic programming topics from scratch. You get an interactive shell to practice those topics side-by-side. </li> <li><a target="_blank" href="https://www.kaggle.com/learn/python">Kaggle</a> [free]— a free and interactive guide to learning python. It is a short tutorial covering all the important topics for data science. </li> <li><a target="_blank" href="https://www.freecodecamp.org/learn/scientific-computing-with-python/python-for-everybody/">Python certifications on freeCodeCamp</a> [free] – freeCodeCamp offers several certifications based on Python, such as scientific computing, data analysis, and machine learning. </li> <li><a target="_blank" href="https://www.youtube.com/watch?v=rfscVS0vtbw">Python Course by freecodecamp on YouTube</a> [free] — This is a 5-hour course that you can follow to practice the basic concepts. </li> <li><a target="_blank" href="https://www.youtube.com/watch?v=HGOBQPFzWKo">Intermediate python</a> [free]— Another free course by Patrick featured on freecodecamp.org. </li> <li><a target="_blank" href="https://www.coursera.org/specializations/python">Coursera Python for Everybody Specialization</a> [fee] — this is a specialization encompassing beginner-level concepts, python data structures, data collection from the web, and using databases with python. </li> </ul> <h3 id="heading-resources-for-learning-git-and-github">Resources for learning Git and GitHub</h3> <ul> <li>Guide <a target="_blank" href="https://www.atlassian.com/git">for Git</a> and <a target="_blank" href="https://lab.github.com/">GitHub</a> [free]: complete these tutorials and labs to develop a firm grip over version control. It will help you further in contributing to open-source projects. </li> <li>Here's a <a target="_blank" href="https://www.freecodecamp.org/news/git-and-github-crash-course/">Git and GitHub crash course</a> on the freeCodeCamp YouTube channel </li> </ul> <h3 id="heading-resources-for-learning-sql">Resources for learning SQL</h3> <ul> <li>Here's a <a target="_blank" href="https://www.freecodecamp.org/news/sql-and-databases-full-course/">course on SQL and Databases</a> on the freeCodeCamp YouTube channel </li> <li><a target="_blank" href="https://www.kaggle.com/learn/intro-to-sql">Intro to SQL</a> and <a target="_blank" href="https://www.kaggle.com/learn/advanced-sql">Advanced SQL</a> on Kaggle. </li> <li>freeCodeCamp now has a <a target="_blank" href="https://www.freecodecamp.org/learn/relational-database/">free interactive SQL course</a>. </li> </ul> <h3 id="heading-measure-your-expertise-by-solving-a-lot-of-problems-and-building-at-least-2-projects">Measure your expertise by solving a lot of problems and building at least 2 projects:</h3> <ul> <li>Solve a lot of problems here: <a target="_blank" href="https://www.hackerrank.com/">HackerRank</a> (beginner-friendly) and <a target="_blank" href="https://leetcode.com/">LeetCode</a> (solve easy or medium-level questions) </li> <li>Data Extraction from a website/API endpoints — try to write Python scripts from extracting data from webpages that allow scraping like soundcloud.com. Store the extracted data into a CSV file or a SQL database. </li> <li>Games like rock-paper-scissor, spin a yarn, hangman, dice rolling simulator, tic-tac-toe, and so on. </li> <li>Simple web apps like a YouTube video downloader, website blocker, music player, plagiarism checker, and so on. </li> </ul> Deploy these projects on GitHub pages or simply host the code on GitHub so that you learn to use Git. <h2 id="heading-2-how-to-learn-about-data-collection-and-wrangling-cleaning">2. How to Learn About Data Collection and Wrangling (Cleaning)</h2> (Estimated time: 2 months) A significant part of data science work is centered around finding apt data that can help you solve your problem. You can collect data from different legitimate sources — scraping (if the website allows), APIs, Databases, and publicly available repositories. Once you have data in hand, an analyst will often find themself cleaning dataframes, working with multi-dimensional arrays, using descriptive/scientific computations, and manipulating dataframes to aggregate data. Data are rarely clean and formatted for use in the “real world”. Pandas and NumPy are the two libraries that are at your disposal to go from dirty data to ready-to-analyze data. As you start feeling comfortable writing Python programs, feel free to start taking lessons on using libraries like pandas and <a target="_blank" href="https://towardsdatascience.com/numpy-essentials-for-data-science-25dc39fae39">numpy</a>. <h3 id="heading-resources-to-learn-about-data-collection-and-cleaning">Resources to learn about data collection and cleaning:</h3> <ul> <li><a target="_blank" href="https://www.youtube.com/watch?v=r-uOLxNrNk8">freeCodeCamp course on learning Numpy, Pandas, matplotlib, and seaborn</a> [free]. </li> <li>Practical tutorial on <a target="_blank" href="https://www.hackerearth.com/practice/machine-learning/data-manipulation-visualisation-r-python/tutorial-data-manipulation-numpy-pandas-python/tutorial/">data manipulation with NumPy and Pandas in Python</a> from HackerEarth. </li> <li><a target="_blank" href="https://www.kaggle.com/learn/pandas">Kaggle pandas tutorial</a> [free] — A short and concise hands-on tutorial that will walk you through commonly used data manipulation skills. </li> <li><a target="_blank" href="https://www.kaggle.com/learn/data-cleaning">Data Cleaning course by Kaggle</a>. </li> <li><a target="_blank" href="https://www.coursera.org/learn/python-data-analysis?specialization=data-science-python">Coursera course on Introduction to Data Science in Python</a> — This is the first course in the <a target="_blank" href="https://www.coursera.org/specializations/data-science-python">Applied Data Science with Python Specialization.</a> </li> </ul> <h3 id="heading-data-collection-project-ideas">Data collection project Ideas:</h3> <ul> <li>Collect data from a website/API (open for public consumption) of your choice, and transform the data to store it from different sources into an aggregated file or table (DB). Example APIs include <a target="_blank" href="https://developers.themoviedb.org/3">TMDB</a>, <a target="_blank" href="https://www.quandl.com/tools/python">quandl</a>, <a target="_blank" href="https://developer.twitter.com/en/docs">Twitter API</a>, and so on. </li> <li>Pick <a target="_blank" href="https://towardsdatascience.com/data-repositories-for-almost-every-type-of-data-science-project-7aa2f98128b">any publicly available dataset</a> and define a set of questions that you’d want to pursue after looking at the dataset and the domain. Wrangle the data to find out answers to those questions using Pandas and NumPy. </li> </ul> <h2 id="heading-3-how-to-learn-about-exploratory-data-analysis-business-acumen-and-storytelling">3. How to Learn About Exploratory Data Analysis, Business Acumen, and Storytelling</h2> (Estimated time: 2–3 months) The next stratum to master is data analysis and storytelling. Drawing insights from the data and then communicating the same to management in simple terms and visualizations is the core responsibility of a Data Analyst. The storytelling part requires you to be proficient with data visualization along with excellent communication skills. <h3 id="heading-specific-exploratory-data-analysis-and-storytelling-topics-to-learn-include">Specific exploratory data analysis and storytelling topics to learn include:</h3> <ul> <li>Exploratory data analysis — defining questions, handling missing values, outliers, formatting, filtering, univariate and multivariate analysis. </li> <li>Data visualization — plotting data using libraries like matplotlib, seaborn, and plotly. Know how to choose the right chart to communicate the findings from the data. </li> <li>Developing dashboards — a good percent of analysts only use Excel or a specialized tool like Power BI and Tableau to build dashboards that summarise/aggregate data to help management make decisions. </li> <li>Business acumen: Work on asking the right questions to answer, ones that actually target the business metrics. Practice writing clear and concise reports, blogs, and presentations. </li> </ul> <h3 id="heading-resources-to-learn-more-about-data-analysis">Resources to learn more about data analysis:</h3> <ul> <li>Learn data analysis with Python in this <a target="_blank" href="https://www.freecodecamp.org/news/learn-data-analysis-with-python-course/">free course on the freeCodeCamp YouTube channel</a>. </li> <li><a target="_blank" href="https://www.coursera.org/learn/data-analysis-with-python">Data Analysis with Python</a> — by IBM on Coursera. The course covers wrangling, exploratory analysis, and simple model development using python. </li> <li><a target="_blank" href="https://www.kaggle.com/learn/data-visualization">Data Visualization</a> — by Kaggle. Another interactive course that lets you practice all the commonly used plots. </li> <li>Build product sense and business acumen with these books: <a target="_blank" href="https://www.amazon.com/Measure-What-Matters-Google-Foundation/dp/0525536221/ref=sr_1_1?crid=1A9SIXXP7S2P8&dchild=1&keywords=measure+what+matters&qid=1610323490&s=books&sprefix=measure%2Cstripbooks%2C365&sr=1-1">Measure what matters</a>, <a target="_blank" href="https://www.amazon.com/Decode-Conquer-Answers-Management-Interviews/dp/0615930417/ref=sr_1_1?s=books&ie=UTF8&qid=1530848101&sr=1-1&keywords=decode+and+conquer">Decode and conquer</a>, <a target="_blank" href="https://www.amazon.com/Cracking-PM-Interview-Product-Technology/dp/0984782818/ref=sr_1_1?s=books&ie=UTF8&qid=1530848116&sr=1-1&keywords=cracking+the+pm+interview">Cracking the PM interview</a>. </li> </ul> <h3 id="heading-data-analysis-project-ideas">Data analysis project ideas</h3> <ul> <li>Exploratory analysis on <a target="_blank" href="https://towardsdatascience.com/hitchhikers-guide-to-exploratory-data-analysis-6e8d896d3f7e">movies dataset to find the formula to create profitable movies</a> (use it as inspiration), use datasets from healthcare, finance, WHO, past census, Ecommerce, and so on. </li> <li>Build dashboards (jupyter notebooks, excel, <a target="_blank" href="https://public.tableau.com/en-gb/gallery/?tab=viz-of-the-day&type=viz-of-the-day">tableau</a>) using the resources provided above. </li> </ul> <h2 id="heading-4-how-to-learn-about-data-engineering">4. How to Learn About Data Engineering</h2> (Estimated time: 4–5 months) Data engineering underpins the R&D teams by making clean data accessible to research engineers and scientists at big data-driven firms. It is a field in itself and you may decide to skip this part if you want to focus on just the statistical algorithm side of the problems. Responsibilities of a data engineer comprise building an efficient data architecture, streamlining data processing, and maintaining large-scale data systems. Engineers use Shell (CLI), SQL, and Python/Scala to create ETL pipelines, automate file system tasks, and optimize the database operations to make them high-performance. Another crucial skill is implementing these data architectures which demand proficiency in cloud service providers like AWS, Google Cloud Platform, Microsoft Azure, and others. <h3 id="heading-resources-to-learn-data-engineering">Resources to learn Data Engineering:</h3> <ul> <li><a target="_blank" href="https://www.udacity.com/course/data-engineer-nanodegree--nd027">Data Engineering Nanodegree by Udacity</a> — as far as a compiled list of resources is concerned, I have not come across a better-structured course on data engineering that covers all the major concepts from scratch. </li> <li><a target="_blank" href="https://www.coursera.org/specializations/gcp-data-machine-learning">Data Engineering, Big Data, and Machine Learning on GCP Specialization</a> — You can complete this specialization offered by Google on Coursera that walks you through all the major APIs and services offered by GCP to build a complete data solution. </li> </ul> <h3 id="heading-data-engineering-project-ideascertifications-to-prepare-for">Data Engineering project ideas/certifications to prepare for:</h3> <ul> <li><a target="_blank" href="https://aws.amazon.com/certification/certified-machine-learning-specialty/">AWS Certified Machine Learning (300 USD)</a> — A proctored exam offered by AWS, adds some weight to your profile (doesn’t guarantee anything, though), requires a decent understanding of AWS services and ML. </li> <li><a target="_blank" href="https://cloud.google.com/certification/data-engineer">Professional Data Engineer</a> — Certification offered by GCP. This is also a proctored exam and assesses your abilities to design data processing systems, deploying machine learning models in a production environment, and ensure solutions quality and automation. </li> </ul> <h2 id="heading-5-how-to-learn-about-applied-statistics-and-mathematics">5. How to Learn About Applied Statistics and Mathematics</h2> (Estimated time: 4–5 months) Statistical methods are a central part of data science. Almost all data science interviews predominantly focus on descriptive and inferential statistics. People often start coding machine learning algorithms without a clear understanding of underlying statistical and mathematical methods that explain the working of those algorithms. This, of course, isn't the best way to go about it. <h3 id="heading-topics-you-should-focus-on-in-applied-statistics-and-math">Topics you should focus on in Applied Statistics and math:</h3> <ul> <li>Descriptive Statistics — to be able to summarise the data is powerful, but not always. Learn about estimates of location (mean, median, mode, weighted statistics, trimmed statistics), and variability to describe the data. </li> <li>Inferential statistics — designing hypothesis tests, A/B tests, defining business metrics, analyzing the collected data and experiment results using confidence interval, p-value, and alpha values. </li> <li>Linear Algebra, Single and multi-variate calculus to understand loss functions, gradient, and optimizers in machine learning. </li> </ul> <h3 id="heading-resources-to-learn-about-statistics-and-math">Resources to learn about Statistics and math:</h3> <ul> <li><a target="_blank" href="https://www.freecodecamp.org/news/free-statistics-course/">Learn college-level statistics</a> in this free 8-hour course on the freeCodeCamp YouTube channel </li> <li><a target="_blank" href="https://www.amazon.com/Practical-Statistics-Data-Scientists-Essential/dp/149207294X/ref=sr_1_1?crid=QOOZP96ISCU4&dchild=1&keywords=practical+statistics+for+data+scientists&qid=1610247485&s=books&sprefix=practical+stat%2Cstripbooks%2C362&sr=1-1">[Book] Practical statistics for data science</a> (highly recommend) — A thorough guide on all the important statistical methods along with clean and concise applications/examples. </li> <li><a target="_blank" href="https://www.amazon.com/Naked-Statistics-Stripping-Dread-Data/dp/1480590185">[Book] Naked Statistics</a> — a non-technical but detailed guide to understanding the impact of statistics on our routine events, sports, recommendation systems, and many more instances. </li> <li><a target="_blank" href="https://www.freecodecamp.org/news/free-statistics-course/">An 8-hour University-level Statistics course</a> — a foundation course to help you start thinking statistically. </li> <li><a target="_blank" href="https://www.udacity.com/course/intro-to-descriptive-statistics--ud827">Intro to Descriptive Statistics</a>— offered by Udacity. Consists of video lectures explaining widely used measures of location and variability(standard deviation, variance, median absolute deviation). </li> <li><a target="_blank" href="https://www.udacity.com/course/intro-to-inferential-statistics--ud201">Inferential Statistics, Udacity</a> — the course consists of video lectures that educate you on drawing conclusions from data that might not be immediately obvious. It focuses on developing hypotheses and use common tests such as t-tests, ANOVA, and regression. </li> <li>And here's a <a target="_blank" href="https://www.freecodecamp.org/news/statistics-for-data-science/">guide to statistics for data science</a> to help you get started down the right path. </li> </ul> <h3 id="heading-statistics-project-ideas">Statistics project ideas:</h3> <ul> <li>Solve the exercises provided in the courses above and then try to go through a number of public datasets where you can apply these statistical concepts. Ask questions like “Is there sufficient evidence to conclude that the mean age of mothers giving birth in Boston is over 25 years of age at the 0.05 level of significance”? </li> <li>Try to design and run small experiments with your peers/groups/classes by asking them to interact with an app or answer a question. Run statistical methods on the collected data once you have a good amount of data after a period of time. This might be very hard to pull off but should be very interesting. </li> <li>Analyze stock prices, cryptocurrencies, and design hypothesis around the average return or any other metric. Determine if you can reject the null hypothesis or fail to do so using critical values. </li> </ul> <h2 id="heading-6-how-to-learn-about-machine-learning-and-ai">6. How to Learn About Machine Learning and AI</h2> (Estimated time: 4–5 months) After grilling yourself and going through all the major aforementioned concepts, you should now be ready to get started with the fancy ML algorithms. There are three major types of learning: <ol> <li>Supervised Learning — includes regression and classification problems. Study simple linear regression, multiple regression, polynomial regression, naive Bayes, logistic regression, KNNs, tree models, ensemble models. Learn about evaluation metrics. </li> <li>Unsupervised Learning — Clustering and dimensionality reduction are the two widely used applications of unsupervised learning. Dive deep into PCA, K-means clustering, hierarchical clustering, and gaussian mixtures. </li> <li>Reinforcement learning (can skip*) — helps you build self-rewarding systems. Learn to optimize rewards, using the TF-Agents library, creating Deep Q-networks, and so on. </li> </ol> The majority of the ML projects need you to master a number of tasks that I’ve explained in <a target="_blank" href="https://towardsdatascience.com/task-cheatsheet-for-almost-every-machine-learning-project-d0946861c6d0">this blog</a>. <h3 id="heading-resources-to-learn-about-machine-learning">Resources to learn about Machine Learning:</h3> <ul> <li>Here's a free full course on <a target="_blank" href="https://www.freecodecamp.org/news/machine-learning-with-scikit-learn-full-course/">Machine learning in Python with ScikitLearn</a> on the freeCodeCamp YouTube channel. </li> <li>[book] <a target="_blank" href="https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ref=sr_1_1?dchild=1&keywords=Hands-On+Machine+Learning+with+Scikit-Learn%2C+Keras%2C+and+TensorFlow%2C+2nd+Edition&qid=1610253356&sr=8-1">Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition</a> — one of my all-time favorite books on machine learning. Doesn’t only cover the theoretical mathematical derivations, but also showcases the implementation of algorithms through examples. You should solve the exercises given at the end of each chapter. </li> <li><a target="_blank" href="https://www.coursera.org/learn/machine-learning">Machine Learning Course by Andrew Ng</a> — the go-to course for anyone trying to learn machine learning. Hands down! </li> <li><a target="_blank" href="https://www.kaggle.com/learn/intro-to-machine-learning">Introduction to Machine Learning</a> — Interactive course by Kaggle. </li> <li><a target="_blank" href="https://www.kaggle.com/learn/intro-to-game-ai-and-reinforcement-learning">Intro to Game AI and Reinforcement Learning</a> — another interactive course on Kaggle on reinforcement learning. </li> </ul> <h3 id="heading-deep-learning-specialization-by-deeplearningaihttpswwwdeeplearningaideep-learning-specialization"><a target="_blank" href="https://www.deeplearning.ai/deep-learning-specialization/">Deep Learning Specialization by deeplearning.ai</a></h3> For those of you who are interested in further diving into deep learning, you can start off by completing this specialization offered by deeplearning.ai and the Hands-ON book. This is not as important from a data science perspective unless you are planning to solve a computer vision or NLP problem. Deep learning deserves a dedicated roadmap of its own. I’ll create that with all the fundamental concepts soon. <h2 id="heading-track-your-learning-progress">Track your learning progress</h2> I’ve also created a learning tracker for you on Notion. You can customize it to your needs and use it to track your progress, have easy access to all the resources and your projects. <a target="_blank" href="https://www.notion.so/Data-Science-learning-tracker-0d3c503280d744acb1b862a1ddd8344e">Find the learning tracker here</a>. Also, here's the video version of this blog: <h3 id="heading-data-science-with-harshithttpswwwyoutubecomcdatasciencewithharshitsubconfirmation1"><a target="_blank" href="https://www.youtube.com/c/DataSciencewithHarshit?sub_confirmation=1">Data Science with Harshit</a></h3> <div class="embed-wrapper"> </div> This is just a high-level overview of the wide spectrum of data science. You might want to deep dive into each of these topics and create a low-level concept-based plan for each of the categories. If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase. </article> <article> <h1> How to Get Started with Algorithmic Trading in Python </h1> Harshit Tyagi — Mon, 04 Jan 2021 17:56:40 +0000 When I was working as a Systems Development Engineer at an Investment Management firm, I learned that to succeed in quantitative finance you need to be good with mathematics, programming, and data analysis. <a target="_blank" href="https://www.freecodecamp.org/news/algorithmic-trading-in-python/">Algorithmic or Quantitative trading</a> can be defined as the process of designing and developing statistical and mathematical trading strategies. It is an extremely sophisticated area of finance. So, the question is how do you get started with Algorithmic Trading? I am going to walk you through five essential topics that you should study in order to pave your way into this fascinating world of trading. I personally prefer Python as it offers the right degree of customization, ease and speed of development, testing frameworks, and execution speed. Because of this, all these topics are focused on <a target="_blank" href="https://medium.com/datadriveninvestor/getting-starting-with-algorithmic-trading-with-python-1ae169cc1705">Python for Trading</a>. <h2 id="heading-1-learn-python-programminghttpswwwfreecodecamporglearn">1. Learn <a target="_blank" href="https://www.freecodecamp.org/learn/">Python Programming</a></h2> In order to have a flourishing career in Data Science in general, you need solid fundamentals. Whichever language you choose, you should thoroughly understand certain topics in that language. Here’s what you should look to master in the Python ecosystem for data science: <ul> <li><a target="_blank" href="https://towardsdatascience.com/ideal-python-environment-setup-for-data-science-cdb03a447de8">Environment Setup</a> — this includes creating a virtual environment, installing required packages, and <a target="_blank" href="https://towardsdatascience.com/the-complete-guide-to-jupyter-notebooks-for-data-science-8ff3591f69a4">working with Jupyter notebook</a>s or Google colabs. </li> <li>Data Structures — some of the most important pythonic data structures are lists, dictionaries, NumPy arrays, tuples, and sets. I’ve collected a <a target="_blank" href="https://medium.com/p/python-fundamentals-for-data-science-6c7f9901e1c8">few examples</a> in the linked article for you to learn these. </li> <li>Object-Oriented Programming — As a quant analyst, you should make sure you are good at writing well-structured code with proper classes defined. You must learn to use objects and their methods while using external packages like Pandas, NumPy, SciPy, and so on. </li> </ul> The freeCodeCamp curriculum also offers a certification in <a target="_blank" href="https://www.freecodecamp.org/learn/data-analysis-with-python/data-analysis-with-python-course/">Data Analysis with Python</a> to help you get started with the basics. <h2 id="heading-learn-how-to-crunch-financial-data">Learn How to Crunch Financial Data</h2> Data analysis is a crucial part of finance. Besides learning to handle dataframes using Pandas, there are a few specific topics that you should pay attention to while dealing with trading data. <h3 id="heading-how-to-exploring-data-using-pandas">How to exploring data using Pandas</h3> One of the most important packages in the Python data science stack is undoubtedly Pandas. You can accomplish almost all major tasks using the functions defined in the package. Focus on creating dataframes, filtering (<code>loc</code>, <code>iloc</code>, <code>query</code>), descriptive statistics (summary), join/merge, grouping, and subsetting. <h3 id="heading-how-to-deal-with-time-series-data">How to deal with time-series data</h3> Trading data is all about time-series analysis. You should learn to resample or reindex the data to change the frequency of the data, from minutes to hours or from the end of day OHLC data to end of week data. For example, you can convert 1-minute time series into 3-minute time series data using the resample function: <pre><code class="lang-python">df_3min = df_1min.resample('3Min', label='left').agg({'OPEN': 'first', 'HIGH': 'max', 'LOW': 'min', 'CLOSE': 'last'}) </code></pre> <h2 id="heading-3-how-to-write-fundamental-trading-algorithms">3. How to Write Fundamental Trading Algorithms</h2> A career in quantitative finance requires a solid understanding of statistical hypothesis testing and mathematics. A good grip over concepts like multivariate calculus, linear algebra, probability theory will help you lay a good foundation for designing and writing algorithms. You can start by calculating moving averages on stock pricing data, writing simple algorithmic strategies like moving average crossover or mean reversion strategy and learning about relative strength trading. After taking this small yet significant leap of practicing and understanding how basic statistical algorithms work, you can look into the more sophisticated areas of machine learning techniques. These require a deeper understanding of statistics and mathematics. Here are two books you can start with: <ul> <li><a target="_blank" href="http://www.amazon.com/gp/product/0470284889/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0470284889&linkCode=as2&tag=quant0f-20">Quantitative Trading: How to build your own Algorithmic Trading Business</a> —By Dr. Ernest Chan </li> <li>Book on <a target="_blank" href="http://www.amazon.com/gp/product/0956399207/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0956399207&linkCode=as2&tag=quant0f-20">Algorithmic Trading and DMA</a> — By Barry Johnson </li> </ul> And here are a couple courses that will help you get started with Python for Trading and that cover most of the topics that I’ve captured here: <ul> <li><a target="_blank" href="https://quantra.quantinsti.com/course/python-for-trading?utm_source=harshit_tyagi&utm_medium=affiliate&utm_campaign=python_finance_article">Python for Trading by Multi Commodity Exchange offered by Quantra</a> </li> <li><a target="_blank" href="https://www.freecodecamp.org/news/algorithmic-trading-using-python-course/">Algorithmic Trading with Python</a> – a free 4-hour course from Nick McCullum on the freeCodeCam YouTube channel </li> </ul> You can get 10% off the Quantra course by using my code HARSHIT10. <h2 id="heading-4-learn-about-backtesting">4. Learn About Backtesting</h2> Once you are done coding your trading strategy, you can’t simply put it to the test in the live market with actual capital, right? The next step is to expose this strategy to a stream of historical trading data, which would generate trading signals. The carried out trades would then accrue an associated profit or loss (P&L) and the accumulation of all the trades would give you the total P&L. This is called backtesting. Backtesting requires you to be well-versed in many areas, like mathematics, statistics, software engineering, and market microstructure. Here are some concepts you should learn to get a decent understanding of backtesting: <ul> <li>You can start by understanding technical indicators. Explore the Python package called TA_Lib to use these indicators. </li> <li>Employ momentum indicators like parabolic SAR, and try to calculate the transaction cost and slippage. </li> <li>Learn to plot cumulative strategy returns and study the overall performance of the strategy. </li> <li>A very important concept that affects the performance of the backtest is bias. You should learn about optimization bias, look-ahead bias, psychological tolerance, and survivorship bias. </li> </ul> <h2 id="heading-5-performance-metrics-how-to-evaluate-trading-strategies">5. Performance Metrics — How to Evaluate Trading Strategies</h2> It’s important for you to be able to explain your strategy concisely. If you don’t understand your strategy, chances are on any external modification of regulation or regime shift, your strategy will start behaving abnormally. Once you understand the strategy confidently, the following performance metrics can help you learn how good or bad the strategy actually is: <ul> <li>Sharpe Ratio — heuristically characterises the risk/reward ratio of the strategy. It quantifies the return you can accrue for the level of volatility undergone by the equity curve. </li> <li>Volatility — quantifies the “risk” related to the strategy. The Sharpe ratio also embodies this characteristic. Higher volatility of an underlying asset often leads to higher risk in the equity curve and that results in smaller Sharpe ratios. </li> <li>Maximum Drawdown — the largest overall peak-to-trough percentage drop on the equity curve of the strategy. Maximum drawdowns are often studied in conjunction with momentum strategies as they suffer from them. Learn to calculate it using the <code>numpy</code> library. </li> <li>Capacity/Liquidity — determines the scalability of the strategy to further capital. Many funds and investment management firms suffer from these capacity issues when strategies increase in capital allocation. </li> <li>CAGR — measures the average rate of a strategy’s growth over a period of time. It is calculated by the formula: (cumulative strategy returns)^(252/number of trading days) — 1 </li> </ul> <h2 id="heading-further-resources">Further Resources</h2> This article served as a suggested curriculum to help you get started with algorithmic trading. It is a good list of concepts to master. Now, the question is what resources can help you get up to speed with these topics? Here are a few classic books and useful courses with assignments and exercises that I found helpful: <ul> <li>[Course] <a target="_blank" href="https://quantra.quantinsti.com/course/python-for-trading?utm_source=harshit_tyagi&utm_medium=affiliate&utm_campaign=python_finance_article">Python for Trading Course by Multi Commodity Exchange offered by Quantra</a> [PromoCode: HARSHIT10] </li> <li>[Book] <a target="_blank" href="http://www.amazon.com/gp/product/0470284889/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0470284889&linkCode=as2&tag=quant0f-20">Quantitative Trading: How to Build Your Own Algorithmic Trading Business</a> — Ernest Chan </li> <li>[Course] <a target="_blank" href="https://quantra.quantinsti.com/courses?utm_source=harshit_tyagi&utm_medium=affiliate&utm_campaign=python_finance_article">Dr. Ernest Chan’s trading courses on the Quantra Platform</a> </li> <li>[Book] <a target="_blank" href="https://www.amazon.in/Python-Finance-Yves-Hilpisch/dp/1491945281">Python for Finance — Yves Hilpisch</a> </li> <li>[Journals]: <a target="_blank" href="http://arxiv.org/archive/q-fin">arXiv</a>, <a target="_blank" href="http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291467-9965">Wiley’s Mathematical finance</a>, <a target="_blank" href="http://www.risk.net/type/journal/source/journal-of-computational-finance">computational finance</a>. </li> </ul> <h3 id="heading-data-science-with-harshithttpswwwyoutubecomcdatasciencewithharshitsubconfirmation1"><a target="_blank" href="https://www.youtube.com/c/DataSciencewithHarshit?sub_confirmation=1">Data Science with Harshit</a></h3> <div class="embed-wrapper"> </div> With this channel, I am planning to roll out a couple of <a target="_blank" href="https://towardsdatascience.com/hitchhikers-guide-to-learning-data-science-2cc3d963b1a2?source=---------8------------------">series covering the entire data science space</a>. Here is why you should be subscribing to the <a target="_blank" href="https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ">channel</a>: <ul> <li>This series would cover all the required/demanded quality tutorials on each of the topics and subtopics like <a target="_blank" href="https://towardsdatascience.com/python-fundamentals-for-data-science-6c7f9901e1c8?source=---------5------------------">Python fundamentals for Data Science</a>. </li> <li>Explained <a target="_blank" href="https://towardsdatascience.com/practical-reasons-to-learn-mathematics-for-data-science-1f6caec161ea?source=---------9------------------">Mathematics and derivations</a> of why we do what we do in ML and Deep Learning. </li> <li><a target="_blank" href="https://www.youtube.com/watch?v=a2pkZCleJwM&t=2s">Podcasts with Data Scientists and Engineers</a> at Google, Microsoft, Amazon, etc, and CEOs of big data-driven companies. </li> <li><a target="_blank" href="https://towardsdatascience.com/building-covid-19-analysis-dashboard-using-python-and-voila-ee091f65dcbb?source=---------2------------------">Projects and instructions</a> to implement the topics learned so far. Learn about new certifications, Bootcamp, and resources to crack those certifications like this <a target="_blank" href="https://youtu.be/yapSsspJzAw">TensorFlow Developer Certificate Exam by Google.</a> </li> </ul> If this tutorial was helpful, you should check out my data science and machine learning courses on <a target="_blank" href="https://www.wiplane.com/">Wiplane Academy</a>. They are comprehensive yet compact and helps you build a solid foundation of work to showcase. </article> </main></body></html>

Harshit Tyagi - freeCodeCamp.org

How to Start Building Projects with LLMs

The Best Way to Learn is to BUILD!

Here’s What We’ll Cover:

What Should Be Your First Project?

Project #1: Summarise YouTube Videos

Setup and Requirements

Introduction to Document Loaders

Categories of Document Loaders

Integration Capabilities

YoutubeLoader from LangChain to Get Transcript:

Process the YouTube Transcript

For this, LangChain offers PromptTemplate:

How to Use LLMChain / LCEL for Summarization

How to serve the YT summariser on WhatsApp

Project #2 — Build a Bot that Can Handle Different Types of User Queries

Project #3 — RAG-Powered Support Bot

Conclusion

What to Know Before Taking Google's Machine Learning or Data Science Course

Programming for Complete Beginners in Data Science and Machine Learning

1. Essential Python Programming for Machine Learning

2. Essential Mathematics for Data Science and Machine Learning

Use linear algebra to represent data

Use calculus to train ML models

3. Essential Statistics for Data Science

Describing data — from data to insights

Quantifying uncertainty

How to Learn these Foundational DS and ML Concepts

Problems with Data Science or ML Courses

Wiplane Academy — wiplane.com

How to Train BPE, WordPiece, and Unigram Tokenizers from Scratch using Hugging Face

BPE Algorithm – a Frequency-based Model

Unigram Algorithm – a Probability-based Model

WordPiece Algorithm

How to Train the BPE, Unigram, and WordPiece Algorithms

How to Train the Datasets

Import the Required Models and Trainers

How to Automate Training and Tokenization

Step 1 - Prepare the tokenizer

Step 2 - Train the tokenizer

Step 3 - Tokenize the input string

Analysis of the output:

How to Compare the Tokens

Closing Thoughts and Next Steps

References and Notes

Connect with me

The Evolution of Tokenization – Byte Pair Encoding in NLP

Main Components of NLP

What is Tokenization?

Why do we need a Tokenizer?

Different ways to tokenize text

Word-based tokenization

Problems with Word tokenizer

Character-based tokenization

Problems with character-based models

Subword Tokenization

Problems with the subword tokenization algorithm:

Byte Pair Encoding (BPE) Algorithm

Step 1: Add word identifiers and calculate word frequency

Step 2: Split the word into characters and then calculate the character frequency

Step 3: Merge the most frequently occurring consecutive byte pairings

Step 4 - Iterate n times to find the best (in terms of frequency) pairs to encode and then concatenate them to find the subwords

How to improve the BPE algorithm

Do we use BPE in BERTs or GPTs?

Summary

References and Notes

Use Python, SpaCy, and Streamlit to Build a Structured Financial Newsfeed

Pre-requisites

What you’ll need to get started:

Goals of the Project

Step 1: How to extract the trending stocks news data