Bhavishya Pandit - freeCodeCamp.org

Which Tools to Use for LLM-Powered Applications: LangChain vs LlamaIndex vs NIM

Bhavishya Pandit — Mon, 21 Oct 2024 17:34:21 +0000

If you’re considering building an application powered by a Large Language Model, you may wonder which tool to use.

Well, two well-established frameworks—LangChain and LlamaIndex—have gained significant attention for their unique features and capabilities. But recently, NVIDIA NIM has emerged as a new player in the field, adding its tools and functionalities to the mix.

In this article, we'll compare LangChain, LlamaIndex, and NVIDIA NIM to help you determine which framework best fits your specific use case.

Introduction to LangChain

According to LangChain’s official docs, “LangChain is a framework for developing applications powered by language models”.

We can elaborate a bit on that and say that LangChain is a versatile framework designed for building data-aware and agent-driven applications. It offers a collection of components and pre-built chains that simplify working with large language models (LLMs) like GPT.

Whether you're just starting or you’re an experienced developer, LangChain supports both quick prototyping and full-scale production applications.

You can use LangChain to simplify the entire development cycle of an LLM application:

Development: LangChain offers open-source building blocks, components and third-party integrations for building applications.
Production: LangSmith, a tool from LangChain, helps monitor and evaluate chains for continuous optimization and deployment.

Deployment: You can use LangGraph Cloud to turn your LLM applications into production-ready APIs.

LangChain offers several open-source libraries for development and production purposes. Let’s take a look at some of them.

LangChain Components

LangChain Components are high-level APIs that simplify working with LLMs. You can compare them with Hooks in React and functions in Python.

These components are designed to be intuitive and easy to use. A key component is the LLM interface, which seamlessly connects to providers like OpenAI, Cohere, and Hugging Face, allowing you to effortlessly query models.

In this example, we utilize the langchain_google_vertexai library to interact with Google’s Vertex AI, specifically leveraging the Gemini 1.5 Flash model. Let’s break down what the code does:

from langchain_google_vertexai import ChatVertexAI

llm = ChatVertexAI(model="gemini-1.5-flash")
llm.invoke(
  "Who was Napoleon?"
)

Importing the ChatVertexAI Class:

The first step is to import the ChatVertexAI class, which allows us to communicate with the Google Vertex AI platform. This library is part of the LangChain ecosystem, designed to integrate large language models (LLMs) seamlessly into applications.

Instantiating the LLM (Large Language Model):

llm = ChatVertexAI(model="gemini-1.5-flash")

Here, we create an instance of the ChatVertexAI class. We specify the model we want to use, which in this case is Gemini 1.5 Flash. This version of Gemini is optimized for fast responses while still maintaining high-quality language generation.

Sending a Query to the Model:

llm.invoke("Who was Napoleon?")

Finally, we use the invoke method to send a question to the Gemini model. In this example, we ask the question, “Who was Napoleon?”. The model processes the query and provides a response, which would typically include information about Napoleon’s identity, historical significance, and key accomplishments.

Chains

LangChain defines Chains as “sequences of calls”. To understand how chains work, we need to know what LCEL is.

LCEL stands for LangChain Expression Language, which is a declarative way to easily compose chains together – that’s it. LCEL just helps us multiple combine chains in long chains.

LangChain supports two types of chains

LCEL Chains: In this case, LangChain offers a higher-level constructor method. But all that is being done under the hood is constructing a chain with LCEL.

For example, create_stuff_documents_chain is an LCEL Chain that takes a list of documents and formats them all into a prompt, then passes that prompt to an LLM. It passes ALL documents, so you should make sure it fits within the context window of the LLM you are using.
Legacy Chains: Legacy Chains are constructed by subclassing from a legacy Chain class. These chains do not use LCEL under the hood but are the standalone classes.

For example, APIChain: this chain uses an LLM to convert a query into an API request, then executes that request, gets back a response, and then passes that request to an LLM to respond.

Legacy Chains were standard practices before LCEL. Once all the legacy chains get an LCEL alternative, they will become obsolete and unsupported.

LangChain Quickstart

!pip install -U langchain-google-genai

%env GOOGLE_API_KEY="your-api-key"

from langchain_google_genai import ChatGoogleGenerativeAI

1. Using LangChain with Google's Gemini Pro Model

This code demonstrates how to integrate Google’s Gemini Pro model with LangChain for natural language processing tasks.

pip install -U langchain-google-genai

First, install the langchain-google-genai package, which allows you to interact with Google’s Generative AI models via LangChain. The -U flag ensures you get the latest version.

2. Setting Up Your API Key

%env GOOGLE_API_KEY="your-api-key"

You need to authenticate your API requests. Use your Google API key by setting it as an environment variable. This ensures secure communication with Google’s services.

3. Accessing the Gemini Pro Model

from langchain_google_genai import ChatGoogleGenerativeAI

The ChatGoogleGenerativeAI class is imported from the langchain-google-genai package. We instantiate it, specifying Gemini Pro—a powerful version of Google’s generative models known for producing high-quality language outputs.

4. Querying the Model

llm = ChatGoogleGenerativeAI(model="gemini-pro")
llm.invoke("Who was Alexander the Great?")

Finally, you invoke the model by passing a query. In this example, the query is asking for information about Alexander the Great. The model will return a detailed response, such as his historical background and significance.

Introduction to LlamaIndex

LlamaIndex, previously known as GPT Index, is a data framework tailored for large language model (LLM) applications. Its core purpose is to ingest, organize, and access private or domain-specific data, offering a suite of tools that simplify the integration of such data into LLMs.

Simply put, LLMs are very strong models but they don't work as well when used with smaller data bundles. LlamaIndex helps us in integrating our data into LLMS to serve specific needs.

LlamaIndex works using several components together. Let's take a look at them one by one.

Data Connectors

LlamaIndex supports ingesting data from multiple sources, such as APIs, PDFs, and SQL databases. These connectors streamline the process of integrating external data into LLM-based applications.

from llama_index.core import VectorStoreIndex, download_loader

from llama_index.readers.google import GoogleDocsReader

gdoc_ids = ["your-google_doc-id"]
loader = GoogleDocsReader()

documents = loader.load_data(document_ids=gdoc_ids)
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
query_engine.query("Where did the author go to school?")

This code uses LlamaIndex to load and query Google Docs. It imports necessary classes, specifies Google Doc IDs, and loads the document content using GoogleDocsReader. The content is then indexed as vectors with VectorStoreIndex, allowing for efficient querying. Finally, it creates a query engine to retrieve answers from the indexed documents based on natural language questions, such as "Where did the author go to school?"

Data Indexing

The framework organizes ingested data into intermediate formats designed to optimize how LLMs access and process information, ensuring both efficiency and performance.

Indexes are built from documents. They are used to build Query Engines and Chat Engines which enable question & answer and chat over the data. In LlamaIndex indexes store data in Node objects. According to the docs:

“A Node corresponds to a chunk of text from a Document. LlamaIndex takes in Document objects and internally parses/chunks them into Node objects.” (Source)

Engines

LlamaIndex includes various engines for interacting with the data via natural language. These include engines for querying knowledge, facilitating conversational interactions, and data agents that enhance LLM-powered workflows.

Advantages of LlamaIndex

Makes it easy to bring in data from different sources, such as APIs, PDFs, and databases like SQL/NoSQL, to be used in applications powered by large language models (LLMs).
Lets you store and organize private data, making it ready for different uses, while smoothly working with vector stores and databases.
Comes with a built-in query feature that allows you to get smart, data-driven answers based on your input.

LlamaIndex Quickstart

In this section, you’ll learn how to use LlamaIndex to create a queryable index from a collection of documents and interact with OpenAI’s API for querying purposes. This is the code to do this:

pip install llama-index
%env OPENAI_API_KEY = "your-api-key"

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Now let’s break it down step by step:

1. Install the LlamaIndex Package

pip install llama-index

You start by installing the llama-index package, which provides tools for building vector-based document indices that allow for natural language queries.

2. Set the OpenAI API Key

%env OPENAI_API_KEY = "your-api-key"

Here, the OpenAI API key is set as an environment variable to authenticate and allow communication with OpenAI’s API. Replace "your-api-key" with your actual API key.

3. Importing Necessary Components

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

The VectorStoreIndex class is used to create an index that will store vectors representing the document contents, while the SimpleDirectoryReader class is used to read documents from a specified directory.

4. Loading Documents

documents = SimpleDirectoryReader("data").load_data()

The SimpleDirectoryReader loads documents from the directory named "data". The load_data() method reads the contents and processes them so they can be used to create the index.

5. Creating the Vector Store Index

index = VectorStoreIndex.from_documents(documents)

A VectorStoreIndex is created from the documents. This index converts the text into vector embeddings that capture the semantic meaning of the text, making it easier for AI models to interpret and query.

6. Building the Query Engine

query_engine = index.as_query_engine()

The query engine is created by converting the vector store index into a format that can be queried. This engine is the component that allows you to run natural language queries against the document index.

7. Querying the Engine

response = query_engine.query("Who is the protagonist in the story?")

Here, a query is made to the index asking for the protagonist of the story. The query engine processes the request and uses the document embeddings to retrieve the most relevant information from the indexed documents.

8. Displaying the Response

Finally, the response from the query engine, which contains the answer to the query, is printed.

Make sure your directory structure looks like this:

|----- main.py
|----- data
|----- Matilda.txt

NVIDIA NIM

Nvidia has recently launched their own set of tools for developing LLM applications called NIM. NIM stands for “Nvidia Inference Microservice”.

NVIDIA NIM is a collection of simple tools (microservices) that help quickly set up and run AI models on the cloud, in data centres, or on workstations.

NIMs are organized by model type. For instance, NVIDIA NIM for large language models (LLMs) makes it easy for businesses to use advanced language models for tasks like understanding and processing natural language.

How NIMs Work

When you first set up a NIM, it checks your hardware and finds the best version of the model from its library to match your system.

If you have certain NVIDIA GPUs (listed in the Support Matrix), NIM will download and use an optimized version of the model with the TRT-LLM library for fast performance. For other NVIDIA GPUs, it will download a non-optimized model and run it using the vLLM library.

So the main idea is to provide optimized AI models for faster local development and a cloud environment to host it for production.

Features of NIM

NIM simplifies the process of running AI models by handling technical details like execution engines and runtime operations for you. It’s also the fastest option, whether using TRT-LLM, vLLM, or other methods.

NIM offers the following high-performance features:

Scalable Deployment: It performs well and can easily grow from handling a few users to millions without issues.
Advanced Language Model Support: NIM comes with pre-optimized engines for various of the latest language model designs.
Flexible Integration: Adding NIM to your existing apps and workflows is easy. Developers can use an OpenAI API-compatible system with extra NVIDIA features for more capabilities.
Enterprise-Grade Security: NIM prioritizes security by using safetensors, continuously monitoring for vulnerabilities (CVEs), and regularly applying security updates.

NIM Quickstart

1. Generate an NGC API key

An NGC API key is required to access NGC resources and a key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys.

2. Export the API key

export NGC_API_KEY=

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

4. List available NIMs

ngc registry image list --format_type csv nvcr.io/nim/*

5. Launch NIM

The following command launches a Docker container for the llama3-8b-instruct model. To launch a container for a different NIM, replace the values of Repository and Latest_Tag with values from the previous image list command and change the value of CONTAINER_NAME to something appropriate.

# Choose a container name for bookkeeping
export CONTAINER_NAME=Llama3-8B-Instruct

# The container name from the previous ngc registgry image list command
Repository=nim/meta/llama3-8b-instruct
Latest_Tag=1.1.2

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/${Repository}:${Latest_Tag}"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

6. Usecase: OpenAI Completion Request

from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
prompt = "Once upon a time"
response = client.completions.create(
    model="meta/llama3-8b-instruct",
    prompt=prompt,
    max_tokens=16,
    stream=False
)
completion = response.choices[0].text
print(completion)

Which Tool to Use?

So you may be wondering: which of these should you use for your specific use case? Well, the answer to this depends on what you’re working on.

LangChain is an excellent choice if you're looking for a versatile framework to integrate multiple tools or build intelligent agents that can handle several tasks simultaneously.

But if your main focus is smart search and data retrieval, LlamaIndex is the better option, as it specializes in indexing and retrieving information for LLMs, making it ideal for deep data exploration. While LangChain can manage indexing and retrieval, LlamaIndex is optimized for these tasks and offers easier data ingestion with its plugins and connectors.

On the other hand, if you're aiming for high-performance model deployment, NVIDIA NIM is a great solution. NIM abstracts the technical details, offers fast inference with tools like TRT-LLM and vLLM, and provides scalable deployment with enterprise-grade security.

So, for apps needing indexing and retrieval, LlamaIndex is recommended. For deploying LLMs at scale, NIM is a powerful choice. Otherwise, LangChain alone is sufficient for working with LLMs.

Conclusion

In this article, we compared three powerful tools for working with large language models: LangChain, LlamaIndex, and NVIDIA NIM. We explored each tool’s unique strengths, such as LangChain's versatility for integrating multiple components, LlamaIndex's focus on efficient data indexing and retrieval, and NVIDIA NIM's high-performance model deployment capabilities.

We discussed key features like scalability, ease of integration, and optimized performance, showing how these tools address different needs within the AI ecosystem.

While each tool has its challenges, such as handling complex infrastructures or optimizing for specific tasks, LangChain, LlamaIndex, and NVIDIA NIM offer valuable solutions for building and scaling AI-powered applications.

How to Build a RAG Pipeline with LlamaIndex

Bhavishya Pandit — Fri, 30 Aug 2024 13:30:49 +0000

Large Language Models are everywhere these days – think ChatGPT – but they have their fair share of challenges.

One of the biggest challenges faced by LLMs is hallucination. This occurs when the model generates text that is factually incorrect or misleading, often based on patterns it has learned from its training data. So how can Retrieval-Augmented Generation, or RAG, help mitigate this issue?

By retrieving relevant information from a more vast, wider knowledge base, RAG ensures that the LLM's responses are grounded in real-world facts. This significantly reduces the likelihood of hallucinations and improves the overall accuracy and reliability of the generated content.

What is Retrieval Augmented Generation (RAG)?

RAG is a technique that combines information retrieval with language generation. Think of it as a two-step process:

Retrieval: The model first retrieves relevant information from a large corpus of documents based on the user's query.
Generation: Using this retrieved information, the model then generates a comprehensive and informative response.

Why use LlamaIndex for RAG?

LlamaIndex is a powerful framework that simplifies the process of building RAG pipelines. It provides a flexible and efficient way to connect retrieval components (like vector databases and embedding models) with generation components (like LLMs).

Some of the key benefits of using Llama-Index include:

Modularity: It allows you to easily customize and experiment with different components.
Scalability: It can handle large datasets and complex queries.
Ease of use: It provides a high-level API that abstracts away much of the underlying complexity.

What You'll Learn Here:

In this article, we will delve deeper into the components of a RAG pipeline and explore how you can use LlamaIndex to build these systems.

We will cover topics such as vector databases, embedding models, language models, and the role of LlamaIndex in connecting these components.

Understanding the Components of a RAG Pipeline

Here's a diagram that'll help familiarize you with the basics of RAG architecture:

This diagram is inspired by this article. Let's go through the key pieces.

Components of RAG

Retrieval Component:

Vector Databases: These databases are optimized for storing and searching high-dimensional vectors. They are crucial for efficiently finding relevant information from a vast corpus of documents.
Embedding Models: These models convert text into numerical representations or embeddings. These embeddings capture the semantic meaning of the text, allowing for efficient comparison and retrieval in vector databases.

A vector is a mathematical object that represents a quantity with both magnitude (size) and direction. In the context of RAG, embeddings are high-dimensional vectors that capture the semantic meaning of text. Each dimension of the vector represents a different aspect of the text's meaning, allowing for efficient comparison and retrieval.

Generation Component:

Language Models: These models are trained on massive amounts of text data, enabling them to generate human-quality text. They are capable of understanding and responding to prompts in a coherent and informative manner.

The RAG Flow

Query Submission: A user submits a query or question.
Embedding Creation: The query is converted into an embedding using the same embedding model used for the corpus.
Retrieval: The embedding is searched against the vector database to find the most relevant documents.
Contextualization: The retrieved documents are combined with the original query to form a context.
Generation: The language model generates a response based on the provided context.

LamaIndex

LlamaIndex plays a crucial role in connecting the retrieval and generation components. It acts as an index that maps queries to relevant documents. By efficiently managing the index, LlamaIndex ensures that the retrieval process is fast and accurate.

Prerequisites

We will be using Python and IBM watsonx via LlamaIndex in this article. You should have the following on your system before getting started:

Python 3.9+
IBM watsonx project and API key
Curiosity to learn

Let's Get Started!

In this article, we will be using LlamaIndex to make a simple RAG Pipeline.

Let's create a virtual environment for Python using the following command in your terminal: python -m venv venv . This will create a virtual environment (venv) for your project. If you are a Windows user you can activate it using .\venv\Scripts\activate, and Mac users can activate it with source venv/bin/activate.

Now let's install the packages:

pip install wikipedia llama-index-llms-ibm llama-index-embeddings-huggingface

Once these packages are installed, you will need watsonx.ai's API key as well. This in turn will help you use LLMs via LlamaIndex.

To learn about how to get your watsonx.ai API keys, click here. You need the project ID and API Key to be able to work on the "Generation" aspect of RAG. Having them will help you make LLM calls through watsonx.ai.

import wikipedia

# Search for a specific page
page = wikipedia.page("Artificial Intelligence")

# Access the content
print(page.content)

Now let's save the page content to a text document. We are doing it so that we can access it later. You can do this using the below code:

import os

# Create the 'Document' directory if it doesn't exist
if not os.path.exists('Document'):
    os.mkdir('Document')

# Open the file 'AI.txt' in write mode with UTF-8 encoding
with open('Document/AI.txt', 'w', encoding='utf-8') as f:
    # Write the content of the 'page' object to the file
    f.write(page.content)

Now we'll be using watsonx.ai via LlamaIndex. It will help us generate responses based on the user's query.

Note: Make sure to replace the parameters WATSONX_APIKEY and project_id with your values in the below code:

import os
from llama_index.llms.ibm import WatsonxLLM
from llama_index.core import SimpleDirectoryReader, Document


# Define a function to generate responses using the WatsonxLLM instance
def generate_response(prompt):
    """
    Generates a response to the given prompt using the WatsonxLLM instance.

    Args:
        prompt (str): The prompt to provide to the large language model.

    Returns:
        str: The generated response from the WatsonxLLM.
    """

    response = watsonx_llm.complete(prompt)
    return response

# Set the WATSONX_APIKEY environment variable (replace with your actual key)
os.environ["WATSONX_APIKEY"] = 'YOUR_WATSONX_APIKEY'  # Replace with your API key

# Define model parameters (adjust as needed)
temperature = 0
max_new_tokens = 1500
additional_params = {
    "decoding_method": "sample",
    "min_new_tokens": 1,
    "top_k": 50,
    "top_p": 1,
}

# Create a WatsonxLLM instance with the specified model, URL, project ID, and parameters
watsonx_llm = WatsonxLLM(
    model_id="meta-llama/llama-3-1-70b-instruct",
    url="https://us-south.ml.cloud.ibm.com",
    project_id="YOUR_PROJECT_ID",
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    additional_params=additional_params,
)

# Load documents from the specified directory
documents = SimpleDirectoryReader(
    input_files=["Document/AI.txt"]
).load_data()

# Combine the text content of all documents into a single Document object
combined_documents = Document(text="\n\n".join([doc.text for doc in documents]))

# Print the combined document
print(combined_documents)

Here's a breakdown of the parameters:

temperature = 0: This setting makes the model generate the most likely text sequence, leading to a more deterministic and predictable output. It's like telling the model to stick to the most common words and phrases.
max_new_tokens = 1500: This limits the generated text to a maximum of 1500 new tokens (words or parts of words).
additional_params:
- decoding_method = "sample": This means the model will generate text randomly based on the probability distribution of each token.
- min_new_tokens = 1: Ensures that at least one new token is generated, preventing the model from repeating itself.
- top_k = 50: This limits the model's choices to the 50 most likely tokens at each step, making the output more focused and less random.
- top_p = 1: This sets the nucleus sampling probability to 1, meaning all tokens with a probability greater than or equal to the top_p value will be considered.

You can tweak these parameters for experimentation and see how they affect your response. Now we'll be building and loading a vector store index from the given document. But first, let's understand what it is.

Understanding Vector Store Indexes

A vector store index is a specialized data structure designed to efficiently store and retrieve high-dimensional vectors. In the context of the Llama Index, these vectors represent the semantic embeddings of documents.

Key characteristics of vector store indexes:

High-dimensional vectors: Each document is represented as a high-dimensional vector, capturing its semantic meaning.
Efficient retrieval: Vector store indexes are optimized for fast similarity search, allowing you to quickly find documents that are semantically similar to a given query.
Scalability: They can handle large datasets and scale efficiently as the number of documents grows.

How Llama Index uses vector store indexes:

Document Embedding: Documents are first converted into high-dimensional vectors using a language model like Llama.
Index Creation: The embeddings are stored in a vector store index.
Query Processing: When a user submits a query, it is also converted into a vector. The vector store index is then used to find the most similar documents based on their embeddings.
Response Generation: The retrieved documents are used to generate a relevant response.

In the below code, you'll come across the word "chunk". A chunk is a smaller, manageable unit of text extracted from a larger document. It's typically a paragraph or a few sentences long. They are used to make the retrieval and processing of information more efficient, especially when dealing with large documents.

By breaking down documents into chunks, RAG systems can focus on the most relevant parts and generate more accurate and concise responses.

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex, load_index_from_storage
from llama_index.core import Settings
from llama_index.core import StorageContext

def get_build_index(documents, embed_model="local:BAAI/bge-small-en-v1.5", save_dir="./vector_store/index"):
    """
    Builds or loads a vector store index from the given documents.

    Args:
        documents (list[Document]): A list of Document objects.
        embed_model (str, optional): The embedding model to use. Defaults to "local:BAAI/bge-small-en-v1.5".
        save_dir (str, optional): The directory to save or load the index from. Defaults to "./vector_store/index".

    Returns:
        VectorStoreIndex: The built or loaded index.
    """

    # Set index settings
    Settings.llm = watsonx_llm
    Settings.embed_model = embed_model
    Settings.node_parser = SentenceSplitter(chunk_size=1000, chunk_overlap=200)
    Settings.num_output = 512
    Settings.context_window = 3900

    # Check if the save directory exists
    if not os.path.exists(save_dir):
        # Create and load the index
        index = VectorStoreIndex.from_documents(
            [documents], service_context=Settings
        )
        index.storage_context.persist(persist_dir=save_dir)
    else:
        # Load the existing index
        index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=save_dir),
            service_context=Settings,
        )
    return index

# Get the Vector Index
vector_index = get_build_index(documents=documents, embed_model="local:BAAI/bge-small-en-v1.5", save_dir="./vector_store/index")

This is the last part of RAG: we create a query engine with metadata replacement and sentence transformer reranking. Bruh! What is a re-ranker now?

A re-ranker is a component that reorders the retrieved documents based on their relevance to the query. It uses additional information, such as semantic similarity or context-specific factors, to refine the initial ranking provided by the retrieval system. This helps ensure that the most relevant documents are presented to the user, leading to more accurate and informative responses.

from llama_index.core.postprocessor import MetadataReplacementPostProcessor, SentenceTransformerRerank

def get_query_engine(sentence_index, similarity_top_k=6, rerank_top_n=2):
    """
    Creates a query engine with metadata replacement and sentence transformer reranking.

    Args:
        sentence_index (VectorStoreIndex): The sentence index to use.
        similarity_top_k (int, optional): The number of similar nodes to consider. Defaults to 6.
        rerank_top_n (int, optional): The number of nodes to rerank. Defaults to 2.

    Returns:
        QueryEngine: The query engine.
    """

    postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )
    engine = sentence_index.as_query_engine(
        similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
    )
    return engine

# Create a query engine with the specified parameters
query_engine = get_query_engine(sentence_index=vector_index, similarity_top_k=8, rerank_top_n=5)

# Query the engine with a question
query = 'What is Deep learning?'
response = query_engine.query(query)
prompt = f'''Generate a detailed response for the query asked based only on the context fetched:
            Query: {query}
            Context: {response}

            Instructions:
            1. Show query and your generated response based on context.
            2. Your response should be detailed and should cover every aspect of the context.
            3. Be crisp and concise.
            4. Don't include anything else in your response - no header/footer/code etc
            '''
response = generate_response(prompt)
print(response.text)

'''
OUTPUT - 
Query: What is Deep learning? 

Deep learning is a subset of artificial intelligence that utilizes multiple layers of neurons between the network's inputs and outputs to progressively extract higher-level features from raw input data. 
This technique allows for improved performance in various subfields of AI, such as computer vision, speech recognition, natural language processing, and image classification. 
The multiple layers in deep learning networks are able to identify complex concepts and patterns, including edges, faces, digits, and letters.
The reason behind deep learning's success is not attributed to a recent theoretical breakthrough, but rather the significant increase in computer power, particularly the shift to using graphics processing units (GPUs), which provided a hundred-fold increase in speed. 
Additionally, the availability of vast amounts of training data, including large curated datasets, has also contributed to the success of deep learning.
Overall, deep learning's ability to analyze and extract insights from raw data has led to its widespread application in various fields, and its performance continues to improve with advancements in technology and data availability. '''

How to Fine-Tune the Pipeline

Once you've built a basic RAG pipeline, the next step is to fine-tune it for optimal performance. This involves iteratively adjusting various components and parameters to improve the quality of the generated responses.

How to Evaluate the Pipeline's Performance

To assess the pipeline's effectiveness, you can use metrics like:

Accuracy: How often does the pipeline generate correct and relevant responses?
Relevance: How well do the retrieved documents match the query?
Coherence: Is the generated text well-structured and easy to understand?
Factuality: Are the generated responses accurate and consistent with known facts?

Iterate on the Index Structure, Embedding Model, and Language Model

You can experiment with different index structures (for example flat index, hierarchical index) to find the one that best suits your data and query patterns. Consider using different embedding models to capture different semantic nuances. Fine-tuning the language model can also improve its ability to generate high-quality responses.

Experiment with Different Hyperparameters

Hyperparameters are settings that control the behaviour of the pipeline components. By experimenting with different values, you can optimize the pipeline's performance. Some examples of hyperparameters include:

Embedding dimension: The size of the embedding vectors
Index size: The maximum number of documents to store in the index
Retrieval threshold: The minimum similarity score for a document to be considered relevant

Real-World Applications of RAG

RAG pipelines have a wide range of applications, including:

Customer support chatbots: Providing informative and helpful responses to customer inquiries
Knowledge base search: Efficiently retrieving relevant information from large document collections
Summarization of large documents: Condensing lengthy documents into concise summaries
Question answering systems: Answering complex questions based on a given corpus of knowledge

RAG Best Practices and Considerations

To build effective RAG pipelines, consider these best practices:

Data quality and preprocessing: Ensure your data is clean, consistent, and relevant to your use case. Preprocess the data to remove noise and improve its quality.
Embedding model selection: Choose an embedding model that is appropriate for your specific domain and task. Consider factors like accuracy, computational efficiency, and interpretability.
Index optimization: Optimize the index structure and parameters to improve retrieval efficiency and accuracy.
Ethical considerations and biases: Be aware of potential biases in your data and models. Take steps to mitigate bias and ensure fairness in your RAG pipeline.

Conclusion

RAG pipelines offer a powerful approach to leveraging large language models for a variety of tasks. By carefully selecting and fine-tuning the components of an RAG pipeline, you can build systems that provide informative, accurate, and relevant responses.

Key points to remember:

RAG combines information retrieval and language generation.
Llama-Index simplifies the process of building RAG pipelines.
Fine-tuning is essential for optimizing pipeline performance.
RAG has a wide range of real-world applications.
Ethical considerations are crucial in building responsible RAG systems.

As RAG technology continues to evolve, we can expect to see even more innovative and powerful applications in the future. Till then, let's wait for the future to unfold!

A Beginner's Guide to LLMs – What's a Large-Language Model and How Does it Work?

Bhavishya Pandit — Thu, 15 Aug 2024 19:41:19 +0000

ChatGPT was released in November 2022. Since then, we’ve witnessed rapid advancements in the field of AI and technology.

But did you know that the journey of AI chatbots began way back in 1966 with ELIZA? ELIZA was not as sophisticated as today’s models like GPT, but it marked the beginning of the exciting path that led us to where we are now.

Language is the essence of human interaction, and in the digital age, teaching machines to understand and generate language has become a cornerstone of artificial intelligence.

The models we interact with today—such as GPT, Llama3, Gemini, and Claude—are known as Large Language Models (LLMs). This is because they are trained on vast datasets of text, enabling them to perform a wide range of language-related tasks.

But what exactly are LLMs, and why is there so much hype surrounding them?

In this article, you'll learn what LLMs are and what is the hype all about.

What Are LLMs?

Large Language Models are AI models trained on vast amounts of text data to understand, generate, and manipulate human language. They are based on deep learning architectures like transformers, which allow them to process and predict text in a way that mimics human understanding.

In simpler terms, an LLM is a computer program that has been trained on many examples to differentiate between an apple and a Boeing 787 – and to be able to describe each of them.

Before they're ready for use and can answer your questions, LLMs are trained on massive datasets. Realistically, a program cannot conclude anything from a single sentence. But after analyzing, say, trillions of sentences, it's able to build a logic to complete sentences or even generate its own.

How to Train an LLM

Here’s how the training process works:

Data Collection: The first step involves gathering millions (or even billions) of text documents from diverse sources, including books, websites, research papers, and social media. This extensive dataset serves as the foundation for the model’s learning process.
Learning Patterns: The model analyzes the collected data to identify and learn patterns in the text. These patterns include grammar rules, word associations, contextual relationships, and even some level of common sense. By processing this data, the model begins to understand how language works.
Fine-Tuning: After the initial training, the model is fine-tuned for specific tasks. This involves adjusting the model’s parameters to optimize its performance for tasks such as translation, summarization, sentiment analysis, or question-answering.
Evaluation and Testing: Once trained, the model is rigorously tested against a series of benchmarks to evaluate its accuracy, efficiency, and reliability. This step ensures that the model performs well in real-world applications.

After the training process is completed, the models are heavily tested on a series of benchmarks for accuracy, efficiency, security, and so on.

Applications of LLMs

LLMs have a wide range of applications, from content generation to prediction and a lot more.

Content Creation:

Writing Assistance: Tools like Grammarly utilize LLMs to provide real-time suggestions for improving grammar, style, and clarity in writing. Whether you’re drafting an email or writing a novel, LLMs can help you polish your text.
Automated Storytelling: AI models can now generate creative content, from short stories to full-length novels. These models can emulate the style of famous authors or even create entirely new literary styles.

Customer Service:

Chatbots: Many companies deploy AI-powered chatbots that can understand and respond to customer inquiries in real time. These chatbots can handle a wide range of tasks, from answering frequently asked questions to processing orders.
Personal Assistants: Virtual assistants like Siri and Alexa use LLMs to interpret and respond to voice commands, providing users with information, reminders, and entertainment on demand.

Healthcare:

Medical Record Summarization: LLMs can assist healthcare professionals by summarizing patient records, making it easier to review critical information and make informed decisions.
Diagnostic Assistance: AI models can analyze patient data and medical literature to assist doctors in diagnosing diseases and recommending treatments.

Research and Education:

Literature Review: LLMs can sift through vast amounts of research papers to provide concise summaries, identify trends, and suggest new research directions.
Educational Tools: AI-powered tutors can offer personalized learning experiences by adapting to a student’s progress and needs. These tools can provide instant feedback and tailored study plans.

Entertainment:

Game Development: LLMs are used to create more dynamic and responsive characters in video games. These AI-driven characters can engage with players more realistically and interactively.
Music and Art Generation: AI models are now capable of composing music, generating artwork, and even writing scripts for movies, pushing the boundaries of creative expression.

Challenges with LLMs

While LLMs are powerful, they are not without their challenges. ChatGPT has over 150 Million monthly users, this gives us an idea about how big the impact of AI is. But new technologies pose some challenges too.

Bias and Fairness:

LLMs learn from the data they are trained on, which can include biases present in society. This can lead to biased or unfair outcomes in their predictions or responses. Addressing this requires careful dataset curation and algorithm adjustments to minimize bias.

Data Privacy:

LLMs may inadvertently learn and retain sensitive information from the data they are trained on, raising privacy concerns. There’s ongoing research on how to make LLMs more privacy-preserving.

Resource Intensive:

Training LLMs require immense computational power and large datasets, which can be costly and environmentally taxing. Efforts are being made to create more efficient models that require less energy and data.

Interpretability:

LLMs are often seen as "black boxes," meaning it’s challenging to understand exactly how they arrive at certain conclusions. Developing methods to make AI more interpretable and explainable is an ongoing research area.

Coding with LLMs: A Replicate Example

For those of you who like to get your hands dirty with code, here’s a quick example of how to use an LLM with the Replicate library.

Replicate is a Python package that simplifies the process of running machine learning models in the cloud. It provides a user-friendly interface to access and utilize a vast collection of pre-trained models from the Replicate platform.

With Replicate, you can easily:

Run models directly from your Python code or Jupyter notebooks.
Access various model types, including image generation, text generation, and more.
Leverage powerful cloud infrastructure for efficient model execution.
Integrate AI capabilities into your applications without the complexities of model training and deployment.

Here’s a simple code snippet to generate text using Meta's llama3-70b-instruct model. Llama 3 is one of the latest open-source large language models developed by Meta. It's designed to be highly capable, versatile, and accessible, allowing users to experiment, innovate, and scale their AI applications.

import os
import replicate # pip install replicate

# Get your token from -> https://replicate.com/account/api-tokens
os.environ["REPLICATE_API_TOKEN"] = "TOKEN"
api = replicate.Client(api_token=os.environ["REPLICATE_API_TOKEN"])

# Running llama3 model using replicate
output = api.run(
    "meta/meta-llama-3-70b-instruct",
        input={"prompt": 'Hey how are you?'}
    )

# Printing llama3's response
for item in output:
    print(item, end="")

Explanation:

We first save the replicate token using the os package as an environment variable.
Then we use the Llama3 70b-instruct model to give a response based on our prompt. You can customize the output by changing the prompt.

And what is a prompt? A prompt is essentially a text-based instruction or query given to an AI model. It's like providing a starting point or direction for the AI to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.

For example:

"Write a poem about a robot exploring the ocean."
"Translate 'Hello, how are you?' into Spanish."
"Explain quantum computing in simple terms."

These are all prompts that guide the AI to produce a specific output.

Using Meta's llama-3-70b-instruct, you can build various tools around applications that are mentioned in this article. Tweak the prompts based on your use case and you will be ready to go! ⚡️

Conclusion

In this article, we explored the world of Large Language Models, providing a high-level understanding of how they work and their training process. We delved into the core concepts of LLMs, including data collection, pattern learning, and fine-tuning, and discussed the extensive applications of LLMs across various industries.

While LLMs offer immense potential, they also come with challenges such as bias, privacy concerns, resource demands, and interpretability. Addressing these challenges is crucial as AI continues to evolve and integrate more deeply into our lives.

We also provided a glimpse into how you can start working with LLMs using the Replicate library, showing that even complex models like Llama3 70b-instruct can be accessible to developers with the right tools.

Bhavishya Pandit - freeCodeCamp.org

Which Tools to Use for LLM-Powered Applications: LangChain vs LlamaIndex vs NIM

Table of Contents:

Introduction to LangChain

LangChain Components

Chains

LangChain Quickstart

1. Using LangChain with Google's Gemini Pro Model

2. Setting Up Your API Key

3. Accessing the Gemini Pro Model

4. Querying the Model

Introduction to LlamaIndex

Data Connectors

Data Indexing

Engines

Advantages of LlamaIndex

LlamaIndex Quickstart

1. Install the LlamaIndex Package

2. Set the OpenAI API Key

3. Importing Necessary Components

4. Loading Documents

5. Creating the Vector Store Index

6. Building the Query Engine

7. Querying the Engine

8. Displaying the Response

NVIDIA NIM

How NIMs Work

Features of NIM

NIM Quickstart

1. Generate an NGC API key

2. Export the API key

3. Docker login to NGC

4. List available NIMs

5. Launch NIM

Which Tool to Use?

Conclusion

How to Build a RAG Pipeline with LlamaIndex

Table of Contents:

What is Retrieval Augmented Generation (RAG)?

Why use LlamaIndex for RAG?

What You'll Learn Here:

Understanding the Components of a RAG Pipeline

Components of RAG

The RAG Flow

LamaIndex

Prerequisites

Let's Get Started!

Understanding Vector Store Indexes

How to Fine-Tune the Pipeline

How to Evaluate the Pipeline's Performance

Iterate on the Index Structure, Embedding Model, and Language Model

Experiment with Different Hyperparameters

Real-World Applications of RAG

RAG Best Practices and Considerations

Conclusion

A Beginner's Guide to LLMs – What's a Large-Language Model and How Does it Work?

What Are LLMs?

How to Train an LLM

Applications of LLMs

Content Creation:

Customer Service:

Healthcare:

Research and Education:

Entertainment:

Challenges with LLMs

Bias and Fairness:

Data Privacy:

Resource Intensive:

Interpretability:

Coding with LLMs: A Replicate Example

Conclusion