Chaitanya Rahalkar - freeCodeCamp.org

How to Build Your Own Local AI: Create Free RAG and AI Agents with Qwen 3 and Ollama

Chaitanya Rahalkar — Tue, 06 May 2025 16:16:20 +0000

The landscape of Artificial Intelligence is rapidly evolving, and one of the most exciting trends is the ability to run powerful Large Language Models (LLMs) directly on your local machine.

This shift away from reliance on cloud-based APIs offers significant advantages in terms of privacy, cost-effectiveness, and offline accessibility. Developers and enthusiasts can now experiment with and deploy sophisticated AI capabilities without sending data externally or incurring API fees.

This tutorial serves as a practical, hands-on guide to harnessing this local AI power. It focuses on leveraging the Qwen 3 family of LLMs, a state-of-the-art open-source offering from Alibaba, combined with Ollama, a tool that dramatically simplifies running LLMs locally.

Prerequisites

Before diving into this tutorial, you should have a foundational understanding of Python programming and be comfortable using the command line or terminal. Make sure you have Python 3 installed on your system.

While prior experience with AI or Large Language Models (LLMs) is beneficial, it's not essential, as I’ll introduce and explain core concepts like Retrieval-Augmented Generation (RAG) and AI agents throughout the guide.

Local AI Power with Qwen 3 and Ollama
- Ollama: Your Local LLM Gateway
- Tutorial Roadmap

How to Set Up Your Local AI Lab
How to Build a Local RAG System with Qwen 3
How to Create Local AI Agents with Qwen 3
Advanced Considerations and Troubleshooting
Conclusion and Next Steps

Local AI Power with Qwen 3 and Ollama

Running LLMs locally addresses several key concerns associated with cloud-based AI services.

Privacy is paramount – data processed locally never leaves the user's machine.
Cost is another major factor – utilizing open-source models and tools like Ollama eliminates API subscription fees and pay-per-token charges, making advanced AI accessible to everyone.
Local execution enables offline functionality – crucial for applications where internet connectivity is unreliable or undesirable.

Ollama: Your Local LLM Gateway

Ollama acts as a bridge, making the power of models like Qwen 3 accessible on local hardware. It's a command-line tool that simplifies the download, setup, and execution of various open-source LLMs across macOS, Linux, and Windows.

Ollama handles the complexities of model configuration and GPU utilization, providing a straightforward interface for developers and users. It also exposes an OpenAI-compatible API endpoint, allowing seamless integration with popular frameworks like LangChain.

Tutorial Roadmap

This tutorial will guide you through the process of:

Setting up a local AI environment: Installing Ollama and selecting/running appropriate Qwen 3 models.
Building a local RAG system: Creating a system that allows chatting with personal documents using Qwen 3, Ollama, LangChain, and ChromaDB for vector storage.
Creating a basic local AI agent: Developing a simple agent powered by Qwen 3 that can utilize custom-defined tools (functions).

How to Set Up Your Local AI Lab

The first step is to prepare your local machine with the necessary tools and models.

Install Ollama

Ollama provides the simplest path to running LLMs locally.

Linux / macOS: Open a terminal and run the official installation script:
```
  curl -fsSL https://ollama.com/install.sh | sh
```
Windows: Download the installer from the Ollama website (https://ollama.com/download) and follow the setup instructions.

After installation, verify it by opening a new terminal window and running:

ollama --version

Ollama typically stores downloaded models in ~/.ollama/models on macOS and /usr/share/ollama/.ollama/models on Linux/WSL.

Choose Your Qwen 3 Model

Selecting the right Qwen 3 model is crucial and depends on your intended task and available hardware, primarily system RAM and GPU VRAM. Running larger models requires more resources but generally offers better performance and reasoning capabilities.

Qwen 3 offers two main architectures available through Ollama:

Dense Models: (like qwen3:0.6b, qwen3:4b, qwen3:8b, qwen3:14b, qwen3:32b) These models activate all their parameters during inference. Their performance is predictable, but resource requirements scale directly with parameter count.
Mixture-of-Experts (MoE) Models: (like qwen3:30b-a3b) These models contain many "expert" sub-networks but only activate a small fraction for each input token. This allows them to achieve the performance characteristic of their large total parameter count (for example, 30 billion) while having inference costs closer to their smaller active parameter count (for example, 3 billion). They offer a compelling balance of capability and efficiency, especially for reasoning and coding tasks.

Recommendation for this tutorial: For the examples that follow, qwen3:8b strikes a good balance between capability and resource requirements for many modern machines. If resources are more constrained, qwen3:4b is a viable alternative. The MoE model qwen3:30b-a3b offers excellent performance, especially for coding and reasoning, and runs surprisingly well on systems with 16GB+ VRAM due to its sparse activation.

Pull and Run Qwen 3 with Ollama

Once you’ve chosen a model, you’ll need to download it (pull it) via Ollama.

Pull the model: Open the terminal and run (replace qwen3:8b with the desired tag):

ollama pull qwen3:8b

This command downloads the model weights and configuration.

Run interactively (optional test): To chat directly with the model from the command line:

ollama run qwen3:8b

Type prompts directly into the terminal. Use /bye to exit the session. Other useful commands within the interactive session include /? for help and /set parameter (for example, /set parameter num_ctx 8192) to temporarily change model parameters for the current session. Use ollama list outside the session to see downloaded models.

Run as a server: For integration with Python scripts (using LangChain), Ollama needs to run as a background server process, exposing an API. Open a separate terminal window and run:

ollama serve

Keep this terminal window open while running the Python scripts. This command starts the server, typically listening on http://localhost:11434, providing an OpenAI-compatible API endpoint.

Set Up Your Python Environment

A dedicated Python environment is recommended for managing dependencies.

Create a virtual environment:

python -m venv venv

Activate the environment:

macOS/Linux: source venv/bin/activate
Windows: venv\Scripts\activate

Install necessary libraries:

pip install langchain langchain-community langchain-core langchain-ollama chromadb sentence-transformers pypdf python-dotenv unstructured[pdf] tiktoken

langchain, langchain-community, langchain-core: The core LangChain framework for building LLM applications.
langchain-ollama: Specific integration for using Ollama models with LangChain.
chromadb: The local vector database for storing document embeddings.
sentence-transformers: Used for an alternative local embedding method (explained later).
pypdf: A library for loading PDF documents.
python-dotenv: For managing environment variables (optional but good practice).
unstructured[pdf]: An alternative, powerful document loader, especially for complex PDFs.
tiktoken: Used by LangChain for token counting.

The local setup involves coordinating several independent components: Ollama itself, the specific Qwen 3 model weights, the Python environment, and various libraries like LangChain and ChromaDB. Ensuring compatibility between these pieces and correctly configuring parameters (like Ollama's context window size or selecting a model appropriate for the available VRAM) is key to a smooth experience.

While this modularity offers flexibility – allowing components like the LLM or vector store to be swapped – it also means the initial setup requires careful attention to detail. This tutorial aims to provide clear steps and sensible defaults to minimize potential friction points.

How to Build a Local RAG System with Qwen 3

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances LLMs by providing them with external knowledge.

Instead of relying solely on its training data, the LLM can retrieve relevant information from a specified document set (like local PDFs) and uses that information to answer questions. This significantly reduces "hallucinations" (incorrect or fabricated information) and allows the LLM to answer questions about specific, private data without needing retraining.

The core RAG process involves:

Loading and splitting documents into manageable chunks.
Converting these chunks into numerical representations (embeddings) using an embedding model.
Storing these embeddings in a vector database for efficient searching.
When a query comes in, embedding the query and searching the vector database for the most similar document chunks.
Providing these relevant chunks (context) along with the original query to the LLM to generate an informed answer.

Let's build this locally using Qwen 3, Ollama, LangChain, and ChromaDB.

Step 1: Prepare Your Data

Create a directory named data in the project folder. Place the PDF document that you intend to query into this directory. For this tutorial, using a single, primarily text-based PDF (like a research paper or a report) for simplicity.

mkdir data
# Copy your PDF file into the 'data' directory
# e.g., cp ~/Downloads/some_paper.pdf./data/mydocument.pdf

If you don’t have a PDF readily available that you’d like to use, you can download a sample PDF (the Llama 2 paper) for this tutorial using the following command in your terminal:


wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

This command creates the data directory and downloads the PDF, saving it as llama2.pdf inside the data directory. If you prefer to use your own document, place your PDF file into the data directory and update the filename in the subsequent Python code.

Step 2: Load Documents in Python

Use LangChain's document loaders to read the PDF content. PyPDFLoader is straightforward for simple PDFs. UnstructuredPDFLoader (requires unstructured[pdf]) can handle more complex layouts but has more dependencies.

# rag_local.py
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader # Or UnstructuredPDFLoader

load_dotenv() # Optional: Loads environment variables from.env file

DATA_PATH = "data/"
PDF_FILENAME = "mydocument.pdf" # Replace with your PDF filename

def load_documents():
    """Loads documents from the specified data path."""
    pdf_path = os.path.join(DATA_PATH, PDF_FILENAME)
    loader = PyPDFLoader(pdf_path)
    # loader = UnstructuredPDFLoader(pdf_path) # Alternative
    documents = loader.load()
    print(f"Loaded {len(documents)} page(s) from {pdf_path}")
    return documents

# documents = load_documents() # Call this later

Step 3: Split Documents

Large documents need to be split into smaller chunks suitable for embedding and retrieval. The RecursiveCharacterTextSplitter attempts to split text semantically (at paragraphs, sentences, and so on) before resorting to fixed-size splits. chunk_size determines the maximum size of each chunk (in characters), and chunk_overlap specifies how many characters should overlap between consecutive chunks to maintain context.

# rag_local.py (continued)
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_documents(documents):
    """Splits documents into smaller chunks."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        is_separator_regex=False,
    )
    all_splits = text_splitter.split_documents(documents)
    print(f"Split into {len(all_splits)} chunks")
    return all_splits

# loaded_docs = load_documents()
# chunks = split_documents(loaded_docs) # Call this later

Step 4: Choose and Configure Embedding Model

Embeddings transform text into vectors (lists of numbers) such that semantically similar text chunks have vectors that are close together in multi-dimensional space.

Option A (Recommended for Simplicity): Ollama Embeddings

This approach uses Ollama to serve a dedicated embedding model. nomic-embed-text is a capable open-source model available via Ollama.

First, ensure the embedding model is pulled:

ollama pull nomic-embed-text

Then, use OllamaEmbeddings in Python:

# rag_local.py (continued)
from langchain_ollama import OllamaEmbeddings

def get_embedding_function(model_name="nomic-embed-text"):
    """Initializes the Ollama embedding function."""
    # Ensure Ollama server is running (ollama serve)
    embeddings = OllamaEmbeddings(model=model_name)
    print(f"Initialized Ollama embeddings with model: {model_name}")
    return embeddings

# embedding_function = get_embedding_function() # Call this later

Option B (Alternative): Sentence Transformers

This uses the sentence-transformers library directly within the Python script. It requires installing the library (pip install sentence-transformers) but doesn't need a separate Ollama process for embeddings. Models like all-MiniLM-L6-v2 are fast and lightweight, while all-mpnet-base-v2 offers higher quality.

# Alternative embedding function using Sentence Transformers
from langchain_community.embeddings import HuggingFaceEmbeddings

def get_embedding_function_hf(model_name="all-MiniLM-L6-v2"):
     """Initializes HuggingFace embeddings (runs locally)."""
     embeddings = HuggingFaceEmbeddings(model_name=model_name)
     print(f"Initialized HuggingFace embeddings with model: {model_name}")
     return embeddings

embedding_function = get_embedding_function_hf() # Use this if choosing Option B

For this tutorial, we’ll use Option A (Ollama Embeddings with nomic-embed-text) to keep the toolchain consistent.

Step 5: Set Up Local Vector Store (ChromaDB)

ChromaDB provides an efficient way to store and search vector embeddings locally. Using a persistent client ensures the indexed data is saved to disk and can be reloaded without re-processing the documents every time.

# rag_local.py (continued)
from langchain_community.vectorstores import Chroma

CHROMA_PATH = "chroma_db" # Directory to store ChromaDB data

def get_vector_store(embedding_function, persist_directory=CHROMA_PATH):
    """Initializes or loads the Chroma vector store."""
    vectorstore = Chroma(
        persist_directory=persist_directory,
        embedding_function=embedding_function
    )
    print(f"Vector store initialized/loaded from: {persist_directory}")
    return vectorstore

embedding_function = get_embedding_function()
vector_store = get_vector_store(embedding_function) # Call this later

Step 6: Index Documents (Embed and Store)

This is the core indexing step where document chunks are converted to embeddings and saved in ChromaDB. The Chroma.from_documents function is convenient for the initial creation and indexing. If the database already exists, subsequent additions can use vectorstore.add_documents.

# rag_local.py (continued)

def index_documents(chunks, embedding_function, persist_directory=CHROMA_PATH):
    """Indexes document chunks into the Chroma vector store."""
    print(f"Indexing {len(chunks)} chunks...")
    # Use from_documents for initial creation.
    # This will overwrite existing data if the directory exists but isn't a valid Chroma DB.
    # For incremental updates, initialize Chroma first and use vectorstore.add_documents().
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embedding_function,
        persist_directory=persist_directory
    )
    vectorstore.persist() # Ensure data is saved
    print(f"Indexing complete. Data saved to: {persist_directory}")
    return vectorstore

#... (previous function calls)
vector_store = index_documents(chunks, embedding_function) # Call this for initial indexing

To load an existing persistent database later:

embedding_function = get_embedding_function()
vector_store = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

Step 7: Build the RAG Chain

Now, assemble the components into a LangChain Expression Language (LCEL) chain. This involves initializing the Qwen 3 LLM via Ollama, creating a retriever from the vector store, defining a suitable prompt, and chaining them together.

A critical parameter when initializing ChatOllama for RAG is num_ctx. This defines the context window size (in tokens) that the LLM can handle. Ollama's default (often 2048 or 4096 tokens) might be too small to accommodate both the retrieved document context and the user's query/prompt.

Qwen 3 models (8B and larger) support much larger context windows (for example, 128k tokens), but practical limits depend on your available RAM/VRAM. Setting num_ctx to a value like 8192 or higher is often necessary for effective RAG.

# rag_local.py (continued)
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def create_rag_chain(vector_store, llm_model_name="qwen3:8b", context_window=8192):
    """Creates the RAG chain."""
    # Initialize the LLM
    llm = ChatOllama(
        model=llm_model_name,
        temperature=0, # Lower temperature for more factual RAG answers
        num_ctx=context_window # IMPORTANT: Set context window size
    )
    print(f"Initialized ChatOllama with model: {llm_model_name}, context window: {context_window}")

    # Create the retriever
    retriever = vector_store.as_retriever(
        search_type="similarity", # Or "mmr"
        search_kwargs={'k': 3} # Retrieve top 3 relevant chunks
    )
    print("Retriever initialized.")

    # Define the prompt template
    template = """Answer the question based ONLY on the following context:
{context}

Question: {question}
"""
    prompt = ChatPromptTemplate.from_template(template)
    print("Prompt template created.")

    # Define the RAG chain using LCEL
    rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
    )
    print("RAG chain created.")
    return rag_chain

#... (previous function calls)
vector_store = get_vector_store(embedding_function) # Assuming DB is already indexed
rag_chain = create_rag_chain(vector_store) # Call this later

The effectiveness of the RAG system hinges on the proper configuration of each component. The chunk_size and chunk_overlap in the splitter affect what the retriever finds. Your choice of embedding_function must be consistent between indexing and querying. The num_ctx parameter for the ChatOllama LLM must be large enough to hold the retrieved context and the prompt itself. A poorly designed prompt template can also lead the LLM astray. Make sure you carefully tune these elements for optimal performance.

Step 8: Query Your Documents

Finally, invoke the RAG chain with a question related to the content of the indexed PDF.

# rag_local.py (continued)

def query_rag(chain, question):
    """Queries the RAG chain and prints the response."""
    print("\nQuerying RAG chain...")
    print(f"Question: {question}")
    response = chain.invoke(question)
    print("\nResponse:")
    print(response)

# --- Main Execution ---
if __name__ == "__main__":
    # 1. Load Documents
    docs = load_documents()

    # 2. Split Documents
    chunks = split_documents(docs)

    # 3. Get Embedding Function
    embedding_function = get_embedding_function() # Using Ollama nomic-embed-text

    # 4. Index Documents (Only needs to be done once per document set)
    # Check if DB exists, if not, index. For simplicity, we might re-index here.
    # A more robust approach would check if indexing is needed.
    print("Attempting to index documents...")
    vector_store = index_documents(chunks, embedding_function)
    # To load existing DB instead:
    # vector_store = get_vector_store(embedding_function)

    # 5. Create RAG Chain
    rag_chain = create_rag_chain(vector_store, llm_model_name="qwen3:8b") # Use the chosen Qwen 3 model

    # 6. Query
    query_question = "What is the main topic of the document?" # Replace with a specific question
    query_rag(rag_chain, query_question)

    query_question_2 = "Summarize the introduction section." # Another example
    query_rag(rag_chain, query_question_2)

Run the complete script (python rag_local.py). Make sure that the ollama serve command is running in another terminal. The script will load the PDF, split it, embed the chunks using nomic-embed-text via Ollama, store them in ChromaDB, build the RAG chain using qwen3:8b via Ollama, and finally execute the queries. It’ll print the LLM's responses based on the document content.

How to Create Local AI Agents with Qwen 3

Beyond answering questions based on provided text, LLMs can act as the reasoning engine for AI agents. Agents can plan sequences of actions, interact with external tools (like functions or APIs), and work towards accomplishing more complex goals assigned by the user.

Qwen 3 models were specifically designed with strong tool-calling and agentic capabilities. While Alibaba provides the Qwen-Agent framework, this tutorial will continue using LangChain for consistency and because its integration with Ollama for agent tasks is more readily documented in the provided materials.

We will build a simple agent that can use a custom Python function as a tool.

Step 1: Define Custom Tools

Tools are standard Python functions that the agent can choose to execute. The function's docstring is crucial, as the LLM uses it to understand what the tool does and what arguments it requires. LangChain's @tool decorator simplifies wrapping functions for agent use.

# agent_local.py
import os
from dotenv import load_dotenv
from langchain.agents import tool
import datetime

load_dotenv() # Optional

@tool
def get_current_datetime(format: str = "%Y-%m-%d %H:%M:%S") -> str:
    """
    Returns the current date and time, formatted according to the provided Python strftime format string.
    Use this tool whenever the user asks for the current date, time, or both.
    Example format strings: '%Y-%m-%d' for date, '%H:%M:%S' for time.
    If no format is specified, defaults to '%Y-%m-%d %H:%M:%S'.
    """
    try:
        return datetime.datetime.now().strftime(format)
    except Exception as e:
        return f"Error formatting date/time: {e}"

# List of tools the agent can use
tools = [get_current_datetime]
print("Custom tool defined.")

Step 2: Set Up the Agent LLM

Instantiate the ChatOllama model again, using a Qwen 3 variant suitable for tool calling. The qwen3:8b model should be capable of handling simple tool use cases.

It's important to note that tool calling reliability with local models served via Ollama can sometimes be less consistent than with large commercial APIs like GPT-4 or Claude. The LLM might fail to recognize when a tool is needed, hallucinate arguments, or misinterpret the tool's output. Starting with clear prompts and simple tools is recommended.

# agent_local.py (continued)
from langchain_ollama import ChatOllama

def get_agent_llm(model_name="qwen3:8b", temperature=0):
    """Initializes the ChatOllama model for the agent."""
    # Ensure Ollama server is running (ollama serve)
    llm = ChatOllama(
        model=model_name,
        temperature=temperature # Lower temperature for more predictable tool use
        # Consider increasing num_ctx if expecting long conversations or complex reasoning
        # num_ctx=8192
    )
    print(f"Initialized ChatOllama agent LLM with model: {model_name}")
    return llm

# agent_llm = get_agent_llm() # Call this later

Step 3: Create the Agent Prompt

Agents require specific prompt structures that guide their reasoning and tool use. The prompt typically includes placeholders for user input (input), conversation history (chat_history), and the agent_scratchpad. The scratchpad is where the agent records its internal "thought" process, the tools it decides to call, and the results (observations) it gets back from those tools. LangChain Hub provides pre-built prompts suitable for tool-calling agents.

# agent_local.py (continued)
from langchain import hub

def get_agent_prompt(prompt_hub_name="hwchase17/openai-tools-agent"):
    """Pulls the agent prompt template from LangChain Hub."""
    # This prompt is designed for OpenAI but often works well with other tool-calling models.
    # Alternatively, define a custom ChatPromptTemplate.
    prompt = hub.pull(prompt_hub_name)
    print(f"Pulled agent prompt from Hub: {prompt_hub_name}")
    # print("Prompt Structure:")
    # prompt.pretty_print() # Uncomment to see the prompt structure
    return prompt

# agent_prompt = get_agent_prompt() # Call this later

Step 4: Build the Agent

The create_tool_calling_agent function combines the LLM, the defined tools, and the prompt into a runnable unit that represents the agent's core logic.

# agent_local.py (continued)
from langchain.agents import create_tool_calling_agent

def build_agent(llm, tools, prompt):
    """Builds the tool-calling agent runnable."""
    agent = create_tool_calling_agent(llm, tools, prompt)
    print("Agent runnable created.")
    return agent

# agent_runnable = build_agent(agent_llm, tools, agent_prompt) # Call this later

Step 5: Create the Agent Executor

The AgentExecutor is responsible for running the agent loop. It takes the agent runnable and the tools, invokes the agent with the input, parses the agent's output (which could be a final answer or a tool call request), executes any requested tool calls, and feeds the results back to the agent until a final answer is reached. Setting verbose=True is highly recommended during development to observe the agent's step-by-step execution flow.

# agent_local.py (continued)
from langchain.agents import AgentExecutor

def create_agent_executor(agent, tools):
    """Creates the agent executor."""
    agent_executor = AgentExecutor(
        agent=agent,
        tools=tools,
        verbose=True # Set to True to see agent thoughts and tool calls
    )
    print("Agent executor created.")
    return agent_executor

# agent_executor = create_agent_executor(agent_runnable, tools) # Call this later

Step 6: Run the Agent

Invoke the agent executor with a user query that should trigger the use of the defined tool.

# agent_local.py (continued)

def run_agent(executor, user_input):
    """Runs the agent executor with the given input."""
    print("\nInvoking agent...")
    print(f"Input: {user_input}")
    response = executor.invoke({"input": user_input})
    print("\nAgent Response:")
    print(response['output'])

# --- Main Execution ---
if __name__ == "__main__":
    # 1. Define Tools (already done above)

    # 2. Get Agent LLM
    agent_llm = get_agent_llm(model_name="qwen3:8b") # Use the chosen Qwen 3 model

    # 3. Get Agent Prompt
    agent_prompt = get_agent_prompt()

    # 4. Build Agent Runnable
    agent_runnable = build_agent(agent_llm, tools, agent_prompt)

    # 5. Create Agent Executor
    agent_executor = create_agent_executor(agent_runnable, tools)

    # 6. Run Agent
    run_agent(agent_executor, "What is the current date?")
    run_agent(agent_executor, "What time is it right now? Use HH:MM format.")
    run_agent(agent_executor, "Tell me a joke.") # Should not use the tool

Running python agent_local.py (with ollama serve active) will execute the agent. The verbose=True setting will print output resembling the ReAct (Reasoning and Acting) framework, showing the agent's internal "Thoughts" on how to proceed, the "Action" it decides to take (calling a specific tool with arguments), and the "Observation" (the result returned by the tool).

Building reliable agents with local models presents unique challenges. The LLM's ability to correctly interpret the prompt, understand when to use tools, select the right tool, generate valid arguments, and process the tool's output is critical.

Local models, especially smaller or heavily quantized ones, might struggle with these reasoning steps compared to larger, cloud-based counterparts. If the qwen3:8b model proves unreliable for more complex agentic tasks, consider trying qwen3:14b or the efficient qwen3:30b-a3b if hardware permits.

For highly complex or stateful agent workflows, exploring frameworks like LangGraph, which offers more control over the agent's execution flow, might be beneficial.

Advanced Considerations and Troubleshooting

Running LLMs locally offers great flexibility but also introduces specific configuration aspects and potential issues.

Controlling Qwen 3's Thinking Mode with Ollama

Qwen 3's unique hybrid inference allows switching between a deep "thinking" mode for complex reasoning and a faster "non-thinking" mode for general chat. While frameworks like Hugging Face Transformers or vLLM might offer explicit parameters (enable_thinking), the primary way to control this when using Ollama appears to be through "soft switches" embedded in the prompt.

Append /think to the end of a user prompt to encourage step-by-step reasoning, or /no_think to request a faster, direct response. You can do this via the Ollama CLI or potentially within the prompts sent via the API/LangChain.

# Example using LangChain's ChatOllama
from langchain_ollama import ChatOllama

llm_think = ChatOllama(model="qwen3:8b")
llm_no_think = ChatOllama(model="qwen3:8b") # Could also set system prompt

# Invoke with prompt modification
response_think = llm_think.invoke("Solve the equation 2x + 5 = 15 /think")
print("Thinking Response:", response_think)

response_no_think = llm_no_think.invoke("What is the capital of France? /no_think")
print("Non-Thinking Response:", response_no_think)

# Alternatively, set via system message (might be less reliable turn-by-turn)
llm_system_no_think = ChatOllama(model="qwen3:8b", system="/no_think")
response_system = llm_system_no_think.invoke("What is 2+2?")
print("System No-Think Response:", response_system)

Note that the persistence of these tags across multiple turns in a conversation might require careful prompt management.

Managing Context Length (`num_ctx`)

The context window (num_ctx) determines how much information (prompt, history, retrieved documents) the LLM can consider at once. Qwen 3 models (8B+) support large native context lengths (for example, 128k tokens), but Ollama often defaults to a much smaller window (like 2048 or 4096). For RAG or conversations requiring memory of earlier turns, this default is often insufficient.

Set num_ctx when initializing ChatOllama or OllamaLLM in LangChain:

# Example setting context window to 8192 tokens
llm = ChatOllama(model="qwen3:8b", num_ctx=8192)

Be mindful that larger num_ctx values significantly increase RAM and VRAM consumption. But setting it too low can lead to the model "forgetting" context or even entering repetitive loops. Choose a value that balances the task requirements with hardware capabilities.

Hardware Limitations and VRAM

Running LLMs locally is resource-intensive.

VRAM: A dedicated GPU (NVIDIA or Apple Silicon) with sufficient VRAM is highly recommended for acceptable performance. The amount of VRAM dictates the largest model size that can run efficiently. Refer to the table in Section 2 for estimates.
RAM: System RAM is also crucial, especially if the model doesn't fit entirely in VRAM. Ollama can utilize system RAM as a fallback, but this is significantly slower.
Quantization: Ollama typically serves quantized models (for example., 4-bit or 5-bit), which reduce the model size and VRAM requirements significantly compared to full-precision models, often with minimal performance degradation for many tasks. The tags like :4b, :8b usually imply a default quantization level.

If performance is slow or errors occur due to resource constraints, consider:

Using a smaller Qwen 3 model (like 4B instead of 8B).
Ensuring Ollama is correctly detecting and utilizing the GPU (check Ollama logs or system monitoring tools).
Closing other resource-intensive applications.

Conclusion and Next Steps

This tutorial gave you a practical walkthrough for setting up your local AI environment using the powerful and open Qwen 3 LLM family with the user-friendly Ollama tool.

If you’ve followed these steps, you should have successfully:

Installed Ollama and downloaded/run Qwen 3 models locally.
Built a functional Retrieval-Augmented Generation (RAG) pipeline using LangChain and ChromaDB to query local documents.
Created a basic AI agent capable of reasoning and utilizing custom Python tools.

Running these systems locally unlocks significant advantages in privacy, cost, and customization, making advanced AI capabilities more accessible than ever. The combination of Qwen 3's performance and open license with Ollama's ease of use creates a potent platform for experimentation and development.

Official Resources:

Qwen 3: GitHub, Documentation
Ollama: Website, Model Library, GitHub
LangChain: Python Documentation
ChromaDB: Documentation
Sentence Transformers: Documentation

By leveraging these powerful, free, and open-source tools, you can continue to push the boundaries of what's possible with AI running directly on your own hardware.

How to Create a Python SIEM System Using AI and LLMs for Log Analysis and Anomaly Detection

Chaitanya Rahalkar — Fri, 07 Mar 2025 17:28:11 +0000

In this tutorial, we’ll build a simplified, AI-flavored SIEM log analysis system using Python. Our focus will be on log analysis and anomaly detection.

We’ll walk through ingesting logs, detecting anomalies with a lightweight machine learning model, and even touch on how the system could respond automatically.

This hands-on proof-of-concept will illustrate how AI can enhance security monitoring in a practical, accessible way.

What Are SIEM Systems?
Prerequisites
Setting Up the Project
How to Implement Log Analysis
How to Build the Anomaly Detection Model
Testing and Visualizing Results
Automated Response Possibilities
Conclusion

What Are SIEM Systems?

Security Information and Event Management (SIEM) systems are the central nervous system of modern security operations. A SIEM aggregates and correlates security logs and events from across an IT environment to provide real-time insights into potential incidents. This helps organizations detect threats faster and respond sooner.

These systems pull together huge volumes of log data — from firewall alerts to application logs — and analyze them for signs of trouble. Anomaly detection in this context is crucial, and unusual patterns in logs can reveal incidents that might slip past static rules. For example, a sudden spike in network requests might indicate a DDoS attack, while multiple failed login attempts could point to unauthorized access attempts.

AI takes SIEM capabilities a step further. By leveraging advanced AI models (like large language models), an AI-powered SIEM can intelligently parse and interpret logs, learn what “normal” behavior looks like, and flag the “weird” stuff that warrants attention.

In essence, AI can act as a smart co-pilot for analysts, spotting subtle anomalies and even summarizing findings in plain language. Recent advancements in large language models allow SIEMs to reason over countless data points much like a human analyst would — but with far greater speed and scale. The result is a powerful digital security assistant that helps cut through the noise and focus on real threats.

Prerequisites

Before we dive in, make sure you have the following:

Python 3.x installed on your system. The code examples should work in any recent Python version.
Basic familiarity with Python programming (looping, functions, using libraries) and an understanding of logs (for example, what a log entry looks like) will be helpful.
Python libraries: We’ll use a few common libraries that are lightweight and don’t require special hardware:
- pandas for basic data handling (if your logs are in CSV or similar format).
- numpy for numeric operations.
- scikit-learn for the anomaly detection model (specifically, we’ll use the IsolationForest algorithm).
A set of log data to analyze. You can use any log file (system logs, application logs, and so on) in plain text or CSV format. For demonstration, we’ll simulate a small log dataset so you can follow along even without a ready-made log file.

Note: If you don’t have the libraries above, install them via pip:

pip install pandas numpy scikit-learn

Setting Up the Project

Let’s set up a simple project structure. Create a new directory for this SIEM anomaly detection project and navigate into it. Inside, you can have a Python script (for example, siem_anomaly_demo.py) or a Jupyter Notebook to run the code step by step.

Make sure your working directory contains or can access your log data. If you’re using a log file, it might be a good idea to place a copy in this project folder. For our proof-of-concept, since we will generate synthetic log data, we won’t need an external file — but in a real scenario you would.

Project setup steps:

Initialize the environment – If you prefer, create a virtual environment for this project (optional but good practice):
```
 python -m venv venv
 source venv/bin/activate  # On Windows use "venv\Scripts\activate"
```
Then install the required packages in this virtual environment.
Prepare a data source – Identify the log source you want to analyze. This could be a path to a log file or database. Ensure you know the format of the logs (for example, are they comma-separated, JSON lines, or plain text?). For illustration, we will fabricate some log entries.
Set up your script or notebook – Open your Python file or notebook. We’ll start by importing the necessary libraries and setting up any configurations (like random seeds for reproducibility).

By the end of this setup, you should have a Python environment ready to run our SIEM log analysis code, and either a real log dataset or the intention to simulate data along with me.

Implementing Log Analysis

In a full SIEM system, log analysis involves collecting logs from various sources and parsing them into a uniform format for further processing. Logs often contain fields like timestamp, severity level, source, event message, user ID, IP address, and so on. The first task is to ingest and preprocess these logs.

1. Log Ingestion

If your logs are in a text file, you can read them in Python. For example, if each log entry is a line in the file, you could do:

with open("my_logs.txt") as f:
    raw_logs = f.readlines()

If the logs are structured (say, CSV format with columns), Pandas can greatly simplify reading:

import pandas as pd
df = pd.read_csv("my_logs.csv")
print(df.head())

This will give you a DataFrame df with your log entries organized in columns. But many logs are semi-structured (for example, components separated by spaces or special characters). In such cases, you might need to split each line by a delimiter or use regex to extract fields. For instance, imagine a log line:

2025-03-06 08:00:00, INFO, User login success, user: admin

This has a timestamp, a log level, a message, and a user. We can parse such lines with Python’s string methods:

logs = [
    "2025-03-06 08:00:00, INFO, User login success, user: admin",
    "2025-03-06 08:01:23, INFO, User login success, user: alice",
    "2025-03-06 08:02:45, ERROR, Failed login attempt, user: alice",
    # ... (more log lines)
]
parsed_logs = []
for line in logs:
    parts = [p.strip() for p in line.split(",")]
    timestamp = parts[0]
    level = parts[1]
    message = parts[2]
    user = parts[3].split(":")[1].strip() if "user:" in parts[3] else None
    parsed_logs.append({"timestamp": timestamp, "level": level, "message": message, "user": user})

# Convert to DataFrame for easier analysis
df_logs = pd.DataFrame(parsed_logs)
print(df_logs.head())

Running the above on our sample list would output something like:

            timestamp  level                 message   user
0  2025-03-06 08:00:00   INFO    User login success   admin
1  2025-03-06 08:01:23   INFO    User login success   alice
2  2025-03-06 08:02:45  ERROR  Failed login attempt   alice
...

Now we have structured the logs into a table. In a real scenario, you would continue parsing all relevant fields from your logs (for example, IP addresses, error codes, and so on) depending on what you want to analyze.

2. Preprocessing and Feature Extraction

With the logs in a structured format, the next step is to derive features for anomaly detection. Raw log messages (strings) by themselves are hard for an algorithm to learn from directly. We often extract numeric features or categories that can be quantified. Some examples of features could be:

Event counts: number of events per minute/hour, number of login failures for each user, and so on.
Duration or size: if logs include durations or data sizes (for example, file transfer size, query execution time), those numeric values can be directly used.
Categorical encoding: log levels (INFO, ERROR, DEBUG) could be mapped to numbers, or specific event types could be one-hot encoded.

For this proof-of-concept, let’s focus on a simple numeric feature: the count of login attempts per minute for a given user. We’ll simulate this as our feature data.

In a real system, you would compute this by grouping the parsed log entries by time window and user. The goal is to get an array of numbers where each number represents "how many login attempts occurred in a given minute." Most of the time this number will be low (normal behavior), but if a particular minute saw an unusually high number of attempts, that’s an anomaly (possibly a brute-force attack).

To simulate, we’ll generate a list of 50 values representing normal behavior, and then append a few values that are abnormally high:

import numpy as np

# Simulate 50 minutes of normal login attempt counts (around 5 per minute on average)
np.random.seed(42)  # for reproducible example
normal_counts = np.random.poisson(lam=5, size=50)

# Simulate anomaly: a spike in login attempts (e.g., an attacker tries 30+ times in a minute)
anomalous_counts = np.array([30, 40, 50])

# Combine the data
login_attempts = np.concatenate([normal_counts, anomalous_counts])
print("Login attempts per minute:", login_attempts)

When you run the above, login_attempts might look like:

Login attempts per minute: [ 5  4  4  5  5  3  5  ...  4 30 40 50]

Most values are in the single digits, but at the end we have three minutes with 30, 40, and 50 attempts – clear outliers. This is our prepared data for anomaly detection. In a real log analysis, this kind of data might come from counting events in your logs over time or extracting some metric from the log content.

Now that our data is ready, we can move on to building the anomaly detection model.

How to Build the Anomaly Detection Model

To detect anomalies in our log-derived data, we’ll use a machine learning approach. Specifically, we’ll use an Isolation Forest – a popular algorithm for unsupervised anomaly detection.

The Isolation Forest works by randomly partitioning the data and isolating points. Anomalies are those points that get isolated (separated from others) quickly, that is, in fewer random splits. This makes it great for identifying outliers in a dataset without needing any labels (we don’t have to know in advance which log entries are “bad”).

Why Isolation Forest?

It’s efficient and works well even if we have a lot of data.
It doesn’t assume any specific data distribution (unlike some statistical methods).
It gives us a straightforward way to score anomalies.

Let’s train an Isolation Forest on our login_attempts data:

from sklearn.ensemble import IsolationForest

# Prepare the data in the shape the model expects (samples, features)
X = login_attempts.reshape(-1, 1)  # each sample is a 1-dimensional [count]

# Initialize the Isolation Forest model
model = IsolationForest(contamination=0.05, random_state=42)
# contamination=0.05 means we expect about 5% of the data to be anomalies

# Train the model on the data
model.fit(X)

A couple of notes on the code:

We reshaped login_attempts to a 2D array X with one feature column because scikit-learn requires a 2D array for training (fit).
We set contamination=0.05 to give the model a hint that roughly 5% of the data might be anomalies. In our synthetic data we added 3 anomalies out of 53 points, which is ~5.7%, so 5% is a reasonable guess. (If you don’t specify contamination, the algorithm will choose a default based on assumption or use a default 0.1 in some versions.)
random_state=42 just ensures reproducibility.

At this point, the Isolation Forest model has been trained on our data. Internally, it has built an ensemble of random trees that partition the data. Points that are hard to isolate (that is, in the dense cluster of normal points) end up deep in these trees, while points that are easy to isolate (the outliers) end up with shorter paths.

Next, we’ll use this model to identify which data points are considered anomalous.

Testing and Visualizing Results

Now comes the exciting part: using our trained model to detect anomalies in the log data. We’ll have the model predict labels for each data point and then filter out the ones flagged as outliers.

# Use the model to predict anomalies
labels = model.predict(X)
# The model outputs +1 for normal points and -1 for anomalies

# Extract the anomaly indices and values
anomaly_indices = np.where(labels == -1)[0]
anomaly_values = login_attempts[anomaly_indices]

print("Anomaly indices:", anomaly_indices)
print("Anomaly values (login attempts):", anomaly_values)

In our case, we expect the anomalies to be the large numbers we inserted (30, 40, 50). The output might look like:

Anomaly indices: [50 51 52]
Anomaly values (login attempts): [30 40 50]

Even without knowing anything about “login attempts” specifically, the Isolation Forest recognized those values as out-of-line with the rest of the data.

This is the power of anomaly detection in a security context: we don’t always know what a new attack will look like, but if it causes something to drift far from normal patterns (like a user suddenly making 10 times more login attempts than usual), the anomaly detector shines a spotlight on it.

Visualizing the results

In a real analysis, it’s often useful to visualize the data and the anomalies. For instance, we could plot the login_attempts values over time (minute by minute) and highlight the anomalies in a different color.

In this simple case, a line chart would show a mostly flat line around 3-8 logins/min with three huge spikes at the end. Those spikes are our anomalies. You could achieve this with Matplotlib if you’re running this in a notebook:

import matplotlib.pyplot as plt

plt.plot(login_attempts, label="Login attempts per minute")
plt.scatter(anomaly_indices, anomaly_values, color='red', label="Anomalies")
plt.xlabel("Time (minute index)")
plt.ylabel("Login attempts")
plt.legend()
plt.show()

For text-based output as we have here, the printed results already confirm that the high values were caught. In more complex cases, anomaly detection models also provide an anomaly score for each point (for example, how far it is from the normal range). Scikit-learn’s IsolationForest, for example, has a decision_function method that yields a score (where lower scores mean more abnormal).

For simplicity, we won’t delve into the scores here, but it’s good to know you can retrieve them to rank anomalies by severity.

With the anomaly detection working, what can we do when we find an anomaly? That leads us to thinking about automated responses.

Automated Response Possibilities

Detecting an anomaly is only half the battle — the next step is responding to it. In enterprise SIEM systems, automated response (often associated with SOAR – Security Orchestration, Automation, and Response) can dramatically reduce reaction time to incidents.

What could an AI-powered SIEM do when it flags something unusual? Here are some possibilities:

Alerting: The simplest action is to send an alert to security personnel. This could be an email, a Slack message, or creating a ticket in an incident management system. The alert would contain details of the anomaly (for example, “User alice had 50 failed login attempts in 1 minute, which is abnormal”). GenAI can help here by generating a clear natural-language summary of the incident for the analyst.
Automated mitigation: More advanced systems might take direct action. For instance, if an IP address is showing malicious behavior in logs, the system could automatically block that IP on the firewall. In our login spike example, the system might temporarily lock the user account or prompt for additional authentication, under the assumption that it might be a bot attack. AI-based SIEMs today can indeed trigger predefined response actions or even orchestrate complex workflows when certain threats are detected (refer to AI SIEM: How SIEM with AI/ML is Revolutionizing the SOC | Exabeam for more information).
Investigation support: Generative AI could also be used to automatically gather context. For example, upon detecting the anomaly, the system could pull related logs (surrounding events, other actions by the same user or from the same IP) and provide an aggregated report. This saves the analyst from manually querying multiple data sources.

It’s important to implement automated responses carefully — you don’t want the system to overreact to false positives. A common strategy is a tiered response: low-confidence anomalies might just log a warning or send a low-priority alert, whereas high-confidence anomalies (or combinations of anomalies) trigger active defense measures.

In practice, a AI-powered SIEM would integrate with your infrastructure (via APIs, scripts, and so on) to execute these actions. For our Python PoC, you could simulate an automated response by, say, printing a message or calling a dummy function when an anomaly is detected. For example:

if len(anomaly_indices) > 0:
    print(f"Alert! Detected {len(anomaly_indices)} anomalous events. Initiating response procedures...")
    # Here, you could add code to disable a user or notify an admin, etc.

While our demonstration is simple, it’s easy to imagine scaling this up. The SIEM could, for instance, feed anomalies into a larger generative model that assesses the situation and decides on the best course of action (like a chatbot Ops assistant that knows your runbooks). The possibilities for automation are expanding as AI becomes more sophisticated.

Conclusion

In this tutorial, we built a basic AI-powered SIEM component that ingests log data, analyzes it for anomalies using a machine learning model, and identifies unusual events that could represent security threats.

We started by parsing and preparing log data, then used an Isolation Forest model to detect outliers in a stream of login attempt counts. The model successfully flagged out-of-norm behavior without any prior knowledge of what an “attack” looks like – it purely relied on deviations from learned normal patterns.

We also discussed how such a system could respond to detected anomalies, from alerting humans to automatically taking action.

Modern SIEM systems augmented with AI/ML are moving in this direction: not only do they detect issues, but they also help triage and respond to them. Generative AI further enhances this by learning from analysts and providing intelligent summaries and decisions, effectively becoming a tireless assistant in the Security Operations Center.

For next steps and improvements:

You can try this approach on real log data. For example, take a system log file and extract a feature like “number of error logs per hour” or “bytes transferred per session” and run anomaly detection on that.
Experiment with other algorithms like One-Class SVM or Local Outlier Factor for anomaly detection to see how they compare.
Incorporate a simple language model to parse log lines or to explain anomalies. For instance, an LLM could read an anomalous log entry and suggest what might be wrong (“This error usually means the database is unreachable”).
Extend the features: in a real SIEM, you’d use many signals at once (failed login counts, unusual IP geolocation, rare process names in logs, and so on). More features and data can improve the context for detection.

How to Build a Real-Time Intrusion Detection System with Python and Open-Source Libraries

Chaitanya Rahalkar — Tue, 21 Jan 2025 14:28:32 +0000

An Intrusion Detection System (IDS) is like a security camera for your network. Just as security cameras help identify suspicious activities in the physical world, an IDS will monitor your network to help detect any potential cyber attacks and security breaches.

By the end of this tutorial, you will know how an IDS works and be able to build your own real-time network monitoring system using Python.

Understanding the Types of IDS
How to Setup Your Development Environment
Building the Core IDS Components
Ideas to Extend the IDS
Security Considerations
Testing the IDS on Mock Data
Wrapping Up

Understanding the Types of IDS

Before we jump into the coding part, let’s understand the types of IDS:

Network-based IDS (NIDS): This system monitors network traffic for suspicious activity.
Host-based IDS (HIDS): This system monitors system logs and file changes on individual hosts and is not directly deployed in the network.
Signature-based IDS: This system is either in the network or on the host and identifies attack patterns based on known patterns.
Anomaly-based IDS: This system identifies unusual behavior using heuristics and prediction algorithms that are trained on previously seen attack patterns.

For this tutorial, you will be building a hybrid system that combines signature-based and anomaly-based detection systems to monitor network traffic.

How to Setup Your Development Environment

Let’s start by setting up our Python environment (I’m using Python 3) and installing the following prerequisites:

pip install scapy
pip install python-nmap
pip install numpy
pip install sklearn

Building the Core IDS Components

Our IDS will comprise of four main components:

A packet capture system
Traffic analysis module
A detection engine
An alert system

Building the Packet Capture Engine

Let’s start with the packet capture engine. We’ll use Scapy for this. Scapy is a networking library that allows us to perform network and network-related operations using Python.

First, we’ll define our PacketCapture class that will serve as the basis of our IDS.

from scapy.all import sniff, IP, TCP
from collections import defaultdict
import threading
import queue

class PacketCapture:
    def __init__(self):
        self.packet_queue = queue.Queue()
        self.stop_capture = threading.Event()

    def packet_callback(self, packet):
        if IP in packet and TCP in packet:
            self.packet_queue.put(packet)

    def start_capture(self, interface="eth0"):
        def capture_thread():
            sniff(iface=interface,
                  prn=self.packet_callback,
                  store=0,
                  stop_filter=lambda _: self.stop_capture.is_set())

        self.capture_thread = threading.Thread(target=capture_thread)
        self.capture_thread.start()

    def stop(self):
        self.stop_capture.set()
        self.capture_thread.join()

Let’s quickly walk through the code and understand what these functions do. For this, you will be using threading and queues to efficiently process and capture network packets.

The init method initializes the class by creating a queue.Queue to store captured packets and a threading Event to control when the packet capture should stop. The packet_callback method acts as a handler for each captured packet and checks if the packet contains both IP and TCP layers. If so, it adds it to the queue for further processing.

The start_capture method begins capturing packets on a specified interface (defaulting to eth0 to capture packets from the Ethernet interface). Run ifconfig to understand the available interfaces and select the appropriate interface from the list.

The function spawns a separate thread to run Scapy’s sniff function, which continuously monitors the interface for packets. The stop_filter parameter ensures the capture stops when the stop_capture event is triggered.

The stop method stops the capture by setting the stop_capture event and waits for the thread to finish execution, ensuring the process terminates cleanly. This design allows for seamless real-time packet capturing without blocking the main thread.

Building the Traffic Analysis Module

Now, let’s write the traffic analysis module. This module will process captured packets and extract relevant features.

class TrafficAnalyzer:
    def __init__(self):
        self.connections = defaultdict(list)
        self.flow_stats = defaultdict(lambda: {
            'packet_count': 0,
            'byte_count': 0,
            'start_time': None,
            'last_time': None
        })

    def analyze_packet(self, packet):
        if IP in packet and TCP in packet:
            ip_src = packet[IP].src
            ip_dst = packet[IP].dst
            port_src = packet[TCP].sport
            port_dst = packet[TCP].dport

            flow_key = (ip_src, ip_dst, port_src, port_dst)

            # Update flow statistics
            stats = self.flow_stats[flow_key]
            stats['packet_count'] += 1
            stats['byte_count'] += len(packet)
            current_time = packet.time

            if not stats['start_time']:
                stats['start_time'] = current_time
            stats['last_time'] = current_time

            return self.extract_features(packet, stats)

    def extract_features(self, packet, stats):
        return {
            'packet_size': len(packet),
            'flow_duration': stats['last_time'] - stats['start_time'],
            'packet_rate': stats['packet_count'] / (stats['last_time'] - stats['start_time']),
            'byte_rate': stats['byte_count'] / (stats['last_time'] - stats['start_time']),
            'tcp_flags': packet[TCP].flags,
            'window_size': packet[TCP].window
        }

In this code section, we define the TrafficAnalyzer class to analyze network traffic. Here we track connection flows and calculate statistics for packets in real time. We use the defaultdict data structure in Python to manage connections and flow statistics by organizing data by unique flows.

The __init__ method initializes two attributes: connections, which stores lists of related packets for each flow, and flow_stats, which stores aggregated statistics for each flow, such as packet count, byte count, start time, and the time of the most recent packet.

The analyze_packet method processes each packet. If the packet contains IP and TCP layers, it extracts the source and destination IPs and ports, forming a unique flow_key to identify the flow. It updates the statistics for the flow by incrementing the packet count, adding the packet’s size to the byte count, and setting or updating the start and last time of the flow. Eventually, it calls extract_features to calculate and return additional metrics.

The extract_features method computes detailed characteristics of the flow and the current packet. These include the packet size, flow duration, packet rate, byte rate, TCP flags, and the TCP window size. These metrics are quite useful to identify patterns, anomalies, or potential threats in network traffic.

Building the Detection Engine

Now we will define our detection engine that will implement both the signature as well as the anomaly-based detection mechanisms:

from sklearn.ensemble import IsolationForest
import numpy as np

class DetectionEngine:
    def __init__(self):
        self.anomaly_detector = IsolationForest(
            contamination=0.1,
            random_state=42
        )
        self.signature_rules = self.load_signature_rules()
        self.training_data = []

    def load_signature_rules(self):
        return {
            'syn_flood': {
                'condition': lambda features: (
                    features['tcp_flags'] == 2 and  # SYN flag
                    features['packet_rate'] > 100
                )
            },
            'port_scan': {
                'condition': lambda features: (
                    features['packet_size'] < 100 and
                    features['packet_rate'] > 50
                )
            }
        }

    def train_anomaly_detector(self, normal_traffic_data):
        self.anomaly_detector.fit(normal_traffic_data)

    def detect_threats(self, features):
        threats = []

        # Signature-based detection
        for rule_name, rule in self.signature_rules.items():
            if rule['condition'](features):
                threats.append({
                    'type': 'signature',
                    'rule': rule_name,
                    'confidence': 1.0
                })

        # Anomaly-based detection
        feature_vector = np.array([[
            features['packet_size'],
            features['packet_rate'],
            features['byte_rate']
        ]])

        anomaly_score = self.anomaly_detector.score_samples(feature_vector)[0]
        if anomaly_score < -0.5:  # Threshold for anomaly detection
            threats.append({
                'type': 'anomaly',
                'score': anomaly_score,
                'confidence': min(1.0, abs(anomaly_score))
            })

        return threats

This code defines a hybrid system that combines the signature-based and anomaly-based detection methods. We use the Isolation Forest model to detect anomalies and also use pre-defined rules for identifying specific attack patterns. If you would like to know more about how the Isolation Forest model works, check out this article.

In this code snippet, the train_anomaly_detector method trains the Isolation Forest model using a dataset of normal traffic features. This enables the model to differentiate typical traffic patterns from anomalies.

The detect_threats method evaluates network traffic features for potential threats using two approaches:

Signature-Based Detection: It iteratively goes through each of the predefined rules, applying the rule’s condition to the traffic features. If a rule matches, a signature-based threat is recorded with high confidence.
Anomaly-Based Detection: It processes the feature vector (packet size, packet rate, and byte rate) through the Isolation Forest model to calculate an anomaly score. If the score indicates unusual behavior, the detection engine triggers it as an anomaly and produces a confidence score proportional to the anomaly’s severity.

Finally, we return the aggregated list of identified threats with their respective annotation (either signature or anomaly), the rule or score that triggered the anomaly, and a confidence score that suggests how likely it is that the identified pattern is a threat.

Building the Alert System

Now let’s build the last component of our IDS which is the alert system. It will process and log detected threats in a structured way. You will also have the option to extend the system to include additional notification mechanisms like Slack, Jira tickets, and so on

import logging
import json
from datetime import datetime

class AlertSystem:
    def __init__(self, log_file="ids_alerts.log"):
        self.logger = logging.getLogger("IDS_Alerts")
        self.logger.setLevel(logging.INFO)

        handler = logging.FileHandler(log_file)
        formatter = logging.Formatter(
            '%(asctime)s - %(levelname)s - %(message)s'
        )
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)

    def generate_alert(self, threat, packet_info):
        alert = {
            'timestamp': datetime.now().isoformat(),
            'threat_type': threat['type'],
            'source_ip': packet_info.get('source_ip'),
            'destination_ip': packet_info.get('destination_ip'),
            'confidence': threat.get('confidence', 0.0),
            'details': threat
        }

        self.logger.warning(json.dumps(alert))

        if threat['confidence'] > 0.8:
            self.logger.critical(
                f"High confidence threat detected: {json.dumps(alert)}"
            )
            # Implement additional notification methods here
            # (e.g., email, Slack, SIEM integration)

The init method sets up a logger named IDS_Alerts with an INFO logging level to capture alert information. It writes logs to a specified file, ids_alerts.log by default. A FileHandler directs logs to the file, while the Formatter ensures the logs follow a consistent format.

The generate_alert method is responsible for creating structured alert entries. Each alert includes key information such as the timestamp of detection, the type of threat, the source and destination IPs involved, the confidence level of the detection, and additional threat-specific details. These alerts are logged as WARNING level messages in JSON format.

If the confidence level of a detected threat is high (greater than 0.8), the alert is escalated and logged as a CRITICAL level message. Note that this method is designed to be extensible, allowing for additional notification mechanisms, such as sending alerts via email or integrating with third-party systems like Slack or SIEM solutions.

Putting it All Together

Now let’s integrate all the components together into our fully functional IDS solution:

class IntrusionDetectionSystem:
    def __init__(self, interface="eth0"):
        self.packet_capture = PacketCapture()
        self.traffic_analyzer = TrafficAnalyzer()
        self.detection_engine = DetectionEngine()
        self.alert_system = AlertSystem()

        self.interface = interface

    def start(self):
        print(f"Starting IDS on interface {self.interface}")
        self.packet_capture.start_capture(self.interface)

        while True:
            try:
                packet = self.packet_capture.packet_queue.get(timeout=1)
                features = self.traffic_analyzer.analyze_packet(packet)

                if features:
                    threats = self.detection_engine.detect_threats(features)

                    for threat in threats:
                        packet_info = {
                            'source_ip': packet[IP].src,
                            'destination_ip': packet[IP].dst,
                            'source_port': packet[TCP].sport,
                            'destination_port': packet[TCP].dport
                        }
                        self.alert_system.generate_alert(threat, packet_info)

            except queue.Empty:
                continue
            except KeyboardInterrupt:
                print("Stopping IDS...")
                self.packet_capture.stop()
                break

if __name__ == "__main__":
    ids = IntrusionDetectionSystem()
    ids.start()

In this code, the IntrusionDetectionSystem class sets up its core components: PacketCapture for capturing packets from a network interface, TrafficAnalyzer for extracting and analyzing packet features, DetectionEngine for identifying threats using both signature-based and anomaly-based methods, and AlertSystem for logging and escalating detected threats. The interface parameter specifies the network interface to monitor, defaulting to eth0 (the generally named ethernet interface on most systems).

The start function initiates the IDS. It begins by starting packet capture on the specified interface and enters a loop to continuously process incoming packets. For each packet captured, the system extracts its features using the TrafficAnalyzer and analyzes them for potential threats using the DetectionEngine. If any threats are detected, the system generates detailed alerts through the AlertSystem.

The system runs in a loop until interrupted by either of the two key exceptions: queue.Empty, which occurs if no packets are available for processing, and KeyboardInterrupt, which stops the IDS gracefully by halting packet capture and exiting the loop.

Ideas to Extend the IDS

To enhance or extend the IDS, you can consider designing or implementing the following features / improvements:

Machine Learning enhancements: You can enhance the IDS capabilities by incorporating deep learning models like Auto Encoders for anomaly detection and using RNNs for sequential pattern analysis. This will improve the system’s ability to identify complex and evolving threats by leveraging advanced feature engineering.
Performance optimizations: You can optimize the IDS using PyPy for faster execution, packet sampling to handle high-traffic networks, and parallel processing to scale the system efficiently.
Integration capabilities: You can extend the IDS by considering support for a REST API for remote monitoring, enabling seamless interaction with external systems.

Security Considerations

When deploying the IDS, note that the system is a proof-of-concept and is not intended for production use-cases. Also keep the following things in mind:

Run the system with appropriate permissions (root/admin required for packet capture)
Secure the alert logs and implement proper log rotation
Regularly update signature rules and retrain anomaly detection models
Monitor system resource usage, especially in high-traffic environments
Implement proper access controls for the IDS configuration and alerts

Testing the IDS on Mock Data

To validate the functionality of your IDS, you can test it using mock data that will simulate real-world network traffic. This will allow you to observe how the system processes packets, analyzes traffic, and generates alerts without requiring a live network environment.

Use the following function to test the IDS:

from scapy.all import IP, TCP

def test_ids():
    # Create test packets to simulate various scenarios
    test_packets = [
        # Normal traffic
        IP(src="192.168.1.1", dst="192.168.1.2") / TCP(sport=1234, dport=80, flags="A"),
        IP(src="192.168.1.3", dst="192.168.1.4") / TCP(sport=1235, dport=443, flags="P"),

        # SYN flood simulation
        IP(src="10.0.0.1", dst="192.168.1.2") / TCP(sport=5678, dport=80, flags="S"),
        IP(src="10.0.0.2", dst="192.168.1.2") / TCP(sport=5679, dport=80, flags="S"),
        IP(src="10.0.0.3", dst="192.168.1.2") / TCP(sport=5680, dport=80, flags="S"),

        # Port scan simulation
        IP(src="192.168.1.100", dst="192.168.1.2") / TCP(sport=4321, dport=22, flags="S"),
        IP(src="192.168.1.100", dst="192.168.1.2") / TCP(sport=4321, dport=23, flags="S"),
        IP(src="192.168.1.100", dst="192.168.1.2") / TCP(sport=4321, dport=25, flags="S"),
    ]

    ids = IntrusionDetectionSystem()

    # Simulate packet processing and threat detection
    print("Starting IDS Test...")
    for i, packet in enumerate(test_packets, 1):
        print(f"\nProcessing packet {i}: {packet.summary()}")

        # Analyze the packet
        features = ids.traffic_analyzer.analyze_packet(packet)

        if features:
            # Detect threats based on features
            threats = ids.detection_engine.detect_threats(features)

            if threats:
                print(f"Detected threats: {threats}")
            else:
                print("No threats detected.")
        else:
            print("Packet does not contain IP/TCP layers or is ignored.")

    print("\nIDS Test Completed.")

if __name__ == "__main__":
    test_ids()

This will test the system against a variety of attacks like SYN flooding and port scanning.

Wrapping Up

Now you know how to build a basic intrusion detection system with Python and a few open-source libraries! This IDS demonstrates some core concepts of network security and real-time threat detection.

Keep in mind that this tutorial is for educational purposes only. There are professionally designed enterprise-grade systems like Snort and Suricata that can handle advanced threats and large-scale deployments.

I hope you gained insights into network security fundamentals and learned how Python can be used to build practical security solutions.

How to Build a Real-time Network Traffic Dashboard with Python and Streamlit

Chaitanya Rahalkar — Fri, 03 Jan 2025 23:08:28 +0000

Have you ever wanted to visualize your network traffic in real-time? In this tutorial, you will be learning how to build an interactive network traffic analysis dashboard with Python and Streamlit. Streamlit is an open-source Python framework you can use to develop web applications for data analysis and data processing.

By the end of this tutorial, you will know how to capture raw network packets from the NIC (Network Interface Card) of your computer, process the data, and create beautiful visualizations that will update in real-time.

Why is Network Traffic Analysis Important?
Prerequisites
How to Setup your Project
How to Build the Core Functionalities
How to Create the Streamlit Visualizations
How to Capture the Network Packets
Putting Everything Together
Future Enhancements
Conclusion

Why is Network Traffic Analysis Important?

Network traffic analysis is a critical requirement in enterprises where networks form the backbone of nearly every application and service. At the core of it, we have analysis of network packets that involves monitoring the network, capturing all the traffic (ingress and egress), and interpreting these packets as they flow through a network. You can use this technique to identify security patterns, detect anomalies, and ensure the security and efficiency of the network.

This proof-of-concept project that we’ll work on in this tutorial is particularly useful since it helps you visualize and analyze network activity in real-time. And this will allow you to understand how troubleshooting issues, performance optimizations, and security analysis is done in enterprise systems.

Prerequisites

Python 3.8 or a newer version installed on your system.
A basic understanding of computer networking concepts.
Familiarity with the Python programming language and its widely used libraries.
Basic knowledge of data visualization techniques and libraries.

How to Setup your Project

To get started, create the project structure and install the necessary tools with Pip with the following commands:

mkdir network-dashboard
cd network-dashboard
pip install streamlit pandas scapy plotly

We will be using Streamlit for the dashboard visualizations, Pandas for the data processing, Scapy for network packet capturing and packet processing, and finally Plotly for plotting charts with our collected data.

How to Build the Core Functionalities

We will be putting all of the code in a single file named dashboard.py. Firstly, let’s start by importing all the elements we will be using:

import streamlit as st
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from scapy.all import *
from collections import defaultdict
import time
from datetime import datetime
import threading
import warnings
import logging
from typing import Dict, List, Optional
import socket

Now let’s configure logging by setting up a basic logging configuration. This will be used for tracking events and running our application in debug mode. We have currently set the logging level to be INFO, meaning that events with level INFO or higher will be displayed. If you are not familiar with logging in Python, I’d recommend checking out this documentation piece that goes in-depth.

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

Next, we’ll build our packet processor. We’ll implement the functionality of processing our captured packets in this class.

class PacketProcessor:
    """Process and analyze network packets"""

    def __init__(self):
        self.protocol_map = {
            1: 'ICMP',
            6: 'TCP',
            17: 'UDP'
        }
        self.packet_data = []
        self.start_time = datetime.now()
        self.packet_count = 0
        self.lock = threading.Lock()

    def get_protocol_name(self, protocol_num: int) -> str:
        """Convert protocol number to name"""
        return self.protocol_map.get(protocol_num, f'OTHER({protocol_num})')

    def process_packet(self, packet) -> None:
        """Process a single packet and extract relevant information"""
        try:
            if IP in packet:
                with self.lock:
                    packet_info = {
                        'timestamp': datetime.now(),
                        'source': packet[IP].src,
                        'destination': packet[IP].dst,
                        'protocol': self.get_protocol_name(packet[IP].proto),
                        'size': len(packet),
                        'time_relative': (datetime.now() - self.start_time).total_seconds()
                    }

                    # Add TCP-specific information
                    if TCP in packet:
                        packet_info.update({
                            'src_port': packet[TCP].sport,
                            'dst_port': packet[TCP].dport,
                            'tcp_flags': packet[TCP].flags
                        })

                    # Add UDP-specific information
                    elif UDP in packet:
                        packet_info.update({
                            'src_port': packet[UDP].sport,
                            'dst_port': packet[UDP].dport
                        })

                    self.packet_data.append(packet_info)
                    self.packet_count += 1

                    # Keep only last 10000 packets to prevent memory issues
                    if len(self.packet_data) > 10000:
                        self.packet_data.pop(0)

        except Exception as e:
            logger.error(f"Error processing packet: {str(e)}")

    def get_dataframe(self) -> pd.DataFrame:
        """Convert packet data to pandas DataFrame"""
        with self.lock:
            return pd.DataFrame(self.packet_data)

This class will build our core functionality and has several utility functions that will be used for processing the packets.

Network packets are categorized into two at transport level (TCP and UDP) and the ICMP protocol at the network level. If you are unfamiliar with the concepts of TCP/IP, I recommend checking out this article on freeCodeCamp News.

Our constructor will keep track of all packets seen that are categorized into these TCP/IP protocol type buckets that we defined. We’ll also take note of the packet capture time, the data captured, and the number of packets captured.

We’ll also be leveraging a thread lock to ensure that only one packet is processed at a single time. This can be further extended to enable the project to have parallel packet processing.

The get_protocol_name helper function helps us get the correct type of the protocol based on their protocol numbers. To give some background on this, the Internet Assigned Numbers Authority (IANA) assigns standardized numbers to identify different protocols in a network packet. As and when we see these numbers in the parsed network packet, we’ll know what kind of protocol is being used in the packet currently intercepted. For the scope of this project, we’ll be mapping to only TCP, UDP and ICMP (Ping). If we encounter any other type of packet, we’ll categorize it as OTHER().

The process_packet function handles our core functionality that will process these individual packets. If the packet contains an IP layer, it will take note of the source and destination IP addresses, protocol type, packet size, and time elapsed since the start of packet capturing.

For packets with specific transport layer protocols (like TCP and UDP), we will capture the source and destination ports along with TCP flags for TCP packets. These extracted details will be stored in memory in the packet_data list. We will also keep track of the packet_count as and when these packets are processed.

The get_dataframe function helps us to convert the packet_data list into a Pandas data-frame that will then be used for our visualization.

How to Create the Streamlit Visualizations

Now it’s time for us to build our interactive Streamlit Dashboard. We will define a function called create_visualization in the dashboard.py script (outside of our packet processing class).

def create_visualizations(df: pd.DataFrame):
    """Create all dashboard visualizations"""
    if len(df) > 0:
        # Protocol distribution
        protocol_counts = df['protocol'].value_counts()
        fig_protocol = px.pie(
            values=protocol_counts.values,
            names=protocol_counts.index,
            title="Protocol Distribution"
        )
        st.plotly_chart(fig_protocol, use_container_width=True)

        # Packets timeline
        df['timestamp'] = pd.to_datetime(df['timestamp'])
        df_grouped = df.groupby(df['timestamp'].dt.floor('S')).size()
        fig_timeline = px.line(
            x=df_grouped.index,
            y=df_grouped.values,
            title="Packets per Second"
        )
        st.plotly_chart(fig_timeline, use_container_width=True)

        # Top source IPs
        top_sources = df['source'].value_counts().head(10)
        fig_sources = px.bar(
            x=top_sources.index,
            y=top_sources.values,
            title="Top Source IP Addresses"
        )
        st.plotly_chart(fig_sources, use_container_width=True)

This function will take the data frame as input and will help us plot three charts / graphs:

Protocol Distribution Chart: This chart will display the proportion of different protocols (for example,TCP, UDP, ICMP) in the captured packet traffic.
Packets Timeline Chart: This chart will show the number of packets processed per second over a time period.
Top Source IP Addresses Chart: This chart will highlight the top 10 IP addresses that sent the most packets in the captured traffic.

The protocol distribution chart is simply a pie chart of the protocol counts for the three different types (along with OTHER). We use the Streamlit and Plotly Python tools to plot these charts. Since we also noted the timestamp since the packet capture started, we will use this data to plot the trend of packets captured over time.

For the second chart, we will do a groupby operation on the data and get the number of packets captured in each second (S stands for seconds), and then finally we will plot the graph.

Finally, for the third chart, we will count the distinct source IPs observed and the plot a chart of the IP counts to show the top 10 IPs.

How to Capture the Network Packets

Now, let’s build the functionality to allow us to capture network packet data.

def start_packet_capture():
    """Start packet capture in a separate thread"""
    processor = PacketProcessor()

    def capture_packets():
        sniff(prn=processor.process_packet, store=False)

    capture_thread = threading.Thread(target=capture_packets, daemon=True)
    capture_thread.start()

    return processor

This is a simple function that instantiates the PacketProcessor class and then uses the sniff function in the scapy module to start capturing the packets.

We use threading here to allow us to capture packets independently from the main program flow. This ensures that the packet capturing operation does not block other operations like updating the dashboard in real-time. We also return the created PacketProcessor instance so that it can be used in our main program.

Putting Everything Together

Now let’s stitch all these pieces together with our main function that will act as the driver function for our program.

def main():
    """Main function to run the dashboard"""
    st.set_page_config(page_title="Network Traffic Analysis", layout="wide")
    st.title("Real-time Network Traffic Analysis")

    # Initialize packet processor in session state
    if 'processor' not in st.session_state:
        st.session_state.processor = start_packet_capture()
        st.session_state.start_time = time.time()

    # Create dashboard layout
    col1, col2 = st.columns(2)

    # Get current data
    df = st.session_state.processor.get_dataframe()

    # Display metrics
    with col1:
        st.metric("Total Packets", len(df))
    with col2:
        duration = time.time() - st.session_state.start_time
        st.metric("Capture Duration", f"{duration:.2f}s")

    # Display visualizations
    create_visualizations(df)

    # Display recent packets
    st.subheader("Recent Packets")
    if len(df) > 0:
        st.dataframe(
            df.tail(10)[['timestamp', 'source', 'destination', 'protocol', 'size']],
            use_container_width=True
        )

    # Add refresh button
    if st.button('Refresh Data'):
        st.rerun()

    # Auto refresh
    time.sleep(2)
    st.rerun()

This function will also instantiate the Streamlit dashboard, and integrate all of our components together. We first set the page title of our Streamlit dashboard and then initialize our PacketProcessor. We use the session state in Streamlit to ensure that only one instance of packet capturing is created and the state of it is retained.

Now, we will dynamically get the dataframe from the session state every time the data is processed and begin to display the metrics and the visualizations. We will also display the recently captured packets along with information like the timestamp, source and destination IPs, protocol, and size of the packet. We will also add the ability for the user to manually refresh the data from the dashboard while we also automatically refresh it every two seconds.

Let’s finally run the program with the following command:

sudo streamlit run dashboard.py

Note that you will have to run the program with sudo since the packet capturing capabilities require administrative privileges. If you are on Windows, open your terminal as Administrator and then run the program without the sudo prefix.

Give it a moment for the program to start capturing packets. If everything goes right, you should see something like this:

These are all the visualizations that we just implemented in our Streamlit dashboard program.

Future Enhancements

With that, here are some future enhancement ideas that you can use to extend the functionalities of the dashboard:

Add machine learning capabilities for anomaly detection
Implement geographical IP mapping
Create custom alerts based on traffic analysis patterns
Add packet payload analysis options

Conclusion

Congratulations! You have now successfully built a real-time network traffic analysis dashboard with Python and Streamlit. This program will provide valuable insights into network behavior and can be extended for various use cases, from security monitoring to network optimization.

With that, I hope you learnt some basics about network traffic analysis as well as a bit of Python programming. Thanks for reading!

How to Build a Honeypot in Python: A Practical Guide to Security Deception

Chaitanya Rahalkar — Thu, 19 Dec 2024 16:58:45 +0000

In cybersecurity, a honeypot is a decoy system that’s designed to attract and then detect potential attackers attempting to compromise the system. Just like a pot of honey sitting out in the open would attract flies.

Think of these honeypots as security cameras for your system. Just as a security camera helps us understand who's trying to break into a building and how they're doing it, these honeypots will help you understand who's trying to attack your system and what techniques they're using.

By the end of this tutorial, you'll be able to write a demo honeypot in Python and understand how honeypots work.

Understanding the Types of Honeypots
How to Set Up Your Development Environment
How to Build the Core Honeypot
How to Analyze Honeypot Data
Security Considerations
Conclusion

Understanding the Types of Honeypots

Before we start designing our own honeypot, let’s quickly understand their different types:

Production Honeypots: These types of honeypots are placed in an actual production environment and are used to detect actual security attacks. They are typically simple in design, easy to maintain and deploy, and offer limited interaction to reduce risk.
Research Honeypots: These are more complex systems set up by security researchers to study attack patterns, perform empirical analysis on these patterns, collect malware samples, and understand new attack techniques that aren’t discovered previously. They often emulate entire operating systems or networks rather than behaving like an application in the production environment.

For this tutorial, we will be building a medium-interaction honeypot that logs connection attempts and basic attacker behavior.

How to Set Up Your Development Environment

Let’s begin by setting up your development environment in Python. Run the following commands:

import socket
import sys
import datetime
import json
import threading
from pathlib import Path

# Configure logging directory
LOG_DIR = Path("honeypot_logs")
LOG_DIR.mkdir(exist_ok=True)

We will be sticking to the built in libraries so won’t be needing to install any external dependencies. We will be storing our logs in the honeypot_logs directory.

How to Build the Core Honeypot

Our basic honeypot will be comprised of three components:

A network listener that accepts connections
A logging system to record activities
A basic emulation service to interact with attackers

Now let’s begin by initializing the core Honeypot class:

class Honeypot:
    def __init__(self, bind_ip="0.0.0.0", ports=None):
        self.bind_ip = bind_ip
        self.ports = ports or [21, 22, 80, 443]  # Default ports to monitor
        self.active_connections = {}
        self.log_file = LOG_DIR / f"honeypot_{datetime.datetime.now().strftime('%Y%m%d')}.json"

    def log_activity(self, port, remote_ip, data):
        """Log suspicious activity with timestamp and details"""
        activity = {
            "timestamp": datetime.datetime.now().isoformat(),
            "remote_ip": remote_ip,
            "port": port,
            "data": data.decode('utf-8', errors='ignore')
        }

        with open(self.log_file, 'a') as f:
            json.dump(activity, f)
            f.write('\n')

    def handle_connection(self, client_socket, remote_ip, port):
        """Handle individual connections and emulate services"""
        service_banners = {
            21: "220 FTP server ready\r\n",
            22: "SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.1\r\n",
            80: "HTTP/1.1 200 OK\r\nServer: Apache/2.4.41 (Ubuntu)\r\n\r\n",
            443: "HTTP/1.1 200 OK\r\nServer: Apache/2.4.41 (Ubuntu)\r\n\r\n"
        }

        try:
            # Send appropriate banner for the service
            if port in service_banners:
                client_socket.send(service_banners[port].encode())

            # Receive data from attacker
            while True:
                data = client_socket.recv(1024)
                if not data:
                    break

                self.log_activity(port, remote_ip, data)

                # Send fake response
                client_socket.send(b"Command not recognized.\r\n")

        except Exception as e:
            print(f"Error handling connection: {e}")
        finally:
            client_socket.close()

This class has a lot of important information in it, so let’s go over each function one by one.

The __init__ function records the ip and port numbers on which we’ll host the honeypot, as well as the path / filename of the log file. We will also be maintaining a record of the total number of active connections we have to the honeypot.

The log_activity function is going to receive the information about the IP, the data, and the port to which the IP attempted a connection. Then we’ll append this information to our JSON-formatted log file.

The handle_connection function is going to mimic these services that will be running on the different ports we have. We will have the honeypot running on ports 21, 22, 80 and 443. These services are for FTP, SSH, HTTP and the HTTPS protocol, respectively. So any attacker attempting to interact with the honeypot should expect these services on these ports.

To mimic the behavior of these services, we’ll use the service banners that they use in reality. This function will first send the appropriate banner when the attacker connects, and then receive the data and log it. The honeypot will also send a fake response “Command not recognized” back to the attacker.

Implement the Network Listeners

Now let’s implement the network listeners that will be handling the incoming connections. For this, we’ll be using simple socket programming. If you aren’t aware of how socket programming works, check out this article that explains some concepts related to it.

def start_listener(self, port):
    """Start a listener on specified port"""
    try:
        server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        server.bind((self.bind_ip, port))
        server.listen(5)

        print(f"[*] Listening on {self.bind_ip}:{port}")

        while True:
            client, addr = server.accept()
            print(f"[*] Accepted connection from {addr[0]}:{addr[1]}")

            # Handle connection in separate thread
            client_handler = threading.Thread(
                target=self.handle_connection,
                args=(client, addr[0], port)
            )
            client_handler.start()

    except Exception as e:
        print(f"Error starting listener on port {port}: {e}")

The start_listener function will start the server and listen on the provided port. The bind_ip for us is going to be 0.0.0.0 which indicates that the server will be listening on all network interfaces.

Now, we will handle each new connection in a separate thread, since there could be instances where multiple attackers attempt to interact with the honeypot or an attacking script or tool is scanning the honeypot. If you aren’t aware of how threading works, you can check out this article that explains threading and concurrency in Python.

Also, make sure to put this function in the core Honeypot class.

Run the Honeypot

Now let’s create the main function that will start our honeypot.

def main():
    honeypot = Honeypot()

    # Start listeners for each port in separate threads
    for port in honeypot.ports:
        listener_thread = threading.Thread(
            target=honeypot.start_listener,
            args=(port,)
        )
        listener_thread.daemon = True
        listener_thread.start()

    try:
        # Keep main thread alive
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        print("\n[*] Shutting down honeypot...")
        sys.exit(0)

if __name__ == "__main__":
    main()

This function instantiates the Honeypot class and starts the listeners for each of our defined ports (21,22,80,443) as a separate thread. Now, we’ll keep our main thread that is running our actual program alive by putting it in an infinite loop. Put this all together in a script and run it.

Write the Honeypot Attack Simulator

Now let’s try to simulate some attack scenarios and target our honeypot so that we can collect some data in our JSON log file.

This simulator will help us demonstrate a few important aspects about honeypots:

Realistic attack patterns: The simulator will simulate common attack patterns like port scanning, brute force attempts, and service-specific exploits.
Variable intensity: The simulator will adjust the intensity of the simulation to test how your honeypot handles different loads.
Several attack types: It will demonstrate different types of attacks that real attackers might attempt, helping you understand how your honeypot responds to each.
Concurrent connections: The simulator will use threading to test how your honeypot handles multiple simultaneous connections.

# honeypot_simulator.py

import socket
import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor
import argparse

class HoneypotSimulator:
    """
    A class to simulate different types of connections and attacks against our honeypot.
    This helps in testing the honeypot's logging and response capabilities.
    """

    def __init__(self, target_ip="127.0.0.1", intensity="medium"):
        # Configuration for the simulator
        self.target_ip = target_ip
        self.intensity = intensity

        # Common ports that attackers often probe
        self.target_ports = [21, 22, 23, 25, 80, 443, 3306, 5432]

        # Dictionary of common commands used by attackers for different services
        self.attack_patterns = {
            21: [  # FTP commands
                "USER admin\r\n",
                "PASS admin123\r\n",
                "LIST\r\n",
                "STOR malware.exe\r\n"
            ],
            22: [  # SSH attempts
                "SSH-2.0-OpenSSH_7.9\r\n",
                "admin:password123\n",
                "root:toor\n"
            ],
            80: [  # HTTP requests
                "GET / HTTP/1.1\r\nHost: localhost\r\n\r\n",
                "POST /admin HTTP/1.1\r\nHost: localhost\r\nContent-Length: 0\r\n\r\n",
                "GET /wp-admin HTTP/1.1\r\nHost: localhost\r\n\r\n"
            ]
        }

        # Intensity settings affect the frequency and volume of simulated attacks
        self.intensity_settings = {
            "low": {"max_threads": 2, "delay_range": (1, 3)},
            "medium": {"max_threads": 5, "delay_range": (0.5, 1.5)},
            "high": {"max_threads": 10, "delay_range": (0.1, 0.5)}
        }

    def simulate_connection(self, port):
        """
        Simulates a connection attempt to a specific port with realistic attack patterns
        """
        try:
            # Create a new socket connection
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.settimeout(3)

            print(f"[*] Attempting connection to {self.target_ip}:{port}")
            sock.connect((self.target_ip, port))

            # Get banner if any
            banner = sock.recv(1024)
            print(f"[+] Received banner from port {port}: {banner.decode('utf-8', 'ignore').strip()}")

            # Send attack patterns based on the port
            if port in self.attack_patterns:
                for command in self.attack_patterns[port]:
                    print(f"[*] Sending command to port {port}: {command.strip()}")
                    sock.send(command.encode())

                    # Wait for response
                    try:
                        response = sock.recv(1024)
                        print(f"[+] Received response: {response.decode('utf-8', 'ignore').strip()}")
                    except socket.timeout:
                        print(f"[-] No response received from port {port}")

                    # Add realistic delay between commands
                    time.sleep(random.uniform(*self.intensity_settings[self.intensity]["delay_range"]))

            sock.close()

        except ConnectionRefusedError:
            print(f"[-] Connection refused on port {port}")
        except socket.timeout:
            print(f"[-] Connection timeout on port {port}")
        except Exception as e:
            print(f"[-] Error connecting to port {port}: {e}")

    def simulate_port_scan(self):
        """
        Simulates a basic port scan across common ports
        """
        print(f"\n[*] Starting port scan simulation against {self.target_ip}")
        for port in self.target_ports:
            self.simulate_connection(port)
            time.sleep(random.uniform(0.1, 0.3))

    def simulate_brute_force(self, port):
        """
        Simulates a brute force attack against a specific service
        """
        common_usernames = ["admin", "root", "user", "test"]
        common_passwords = ["password123", "admin123", "123456", "root"]

        print(f"\n[*] Starting brute force simulation against port {port}")

        for username in common_usernames:
            for password in common_passwords:
                try:
                    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
                    sock.settimeout(2)
                    sock.connect((self.target_ip, port))

                    if port == 21:  # FTP
                        sock.send(f"USER {username}\r\n".encode())
                        sock.recv(1024)
                        sock.send(f"PASS {password}\r\n".encode())
                    elif port == 22:  # SSH
                        sock.send(f"{username}:{password}\n".encode())

                    sock.close()
                    time.sleep(random.uniform(0.1, 0.3))

                except Exception as e:
                    print(f"[-] Error in brute force attempt: {e}")

    def run_continuous_simulation(self, duration=300):
        """
        Runs a continuous simulation for a specified duration
        """
        print(f"\n[*] Starting continuous simulation for {duration} seconds")
        print(f"[*] Intensity level: {self.intensity}")

        end_time = time.time() + duration

        with ThreadPoolExecutor(
            max_workers=self.intensity_settings[self.intensity]["max_threads"]
        ) as executor:
            while time.time() < end_time:
                # Mix of different attack patterns
                simulation_choices = [
                    lambda: self.simulate_port_scan(),
                    lambda: self.simulate_brute_force(21),
                    lambda: self.simulate_brute_force(22),
                    lambda: self.simulate_connection(80)
                ]

                # Randomly choose and execute an attack pattern
                executor.submit(random.choice(simulation_choices))
                time.sleep(random.uniform(*self.intensity_settings[self.intensity]["delay_range"]))

def main():
    """
    Main function to run the honeypot simulator with command-line arguments
    """
    parser = argparse.ArgumentParser(description="Honeypot Attack Simulator")
    parser.add_argument("--target", default="127.0.0.1", help="Target IP address")
    parser.add_argument(
        "--intensity",
        choices=["low", "medium", "high"],
        default="medium",
        help="Simulation intensity level"
    )
    parser.add_argument(
        "--duration",
        type=int,
        default=300,
        help="Simulation duration in seconds"
    )

    args = parser.parse_args()

    simulator = HoneypotSimulator(args.target, args.intensity)

    try:
        simulator.run_continuous_simulation(args.duration)
    except KeyboardInterrupt:
        print("\n[*] Simulation interrupted by user")
    except Exception as e:
        print(f"[-] Simulation error: {e}")
    finally:
        print("\n[*] Simulation complete")

if __name__ == "__main__":
    main()

We have a lot going on in this simulation script, so let’s break it down one by one. I’ve also added comments for every function and operation to make this a bit more readable in the code.

We first have our utility class called the HoneypotSimulator. In this class, we have the __init__ function that sets up the basic configuration for our simulator. It takes two parameters: a target IP address (defaulting to localhost) and an intensity level (defaulting to "medium").

We also define three important components: the target ports to probe (common services like FTP, SSH, HTTP), attack patterns specific to each service (like login attempts and commands), and intensity settings that control how aggressive our simulation will be through thread counts and timing delays.

The simulate_connection function handles individual connection attempts to a specific port. It creates a socket connection, tries to get any service banners (like SSH version information), and then sends appropriate attack commands based on the service type. We have added error handling for common network issues and also added realistic delays between commands to mimic human interaction.

Our simulate_port_scan function acts like a reconnaissance tool, that will systematically chec each port in our target list. It's similar to how tools like nmap work – going through ports one by one to see what services are available. For each port, it calls the simulate_connection function and adds small random delays to make the scan pattern look more natural.

The simulate_brute_force function maintains lists of common usernames and passwords, attempting different combinations against services like FTP and SSH. For each attempt, it creates a new connection, sends the login credentials in the correct format for that service, and then closes the connection. This helps us to test how well the honeypot detects and logs credential stuffing attacks.

The run_continuous_simulation function runs for a specified duration, randomly choosing between different attack types like port scanning, brute force, or specific service attacks. It uses Python's ThreadPoolExecutor to run multiple attacks simultaneously based on the specified intensity level.

Finally, we have the main function that provides the command-line interface for the simulator. It uses argparse to handle command-line arguments, letting users specify the target IP, intensity level, and duration of the simulation. It creates an instance of the HoneypotSimulator class and manages the overall execution, including proper handling of user interruptions and errors.

After putting the simulator code in a separate script, run it with the following command:

# Run with default settings (medium intensity, localhost, 5 minutes)
python honeypot_simulator.py

# Run with custom settings
python honeypot_simulator.py --target 192.168.1.100 --intensity high --duration 600

Since we are running the honeypot as well as the simulator on the same machine locally, the target will be localhost. But it can be something else in a real scenario or if you are running the honeypot in a VM or a different machine – so make sure you confirm the IP before running the simulator.

How to Analyze Honeypot Data

Let’s quickly write a helper function that will allow us to analyze all the data collected by the Honeypot. Since we’ve stored this in a JSON log file, we can conveniently parse it using the built-in JSON package.

import datetime
import json

def analyze_logs(log_file):
    """Enhanced honeypot log analysis with temporal and behavioral patterns"""
    ip_analysis = {}
    port_analysis = {}
    hourly_attacks = {}
    data_patterns = {}

    # Track session patterns
    ip_sessions = {}
    attack_timeline = []

    with open(log_file, 'r') as f:
        for line in f:
            try:
                activity = json.loads(line)
                timestamp = datetime.datetime.fromisoformat(activity['timestamp'])
                ip = activity['remote_ip']
                port = activity['port']
                data = activity['data']

                # Initialize IP tracking if new
                if ip not in ip_analysis:
                    ip_analysis[ip] = {
                        'total_attempts': 0,
                        'first_seen': timestamp,
                        'last_seen': timestamp,
                        'targeted_ports': set(),
                        'unique_payloads': set(),
                        'session_count': 0
                    }

                # Update IP statistics
                ip_analysis[ip]['total_attempts'] += 1
                ip_analysis[ip]['last_seen'] = timestamp
                ip_analysis[ip]['targeted_ports'].add(port)
                ip_analysis[ip]['unique_payloads'].add(data.strip())

                # Track hourly patterns
                hour = timestamp.hour
                hourly_attacks[hour] = hourly_attacks.get(hour, 0) + 1

                # Analyze port targeting patterns
                if port not in port_analysis:
                    port_analysis[port] = {
                        'total_attempts': 0,
                        'unique_ips': set(),
                        'unique_payloads': set()
                    }
                port_analysis[port]['total_attempts'] += 1
                port_analysis[port]['unique_ips'].add(ip)
                port_analysis[port]['unique_payloads'].add(data.strip())

                # Track payload patterns
                if data.strip():
                    data_patterns[data.strip()] = data_patterns.get(data.strip(), 0) + 1

                # Track attack timeline
                attack_timeline.append({
                    'timestamp': timestamp,
                    'ip': ip,
                    'port': port
                })

            except (json.JSONDecodeError, KeyError) as e:
                continue

    # Analysis Report Generation
    print("\n=== Honeypot Analysis Report ===")

    # 1. IP-based Analysis
    print("\nTop 10 Most Active IPs:")
    sorted_ips = sorted(ip_analysis.items(), 
                       key=lambda x: x[1]['total_attempts'], 
                       reverse=True)[:10]
    for ip, stats in sorted_ips:
        duration = stats['last_seen'] - stats['first_seen']
        print(f"\nIP: {ip}")
        print(f"Total Attempts: {stats['total_attempts']}")
        print(f"Active Duration: {duration}")
        print(f"Unique Ports Targeted: {len(stats['targeted_ports'])}")
        print(f"Unique Payloads: {len(stats['unique_payloads'])}")

    # 2. Port Analysis
    print("\nPort Targeting Analysis:")
    sorted_ports = sorted(port_analysis.items(),
                         key=lambda x: x[1]['total_attempts'],
                         reverse=True)
    for port, stats in sorted_ports:
        print(f"\nPort {port}:")
        print(f"Total Attempts: {stats['total_attempts']}")
        print(f"Unique Attackers: {len(stats['unique_ips'])}")
        print(f"Unique Payloads: {len(stats['unique_payloads'])}")

    # 3. Temporal Analysis
    print("\nHourly Attack Distribution:")
    for hour in sorted(hourly_attacks.keys()):
        print(f"Hour {hour:02d}: {hourly_attacks[hour]} attempts")

    # 4. Attack Sophistication Analysis
    print("\nAttacker Sophistication Analysis:")
    for ip, stats in sorted_ips:
        sophistication_score = (
            len(stats['targeted_ports']) * 0.4 +  # Port diversity
            len(stats['unique_payloads']) * 0.6   # Payload diversity
        )
        print(f"IP {ip}: Sophistication Score {sophistication_score:.2f}")

    # 5. Common Payload Patterns
    print("\nTop 10 Most Common Payloads:")
    sorted_payloads = sorted(data_patterns.items(),
                            key=lambda x: x[1],
                            reverse=True)[:10]
    for payload, count in sorted_payloads:
        if len(payload) > 50:  # Truncate long payloads
            payload = payload[:50] + "..."
        print(f"Count {count}: {payload}")

You can place this in a separate script file and call the function on the JSON logs. This function will provide us comprehensive insights from the JSON file based on the data collected.

Our analysis begins by grouping the data into several categories like IP-based statistics, port targeting patterns, hourly attack distributions, and payload characteristics. For every IP, we are tracking total attempts, first and last seen times, targeted ports and unique payloads. This will help us build unique profiles for attackers.

We also examine port-based attack patterns here that monitor for most frequently targeted ports, and by how many unique attackers. We also perform an attack sophistication analysis that helps us identify targeted attackers, considering factors like ports targeted and unique payloads used. This analysis is used for separating simple scanning activities and sophisticated attacks.

Temporal analysis helps us to identify patterns in hourly attack attempts revealing patterns in attack timing and potential automated targeting campaigns. Finally, we publish commonly seen payloads to identify commonly seen attack strings or commands.

Security Considerations

While deploying this honeypot, make sure you consider the following security measures:

Run your honeypot in an isolated environment. Typically inside a VM, or on your local machine that is behind a NAT and a firewall.
Run the honeypot with minimal system privileges (typically not as root) to reduce risk if compromised.
Be cautious with collected data if you plan to ever deploy it as a production-grade or research honeypot as it may contain malware or sensitive information.
Implement robust monitoring mechanisms to detect attempts to break out of the honeypot environment.

Conclusion

With this we have built our honeypot, written a simulator to simulate attacks for our honeypot and analyzed the data from our honeypot logs to make a few simple inferences. It is an excellent way to understand both offensive as well as defensive security concepts. You can consider building upon this to create more complex detection systems and think of adding features like:

Dynamic service emulation based on attack behavior
Integration with threat intelligence systems that will perform better inference analysis of these collected honeypot logs
Gather even comprehensive logs beyond the IP, port and network data through advanced logging mechanisms
Add machine learning capabilities to detect attack patterns

Remember that even though honeypots are powerful security tools, they should be a part of a comprehensive defensive security strategy, not the only line of defense.

I hope you learnt about how honeypots work, what is their purpose as well as a bit of Python programming as well!

Building a Simple Web Application Security Scanner with Python: A Beginner's Guide

Chaitanya Rahalkar — Thu, 12 Dec 2024 15:38:02 +0000

In this article, you are going to learn to create a basic security tool that can be helpful in identifying common vulnerabilities in web applications.

I have two goals here. The first is to empower you with the skills to develop tools that can help enhance the overall security posture of your websites. The second is to help you practice some Python programming.

In this guide, you will be building a Python-based security scanner that can detect XSS, SQL injection, and sensitive PII (Personally Identifiable Information).

Types of Vulnerabilities

Generally, we can categorize web security vulnerabilities into the following buckets (for even more buckets, check the OWASP Top 10):

SQL injection: A technique where attackers are able to insert malicious SQL code into SQL queries through unvalidated inputs, allowing them to modify / read database contents.
Cross-Site Scripting (XSS): A technique where attackers inject malicious JavaScript in trusted websites. This allows them to execute the JavaScript code in the context of the browser and steal sensitive information or perform unauthorized operations.
Sensitive information exposure: A security issue where an application unintentionally reveals sensitive data like passwords, API keys and so on through logs, insecure storage, and other vulnerabilities.
Common security misconfigurations: Security issues that occurs due to improper configuration of web servers – like default credentials for administrator accounts, enabled debug mode, publicly available administrator dashboards with weak credentials, and so on.
Basic authentication weaknesses: Security issues that occur due to lapses in password policies, user authentication processes, improper session management, and so on.

Prerequisites
Setting Up Our Development Environment
Building our Core Scanner Class
Implementing the Crawler
Designing and Implementing the Security Checks
Implementing the Main Scanning Logic
Extending the Security Scanner
Wrapping Up

Prerequisites

To follow along with this tutorial, you will be needing:

Python 3.x
Basic understanding of HTTP protocols
Basic understanding of web applications
Basic understanding of how XSS, SQL injection, and basic security attacks work

Setting Up Our Development Environment

Let’s install our required dependencies with the following command:

pip install requests beautifulsoup4 urllib3 colorama

We’ll use these dependencies in our code file:

# Required packages
import requests
from bs4 import BeautifulSoup
import urllib.parse
import colorama
import re
from concurrent.futures import ThreadPoolExecutor
import sys
from typing import List, Dict, Set

Building our Core Scanner Class

Once you have the dependencies, it’s time to write the core scanner class.

This class will serve as our main class that will handle the web security scanning functionality. It will track our visited pages and also store our findings.

We have the normalize_url function that we’ll use to ensure that you don’t rescan URLs that have already been seen before. This function will essentially remove the HTTP GET parameters from the URL. For example, https://example.com/page?id=1 will become https://example.com/page after normalizing it.

class WebSecurityScanner:
    def __init__(self, target_url: str, max_depth: int = 3):
        """
        Initialize the security scanner with a target URL and maximum crawl depth.

        Args:
            target_url: The base URL to scan
            max_depth: Maximum depth for crawling links (default: 3)
        """
        self.target_url = target_url
        self.max_depth = max_depth
        self.visited_urls: Set[str] = set()
        self.vulnerabilities: List[Dict] = []
        self.session = requests.Session()

        # Initialize colorama for cross-platform colored output
        colorama.init()

    def normalize_url(self, url: str) -> str:
        """Normalize the URL to prevent duplicate checks"""
        parsed = urllib.parse.urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}{parsed.path}"

Implementing the Crawler

The first step in our scanner is to implement a web crawler that will discover pages and URLs in a given target application. Make sure you’re writing these functions in our WebSecurityScanner class.

def crawl(self, url: str, depth: int = 0) -> None:
    """
    Crawl the website to discover pages and endpoints.

    Args:
        url: Current URL to crawl
        depth: Current depth in the crawl tree
    """
    if depth > self.max_depth or url in self.visited_urls:
        return

    try:
        self.visited_urls.add(url)
        response = self.session.get(url, verify=False)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all links in the page
        links = soup.find_all('a', href=True)
        for link in links:
            next_url = urllib.parse.urljoin(url, link['href'])
            if next_url.startswith(self.target_url):
                self.crawl(next_url, depth + 1)

    except Exception as e:
        print(f"Error crawling {url}: {str(e)}")

This crawl function helps us perform a depth-first crawl of a website. It will explore all pages of a website while staying within the specified domain.

For example, if you plan to use this scanner on https://google.com, the function will first get all the URLs and then one-by-one check if they belong to the specified domain (that is, google.com). If so, it will recursively continue to scan the seen URL up to a specified depth which is supplied with the depth parameter as an argument to the function. We also have some exception handling to make sure we handle errors smoothly and report any errors during crawling.

Designing and Implementing the Security Checks

Now let’s finally get to the juicy part and implement our security checks. We’ll start first with SQL Injection.

SQL Injection Detection Check

def check_sql_injection(self, url: str) -> None:
    """Test for potential SQL injection vulnerabilities"""
    sql_payloads = ["'", "1' OR '1'='1", "' OR 1=1--", "' UNION SELECT NULL--"]

    for payload in sql_payloads:
        try:
            # Test GET parameters
            parsed = urllib.parse.urlparse(url)
            params = urllib.parse.parse_qs(parsed.query)

            for param in params:
                test_url = url.replace(f"{param}={params[param][0]}", 
                                     f"{param}={payload}")
                response = self.session.get(test_url)

                # Look for SQL error messages
                if any(error in response.text.lower() for error in 
                    ['sql', 'mysql', 'sqlite', 'postgresql', 'oracle']):
                    self.report_vulnerability({
                        'type': 'SQL Injection',
                        'url': url,
                        'parameter': param,
                        'payload': payload
                    })

        except Exception as e:
            print(f"Error testing SQL injection on {url}: {str(e)}")

This function essentially performs basic SQL injection checks by testing the URL against common SQL injection payloads and looking for error messages that might hint at a security vulnerability.

Based on the error message received after performing a simple GET request on the URL, we check whether that message is a database error or not. If it is, we use the report_vulnerability function to report that as a security issue in our final report that this script will generate. For the sake of this example, we are selecting a few commonly tested SQL injection payloads, but you can extend this to test even more.

XSS (Cross-Site Scripting) Check

Now let’s implement the second security check for XSS payloads.

def check_xss(self, url: str) -> None:
    """Test for potential Cross-Site Scripting vulnerabilities"""
    xss_payloads = [
        "",
        "",
        "javascript:alert('XSS')"
    ]

    for payload in xss_payloads:
        try:
            # Test GET parameters
            parsed = urllib.parse.urlparse(url)
            params = urllib.parse.parse_qs(parsed.query)

            for param in params:
                test_url = url.replace(f"{param}={params[param][0]}", 
                                     f"{param}={urllib.parse.quote(payload)}")
                response = self.session.get(test_url)

                if payload in response.text:
                    self.report_vulnerability({
                        'type': 'Cross-Site Scripting (XSS)',
                        'url': url,
                        'parameter': param,
                        'payload': payload
                    })

        except Exception as e:
            print(f"Error testing XSS on {url}: {str(e)}")

This function, just like the SQL injection tester, uses a set of common XSS payloads and applies the same idea. But the key difference here is that we are looking for our injected payload to appear unmodified in our response rather than looking for an error message.

If you are able to see our injected payload, most likely it will be executed in the context of the victim’s browser as a reflected XSS attack.

Sensitive Information Exposure Check

Now let’s implement our final check for sensitive PII.

def check_sensitive_info(self, url: str) -> None:
    """Check for exposed sensitive information"""
    sensitive_patterns = {
        'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
        'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
        'api_key': r'api[_-]?key[_-]?([\'"|`])([a-zA-Z0-9]{32,45})\1'
    }

    try:
        response = self.session.get(url)

        for info_type, pattern in sensitive_patterns.items():
            matches = re.finditer(pattern, response.text)
            for match in matches:
                self.report_vulnerability({
                    'type': 'Sensitive Information Exposure',
                    'url': url,
                    'info_type': info_type,
                    'pattern': pattern
                })

    except Exception as e:
        print(f"Error checking sensitive information on {url}: {str(e)}")

This function uses a set of predefined Regex patterns to search for PII like emails, phone numbers, SSNs, and API keys (that are prefixed with api-key-).

Just like the previous two functions, we use the response text for the URL and our Regex patterns to find these PIIs in the response text. If we do find any, we report them with the report_vulnerability function. Make sure to have all these functions defined in the WebSecurityScanner class.

Implementing the Main Scanning Logic

Let’s finally stitch everything together by defining the scan and report_vulnerability function in the WebSecurityScanner class:

def scan(self) -> List[Dict]:
    """
    Main scanning method that coordinates the security checks

    Returns:
        List of discovered vulnerabilities
    """
    print(f"\n{colorama.Fore.BLUE}Starting security scan of {self.target_url}{colorama.Style.RESET_ALL}\n")

    # First, crawl the website
    self.crawl(self.target_url)

    # Then run security checks on all discovered URLs
    with ThreadPoolExecutor(max_workers=5) as executor:
        for url in self.visited_urls:
            executor.submit(self.check_sql_injection, url)
            executor.submit(self.check_xss, url)
            executor.submit(self.check_sensitive_info, url)

    return self.vulnerabilities

def report_vulnerability(self, vulnerability: Dict) -> None:
    """Record and display found vulnerabilities"""
    self.vulnerabilities.append(vulnerability)
    print(f"{colorama.Fore.RED}[VULNERABILITY FOUND]{colorama.Style.RESET_ALL}")
    for key, value in vulnerability.items():
        print(f"{key}: {value}")
    print()

This code defines our scan function which will essentially invoke the crawl function and recursively start crawling the website. With multithreading, we will apply all three security checks on the visited URLs.

We have also defined the report_vulnerability function which will effectively print our vulnerability to the console and also store them in our vulnerabilities array.

Now let’s finally use our scanner by saving it as scanner.py:

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python scanner.py ")
        sys.exit(1)

    target_url = sys.argv[1]
    scanner = WebSecurityScanner(target_url)
    vulnerabilities = scanner.scan()

    # Print summary
    print(f"\n{colorama.Fore.GREEN}Scan Complete!{colorama.Style.RESET_ALL}")
    print(f"Total URLs scanned: {len(scanner.visited_urls)}")
    print(f"Vulnerabilities found: {len(vulnerabilities)}")

The target URL will be supplied as a system argument and we will get the summary of URLs scanned and vulnerabilities found at the end of our scan. Now let’s discuss how you can extend the scanner and add more features.

Extending the Security Scanner

Here are some ideas to extend this basic security scanner into something even more advanced:

Add more vulnerability checks like CSRF detection, directory traversal, and so on.
Improve reporting with an HTML or PDF output.
Add configuration options for scan intensity and scope of searching (specifying the depth of scans through a CLI argument).
Implementing proper rate limiting.
Adding authentication support for testing URLs that require session-based authentication.

Wrapping Up

Now you know how to build a basic security scanner! This scanner demonstrates a few core concepts of Web Security.

Keep in mind that this tutorial should only be used for educational purposes. There are several professionally designed enterprise-grade applications like Burp Suite and OWASP Zap that can check for hundreds of security vulnerabilities at a much larger scale.

I hope you learned the basics of web security and a bit of Python programming as well.

Chaitanya Rahalkar - freeCodeCamp.org

How to Build Your Own Local AI: Create Free RAG and AI Agents with Qwen 3 and Ollama

Prerequisites

Table of Contents

Local AI Power with Qwen 3 and Ollama

Ollama: Your Local LLM Gateway

Tutorial Roadmap

How to Set Up Your Local AI Lab

Install Ollama

Choose Your Qwen 3 Model

Pull and Run Qwen 3 with Ollama

Set Up Your Python Environment

How to Build a Local RAG System with Qwen 3

Step 1: Prepare Your Data

Step 2: Load Documents in Python

Step 3: Split Documents

Step 4: Choose and Configure Embedding Model

Option A (Recommended for Simplicity): Ollama Embeddings

Option B (Alternative): Sentence Transformers

Step 5: Set Up Local Vector Store (ChromaDB)

Step 6: Index Documents (Embed and Store)

Step 7: Build the RAG Chain

Step 8: Query Your Documents

How to Create Local AI Agents with Qwen 3

Step 1: Define Custom Tools

Step 2: Set Up the Agent LLM

Step 3: Create the Agent Prompt

Step 4: Build the Agent

Step 5: Create the Agent Executor

Step 6: Run the Agent

Advanced Considerations and Troubleshooting

Controlling Qwen 3's Thinking Mode with Ollama

Managing Context Length (num_ctx)

Hardware Limitations and VRAM

Conclusion and Next Steps

How to Create a Python SIEM System Using AI and LLMs for Log Analysis and Anomaly Detection

Table of Contents

What Are SIEM Systems?

Prerequisites

Setting Up the Project

Implementing Log Analysis

1. Log Ingestion

2. Preprocessing and Feature Extraction

How to Build the Anomaly Detection Model

Testing and Visualizing Results

Visualizing the results

Automated Response Possibilities

Conclusion

How to Build a Real-Time Intrusion Detection System with Python and Open-Source Libraries

Table of Contents

Understanding the Types of IDS

How to Setup Your Development Environment

Building the Core IDS Components

Building the Packet Capture Engine

Building the Traffic Analysis Module

Building the Detection Engine

Building the Alert System

Putting it All Together

Ideas to Extend the IDS

Security Considerations

Testing the IDS on Mock Data

Wrapping Up

How to Build a Real-time Network Traffic Dashboard with Python and Streamlit

Table of Contents

Why is Network Traffic Analysis Important?

Prerequisites

How to Setup your Project

How to Build the Core Functionalities

How to Create the Streamlit Visualizations

How to Capture the Network Packets

Putting Everything Together

Future Enhancements

Conclusion

How to Build a Honeypot in Python: A Practical Guide to Security Deception

Table of Contents

Understanding the Types of Honeypots

How to Set Up Your Development Environment

How to Build the Core Honeypot

Implement the Network Listeners

Run the Honeypot

Managing Context Length (`num_ctx`)