<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Chaitanya Rahalkar - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Chaitanya Rahalkar - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 22:24:00 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/chaitanyarahalkar/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build Your Own Local AI: Create Free RAG and AI Agents with Qwen 3 and Ollama ]]>
                </title>
                <description>
                    <![CDATA[ The landscape of Artificial Intelligence is rapidly evolving, and one of the most exciting trends is the ability to run powerful Large Language Models (LLMs) directly on your local machine. This shift away from reliance on cloud-based APIs offers sig... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-local-ai/</link>
                <guid isPermaLink="false">681a35d4d5da806f5e9467b1</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chaitanya Rahalkar ]]>
                </dc:creator>
                <pubDate>Tue, 06 May 2025 16:16:20 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1746545253944/58b04b54-e443-4804-bedd-3290bfda5bb7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The landscape of Artificial Intelligence is rapidly evolving, and one of the most exciting trends is the ability to run powerful Large Language Models (LLMs) directly on your local machine.</p>
<p>This shift away from reliance on cloud-based APIs offers significant advantages in terms of privacy, cost-effectiveness, and offline accessibility. Developers and enthusiasts can now experiment with and deploy sophisticated AI capabilities without sending data externally or incurring API fees.</p>
<p>This tutorial serves as a practical, hands-on guide to harnessing this local AI power. It focuses on leveraging the Qwen 3 family of LLMs, a state-of-the-art open-source offering from Alibaba, combined with Ollama, a tool that dramatically simplifies running LLMs locally.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before diving into this tutorial, you should have a foundational understanding of Python programming and be comfortable using the command line or terminal. Make sure you have Python 3 installed on your system.</p>
<p>While prior experience with AI or Large Language Models (LLMs) is beneficial, it's not essential, as I’ll introduce and explain core concepts like Retrieval-Augmented Generation (RAG) and AI agents throughout the guide.</p>
<p>This tutorial serves as a practical, hands-on guide to harnessing this local AI power. It focuses on leveraging the Qwen 3 family of LLMs, a state-of-the-art open-source offering from Alibaba, combined with Ollama, a tool that dramatically simplifies running LLMs locally.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-local-ai-power-with-qwen-3-and-ollama">Local AI Power with Qwen 3 and Ollama</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-ollama-your-local-llm-gateway">Ollama: Your Local LLM Gateway</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tutorial-roadmap">Tutorial Roadmap</a></p>
</li>
</ul>
</li>
</ol>
<ol start="2">
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-local-ai-lab">How to Set Up Your Local AI Lab</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-install-ollama">Install Ollama</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-choose-your-qwen-3-model">Choose Your Qwen 3 Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pull-and-run-qwen-3-with-ollama">Pull and Run Qwen 3 with Ollama</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-set-up-your-python-environment">Set Up Your Python Environment</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-a-local-rag-system-with-qwen-3">How to Build a Local RAG System with Qwen 3</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-step-1-prepare-your-data">Step 1: Prepare Your Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-load-documents-in-python">Step 2: Load Documents in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-split-documents">Step 3: Split Documents</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-choose-and-configure-embedding-model">Step 4: Choose and Configure Embedding Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-set-up-local-vector-store-chromadb">Step 5: Set Up Local Vector Store (ChromaDB)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-index-documents-embed-and-store">Step 6: Index Documents (Embed and Store)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-build-the-rag-chain">Step 7: Build the RAG Chain</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-8-query-your-documents">Step 8: Query Your Documents</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-local-ai-agents-with-qwen-3">How to Create Local AI Agents with Qwen 3</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-step-1-define-custom-tools">Step 1: Define Custom Tools</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-set-up-the-agent-llm">Step 2: Set up the Agent LLM</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-create-the-agent-prompt">Step 3: Create the Agent Prompt</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-build-the-agent">Step 4: Build the Agent</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-create-the-agent-executor">Step 5: Create the Agent Executor</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-run-the-agent">Step 6: Run the Agent</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-advanced-considerations-and-troubleshooting">Advanced Considerations and Troubleshooting</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-controlling-qwen-3s-thinking-mode-with-ollama">Controlling Qwen 3's Thinking Mode with Ollama</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-managing-context-length-numctx">Managing Context Length (num_ctx)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-hardware-limitations-and-vram">Hardware Limitations and VRAM</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion-and-next-steps">Conclusion and Next Steps</a></p>
</li>
</ol>
<h2 id="heading-local-ai-power-with-qwen-3-and-ollama">Local AI Power with Qwen 3 and Ollama</h2>
<p>Running LLMs locally addresses several key concerns associated with cloud-based AI services.</p>
<ul>
<li><p>Privacy is paramount – data processed locally never leaves the user's machine.</p>
</li>
<li><p>Cost is another major factor – utilizing open-source models and tools like Ollama eliminates API subscription fees and pay-per-token charges, making advanced AI accessible to everyone.</p>
</li>
<li><p>Local execution enables offline functionality – crucial for applications where internet connectivity is unreliable or undesirable.</p>
</li>
</ul>
<h3 id="heading-ollama-your-local-llm-gateway">Ollama: Your Local LLM Gateway</h3>
<p>Ollama acts as a bridge, making the power of models like Qwen 3 accessible on local hardware. It's a command-line tool that simplifies the download, setup, and execution of various open-source LLMs across macOS, Linux, and Windows.</p>
<p>Ollama handles the complexities of model configuration and GPU utilization, providing a straightforward interface for developers and users. It also exposes an OpenAI-compatible API endpoint, allowing seamless integration with popular frameworks like LangChain.</p>
<h3 id="heading-tutorial-roadmap">Tutorial Roadmap</h3>
<p>This tutorial will guide you through the process of:</p>
<ol>
<li><p><strong>Setting up a local AI environment:</strong> Installing Ollama and selecting/running appropriate Qwen 3 models.</p>
</li>
<li><p><strong>Building a local RAG system:</strong> Creating a system that allows chatting with personal documents using Qwen 3, Ollama, LangChain, and ChromaDB for vector storage.</p>
</li>
<li><p><strong>Creating a basic local AI agent:</strong> Developing a simple agent powered by Qwen 3 that can utilize custom-defined tools (functions).</p>
</li>
</ol>
<h2 id="heading-how-to-set-up-your-local-ai-lab">How to Set Up Your Local AI Lab</h2>
<p>The first step is to prepare your local machine with the necessary tools and models.</p>
<h3 id="heading-install-ollama">Install Ollama</h3>
<p>Ollama provides the simplest path to running LLMs locally.</p>
<ul>
<li><p><strong>Linux / macOS:</strong> Open a terminal and run the official installation script:</p>
<pre><code class="lang-bash">  curl -fsSL https://ollama.com/install.sh | sh
</code></pre>
</li>
<li><p><strong>Windows:</strong> Download the installer from the Ollama website (<a target="_blank" href="https://ollama.com/download">https://ollama.com/download</a>) and follow the setup instructions.</p>
</li>
</ul>
<p>After installation, verify it by opening a new terminal window and running:</p>
<pre><code class="lang-bash">ollama --version
</code></pre>
<p>Ollama typically stores downloaded models in <code>~/.ollama/models</code> on macOS and <code>/usr/share/ollama/.ollama/models</code> on Linux/WSL.</p>
<h3 id="heading-choose-your-qwen-3-model">Choose Your Qwen 3 Model</h3>
<p>Selecting the right Qwen 3 model is crucial and depends on your intended task and available hardware, primarily system RAM and GPU VRAM. Running larger models requires more resources but generally offers better performance and reasoning capabilities.</p>
<p>Qwen 3 offers two main architectures available through Ollama:</p>
<ul>
<li><p><strong>Dense Models:</strong> (like <code>qwen3:0.6b</code>, <code>qwen3:4b</code>, <code>qwen3:8b</code>, <code>qwen3:14b</code>, <code>qwen3:32b</code>) These models activate all their parameters during inference. Their performance is predictable, but resource requirements scale directly with parameter count.</p>
</li>
<li><p><strong>Mixture-of-Experts (MoE) Models:</strong> (like <code>qwen3:30b-a3b</code>) These models contain many "expert" sub-networks but only activate a small fraction for each input token. This allows them to achieve the performance characteristic of their large total parameter count (for example, 30 billion) while having inference costs closer to their smaller <em>active</em> parameter count (for example, 3 billion). They offer a compelling balance of capability and efficiency, especially for reasoning and coding tasks.</p>
</li>
</ul>
<p><strong>Recommendation for this tutorial:</strong> For the examples that follow, <code>qwen3:8b</code> strikes a good balance between capability and resource requirements for many modern machines. If resources are more constrained, <code>qwen3:4b</code> is a viable alternative. The MoE model <code>qwen3:30b-a3b</code> offers excellent performance, especially for coding and reasoning, and runs surprisingly well on systems with 16GB+ VRAM due to its sparse activation.</p>
<h3 id="heading-pull-and-run-qwen-3-with-ollama">Pull and Run Qwen 3 with Ollama</h3>
<p>Once you’ve chosen a model, you’ll need to download it (pull it) via Ollama.</p>
<p><strong>Pull the model:</strong> Open the terminal and run (replace <code>qwen3:8b</code> with the desired tag):</p>
<pre><code class="lang-bash">ollama pull qwen3:8b
</code></pre>
<p>This command downloads the model weights and configuration.</p>
<p><strong>Run interactively (optional test):</strong> To chat directly with the model from the command line:</p>
<pre><code class="lang-bash">ollama run qwen3:8b
</code></pre>
<p>Type prompts directly into the terminal. Use <code>/bye</code> to exit the session. Other useful commands within the interactive session include <code>/?</code> for help and <code>/set parameter &lt;name&gt; &lt;value&gt;</code> (for example, <code>/set parameter num_ctx 8192</code>) to temporarily change model parameters for the current session. Use <code>ollama list</code> outside the session to see downloaded models.</p>
<p><strong>Run as a server:</strong> For integration with Python scripts (using LangChain), Ollama needs to run as a background server process, exposing an API. Open a <em>separate</em> terminal window and run:</p>
<pre><code class="lang-bash">ollama serve
</code></pre>
<p>Keep this terminal window open while running the Python scripts. This command starts the server, typically listening on <code>http://localhost:11434</code>, providing an OpenAI-compatible API endpoint.</p>
<h3 id="heading-set-up-your-python-environment">Set Up Your Python Environment</h3>
<p>A dedicated Python environment is recommended for managing dependencies.</p>
<p><strong>Create a virtual environment:</strong></p>
<pre><code class="lang-bash">python -m venv venv
</code></pre>
<p><strong>Activate the environment:</strong></p>
<ul>
<li><p>macOS/Linux: <code>source venv/bin/activate</code></p>
</li>
<li><p>Windows: <code>venv\Scripts\activate</code></p>
</li>
</ul>
<p><strong>Install necessary libraries:</strong></p>
<pre><code class="lang-bash">pip install langchain langchain-community langchain-core langchain-ollama chromadb sentence-transformers pypdf python-dotenv unstructured[pdf] tiktoken
</code></pre>
<ul>
<li><p><code>langchain</code>, <code>langchain-community</code>, <code>langchain-core</code>: The core LangChain framework for building LLM applications.</p>
</li>
<li><p><code>langchain-ollama</code>: Specific integration for using Ollama models with LangChain.</p>
</li>
<li><p><code>chromadb</code>: The local vector database for storing document embeddings.</p>
</li>
<li><p><code>sentence-transformers</code>: Used for an alternative local embedding method (explained later).</p>
</li>
<li><p><code>pypdf</code>: A library for loading PDF documents.</p>
</li>
<li><p><code>python-dotenv</code>: For managing environment variables (optional but good practice).</p>
</li>
<li><p><code>unstructured[pdf]</code>: An alternative, powerful document loader, especially for complex PDFs.</p>
</li>
<li><p><code>tiktoken</code>: Used by LangChain for token counting.</p>
</li>
</ul>
<p>The local setup involves coordinating several independent components: Ollama itself, the specific Qwen 3 model weights, the Python environment, and various libraries like LangChain and ChromaDB. Ensuring compatibility between these pieces and correctly configuring parameters (like Ollama's context window size or selecting a model appropriate for the available VRAM) is key to a smooth experience.</p>
<p>While this modularity offers flexibility – allowing components like the LLM or vector store to be swapped – it also means the initial setup requires careful attention to detail. This tutorial aims to provide clear steps and sensible defaults to minimize potential friction points.</p>
<h2 id="heading-how-to-build-a-local-rag-system-with-qwen-3">How to Build a Local RAG System with Qwen 3</h2>
<p>Retrieval-Augmented Generation (RAG) is a powerful technique that enhances LLMs by providing them with external knowledge.</p>
<p>Instead of relying solely on its training data, the LLM can retrieve relevant information from a specified document set (like local PDFs) and uses that information to answer questions. This significantly reduces "hallucinations" (incorrect or fabricated information) and allows the LLM to answer questions about specific, private data without needing retraining.</p>
<p>The core RAG process involves:</p>
<ol>
<li><p>Loading and splitting documents into manageable chunks.</p>
</li>
<li><p>Converting these chunks into numerical representations (embeddings) using an embedding model.</p>
</li>
<li><p>Storing these embeddings in a vector database for efficient searching.</p>
</li>
<li><p>When a query comes in, embedding the query and searching the vector database for the most similar document chunks.</p>
</li>
<li><p>Providing these relevant chunks (context) along with the original query to the LLM to generate an informed answer.</p>
</li>
</ol>
<p>Let's build this locally using Qwen 3, Ollama, LangChain, and ChromaDB.</p>
<h3 id="heading-step-1-prepare-your-data">Step 1: Prepare Your Data</h3>
<p>Create a directory named <code>data</code> in the project folder. Place the PDF document that you intend to query into this directory. For this tutorial, using a single, primarily text-based PDF (like a research paper or a report) for simplicity.</p>
<pre><code class="lang-bash">mkdir data
<span class="hljs-comment"># Copy your PDF file into the 'data' directory</span>
<span class="hljs-comment"># e.g., cp ~/Downloads/some_paper.pdf./data/mydocument.pdf</span>
</code></pre>
<p>If you don’t have a PDF readily available that you’d like to use, you can download a sample PDF (the Llama 2 paper) for this tutorial using the following command in your terminal:</p>
<pre><code class="lang-bash">
wget --user-agent <span class="hljs-string">"Mozilla"</span> <span class="hljs-string">"https://arxiv.org/pdf/2307.09288.pdf"</span> -O <span class="hljs-string">"data/llama2.pdf"</span>
</code></pre>
<p>This command creates the <code>data</code> directory and downloads the PDF, saving it as <code>llama2.pdf</code> inside the <code>data</code> directory. If you prefer to use your own document, place your PDF file into the <code>data</code> directory and update the filename in the subsequent Python code.</p>
<h3 id="heading-step-2-load-documents-in-python">Step 2: Load Documents in Python</h3>
<p>Use LangChain's document loaders to read the PDF content. <code>PyPDFLoader</code> is straightforward for simple PDFs. <code>UnstructuredPDFLoader</code> (requires <code>unstructured[pdf]</code>) can handle more complex layouts but has more dependencies.</p>
<pre><code class="lang-python"><span class="hljs-comment"># rag_local.py</span>
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> langchain_community.document_loaders <span class="hljs-keyword">import</span> PyPDFLoader <span class="hljs-comment"># Or UnstructuredPDFLoader</span>

load_dotenv() <span class="hljs-comment"># Optional: Loads environment variables from.env file</span>

DATA_PATH = <span class="hljs-string">"data/"</span>
PDF_FILENAME = <span class="hljs-string">"mydocument.pdf"</span> <span class="hljs-comment"># Replace with your PDF filename</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_documents</span>():</span>
    <span class="hljs-string">"""Loads documents from the specified data path."""</span>
    pdf_path = os.path.join(DATA_PATH, PDF_FILENAME)
    loader = PyPDFLoader(pdf_path)
    <span class="hljs-comment"># loader = UnstructuredPDFLoader(pdf_path) # Alternative</span>
    documents = loader.load()
    print(<span class="hljs-string">f"Loaded <span class="hljs-subst">{len(documents)}</span> page(s) from <span class="hljs-subst">{pdf_path}</span>"</span>)
    <span class="hljs-keyword">return</span> documents

<span class="hljs-comment"># documents = load_documents() # Call this later</span>
</code></pre>
<h3 id="heading-step-3-split-documents">Step 3: Split Documents</h3>
<p>Large documents need to be split into smaller chunks suitable for embedding and retrieval. The <code>RecursiveCharacterTextSplitter</code> attempts to split text semantically (at paragraphs, sentences, and so on) before resorting to fixed-size splits. <code>chunk_size</code> determines the maximum size of each chunk (in characters), and <code>chunk_overlap</code> specifies how many characters should overlap between consecutive chunks to maintain context.</p>
<pre><code class="lang-python"><span class="hljs-comment"># rag_local.py (continued)</span>
<span class="hljs-keyword">from</span> langchain_text_splitters <span class="hljs-keyword">import</span> RecursiveCharacterTextSplitter

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">split_documents</span>(<span class="hljs-params">documents</span>):</span>
    <span class="hljs-string">"""Splits documents into smaller chunks."""</span>
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=<span class="hljs-number">1000</span>,
        chunk_overlap=<span class="hljs-number">200</span>,
        length_function=len,
        is_separator_regex=<span class="hljs-literal">False</span>,
    )
    all_splits = text_splitter.split_documents(documents)
    print(<span class="hljs-string">f"Split into <span class="hljs-subst">{len(all_splits)}</span> chunks"</span>)
    <span class="hljs-keyword">return</span> all_splits

<span class="hljs-comment"># loaded_docs = load_documents()</span>
<span class="hljs-comment"># chunks = split_documents(loaded_docs) # Call this later</span>
</code></pre>
<h3 id="heading-step-4-choose-and-configure-embedding-model">Step 4: Choose and Configure Embedding Model</h3>
<p>Embeddings transform text into vectors (lists of numbers) such that semantically similar text chunks have vectors that are close together in multi-dimensional space.</p>
<h4 id="heading-option-a-recommended-for-simplicity-ollama-embeddings">Option A (Recommended for Simplicity): Ollama Embeddings</h4>
<p>This approach uses Ollama to serve a dedicated embedding model. nomic-embed-text is a capable open-source model available via Ollama.</p>
<p>First, ensure the embedding model is pulled:</p>
<pre><code class="lang-bash">ollama pull nomic-embed-text
</code></pre>
<p>Then, use <code>OllamaEmbeddings</code> in Python:</p>
<pre><code class="lang-python"><span class="hljs-comment"># rag_local.py (continued)</span>
<span class="hljs-keyword">from</span> langchain_ollama <span class="hljs-keyword">import</span> OllamaEmbeddings

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_embedding_function</span>(<span class="hljs-params">model_name=<span class="hljs-string">"nomic-embed-text"</span></span>):</span>
    <span class="hljs-string">"""Initializes the Ollama embedding function."""</span>
    <span class="hljs-comment"># Ensure Ollama server is running (ollama serve)</span>
    embeddings = OllamaEmbeddings(model=model_name)
    print(<span class="hljs-string">f"Initialized Ollama embeddings with model: <span class="hljs-subst">{model_name}</span>"</span>)
    <span class="hljs-keyword">return</span> embeddings

<span class="hljs-comment"># embedding_function = get_embedding_function() # Call this later</span>
</code></pre>
<h4 id="heading-option-b-alternative-sentence-transformers">Option B (Alternative): Sentence Transformers</h4>
<p>This uses the sentence-transformers library directly within the Python script. It requires installing the library (pip install sentence-transformers) but doesn't need a separate Ollama process for embeddings. Models like all-MiniLM-L6-v2 are fast and lightweight, while all-mpnet-base-v2 offers higher quality.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Alternative embedding function using Sentence Transformers</span>
<span class="hljs-keyword">from</span> langchain_community.embeddings <span class="hljs-keyword">import</span> HuggingFaceEmbeddings

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_embedding_function_hf</span>(<span class="hljs-params">model_name=<span class="hljs-string">"all-MiniLM-L6-v2"</span></span>):</span>
     <span class="hljs-string">"""Initializes HuggingFace embeddings (runs locally)."""</span>
     embeddings = HuggingFaceEmbeddings(model_name=model_name)
     print(<span class="hljs-string">f"Initialized HuggingFace embeddings with model: <span class="hljs-subst">{model_name}</span>"</span>)
     <span class="hljs-keyword">return</span> embeddings

embedding_function = get_embedding_function_hf() <span class="hljs-comment"># Use this if choosing Option B</span>
</code></pre>
<p>For this tutorial, we’ll use Option A (Ollama Embeddings with <code>nomic-embed-text</code>) to keep the toolchain consistent.</p>
<h3 id="heading-step-5-set-up-local-vector-store-chromadb">Step 5: Set Up Local Vector Store (ChromaDB)</h3>
<p>ChromaDB provides an efficient way to store and search vector embeddings locally. Using a persistent client ensures the indexed data is saved to disk and can be reloaded without re-processing the documents every time.</p>
<pre><code class="lang-python"><span class="hljs-comment"># rag_local.py (continued)</span>
<span class="hljs-keyword">from</span> langchain_community.vectorstores <span class="hljs-keyword">import</span> Chroma

CHROMA_PATH = <span class="hljs-string">"chroma_db"</span> <span class="hljs-comment"># Directory to store ChromaDB data</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_vector_store</span>(<span class="hljs-params">embedding_function, persist_directory=CHROMA_PATH</span>):</span>
    <span class="hljs-string">"""Initializes or loads the Chroma vector store."""</span>
    vectorstore = Chroma(
        persist_directory=persist_directory,
        embedding_function=embedding_function
    )
    print(<span class="hljs-string">f"Vector store initialized/loaded from: <span class="hljs-subst">{persist_directory}</span>"</span>)
    <span class="hljs-keyword">return</span> vectorstore

embedding_function = get_embedding_function()
vector_store = get_vector_store(embedding_function) <span class="hljs-comment"># Call this later</span>
</code></pre>
<h3 id="heading-step-6-index-documents-embed-and-store">Step 6: Index Documents (Embed and Store)</h3>
<p>This is the core indexing step where document chunks are converted to embeddings and saved in ChromaDB. The <code>Chroma.from_documents</code> function is convenient for the initial creation and indexing. If the database already exists, subsequent additions can use <code>vectorstore.add_documents</code>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># rag_local.py (continued)</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">index_documents</span>(<span class="hljs-params">chunks, embedding_function, persist_directory=CHROMA_PATH</span>):</span>
    <span class="hljs-string">"""Indexes document chunks into the Chroma vector store."""</span>
    print(<span class="hljs-string">f"Indexing <span class="hljs-subst">{len(chunks)}</span> chunks..."</span>)
    <span class="hljs-comment"># Use from_documents for initial creation.</span>
    <span class="hljs-comment"># This will overwrite existing data if the directory exists but isn't a valid Chroma DB.</span>
    <span class="hljs-comment"># For incremental updates, initialize Chroma first and use vectorstore.add_documents().</span>
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embedding_function,
        persist_directory=persist_directory
    )
    vectorstore.persist() <span class="hljs-comment"># Ensure data is saved</span>
    print(<span class="hljs-string">f"Indexing complete. Data saved to: <span class="hljs-subst">{persist_directory}</span>"</span>)
    <span class="hljs-keyword">return</span> vectorstore

<span class="hljs-comment">#... (previous function calls)</span>
vector_store = index_documents(chunks, embedding_function) <span class="hljs-comment"># Call this for initial indexing</span>
</code></pre>
<p>To load an existing persistent database later:</p>
<pre><code class="lang-python">embedding_function = get_embedding_function()
vector_store = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)
</code></pre>
<h3 id="heading-step-7-build-the-rag-chain">Step 7: Build the RAG Chain</h3>
<p>Now, assemble the components into a LangChain Expression Language (LCEL) chain. This involves initializing the Qwen 3 LLM via Ollama, creating a retriever from the vector store, defining a suitable prompt, and chaining them together.</p>
<p>A critical parameter when initializing <code>ChatOllama</code> for RAG is <code>num_ctx</code>. This defines the context window size (in tokens) that the LLM can handle. Ollama's default (often 2048 or 4096 tokens) might be too small to accommodate both the retrieved document context and the user's query/prompt.</p>
<p>Qwen 3 models (8B and larger) support much larger context windows (for example, 128k tokens), but practical limits depend on your available RAM/VRAM. Setting <code>num_ctx</code> to a value like 8192 or higher is often necessary for effective RAG.</p>
<pre><code class="lang-python"><span class="hljs-comment"># rag_local.py (continued)</span>
<span class="hljs-keyword">from</span> langchain_ollama <span class="hljs-keyword">import</span> ChatOllama
<span class="hljs-keyword">from</span> langchain_core.prompts <span class="hljs-keyword">import</span> ChatPromptTemplate
<span class="hljs-keyword">from</span> langchain_core.runnables <span class="hljs-keyword">import</span> RunnablePassthrough
<span class="hljs-keyword">from</span> langchain_core.output_parsers <span class="hljs-keyword">import</span> StrOutputParser

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_rag_chain</span>(<span class="hljs-params">vector_store, llm_model_name=<span class="hljs-string">"qwen3:8b"</span>, context_window=<span class="hljs-number">8192</span></span>):</span>
    <span class="hljs-string">"""Creates the RAG chain."""</span>
    <span class="hljs-comment"># Initialize the LLM</span>
    llm = ChatOllama(
        model=llm_model_name,
        temperature=<span class="hljs-number">0</span>, <span class="hljs-comment"># Lower temperature for more factual RAG answers</span>
        num_ctx=context_window <span class="hljs-comment"># IMPORTANT: Set context window size</span>
    )
    print(<span class="hljs-string">f"Initialized ChatOllama with model: <span class="hljs-subst">{llm_model_name}</span>, context window: <span class="hljs-subst">{context_window}</span>"</span>)

    <span class="hljs-comment"># Create the retriever</span>
    retriever = vector_store.as_retriever(
        search_type=<span class="hljs-string">"similarity"</span>, <span class="hljs-comment"># Or "mmr"</span>
        search_kwargs={<span class="hljs-string">'k'</span>: <span class="hljs-number">3</span>} <span class="hljs-comment"># Retrieve top 3 relevant chunks</span>
    )
    print(<span class="hljs-string">"Retriever initialized."</span>)

    <span class="hljs-comment"># Define the prompt template</span>
    template = <span class="hljs-string">"""Answer the question based ONLY on the following context:
{context}

Question: {question}
"""</span>
    prompt = ChatPromptTemplate.from_template(template)
    print(<span class="hljs-string">"Prompt template created."</span>)

    <span class="hljs-comment"># Define the RAG chain using LCEL</span>
    rag_chain = (
        {<span class="hljs-string">"context"</span>: retriever, <span class="hljs-string">"question"</span>: RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
    )
    print(<span class="hljs-string">"RAG chain created."</span>)
    <span class="hljs-keyword">return</span> rag_chain

<span class="hljs-comment">#... (previous function calls)</span>
vector_store = get_vector_store(embedding_function) <span class="hljs-comment"># Assuming DB is already indexed</span>
rag_chain = create_rag_chain(vector_store) <span class="hljs-comment"># Call this later</span>
</code></pre>
<p>The effectiveness of the RAG system hinges on the proper configuration of each component. The <code>chunk_size</code> and <code>chunk_overlap</code> in the splitter affect what the retriever finds. Your choice of <code>embedding_function</code> must be consistent between indexing and querying. The <code>num_ctx</code> parameter for the <code>ChatOllama</code> LLM must be large enough to hold the retrieved context and the prompt itself. A poorly designed prompt template can also lead the LLM astray. Make sure you carefully tune these elements for optimal performance.</p>
<h3 id="heading-step-8-query-your-documents">Step 8: Query Your Documents</h3>
<p>Finally, invoke the RAG chain with a question related to the content of the indexed PDF.</p>
<pre><code class="lang-python"><span class="hljs-comment"># rag_local.py (continued)</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">query_rag</span>(<span class="hljs-params">chain, question</span>):</span>
    <span class="hljs-string">"""Queries the RAG chain and prints the response."""</span>
    print(<span class="hljs-string">"\nQuerying RAG chain..."</span>)
    print(<span class="hljs-string">f"Question: <span class="hljs-subst">{question}</span>"</span>)
    response = chain.invoke(question)
    print(<span class="hljs-string">"\nResponse:"</span>)
    print(response)

<span class="hljs-comment"># --- Main Execution ---</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    <span class="hljs-comment"># 1. Load Documents</span>
    docs = load_documents()

    <span class="hljs-comment"># 2. Split Documents</span>
    chunks = split_documents(docs)

    <span class="hljs-comment"># 3. Get Embedding Function</span>
    embedding_function = get_embedding_function() <span class="hljs-comment"># Using Ollama nomic-embed-text</span>

    <span class="hljs-comment"># 4. Index Documents (Only needs to be done once per document set)</span>
    <span class="hljs-comment"># Check if DB exists, if not, index. For simplicity, we might re-index here.</span>
    <span class="hljs-comment"># A more robust approach would check if indexing is needed.</span>
    print(<span class="hljs-string">"Attempting to index documents..."</span>)
    vector_store = index_documents(chunks, embedding_function)
    <span class="hljs-comment"># To load existing DB instead:</span>
    <span class="hljs-comment"># vector_store = get_vector_store(embedding_function)</span>

    <span class="hljs-comment"># 5. Create RAG Chain</span>
    rag_chain = create_rag_chain(vector_store, llm_model_name=<span class="hljs-string">"qwen3:8b"</span>) <span class="hljs-comment"># Use the chosen Qwen 3 model</span>

    <span class="hljs-comment"># 6. Query</span>
    query_question = <span class="hljs-string">"What is the main topic of the document?"</span> <span class="hljs-comment"># Replace with a specific question</span>
    query_rag(rag_chain, query_question)

    query_question_2 = <span class="hljs-string">"Summarize the introduction section."</span> <span class="hljs-comment"># Another example</span>
    query_rag(rag_chain, query_question_2)
</code></pre>
<p>Run the complete script (<code>python rag_local.py</code>). Make sure that the <code>ollama serve</code> command is running in another terminal. The script will load the PDF, split it, embed the chunks using <code>nomic-embed-text</code> via Ollama, store them in ChromaDB, build the RAG chain using <code>qwen3:8b</code> via Ollama, and finally execute the queries. It’ll print the LLM's responses based on the document content.</p>
<h2 id="heading-how-to-create-local-ai-agents-with-qwen-3">How to Create Local AI Agents with Qwen 3</h2>
<p>Beyond answering questions based on provided text, LLMs can act as the reasoning engine for AI agents. Agents can plan sequences of actions, interact with external tools (like functions or APIs), and work towards accomplishing more complex goals assigned by the user.</p>
<p>Qwen 3 models were specifically designed with strong tool-calling and agentic capabilities. While Alibaba provides the Qwen-Agent framework, this tutorial will continue using LangChain for consistency and because its integration with Ollama for agent tasks is more readily documented in the provided materials.</p>
<p>We will build a simple agent that can use a custom Python function as a tool.</p>
<h3 id="heading-step-1-define-custom-tools">Step 1: Define Custom Tools</h3>
<p>Tools are standard Python functions that the agent can choose to execute. The function's docstring is crucial, as the LLM uses it to understand what the tool does and what arguments it requires. LangChain's <code>@tool</code> decorator simplifies wrapping functions for agent use.</p>
<pre><code class="lang-python"><span class="hljs-comment"># agent_local.py</span>
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> langchain.agents <span class="hljs-keyword">import</span> tool
<span class="hljs-keyword">import</span> datetime

load_dotenv() <span class="hljs-comment"># Optional</span>

<span class="hljs-meta">@tool</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_current_datetime</span>(<span class="hljs-params">format: str = <span class="hljs-string">"%Y-%m-%d %H:%M:%S"</span></span>) -&gt; str:</span>
    <span class="hljs-string">"""
    Returns the current date and time, formatted according to the provided Python strftime format string.
    Use this tool whenever the user asks for the current date, time, or both.
    Example format strings: '%Y-%m-%d' for date, '%H:%M:%S' for time.
    If no format is specified, defaults to '%Y-%m-%d %H:%M:%S'.
    """</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">return</span> datetime.datetime.now().strftime(format)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error formatting date/time: <span class="hljs-subst">{e}</span>"</span>

<span class="hljs-comment"># List of tools the agent can use</span>
tools = [get_current_datetime]
print(<span class="hljs-string">"Custom tool defined."</span>)
</code></pre>
<h3 id="heading-step-2-set-up-the-agent-llm">Step 2: Set Up the Agent LLM</h3>
<p>Instantiate the <code>ChatOllama</code> model again, using a Qwen 3 variant suitable for tool calling. The <code>qwen3:8b</code> model should be capable of handling simple tool use cases.</p>
<p>It's important to note that tool calling reliability with local models served via Ollama can sometimes be less consistent than with large commercial APIs like GPT-4 or Claude. The LLM might fail to recognize when a tool is needed, hallucinate arguments, or misinterpret the tool's output. Starting with clear prompts and simple tools is recommended.</p>
<pre><code class="lang-python"><span class="hljs-comment"># agent_local.py (continued)</span>
<span class="hljs-keyword">from</span> langchain_ollama <span class="hljs-keyword">import</span> ChatOllama

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_agent_llm</span>(<span class="hljs-params">model_name=<span class="hljs-string">"qwen3:8b"</span>, temperature=<span class="hljs-number">0</span></span>):</span>
    <span class="hljs-string">"""Initializes the ChatOllama model for the agent."""</span>
    <span class="hljs-comment"># Ensure Ollama server is running (ollama serve)</span>
    llm = ChatOllama(
        model=model_name,
        temperature=temperature <span class="hljs-comment"># Lower temperature for more predictable tool use</span>
        <span class="hljs-comment"># Consider increasing num_ctx if expecting long conversations or complex reasoning</span>
        <span class="hljs-comment"># num_ctx=8192</span>
    )
    print(<span class="hljs-string">f"Initialized ChatOllama agent LLM with model: <span class="hljs-subst">{model_name}</span>"</span>)
    <span class="hljs-keyword">return</span> llm

<span class="hljs-comment"># agent_llm = get_agent_llm() # Call this later</span>
</code></pre>
<h3 id="heading-step-3-create-the-agent-prompt">Step 3: Create the Agent Prompt</h3>
<p>Agents require specific prompt structures that guide their reasoning and tool use. The prompt typically includes placeholders for user input (<code>input</code>), conversation history (<code>chat_history</code>), and the <code>agent_scratchpad</code>. The scratchpad is where the agent records its internal "thought" process, the tools it decides to call, and the results (observations) it gets back from those tools. LangChain Hub provides pre-built prompts suitable for tool-calling agents.</p>
<pre><code class="lang-python"><span class="hljs-comment"># agent_local.py (continued)</span>
<span class="hljs-keyword">from</span> langchain <span class="hljs-keyword">import</span> hub

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_agent_prompt</span>(<span class="hljs-params">prompt_hub_name=<span class="hljs-string">"hwchase17/openai-tools-agent"</span></span>):</span>
    <span class="hljs-string">"""Pulls the agent prompt template from LangChain Hub."""</span>
    <span class="hljs-comment"># This prompt is designed for OpenAI but often works well with other tool-calling models.</span>
    <span class="hljs-comment"># Alternatively, define a custom ChatPromptTemplate.</span>
    prompt = hub.pull(prompt_hub_name)
    print(<span class="hljs-string">f"Pulled agent prompt from Hub: <span class="hljs-subst">{prompt_hub_name}</span>"</span>)
    <span class="hljs-comment"># print("Prompt Structure:")</span>
    <span class="hljs-comment"># prompt.pretty_print() # Uncomment to see the prompt structure</span>
    <span class="hljs-keyword">return</span> prompt

<span class="hljs-comment"># agent_prompt = get_agent_prompt() # Call this later</span>
</code></pre>
<h3 id="heading-step-4-build-the-agent">Step 4: Build the Agent</h3>
<p>The <code>create_tool_calling_agent</code> function combines the LLM, the defined tools, and the prompt into a runnable unit that represents the agent's core logic.</p>
<pre><code class="lang-python"><span class="hljs-comment"># agent_local.py (continued)</span>
<span class="hljs-keyword">from</span> langchain.agents <span class="hljs-keyword">import</span> create_tool_calling_agent

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">build_agent</span>(<span class="hljs-params">llm, tools, prompt</span>):</span>
    <span class="hljs-string">"""Builds the tool-calling agent runnable."""</span>
    agent = create_tool_calling_agent(llm, tools, prompt)
    print(<span class="hljs-string">"Agent runnable created."</span>)
    <span class="hljs-keyword">return</span> agent

<span class="hljs-comment"># agent_runnable = build_agent(agent_llm, tools, agent_prompt) # Call this later</span>
</code></pre>
<h3 id="heading-step-5-create-the-agent-executor">Step 5: Create the Agent Executor</h3>
<p>The <code>AgentExecutor</code> is responsible for running the agent loop. It takes the agent runnable and the tools, invokes the agent with the input, parses the agent's output (which could be a final answer or a tool call request), executes any requested tool calls, and feeds the results back to the agent until a final answer is reached. Setting <code>verbose=True</code> is highly recommended during development to observe the agent's step-by-step execution flow.</p>
<pre><code class="lang-python"><span class="hljs-comment"># agent_local.py (continued)</span>
<span class="hljs-keyword">from</span> langchain.agents <span class="hljs-keyword">import</span> AgentExecutor

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_agent_executor</span>(<span class="hljs-params">agent, tools</span>):</span>
    <span class="hljs-string">"""Creates the agent executor."""</span>
    agent_executor = AgentExecutor(
        agent=agent,
        tools=tools,
        verbose=<span class="hljs-literal">True</span> <span class="hljs-comment"># Set to True to see agent thoughts and tool calls</span>
    )
    print(<span class="hljs-string">"Agent executor created."</span>)
    <span class="hljs-keyword">return</span> agent_executor

<span class="hljs-comment"># agent_executor = create_agent_executor(agent_runnable, tools) # Call this later</span>
</code></pre>
<h3 id="heading-step-6-run-the-agent">Step 6: Run the Agent</h3>
<p>Invoke the agent executor with a user query that should trigger the use of the defined tool.</p>
<pre><code class="lang-python"><span class="hljs-comment"># agent_local.py (continued)</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_agent</span>(<span class="hljs-params">executor, user_input</span>):</span>
    <span class="hljs-string">"""Runs the agent executor with the given input."""</span>
    print(<span class="hljs-string">"\nInvoking agent..."</span>)
    print(<span class="hljs-string">f"Input: <span class="hljs-subst">{user_input}</span>"</span>)
    response = executor.invoke({<span class="hljs-string">"input"</span>: user_input})
    print(<span class="hljs-string">"\nAgent Response:"</span>)
    print(response[<span class="hljs-string">'output'</span>])

<span class="hljs-comment"># --- Main Execution ---</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    <span class="hljs-comment"># 1. Define Tools (already done above)</span>

    <span class="hljs-comment"># 2. Get Agent LLM</span>
    agent_llm = get_agent_llm(model_name=<span class="hljs-string">"qwen3:8b"</span>) <span class="hljs-comment"># Use the chosen Qwen 3 model</span>

    <span class="hljs-comment"># 3. Get Agent Prompt</span>
    agent_prompt = get_agent_prompt()

    <span class="hljs-comment"># 4. Build Agent Runnable</span>
    agent_runnable = build_agent(agent_llm, tools, agent_prompt)

    <span class="hljs-comment"># 5. Create Agent Executor</span>
    agent_executor = create_agent_executor(agent_runnable, tools)

    <span class="hljs-comment"># 6. Run Agent</span>
    run_agent(agent_executor, <span class="hljs-string">"What is the current date?"</span>)
    run_agent(agent_executor, <span class="hljs-string">"What time is it right now? Use HH:MM format."</span>)
    run_agent(agent_executor, <span class="hljs-string">"Tell me a joke."</span>) <span class="hljs-comment"># Should not use the tool</span>
</code></pre>
<p>Running <code>python agent_local.py</code> (with <code>ollama serve</code> active) will execute the agent. The <code>verbose=True</code> setting will print output resembling the ReAct (Reasoning and Acting) framework, showing the agent's internal "Thoughts" on how to proceed, the "Action" it decides to take (calling a specific tool with arguments), and the "Observation" (the result returned by the tool).</p>
<p>Building reliable agents with local models presents unique challenges. The LLM's ability to correctly interpret the prompt, understand when to use tools, select the right tool, generate valid arguments, and process the tool's output is critical.</p>
<p>Local models, especially smaller or heavily quantized ones, might struggle with these reasoning steps compared to larger, cloud-based counterparts. If the <code>qwen3:8b</code> model proves unreliable for more complex agentic tasks, consider trying <code>qwen3:14b</code> or the efficient <code>qwen3:30b-a3b</code> if hardware permits.</p>
<p>For highly complex or stateful agent workflows, exploring frameworks like LangGraph, which offers more control over the agent's execution flow, might be beneficial.</p>
<h2 id="heading-advanced-considerations-and-troubleshooting">Advanced Considerations and Troubleshooting</h2>
<p>Running LLMs locally offers great flexibility but also introduces specific configuration aspects and potential issues.</p>
<h3 id="heading-controlling-qwen-3s-thinking-mode-with-ollama">Controlling Qwen 3's Thinking Mode with Ollama</h3>
<p>Qwen 3's unique hybrid inference allows switching between a deep "thinking" mode for complex reasoning and a faster "non-thinking" mode for general chat. While frameworks like Hugging Face Transformers or vLLM might offer explicit parameters (<code>enable_thinking</code>), the primary way to control this when using Ollama appears to be through "soft switches" embedded in the prompt.</p>
<p>Append <code>/think</code> to the end of a user prompt to encourage step-by-step reasoning, or <code>/no_think</code> to request a faster, direct response. You can do this via the Ollama CLI or potentially within the prompts sent via the API/LangChain.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Example using LangChain's ChatOllama</span>
<span class="hljs-keyword">from</span> langchain_ollama <span class="hljs-keyword">import</span> ChatOllama

llm_think = ChatOllama(model=<span class="hljs-string">"qwen3:8b"</span>)
llm_no_think = ChatOllama(model=<span class="hljs-string">"qwen3:8b"</span>) <span class="hljs-comment"># Could also set system prompt</span>

<span class="hljs-comment"># Invoke with prompt modification</span>
response_think = llm_think.invoke(<span class="hljs-string">"Solve the equation 2x + 5 = 15 /think"</span>)
print(<span class="hljs-string">"Thinking Response:"</span>, response_think)

response_no_think = llm_no_think.invoke(<span class="hljs-string">"What is the capital of France? /no_think"</span>)
print(<span class="hljs-string">"Non-Thinking Response:"</span>, response_no_think)

<span class="hljs-comment"># Alternatively, set via system message (might be less reliable turn-by-turn)</span>
llm_system_no_think = ChatOllama(model=<span class="hljs-string">"qwen3:8b"</span>, system=<span class="hljs-string">"/no_think"</span>)
response_system = llm_system_no_think.invoke(<span class="hljs-string">"What is 2+2?"</span>)
print(<span class="hljs-string">"System No-Think Response:"</span>, response_system)
</code></pre>
<p>Note that the persistence of these tags across multiple turns in a conversation might require careful prompt management.</p>
<h3 id="heading-managing-context-length-numctx">Managing Context Length (<code>num_ctx</code>)</h3>
<p>The context window (<code>num_ctx</code>) determines how much information (prompt, history, retrieved documents) the LLM can consider at once. Qwen 3 models (8B+) support large native context lengths (for example, 128k tokens), but Ollama often defaults to a much smaller window (like 2048 or 4096). For RAG or conversations requiring memory of earlier turns, this default is often insufficient.</p>
<p>Set <code>num_ctx</code> when initializing <code>ChatOllama</code> or <code>OllamaLLM</code> in LangChain:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Example setting context window to 8192 tokens</span>
llm = ChatOllama(model=<span class="hljs-string">"qwen3:8b"</span>, num_ctx=<span class="hljs-number">8192</span>)
</code></pre>
<p>Be mindful that larger <code>num_ctx</code> values significantly increase RAM and VRAM consumption. But setting it too low can lead to the model "forgetting" context or even entering repetitive loops. Choose a value that balances the task requirements with hardware capabilities.</p>
<h3 id="heading-hardware-limitations-and-vram">Hardware Limitations and VRAM</h3>
<p>Running LLMs locally is resource-intensive.</p>
<ul>
<li><p><strong>VRAM:</strong> A dedicated GPU (NVIDIA or Apple Silicon) with sufficient VRAM is highly recommended for acceptable performance. The amount of VRAM dictates the largest model size that can run efficiently. Refer to the table in Section 2 for estimates.</p>
</li>
<li><p><strong>RAM:</strong> System RAM is also crucial, especially if the model doesn't fit entirely in VRAM. Ollama can utilize system RAM as a fallback, but this is significantly slower.</p>
</li>
<li><p><strong>Quantization:</strong> Ollama typically serves quantized models (for example., 4-bit or 5-bit), which reduce the model size and VRAM requirements significantly compared to full-precision models, often with minimal performance degradation for many tasks. The tags like <code>:4b</code>, <code>:8b</code> usually imply a default quantization level.</p>
</li>
</ul>
<p>If performance is slow or errors occur due to resource constraints, consider:</p>
<ul>
<li><p>Using a smaller Qwen 3 model (like 4B instead of 8B).</p>
</li>
<li><p>Ensuring Ollama is correctly detecting and utilizing the GPU (check Ollama logs or system monitoring tools).</p>
</li>
<li><p>Closing other resource-intensive applications.</p>
</li>
</ul>
<h2 id="heading-conclusion-and-next-steps">Conclusion and Next Steps</h2>
<p>This tutorial gave you a practical walkthrough for setting up your local AI environment using the powerful and open Qwen 3 LLM family with the user-friendly Ollama tool.</p>
<p>If you’ve followed these steps, you should have successfully:</p>
<ol>
<li><p>Installed Ollama and downloaded/run Qwen 3 models locally.</p>
</li>
<li><p>Built a functional Retrieval-Augmented Generation (RAG) pipeline using LangChain and ChromaDB to query local documents.</p>
</li>
<li><p>Created a basic AI agent capable of reasoning and utilizing custom Python tools.</p>
</li>
</ol>
<p>Running these systems locally unlocks significant advantages in privacy, cost, and customization, making advanced AI capabilities more accessible than ever. The combination of Qwen 3's performance and open license with Ollama's ease of use creates a potent platform for experimentation and development.</p>
<p><strong>Official Resources:</strong></p>
<ul>
<li><p><strong>Qwen 3:</strong> <a target="_blank" href="https://github.com/QwenLM/Qwen3">GitHub</a>, <a target="_blank" href="https://qwen.readthedocs.io/en/latest/">Documentation</a></p>
</li>
<li><p><strong>Ollama:</strong> <a target="_blank" href="https://ollama.com/">Website</a>, <a target="_blank" href="https://ollama.com/library">Model Library</a>, <a target="_blank" href="https://github.com/ollama/ollama">GitHub</a></p>
</li>
<li><p><strong>LangChain:</strong> <a target="_blank" href="https://python.langchain.com/docs/get_started/introduction">Python Documentation</a></p>
</li>
<li><p><strong>ChromaDB:</strong> <a target="_blank" href="https://docs.trychroma.com/">Documentation</a></p>
</li>
<li><p><strong>Sentence Transformers:</strong> <a target="_blank" href="https://www.sbert.net/">Documentation</a></p>
</li>
</ul>
<p>By leveraging these powerful, free, and open-source tools, you can continue to push the boundaries of what's possible with AI running directly on your own hardware.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create a Python SIEM System Using AI and LLMs for Log Analysis and Anomaly Detection ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, we’ll build a simplified, AI-flavored SIEM log analysis system using Python. Our focus will be on log analysis and anomaly detection. We’ll walk through ingesting logs, detecting anomalies with a lightweight machine learning model, ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-python-siem-system-using-ai-and-llms/</link>
                <guid isPermaLink="false">67cb2cab98825a8b61ca9121</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #cybersecurity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chaitanya Rahalkar ]]>
                </dc:creator>
                <pubDate>Fri, 07 Mar 2025 17:28:11 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1741368457380/900d7d5b-cffc-4175-b5a5-4d7361ea383d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, we’ll build a simplified, AI-flavored SIEM log analysis system using Python. Our focus will be on log analysis and anomaly detection.</p>
<p>We’ll walk through ingesting logs, detecting anomalies with a lightweight machine learning model, and even touch on how the system could respond automatically.</p>
<p>This hands-on proof-of-concept will illustrate how AI can enhance security monitoring in a practical, accessible way.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-are-siem-systems">What Are SIEM Systems?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-the-project">Setting Up the Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-log-analysis">How to Implement Log Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-anomaly-detection-model">How to Build the Anomaly Detection Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-testing-and-visualizing-results">Testing and Visualizing Results</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-automated-response-possibilities">Automated Response Possibilities</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-are-siem-systems">What Are SIEM Systems?</h2>
<p>Security Information and Event Management (SIEM) systems are the central nervous system of modern security operations. A SIEM aggregates and correlates security logs and events from across an IT environment to provide real-time insights into potential incidents. This helps organizations detect threats faster and respond sooner.</p>
<p>These systems pull together huge volumes of log data — from firewall alerts to application logs — and analyze them for signs of trouble. Anomaly detection in this context is crucial, and unusual patterns in logs can reveal incidents that might slip past static rules. For example, a sudden spike in network requests might indicate a DDoS attack, while multiple failed login attempts could point to unauthorized access attempts.</p>
<p>AI takes SIEM capabilities a step further. By leveraging advanced AI models (like large language models), an AI-powered SIEM can intelligently parse and interpret logs, learn what “normal” behavior looks like, and flag the “weird” stuff that warrants attention.</p>
<p>In essence, AI can act as a smart co-pilot for analysts, spotting subtle anomalies and even summarizing findings in plain language. Recent advancements in large language models allow SIEMs to reason over countless data points much like a human analyst would — but with far greater speed and scale. The result is a powerful digital security assistant that helps cut through the noise and focus on real threats.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we dive in, make sure you have the following:</p>
<ul>
<li><p>Python 3.x installed on your system. The code examples should work in any recent Python version.</p>
</li>
<li><p>Basic familiarity with Python programming (looping, functions, using libraries) and an understanding of logs (for example, what a log entry looks like) will be helpful.</p>
</li>
<li><p>Python libraries: We’ll use a few common libraries that are lightweight and don’t require special hardware:</p>
<ul>
<li><p><a target="_blank" href="https://pandas.pydata.org/">pandas</a> for basic data handling (if your logs are in CSV or similar format).</p>
</li>
<li><p><a target="_blank" href="https://numpy.org/">numpy</a> for numeric operations.</p>
</li>
<li><p><a target="_blank" href="https://scikit-learn.org/">scikit-learn</a> for the anomaly detection model (specifically, we’ll use the IsolationForest algorithm).</p>
</li>
</ul>
</li>
<li><p>A set of log data to analyze. You can use any log file (system logs, application logs, and so on) in plain text or CSV format. For demonstration, we’ll simulate a small log dataset so you can follow along even without a ready-made log file.</p>
</li>
</ul>
<p><strong>Note:</strong> If you don’t have the libraries above, install them via pip:</p>
<pre><code class="lang-bash">pip install pandas numpy scikit-learn
</code></pre>
<h2 id="heading-setting-up-the-project">Setting Up the Project</h2>
<p>Let’s set up a simple project structure. Create a new directory for this SIEM anomaly detection project and navigate into it. Inside, you can have a Python script (for example, <code>siem_anomaly_demo.py</code>) or a Jupyter Notebook to run the code step by step.</p>
<p>Make sure your working directory contains or can access your log data. If you’re using a log file, it might be a good idea to place a copy in this project folder. For our proof-of-concept, since we will generate synthetic log data, we won’t need an external file — but in a real scenario you would.</p>
<p><strong>Project setup steps:</strong></p>
<ol>
<li><p><strong>Initialize the environment</strong> – If you prefer, create a virtual environment for this project (optional but good practice):</p>
<pre><code class="lang-bash"> python -m venv venv
 <span class="hljs-built_in">source</span> venv/bin/activate  <span class="hljs-comment"># On Windows use "venv\Scripts\activate"</span>
</code></pre>
<p> Then install the required packages in this virtual environment.</p>
</li>
<li><p><strong>Prepare a data source</strong> – Identify the log source you want to analyze. This could be a path to a log file or database. Ensure you know the format of the logs (for example, are they comma-separated, JSON lines, or plain text?). For illustration, we will fabricate some log entries.</p>
</li>
<li><p><strong>Set up your script or notebook</strong> – Open your Python file or notebook. We’ll start by importing the necessary libraries and setting up any configurations (like random seeds for reproducibility).</p>
</li>
</ol>
<p>By the end of this setup, you should have a Python environment ready to run our SIEM log analysis code, and either a real log dataset or the intention to simulate data along with me.</p>
<h2 id="heading-implementing-log-analysis">Implementing Log Analysis</h2>
<p>In a full SIEM system, log analysis involves collecting logs from various sources and parsing them into a uniform format for further processing. Logs often contain fields like timestamp, severity level, source, event message, user ID, IP address, and so on. The first task is to ingest and preprocess these logs.</p>
<h3 id="heading-1-log-ingestion"><strong>1. Log Ingestion</strong></h3>
<p>If your logs are in a text file, you can read them in Python. For example, if each log entry is a line in the file, you could do:</p>
<pre><code class="lang-python"><span class="hljs-keyword">with</span> open(<span class="hljs-string">"my_logs.txt"</span>) <span class="hljs-keyword">as</span> f:
    raw_logs = f.readlines()
</code></pre>
<p>If the logs are structured (say, CSV format with columns), Pandas can greatly simplify reading:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
df = pd.read_csv(<span class="hljs-string">"my_logs.csv"</span>)
print(df.head())
</code></pre>
<p>This will give you a DataFrame <code>df</code> with your log entries organized in columns. But many logs are semi-structured (for example, components separated by spaces or special characters). In such cases, you might need to split each line by a delimiter or use regex to extract fields. For instance, imagine a log line:</p>
<pre><code class="lang-python"><span class="hljs-number">2025</span><span class="hljs-number">-03</span><span class="hljs-number">-06</span> <span class="hljs-number">08</span>:<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, INFO, User login success, user: admin
</code></pre>
<p>This has a timestamp, a log level, a message, and a user. We can parse such lines with Python’s string methods:</p>
<pre><code class="lang-python">logs = [
    <span class="hljs-string">"2025-03-06 08:00:00, INFO, User login success, user: admin"</span>,
    <span class="hljs-string">"2025-03-06 08:01:23, INFO, User login success, user: alice"</span>,
    <span class="hljs-string">"2025-03-06 08:02:45, ERROR, Failed login attempt, user: alice"</span>,
    <span class="hljs-comment"># ... (more log lines)</span>
]
parsed_logs = []
<span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> logs:
    parts = [p.strip() <span class="hljs-keyword">for</span> p <span class="hljs-keyword">in</span> line.split(<span class="hljs-string">","</span>)]
    timestamp = parts[<span class="hljs-number">0</span>]
    level = parts[<span class="hljs-number">1</span>]
    message = parts[<span class="hljs-number">2</span>]
    user = parts[<span class="hljs-number">3</span>].split(<span class="hljs-string">":"</span>)[<span class="hljs-number">1</span>].strip() <span class="hljs-keyword">if</span> <span class="hljs-string">"user:"</span> <span class="hljs-keyword">in</span> parts[<span class="hljs-number">3</span>] <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>
    parsed_logs.append({<span class="hljs-string">"timestamp"</span>: timestamp, <span class="hljs-string">"level"</span>: level, <span class="hljs-string">"message"</span>: message, <span class="hljs-string">"user"</span>: user})

<span class="hljs-comment"># Convert to DataFrame for easier analysis</span>
df_logs = pd.DataFrame(parsed_logs)
print(df_logs.head())
</code></pre>
<p>Running the above on our sample list would output something like:</p>
<pre><code class="lang-python">            timestamp  level                 message   user
<span class="hljs-number">0</span>  <span class="hljs-number">2025</span><span class="hljs-number">-03</span><span class="hljs-number">-06</span> <span class="hljs-number">08</span>:<span class="hljs-number">00</span>:<span class="hljs-number">00</span>   INFO    User login success   admin
<span class="hljs-number">1</span>  <span class="hljs-number">2025</span><span class="hljs-number">-03</span><span class="hljs-number">-06</span> <span class="hljs-number">08</span>:<span class="hljs-number">01</span>:<span class="hljs-number">23</span>   INFO    User login success   alice
<span class="hljs-number">2</span>  <span class="hljs-number">2025</span><span class="hljs-number">-03</span><span class="hljs-number">-06</span> <span class="hljs-number">08</span>:<span class="hljs-number">02</span>:<span class="hljs-number">45</span>  ERROR  Failed login attempt   alice
...
</code></pre>
<p>Now we have structured the logs into a table. In a real scenario, you would continue parsing all relevant fields from your logs (for example, IP addresses, error codes, and so on) depending on what you want to analyze.</p>
<h3 id="heading-2-preprocessing-and-feature-extraction"><strong>2. Preprocessing and Feature Extraction</strong></h3>
<p>With the logs in a structured format, the next step is to derive features for anomaly detection. Raw log messages (strings) by themselves are hard for an algorithm to learn from directly. We often extract numeric features or categories that can be quantified. Some examples of features could be:</p>
<ul>
<li><p><strong>Event counts:</strong> number of events per minute/hour, number of login failures for each user, and so on.</p>
</li>
<li><p><strong>Duration or size:</strong> if logs include durations or data sizes (for example, file transfer size, query execution time), those numeric values can be directly used.</p>
</li>
<li><p><strong>Categorical encoding:</strong> log levels (INFO, ERROR, DEBUG) could be mapped to numbers, or specific event types could be one-hot encoded.</p>
</li>
</ul>
<p>For this proof-of-concept, let’s focus on a simple numeric feature: the count of login attempts per minute for a given user. We’ll simulate this as our feature data.</p>
<p>In a real system, you would compute this by grouping the parsed log entries by time window and user. The goal is to get an array of numbers where each number represents "how many login attempts occurred in a given minute." Most of the time this number will be low (normal behavior), but if a particular minute saw an unusually high number of attempts, that’s an anomaly (possibly a brute-force attack).</p>
<p>To simulate, we’ll generate a list of 50 values representing normal behavior, and then append a few values that are abnormally high:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Simulate 50 minutes of normal login attempt counts (around 5 per minute on average)</span>
np.random.seed(<span class="hljs-number">42</span>)  <span class="hljs-comment"># for reproducible example</span>
normal_counts = np.random.poisson(lam=<span class="hljs-number">5</span>, size=<span class="hljs-number">50</span>)

<span class="hljs-comment"># Simulate anomaly: a spike in login attempts (e.g., an attacker tries 30+ times in a minute)</span>
anomalous_counts = np.array([<span class="hljs-number">30</span>, <span class="hljs-number">40</span>, <span class="hljs-number">50</span>])

<span class="hljs-comment"># Combine the data</span>
login_attempts = np.concatenate([normal_counts, anomalous_counts])
print(<span class="hljs-string">"Login attempts per minute:"</span>, login_attempts)
</code></pre>
<p>When you run the above, <code>login_attempts</code> might look like:</p>
<pre><code class="lang-python">Login attempts per minute: [ <span class="hljs-number">5</span>  <span class="hljs-number">4</span>  <span class="hljs-number">4</span>  <span class="hljs-number">5</span>  <span class="hljs-number">5</span>  <span class="hljs-number">3</span>  <span class="hljs-number">5</span>  ...  <span class="hljs-number">4</span> <span class="hljs-number">30</span> <span class="hljs-number">40</span> <span class="hljs-number">50</span>]
</code></pre>
<p>Most values are in the single digits, but at the end we have three minutes with 30, 40, and 50 attempts – clear outliers. This is our prepared data for anomaly detection. In a real log analysis, this kind of data might come from counting events in your logs over time or extracting some metric from the log content.</p>
<p>Now that our data is ready, we can move on to building the anomaly detection model.</p>
<h2 id="heading-how-to-build-the-anomaly-detection-model">How to Build the Anomaly Detection Model</h2>
<p>To detect anomalies in our log-derived data, we’ll use a machine learning approach. Specifically, we’ll use an Isolation Forest – a popular algorithm for unsupervised anomaly detection.</p>
<p>The Isolation Forest works by randomly partitioning the data and isolating points. Anomalies are those points that get isolated (separated from others) quickly, that is, in fewer random splits. This makes it great for identifying outliers in a dataset without needing any labels (we don’t have to know in advance which log entries are “bad”).</p>
<p>Why Isolation Forest?</p>
<ul>
<li><p>It’s efficient and works well even if we have a lot of data.</p>
</li>
<li><p>It doesn’t assume any specific data distribution (unlike some statistical methods).</p>
</li>
<li><p>It gives us a straightforward way to score anomalies.</p>
</li>
</ul>
<p>Let’s train an Isolation Forest on our <code>login_attempts</code> data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> IsolationForest

<span class="hljs-comment"># Prepare the data in the shape the model expects (samples, features)</span>
X = login_attempts.reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)  <span class="hljs-comment"># each sample is a 1-dimensional [count]</span>

<span class="hljs-comment"># Initialize the Isolation Forest model</span>
model = IsolationForest(contamination=<span class="hljs-number">0.05</span>, random_state=<span class="hljs-number">42</span>)
<span class="hljs-comment"># contamination=0.05 means we expect about 5% of the data to be anomalies</span>

<span class="hljs-comment"># Train the model on the data</span>
model.fit(X)
</code></pre>
<p>A couple of notes on the code:</p>
<ul>
<li><p>We reshaped <code>login_attempts</code> to a 2D array <code>X</code> with one feature column because scikit-learn requires a 2D array for training (<code>fit</code>).</p>
</li>
<li><p>We set <code>contamination=0.05</code> to give the model a hint that roughly 5% of the data might be anomalies. In our synthetic data we added 3 anomalies out of 53 points, which is ~5.7%, so 5% is a reasonable guess. (If you don’t specify contamination, the algorithm will choose a default based on assumption or use a default 0.1 in some versions.)</p>
</li>
<li><p><code>random_state=42</code> just ensures reproducibility.</p>
</li>
</ul>
<p>At this point, the Isolation Forest model has been trained on our data. Internally, it has built an ensemble of random trees that partition the data. Points that are hard to isolate (that is, in the dense cluster of normal points) end up deep in these trees, while points that are easy to isolate (the outliers) end up with shorter paths.</p>
<p>Next, we’ll use this model to identify which data points are considered anomalous.</p>
<h2 id="heading-testing-and-visualizing-results">Testing and Visualizing Results</h2>
<p>Now comes the exciting part: using our trained model to detect anomalies in the log data. We’ll have the model predict labels for each data point and then filter out the ones flagged as outliers.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Use the model to predict anomalies</span>
labels = model.predict(X)
<span class="hljs-comment"># The model outputs +1 for normal points and -1 for anomalies</span>

<span class="hljs-comment"># Extract the anomaly indices and values</span>
anomaly_indices = np.where(labels == <span class="hljs-number">-1</span>)[<span class="hljs-number">0</span>]
anomaly_values = login_attempts[anomaly_indices]

print(<span class="hljs-string">"Anomaly indices:"</span>, anomaly_indices)
print(<span class="hljs-string">"Anomaly values (login attempts):"</span>, anomaly_values)
</code></pre>
<p>In our case, we expect the anomalies to be the large numbers we inserted (30, 40, 50). The output might look like:</p>
<pre><code class="lang-python">Anomaly indices: [<span class="hljs-number">50</span> <span class="hljs-number">51</span> <span class="hljs-number">52</span>]
Anomaly values (login attempts): [<span class="hljs-number">30</span> <span class="hljs-number">40</span> <span class="hljs-number">50</span>]
</code></pre>
<p>Even without knowing anything about “login attempts” specifically, the Isolation Forest recognized those values as out-of-line with the rest of the data.</p>
<p>This is the power of anomaly detection in a security context: we don’t always know what a new attack will look like, but if it causes something to drift far from normal patterns (like a user suddenly making 10 times more login attempts than usual), the anomaly detector shines a spotlight on it.</p>
<h3 id="heading-visualizing-the-results"><strong>Visualizing the results</strong></h3>
<p>In a real analysis, it’s often useful to visualize the data and the anomalies. For instance, we could plot the <code>login_attempts</code> values over time (minute by minute) and highlight the anomalies in a different color.</p>
<p>In this simple case, a line chart would show a mostly flat line around 3-8 logins/min with three huge spikes at the end. Those spikes are our anomalies. You could achieve this with Matplotlib if you’re running this in a notebook:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

plt.plot(login_attempts, label=<span class="hljs-string">"Login attempts per minute"</span>)
plt.scatter(anomaly_indices, anomaly_values, color=<span class="hljs-string">'red'</span>, label=<span class="hljs-string">"Anomalies"</span>)
plt.xlabel(<span class="hljs-string">"Time (minute index)"</span>)
plt.ylabel(<span class="hljs-string">"Login attempts"</span>)
plt.legend()
plt.show()
</code></pre>
<p>For text-based output as we have here, the printed results already confirm that the high values were caught. In more complex cases, anomaly detection models also provide an anomaly score for each point (for example, how far it is from the normal range). Scikit-learn’s IsolationForest, for example, has a <code>decision_function</code> method that yields a score (where lower scores mean more abnormal).</p>
<p>For simplicity, we won’t delve into the scores here, but it’s good to know you can retrieve them to rank anomalies by severity.</p>
<p>With the anomaly detection working, what can we do when we find an anomaly? That leads us to thinking about automated responses.</p>
<h2 id="heading-automated-response-possibilities">Automated Response Possibilities</h2>
<p>Detecting an anomaly is only half the battle — the next step is responding to it. In enterprise SIEM systems, automated response (often associated with SOAR – Security Orchestration, Automation, and Response) can dramatically reduce reaction time to incidents.</p>
<p>What could an AI-powered SIEM do when it flags something unusual? Here are some possibilities:</p>
<ul>
<li><p><strong>Alerting:</strong> The simplest action is to send an alert to security personnel. This could be an email, a Slack message, or creating a ticket in an incident management system. The alert would contain details of the anomaly (for example, “User <em>alice</em> had 50 failed login attempts in 1 minute, which is abnormal”). GenAI can help here by generating a clear natural-language summary of the incident for the analyst.</p>
</li>
<li><p><strong>Automated mitigation:</strong> More advanced systems might take direct action. For instance, if an IP address is showing malicious behavior in logs, the system could automatically block that IP on the firewall. In our login spike example, the system might temporarily lock the user account or prompt for additional authentication, under the assumption that it might be a bot attack. AI-based SIEMs today can indeed trigger predefined response actions or even orchestrate complex workflows when certain threats are detected (refer to <a target="_blank" href="https://www.exabeam.com/explainers/siem/ai-siem-how-siem-with-ai-ml-is-revolutionizing-the-soc/#:~:text=automatically%20trigger%20alerts%2C%20implement%20predefined,even%20orchestrate%20complex%20response%20workflows">AI SIEM: How SIEM with AI/ML is Revolutionizing the SOC | Exabeam</a> for more information).</p>
</li>
<li><p><strong>Investigation support:</strong> Generative AI could also be used to automatically gather context. For example, upon detecting the anomaly, the system could pull related logs (surrounding events, other actions by the same user or from the same IP) and provide an aggregated report. This saves the analyst from manually querying multiple data sources.</p>
</li>
</ul>
<p>It’s important to implement automated responses carefully — you don’t want the system to overreact to false positives. A common strategy is a tiered response: low-confidence anomalies might just log a warning or send a low-priority alert, whereas high-confidence anomalies (or combinations of anomalies) trigger active defense measures.</p>
<p>In practice, a AI-powered SIEM would integrate with your infrastructure (via APIs, scripts, and so on) to execute these actions. For our Python PoC, you could simulate an automated response by, say, printing a message or calling a dummy function when an anomaly is detected. For example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> len(anomaly_indices) &gt; <span class="hljs-number">0</span>:
    print(<span class="hljs-string">f"Alert! Detected <span class="hljs-subst">{len(anomaly_indices)}</span> anomalous events. Initiating response procedures..."</span>)
    <span class="hljs-comment"># Here, you could add code to disable a user or notify an admin, etc.</span>
</code></pre>
<p>While our demonstration is simple, it’s easy to imagine scaling this up. The SIEM could, for instance, feed anomalies into a larger generative model that assesses the situation and decides on the best course of action (like a chatbot Ops assistant that knows your runbooks). The possibilities for automation are expanding as AI becomes more sophisticated.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, we built a basic AI-powered SIEM component that ingests log data, analyzes it for anomalies using a machine learning model, and identifies unusual events that could represent security threats.</p>
<p>We started by parsing and preparing log data, then used an Isolation Forest model to detect outliers in a stream of login attempt counts. The model successfully flagged out-of-norm behavior without any prior knowledge of what an “attack” looks like – it purely relied on deviations from learned normal patterns.</p>
<p>We also discussed how such a system could respond to detected anomalies, from alerting humans to automatically taking action.</p>
<p>Modern SIEM systems augmented with AI/ML are moving in this direction: not only do they detect issues, but they also help triage and respond to them. Generative AI further enhances this by learning from analysts and providing intelligent summaries and decisions, effectively becoming a tireless assistant in the Security Operations Center.</p>
<p>For next steps and improvements:</p>
<ul>
<li><p>You can try this approach on real log data. For example, take a system log file and extract a feature like “number of error logs per hour” or “bytes transferred per session” and run anomaly detection on that.</p>
</li>
<li><p>Experiment with other algorithms like One-Class SVM or Local Outlier Factor for anomaly detection to see how they compare.</p>
</li>
<li><p>Incorporate a simple language model to parse log lines or to explain anomalies. For instance, an LLM could read an anomalous log entry and suggest what might be wrong (“This error usually means the database is unreachable”).</p>
</li>
<li><p>Extend the features: in a real SIEM, you’d use many signals at once (failed login counts, unusual IP geolocation, rare process names in logs, and so on). More features and data can improve the context for detection.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Real-Time Intrusion Detection System with Python and Open-Source Libraries ]]>
                </title>
                <description>
                    <![CDATA[ An Intrusion Detection System (IDS) is like a security camera for your network. Just as security cameras help identify suspicious activities in the physical world, an IDS will monitor your network to help detect any potential cyber attacks and securi... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-real-time-intrusion-detection-system-with-python/</link>
                <guid isPermaLink="false">678faf10f366e60cf6e7b6d0</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #cybersecurity ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chaitanya Rahalkar ]]>
                </dc:creator>
                <pubDate>Tue, 21 Jan 2025 14:28:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1737469496956/6cb12a90-de25-46da-aafc-bbd5048d0411.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>An Intrusion Detection System (IDS) is like a security camera for your network. Just as security cameras help identify suspicious activities in the physical world, an IDS will monitor your network to help detect any potential cyber attacks and security breaches.</p>
<p>By the end of this tutorial, you will know how an IDS works and be able to build your own real-time network monitoring system using Python.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-understanding-the-types-of-ids">Understanding the Types of IDS</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-setup-your-development-environment">How to Setup Your Development Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-building-the-core-ids-components">Building the Core IDS Components</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-building-the-packet-capture-engine">Building the Packet Capture Engine</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-building-the-traffic-analysis-module">Building the Traffic Analysis Module</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-building-the-detection-engine">Building the Detection Engine</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-building-the-alert-system">Building the Alert System</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-putting-it-all-together">Putting It All Together</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-ideas-to-extend-the-ids">Ideas to Extend the IDS</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-security-considerations">Security Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-testing-the-ids-on-mock-data">Testing the IDS on Mock Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-understanding-the-types-of-ids">Understanding the Types of IDS</h2>
<p>Before we jump into the coding part, let’s understand the types of IDS:</p>
<ol>
<li><p><strong>Network-based IDS (NIDS)</strong>: This system monitors network traffic for suspicious activity.</p>
</li>
<li><p><strong>Host-based IDS (HIDS)</strong>: This system monitors system logs and file changes on individual hosts and is not directly deployed in the network.</p>
</li>
<li><p><strong>Signature-based IDS</strong>: This system is either in the network or on the host and identifies attack patterns based on known patterns.</p>
</li>
<li><p><strong>Anomaly-based IDS</strong>: This system identifies unusual behavior using heuristics and prediction algorithms that are trained on previously seen attack patterns.</p>
</li>
</ol>
<p>For this tutorial, you will be building a hybrid system that combines signature-based and anomaly-based detection systems to monitor network traffic.</p>
<h2 id="heading-how-to-setup-your-development-environment">How to Setup Your Development Environment</h2>
<p>Let’s start by setting up our Python environment (I’m using Python 3) and installing the following prerequisites:</p>
<pre><code class="lang-bash">pip install scapy
pip install python-nmap
pip install numpy
pip install sklearn
</code></pre>
<h2 id="heading-building-the-core-ids-components">Building the Core IDS Components</h2>
<p>Our IDS will comprise of four main components:</p>
<ol>
<li><p>A packet capture system</p>
</li>
<li><p>Traffic analysis module</p>
</li>
<li><p>A detection engine</p>
</li>
<li><p>An alert system</p>
</li>
</ol>
<h3 id="heading-building-the-packet-capture-engine">Building the Packet Capture Engine</h3>
<p>Let’s start with the packet capture engine. We’ll use Scapy for this. Scapy is a networking library that allows us to perform network and network-related operations using Python.</p>
<p>First, we’ll define our <code>PacketCapture</code> class that will serve as the basis of our IDS.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> scapy.all <span class="hljs-keyword">import</span> sniff, IP, TCP
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> defaultdict
<span class="hljs-keyword">import</span> threading
<span class="hljs-keyword">import</span> queue

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PacketCapture</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        self.packet_queue = queue.Queue()
        self.stop_capture = threading.Event()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">packet_callback</span>(<span class="hljs-params">self, packet</span>):</span>
        <span class="hljs-keyword">if</span> IP <span class="hljs-keyword">in</span> packet <span class="hljs-keyword">and</span> TCP <span class="hljs-keyword">in</span> packet:
            self.packet_queue.put(packet)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">start_capture</span>(<span class="hljs-params">self, interface=<span class="hljs-string">"eth0"</span></span>):</span>
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">capture_thread</span>():</span>
            sniff(iface=interface,
                  prn=self.packet_callback,
                  store=<span class="hljs-number">0</span>,
                  stop_filter=<span class="hljs-keyword">lambda</span> _: self.stop_capture.is_set())

        self.capture_thread = threading.Thread(target=capture_thread)
        self.capture_thread.start()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">stop</span>(<span class="hljs-params">self</span>):</span>
        self.stop_capture.set()
        self.capture_thread.join()
</code></pre>
<p>Let’s quickly walk through the code and understand what these functions do. For this, you will be using threading and queues to efficiently process and capture network packets.</p>
<p>The <code>init</code> method initializes the class by creating a <code>queue.Queue</code> to store captured packets and a threading Event to control when the packet capture should stop. The <code>packet_callback</code> method acts as a handler for each captured packet and checks if the packet contains both IP and TCP layers. If so, it adds it to the queue for further processing.</p>
<p>The <code>start_capture</code> method begins capturing packets on a specified interface (defaulting to <code>eth0</code> to capture packets from the Ethernet interface). Run <code>ifconfig</code> to understand the available interfaces and select the appropriate interface from the list.</p>
<p>The function spawns a separate thread to run Scapy’s sniff function, which continuously monitors the interface for packets. The <code>stop_filter</code> parameter ensures the capture stops when the <code>stop_capture</code> event is triggered.</p>
<p>The <code>stop</code> method stops the capture by setting the <code>stop_capture</code> event and waits for the thread to finish execution, ensuring the process terminates cleanly. This design allows for seamless real-time packet capturing without blocking the main thread.</p>
<h3 id="heading-building-the-traffic-analysis-module">Building the Traffic Analysis Module</h3>
<p>Now, let’s write the traffic analysis module. This module will process captured packets and extract relevant features.</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TrafficAnalyzer</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        self.connections = defaultdict(list)
        self.flow_stats = defaultdict(<span class="hljs-keyword">lambda</span>: {
            <span class="hljs-string">'packet_count'</span>: <span class="hljs-number">0</span>,
            <span class="hljs-string">'byte_count'</span>: <span class="hljs-number">0</span>,
            <span class="hljs-string">'start_time'</span>: <span class="hljs-literal">None</span>,
            <span class="hljs-string">'last_time'</span>: <span class="hljs-literal">None</span>
        })

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">analyze_packet</span>(<span class="hljs-params">self, packet</span>):</span>
        <span class="hljs-keyword">if</span> IP <span class="hljs-keyword">in</span> packet <span class="hljs-keyword">and</span> TCP <span class="hljs-keyword">in</span> packet:
            ip_src = packet[IP].src
            ip_dst = packet[IP].dst
            port_src = packet[TCP].sport
            port_dst = packet[TCP].dport

            flow_key = (ip_src, ip_dst, port_src, port_dst)

            <span class="hljs-comment"># Update flow statistics</span>
            stats = self.flow_stats[flow_key]
            stats[<span class="hljs-string">'packet_count'</span>] += <span class="hljs-number">1</span>
            stats[<span class="hljs-string">'byte_count'</span>] += len(packet)
            current_time = packet.time

            <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> stats[<span class="hljs-string">'start_time'</span>]:
                stats[<span class="hljs-string">'start_time'</span>] = current_time
            stats[<span class="hljs-string">'last_time'</span>] = current_time

            <span class="hljs-keyword">return</span> self.extract_features(packet, stats)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_features</span>(<span class="hljs-params">self, packet, stats</span>):</span>
        <span class="hljs-keyword">return</span> {
            <span class="hljs-string">'packet_size'</span>: len(packet),
            <span class="hljs-string">'flow_duration'</span>: stats[<span class="hljs-string">'last_time'</span>] - stats[<span class="hljs-string">'start_time'</span>],
            <span class="hljs-string">'packet_rate'</span>: stats[<span class="hljs-string">'packet_count'</span>] / (stats[<span class="hljs-string">'last_time'</span>] - stats[<span class="hljs-string">'start_time'</span>]),
            <span class="hljs-string">'byte_rate'</span>: stats[<span class="hljs-string">'byte_count'</span>] / (stats[<span class="hljs-string">'last_time'</span>] - stats[<span class="hljs-string">'start_time'</span>]),
            <span class="hljs-string">'tcp_flags'</span>: packet[TCP].flags,
            <span class="hljs-string">'window_size'</span>: packet[TCP].window
        }
</code></pre>
<p>In this code section, we define the <code>TrafficAnalyzer</code> class to analyze network traffic. Here we track connection flows and calculate statistics for packets in real time. We use the <code>defaultdict</code> data structure in Python to manage connections and flow statistics by organizing data by unique flows.</p>
<p>The <code>__init__</code> method initializes two attributes: <code>connections</code>, which stores lists of related packets for each flow, and <code>flow_stats</code>, which stores aggregated statistics for each flow, such as packet count, byte count, start time, and the time of the most recent packet.</p>
<p>The <code>analyze_packet</code> method processes each packet. If the packet contains IP and TCP layers, it extracts the source and destination IPs and ports, forming a unique <code>flow_key</code> to identify the flow. It updates the statistics for the flow by incrementing the packet count, adding the packet’s size to the byte count, and setting or updating the start and last time of the flow. Eventually, it calls <code>extract_features</code> to calculate and return additional metrics.</p>
<p>The <code>extract_features</code> method computes detailed characteristics of the flow and the current packet. These include the packet size, flow duration, packet rate, byte rate, TCP flags, and the TCP window size. These metrics are quite useful to identify patterns, anomalies, or potential threats in network traffic.</p>
<h3 id="heading-building-the-detection-engine">Building the Detection Engine</h3>
<p>Now we will define our detection engine that will implement both the signature as well as the anomaly-based detection mechanisms:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> IsolationForest
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DetectionEngine</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        self.anomaly_detector = IsolationForest(
            contamination=<span class="hljs-number">0.1</span>,
            random_state=<span class="hljs-number">42</span>
        )
        self.signature_rules = self.load_signature_rules()
        self.training_data = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_signature_rules</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-keyword">return</span> {
            <span class="hljs-string">'syn_flood'</span>: {
                <span class="hljs-string">'condition'</span>: <span class="hljs-keyword">lambda</span> features: (
                    features[<span class="hljs-string">'tcp_flags'</span>] == <span class="hljs-number">2</span> <span class="hljs-keyword">and</span>  <span class="hljs-comment"># SYN flag</span>
                    features[<span class="hljs-string">'packet_rate'</span>] &gt; <span class="hljs-number">100</span>
                )
            },
            <span class="hljs-string">'port_scan'</span>: {
                <span class="hljs-string">'condition'</span>: <span class="hljs-keyword">lambda</span> features: (
                    features[<span class="hljs-string">'packet_size'</span>] &lt; <span class="hljs-number">100</span> <span class="hljs-keyword">and</span>
                    features[<span class="hljs-string">'packet_rate'</span>] &gt; <span class="hljs-number">50</span>
                )
            }
        }

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train_anomaly_detector</span>(<span class="hljs-params">self, normal_traffic_data</span>):</span>
        self.anomaly_detector.fit(normal_traffic_data)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">detect_threats</span>(<span class="hljs-params">self, features</span>):</span>
        threats = []

        <span class="hljs-comment"># Signature-based detection</span>
        <span class="hljs-keyword">for</span> rule_name, rule <span class="hljs-keyword">in</span> self.signature_rules.items():
            <span class="hljs-keyword">if</span> rule[<span class="hljs-string">'condition'</span>](features):
                threats.append({
                    <span class="hljs-string">'type'</span>: <span class="hljs-string">'signature'</span>,
                    <span class="hljs-string">'rule'</span>: rule_name,
                    <span class="hljs-string">'confidence'</span>: <span class="hljs-number">1.0</span>
                })

        <span class="hljs-comment"># Anomaly-based detection</span>
        feature_vector = np.array([[
            features[<span class="hljs-string">'packet_size'</span>],
            features[<span class="hljs-string">'packet_rate'</span>],
            features[<span class="hljs-string">'byte_rate'</span>]
        ]])

        anomaly_score = self.anomaly_detector.score_samples(feature_vector)[<span class="hljs-number">0</span>]
        <span class="hljs-keyword">if</span> anomaly_score &lt; <span class="hljs-number">-0.5</span>:  <span class="hljs-comment"># Threshold for anomaly detection</span>
            threats.append({
                <span class="hljs-string">'type'</span>: <span class="hljs-string">'anomaly'</span>,
                <span class="hljs-string">'score'</span>: anomaly_score,
                <span class="hljs-string">'confidence'</span>: min(<span class="hljs-number">1.0</span>, abs(anomaly_score))
            })

        <span class="hljs-keyword">return</span> threats
</code></pre>
<p>This code defines a hybrid system that combines the signature-based and anomaly-based detection methods. We use the Isolation Forest model to detect anomalies and also use pre-defined rules for identifying specific attack patterns. If you would like to know more about how the Isolation Forest model works, check out <a target="_blank" href="https://medium.com/@corymaklin/isolation-forest-799fceacdda4">this</a> article.</p>
<p>In this code snippet, the <code>train_anomaly_detector</code> method trains the Isolation Forest model using a dataset of normal traffic features. This enables the model to differentiate typical traffic patterns from anomalies.</p>
<p>The <code>detect_threats</code> method evaluates network traffic features for potential threats using two approaches:</p>
<ol>
<li><p><strong>Signature-Based Detection</strong>: It iteratively goes through each of the predefined rules, applying the rule’s condition to the traffic features. If a rule matches, a signature-based threat is recorded with high confidence.</p>
</li>
<li><p><strong>Anomaly-Based Detection</strong>: It processes the feature vector (packet size, packet rate, and byte rate) through the Isolation Forest model to calculate an anomaly score. If the score indicates unusual behavior, the detection engine triggers it as an anomaly and produces a confidence score proportional to the anomaly’s severity.</p>
</li>
</ol>
<p>Finally, we return the aggregated list of identified threats with their respective annotation (either signature or anomaly), the rule or score that triggered the anomaly, and a confidence score that suggests how likely it is that the identified pattern is a threat.</p>
<h3 id="heading-building-the-alert-system">Building the Alert System</h3>
<p>Now let’s build the last component of our IDS which is the alert system. It will process and log detected threats in a structured way. You will also have the option to extend the system to include additional notification mechanisms like Slack, Jira tickets, and so on</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> logging
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AlertSystem</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, log_file=<span class="hljs-string">"ids_alerts.log"</span></span>):</span>
        self.logger = logging.getLogger(<span class="hljs-string">"IDS_Alerts"</span>)
        self.logger.setLevel(logging.INFO)

        handler = logging.FileHandler(log_file)
        formatter = logging.Formatter(
            <span class="hljs-string">'%(asctime)s - %(levelname)s - %(message)s'</span>
        )
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">generate_alert</span>(<span class="hljs-params">self, threat, packet_info</span>):</span>
        alert = {
            <span class="hljs-string">'timestamp'</span>: datetime.now().isoformat(),
            <span class="hljs-string">'threat_type'</span>: threat[<span class="hljs-string">'type'</span>],
            <span class="hljs-string">'source_ip'</span>: packet_info.get(<span class="hljs-string">'source_ip'</span>),
            <span class="hljs-string">'destination_ip'</span>: packet_info.get(<span class="hljs-string">'destination_ip'</span>),
            <span class="hljs-string">'confidence'</span>: threat.get(<span class="hljs-string">'confidence'</span>, <span class="hljs-number">0.0</span>),
            <span class="hljs-string">'details'</span>: threat
        }

        self.logger.warning(json.dumps(alert))

        <span class="hljs-keyword">if</span> threat[<span class="hljs-string">'confidence'</span>] &gt; <span class="hljs-number">0.8</span>:
            self.logger.critical(
                <span class="hljs-string">f"High confidence threat detected: <span class="hljs-subst">{json.dumps(alert)}</span>"</span>
            )
            <span class="hljs-comment"># Implement additional notification methods here</span>
            <span class="hljs-comment"># (e.g., email, Slack, SIEM integration)</span>
</code></pre>
<p>The <code>init</code> method sets up a logger named <code>IDS_Alerts</code> with an <code>INFO</code> logging level to capture alert information. It writes logs to a specified file, <code>ids_alerts.log</code> by default. A <code>FileHandler</code> directs logs to the file, while the <code>Formatter</code> ensures the logs follow a consistent format.</p>
<p>The <code>generate_alert</code> method is responsible for creating structured alert entries. Each alert includes key information such as the timestamp of detection, the type of threat, the source and destination IPs involved, the confidence level of the detection, and additional threat-specific details. These alerts are logged as <code>WARNING</code> level messages in JSON format.</p>
<p>If the confidence level of a detected threat is high (greater than 0.8), the alert is escalated and logged as a <code>CRITICAL</code> level message. Note that this method is designed to be extensible, allowing for additional notification mechanisms, such as sending alerts via email or integrating with third-party systems like Slack or SIEM solutions.</p>
<h3 id="heading-putting-it-all-together">Putting it All Together</h3>
<p>Now let’s integrate all the components together into our fully functional IDS solution:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">IntrusionDetectionSystem</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, interface=<span class="hljs-string">"eth0"</span></span>):</span>
        self.packet_capture = PacketCapture()
        self.traffic_analyzer = TrafficAnalyzer()
        self.detection_engine = DetectionEngine()
        self.alert_system = AlertSystem()

        self.interface = interface

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">start</span>(<span class="hljs-params">self</span>):</span>
        print(<span class="hljs-string">f"Starting IDS on interface <span class="hljs-subst">{self.interface}</span>"</span>)
        self.packet_capture.start_capture(self.interface)

        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
            <span class="hljs-keyword">try</span>:
                packet = self.packet_capture.packet_queue.get(timeout=<span class="hljs-number">1</span>)
                features = self.traffic_analyzer.analyze_packet(packet)

                <span class="hljs-keyword">if</span> features:
                    threats = self.detection_engine.detect_threats(features)

                    <span class="hljs-keyword">for</span> threat <span class="hljs-keyword">in</span> threats:
                        packet_info = {
                            <span class="hljs-string">'source_ip'</span>: packet[IP].src,
                            <span class="hljs-string">'destination_ip'</span>: packet[IP].dst,
                            <span class="hljs-string">'source_port'</span>: packet[TCP].sport,
                            <span class="hljs-string">'destination_port'</span>: packet[TCP].dport
                        }
                        self.alert_system.generate_alert(threat, packet_info)

            <span class="hljs-keyword">except</span> queue.Empty:
                <span class="hljs-keyword">continue</span>
            <span class="hljs-keyword">except</span> KeyboardInterrupt:
                print(<span class="hljs-string">"Stopping IDS..."</span>)
                self.packet_capture.stop()
                <span class="hljs-keyword">break</span>

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    ids = IntrusionDetectionSystem()
    ids.start()
</code></pre>
<p>In this code, the <code>IntrusionDetectionSystem</code> class sets up its core components: <code>PacketCapture</code> for capturing packets from a network interface, <code>TrafficAnalyzer</code> for extracting and analyzing packet features, <code>DetectionEngine</code> for identifying threats using both signature-based and anomaly-based methods, and <code>AlertSystem</code> for logging and escalating detected threats. The interface parameter specifies the network interface to monitor, defaulting to <code>eth0</code> (the generally named ethernet interface on most systems).</p>
<p>The <code>start</code> function initiates the IDS. It begins by starting packet capture on the specified interface and enters a loop to continuously process incoming packets. For each packet captured, the system extracts its features using the <code>TrafficAnalyzer</code> and analyzes them for potential threats using the <code>DetectionEngine</code>. If any threats are detected, the system generates detailed alerts through the <code>AlertSystem</code>.</p>
<p>The system runs in a loop until interrupted by either of the two key exceptions: <code>queue.Empty</code>, which occurs if no packets are available for processing, and <code>KeyboardInterrupt</code>, which stops the IDS gracefully by halting packet capture and exiting the loop.</p>
<h2 id="heading-ideas-to-extend-the-ids">Ideas to Extend the IDS</h2>
<p>To enhance or extend the IDS, you can consider designing or implementing the following features / improvements:</p>
<ol>
<li><p><strong>Machine Learning enhancements:</strong> You can enhance the IDS capabilities by incorporating deep learning models like Auto Encoders for anomaly detection and using RNNs for sequential pattern analysis. This will improve the system’s ability to identify complex and evolving threats by leveraging advanced feature engineering.</p>
</li>
<li><p><strong>Performance optimizations</strong>: You can optimize the IDS using PyPy for faster execution, packet sampling to handle high-traffic networks, and parallel processing to scale the system efficiently.</p>
</li>
<li><p><strong>Integration capabilities</strong>: You can extend the IDS by considering support for a REST API for remote monitoring, enabling seamless interaction with external systems.</p>
</li>
</ol>
<h2 id="heading-security-considerations">Security Considerations</h2>
<p>When deploying the IDS, note that the system is a proof-of-concept and is not intended for production use-cases. Also keep the following things in mind:</p>
<ul>
<li><p>Run the system with appropriate permissions (root/admin required for packet capture)</p>
</li>
<li><p>Secure the alert logs and implement proper log rotation</p>
</li>
<li><p>Regularly update signature rules and retrain anomaly detection models</p>
</li>
<li><p>Monitor system resource usage, especially in high-traffic environments</p>
</li>
<li><p>Implement proper access controls for the IDS configuration and alerts</p>
</li>
</ul>
<h2 id="heading-testing-the-ids-on-mock-data">Testing the IDS on Mock Data</h2>
<p>To validate the functionality of your IDS, you can test it using mock data that will simulate real-world network traffic. This will allow you to observe how the system processes packets, analyzes traffic, and generates alerts without requiring a live network environment.</p>
<p>Use the following function to test the IDS:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> scapy.all <span class="hljs-keyword">import</span> IP, TCP

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_ids</span>():</span>
    <span class="hljs-comment"># Create test packets to simulate various scenarios</span>
    test_packets = [
        <span class="hljs-comment"># Normal traffic</span>
        IP(src=<span class="hljs-string">"192.168.1.1"</span>, dst=<span class="hljs-string">"192.168.1.2"</span>) / TCP(sport=<span class="hljs-number">1234</span>, dport=<span class="hljs-number">80</span>, flags=<span class="hljs-string">"A"</span>),
        IP(src=<span class="hljs-string">"192.168.1.3"</span>, dst=<span class="hljs-string">"192.168.1.4"</span>) / TCP(sport=<span class="hljs-number">1235</span>, dport=<span class="hljs-number">443</span>, flags=<span class="hljs-string">"P"</span>),

        <span class="hljs-comment"># SYN flood simulation</span>
        IP(src=<span class="hljs-string">"10.0.0.1"</span>, dst=<span class="hljs-string">"192.168.1.2"</span>) / TCP(sport=<span class="hljs-number">5678</span>, dport=<span class="hljs-number">80</span>, flags=<span class="hljs-string">"S"</span>),
        IP(src=<span class="hljs-string">"10.0.0.2"</span>, dst=<span class="hljs-string">"192.168.1.2"</span>) / TCP(sport=<span class="hljs-number">5679</span>, dport=<span class="hljs-number">80</span>, flags=<span class="hljs-string">"S"</span>),
        IP(src=<span class="hljs-string">"10.0.0.3"</span>, dst=<span class="hljs-string">"192.168.1.2"</span>) / TCP(sport=<span class="hljs-number">5680</span>, dport=<span class="hljs-number">80</span>, flags=<span class="hljs-string">"S"</span>),

        <span class="hljs-comment"># Port scan simulation</span>
        IP(src=<span class="hljs-string">"192.168.1.100"</span>, dst=<span class="hljs-string">"192.168.1.2"</span>) / TCP(sport=<span class="hljs-number">4321</span>, dport=<span class="hljs-number">22</span>, flags=<span class="hljs-string">"S"</span>),
        IP(src=<span class="hljs-string">"192.168.1.100"</span>, dst=<span class="hljs-string">"192.168.1.2"</span>) / TCP(sport=<span class="hljs-number">4321</span>, dport=<span class="hljs-number">23</span>, flags=<span class="hljs-string">"S"</span>),
        IP(src=<span class="hljs-string">"192.168.1.100"</span>, dst=<span class="hljs-string">"192.168.1.2"</span>) / TCP(sport=<span class="hljs-number">4321</span>, dport=<span class="hljs-number">25</span>, flags=<span class="hljs-string">"S"</span>),
    ]

    ids = IntrusionDetectionSystem()

    <span class="hljs-comment"># Simulate packet processing and threat detection</span>
    print(<span class="hljs-string">"Starting IDS Test..."</span>)
    <span class="hljs-keyword">for</span> i, packet <span class="hljs-keyword">in</span> enumerate(test_packets, <span class="hljs-number">1</span>):
        print(<span class="hljs-string">f"\nProcessing packet <span class="hljs-subst">{i}</span>: <span class="hljs-subst">{packet.summary()}</span>"</span>)

        <span class="hljs-comment"># Analyze the packet</span>
        features = ids.traffic_analyzer.analyze_packet(packet)

        <span class="hljs-keyword">if</span> features:
            <span class="hljs-comment"># Detect threats based on features</span>
            threats = ids.detection_engine.detect_threats(features)

            <span class="hljs-keyword">if</span> threats:
                print(<span class="hljs-string">f"Detected threats: <span class="hljs-subst">{threats}</span>"</span>)
            <span class="hljs-keyword">else</span>:
                print(<span class="hljs-string">"No threats detected."</span>)
        <span class="hljs-keyword">else</span>:
            print(<span class="hljs-string">"Packet does not contain IP/TCP layers or is ignored."</span>)

    print(<span class="hljs-string">"\nIDS Test Completed."</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    test_ids()
</code></pre>
<p>This will test the system against a variety of attacks like SYN flooding and port scanning.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>Now you know how to build a basic intrusion detection system with Python and a few open-source libraries! This IDS demonstrates some core concepts of network security and real-time threat detection.</p>
<p>Keep in mind that this tutorial is for educational purposes only. There are professionally designed enterprise-grade systems like Snort and Suricata that can handle advanced threats and large-scale deployments.</p>
<p>I hope you gained insights into network security fundamentals and learned how Python can be used to build practical security solutions.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Real-time Network Traffic Dashboard with Python and Streamlit ]]>
                </title>
                <description>
                    <![CDATA[ Have you ever wanted to visualize your network traffic in real-time? In this tutorial, you will be learning how to build an interactive network traffic analysis dashboard with Python and Streamlit. Streamlit is an open-source Python framework you can... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-real-time-network-traffic-dashboard-with-python-and-streamlit/</link>
                <guid isPermaLink="false">67786dec9c66c24e89239f0a</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ networking ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #cybersecurity ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chaitanya Rahalkar ]]>
                </dc:creator>
                <pubDate>Fri, 03 Jan 2025 23:08:28 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1735280432228/33730b4a-6424-48b0-a7bf-ef029663fb90.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Have you ever wanted to visualize your network traffic in real-time? In this tutorial, you will be learning how to build an interactive network traffic analysis dashboard with Python and <code>Streamlit</code>. <code>Streamlit</code> is an open-source Python framework you can use to develop web applications for data analysis and data processing.</p>
<p>By the end of this tutorial, you will know how to capture raw network packets from the NIC (Network Interface Card) of your computer, process the data, and create beautiful visualizations that will update in real-time.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-is-network-traffic-analysis-important">Why is Network Traffic Analysis Important?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-setup-your-project">How to Setup your Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-core-functionalities">How to Build the Core Functionalities</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-the-streamlit-visualizations">How to Create the Streamlit Visualizations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-capture-the-network-packets">How to Capture the Network Packets</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-putting-everything-together">Putting Everything Together</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-future-enhancements">Future Enhancements</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-is-network-traffic-analysis-important">Why is Network Traffic Analysis Important?</h2>
<p>Network traffic analysis is a critical requirement in enterprises where networks form the backbone of nearly every application and service. At the core of it, we have analysis of network packets that involves monitoring the network, capturing all the traffic (ingress and egress), and interpreting these packets as they flow through a network. You can use this technique to identify security patterns, detect anomalies, and ensure the security and efficiency of the network.</p>
<p>This proof-of-concept project that we’ll work on in this tutorial is particularly useful since it helps you visualize and analyze network activity in real-time. And this will allow you to understand how troubleshooting issues, performance optimizations, and security analysis is done in enterprise systems.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>Python 3.8 or a newer version installed on your system.</p>
</li>
<li><p>A basic understanding of <a target="_blank" href="https://www.freecodecamp.org/news/computer-networking-how-applications-talk-over-the-internet/">computer networking concepts</a>.</p>
</li>
<li><p>Familiarity with the <a target="_blank" href="https://www.freecodecamp.org/news/ultimate-beginners-python-course/">Python programming language</a> and its widely used libraries.</p>
</li>
<li><p>Basic knowledge of <a target="_blank" href="https://www.freecodecamp.org/news/learn-data-visualization-in-this-free-17-hour-course/">data visualization</a> techniques and libraries.</p>
</li>
</ul>
<h2 id="heading-how-to-setup-your-project">How to Setup your Project</h2>
<p>To get started, create the project structure and install the necessary tools with Pip with the following commands:</p>
<pre><code class="lang-bash">mkdir network-dashboard
<span class="hljs-built_in">cd</span> network-dashboard
pip install streamlit pandas scapy plotly
</code></pre>
<p>We will be using <code>Streamlit</code> for the dashboard visualizations, <code>Pandas</code> for the data processing, <code>Scapy</code> for network packet capturing and packet processing, and finally <code>Plotly</code> for plotting charts with our collected data.</p>
<h2 id="heading-how-to-build-the-core-functionalities">How to Build the Core Functionalities</h2>
<p>We will be putting all of the code in a single file named <code>dashboard.py</code>. Firstly, let’s start by importing all the elements we will be using:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> streamlit <span class="hljs-keyword">as</span> st
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> plotly.express <span class="hljs-keyword">as</span> px
<span class="hljs-keyword">import</span> plotly.graph_objects <span class="hljs-keyword">as</span> go
<span class="hljs-keyword">from</span> scapy.all <span class="hljs-keyword">import</span> *
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> defaultdict
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> threading
<span class="hljs-keyword">import</span> warnings
<span class="hljs-keyword">import</span> logging
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Dict, List, Optional
<span class="hljs-keyword">import</span> socket
</code></pre>
<p>Now let’s configure logging by setting up a basic logging configuration. This will be used for tracking events and running our application in debug mode. We have currently set the logging level to be <code>INFO</code>, meaning that events with level <code>INFO</code> or higher will be displayed. If you are not familiar with logging in Python, I’d recommend checking out <a target="_blank" href="https://docs.python.org/3/library/logging.html">this</a> documentation piece that goes in-depth.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Configure logging</span>
logging.basicConfig(
    level=logging.INFO,
    format=<span class="hljs-string">'%(asctime)s - %(levelname)s - %(message)s'</span>
)
logger = logging.getLogger(__name__)
</code></pre>
<p>Next, we’ll build our packet processor. We’ll implement the functionality of processing our captured packets in this class.</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PacketProcessor</span>:</span>
    <span class="hljs-string">"""Process and analyze network packets"""</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        self.protocol_map = {
            <span class="hljs-number">1</span>: <span class="hljs-string">'ICMP'</span>,
            <span class="hljs-number">6</span>: <span class="hljs-string">'TCP'</span>,
            <span class="hljs-number">17</span>: <span class="hljs-string">'UDP'</span>
        }
        self.packet_data = []
        self.start_time = datetime.now()
        self.packet_count = <span class="hljs-number">0</span>
        self.lock = threading.Lock()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_protocol_name</span>(<span class="hljs-params">self, protocol_num: int</span>) -&gt; str:</span>
        <span class="hljs-string">"""Convert protocol number to name"""</span>
        <span class="hljs-keyword">return</span> self.protocol_map.get(protocol_num, <span class="hljs-string">f'OTHER(<span class="hljs-subst">{protocol_num}</span>)'</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_packet</span>(<span class="hljs-params">self, packet</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
        <span class="hljs-string">"""Process a single packet and extract relevant information"""</span>
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">if</span> IP <span class="hljs-keyword">in</span> packet:
                <span class="hljs-keyword">with</span> self.lock:
                    packet_info = {
                        <span class="hljs-string">'timestamp'</span>: datetime.now(),
                        <span class="hljs-string">'source'</span>: packet[IP].src,
                        <span class="hljs-string">'destination'</span>: packet[IP].dst,
                        <span class="hljs-string">'protocol'</span>: self.get_protocol_name(packet[IP].proto),
                        <span class="hljs-string">'size'</span>: len(packet),
                        <span class="hljs-string">'time_relative'</span>: (datetime.now() - self.start_time).total_seconds()
                    }

                    <span class="hljs-comment"># Add TCP-specific information</span>
                    <span class="hljs-keyword">if</span> TCP <span class="hljs-keyword">in</span> packet:
                        packet_info.update({
                            <span class="hljs-string">'src_port'</span>: packet[TCP].sport,
                            <span class="hljs-string">'dst_port'</span>: packet[TCP].dport,
                            <span class="hljs-string">'tcp_flags'</span>: packet[TCP].flags
                        })

                    <span class="hljs-comment"># Add UDP-specific information</span>
                    <span class="hljs-keyword">elif</span> UDP <span class="hljs-keyword">in</span> packet:
                        packet_info.update({
                            <span class="hljs-string">'src_port'</span>: packet[UDP].sport,
                            <span class="hljs-string">'dst_port'</span>: packet[UDP].dport
                        })

                    self.packet_data.append(packet_info)
                    self.packet_count += <span class="hljs-number">1</span>

                    <span class="hljs-comment"># Keep only last 10000 packets to prevent memory issues</span>
                    <span class="hljs-keyword">if</span> len(self.packet_data) &gt; <span class="hljs-number">10000</span>:
                        self.packet_data.pop(<span class="hljs-number">0</span>)

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            logger.error(<span class="hljs-string">f"Error processing packet: <span class="hljs-subst">{str(e)}</span>"</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_dataframe</span>(<span class="hljs-params">self</span>) -&gt; pd.DataFrame:</span>
        <span class="hljs-string">"""Convert packet data to pandas DataFrame"""</span>
        <span class="hljs-keyword">with</span> self.lock:
            <span class="hljs-keyword">return</span> pd.DataFrame(self.packet_data)
</code></pre>
<p>This class will build our core functionality and has several utility functions that will be used for processing the packets.</p>
<p>Network packets are categorized into two at transport level (TCP and UDP) and the ICMP protocol at the network level. If you are unfamiliar with the concepts of TCP/IP, I recommend checking out <a target="_blank" href="https://www.freecodecamp.org/news/what-is-tcp-ip-layers-and-protocols-explained/">this</a> article on freeCodeCamp News.</p>
<p>Our constructor will keep track of all packets seen that are categorized into these TCP/IP protocol type buckets that we defined. We’ll also take note of the packet capture time, the data captured, and the number of packets captured.</p>
<p>We’ll also be leveraging a thread lock to ensure that only one packet is processed at a single time. This can be further extended to enable the project to have parallel packet processing.</p>
<p>The <code>get_protocol_name</code> helper function helps us get the correct type of the protocol based on their protocol numbers. To give some background on this, the Internet Assigned Numbers Authority (IANA) assigns standardized numbers to identify different protocols in a network packet. As and when we see these numbers in the parsed network packet, we’ll know what kind of protocol is being used in the packet currently intercepted. For the scope of this project, we’ll be mapping to only TCP, UDP and ICMP (Ping). If we encounter any other type of packet, we’ll categorize it as <code>OTHER(&lt;protocol_num&gt;)</code>.</p>
<p>The <code>process_packet</code> function handles our core functionality that will process these individual packets. If the packet contains an IP layer, it will take note of the source and destination IP addresses, protocol type, packet size, and time elapsed since the start of packet capturing.</p>
<p>For packets with specific transport layer protocols (like TCP and UDP), we will capture the source and destination ports along with TCP flags for TCP packets. These extracted details will be stored in memory in the <code>packet_data</code> list. We will also keep track of the <code>packet_count</code> as and when these packets are processed.</p>
<p>The <code>get_dataframe</code> function helps us to convert the <code>packet_data</code> list into a <code>Pandas</code> data-frame that will then be used for our visualization.</p>
<h2 id="heading-how-to-create-the-streamlit-visualizations">How to Create the Streamlit Visualizations</h2>
<p>Now it’s time for us to build our interactive Streamlit Dashboard. We will define a function called <code>create_visualization</code> in the <code>dashboard.py</code> script (outside of our packet processing class).</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_visualizations</span>(<span class="hljs-params">df: pd.DataFrame</span>):</span>
    <span class="hljs-string">"""Create all dashboard visualizations"""</span>
    <span class="hljs-keyword">if</span> len(df) &gt; <span class="hljs-number">0</span>:
        <span class="hljs-comment"># Protocol distribution</span>
        protocol_counts = df[<span class="hljs-string">'protocol'</span>].value_counts()
        fig_protocol = px.pie(
            values=protocol_counts.values,
            names=protocol_counts.index,
            title=<span class="hljs-string">"Protocol Distribution"</span>
        )
        st.plotly_chart(fig_protocol, use_container_width=<span class="hljs-literal">True</span>)

        <span class="hljs-comment"># Packets timeline</span>
        df[<span class="hljs-string">'timestamp'</span>] = pd.to_datetime(df[<span class="hljs-string">'timestamp'</span>])
        df_grouped = df.groupby(df[<span class="hljs-string">'timestamp'</span>].dt.floor(<span class="hljs-string">'S'</span>)).size()
        fig_timeline = px.line(
            x=df_grouped.index,
            y=df_grouped.values,
            title=<span class="hljs-string">"Packets per Second"</span>
        )
        st.plotly_chart(fig_timeline, use_container_width=<span class="hljs-literal">True</span>)

        <span class="hljs-comment"># Top source IPs</span>
        top_sources = df[<span class="hljs-string">'source'</span>].value_counts().head(<span class="hljs-number">10</span>)
        fig_sources = px.bar(
            x=top_sources.index,
            y=top_sources.values,
            title=<span class="hljs-string">"Top Source IP Addresses"</span>
        )
        st.plotly_chart(fig_sources, use_container_width=<span class="hljs-literal">True</span>)
</code></pre>
<p>This function will take the data frame as input and will help us plot three charts / graphs:</p>
<ol>
<li><p>Protocol Distribution Chart: This chart will display the proportion of different protocols (for example,TCP, UDP, ICMP) in the captured packet traffic.</p>
</li>
<li><p>Packets Timeline Chart: This chart will show the number of packets processed per second over a time period.</p>
</li>
<li><p>Top Source IP Addresses Chart: This chart will highlight the top 10 IP addresses that sent the most packets in the captured traffic.</p>
</li>
</ol>
<p>The protocol distribution chart is simply a pie chart of the protocol counts for the three different types (along with OTHER). We use the <code>Streamlit</code> and <code>Plotly</code> Python tools to plot these charts. Since we also noted the timestamp since the packet capture started, we will use this data to plot the trend of packets captured over time.</p>
<p>For the second chart, we will do a <code>groupby</code> operation on the data and get the number of packets captured in each second (<code>S</code> stands for seconds), and then finally we will plot the graph.</p>
<p>Finally, for the third chart, we will count the distinct source IPs observed and the plot a chart of the IP counts to show the top 10 IPs.</p>
<h2 id="heading-how-to-capture-the-network-packets">How to Capture the Network Packets</h2>
<p>Now, let’s build the functionality to allow us to capture network packet data.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">start_packet_capture</span>():</span>
    <span class="hljs-string">"""Start packet capture in a separate thread"""</span>
    processor = PacketProcessor()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">capture_packets</span>():</span>
        sniff(prn=processor.process_packet, store=<span class="hljs-literal">False</span>)

    capture_thread = threading.Thread(target=capture_packets, daemon=<span class="hljs-literal">True</span>)
    capture_thread.start()

    <span class="hljs-keyword">return</span> processor
</code></pre>
<p>This is a simple function that instantiates the <code>PacketProcessor</code> class and then uses the <code>sniff</code> function in the <code>scapy</code> module to start capturing the packets.</p>
<p>We use threading here to allow us to capture packets independently from the main program flow. This ensures that the packet capturing operation does not block other operations like updating the dashboard in real-time. We also return the created <code>PacketProcessor</code> instance so that it can be used in our main program.</p>
<h2 id="heading-putting-everything-together">Putting Everything Together</h2>
<p>Now let’s stitch all these pieces together with our <code>main</code> function that will act as the driver function for our program.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    <span class="hljs-string">"""Main function to run the dashboard"""</span>
    st.set_page_config(page_title=<span class="hljs-string">"Network Traffic Analysis"</span>, layout=<span class="hljs-string">"wide"</span>)
    st.title(<span class="hljs-string">"Real-time Network Traffic Analysis"</span>)

    <span class="hljs-comment"># Initialize packet processor in session state</span>
    <span class="hljs-keyword">if</span> <span class="hljs-string">'processor'</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> st.session_state:
        st.session_state.processor = start_packet_capture()
        st.session_state.start_time = time.time()

    <span class="hljs-comment"># Create dashboard layout</span>
    col1, col2 = st.columns(<span class="hljs-number">2</span>)

    <span class="hljs-comment"># Get current data</span>
    df = st.session_state.processor.get_dataframe()

    <span class="hljs-comment"># Display metrics</span>
    <span class="hljs-keyword">with</span> col1:
        st.metric(<span class="hljs-string">"Total Packets"</span>, len(df))
    <span class="hljs-keyword">with</span> col2:
        duration = time.time() - st.session_state.start_time
        st.metric(<span class="hljs-string">"Capture Duration"</span>, <span class="hljs-string">f"<span class="hljs-subst">{duration:<span class="hljs-number">.2</span>f}</span>s"</span>)

    <span class="hljs-comment"># Display visualizations</span>
    create_visualizations(df)

    <span class="hljs-comment"># Display recent packets</span>
    st.subheader(<span class="hljs-string">"Recent Packets"</span>)
    <span class="hljs-keyword">if</span> len(df) &gt; <span class="hljs-number">0</span>:
        st.dataframe(
            df.tail(<span class="hljs-number">10</span>)[[<span class="hljs-string">'timestamp'</span>, <span class="hljs-string">'source'</span>, <span class="hljs-string">'destination'</span>, <span class="hljs-string">'protocol'</span>, <span class="hljs-string">'size'</span>]],
            use_container_width=<span class="hljs-literal">True</span>
        )

    <span class="hljs-comment"># Add refresh button</span>
    <span class="hljs-keyword">if</span> st.button(<span class="hljs-string">'Refresh Data'</span>):
        st.rerun()

    <span class="hljs-comment"># Auto refresh</span>
    time.sleep(<span class="hljs-number">2</span>)
    st.rerun()
</code></pre>
<p>This function will also instantiate the <code>Streamlit</code> dashboard, and integrate all of our components together. We first set the page title of our <code>Streamlit</code> dashboard and then initialize our <code>PacketProcessor</code>. We use the session state in <code>Streamlit</code> to ensure that only one instance of packet capturing is created and the state of it is retained.</p>
<p>Now, we will dynamically get the dataframe from the session state every time the data is processed and begin to display the metrics and the visualizations. We will also display the recently captured packets along with information like the timestamp, source and destination IPs, protocol, and size of the packet. We will also add the ability for the user to manually refresh the data from the dashboard while we also automatically refresh it every two seconds.</p>
<p>Let’s finally run the program with the following command:</p>
<pre><code class="lang-bash">sudo streamlit run dashboard.py
</code></pre>
<p>Note that you will have to run the program with <code>sudo</code> since the packet capturing capabilities require administrative privileges. If you are on Windows, open your terminal as Administrator and then run the program without the <code>sudo</code> prefix.</p>
<p>Give it a moment for the program to start capturing packets. If everything goes right, you should see something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735279281523/34802db4-7982-4c0f-a591-c2d5ca1e1f08.png" alt="A network traffic analysis dashboard shows a pie chart with protocol distribution: TCP (48.7%), UDP (47.5%), and ICMP (3.8%). Below is a line graph displaying packets per second over time with several noticeable peaks. Total packets are 6743, and capture duration is 118.63 seconds." class="image--center mx-auto" width="2556" height="1242" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735279285726/246a5af6-2d15-49fa-9132-8103be79ce3a.png" alt="A dark-themed dashboard showing a bar chart of top source IP addresses and a table of recent packets with details like timestamp, source, destination, protocol, and size." class="image--center mx-auto" width="2551" height="1108" loading="lazy"></p>
<p>These are all the visualizations that we just implemented in our <code>Streamlit</code> dashboard program.</p>
<h2 id="heading-future-enhancements">Future Enhancements</h2>
<p>With that, here are some future enhancement ideas that you can use to extend the functionalities of the dashboard:</p>
<ol>
<li><p>Add machine learning capabilities for anomaly detection</p>
</li>
<li><p>Implement geographical IP mapping</p>
</li>
<li><p>Create custom alerts based on traffic analysis patterns</p>
</li>
<li><p>Add packet payload analysis options</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Congratulations! You have now successfully built a real-time network traffic analysis dashboard with Python and <code>Streamlit</code>. This program will provide valuable insights into network behavior and can be extended for various use cases, from security monitoring to network optimization.</p>
<p>With that, I hope you learnt some basics about network traffic analysis as well as a bit of Python programming. Thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Honeypot in Python: A Practical Guide to Security Deception ]]>
                </title>
                <description>
                    <![CDATA[ In cybersecurity, a honeypot is a decoy system that’s designed to attract and then detect potential attackers attempting to compromise the system. Just like a pot of honey sitting out in the open would attract flies. Think of these honeypots as secur... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-honeypot-with-python/</link>
                <guid isPermaLink="false">676450c555ae44f950ee9c1f</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ software development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #cybersecurity ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chaitanya Rahalkar ]]>
                </dc:creator>
                <pubDate>Thu, 19 Dec 2024 16:58:45 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1734581440876/9b4a1d00-6185-4666-94cc-97131eed03fa.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In cybersecurity, a honeypot is a decoy system that’s designed to attract and then detect potential attackers attempting to compromise the system. Just like a pot of honey sitting out in the open would attract flies.</p>
<p>Think of these honeypots as security cameras for your system. Just as a security camera helps us understand who's trying to break into a building and how they're doing it, these honeypots will help you understand who's trying to attack your system and what techniques they're using.</p>
<p>By the end of this tutorial, you'll be able to write a demo honeypot in Python and understand how honeypots work.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-understanding-the-types-of-honeypots">Understanding the Types of Honeypots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-development-environment">How to Set Up Your Development Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-core-honeypot">How to Build the Core Honeypot</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-implement-the-network-listeners">Implement the Network Listeners</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-run-the-honeypot">Run the Honeypot</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-write-the-honeypot-attack-simulator">Write the Honeypot Attack Simulator</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-analyze-honeypot-data">How to Analyze Honeypot Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-security-considerations">Security Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-understanding-the-types-of-honeypots">Understanding the Types of Honeypots</h2>
<p>Before we start designing our own honeypot, let’s quickly understand their different types:</p>
<ol>
<li><p>Production Honeypots: These types of honeypots are placed in an actual production environment and are used to detect actual security attacks. They are typically simple in design, easy to maintain and deploy, and offer limited interaction to reduce risk.</p>
</li>
<li><p>Research Honeypots: These are more complex systems set up by security researchers to study attack patterns, perform empirical analysis on these patterns, collect malware samples, and understand new attack techniques that aren’t discovered previously. They often emulate entire operating systems or networks rather than behaving like an application in the production environment.</p>
</li>
</ol>
<p>For this tutorial, we will be building a medium-interaction honeypot that logs connection attempts and basic attacker behavior.</p>
<h2 id="heading-how-to-set-up-your-development-environment">How to Set Up Your Development Environment</h2>
<p>Let’s begin by setting up your development environment in Python. Run the following commands:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> socket
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> threading
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path

<span class="hljs-comment"># Configure logging directory</span>
LOG_DIR = Path(<span class="hljs-string">"honeypot_logs"</span>)
LOG_DIR.mkdir(exist_ok=<span class="hljs-literal">True</span>)
</code></pre>
<p>We will be sticking to the built in libraries so won’t be needing to install any external dependencies. We will be storing our logs in the <code>honeypot_logs</code> directory.</p>
<h2 id="heading-how-to-build-the-core-honeypot">How to Build the Core Honeypot</h2>
<p>Our basic honeypot will be comprised of three components:</p>
<ol>
<li><p>A network listener that accepts connections</p>
</li>
<li><p>A logging system to record activities</p>
</li>
<li><p>A basic emulation service to interact with attackers</p>
</li>
</ol>
<p>Now let’s begin by initializing the core Honeypot class:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Honeypot</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, bind_ip=<span class="hljs-string">"0.0.0.0"</span>, ports=None</span>):</span>
        self.bind_ip = bind_ip
        self.ports = ports <span class="hljs-keyword">or</span> [<span class="hljs-number">21</span>, <span class="hljs-number">22</span>, <span class="hljs-number">80</span>, <span class="hljs-number">443</span>]  <span class="hljs-comment"># Default ports to monitor</span>
        self.active_connections = {}
        self.log_file = LOG_DIR / <span class="hljs-string">f"honeypot_<span class="hljs-subst">{datetime.datetime.now().strftime(<span class="hljs-string">'%Y%m%d'</span>)}</span>.json"</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">log_activity</span>(<span class="hljs-params">self, port, remote_ip, data</span>):</span>
        <span class="hljs-string">"""Log suspicious activity with timestamp and details"""</span>
        activity = {
            <span class="hljs-string">"timestamp"</span>: datetime.datetime.now().isoformat(),
            <span class="hljs-string">"remote_ip"</span>: remote_ip,
            <span class="hljs-string">"port"</span>: port,
            <span class="hljs-string">"data"</span>: data.decode(<span class="hljs-string">'utf-8'</span>, errors=<span class="hljs-string">'ignore'</span>)
        }

        <span class="hljs-keyword">with</span> open(self.log_file, <span class="hljs-string">'a'</span>) <span class="hljs-keyword">as</span> f:
            json.dump(activity, f)
            f.write(<span class="hljs-string">'\n'</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handle_connection</span>(<span class="hljs-params">self, client_socket, remote_ip, port</span>):</span>
        <span class="hljs-string">"""Handle individual connections and emulate services"""</span>
        service_banners = {
            <span class="hljs-number">21</span>: <span class="hljs-string">"220 FTP server ready\r\n"</span>,
            <span class="hljs-number">22</span>: <span class="hljs-string">"SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.1\r\n"</span>,
            <span class="hljs-number">80</span>: <span class="hljs-string">"HTTP/1.1 200 OK\r\nServer: Apache/2.4.41 (Ubuntu)\r\n\r\n"</span>,
            <span class="hljs-number">443</span>: <span class="hljs-string">"HTTP/1.1 200 OK\r\nServer: Apache/2.4.41 (Ubuntu)\r\n\r\n"</span>
        }

        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># Send appropriate banner for the service</span>
            <span class="hljs-keyword">if</span> port <span class="hljs-keyword">in</span> service_banners:
                client_socket.send(service_banners[port].encode())

            <span class="hljs-comment"># Receive data from attacker</span>
            <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
                data = client_socket.recv(<span class="hljs-number">1024</span>)
                <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> data:
                    <span class="hljs-keyword">break</span>

                self.log_activity(port, remote_ip, data)

                <span class="hljs-comment"># Send fake response</span>
                client_socket.send(<span class="hljs-string">b"Command not recognized.\r\n"</span>)

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">f"Error handling connection: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">finally</span>:
            client_socket.close()
</code></pre>
<p>This class has a lot of important information in it, so let’s go over each function one by one.</p>
<p>The <code>__init__</code> function records the ip and port numbers on which we’ll host the honeypot, as well as the path / filename of the log file. We will also be maintaining a record of the total number of active connections we have to the honeypot.</p>
<p>The <code>log_activity</code> function is going to receive the information about the IP, the data, and the port to which the IP attempted a connection. Then we’ll append this information to our JSON-formatted log file.</p>
<p>The <code>handle_connection</code> function is going to mimic these services that will be running on the different ports we have. We will have the honeypot running on ports 21, 22, 80 and 443. These services are for FTP, SSH, HTTP and the HTTPS protocol, respectively. So any attacker attempting to interact with the honeypot should expect these services on these ports.</p>
<p>To mimic the behavior of these services, we’ll use the service banners that they use in reality. This function will first send the appropriate banner when the attacker connects, and then receive the data and log it. The honeypot will also send a fake response “<em>Command not recognized</em>” back to the attacker.</p>
<h3 id="heading-implement-the-network-listeners">Implement the Network Listeners</h3>
<p>Now let’s implement the network listeners that will be handling the incoming connections. For this, we’ll be using simple socket programming. If you aren’t aware of how socket programming works, <a target="_blank" href="https://www.freecodecamp.org/news/socket-programming-in-python">check out this article</a> that explains some concepts related to it.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">start_listener</span>(<span class="hljs-params">self, port</span>):</span>
    <span class="hljs-string">"""Start a listener on specified port"""</span>
    <span class="hljs-keyword">try</span>:
        server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        server.bind((self.bind_ip, port))
        server.listen(<span class="hljs-number">5</span>)

        print(<span class="hljs-string">f"[*] Listening on <span class="hljs-subst">{self.bind_ip}</span>:<span class="hljs-subst">{port}</span>"</span>)

        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
            client, addr = server.accept()
            print(<span class="hljs-string">f"[*] Accepted connection from <span class="hljs-subst">{addr[<span class="hljs-number">0</span>]}</span>:<span class="hljs-subst">{addr[<span class="hljs-number">1</span>]}</span>"</span>)

            <span class="hljs-comment"># Handle connection in separate thread</span>
            client_handler = threading.Thread(
                target=self.handle_connection,
                args=(client, addr[<span class="hljs-number">0</span>], port)
            )
            client_handler.start()

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Error starting listener on port <span class="hljs-subst">{port}</span>: <span class="hljs-subst">{e}</span>"</span>)
</code></pre>
<p>The <code>start_listener</code> function will start the server and listen on the provided port. The <code>bind_ip</code> for us is going to be <code>0.0.0.0</code> which indicates that the server will be listening on all network interfaces.</p>
<p>Now, we will handle each new connection in a separate thread, since there could be instances where multiple attackers attempt to interact with the honeypot or an attacking script or tool is scanning the honeypot. If you aren’t aware of how threading works, you can <a target="_blank" href="https://www.freecodecamp.org/news/concurrency-in-python/">check out this article</a> that explains threading and concurrency in Python.</p>
<p>Also, make sure to put this function in the core <code>Honeypot</code> class.</p>
<h3 id="heading-run-the-honeypot">Run the Honeypot</h3>
<p>Now let’s create the <code>main</code> function that will start our honeypot.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    honeypot = Honeypot()

    <span class="hljs-comment"># Start listeners for each port in separate threads</span>
    <span class="hljs-keyword">for</span> port <span class="hljs-keyword">in</span> honeypot.ports:
        listener_thread = threading.Thread(
            target=honeypot.start_listener,
            args=(port,)
        )
        listener_thread.daemon = <span class="hljs-literal">True</span>
        listener_thread.start()

    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># Keep main thread alive</span>
        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
            time.sleep(<span class="hljs-number">1</span>)
    <span class="hljs-keyword">except</span> KeyboardInterrupt:
        print(<span class="hljs-string">"\n[*] Shutting down honeypot..."</span>)
        sys.exit(<span class="hljs-number">0</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<p>This function instantiates the <code>Honeypot</code> class and starts the listeners for each of our defined ports (21,22,80,443) as a separate thread. Now, we’ll keep our main thread that is running our actual program alive by putting it in an infinite loop. Put this all together in a script and run it.</p>
<h3 id="heading-write-the-honeypot-attack-simulator">Write the Honeypot Attack Simulator</h3>
<p>Now let’s try to simulate some attack scenarios and target our honeypot so that we can collect some data in our JSON log file.</p>
<p>This simulator will help us demonstrate a few important aspects about honeypots:</p>
<ol>
<li><p>Realistic attack patterns: The simulator will simulate common attack patterns like port scanning, brute force attempts, and service-specific exploits.</p>
</li>
<li><p>Variable intensity: The simulator will adjust the intensity of the simulation to test how your honeypot handles different loads.</p>
</li>
<li><p>Several attack types: It will demonstrate different types of attacks that real attackers might attempt, helping you understand how your honeypot responds to each.</p>
</li>
<li><p>Concurrent connections: The simulator will use threading to test how your honeypot handles multiple simultaneous connections.</p>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># honeypot_simulator.py</span>

<span class="hljs-keyword">import</span> socket
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">import</span> random
<span class="hljs-keyword">import</span> threading
<span class="hljs-keyword">from</span> concurrent.futures <span class="hljs-keyword">import</span> ThreadPoolExecutor
<span class="hljs-keyword">import</span> argparse

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">HoneypotSimulator</span>:</span>
    <span class="hljs-string">"""
    A class to simulate different types of connections and attacks against our honeypot.
    This helps in testing the honeypot's logging and response capabilities.
    """</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, target_ip=<span class="hljs-string">"127.0.0.1"</span>, intensity=<span class="hljs-string">"medium"</span></span>):</span>
        <span class="hljs-comment"># Configuration for the simulator</span>
        self.target_ip = target_ip
        self.intensity = intensity

        <span class="hljs-comment"># Common ports that attackers often probe</span>
        self.target_ports = [<span class="hljs-number">21</span>, <span class="hljs-number">22</span>, <span class="hljs-number">23</span>, <span class="hljs-number">25</span>, <span class="hljs-number">80</span>, <span class="hljs-number">443</span>, <span class="hljs-number">3306</span>, <span class="hljs-number">5432</span>]

        <span class="hljs-comment"># Dictionary of common commands used by attackers for different services</span>
        self.attack_patterns = {
            <span class="hljs-number">21</span>: [  <span class="hljs-comment"># FTP commands</span>
                <span class="hljs-string">"USER admin\r\n"</span>,
                <span class="hljs-string">"PASS admin123\r\n"</span>,
                <span class="hljs-string">"LIST\r\n"</span>,
                <span class="hljs-string">"STOR malware.exe\r\n"</span>
            ],
            <span class="hljs-number">22</span>: [  <span class="hljs-comment"># SSH attempts</span>
                <span class="hljs-string">"SSH-2.0-OpenSSH_7.9\r\n"</span>,
                <span class="hljs-string">"admin:password123\n"</span>,
                <span class="hljs-string">"root:toor\n"</span>
            ],
            <span class="hljs-number">80</span>: [  <span class="hljs-comment"># HTTP requests</span>
                <span class="hljs-string">"GET / HTTP/1.1\r\nHost: localhost\r\n\r\n"</span>,
                <span class="hljs-string">"POST /admin HTTP/1.1\r\nHost: localhost\r\nContent-Length: 0\r\n\r\n"</span>,
                <span class="hljs-string">"GET /wp-admin HTTP/1.1\r\nHost: localhost\r\n\r\n"</span>
            ]
        }

        <span class="hljs-comment"># Intensity settings affect the frequency and volume of simulated attacks</span>
        self.intensity_settings = {
            <span class="hljs-string">"low"</span>: {<span class="hljs-string">"max_threads"</span>: <span class="hljs-number">2</span>, <span class="hljs-string">"delay_range"</span>: (<span class="hljs-number">1</span>, <span class="hljs-number">3</span>)},
            <span class="hljs-string">"medium"</span>: {<span class="hljs-string">"max_threads"</span>: <span class="hljs-number">5</span>, <span class="hljs-string">"delay_range"</span>: (<span class="hljs-number">0.5</span>, <span class="hljs-number">1.5</span>)},
            <span class="hljs-string">"high"</span>: {<span class="hljs-string">"max_threads"</span>: <span class="hljs-number">10</span>, <span class="hljs-string">"delay_range"</span>: (<span class="hljs-number">0.1</span>, <span class="hljs-number">0.5</span>)}
        }

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">simulate_connection</span>(<span class="hljs-params">self, port</span>):</span>
        <span class="hljs-string">"""
        Simulates a connection attempt to a specific port with realistic attack patterns
        """</span>
        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># Create a new socket connection</span>
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.settimeout(<span class="hljs-number">3</span>)

            print(<span class="hljs-string">f"[*] Attempting connection to <span class="hljs-subst">{self.target_ip}</span>:<span class="hljs-subst">{port}</span>"</span>)
            sock.connect((self.target_ip, port))

            <span class="hljs-comment"># Get banner if any</span>
            banner = sock.recv(<span class="hljs-number">1024</span>)
            print(<span class="hljs-string">f"[+] Received banner from port <span class="hljs-subst">{port}</span>: <span class="hljs-subst">{banner.decode(<span class="hljs-string">'utf-8'</span>, <span class="hljs-string">'ignore'</span>).strip()}</span>"</span>)

            <span class="hljs-comment"># Send attack patterns based on the port</span>
            <span class="hljs-keyword">if</span> port <span class="hljs-keyword">in</span> self.attack_patterns:
                <span class="hljs-keyword">for</span> command <span class="hljs-keyword">in</span> self.attack_patterns[port]:
                    print(<span class="hljs-string">f"[*] Sending command to port <span class="hljs-subst">{port}</span>: <span class="hljs-subst">{command.strip()}</span>"</span>)
                    sock.send(command.encode())

                    <span class="hljs-comment"># Wait for response</span>
                    <span class="hljs-keyword">try</span>:
                        response = sock.recv(<span class="hljs-number">1024</span>)
                        print(<span class="hljs-string">f"[+] Received response: <span class="hljs-subst">{response.decode(<span class="hljs-string">'utf-8'</span>, <span class="hljs-string">'ignore'</span>).strip()}</span>"</span>)
                    <span class="hljs-keyword">except</span> socket.timeout:
                        print(<span class="hljs-string">f"[-] No response received from port <span class="hljs-subst">{port}</span>"</span>)

                    <span class="hljs-comment"># Add realistic delay between commands</span>
                    time.sleep(random.uniform(*self.intensity_settings[self.intensity][<span class="hljs-string">"delay_range"</span>]))

            sock.close()

        <span class="hljs-keyword">except</span> ConnectionRefusedError:
            print(<span class="hljs-string">f"[-] Connection refused on port <span class="hljs-subst">{port}</span>"</span>)
        <span class="hljs-keyword">except</span> socket.timeout:
            print(<span class="hljs-string">f"[-] Connection timeout on port <span class="hljs-subst">{port}</span>"</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">f"[-] Error connecting to port <span class="hljs-subst">{port}</span>: <span class="hljs-subst">{e}</span>"</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">simulate_port_scan</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-string">"""
        Simulates a basic port scan across common ports
        """</span>
        print(<span class="hljs-string">f"\n[*] Starting port scan simulation against <span class="hljs-subst">{self.target_ip}</span>"</span>)
        <span class="hljs-keyword">for</span> port <span class="hljs-keyword">in</span> self.target_ports:
            self.simulate_connection(port)
            time.sleep(random.uniform(<span class="hljs-number">0.1</span>, <span class="hljs-number">0.3</span>))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">simulate_brute_force</span>(<span class="hljs-params">self, port</span>):</span>
        <span class="hljs-string">"""
        Simulates a brute force attack against a specific service
        """</span>
        common_usernames = [<span class="hljs-string">"admin"</span>, <span class="hljs-string">"root"</span>, <span class="hljs-string">"user"</span>, <span class="hljs-string">"test"</span>]
        common_passwords = [<span class="hljs-string">"password123"</span>, <span class="hljs-string">"admin123"</span>, <span class="hljs-string">"123456"</span>, <span class="hljs-string">"root"</span>]

        print(<span class="hljs-string">f"\n[*] Starting brute force simulation against port <span class="hljs-subst">{port}</span>"</span>)

        <span class="hljs-keyword">for</span> username <span class="hljs-keyword">in</span> common_usernames:
            <span class="hljs-keyword">for</span> password <span class="hljs-keyword">in</span> common_passwords:
                <span class="hljs-keyword">try</span>:
                    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
                    sock.settimeout(<span class="hljs-number">2</span>)
                    sock.connect((self.target_ip, port))

                    <span class="hljs-keyword">if</span> port == <span class="hljs-number">21</span>:  <span class="hljs-comment"># FTP</span>
                        sock.send(<span class="hljs-string">f"USER <span class="hljs-subst">{username}</span>\r\n"</span>.encode())
                        sock.recv(<span class="hljs-number">1024</span>)
                        sock.send(<span class="hljs-string">f"PASS <span class="hljs-subst">{password}</span>\r\n"</span>.encode())
                    <span class="hljs-keyword">elif</span> port == <span class="hljs-number">22</span>:  <span class="hljs-comment"># SSH</span>
                        sock.send(<span class="hljs-string">f"<span class="hljs-subst">{username}</span>:<span class="hljs-subst">{password}</span>\n"</span>.encode())

                    sock.close()
                    time.sleep(random.uniform(<span class="hljs-number">0.1</span>, <span class="hljs-number">0.3</span>))

                <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
                    print(<span class="hljs-string">f"[-] Error in brute force attempt: <span class="hljs-subst">{e}</span>"</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_continuous_simulation</span>(<span class="hljs-params">self, duration=<span class="hljs-number">300</span></span>):</span>
        <span class="hljs-string">"""
        Runs a continuous simulation for a specified duration
        """</span>
        print(<span class="hljs-string">f"\n[*] Starting continuous simulation for <span class="hljs-subst">{duration}</span> seconds"</span>)
        print(<span class="hljs-string">f"[*] Intensity level: <span class="hljs-subst">{self.intensity}</span>"</span>)

        end_time = time.time() + duration

        <span class="hljs-keyword">with</span> ThreadPoolExecutor(
            max_workers=self.intensity_settings[self.intensity][<span class="hljs-string">"max_threads"</span>]
        ) <span class="hljs-keyword">as</span> executor:
            <span class="hljs-keyword">while</span> time.time() &lt; end_time:
                <span class="hljs-comment"># Mix of different attack patterns</span>
                simulation_choices = [
                    <span class="hljs-keyword">lambda</span>: self.simulate_port_scan(),
                    <span class="hljs-keyword">lambda</span>: self.simulate_brute_force(<span class="hljs-number">21</span>),
                    <span class="hljs-keyword">lambda</span>: self.simulate_brute_force(<span class="hljs-number">22</span>),
                    <span class="hljs-keyword">lambda</span>: self.simulate_connection(<span class="hljs-number">80</span>)
                ]

                <span class="hljs-comment"># Randomly choose and execute an attack pattern</span>
                executor.submit(random.choice(simulation_choices))
                time.sleep(random.uniform(*self.intensity_settings[self.intensity][<span class="hljs-string">"delay_range"</span>]))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    <span class="hljs-string">"""
    Main function to run the honeypot simulator with command-line arguments
    """</span>
    parser = argparse.ArgumentParser(description=<span class="hljs-string">"Honeypot Attack Simulator"</span>)
    parser.add_argument(<span class="hljs-string">"--target"</span>, default=<span class="hljs-string">"127.0.0.1"</span>, help=<span class="hljs-string">"Target IP address"</span>)
    parser.add_argument(
        <span class="hljs-string">"--intensity"</span>,
        choices=[<span class="hljs-string">"low"</span>, <span class="hljs-string">"medium"</span>, <span class="hljs-string">"high"</span>],
        default=<span class="hljs-string">"medium"</span>,
        help=<span class="hljs-string">"Simulation intensity level"</span>
    )
    parser.add_argument(
        <span class="hljs-string">"--duration"</span>,
        type=int,
        default=<span class="hljs-number">300</span>,
        help=<span class="hljs-string">"Simulation duration in seconds"</span>
    )

    args = parser.parse_args()

    simulator = HoneypotSimulator(args.target, args.intensity)

    <span class="hljs-keyword">try</span>:
        simulator.run_continuous_simulation(args.duration)
    <span class="hljs-keyword">except</span> KeyboardInterrupt:
        print(<span class="hljs-string">"\n[*] Simulation interrupted by user"</span>)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"[-] Simulation error: <span class="hljs-subst">{e}</span>"</span>)
    <span class="hljs-keyword">finally</span>:
        print(<span class="hljs-string">"\n[*] Simulation complete"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<p>We have a lot going on in this simulation script, so let’s break it down one by one. I’ve also added comments for every function and operation to make this a bit more readable in the code.</p>
<p>We first have our utility class called the <code>HoneypotSimulator</code>. In this class, we have the <code>__init__</code> function that sets up the basic configuration for our simulator. It takes two parameters: a target IP address (defaulting to <a target="_blank" href="http://localhost">localhost</a>) and an intensity level (defaulting to "medium").</p>
<p>We also define three important components: the target ports to probe (common services like FTP, SSH, HTTP), attack patterns specific to each service (like login attempts and commands), and intensity settings that control how aggressive our simulation will be through thread counts and timing delays.</p>
<p>The <code>simulate_connection</code> function handles individual connection attempts to a specific port. It creates a socket connection, tries to get any service banners (like SSH version information), and then sends appropriate attack commands based on the service type. We have added error handling for common network issues and also added realistic delays between commands to mimic human interaction.</p>
<p>Our <code>simulate_port_scan</code> function acts like a reconnaissance tool, that will systematically chec each port in our target list. It's similar to how tools like <code>nmap</code> work – going through ports one by one to see what services are available. For each port, it calls the <code>simulate_connection</code> function and adds small random delays to make the scan pattern look more natural.</p>
<p>The <code>simulate_brute_force</code> function maintains lists of common usernames and passwords, attempting different combinations against services like FTP and SSH. For each attempt, it creates a new connection, sends the login credentials in the correct format for that service, and then closes the connection. This helps us to test how well the honeypot detects and logs credential stuffing attacks.</p>
<p>The <code>run_continuous_simulation</code> function runs for a specified duration, randomly choosing between different attack types like port scanning, brute force, or specific service attacks. It uses Python's <code>ThreadPoolExecutor</code> to run multiple attacks simultaneously based on the specified intensity level.</p>
<p>Finally, we have the <code>main</code> function that provides the command-line interface for the simulator. It uses <code>argparse</code> to handle command-line arguments, letting users specify the target IP, intensity level, and duration of the simulation. It creates an instance of the <code>HoneypotSimulator</code> class and manages the overall execution, including proper handling of user interruptions and errors.</p>
<p>After putting the simulator code in a separate script, run it with the following command:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Run with default settings (medium intensity, localhost, 5 minutes)</span>
python honeypot_simulator.py

<span class="hljs-comment"># Run with custom settings</span>
python honeypot_simulator.py --target <span class="hljs-number">192.168</span><span class="hljs-number">.1</span><span class="hljs-number">.100</span> --intensity high --duration <span class="hljs-number">600</span>
</code></pre>
<p>Since we are running the honeypot as well as the simulator on the same machine locally, the target will be <code>localhost</code>. But it can be something else in a real scenario or if you are running the honeypot in a VM or a different machine – so make sure you confirm the IP before running the simulator.</p>
<h2 id="heading-how-to-analyze-honeypot-data">How to Analyze Honeypot Data</h2>
<p>Let’s quickly write a helper function that will allow us to analyze all the data collected by the Honeypot. Since we’ve stored this in a JSON log file, we can conveniently parse it using the built-in JSON package.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> json

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">analyze_logs</span>(<span class="hljs-params">log_file</span>):</span>
    <span class="hljs-string">"""Enhanced honeypot log analysis with temporal and behavioral patterns"""</span>
    ip_analysis = {}
    port_analysis = {}
    hourly_attacks = {}
    data_patterns = {}

    <span class="hljs-comment"># Track session patterns</span>
    ip_sessions = {}
    attack_timeline = []

    <span class="hljs-keyword">with</span> open(log_file, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
        <span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> f:
            <span class="hljs-keyword">try</span>:
                activity = json.loads(line)
                timestamp = datetime.datetime.fromisoformat(activity[<span class="hljs-string">'timestamp'</span>])
                ip = activity[<span class="hljs-string">'remote_ip'</span>]
                port = activity[<span class="hljs-string">'port'</span>]
                data = activity[<span class="hljs-string">'data'</span>]

                <span class="hljs-comment"># Initialize IP tracking if new</span>
                <span class="hljs-keyword">if</span> ip <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> ip_analysis:
                    ip_analysis[ip] = {
                        <span class="hljs-string">'total_attempts'</span>: <span class="hljs-number">0</span>,
                        <span class="hljs-string">'first_seen'</span>: timestamp,
                        <span class="hljs-string">'last_seen'</span>: timestamp,
                        <span class="hljs-string">'targeted_ports'</span>: set(),
                        <span class="hljs-string">'unique_payloads'</span>: set(),
                        <span class="hljs-string">'session_count'</span>: <span class="hljs-number">0</span>
                    }

                <span class="hljs-comment"># Update IP statistics</span>
                ip_analysis[ip][<span class="hljs-string">'total_attempts'</span>] += <span class="hljs-number">1</span>
                ip_analysis[ip][<span class="hljs-string">'last_seen'</span>] = timestamp
                ip_analysis[ip][<span class="hljs-string">'targeted_ports'</span>].add(port)
                ip_analysis[ip][<span class="hljs-string">'unique_payloads'</span>].add(data.strip())

                <span class="hljs-comment"># Track hourly patterns</span>
                hour = timestamp.hour
                hourly_attacks[hour] = hourly_attacks.get(hour, <span class="hljs-number">0</span>) + <span class="hljs-number">1</span>

                <span class="hljs-comment"># Analyze port targeting patterns</span>
                <span class="hljs-keyword">if</span> port <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> port_analysis:
                    port_analysis[port] = {
                        <span class="hljs-string">'total_attempts'</span>: <span class="hljs-number">0</span>,
                        <span class="hljs-string">'unique_ips'</span>: set(),
                        <span class="hljs-string">'unique_payloads'</span>: set()
                    }
                port_analysis[port][<span class="hljs-string">'total_attempts'</span>] += <span class="hljs-number">1</span>
                port_analysis[port][<span class="hljs-string">'unique_ips'</span>].add(ip)
                port_analysis[port][<span class="hljs-string">'unique_payloads'</span>].add(data.strip())

                <span class="hljs-comment"># Track payload patterns</span>
                <span class="hljs-keyword">if</span> data.strip():
                    data_patterns[data.strip()] = data_patterns.get(data.strip(), <span class="hljs-number">0</span>) + <span class="hljs-number">1</span>

                <span class="hljs-comment"># Track attack timeline</span>
                attack_timeline.append({
                    <span class="hljs-string">'timestamp'</span>: timestamp,
                    <span class="hljs-string">'ip'</span>: ip,
                    <span class="hljs-string">'port'</span>: port
                })

            <span class="hljs-keyword">except</span> (json.JSONDecodeError, KeyError) <span class="hljs-keyword">as</span> e:
                <span class="hljs-keyword">continue</span>

    <span class="hljs-comment"># Analysis Report Generation</span>
    print(<span class="hljs-string">"\n=== Honeypot Analysis Report ==="</span>)

    <span class="hljs-comment"># 1. IP-based Analysis</span>
    print(<span class="hljs-string">"\nTop 10 Most Active IPs:"</span>)
    sorted_ips = sorted(ip_analysis.items(), 
                       key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>][<span class="hljs-string">'total_attempts'</span>], 
                       reverse=<span class="hljs-literal">True</span>)[:<span class="hljs-number">10</span>]
    <span class="hljs-keyword">for</span> ip, stats <span class="hljs-keyword">in</span> sorted_ips:
        duration = stats[<span class="hljs-string">'last_seen'</span>] - stats[<span class="hljs-string">'first_seen'</span>]
        print(<span class="hljs-string">f"\nIP: <span class="hljs-subst">{ip}</span>"</span>)
        print(<span class="hljs-string">f"Total Attempts: <span class="hljs-subst">{stats[<span class="hljs-string">'total_attempts'</span>]}</span>"</span>)
        print(<span class="hljs-string">f"Active Duration: <span class="hljs-subst">{duration}</span>"</span>)
        print(<span class="hljs-string">f"Unique Ports Targeted: <span class="hljs-subst">{len(stats[<span class="hljs-string">'targeted_ports'</span>])}</span>"</span>)
        print(<span class="hljs-string">f"Unique Payloads: <span class="hljs-subst">{len(stats[<span class="hljs-string">'unique_payloads'</span>])}</span>"</span>)

    <span class="hljs-comment"># 2. Port Analysis</span>
    print(<span class="hljs-string">"\nPort Targeting Analysis:"</span>)
    sorted_ports = sorted(port_analysis.items(),
                         key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>][<span class="hljs-string">'total_attempts'</span>],
                         reverse=<span class="hljs-literal">True</span>)
    <span class="hljs-keyword">for</span> port, stats <span class="hljs-keyword">in</span> sorted_ports:
        print(<span class="hljs-string">f"\nPort <span class="hljs-subst">{port}</span>:"</span>)
        print(<span class="hljs-string">f"Total Attempts: <span class="hljs-subst">{stats[<span class="hljs-string">'total_attempts'</span>]}</span>"</span>)
        print(<span class="hljs-string">f"Unique Attackers: <span class="hljs-subst">{len(stats[<span class="hljs-string">'unique_ips'</span>])}</span>"</span>)
        print(<span class="hljs-string">f"Unique Payloads: <span class="hljs-subst">{len(stats[<span class="hljs-string">'unique_payloads'</span>])}</span>"</span>)

    <span class="hljs-comment"># 3. Temporal Analysis</span>
    print(<span class="hljs-string">"\nHourly Attack Distribution:"</span>)
    <span class="hljs-keyword">for</span> hour <span class="hljs-keyword">in</span> sorted(hourly_attacks.keys()):
        print(<span class="hljs-string">f"Hour <span class="hljs-subst">{hour:<span class="hljs-number">02</span>d}</span>: <span class="hljs-subst">{hourly_attacks[hour]}</span> attempts"</span>)

    <span class="hljs-comment"># 4. Attack Sophistication Analysis</span>
    print(<span class="hljs-string">"\nAttacker Sophistication Analysis:"</span>)
    <span class="hljs-keyword">for</span> ip, stats <span class="hljs-keyword">in</span> sorted_ips:
        sophistication_score = (
            len(stats[<span class="hljs-string">'targeted_ports'</span>]) * <span class="hljs-number">0.4</span> +  <span class="hljs-comment"># Port diversity</span>
            len(stats[<span class="hljs-string">'unique_payloads'</span>]) * <span class="hljs-number">0.6</span>   <span class="hljs-comment"># Payload diversity</span>
        )
        print(<span class="hljs-string">f"IP <span class="hljs-subst">{ip}</span>: Sophistication Score <span class="hljs-subst">{sophistication_score:<span class="hljs-number">.2</span>f}</span>"</span>)

    <span class="hljs-comment"># 5. Common Payload Patterns</span>
    print(<span class="hljs-string">"\nTop 10 Most Common Payloads:"</span>)
    sorted_payloads = sorted(data_patterns.items(),
                            key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>],
                            reverse=<span class="hljs-literal">True</span>)[:<span class="hljs-number">10</span>]
    <span class="hljs-keyword">for</span> payload, count <span class="hljs-keyword">in</span> sorted_payloads:
        <span class="hljs-keyword">if</span> len(payload) &gt; <span class="hljs-number">50</span>:  <span class="hljs-comment"># Truncate long payloads</span>
            payload = payload[:<span class="hljs-number">50</span>] + <span class="hljs-string">"..."</span>
        print(<span class="hljs-string">f"Count <span class="hljs-subst">{count}</span>: <span class="hljs-subst">{payload}</span>"</span>)
</code></pre>
<p>You can place this in a separate script file and call the function on the JSON logs. This function will provide us comprehensive insights from the JSON file based on the data collected.</p>
<p>Our analysis begins by grouping the data into several categories like IP-based statistics, port targeting patterns, hourly attack distributions, and payload characteristics. For every IP, we are tracking total attempts, first and last seen times, targeted ports and unique payloads. This will help us build unique profiles for attackers.</p>
<p>We also examine port-based attack patterns here that monitor for most frequently targeted ports, and by how many unique attackers. We also perform an attack sophistication analysis that helps us identify targeted attackers, considering factors like ports targeted and unique payloads used. This analysis is used for separating simple scanning activities and sophisticated attacks.</p>
<p>Temporal analysis helps us to identify patterns in hourly attack attempts revealing patterns in attack timing and potential automated targeting campaigns. Finally, we publish commonly seen payloads to identify commonly seen attack strings or commands.</p>
<h2 id="heading-security-considerations">Security Considerations</h2>
<p>While deploying this honeypot, make sure you consider the following security measures:</p>
<ol>
<li><p>Run your honeypot in an isolated environment. Typically inside a VM, or on your local machine that is behind a NAT and a firewall.</p>
</li>
<li><p>Run the honeypot with minimal system privileges (typically not as root) to reduce risk if compromised.</p>
</li>
<li><p>Be cautious with collected data if you plan to ever deploy it as a production-grade or research honeypot as it may contain malware or sensitive information.</p>
</li>
<li><p>Implement robust monitoring mechanisms to detect attempts to break out of the honeypot environment.</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>With this we have built our honeypot, written a simulator to simulate attacks for our honeypot and analyzed the data from our honeypot logs to make a few simple inferences. It is an excellent way to understand both offensive as well as defensive security concepts. You can consider building upon this to create more complex detection systems and think of adding features like:</p>
<ol>
<li><p>Dynamic service emulation based on attack behavior</p>
</li>
<li><p>Integration with threat intelligence systems that will perform better inference analysis of these collected honeypot logs</p>
</li>
<li><p>Gather even comprehensive logs beyond the IP, port and network data through advanced logging mechanisms</p>
</li>
<li><p>Add machine learning capabilities to detect attack patterns</p>
</li>
</ol>
<p>Remember that even though honeypots are powerful security tools, they should be a part of a comprehensive defensive security strategy, not the only line of defense.</p>
<p>I hope you learnt about how honeypots work, what is their purpose as well as a bit of Python programming as well!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Building a Simple Web Application Security Scanner with Python: A Beginner's Guide ]]>
                </title>
                <description>
                    <![CDATA[ In this article, you are going to learn to create a basic security tool that can be helpful in identifying common vulnerabilities in web applications. I have two goals here. The first is to empower you with the skills to develop tools that can help e... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-web-application-security-scanner-with-python/</link>
                <guid isPermaLink="false">675b035a23cb72fd28dab52d</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ python projects ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #cybersecurity ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chaitanya Rahalkar ]]>
                </dc:creator>
                <pubDate>Thu, 12 Dec 2024 15:38:02 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1733929791562/042042e3-56e2-4185-be19-2a0f5fa15d25.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this article, you are going to learn to create a basic security tool that can be helpful in identifying common vulnerabilities in web applications.</p>
<p>I have two goals here. The first is to empower you with the skills to develop tools that can help enhance the overall security posture of your websites. The second is to help you practice some Python programming.</p>
<p>In this guide, you will be building a Python-based security scanner that can detect XSS, SQL injection, and sensitive PII (Personally Identifiable Information).</p>
<h3 id="heading-types-of-vulnerabilities">Types of Vulnerabilities</h3>
<p>Generally, we can categorize web security vulnerabilities into the following buckets (for even more buckets, check the <a target="_blank" href="https://owasp.org/www-project-top-ten/">OWASP Top 10</a>):</p>
<ul>
<li><p><strong>SQL injection</strong>: A technique where attackers are able to insert malicious SQL code into SQL queries through unvalidated inputs, allowing them to modify / read database contents.</p>
</li>
<li><p><strong>Cross-Site Scripting (XSS)</strong>: A technique where attackers inject malicious JavaScript in trusted websites. This allows them to execute the JavaScript code in the context of the browser and steal sensitive information or perform unauthorized operations.</p>
</li>
<li><p><strong>Sensitive information exposure</strong>: A security issue where an application unintentionally reveals sensitive data like passwords, API keys and so on through logs, insecure storage, and other vulnerabilities.</p>
</li>
<li><p><strong>Common security misconfigurations</strong>: Security issues that occurs due to improper configuration of web servers – like default credentials for administrator accounts, enabled debug mode, publicly available administrator dashboards with weak credentials, and so on.</p>
</li>
<li><p><strong>Basic authentication weaknesses</strong>: Security issues that occur due to lapses in password policies, user authentication processes, improper session management, and so on.</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-our-development-environment">Setting Up Our Development Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-building-our-core-scanner-class">Building our Core Scanner Class</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-implementing-the-crawler">Implementing the Crawler</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-designing-and-implementing-the-security-checks">Designing and Implementing the Security Checks</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-sql-injection-detection-check">SQL Injection Detection Check</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-xss-cross-site-scripting-check">XSS (Cross-Site Scripting) Check</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-sensitive-information-exposure-check">Sensitive Information Exposure Check</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-implementing-the-main-scanning-logic">Implementing the Main Scanning Logic</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-extending-the-security-scanner">Extending the Security Scanner</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along with this tutorial, you will be needing:</p>
<ul>
<li><p>Python 3.x</p>
</li>
<li><p>Basic understanding of HTTP protocols</p>
</li>
<li><p>Basic understanding of web applications</p>
</li>
<li><p>Basic understanding of how XSS, SQL injection, and basic security attacks work</p>
</li>
</ul>
<h2 id="heading-setting-up-our-development-environment">Setting Up Our Development Environment</h2>
<p>Let’s install our required dependencies with the following command:</p>
<pre><code class="lang-bash">pip install requests beautifulsoup4 urllib3 colorama
</code></pre>
<p>We’ll use these dependencies in our code file:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Required packages</span>
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup
<span class="hljs-keyword">import</span> urllib.parse
<span class="hljs-keyword">import</span> colorama
<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">from</span> concurrent.futures <span class="hljs-keyword">import</span> ThreadPoolExecutor
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> List, Dict, Set
</code></pre>
<h2 id="heading-building-our-core-scanner-class">Building our Core Scanner Class</h2>
<p>Once you have the dependencies, it’s time to write the core scanner class.</p>
<p>This class will serve as our main class that will handle the web security scanning functionality. It will track our visited pages and also store our findings.</p>
<p>We have the <code>normalize_url</code> function that we’ll use to ensure that you don’t rescan URLs that have already been seen before. This function will essentially remove the HTTP GET parameters from the URL. For example, <code>https://example.com/page?id=1</code> will become <code>https://example.com/page</code> after normalizing it.</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WebSecurityScanner</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, target_url: str, max_depth: int = <span class="hljs-number">3</span></span>):</span>
        <span class="hljs-string">"""
        Initialize the security scanner with a target URL and maximum crawl depth.

        Args:
            target_url: The base URL to scan
            max_depth: Maximum depth for crawling links (default: 3)
        """</span>
        self.target_url = target_url
        self.max_depth = max_depth
        self.visited_urls: Set[str] = set()
        self.vulnerabilities: List[Dict] = []
        self.session = requests.Session()

        <span class="hljs-comment"># Initialize colorama for cross-platform colored output</span>
        colorama.init()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">normalize_url</span>(<span class="hljs-params">self, url: str</span>) -&gt; str:</span>
        <span class="hljs-string">"""Normalize the URL to prevent duplicate checks"""</span>
        parsed = urllib.parse.urlparse(url)
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"<span class="hljs-subst">{parsed.scheme}</span>://<span class="hljs-subst">{parsed.netloc}</span><span class="hljs-subst">{parsed.path}</span>"</span>
</code></pre>
<h2 id="heading-implementing-the-crawler">Implementing the Crawler</h2>
<p>The first step in our scanner is to implement a web crawler that will discover pages and URLs in a given target application. Make sure you’re writing these functions in our <code>WebSecurityScanner</code> class.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">crawl</span>(<span class="hljs-params">self, url: str, depth: int = <span class="hljs-number">0</span></span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-string">"""
    Crawl the website to discover pages and endpoints.

    Args:
        url: Current URL to crawl
        depth: Current depth in the crawl tree
    """</span>
    <span class="hljs-keyword">if</span> depth &gt; self.max_depth <span class="hljs-keyword">or</span> url <span class="hljs-keyword">in</span> self.visited_urls:
        <span class="hljs-keyword">return</span>

    <span class="hljs-keyword">try</span>:
        self.visited_urls.add(url)
        response = self.session.get(url, verify=<span class="hljs-literal">False</span>)
        soup = BeautifulSoup(response.text, <span class="hljs-string">'html.parser'</span>)

        <span class="hljs-comment"># Find all links in the page</span>
        links = soup.find_all(<span class="hljs-string">'a'</span>, href=<span class="hljs-literal">True</span>)
        <span class="hljs-keyword">for</span> link <span class="hljs-keyword">in</span> links:
            next_url = urllib.parse.urljoin(url, link[<span class="hljs-string">'href'</span>])
            <span class="hljs-keyword">if</span> next_url.startswith(self.target_url):
                self.crawl(next_url, depth + <span class="hljs-number">1</span>)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Error crawling <span class="hljs-subst">{url}</span>: <span class="hljs-subst">{str(e)}</span>"</span>)
</code></pre>
<p>This <code>crawl</code> function helps us perform a depth-first crawl of a website. It will explore all pages of a website while staying within the specified domain.</p>
<p>For example, if you plan to use this scanner on <code>https://google.com</code>, the function will first get all the URLs and then one-by-one check if they belong to the specified domain (that is, <code>google.com</code>). If so, it will recursively continue to scan the seen URL up to a specified depth which is supplied with the <code>depth</code> parameter as an argument to the function. We also have some exception handling to make sure we handle errors smoothly and report any errors during crawling.</p>
<h2 id="heading-designing-and-implementing-the-security-checks">Designing and Implementing the Security Checks</h2>
<p>Now let’s finally get to the juicy part and implement our security checks. We’ll start first with SQL Injection.</p>
<h3 id="heading-sql-injection-detection-check">SQL Injection Detection Check</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_sql_injection</span>(<span class="hljs-params">self, url: str</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-string">"""Test for potential SQL injection vulnerabilities"""</span>
    sql_payloads = [<span class="hljs-string">"'"</span>, <span class="hljs-string">"1' OR '1'='1"</span>, <span class="hljs-string">"' OR 1=1--"</span>, <span class="hljs-string">"' UNION SELECT NULL--"</span>]

    <span class="hljs-keyword">for</span> payload <span class="hljs-keyword">in</span> sql_payloads:
        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># Test GET parameters</span>
            parsed = urllib.parse.urlparse(url)
            params = urllib.parse.parse_qs(parsed.query)

            <span class="hljs-keyword">for</span> param <span class="hljs-keyword">in</span> params:
                test_url = url.replace(<span class="hljs-string">f"<span class="hljs-subst">{param}</span>=<span class="hljs-subst">{params[param][<span class="hljs-number">0</span>]}</span>"</span>, 
                                     <span class="hljs-string">f"<span class="hljs-subst">{param}</span>=<span class="hljs-subst">{payload}</span>"</span>)
                response = self.session.get(test_url)

                <span class="hljs-comment"># Look for SQL error messages</span>
                <span class="hljs-keyword">if</span> any(error <span class="hljs-keyword">in</span> response.text.lower() <span class="hljs-keyword">for</span> error <span class="hljs-keyword">in</span> 
                    [<span class="hljs-string">'sql'</span>, <span class="hljs-string">'mysql'</span>, <span class="hljs-string">'sqlite'</span>, <span class="hljs-string">'postgresql'</span>, <span class="hljs-string">'oracle'</span>]):
                    self.report_vulnerability({
                        <span class="hljs-string">'type'</span>: <span class="hljs-string">'SQL Injection'</span>,
                        <span class="hljs-string">'url'</span>: url,
                        <span class="hljs-string">'parameter'</span>: param,
                        <span class="hljs-string">'payload'</span>: payload
                    })

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">f"Error testing SQL injection on <span class="hljs-subst">{url}</span>: <span class="hljs-subst">{str(e)}</span>"</span>)
</code></pre>
<p>This function essentially performs basic SQL injection checks by testing the URL against common SQL injection payloads and looking for error messages that might hint at a security vulnerability.</p>
<p>Based on the error message received after performing a simple GET request on the URL, we check whether that message is a database error or not. If it is, we use the <code>report_vulnerability</code> function to report that as a security issue in our final report that this script will generate. For the sake of this example, we are selecting a few commonly tested SQL injection payloads, but you can extend this to test even more.</p>
<h3 id="heading-xss-cross-site-scripting-check">XSS (Cross-Site Scripting) Check</h3>
<p>Now let’s implement the second security check for XSS payloads.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_xss</span>(<span class="hljs-params">self, url: str</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-string">"""Test for potential Cross-Site Scripting vulnerabilities"""</span>
    xss_payloads = [
        <span class="hljs-string">"&lt;script&gt;alert('XSS')&lt;/script&gt;"</span>,
        <span class="hljs-string">"&lt;img src=x onerror=alert('XSS')&gt;"</span>,
        <span class="hljs-string">"javascript:alert('XSS')"</span>
    ]

    <span class="hljs-keyword">for</span> payload <span class="hljs-keyword">in</span> xss_payloads:
        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># Test GET parameters</span>
            parsed = urllib.parse.urlparse(url)
            params = urllib.parse.parse_qs(parsed.query)

            <span class="hljs-keyword">for</span> param <span class="hljs-keyword">in</span> params:
                test_url = url.replace(<span class="hljs-string">f"<span class="hljs-subst">{param}</span>=<span class="hljs-subst">{params[param][<span class="hljs-number">0</span>]}</span>"</span>, 
                                     <span class="hljs-string">f"<span class="hljs-subst">{param}</span>=<span class="hljs-subst">{urllib.parse.quote(payload)}</span>"</span>)
                response = self.session.get(test_url)

                <span class="hljs-keyword">if</span> payload <span class="hljs-keyword">in</span> response.text:
                    self.report_vulnerability({
                        <span class="hljs-string">'type'</span>: <span class="hljs-string">'Cross-Site Scripting (XSS)'</span>,
                        <span class="hljs-string">'url'</span>: url,
                        <span class="hljs-string">'parameter'</span>: param,
                        <span class="hljs-string">'payload'</span>: payload
                    })

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">f"Error testing XSS on <span class="hljs-subst">{url}</span>: <span class="hljs-subst">{str(e)}</span>"</span>)
</code></pre>
<p>This function, just like the SQL injection tester, uses a set of common XSS payloads and applies the same idea. But the key difference here is that we are looking for our injected payload to appear unmodified in our response rather than looking for an error message.</p>
<p>If you are able to see our injected payload, most likely it will be executed in the context of the victim’s browser as a reflected XSS attack.</p>
<h3 id="heading-sensitive-information-exposure-check">Sensitive Information Exposure Check</h3>
<p>Now let’s implement our final check for sensitive PII.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_sensitive_info</span>(<span class="hljs-params">self, url: str</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-string">"""Check for exposed sensitive information"""</span>
    sensitive_patterns = {
        <span class="hljs-string">'email'</span>: <span class="hljs-string">r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'</span>,
        <span class="hljs-string">'phone'</span>: <span class="hljs-string">r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'</span>,
        <span class="hljs-string">'ssn'</span>: <span class="hljs-string">r'\b\d{3}-\d{2}-\d{4}\b'</span>,
        <span class="hljs-string">'api_key'</span>: <span class="hljs-string">r'api[_-]?key[_-]?([\'"|`])([a-zA-Z0-9]{32,45})\1'</span>
    }

    <span class="hljs-keyword">try</span>:
        response = self.session.get(url)

        <span class="hljs-keyword">for</span> info_type, pattern <span class="hljs-keyword">in</span> sensitive_patterns.items():
            matches = re.finditer(pattern, response.text)
            <span class="hljs-keyword">for</span> match <span class="hljs-keyword">in</span> matches:
                self.report_vulnerability({
                    <span class="hljs-string">'type'</span>: <span class="hljs-string">'Sensitive Information Exposure'</span>,
                    <span class="hljs-string">'url'</span>: url,
                    <span class="hljs-string">'info_type'</span>: info_type,
                    <span class="hljs-string">'pattern'</span>: pattern
                })

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Error checking sensitive information on <span class="hljs-subst">{url}</span>: <span class="hljs-subst">{str(e)}</span>"</span>)
</code></pre>
<p>This function uses a set of predefined Regex patterns to search for PII like emails, phone numbers, SSNs, and API keys (that are prefixed with api-key-&lt;number&gt;).</p>
<p>Just like the previous two functions, we use the response text for the URL and our Regex patterns to find these PIIs in the response text. If we do find any, we report them with the <code>report_vulnerability</code> function. Make sure to have all these functions defined in the <code>WebSecurityScanner</code> class.</p>
<h2 id="heading-implementing-the-main-scanning-logic">Implementing the Main Scanning Logic</h2>
<p>Let’s finally stitch everything together by defining the <code>scan</code> and <code>report_vulnerability</code> function in the <code>WebSecurityScanner</code> class:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scan</span>(<span class="hljs-params">self</span>) -&gt; List[Dict]:</span>
    <span class="hljs-string">"""
    Main scanning method that coordinates the security checks

    Returns:
        List of discovered vulnerabilities
    """</span>
    print(<span class="hljs-string">f"\n<span class="hljs-subst">{colorama.Fore.BLUE}</span>Starting security scan of <span class="hljs-subst">{self.target_url}</span><span class="hljs-subst">{colorama.Style.RESET_ALL}</span>\n"</span>)

    <span class="hljs-comment"># First, crawl the website</span>
    self.crawl(self.target_url)

    <span class="hljs-comment"># Then run security checks on all discovered URLs</span>
    <span class="hljs-keyword">with</span> ThreadPoolExecutor(max_workers=<span class="hljs-number">5</span>) <span class="hljs-keyword">as</span> executor:
        <span class="hljs-keyword">for</span> url <span class="hljs-keyword">in</span> self.visited_urls:
            executor.submit(self.check_sql_injection, url)
            executor.submit(self.check_xss, url)
            executor.submit(self.check_sensitive_info, url)

    <span class="hljs-keyword">return</span> self.vulnerabilities

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">report_vulnerability</span>(<span class="hljs-params">self, vulnerability: Dict</span>) -&gt; <span class="hljs-keyword">None</span>:</span>
    <span class="hljs-string">"""Record and display found vulnerabilities"""</span>
    self.vulnerabilities.append(vulnerability)
    print(<span class="hljs-string">f"<span class="hljs-subst">{colorama.Fore.RED}</span>[VULNERABILITY FOUND]<span class="hljs-subst">{colorama.Style.RESET_ALL}</span>"</span>)
    <span class="hljs-keyword">for</span> key, value <span class="hljs-keyword">in</span> vulnerability.items():
        print(<span class="hljs-string">f"<span class="hljs-subst">{key}</span>: <span class="hljs-subst">{value}</span>"</span>)
    print()
</code></pre>
<p>This code defines our <code>scan</code> function which will essentially invoke the <code>crawl</code> function and recursively start crawling the website. With multithreading, we will apply all three security checks on the visited URLs.</p>
<p>We have also defined the <code>report_vulnerability</code> function which will effectively print our vulnerability to the console and also store them in our <code>vulnerabilities</code> array.</p>
<p>Now let’s finally use our scanner by saving it as <code>scanner.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    <span class="hljs-keyword">if</span> len(sys.argv) != <span class="hljs-number">2</span>:
        print(<span class="hljs-string">"Usage: python scanner.py &lt;target_url&gt;"</span>)
        sys.exit(<span class="hljs-number">1</span>)

    target_url = sys.argv[<span class="hljs-number">1</span>]
    scanner = WebSecurityScanner(target_url)
    vulnerabilities = scanner.scan()

    <span class="hljs-comment"># Print summary</span>
    print(<span class="hljs-string">f"\n<span class="hljs-subst">{colorama.Fore.GREEN}</span>Scan Complete!<span class="hljs-subst">{colorama.Style.RESET_ALL}</span>"</span>)
    print(<span class="hljs-string">f"Total URLs scanned: <span class="hljs-subst">{len(scanner.visited_urls)}</span>"</span>)
    print(<span class="hljs-string">f"Vulnerabilities found: <span class="hljs-subst">{len(vulnerabilities)}</span>"</span>)
</code></pre>
<p>The target URL will be supplied as a system argument and we will get the summary of URLs scanned and vulnerabilities found at the end of our scan. Now let’s discuss how you can extend the scanner and add more features.</p>
<h2 id="heading-extending-the-security-scanner">Extending the Security Scanner</h2>
<p>Here are some ideas to extend this basic security scanner into something even more advanced:</p>
<ol>
<li><p>Add more vulnerability checks like CSRF detection, directory traversal, and so on.</p>
</li>
<li><p>Improve reporting with an HTML or PDF output.</p>
</li>
<li><p>Add configuration options for scan intensity and scope of searching (specifying the depth of scans through a CLI argument).</p>
</li>
<li><p>Implementing proper rate limiting.</p>
</li>
<li><p>Adding authentication support for testing URLs that require session-based authentication.</p>
</li>
</ol>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>Now you know how to build a basic security scanner! This scanner demonstrates a few core concepts of Web Security.</p>
<p>Keep in mind that this tutorial should only be used for educational purposes. There are several professionally designed enterprise-grade applications like Burp Suite and OWASP Zap that can check for hundreds of security vulnerabilities at a much larger scale.</p>
<p>I hope you learned the basics of web security and a bit of Python programming as well.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
