<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Surya Teja Appini - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Surya Teja Appini - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 12 May 2026 15:09:22 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/appinisurya/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Choose the Right LLM for Your Projects: A Guide to Effective Model Benchmarking ]]>
                </title>
                <description>
                    <![CDATA[ When you start building with LLMs, it quickly becomes clear that not all models behave the same. One model may excel at creative writing but struggle with technical precision. Another might be thoughtful yet verbose. A third could be fast and efficie... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/choose-the-right-llm-for-your-projects-benchmarking-guide/</link>
                <guid isPermaLink="false">690e27b0ca4947acbefbd755</guid>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Benchmark ]]>
                    </category>
                
                    <category>
                        <![CDATA[ evaluation metrics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ benchmarking ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Surya Teja Appini ]]>
                </dc:creator>
                <pubDate>Fri, 07 Nov 2025 17:09:04 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762534383880/404f27c6-2995-4daa-bcac-c61b10e93abc.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When you start building with LLMs, it quickly becomes clear that not all models behave the same. One model may excel at creative writing but struggle with technical precision. Another might be thoughtful yet verbose. A third could be fast and efficient yet less consistent. So how do you choose the right one for your task?</p>
<p>This guide walks you through a comprehensive workflow for evaluating and selecting the best LLM for your needs. It’s designed for developers who want more than API demos. You’ll see how to design, test, and compare models using real examples and meaningful metrics.</p>
<p>By the end, you’ll understand not only <em>how</em> to benchmark models but <em>why</em> each step matters.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-public-benchmarks-arent-enough">Why Public Benchmarks Aren’t Enough</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-define-the-task-and-metrics">Step 1: Define the Task and Metrics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-prepare-the-data-and-generate-outputs">Step 2: Prepare the Data and Generate Outputs</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-automate-evaluation-with-a-judge-llm">Step 3: Automate Evaluation with a Judge LLM</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-analyze-visualize-and-interpret">Step 4: Analyze, Visualize, and Interpret</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-iterate-and-scale-up">Step 5: Iterate and Scale Up</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-preparing-a-test-dataset">Preparing a Test Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cloud-providers-and-apis-for-llm-access">Cloud Providers and APIs for LLM Access</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-common-pitfalls-to-avoid">Common Pitfalls to Avoid</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-run-this-end-to-end">How to Run This End to End</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion-turning-evaluation-into-insight">Conclusion: Turning Evaluation into Insight</a></p>
</li>
</ul>
<h2 id="heading-why-public-benchmarks-arent-enough">Why Public Benchmarks Aren’t Enough</h2>
<p>Public leaderboards like MMLU, HumanEval, and HellaSwag show how well models perform on general tests, but they don’t reflect the nuances of your real-world application. A model that scores 90% on reasoning might still fail to produce factual or brand-aligned answers for your domain.</p>
<p>For example, if you’re building a customer review summarizer, your goal isn’t just correctness, it’s tone, style, and reliability. You might value concise responses with minimal hallucination over creative but inconsistent writing.</p>
<p>That’s why you need a <strong>custom benchmark</strong> that mirrors your actual inputs and quality expectations.</p>
<h2 id="heading-step-1-define-the-task-and-metrics">Step 1: Define the Task and Metrics</h2>
<p>To start, you’ll want to decide up front what success looks like for your application by translating a product need (for example, a short, factual review summarizer) into measurable criteria such as accuracy, factuality, conciseness, latency, and cost. Clear, specific goals make the rest of the pipeline meaningful and comparable.</p>
<p><strong>Example task:</strong> Summarize user reviews into concise, factual one-liners.</p>
<h3 id="heading-key-metrics">Key Metrics</h3>
<ul>
<li><p><strong>Accuracy:</strong> Does the summary reflect the correct information?</p>
</li>
<li><p><strong>Factuality:</strong> Does it avoid hallucinations?</p>
</li>
<li><p><strong>Conciseness:</strong> Is it short yet meaningful?</p>
</li>
<li><p><strong>Latency:</strong> How long does it take per query?</p>
</li>
<li><p><strong>Cost:</strong> How much do API tokens add up per 1,000 requests?</p>
</li>
</ul>
<p>These metrics will help balance technical trade-offs and real-world constraints.</p>
<h2 id="heading-step-2-prepare-the-data-and-generate-outputs">Step 2: Prepare the Data and Generate Outputs</h2>
<p>Now, we’ll build a small but representative test set and generate candidate outputs from each model you plan to evaluate. The purpose is to create comparable inputs and collect the raw outputs you will later score and analyze.</p>
<p><strong>Requirements:</strong> Python 3.9+ and <code>pandas</code> installed (<code>pip install pandas</code>).</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

reviews = [
    <span class="hljs-string">"The camera quality is great but the battery dies fast."</span>,
    <span class="hljs-string">"Love the design and performance, but it's overpriced."</span>,
    <span class="hljs-string">"Fast processor, poor sound quality, average screen."</span>
]
references = [
    <span class="hljs-string">"Good camera, poor battery."</span>,
    <span class="hljs-string">"Excellent design but expensive."</span>,
    <span class="hljs-string">"Fast but weak audio and display."</span>
]

<span class="hljs-comment"># Build a tiny DataFrame for quick iteration</span>
df = pd.DataFrame({<span class="hljs-string">"review"</span>: reviews, <span class="hljs-string">"reference"</span>: references})
print(<span class="hljs-string">"Sample data:"</span>)
print(df.head())  <span class="hljs-comment"># sanity check: confirm shape/columns</span>
</code></pre>
<p>Now, generate responses using multiple LLMs through <strong>OpenRouter</strong>, which unifies different APIs into one.</p>
<p><strong>Requirements:</strong> OpenRouter API key set as <code>YOUR_KEY</code>, <code>openrouter</code> Python client installed (<code>pip install openrouter</code>), and access to the models you plan to test.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> openrouter
<span class="hljs-keyword">import</span> time

<span class="hljs-comment"># Initialize API client</span>
client = openrouter.Client(api_key=<span class="hljs-string">"YOUR_KEY"</span>)

<span class="hljs-comment"># Replace these placeholders with whichever providers/models you can access</span>
models = [<span class="hljs-string">"model-A"</span>, <span class="hljs-string">"model-B"</span>, <span class="hljs-string">"model-C"</span>]

results = {}
<span class="hljs-keyword">for</span> model <span class="hljs-keyword">in</span> models:
    print(<span class="hljs-string">f"Evaluating <span class="hljs-subst">{model}</span>..."</span>)
    start = time.time()
    outputs = []
    <span class="hljs-keyword">for</span> review <span class="hljs-keyword">in</span> reviews:
        <span class="hljs-comment"># Keep the prompt identical across models to reduce bias</span>
        res = client.completions.create(
            model=model,
            messages=[{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">f"Summarize this review: <span class="hljs-subst">{review}</span>"</span>}]
        )
        outputs.append(res.choices[<span class="hljs-number">0</span>].message.content.strip())
    <span class="hljs-comment"># Store both the raw outputs and a coarse latency figure</span>
    results[model] = {<span class="hljs-string">"outputs"</span>: outputs, <span class="hljs-string">"latency"</span>: time.time() - start}

print(<span class="hljs-string">"Model outputs generated."</span>)
</code></pre>
<p><strong>Tip:</strong> Even a handful of examples per model can reveal consistent behavior patterns.</p>
<h2 id="heading-step-3-automate-evaluation-with-a-judge-llm">Step 3: Automate Evaluation with a Judge LLM</h2>
<p>In this step, our goal is to replace slow, inconsistent manual labeling with a repeatable, programmatic judging step. We’ll use a fixed judge model and a short rubric so you can get machine-readable scores that reflect qualitative criteria like tone, clarity, and factuality.</p>
<p>Before we go on, let’s clear something up: what is a model-as-a-judge? A model-as-a-judge (MAAJ) uses one LLM to grade outputs from another LLM against task-specific criteria. By prompting the judge with a clear, consistent rubric, you get structured scores that are repeatable and machine-readable. This is useful for aggregation, tracking, and visualization.</p>
<p>We use a fixed rubric because it minimizes drift between runs, and JSON because it makes the output easy to parse and analyze programmatically.</p>
<p>Here are some tips for reliable judging:</p>
<ul>
<li><p>Use a judge model that follows instructions well and keep it fixed across evaluation runs.</p>
</li>
<li><p>Calibrate the rubric: start with 2–4 criteria and a simple numeric scale (for example, 1–5).</p>
</li>
<li><p>Avoid self-judging: prefer a judge from a different provider or model family where possible to reduce shared biases.</p>
</li>
<li><p>For tie-breakers or fine-grained comparisons, consider pairwise judgments (ask the judge to pick the better of two candidates) and convert preferences into scores.</p>
</li>
</ul>
<p><strong>Requirements:</strong> An API key for your judge provider and the official SDK (for example <code>pip install openai</code>).</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> OpenAI
<span class="hljs-keyword">import</span> json

client = OpenAI(api_key=<span class="hljs-string">"YOUR_API_KEY"</span>)  <span class="hljs-comment"># or load from env var</span>

<span class="hljs-comment"># Clear rubric keeps the judge consistent across runs</span>
PROMPT = <span class="hljs-string">"""
You are grading summaries on a scale of 1-5 for:
1. Correctness (alignment with the reference)
2. Conciseness (brevity and clarity)
3. Helpfulness (coverage of key points)
Return a JSON object with the scores.
"""</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">evaluate</span>(<span class="hljs-params">candidate, reference</span>):</span>
    <span class="hljs-comment"># Provide both reference and candidate to the judge</span>
    msg = <span class="hljs-string">f"Reference: <span class="hljs-subst">{reference}</span>\nCandidate: <span class="hljs-subst">{candidate}</span>"</span>
    response = client.chat.completions.create(
        model=<span class="hljs-string">"judge-model"</span>,  <span class="hljs-comment"># keep the judge fixed for fair comparisons</span>
        messages=[
            {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: PROMPT},
            {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: msg}
        ]
    )
    <span class="hljs-comment"># Judge returns JSON; parse into a Python dict</span>
    <span class="hljs-keyword">return</span> json.loads(response.choices[<span class="hljs-number">0</span>].message.content)
</code></pre>
<p>In the code above,</p>
<ul>
<li><p>The rubric in <code>PROMPT</code> defines scoring dimensions (for example: correctness, conciseness, helpfulness). The judge is instructed to return a JSON object.</p>
</li>
<li><p>For each candidate and its reference, the judge receives both strings and applies the rubric.</p>
</li>
<li><p>The judge’s JSON output is parsed with <code>json.loads(...)</code> and aggregated per model to compute averages or distributions.</p>
</li>
</ul>
<p>You can loop through models to automatically gather structured scores.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> statistics

<span class="hljs-keyword">for</span> model, data <span class="hljs-keyword">in</span> results.items():
    scores = [evaluate(cand, ref) <span class="hljs-keyword">for</span> cand, ref <span class="hljs-keyword">in</span> zip(data[<span class="hljs-string">"outputs"</span>], references)]
    results[model][<span class="hljs-string">"scores"</span>] = scores

    avg = {k: statistics.mean([s[k] <span class="hljs-keyword">for</span> s <span class="hljs-keyword">in</span> scores]) <span class="hljs-keyword">for</span> k <span class="hljs-keyword">in</span> scores[<span class="hljs-number">0</span>]}
    print(<span class="hljs-string">f"\n<span class="hljs-subst">{model}</span> Average Scores:"</span>)
    <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> avg.items():
        print(<span class="hljs-string">f"  <span class="hljs-subst">{k}</span>: <span class="hljs-subst">{v:<span class="hljs-number">.2</span>f}</span>"</span>)
</code></pre>
<h2 id="heading-step-4-analyze-visualize-and-interpret">Step 4: Analyze, Visualize, and Interpret</h2>
<p>Our goal here is to turn raw numbers and judge scores into actionable insight. Visualization exposes trade-offs (cost vs. quality vs. latency), highlights variance and outliers, and helps you pick the model that best fits your constraints.</p>
<p>What to visualize and why:</p>
<ul>
<li><p>Latency bars: compare average response time per model. Good for quick performance triage.</p>
</li>
<li><p>Cost bars: cost per 1,000 requests. Makes budget trade-offs visible.</p>
</li>
<li><p>Quality distributions: box plots or histograms of judge scores. Shows variance and outliers.</p>
</li>
<li><p>Quality vs cost scatter: quickly surfaces Pareto-efficient choices.</p>
</li>
<li><p>Confusion matrices: for classification tasks. Shows where models disagree with ground truth.</p>
</li>
<li><p>Radar charts: helpful when comparing 3 to 6 metrics across models at once.</p>
</li>
</ul>
<p>The code below builds a simple bar chart from a <code>results</code> dictionary: <code>models_list</code> provides x-axis labels and <code>latencies</code> maps to the bar heights in seconds. You can replicate the pattern for cost or judge-based scores by swapping the y-values.</p>
<p><strong>Requirements:</strong> <code>matplotlib</code> installed (<code>pip install matplotlib</code>).</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

latencies = [results[m][<span class="hljs-string">'latency'</span>] <span class="hljs-keyword">for</span> m <span class="hljs-keyword">in</span> results]
models_list = list(results.keys())

plt.bar(models_list, latencies)  <span class="hljs-comment"># simple bar chart; add styling if needed</span>
plt.title(<span class="hljs-string">'Model Latency Comparison'</span>)
plt.ylabel(<span class="hljs-string">'Seconds'</span>)
plt.show()
</code></pre>
<p>The chart is embedded here for reference:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762499865751/d3ebd40e-e8f4-44c8-b953-612d1604a8ca.png" alt="Bar chart comparing model latency." class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Figure: Model latency comparison (seconds per batch).</p>
<p><strong>Reflection Question:</strong> Which metric matters most for your use case: accuracy, speed, or cost?</p>
<h2 id="heading-step-5-iterate-and-scale-up">Step 5: Iterate and Scale Up</h2>
<p>At this point, we’ll move from small experiments to a repeatable, automated evaluation pipeline that can run at scale, track regression, and integrate with monitoring and CI. This step is about operationalizing the evaluation so you can confidently detect when a model update helps or harms your product.</p>
<p>Evaluation flow (high level):</p>
<ol>
<li><p><strong>Dataset (JSONL)</strong>: a versioned test set with metadata (category, difficulty).</p>
</li>
<li><p><strong>Prompt templates</strong>: standardized prompts or templates applied uniformly across models.</p>
</li>
<li><p><strong>Model runners</strong>: parallel execution across a pool of models (cloud APIs or local hosts).</p>
</li>
<li><p><strong>Judge + Metrics</strong>: compute structured scores (judge JSON) and classical metrics (accuracy, F1).</p>
</li>
<li><p><strong>Storage &amp; dashboards</strong>: persist results, visualize trends, alert on regressions.</p>
</li>
</ol>
<p>Having this explicit flow helps you choose tooling. Below are two representative frameworks and how they map to the flow so you can see which stages they help with.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762499923039/a8fb6a38-5046-488e-8c15-3783c8d5dab9.png" alt="Circular diagram showing Data feeding Models, Models feeding Judge, Judge producing Insights, Insights leading to Refinement, and Refinement feeding back into Data." class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Figure: Evaluation pipeline – Data → Models → Judge → Insights → Refinement → Data</p>
<p>Some examples that map to the flow are:</p>
<ul>
<li><p><strong>AWS FMEval</strong>: focuses on large-scale evaluation and experiment tracking. It covers dataset adapters, parallel model runners, built-in metrics, and native integration with AWS experiment storage and dashboards. Use it when your data lives on cloud storage and you want tight Bedrock or AWS integration for production evaluation runs.</p>
</li>
<li><p><strong>LangChain Eval</strong>: focuses on tight integration with application pipelines. It covers prompt templates, judge and metric hooks, and easy programmatic evaluators that plug directly into LangChain-based model runners. Use it when your evaluation should be embedded in development pipelines or when you already use LangChain for orchestration.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fmeval <span class="hljs-keyword">import</span> DataConfig, ModelRunner, EvaluationSet

cfg = DataConfig(dataset_uri=<span class="hljs-string">"s3://your-dataset/reviews.jsonl"</span>)  <span class="hljs-comment"># JSONL test set</span>
runner = ModelRunner(model_id=<span class="hljs-string">"model-id"</span>)       <span class="hljs-comment"># pick a model to evaluate</span>
eval_set = EvaluationSet(config=cfg, runner=runner)
<span class="hljs-comment"># Run evaluation with a simple metric; swap in your custom metric as needed</span>
eval_set.evaluate(metric=<span class="hljs-string">"accuracy"</span>)
<span class="hljs-comment"># Persist results for dashboards or regression tracking</span>
eval_set.save(<span class="hljs-string">"./results.json"</span>)
</code></pre>
<p>You’ll want to run scheduling and drift tracking regularly – for example, nightly or weekly evals on a fixed test set. Send an alert when a model update drops a score or increases latency beyond a threshold.</p>
<h2 id="heading-preparing-a-test-dataset">Preparing a Test Dataset</h2>
<p>A well-prepared test dataset is the foundation of reliable model evaluation. Here are a few best practices, followed by a concrete example:</p>
<ul>
<li><p>Reflect real use cases: use authentic data from your domain such as customer queries, logs, or user reviews.</p>
</li>
<li><p>Diversify examples: include easy, typical, and edge-case scenarios to measure robustness.</p>
</li>
<li><p>Expert annotation: have domain experts provide clear reference outputs or ground truth labels.</p>
</li>
<li><p>Keep it separate: ensure the test dataset is not reused from training or fine-tuning.</p>
</li>
<li><p>Update regularly: add new examples to reflect changing user behavior or data drift.</p>
</li>
<li><p>Version everything: track dataset versions, annotation changes, and evaluation notes.</p>
</li>
<li><p>Quality over quantity: start small but ensure examples are accurate and representative.</p>
</li>
</ul>
<h3 id="heading-small-jsonl-test-set">Small JSONL test set</h3>
<p>Create a line-delimited JSON (JSONL) file where each line is a JSON object with two required fields: <code>input</code> (the prompt) and <code>reference</code> (the expected output). This simple, tooling-friendly format is accepted by most evaluation frameworks and is easy to version, diff, and slice.</p>
<p>Optionally add metadata fields such as <code>category</code>, <code>difficulty</code>, or <code>source</code> to enable filtered analysis and targeted slicing during evaluation.</p>
<pre><code class="lang-python">{<span class="hljs-string">"input"</span>: <span class="hljs-string">"The camera quality is great but the battery dies fast."</span>, <span class="hljs-string">"reference"</span>: <span class="hljs-string">"Good camera, poor battery."</span>}
{<span class="hljs-string">"input"</span>: <span class="hljs-string">"Love the design and performance, but it's overpriced."</span>, <span class="hljs-string">"reference"</span>: <span class="hljs-string">"Excellent design but expensive."</span>}
{<span class="hljs-string">"input"</span>: <span class="hljs-string">"Fast processor, poor sound quality, average screen."</span>, <span class="hljs-string">"reference"</span>: <span class="hljs-string">"Fast but weak audio and display."</span>}
</code></pre>
<p>Helper script to produce JSONL:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
samples = [
    {<span class="hljs-string">"input"</span>: <span class="hljs-string">"The camera quality is great but the battery dies fast."</span>, <span class="hljs-string">"reference"</span>: <span class="hljs-string">"Good camera, poor battery."</span>},
    {<span class="hljs-string">"input"</span>: <span class="hljs-string">"Love the design and performance, but it's overpriced."</span>, <span class="hljs-string">"reference"</span>: <span class="hljs-string">"Excellent design but expensive."</span>},
    {<span class="hljs-string">"input"</span>: <span class="hljs-string">"Fast processor, poor sound quality, average screen."</span>, <span class="hljs-string">"reference"</span>: <span class="hljs-string">"Fast but weak audio and display."</span>}
]
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"reviews_test.jsonl"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
    <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> samples:
        f.write(json.dumps(row) + <span class="hljs-string">"\n"</span>)
print(<span class="hljs-string">"Wrote reviews_test.jsonl"</span>)
</code></pre>
<p>You can add fields like <code>category</code> or <code>difficulty</code> to filter and slice results later.</p>
<p>Even a compact, well-designed test set can highlight major model differences and guide better deployment decisions.</p>
<h2 id="heading-cloud-providers-and-apis-for-llm-access">Cloud Providers and APIs for LLM Access</h2>
<p>Before you can benchmark different large language models, you need reliable ways to access them. Most LLMs are hosted behind APIs or cloud platforms that expose standard interfaces for sending prompts and receiving outputs. Choosing the right provider affects not only <em>which</em> models you can test, but also your results for latency, throughput, and cost.</p>
<p>Now, we’ll look at some of the main options for accessing LLMs. These range from commercial APIs like OpenAI and Anthropic, to open-source options like Hugging Face, and enterprise platforms like AWS Bedrock and Azure OpenAI.</p>
<p>Understanding these platforms will help you design realistic benchmarks that reflect the infrastructure you’ll actually deploy in production.</p>
<ul>
<li><p><strong>OpenAI and Anthropic:</strong> Reliable APIs offering strong reasoning and creative models.</p>
</li>
<li><p><strong>Google Gemini and Cohere:</strong> Strong multimodal and enterprise-friendly options.</p>
</li>
<li><p><strong>OpenRouter:</strong> Simplifies access to multiple providers with a single API key.</p>
</li>
<li><p><strong>Hugging Face:</strong> Great for open-source experimentation and deployment flexibility.</p>
</li>
<li><p><strong>AWS Bedrock and Azure OpenAI:</strong> Enterprise-grade platforms with security, compliance, and scalability.</p>
</li>
</ul>
<p>Use a unified testing approach for flexible experiments and a production cloud provider when you need compliance and scalability.</p>
<p>Once you’ve decided where to source your models, you can run consistent benchmarks across providers using a unified API interface. This helps make sure your comparisons reflect real deployment conditions.</p>
<h2 id="heading-common-pitfalls-to-avoid">Common Pitfalls to Avoid</h2>
<p>Below are five common mistakes, why they matter, and what to do instead. Keep this checklist handy when designing experiments or reviewing results.</p>
<p><strong>1. Using the same model as both generator and judge</strong><br>Shared biases inflate scores and hide errors. Instead, you can use a separate judge (different provider, family, or size) and keep the judge fixed across runs.</p>
<p><strong>2. Relying only on aggregate numbers</strong><br>Averages hide tone, factuality issues, and edge-case failures. Instead, you should maintain a curated error-analysis set and do periodic manual spot checks.</p>
<p><strong>3. Ignoring latency and cost</strong><br>A high-scoring model may be too slow or expensive for production SLAs. Instead, you can track latency distributions and projected monthly cost alongside quality metrics.</p>
<p><strong>4. Not versioning datasets or prompts</strong><br>Silent changes break comparability and reproducibility. Make sure you store datasets and prompt templates in version control and log run metadata and data hashes for every evaluation.</p>
<p><strong>5. Overfitting to the test set</strong><br>Repeated tuning on a tiny set undermines generalization. Instead, keep a holdout set, rotate or refresh samples, and expand the dataset over time.</p>
<h2 id="heading-conclusion-turning-evaluation-into-insight">Conclusion: Turning Evaluation into Insight</h2>
<p>Benchmarking helps you score models as well as understand them. Through this workflow, you’ve seen how to:</p>
<ol>
<li><p>Define tasks and meaningful metrics.</p>
</li>
<li><p>Generate model outputs programmatically.</p>
</li>
<li><p>Evaluate using a judge model for consistency.</p>
</li>
<li><p>Visualize trade-offs to make data-driven choices.</p>
</li>
</ol>
<p>As models evolve, your benchmarking pipeline becomes a living system. It helps you track progress, validate improvements, and justify decisions with evidence.</p>
<p>Choosing an LLM is no longer guesswork. It’s now a structured experiment grounded in real data. Each iteration builds intuition and confidence. Over time, you’ll know not just which model performs best, but <em>why</em>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build Your Own Private Voice Assistant: A Step-by-Step Guide Using Open-Source Tools ]]>
                </title>
                <description>
                    <![CDATA[ Most commercial voice assistants send your voice data to cloud servers before responding. By using open‑source tools, you can run everything directly on your phone for better privacy, faster responses, and full control over how the assistant behaves.... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/private-voice-assistant-using-open-source-tools/</link>
                <guid isPermaLink="false">690bcbbc8abe1e0a5b05e0be</guid>
                
                    <category>
                        <![CDATA[ Voice ]]>
                    </category>
                
                    <category>
                        <![CDATA[ voice assistants ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Personalization  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tool calling ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agentic AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ on-device ai ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Surya Teja Appini ]]>
                </dc:creator>
                <pubDate>Wed, 05 Nov 2025 22:12:12 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762380694991/10687751-7aec-4d78-8af8-1f76edc28afd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most commercial voice assistants send your voice data to cloud servers before responding. By using open‑source tools, you can run everything directly on your phone for better privacy, faster responses, and full control over how the assistant behaves.</p>
<p>In this tutorial, I’ll walk you through the process step-by-step. You don’t need prior experience with machine learning models, as we’ll build up the system gradually and test each part as we go. By the end, you will have a fully local mobile voice assistant powered by:</p>
<ul>
<li><p>Whisper for Automatic Speech Recognition (ASR)</p>
</li>
<li><p>Machine Learning Compiler (MLC) LLM for on-device reasoning</p>
</li>
<li><p>System Text-to-Speech (TTS) using built-in Android TTS</p>
</li>
</ul>
<p>Your assistant will be able to:</p>
<ul>
<li><p>Understand your voice commands offline</p>
</li>
<li><p>Respond to you with synthesized speech</p>
</li>
<li><p>Perform tool calling actions (such as controlling smart devices)</p>
</li>
<li><p>Store personal memories and preferences</p>
</li>
<li><p>Use Retrieval-Augmented Generation (RAG) to answer questions from your own notes</p>
</li>
<li><p>Perform multi-step agentic workflows such as generating a morning briefing and optionally sending the summary to a contact</p>
</li>
</ul>
<p>This tutorial focuses on Android using Termux (the terminal environment for Android) for a fully local workflow.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-system-overview">System Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-requirements">Requirements</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-test-microphone-and-audio-playback-on-android">Step 1: Test Microphone and Audio Playback on Android</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-install-and-run-whisper-for-asr">Step 2: Install and Run Whisper for ASR</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-install-a-local-llm-with-mlc">Step 3: Install a Local LLM with MLC</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-local-text-to-speech-tts">Step 4: Local Text-to-Speech (TTS)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-the-core-voice-loop">Step 5: The Core Voice Loop</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-tool-calling-make-it-act">Step 6: Tool Calling (Make It Act)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-memory-and-personalization">Step 7: Memory and Personalization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-8-retrieval-augmented-generation-rag">Step 8: Retrieval-Augmented Generation (RAG)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-9-multi-step-agentic-workflow">Step 9: Multi-Step Agentic Workflow</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion-and-next-steps">Conclusion and Next Steps</a></p>
</li>
</ul>
<h2 id="heading-system-overview"><strong>System Overview</strong></h2>
<p>This diagram shows how your voice moves through the assistant: speech in → transcription → reasoning → action → spoken reply.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762319872832/7b52b715-79c0-4c92-b431-b84c49ba7299.png" alt="7b52b715-79c0-4c92-b431-b84c49ba7299" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This pipeline describes the core flow:</p>
<ul>
<li><p>You speak into the microphone.</p>
</li>
<li><p>Whisper converts audio into text.</p>
</li>
<li><p>The local LLM interprets your request.</p>
</li>
<li><p>The assistant may call tools (for example, send notifications or create events).</p>
</li>
<li><p>The response is spoken aloud using the device’s Text-to-Speech system.</p>
</li>
</ul>
<h3 id="heading-key-concepts-used-in-this-tutorial">Key Concepts Used in This Tutorial</h3>
<ul>
<li><p><strong>Automatic Speech Recognition (ASR):</strong> Converts your speech into text. We use Whisper or Faster‑Whisper.</p>
</li>
<li><p><strong>Local Large Language Model (LLM):</strong> A reasoning model running on your phone using the MLC engine.</p>
</li>
<li><p><strong>Text‑to‑Speech (TTS):</strong> Converts text back to speech. We use Android’s built‑in system TTS.</p>
</li>
<li><p><strong>Tool Calling:</strong> Allows the assistant to perform actions (for example, sending a notification or creating an event).</p>
</li>
<li><p><strong>Memory:</strong> Stores personalized facts the assistant learns during conversation.</p>
</li>
<li><p><strong>Retrieval‑Augmented Generation (RAG):</strong> Lets the assistant reference your documents or notes.</p>
</li>
<li><p><strong>Agent Workflow:</strong> A multi‑step chain where the assistant uses multiple abilities together.</p>
</li>
</ul>
<h2 id="heading-requirements">Requirements</h2>
<p>What you should already be familiar with:</p>
<ul>
<li><p>Basic command line usage (running commands, navigating directories)</p>
</li>
<li><p>Very basic Python (calling a function, editing a <code>.py</code> script)</p>
</li>
</ul>
<p>You do <strong>not</strong> need to have:</p>
<ul>
<li><p>Machine learning experience</p>
</li>
<li><p>A deep understanding of neural networks</p>
</li>
<li><p>Prior experience with speech or audio models</p>
</li>
</ul>
<p>Here are the tools and technologies you’ll need to follow along:</p>
<ul>
<li><p>An Android phone with Snapdragon 8+ Gen 1 or newer recommended (older devices will still work, but responses may be slower)</p>
</li>
<li><p>Termux</p>
</li>
<li><p>Python 3.9+ inside Termux</p>
</li>
<li><p>Enough free storage (at least 4–6 GB) to store the model and audio files</p>
</li>
</ul>
<p><strong>Why these requirements matter:</strong></p>
<p>Whisper and Llama models run on-device, so the phone must handle real‑time compute. MLC optimizes models for your device's GPU / NPU, so newer processors will run faster and cooler. And system TTS and Termux APIs let the assistant speak and interact with the phone locally.</p>
<p>If your phone is older or mid‑range, switch the model in Step 3 to <code>Phi-3.5-Mini</code> which is smaller and faster.</p>
<p>We’ll start by setting up your Android environment with Termux, Python, media access, and storage permissions so later steps can record audio, run models, and speak.</p>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># In Termux</span>
pkg update &amp;&amp; pkg upgrade -y
pkg install -y python git ffmpeg termux-api
termux-setup-storage  <span class="hljs-comment"># grant storage permission</span>
</code></pre>
<h2 id="heading-step-1-test-microphone-and-audio-playback-on-android">Step 1: Test Microphone and Audio Playback on Android</h2>
<p><strong>What this step does:</strong> Verifies that your device microphone and speakers work correctly through Termux before connecting them to the voice assistant.</p>
<p>On-device assistants need reliable access to the microphone and speakers. On Android, Termux provides utilities to record audio and play media. This avoids complex audio dependencies and works on more devices.</p>
<p>These commands let you quickly test your microphone and audio playback without writing any code. This is useful to verify that your device permissions and audio paths are working before introducing Whisper or TTS.</p>
<ul>
<li><p><code>termux-microphone-record</code> records from the device microphone to a <code>.wav</code> file</p>
</li>
<li><p><code>termux-media-player</code> plays audio files</p>
</li>
<li><p><code>termux-tts-speak</code> speaks text using the system TTS voice (fast fallback)</p>
</li>
</ul>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Start a 4 second recording</span>
termux-microphone-record -f <span class="hljs-keyword">in</span>.wav -l <span class="hljs-number">4</span> &amp;&amp; termux-microphone-record -q

<span class="hljs-comment"># Play back the captured audio</span>
termux-media-player play <span class="hljs-keyword">in</span>.wav

<span class="hljs-comment"># Speak text via system TTS (fallback if you do not install a Python TTS)</span>
termux-tts-speak <span class="hljs-string">"Hello, this is your on-device assistant running locally."</span>
</code></pre>
<h2 id="heading-step-2-install-and-run-whisper-for-asr">Step 2: Install and Run Whisper for ASR</h2>
<p><strong>What this step does:</strong> Converts recorded speech into text so the language model can understand what you said.</p>
<p>Whisper listens to your audio recording and converts it into text. Smaller versions like <code>tiny</code> or <code>base</code> run faster on most phones and are good enough for everyday commands.</p>
<p>Install Whisper:</p>
<pre><code class="lang-python">pip install openai-whisper
</code></pre>
<p>If you run into installation issues, you can use Faster‑Whisper instead:</p>
<pre><code class="lang-python">pip install faster-whisper
</code></pre>
<p>Below is a small Python script that takes the recorded audio file and turns it into text. It tries Whisper first, and if that isn’t available, it will automatically fall back to Faster‑Whisper.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Convert recorded speech to text (asr_transcribe.py)</span>
<span class="hljs-keyword">import</span> sys

<span class="hljs-comment"># Try Whisper, fallback to Faster-Whisper if needed</span>
<span class="hljs-keyword">try</span>:
    <span class="hljs-keyword">import</span> whisper
    use_faster = <span class="hljs-literal">False</span>
<span class="hljs-keyword">except</span> Exception:
    use_faster = <span class="hljs-literal">True</span>

<span class="hljs-keyword">if</span> use_faster:
    <span class="hljs-keyword">from</span> faster_whisper <span class="hljs-keyword">import</span> WhisperModel
    model = WhisperModel(<span class="hljs-string">"tiny.en"</span>)
    segments, info = model.transcribe(sys.argv[<span class="hljs-number">1</span>])
    text = <span class="hljs-string">" "</span>.join(s.text <span class="hljs-keyword">for</span> s <span class="hljs-keyword">in</span> segments)
    print(text.strip())
<span class="hljs-keyword">else</span>:
    model = whisper.load_model(<span class="hljs-string">"tiny.en"</span>)
    result = model.transcribe(sys.argv[<span class="hljs-number">1</span>], fp16=<span class="hljs-literal">False</span>)
    print(result[<span class="hljs-string">"text"</span>].strip())
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Record 4 seconds and transcribe</span>
termux-microphone-record -f <span class="hljs-keyword">in</span>.wav -l <span class="hljs-number">4</span> &amp;&amp; termux-microphone-record -q
python asr_transcribe.py <span class="hljs-keyword">in</span>.wav
</code></pre>
<h2 id="heading-step-3-install-a-local-llm-with-mlc">Step 3: Install a Local LLM with MLC</h2>
<p><strong>What this step does:</strong> Installs and tests the on-device reasoning model that will generate responses to transcribed speech.</p>
<p>MLC compiles transformer models to mobile GPUs and Neural Processing Units, enabling on-device inference. You will run an instruction-tuned model with 4-bit or 8-bit weights for speed.</p>
<p>Install the command-line interface like this:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Clone and install Python bindings (for scripting) and CLI</span>
git clone https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm
pip install -r requirements.txt
pip install -e python
</code></pre>
<p>We will use <strong>Llama 3 8B Instruct q4</strong> because it offers strong reasoning while still running on many recent Android devices. If your phone has less memory or you want faster responses, you can swap in <strong>Phi-3.5 Mini</strong> (about 3.8B) without changing any code.</p>
<p>Download a mobile-optimized model:</p>
<pre><code class="lang-python">mlc_llm download Llama<span class="hljs-number">-3</span><span class="hljs-number">-8</span>B-Instruct-q4f16_1
</code></pre>
<p>We will use a short Python script to send text to the model and print the response. This lets us verify that the model is installed correctly before we connect it to audio.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Local LLM text generation (local_llm.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">import</span> sys

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
prompt = sys.argv[<span class="hljs-number">1</span>] <span class="hljs-keyword">if</span> len(sys.argv) &gt; <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">"Hello"</span>
resp = engine.chat([{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: prompt}])
<span class="hljs-comment"># The engine may return different structures across versions</span>
reply_text = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)
print(reply_text)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python local_llm.py <span class="hljs-string">"Summarize this in one sentence: building a local voice assistant on Android"</span>
</code></pre>
<h2 id="heading-step-4-local-text-to-speech-tts">Step 4: Local Text-to-Speech (TTS)</h2>
<p><strong>What this step does:</strong> Turns the model’s text responses into spoken audio so the assistant can talk back.</p>
<p>This step converts the text returned by the model into spoken audio so the assistant can talk back. It uses the built-in Android Text-to-Speech voice and requires no additional Python packages.</p>
<pre><code class="lang-python">termux-tts-speak <span class="hljs-string">"Hello, I am running entirely on your device."</span>
</code></pre>
<p>This is the voice output method we will use throughout the tutorial.</p>
<h2 id="heading-step-5-the-core-voice-loop">Step 5: The Core Voice Loop</h2>
<p><strong>What this step does:</strong> Connects speech recognition, language model reasoning, and speech synthesis into a single interactive conversation loop.</p>
<p>This loop ties together recording, transcription, response generation, and playback.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Core voice loop tying ASR + LLM + TTS (voice_loop.py)</span>
<span class="hljs-keyword">import</span> subprocess, os

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run</span>(<span class="hljs-params">cmd</span>):</span> <span class="hljs-keyword">return</span> subprocess.check_output(cmd).decode().strip()

print(<span class="hljs-string">"Listening..."</span>)
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-f"</span>, <span class="hljs-string">"in.wav"</span>, <span class="hljs-string">"-l"</span>, <span class="hljs-string">"4"</span>]) ; subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-q"</span>])
text = run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"asr_transcribe.py"</span>, <span class="hljs-string">"in.wav"</span>])
reply = run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"local_llm.py"</span>, text])
<span class="hljs-keyword">try</span>:
    subprocess.run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"speak_xtts.py"</span>, reply]); subprocess.run([<span class="hljs-string">"termux-media-player"</span>, <span class="hljs-string">"play"</span>, <span class="hljs-string">"out.wav"</span>])
<span class="hljs-keyword">except</span>:
    subprocess.run([<span class="hljs-string">"termux-tts-speak"</span>, reply])
</code></pre>
<p>Run:</p>
<pre><code class="lang-python">python voice_loop.py
</code></pre>
<h2 id="heading-step-6-tool-calling-make-it-act">Step 6: Tool Calling (Make It Act)</h2>
<p><strong>What this step does:</strong> Enables the assistant to perform actions – not just reply – by calling real functions on your device.</p>
<p>Tool calling lets the assistant perform actions, not just answer. When the model recognizes an action request, it outputs a small JSON instruction, and your code runs the corresponding function. You show the model which tools exist and how to call them. The program intercepts calls and runs the corresponding code.</p>
<p><strong>Example use case:</strong></p>
<p>You say: <em>"Schedule a meeting tomorrow at 3 PM with John."</em></p>
<p>The assistant:</p>
<ol>
<li><p>Transcribes what you said.</p>
</li>
<li><p>Detects that this is not a question, but an action request.</p>
</li>
<li><p>Calls the <code>add_event()</code> function with the correct parameters.</p>
</li>
<li><p>Confirms: <em>"Okay, I scheduled that."</em></p>
</li>
</ol>
<p>Here’s the structure of how tool calls will work:</p>
<ul>
<li><p>Define Python functions such as <code>add_event</code>, <code>control_light</code></p>
</li>
<li><p>Provide a schema for the model to output when it wants to call a tool</p>
</li>
<li><p>Detect that schema in the LLM output and execute the function</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Tool calling functions (tools.py)</span>
<span class="hljs-keyword">import</span> json

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_event</span>(<span class="hljs-params">title: str, date: str</span>) -&gt; dict:</span>
    <span class="hljs-comment"># Replace with actual calendar integration</span>
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"ok"</span>, <span class="hljs-string">"title"</span>: title, <span class="hljs-string">"date"</span>: date}

TOOLS = {
    <span class="hljs-string">"add_event"</span>: add_event,
}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_tool</span>(<span class="hljs-params">call_json: str</span>) -&gt; str:</span>
    <span class="hljs-string">"""call_json: '{"tool":"add_event","args":{"title":"Dentist","date":"2025-11-10 10:00"}}'"""</span>
    data = json.loads(call_json)
    name = data[<span class="hljs-string">"tool"</span>]
    args = data.get(<span class="hljs-string">"args"</span>, {})
    <span class="hljs-keyword">if</span> name <span class="hljs-keyword">in</span> TOOLS:
        result = TOOLS[name](**args)
        <span class="hljs-keyword">return</span> json.dumps({<span class="hljs-string">"tool_result"</span>: result})
    <span class="hljs-keyword">return</span> json.dumps({<span class="hljs-string">"error"</span>: <span class="hljs-string">"unknown tool"</span>})
</code></pre>
<p>Prompt the model to use tools:</p>
<pre><code class="lang-python"><span class="hljs-comment"># LLM wrapper enabling tool use (llm_with_tools.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">import</span> json, sys

SYSTEM = (
    <span class="hljs-string">"You can call tools by emitting a single JSON object with keys 'tool' and 'args'. "</span>
    <span class="hljs-string">"Available tools: add_event(title:str, date:str). "</span>
    <span class="hljs-string">"If no tool is needed, answer directly."</span>
)

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
user = sys.argv[<span class="hljs-number">1</span>]
resp = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user},
])
print(resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp))
</code></pre>
<p>And then glue it together:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Run LLM with tool call detection (run_with_tools.py)</span>
<span class="hljs-keyword">import</span> subprocess, json
<span class="hljs-keyword">from</span> tools <span class="hljs-keyword">import</span> run_tool

user = <span class="hljs-string">"Add a dentist appointment next Thursday at 10"</span>
raw = subprocess.check_output([<span class="hljs-string">"python"</span>, <span class="hljs-string">"llm_with_tools.py"</span>, user]).decode().strip()

<span class="hljs-comment"># If the model returned a JSON tool call, run it</span>
<span class="hljs-keyword">try</span>:
    data = json.loads(raw)
    <span class="hljs-keyword">if</span> isinstance(data, dict) <span class="hljs-keyword">and</span> <span class="hljs-string">"tool"</span> <span class="hljs-keyword">in</span> data:
        print(<span class="hljs-string">"Tool call:"</span>, data)
        print(run_tool(raw))
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">"Assistant:"</span>, raw)
<span class="hljs-keyword">except</span> Exception:
    print(<span class="hljs-string">"Assistant:"</span>, raw)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python run_with_tools.py
</code></pre>
<h2 id="heading-step-7-memory-and-personalization">Step 7: Memory and Personalization</h2>
<p><strong>What this step does:</strong> Allows the assistant to remember personal information you share so conversations feel continuous and adaptive.</p>
<p>A helpful assistant should feel like it learns alongside you. Memory allows the system to keep track of small details you mention naturally in conversation.</p>
<p>Without memory, every conversation starts from scratch. With memory, your assistant can remember personal facts (for example, birthdays, favorite music), your routines, device settings, or notes you mention in conversation. This unlocks more natural interactions and enables personalization over time.</p>
<p>You can start with a simple key-value store and expand over time. Your program reads memory before inference and writes back new facts after.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Simple key-value memory store (memory.py)</span>
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path

MEM_PATH = Path(<span class="hljs-string">"memory.json"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">mem_load</span>():</span>
    <span class="hljs-keyword">return</span> json.loads(MEM_PATH.read_text()) <span class="hljs-keyword">if</span> MEM_PATH.exists() <span class="hljs-keyword">else</span> {}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">mem_save</span>(<span class="hljs-params">mem</span>):</span>
    MEM_PATH.write_text(json.dumps(mem, indent=<span class="hljs-number">2</span>))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">remember</span>(<span class="hljs-params">key: str, value: str</span>):</span>
    mem = mem_load()
    mem[key] = value
    mem_save(mem)
</code></pre>
<p>Use memory in the loop:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Voice loop with memory loading and updating (voice_loop_with_memory.py)</span>
<span class="hljs-keyword">import</span> subprocess, json
<span class="hljs-keyword">from</span> memory <span class="hljs-keyword">import</span> mem_load, remember

<span class="hljs-comment"># 1) Record and transcribe</span>
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-f"</span>, <span class="hljs-string">"in.wav"</span>, <span class="hljs-string">"-l"</span>, <span class="hljs-string">"4"</span>]) 
subprocess.run([<span class="hljs-string">"termux-microphone-record"</span>, <span class="hljs-string">"-q"</span>]) 
user_text = subprocess.check_output([<span class="hljs-string">"python"</span>, <span class="hljs-string">"asr_transcribe.py"</span>, <span class="hljs-string">"in.wav"</span>]).decode().strip()

<span class="hljs-comment"># 2) Load memory and add as system context</span>
mem = mem_load()
SYSTEM = <span class="hljs-string">"Known facts: "</span> + json.dumps(mem)

<span class="hljs-comment"># 3) Ask the model</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
resp = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user_text},
])
reply = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)
print(<span class="hljs-string">"Assistant:"</span>, reply)

<span class="hljs-comment"># 4) Very simple pattern: if the user said "remember X is Y", store it</span>
<span class="hljs-keyword">if</span> user_text.lower().startswith(<span class="hljs-string">"remember "</span>) <span class="hljs-keyword">and</span> <span class="hljs-string">" is "</span> <span class="hljs-keyword">in</span> user_text:
    k, v = user_text[<span class="hljs-number">9</span>:].split(<span class="hljs-string">" is "</span>, <span class="hljs-number">1</span>)
    remember(k.strip(), v.strip())
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python voice_loop_with_memory.py
</code></pre>
<h2 id="heading-step-8-retrieval-augmented-generation-rag">Step 8: Retrieval-Augmented Generation (RAG)</h2>
<p><strong>What this step does:</strong> Lets the assistant search your offline notes or documents at answer time, improving accuracy for personal tasks.</p>
<p>To use RAG, we first install a lightweight vector database, then add documents to it, and later query it when answering questions.</p>
<p>A language model cannot magically know details about your life, your work, or your files unless you give it a way to look things up.</p>
<p><a target="_blank" href="https://www.freecodecamp.org/news/learn-rag-fundamentals-and-advanced-techniques/">Retrieval-Augmented Generation (RAG)</a> bridges that gap. RAG allows the assistant to search your own stored data at query time. This means the assistant can answer questions about your projects, home details, travel plans, studies, or any personal documents you store completely offline.</p>
<p>RAG allows the assistant to reference your actual notes when answering, instead of relying only on the model's internal training.</p>
<p>Install the vector store:</p>
<pre><code class="lang-python">pip install chromadb
</code></pre>
<p>Add and search your notes:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Local vector DB indexing and querying (rag.py)</span>
<span class="hljs-keyword">from</span> chromadb <span class="hljs-keyword">import</span> Client

client = Client()
notes = client.create_collection(<span class="hljs-string">"notes"</span>)

<span class="hljs-comment"># Add your documents (repeat as needed)</span>
notes.add(documents=[<span class="hljs-string">"Contractor quote was 42000 United States Dollars for the extension."</span>], ids=[<span class="hljs-string">"q1"</span>]) 

<span class="hljs-comment"># Query the local vector database</span>
results = notes.query(query_texts=[<span class="hljs-string">"extension quote"</span>], n_results=<span class="hljs-number">1</span>)
context = results[<span class="hljs-string">"documents"</span>][<span class="hljs-number">0</span>][<span class="hljs-number">0</span>]
print(context)
</code></pre>
<p>Use retrieved context in responses:</p>
<pre><code class="lang-python"><span class="hljs-comment"># LLM answering using retrieved context (llm_with_rag.py)</span>
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">from</span> chromadb <span class="hljs-keyword">import</span> Client

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)
client = Client()
notes = client.get_or_create_collection(<span class="hljs-string">"notes"</span>)

question = <span class="hljs-string">"What was the quoted amount for the home extension?"</span>
res = notes.query(query_texts=[question], n_results=<span class="hljs-number">2</span>)
ctx = <span class="hljs-string">"\n"</span>.join([d[<span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> d <span class="hljs-keyword">in</span> res[<span class="hljs-string">"documents"</span>]])

SYSTEM = <span class="hljs-string">"Use the provided context to answer accurately. If missing, say you do not know.\nContext:\n"</span> + ctx
ans = engine.chat([
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: question},
])
print(ans.get(<span class="hljs-string">"message"</span>, ans) <span class="hljs-keyword">if</span> isinstance(ans, dict) <span class="hljs-keyword">else</span> str(ans))
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">python rag.py
python llm_with_rag.py
</code></pre>
<h2 id="heading-step-9-multi-step-agentic-workflow">Step 9: Multi-Step Agentic Workflow</h2>
<p><strong>What this step does:</strong> Combines listening, reasoning, memory, and tool usage into a multi-step routine that runs automatically.</p>
<p>Now that the assistant can listen, respond, remember facts, and call tools, we can combine those abilities into a small routine that performs several steps automatically.</p>
<p><strong>Practical example: "Morning Briefing" on your phone</strong></p>
<p>Goal: when you say <em>"Give me my morning briefing and text it to my partner"</em>, the assistant will:</p>
<ol>
<li><p>Read today's agenda from a local file,</p>
</li>
<li><p>summarize it,</p>
</li>
<li><p>speak it aloud, and</p>
</li>
<li><p>send the summary via SMS using Termux.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762319593253/99e670d4-4934-47ce-a164-f0f7880ea80f.png" alt="Multi-step morning briefing workflow with retrieval, summary, speech output, and SMS action." class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><em>Diagram: Multi-step morning briefing workflow with retrieval, summary, speech output, and SMS action.</em></p>
<h3 id="heading-prepare-your-agenda-file">Prepare your agenda file</h3>
<p>This file stores your events for the day. You can edit it manually, generate it, or sync it later if you want.</p>
<p>Create <code>agenda.json</code> in the same folder:</p>
<pre><code class="lang-python">{
  <span class="hljs-string">"2025-11-03"</span>: [
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"09:30"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Standup meeting"</span>},
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"13:00"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Lunch with Priya"</span>},
    {<span class="hljs-string">"time"</span>: <span class="hljs-string">"16:30"</span>, <span class="hljs-string">"title"</span>: <span class="hljs-string">"Gym"</span>}
  ]
}
</code></pre>
<p>Phone-integrated tools for this workflow:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Phone-integrated agent tools (tools_phone.py)</span>
<span class="hljs-keyword">import</span> json, subprocess, datetime
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path

AGENDA_PATH = Path(<span class="hljs-string">"agenda.json"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_today_agenda</span>():</span>
    today = datetime.date.today().isoformat()
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> AGENDA_PATH.exists():
        <span class="hljs-keyword">return</span> []
    data = json.loads(AGENDA_PATH.read_text())
    <span class="hljs-keyword">return</span> data.get(today, [])

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">send_sms</span>(<span class="hljs-params">number: str, text: str</span>) -&gt; dict:</span>
    <span class="hljs-comment"># Requires Termux:API and SMS permission</span>
    subprocess.run([<span class="hljs-string">"termux-sms-send"</span>, <span class="hljs-string">"-n"</span>, number, text])
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"sent"</span>, <span class="hljs-string">"to"</span>: number}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">notify</span>(<span class="hljs-params">title: str, content: str</span>) -&gt; dict:</span>
    subprocess.run([<span class="hljs-string">"termux-notification"</span>, <span class="hljs-string">"--title"</span>, title, <span class="hljs-string">"--content"</span>, content])
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"notified"</span>}
</code></pre>
<p>Create the agent routine:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Multi-step morning briefing agent (agent_morning.py)</span>
<span class="hljs-keyword">import</span> json, subprocess, os
<span class="hljs-keyword">from</span> mlc_llm <span class="hljs-keyword">import</span> MLCEngine
<span class="hljs-keyword">from</span> tools_phone <span class="hljs-keyword">import</span> load_today_agenda, send_sms, notify

PARTNER_PHONE = os.environ.get(<span class="hljs-string">"PARTNER_PHONE"</span>, <span class="hljs-string">"+15551234567"</span>)

TOOLS = {
    <span class="hljs-string">"send_sms"</span>: send_sms,
    <span class="hljs-string">"notify"</span>: notify,
}

SYSTEM = (
  <span class="hljs-string">"You assist on a phone. You may emit a single-line JSON when an action is needed "</span>
  <span class="hljs-string">"with keys 'tool' and 'args'. Available tools: send_sms(number:str, text:str), "</span>
  <span class="hljs-string">"notify(title:str, content:str). Keep messages concise. If no tool is needed, answer in plain text."</span>
)

engine = MLCEngine(model=<span class="hljs-string">"Llama-3-8B-Instruct-q4f16_1"</span>)

agenda = load_today_agenda()
agenda = load_today_agenda()
agenda_text = <span class="hljs-string">"
"</span>.join(<span class="hljs-string">f"<span class="hljs-subst">{e[<span class="hljs-string">'time'</span>]}</span> - <span class="hljs-subst">{e[<span class="hljs-string">'title'</span>]}</span>"</span> <span class="hljs-keyword">for</span> e <span class="hljs-keyword">in</span> agenda) <span class="hljs-keyword">or</span> <span class="hljs-string">"No events for today."</span>

user_request = <span class="hljs-string">"Give me my morning briefing and text it to my partner."</span> <span class="hljs-string">"Give me my morning briefing and text it to my partner."</span>

<span class="hljs-comment"># 1) Ask LLM for a 2-3 sentence summary to speak</span>
summary = engine.chat([
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">"Summarize this agenda in 2-3 sentences for a morning briefing:"</span>},
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: agenda_text},
])
summary_text = summary.get(<span class="hljs-string">"message"</span>, summary) <span class="hljs-keyword">if</span> isinstance(summary, dict) <span class="hljs-keyword">else</span> str(summary)
print(<span class="hljs-string">"Briefing:
"</span>, summary_text)

<span class="hljs-comment"># 2) Speak locally (prefer XTTS, fallback to system TTS)</span>
<span class="hljs-keyword">try</span>:
    subprocess.run([<span class="hljs-string">"python"</span>, <span class="hljs-string">"speak_xtts.py"</span>, summary_text], check=<span class="hljs-literal">True</span>)
    subprocess.run([<span class="hljs-string">"termux-media-player"</span>, <span class="hljs-string">"play"</span>, <span class="hljs-string">"out.wav"</span>]) 
<span class="hljs-keyword">except</span> Exception:
    subprocess.run([<span class="hljs-string">"termux-tts-speak"</span>, summary_text])

<span class="hljs-comment"># 3) Ask LLM whether to send SMS and with what text, using tool schema</span>
resp = engine.chat([
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM},
  {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: <span class="hljs-string">f"User said: '<span class="hljs-subst">{user_request}</span>'. Partner phone is <span class="hljs-subst">{PARTNER_PHONE}</span>. Summary: <span class="hljs-subst">{summary_text}</span>"</span>},
])
msg = resp.get(<span class="hljs-string">"message"</span>, resp) <span class="hljs-keyword">if</span> isinstance(resp, dict) <span class="hljs-keyword">else</span> str(resp)

<span class="hljs-comment"># 4) If the model requested a tool, execute it</span>
<span class="hljs-keyword">try</span>:
    data = json.loads(msg)
    <span class="hljs-keyword">if</span> isinstance(data, dict) <span class="hljs-keyword">and</span> data.get(<span class="hljs-string">"tool"</span>) <span class="hljs-keyword">in</span> TOOLS:
        <span class="hljs-comment"># Auto-fill phone number if missing</span>
        <span class="hljs-keyword">if</span> data[<span class="hljs-string">"tool"</span>] == <span class="hljs-string">"send_sms"</span> <span class="hljs-keyword">and</span> <span class="hljs-string">"number"</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> data.get(<span class="hljs-string">"args"</span>, {}):
            data.setdefault(<span class="hljs-string">"args"</span>, {})[<span class="hljs-string">"number"</span>] = PARTNER_PHONE
        result = TOOLS[data[<span class="hljs-string">"tool"</span>]](**data.get(<span class="hljs-string">"args"</span>, {}))
        print(<span class="hljs-string">"Tool result:"</span>, result)
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">"Assistant:"</span>, msg)
<span class="hljs-keyword">except</span> Exception:
    print(<span class="hljs-string">"Assistant:"</span>, msg)
</code></pre>
<p><strong>Run it now:</strong></p>
<pre><code class="lang-python">export PARTNER_PHONE=+<span class="hljs-number">15551234567</span>
python agent_morning.py
</code></pre>
<p>This example is realistic on Android because it uses Termux utilities you already installed: local TTS for speech output, <code>termux-sms-send</code> for messaging, and <code>termux-notification</code> for a quick on-device confirmation. You can extend it with a Home Assistant tool later if you have a local server (for example, to toggle lights or set thermostat scenes).</p>
<h2 id="heading-conclusion-and-next-steps">Conclusion and Next Steps</h2>
<p>Building a fully local voice assistant is an incremental process. Each step you added – speech recognition, text generation, memory, retrieval, and tool execution – unlocked new capabilities and moved the system closer to behaving like a real assistant.</p>
<p>You built a fully local voice assistant on your phone with:</p>
<ul>
<li><p>On-device Automatic Speech Recognition with Whisper (with Faster-Whisper fallback)</p>
</li>
<li><p>On-device reasoning with MLC Large Language Model</p>
</li>
<li><p>Local Text-to-Speech using the built-in system TTS</p>
</li>
<li><p>Tool calling for real actions</p>
</li>
<li><p>Memory and personalization</p>
</li>
<li><p>Retrieval-Augmented Generation for document-based knowledge</p>
</li>
<li><p>A simple agent loop for multi-step work</p>
</li>
</ul>
<p>From here you can add:</p>
<ul>
<li><p>Wake word detection (for example, Porcupine or open wake word models)</p>
</li>
<li><p>Device-specific integrations (for example, Home Assistant, smart lighting)</p>
</li>
<li><p>Better memory schemas and calendars or contacts adapters</p>
</li>
</ul>
<p>Your data never leaves your device, and you control every part of the stack. This is a private, customizable assistant you can expand however you like.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
