Making LLMs smarter with Dynamic Knowledge Access
This guide shows how to use Retrieval Augmented Generation (RAG) to enhance a large language model (LLM). RAG is the process of enabling an LLM to reference context outside of its initial training data before generating its response. It can be extremely expensive in both time and computing power to train a model that is useful for your own domain-specific purposes. Therefore, using RAG is a cost-effective alternative to extending the capabilities of an existing LLM.
Prerequisites
- uv - for Python dependency management
- The Nitric CLI
- (optional) An AWS account
Getting started
We'll start by creating a new project using Nitric's python starter template.
If you want to take a look at the finished code, it can be found here.
nitric new llama-rag py-startercd llama-rag
Next, let's install our base dependencies, then add the llama-index
libraries. We'll be using llama index as it makes creating RAGs extremely simple and has support for running our own local Llama 3.2 models.
# Install the base dependenciesuv sync# Add Llama index dependenciesuv add llama-index llama-index-embeddings-huggingface llama-index-llms-llama-cpp
We'll organize our project structure like so:
+--common/| +-- __init__.py| +-- model_parameters.py+--model/| +-- Llama-3.2-1B-Instruct-Q4_K_M.gguf+--services/| +-- api.py+--.gitignore+--.python-version+-- build_query_engine.py+-- pyproject.toml+-- python.dockerfile+-- python.dockerignore+-- nitric.yaml+-- README.md
Setting up our LLM
Before we even start writing code for our LLM we'll want to download the model into our project. For this project we'll be using Llama 3.2 with the Q4_K_M quantization.
mkdir modelcd modelcurl -OL https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.ggufcd ..
Now that we have our model we can load it into our code. We'll also define our embed model using a recommend model from Hugging Face. At this point we can also create a prompt template for prompts with our query engine. It will just sanitize some of the hallucinations so that if the model does not know an answer it won't pretend like it does.
from llama_index.core import ChatPromptTemplatefrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom llama_index.llms.llama_cpp import LlamaCPP# Load the locally stored Llama modelllm = LlamaCPP(model_url=None,model_path="./model/Llama-3.2-1B-Instruct-Q4_K_M.gguf",temperature=0.7,verbose=False,)# Load the embed model from hugging faceembed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)# Set the location that we will persist our embedspersist_dir = "query_engine_vectors"# Create the prompt query templates to sanitise hallucinationstext_qa_template = ChatPromptTemplate.from_messages([("system","If the context is not useful, respond with 'I'm not sure'.",),("user",("Context information is below.\n""---------------------\n""{context_str}\n""---------------------\n""Given the context information and not prior knowledge ""answer the question: {query_str}\n.")),])
Building a Query Engine
The next step is where we embed our context into the LLM. For this example we will embed the Nitric documentation. It's open-source on GitHub, so we can clone it into our project.
git clone https://github.com/nitrictech/docs.git nitric-docs
We can then create our embedding and store it locally.
from common.model_parameters import llm, embed_model, persist_dirfrom llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings# Set global settings for llama indexSettings.llm = llmSettings.embed_model = embed_model# Load data from the documents directoryloader = SimpleDirectoryReader(# The location of the documents you want to embedinput_dir = "./nitric-docs/",# Set the extension to what format your documents are inrequired_exts=[".mdx"],# Search through documents recursivelyrecursive=True)docs = loader.load_data()# Embed the docs into the Llama modelindex = VectorStoreIndex.from_documents(docs, show_progress=True)# Save the query engine index to the local machineindex.storage_context.persist(persist_dir)
You can then run this using the following command. This should output the embeds into your persist_dir
.
uv run build_query_engine.py
Creating an API for querying our model
With our LLM ready for querying, we can create an API to handle prompts.
import osfrom common.model_parameters import embed_model, llm, text_qa_template, persist_dirfrom nitric.resources import apifrom nitric.context import HttpContextfrom nitric.application import Nitricfrom llama_index.core import StorageContext, load_index_from_storage, Settings# Set global settings for llama indexSettings.llm = llmSettings.embed_model = embed_modelmain_api = api("main")@main_api.post("/prompt")async def query_model(ctx: HttpContext):# Pull the data from the request bodyquery = str(ctx.req.data)print(f"Querying model: \"{query}\"")# Get the model from the stored local contextif os.path.exists(persist_dir):storage_context = StorageContext.from_defaults(persist_dir=persist_dir)index = load_index_from_storage(storage_context)# Get the query engine from the index, and use the prompt template for santisation.query_engine = index.as_query_engine(streaming=False, similarity_top_k=4, text_qa_template=text_qa_template)else:print("model does not exist")ctx.res.success= Falsereturn ctx# Query the modelresponse = query_engine.query(query)ctx.res.body = f"{response}"print(f"Response: \n{response}")return ctxNitric.run()
Test it locally
Now that you have an API defined, we can test it locally. You can do this using nitric start
and make a request to the API either through the Nitric Dashboard or another HTTP client like cURL.
curl -X POST http://localhost:4001/prompt -d "What is Nitric?"
This should produce an output similar to:
Nitric is a cloud-agnostic framework designed to aid developers in building full cloud applications, including infrastructure. It is a declarative cloud framework with common resources like APIs, websockets, databases, queues, topics, buckets, and more. The framework provides tools for locally simulating a cloud environment, to allow an application to be tested locally, and it makes it possible to interact with resources at runtime. It is a lightweight and flexible framework that allows developers to structure their projects according to their preferences and needs. Nitric is not a replacement for IaC tools like Terraform but rather introduces a method of bringing developer self-service for infrastructure directly into the developer application. Nitric can be augmented through use of tools like Pulumi or Terraform and even be fully customized using such tools. The framework supports multiple programming languages, and its default deployment engines are built with Pulumi. Nitric provides tools for defining services in your project's `nitric.yaml` file, and each service can be run independently, allowing your app to scale and manage different workloads efficiently. Services are the heart of Nitric apps, they're the entrypoints to your code. They can serve as APIs, websockets, schedule handlers, subscribers and a lot more.
Get ready for deployment
Now that its tested locally, we can get our project ready for containerization. The default python dockerfile uses python3.11-bookworm-slim
as its basic container image, which doesn't have the right dependencies to load the Llama model. So, all we need to do is update the Dockerfile to use python3.11-bookworm (the non-slim version) instead.
Update line 2:
-FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder+FROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builder
And line 18:
-FROM python:3.11-slim-bookworm+FROM python:3.11-bookworm
When you're ready to deploy the project, we can create a new Nitric stack file that will target AWS:
nitric stack new dev aws
Update the stack file nitric.dev.yaml
with the appropriate AWS region and memory allocation to handle the model:
provider: nitric/aws@1.14.0region: us-east-1config:# How services will be deployed by default, if you have other services not running models# you can add them here too so they don't use the same configurationdefault:lambda:# Set the memory to 6GB to handle the model, this automatically sets additional CPU allocationmemory: 6144# Set a timeout of 30 seconds (this is the most API Gateway will wait for a response)timeout: 30# We add more storage to the lambda function, so it can store the modelephemeral-storage: 1024
We can then deploy using the following command:
nitric up
Testing on AWS will be the same as we did locally, we'll just use cURL to make a request to the API URL that was outputted at the end of the deployment.
curl -x POST {your AWS endpoint URL here}/prompt -d "What is Nitric?"
Once you're finished querying the model, you can destroy the deployment using nitric down
.
Summary
In this project we've successfully augmented an LLM using Retrieval Augmented Generation (RAG) techniques with Llama Index and Nitric. You can modify this project to use any LLM, change the prompt template to be more specific in responses, or change the context for your own personal requirements. We could extend this project to maintain context between requests using WebSockets to have more of a chat-like experience with the model.