Docs Menu
Docs Home
/ /
Atlas Architecture Center
/ /
/

Optimizing Generative AI Applications with Fireworks AI and MongoDB for Peak Performance

Optimize AI with MongoDB & Fireworks AI: Faster inference, reduced costs, and efficient RAG applications.

Use cases: Gen AI, Model Performance Tuning

Industries: Financial Services

Products: MongoDB Atlas

Partners: Fireworks AI, LangChain

In today's AI-driven landscape, delivering high performance is only half the battle. True success lies in striking the perfect balance between performance and Total Cost of Ownership (TCO). Enterprises are constantly challenged to optimize the price/performance ratio, ensuring that every gain in speed or efficiency translates into tangible cost benefits. Enter the strategic partnership of MongoDB and Fireworks.AI—a formidable combination that empowers organizations to elevate performance while conserving resources. Seamlessly blending MongoDB's robust data management capabilities with Fireworks.AI’s model optimization tools, this collaboration not only enhances latency and throughput but effectively minimizes operational costs.

This article will address the following topics:

  • Methods for leveraging MongoDB and Fireworks.AI to enhance performance and reduce Total Cost of Ownership (TCO).

  • Strategies for caching queries and responses that will later help optimize retrieval augmented generation(RAG) when leveraging MongoDB Atlas with Generative AI models.

  • Fine-tune Small Language Models (SLM) based on the Large Language Model (LLM) to achieve faster performance with comparable quality of responses.

  • Techniques employed with Fireworks.AI platform to fine-tune models, accelerate inference speeds, and decrease hardware requirements for LLMs.

  • A case study featuring a credit card recommendation system, demonstrating quantifiable improvements in latency, memory utilization, and cost-effectiveness.

  • Best practices for deploying and scaling these solutions in production environments.

Upon completion of this article, readers will acquire a comprehensive, actionable strategy for achieving enhanced AI performance with reduced expenditure, supported by practical illustrations and performance metrics.

MongoDB is renowned for its flexible schema, efficient indexing, and distributed architecture, allowing organizations to scale their data infrastructure on demand. When paired with Fireworks.AI’s cutting-edge model tuning capabilities, enterprises are equipped to achieve remarkable efficiency and performance gains without financial strain.

Introducing FireOptimizer and FireAttention: Core Components

Central to this optimized approach is the FireOptimizer framework, integrating MongoDB’s flexible, high-performance database with Fireworks.AI’s model tuning prowess.

This framework accelerates batch inferencing through solutions like FireAttention, a method that enhances request processing using existing hardware, optimizing resource utilization. Furthermore, Parameter-Efficient Fine-Tuning (PEFT), including methods like LoRA and QLoRA, fine-tunes models to complete specialized tasks efficiently using trace data or label data, thereby enhancing portability and reducing computational footprint.

FireOptimizer

Figure 1. FireOptimizer Architecture for Adaptive Optimization and High-Quality Inference

The key benefits that can be achieved are:

  • Faster Inference: With FireOptimizer’s adaptive speculative execution, production workloads across various models saw up to 3x latency improvements, ensuring applications are always highly responsive.

  • Hassle-Free Optimization: FireOptimizer automates the complex optimization process, so you can focus on building your application, while we handle the rest.

Let’s dive deeper into the FireOptimizer framework.

Large Language Models (LLMs) typically generate outputs one word (or token) at a time , which can make responses slower, especially for long outputs. To accelerate this process, a technique called speculative decoding is used. This involves a smaller, faster model—known as a draft model—that quickly generates several possible words or phrases in advance while the main LLM is still processing. The main LLM then evaluates these guesses and retains only the accurate ones. This technique is commonly referred to as speculative execution or decoding.

The limitation of this traditional approach lies in the fact that most draft models are trained on generic data (e.g., public conversations). While this approach is effective for general tasks, its accuracy or "hit rate" diminishes significantly in specialized scenarios, such as coding or financial analysis.

In contrast, Fireworks AI enhances this approach through adaptive speculative execution. Instead of using a one-size-fits-all draft model, it uses models that are customized to specific domains or user profiles. By using this adaptive approach, it optimizes predictions based on domain-specific characteristics. This method improves accuracy, increases hit rates (e.g., from 29% to 76% in code generation tasks), and reduces inference costs, leading to latency improvements of up to 3x.

Proprietary LLM serving stack: Mastering Long-Context Workloads

Processing long prompts (8K-32K tokens) is critical for applications like document analysis or code completion—but often slows systems to a crawl. Fireworks AI’s FireAttention V2 tackles this, delivering 12x faster processing for long-context tasks.

  • Optimized Attention Scaling: Reduces computational overhead when handling lengthy inputs.

  • Multi-Host Deployment: Distributes workloads efficiently across GPUs.

  • Advanced Kernels: Streamlines operations for faster execution.

With FireAttention V3, Fireworks AI extends its lead to AMD’s MI300 GPUs, offering a cost-efficient alternative to NVIDIA. Performance highlights include:

  • 1.4x–1.8x Higher Throughput: For models like LLaMA 8B and 70B, compared to standard setups.

  • Up to 5.5x Speed Gains: In low-latency scenarios, outperforming AMD’s own vLLM.

  • This leap comes from architectural tweaks like redesigned attention kernels and smarter memory use, proving that AMD GPUs are now formidable players in LLM inference.

With the above capabilities, FireAttention can significantly enhance the fine-tuning of Small Language Models by enabling efficient processing of long-context inputs (8K–32K tokens), which are often critical for domain-specific tasks like document analysis or multi-turn dialogue. Its optimized attention mechanisms reduce computational overhead, allowing faster training cycles and larger batch sizes. With support for multi-host GPU deployment and advanced attention kernels, it streamlines workloads across hardware, improving throughput, lowering latency and making fine-tuning both faster and more affordable.

Efficiency at Scale: Adaptive Resource Optimization

Fireworks AI’s optimizations extend beyond adaptive speculation. Three critical techniques maximize throughput and cost-efficiency:

  • Adaptive Caching: Reuses frequent computations (e.g., repetitive API calls) to skip redundant work, slashing latency by 30–50% for high-traffic workloads.

  • Customizable Quantization: Balances 4-/8-bit precision with model quality, doubling speeds while retaining >99% accuracy for tasks like batch processing.

  • Disaggregated Serving: Tailors hardware allocation to workload type—host multiple lightweight model copies (e.g., 8 GPUs → 8 parallel instances) or shard large models (e.g., LLaMA 70B) across GPUs for complex tasks.

Smaller, efficient models present unique opportunities for fine-tuning, allowing for specialized adaptation while maintaining resource efficiency. There has been constant research in this space as making finetuned or optimized SLM to various mediums such as cloud, on-device, dedicated hardware is becoming extremely competitive. These fine tuning techniques can be categorized as follows:

These fine tuning techniques can be categorized as follows:

1

This category includes approaches that introduce additional trainable parameters to the existing pre-trained model without modifying its original weights.

  • Adapters: These methods involve inserting small, trainable layers, known as adapter modules, between the layers of the pre-trained model. These adapters learn task-specific transformations, enabling adaptation to new tasks without altering the pre-trained parameters.

  • Soft Prompts: These are trainable vector embeddings appended to the input sequence. They serve as guidance cues, influencing the model’s behavior towards the desired task.

  • Prefix Tuning: This technique involves adding a trainable prefix to the input sequence. This prefix learns task-specific information without requiring modifications to the core model architecture.

2

This approach involves reparameterizing existing model weights using low-rank approximations to reduce the number of trainable parameters.

  • Low-Rank Adaptation (LoRA): LoRA approximates weight updates in the attention layers of a pre-trained model using low-rank matrices. This significantly decreases the number of trainable parameters.

  • Quantized LoRA (QLoRA): QLoRA enhances LoRA by incorporating quantization techniques, further reducing memory usage and computational costs.

3

This category focuses on selectively fine-tuning specific parameters of the pre-trained model, resulting in greater computational efficiency.

  • BitFit: This method fine-tunes only the bias terms, or other specific parameters, of the pre-trained model, improving computational efficiency.

  • DiffPruning: This technique identifies and removes parameters that contribute minimally to the model's performance, thereby reducing the number of trainable parameters.

4

These strategies involve strategically freezing certain layers of the pre-trained model while fine-tuning others, optimizing the adaptation process.

  • Freeze and Reconfigure (FAR): FAR involves freezing specific layers of the pre-trained model and fine-tuning the remaining layers to optimize model adaptation.

  • FishMask: This technique employs a mask to selectively freeze or fine-tune layers, optimizing adaptation for specific tasks.

The most popular technique of these is the PEFT based ones. PEFT techniques used with large pre-trained models can adapt them to new tasks by only adjusting a small fraction of their parameters. This approach helps prevent overfitting, especially with smaller datasets, and significantly reduces computational and memory requirements compared to full fine-tuning. PEFT also mitigates catastrophic forgetting in LLMs. These techniques enable efficient model customization without requiring full retraining, making them ideal for resource-constrained environments.

We can adopt the PEFT LoRA techniques combined with the availability of both trace data (data generated from the model's interactions) and labeled data (data explicitly annotated for specific tasks), enable the fine-tuning of smaller models to achieve high performance on specific tasks without the need for extensive computational resources.

To highlight the practical applications, we turn to a MongoDB credit card application demo that illustrates how MongoDB can be used for credit scoring with predictive analytics, explaining credit scoring results using generative AI and performing credit card recommendations using a RAG approach with LLM. For the purpose of this fine tuning example, we will be focusing on simplifying credit rating explanations using LLMs. The application encompasses various feature scopes i.e. user profile generation, product recommendations, and reranking with summarization tasks. Details of the application design and source code can be found at the MongoDB Credit Card Application Solution Library Page.

LLMs can cause slower response times due to the complex calculations they must perform over billions of parameters. For example, generating credit card recommendations may require multiple LLM queries, resulting in a total response time of 10 to 20 seconds (with each query taking 5 or more seconds). Additionally, LLMs can be difficult and expensive to productionize and scale for millions of users.

By leveraging Small Language Models (SLMs), enterprises can achieve significant gains in processing speed and cost-efficiency. SLMs require less computational power, making them ideal for devices with limited resources, while providing faster responses and lowering operational costs.

PEFT and its method, Low-Rank Adaptation (LoRA), significantly improve efficiency by optimizing only a subset of parameters. This approach reduces memory requirements and cuts down on operational costs. The synergy with MongoDB enhances data handling and facilitates efficient model tuning processes.

MongoDB is integral, providing seamless data management and real-time integration that improves operational efficiency. By storing trace data as JSON and enabling efficient retrieval and storage, MongoDB adds substantial value to the process of fine-tuning models. MongoDB also doubles up as a caching layer to avoid unnecessarily invoking LLM on repeated requests for the same data.

A critical aspect of the Credit Card Application demo is explaining credit scores or credit ratings to the customer in clear, everyday language. Frontier LLMs like Meta's LLaMA 3.1-405B can generate these explanations using factors such as user profile parameters, model input features, and feature importances from the model used to predict the customer's alternative credit score or credit rating. However, these tasks cannot be consistently achieved with a SLMs due to the limited number of parameters required for effective reasoning and explanation. To achieve the desired outcome, we need to leverage the fine-tuning process described above using the Fireworks AI fine-tuning platform.

The following steps will go through step-by-step, how one can make use of the platform to fine tune a SLM.

Here’s how to leverage this platform and tool:

LLM Fine Tuning Process

Figure 2. LLM/SLM Fine Tuning Process

The fine-tuning process starts with the collection of relevant, task-specific data. As shown in Figure 2, MongoDB Atlas, being a general-purpose database, can be leveraged to cache LLM/SLM responses for a specific user based on their inputs to the Credit Card application (the user is allowed to simulate credit profile of user on the web UI). Below is a sample Python code snippet demonstrating how to set up a decorator to cache LLM/SLM responses in MongoDB Atlas:

class mdbcache:
def __init__(self, function):
self.function = function
def __call__(self, *args, **kwargs):
key = str(args) + str(kwargs)
ele = ccol.find_one({"key": key})
if ele:
return ele["response"]
value = self.function(*args, **kwargs)
ccol.insert_one({"key":key, "response": value})
return value
@mdbcache
def invoke_llm(prompt):
"""
Invoke the language model with the given prompt with cache. The llm.invoke method can invoke either a LLM or SLM based on the Fireworks Model ID provided at the start of applicaiton.
Args:
prompt (str): The prompt to pass to the LLM.
"""
response = llm.invoke(prompt)
return response

As shown in the diagram, the training dataset required for the fine-tuning process can be generated using a simulator. Here, we are trying to simulate a user profile with a stratified sampling approach to select an equal number of samples for all three credit ratings: Good, Normal, and Poor. For this demo, we have generated around 1300+ sample responses.

Now, the generated responses have to be transformed into the format that the Fireworks AI platform supports for starting the fine-tuning process. The file "cc_cache.jsonl" used below in the fine-tuning process can be generated by running the following code:

from pymongo import MongoClient
import pandas as pd
import json
client = MongoClient("mongodb+srv://<uid>:<pwd>@bfsi-demo.2wqno.mongodb.net/?retryWrites=true&w=majority")
df = pd.DataFrame.from_records(client["bfsi-genai"]["cc_cache"].find({},{"_id": 0}))
df["prompt"] = df["key"].apply(lambda x: x.strip('(').strip('"').strip(")").strip("\\"))
del df["key"]
df["response"] = df["response"].apply(lambda x: x.strip())
df.to_json("cc_cache.jsonl", orient="records", lines=True)
# transform cache to messages
messages = []
for item in df.iterrows():
messages += [{"messages": [{"role": "user", "content": item["prompt"].strip(" \\")}, {"role": "assistant", "content": item["response"]}]}]
with open("cc_cache.jsonl", "w") as f:
for item in messages:
f.write(json.dumps(item) + "\n")

Now that we have prepared the dataset and generated our "cc_cache.jsonl" file, we can start fine-tuning the pre-trained "llama-v3p1-8b-instruct" model by following the steps below.

1
pip install firectl
2
firectl login
3
firectl create dataset <dataset_name> cc_cache.jsonl
4
firectl create sftj --base-model accounts/fireworks/models/llama-v3p1-8b-instruct --dataset <dataset_name> --output-model ccmodel --lora-rank 8 --epochs 1
5
Monitoring the fine-tuning process

Figure 3. Monitoring the fine-tuning process

6
firectl deploy ccmodel

The seamless integration of MongoDB and Fireworks.AI streamlines the process of enhancing AI model performance in a cost-effective and efficient manner. This is evident in the step-by-step guide to fine-tuning provided above.

After deploying the model in the fireworks platform as a serverless API, the model id (models/ft-m88hxaga-pi11m) shown in Fig 2 can be used to invoke the fine-tuned SLM model using the language model framework of your choice.

Fine-tuning the SLM for credit card recommendations has yielded impressive results:

Applying the above fine-tuning steps to the SLM model in the context of of the alternative credit card product recommendation process, we observed a tremendous improvement in the response time of the process using the fine-tuned SLM:

1. Response Time Improvement: Originally, the response time with a large LLM would be around 5 seconds. By using an SLM, this can be reduced to approximately 19x reduction in latency.

import time
class tiktok:
"""
Decorator to time the execution of a function and log the time taken.
"""
def __init__(self, function):
self.function = function
def __call__(self, *args, **kwargs):
import time
start = time.time()
value = self.function(*args, **kwargs)
end = time.time()
print(f"Time taken for {self.function.__name__}: {end - start} seconds")
return value
@tiktok
@mdbcache
def invoke_llm(prompt):
"""
Invoke the language model with the given prompt with cache. The invole LLM method can invoke either a LLM or SLM based on the Fireworks Model ID initialised.
Args:
prompt (str): The prompt to pass to the LLM.
"""
...
Model
Inference Time 1 (s)
Inference Time 2 (s)
Inference Time 3 (s)
Average Time (s)

llama-v3p1-405b-instruct

5.5954

7.5936

4.9121

6.0337

SLM - fine-tuned llama-v3p1-8b

0.3554

0.0480

0.0473

0.1502

  1. Memory Reduction: LLMs typically require substantial memory, often in the range of 8x80GB VRAM. A SLM can operate efficiently with around 16 GB VRAM of memory, leading to a 97.5% reduction in memory usage.

  2. Hardware Reduction: Deploying LLMs often necessitates high-end GPUs or multiple servers. With an SLM, the hardware requirements can be significantly reduced, potentially allowing deployment on standard CPUs or a single server, leading to a drastic reduction in hardware costs.

The integration of MongoDB with Fireworks AI significantly enhances the efficiency and cost-effectiveness of training SLMs. This partnership makes advanced AI solutions more accessible and economically viable, ensuring future-proof AI investments. Key value propositions of this collaboration include:

  • Reduced Total Cost of Ownership (TCO): Fireworks AI's LoRA and qLoRA techniques minimize computational resource needs by applying low-rank adaptation and quantization, which lowers costs associated with fine-tuning SLMs. MongoDB supports this with its distributed architecture, flexible schema, and efficient indexing to scale data infrastructure on-demand and minimize storage costs, thus reducing both capital and operational expenditures (CapEx and OpEx).

  • Integrated Data and AI Workflows: MongoDB facilitates real-time data integration, enabling instantaneous data availability for AI models, thereby enhancing operational efficiency. This integration with Fireworks AI's fine-tuning tools ensures a streamlined workflow that keeps AI models continuously updated and relevant, improving decision-making and delivering accurate insights.

  • Robust Retrieval Augmented Generation (RAG) Solutions: The seamless combination of MongoDB Atlas and Fireworks AI streamlines the creation of RAG frameworks, enhancing data storage and retrieval. MongoDB Atlas provides a scalable platform for embedding storage, while Fireworks AI offers managed LLM/SLM hosting and other features. Together, we enable scalable and intelligent systems that improve user experience with relevant information retrieval.

Organizations adopting this strategy can achieve accelerated AI performance, resource savings, and future-proof solutions—driving innovation and competitive advantage across different sectors.

How Fireworks AI's platform accelerates inference speeds and throughput:

  • Wei You Pan

  • Ashwin Gangadhar

  • Peyman Parsi

  • Andrea Alaman Calderon

  • Benny Chen

  • Ayaan Momin

Back

Assessing Business Loan Risks with Generative AI

On this page