How to Ship LLMs in Production ?

7 min readDec 18, 2023

Language models (LLMs) have been a topic of intrigue for me for quite some time. Initially skeptical, I eventually felt the pressure to dive into the technology myself rather than just reading articles or examining shiny examples. I wanted to create a small-scale reproduction of what I would deploy for my clients in a production environment. While OpenAI is widely known, I discovered several alternatives such as PaLM on Google Vertex AI, which I was already familiar with and had worked on previously. It’s worth mentioning that there are other alternatives available as well, such as Hugging Face, Falcon, and StableLM. Ultimately, I decided to use PaLM as it had recently been released and offered a viable serverless infrastructure. However, I relied on Sentence Transformer for document embedding, which I’ll explain in more detail later.

My primary focus was to develop a proof of concept that would initially work on a notebook for rapid experimentation with the technology. That’s why I chose Langchain, a popular library with numerous integrations, prompt templates, and ingestion mechanisms like loading PDFs from Google Drive in a vector store. I have been truly impressive to see so many people outside the tech sphere embracing Python and ML tools. As someone who has experienced nightmares with dependencies, unresolved pip installations, and system dependency issues, I can appreciate the ease of use that Langchain provides.

Today developers have no idea of what they are doing

Returning to Langchain, it can be likened to Keras for TensorFlow but for conversing with LLMs or enabling them to engage in various tasks. The library is divided into different modules:

LLMs: This module enables interaction with different types of LLMs using a common interface.
Chains: Here, we delve into the core concept of Langchain, which makes the library truly magical. Chains provide an abstraction that users interact with and can be serialized in JSON, similar to Hugging Face’s pipeline.
Memory: The preservation of previous messages and context is vital for reproducing a ChatGPT-like interface. Memory allows for a discussion with greater context.
Agent: This is the most advanced concept within Langchain. While it still leverages conversation as a network of information, it also facilitates interaction with multiple conversations or APIs. As LLMs lack the concept of concepts, it becomes necessary to incorporate ground truth logic and information into the system, such as fact-checking news.

Langchain components illustration by Syed Hyder Ali Zaidi

For my experiment, I utilized those modules except for Agents in my notebook, despite encountering several dependency and authentication issues. The library is enjoyable to use, and I wanted to share my experience with others, including colleagues at SFEIR and friends, to showcase that we can also have our own GPT-like models at home. Therefore, my goal was to build a cost-efficient, completely serverless service while maintaining control over the entire stack, avoiding reliance on magical SaaS solutions. This service would act as my librarian, ingesting all the books, papers, and documentation I had read into a Google Drive and test my remembrance.

from langchain.llms import VertexAI
from langchain.vectorstores import Chroma
from langchain.embeddings import VertexAIEmbeddings
from langchain.document_loaders import GoogleDriveLoader

loader = GoogleDriveLoader(
    folder_id="1daABjn2QXHMFUK_LUvVRlbUdTTc8nOWe",
    recursive=True,
)
embeddings = VertexAIEmbeddings()
vectorstore = Chroma.from_documents(loader.load(), embeddings)
llm = VertexAI()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    vectorstore.as_retriever(),
    return_source_documents=True,
)

The first challenge I needed to address was where to store the vector representations of the embeddings. Initially, I used Chroma, a fantastic database for experimentation. However, I eventually discovered Deep Lake, an open-source library that, as the name suggests, operates similarly to the concept of a Data Lake, such as Iceberg, by separating the data source from the computing. This was incredibly convenient, as it eliminated the need for a costly vector database cluster.

During the ingestion process, which involved approximately 9,000 pages, I encountered a limitation with PaLM. Its quota allowed for a bucket of only 60 calls per minute, and considering the pricing, it could become expensive if I scaled the system to a company-wide dataset. To address this, I opted for a “lighter” solution by loading Sentence Transformer, an open-source, pretrained embedding library that fit comfortably within a memory footprint of less than 4GB file system included (a ballpark figure).

from langchain.vectorstores import DeepLake
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()
vectorstore = DeepLake(
  dataset_path="gcs://sfeir-hivemind-deep-lake-datasets/books/",
  embedding_function=embeddings
)
vectorstore.add_documents(loader.load())

One of the significant challenges I encountered during my experimentation was how to manage stateful conversations. Since serverless environment are designed to be stateless, preserving the conversation state across multiple function invocations can be quite tricky. However, I found a simple yet effective solution to this problem by leveraging Firestore, a serverless document database provided by Google Cloud.

To implement this solution, I created a Firestore collection for storing conversation documents. Each document represented a conversation and contained relevant metadata and the messages exchanged during the conversation. Firestore’s document-centric approach made it easy to manage and query conversations based on various parameters such as user ID and session ID.

When a new message arrived, I would retrieve the conversation document from Firestore, update it with the new message, and then store it back into Firestore. This approach ensured that the conversation state was preserved across function invocations, enabling a seamless and context-aware conversation experience.

from langchain.memory.chat_message_histories import FirestoreChatMessageHistory

chat_message_history = FirestoreChatMessageHistory(
    collection_name="chat_history",
    session_id=qa.session_id,
    user_id=qa.user_id,
)
result = qa(
    {"question": qa.question, "chat_history": chat_message_history.messages}
)
chat_message_history.add_user_message(qa.question)
chat_message_history.add_ai_message(result["answer"])

Now, let’s move on to the core component — the inference service. In a production system with a generous budget, I would have chosen Google Vertex AI Endpoint since the service provide more than a simple endpoint including monitoring and sampling. However, since this solution is not serverless and involves the deployment of costly permanent machines, it wasn’t the ideal fit for my needs. Instead, I decided to use Docker and opted for Cloud Run. While Cloud Run had certain technical limitations for my use case, such as cold start issues (as ML Docker containers are often around several gigabytes in size) and the lack of GPU support unless deploying on GKE Anthos, I employed a few tricks to mitigate these challenges. One such trick was using the startup probe’s initial delay seconds and in-memory volume to avoid unexpected OOM, because Cloud Run file system is mounted on the instance memory. Additionally, the maximum memory limitations of 32 gibibyte posed a potential challenge when loading a 40-billion-parameter model, also considering the cold start of loading such model would also be challenging and questionably cost effective.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sfeir-hivemind
  annotations:
    run.googleapis.com/execution-environment: gen2
    run.googleapis.com/launch-stage: BETA
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/startup-cpu-boost: "true"
    spec:
      containers:
        - image: europe-docker.pkg.dev/sfeir-ml-labs/sfeir-hivemind-containers/sfeir-hivemind
          name: bento
          ports:
            - containerPort: 3000
          resources:
            limits:
              memory: 4Gi
          livenessProbe:
            httpGet:
              path: /livez
              port: 3000
          startupProbe:
            httpGet:
              path: /readyz
              port: 3000
            initialDelaySeconds: 60
          volumeMounts:
            - mountPath: /home/bentoml/bentoml
              name: bentoml
            - mountPath: /home/bentoml/.cache/torch
              name: torch-cache
      timeoutSeconds: 600
      volumes:
        - name: bentoml
          emptyDir:
            medium: Memory
            sizeLimit: 1Gi
        - name: torch-cache
          emptyDir:
            medium: Memory
            sizeLimit: 512Mi

While the wide spectrum of how to serve a ML model continue to evolve, it’s worth mentioning the role of BentoML — a HTTP/gRPC serving library that facilitates the building, optimization, and shipping of machine learning models using Docker. BentoML, similar but way more complete than other serving libraries like TorchServe, simplifies the deployment, management (OpenAPI) and monitoring (Prometheus metrics, health and readiness) of machine learning models in production environments. Unfortunately I didn’t use the BentoML to its full capacity, but so far, the serving part is feature complete at the infrastructure level for a production use case.

import os
import bentoml
from bentoml.io import JSON
from sfeir.hivemind.runner import VertexAIRunnable
from sfeir.hivemind.schema import (
    QuestionAnsweringRequest,
    QuestionAnsweringResponse,
    SourceDocument,
)

vertexai_runner = bentoml.Runner(
    VertexAIRunnable,
    name="sfeir-hivemind",
    runnable_init_params={
        "project": os.environ.get("VERTEX_AI_PROJECT"),
        "location": os.environ.get("VERTEX_AI_LOCATION"),
        "dataset_uri": os.environ["DEEP_LAKE_DATASET_URI"],
    },
)

svc = bentoml.Service("sfeir-hivemind", runners=[vertexai_runner])

@svc.api(
    input=JSON(pydantic_model=QuestionAnsweringRequest),
    output=JSON(pydantic_model=QuestionAnsweringResponse),
)
async def predict(qa: QuestionAnsweringRequest) -> QuestionAnsweringResponse:
    result = await vertexai_runner.predict.async_run(
        {"question": qa.question}
    )
    return QuestionAnsweringResponse(
        question=result["question"],
        answer=result["answer"],
        source_documents=[
            SourceDocument(
                page_content=doc.page_content,
                source=doc.metadata["source"],
                title=doc.metadata["title"],
                page=doc.metadata["page"],
            )
            for doc in result["source_documents"]
        ],
    )

In conclusion, my experimentation with language models using Langchain, PaLM, and other supporting libraries proved to be a fascinating journey. Despite encountering some challenges with dependencies, authentication, and resource limitations, I was able to build a “cost-efficient” serverless service with a 90s cold start, that allowed me to retain control over the entire stack.

When it comes to integrate in a MLOps pipeline such as for Zero-shot classification, I would opt for a more performant and efficient approach. From my experience, I would recommend loading the language model directly into the processing pipeline, rather than relying on network API or high level APIs like Langchain to eliminates the overhead of making external calls and uneeded complexity that could be limiting afterward.

However, it’s important to note that choosing to load the LLM model directly into the processing pipeline requires careful consideration of resource allocation, memory management, and performance monitoring. You need to ensure that the infrastructure can handle the model’s size and complexity without compromising system stability.

How to Ship LLMs in Production ?

Written by Shikanime Deva