This article was contributed by Vedrana Vidulin, Head of Responsible AI Unit at Intellias (LinkedIn).
As artificial intelligence becomes increasingly integrated into smart devices, embedded systems, and edge computing, the ability to run language models locally – without reliance on cloud services – is paramount. This capability unlocks new opportunities across industries by reducing latency, enhancing data privacy, and enabling offline functionality. LiteLLM provides a practical solution for bridging the gap between the power of large language models (LLMs) and the constraints of resource-limited hardware.
Deploying LiteLLM,an open-source LLM gateway,on embedded linux enables the execution of lightweight AI models in environments with limited resources. Functioning as a flexible proxy server, LiteLLM offers a unified API interface that accepts OpenAI-style requests, allowing developers to interact with both local and remote models using a consistent and user-kind format. This guide provides a step-by-step walkthrough, from installation to performance tuning, to help you build a reliable and lightweight AI system on an embedded linux distribution.
Setup Checklist
Before you begin, ensure you have the following:
- A device running a Linux-based operating system (Debian is recommended) with sufficient computational resources to handle LLM operations.
- Python 3.7 or higher installed on the device.
- Access to the internet for downloading necessary packages and models.
Step-by-Step Installation
Step 1: Install LiteLLM
First, ensure your device is up-to-date and prepared for installation. Then, install LiteLLM within a clean and secure surroundings.
Update the package lists to access the latest software versions:
sudo apt update |
Check if pip (Python Package Installer) is installed:
pip --version |
If pip is not installed, install it using:
sudo apt-get install python3-pip |
Using a virtual environment is highly recommended. Check if venv is installed:
dpkg -s python3-venv | grep “Status: install ok installed” |
if venv is not installed,install it using:
sudo apt install python3-venv -y |
Create and activate a virtual environment:
python3 -m venv litellm_env |
source litellm_env/bin/activate |
Install LiteLLM along with its proxy server component using pip:
pip install ‘litellm[proxy]’ |
Use LiteLLM within this environment. To deactivate the virtual environment, type deactivate.
Step 2: Configure LiteLLM
With LiteLLM installed, the next step is to define its operation through a configuration file. This file specifies the language models to be used and the endpoints through which they will be served.
navigate to a suitable directory and create a configuration file named config.yaml:
mkdir ~/litellm_config |
cd ~/litellm_config |
nano config.yaml |
In config.yaml, specify the models you intend to use. Such as, to configure LiteLLM to interface with a model served by Ollama:
model_list: |
This configuration maps the model name codegemma to the codegemma:2b model served by Ollama at http://localhost:11434.
Step 3: Serve Models with Ollama
To run your AI model locally, you’ll use Ollama, a tool specifically designed for hosting large language models (LLMs) directly on your device without relying on cloud services.
install Ollama using the following command:
curl -fsSL https://ollama.com/install.sh | sh |
This command downloads and runs the official installation script, automatically starting the Ollama server.
Once installed, load the AI model you want to use. Such as, to pull the codegemma:2b model:
ollama pull codegemma:2b |
After the model is downloaded, the Ollama server will listen for requests, ready to generate responses from your local setup.
Step 4: Launch the LiteLLM Proxy Server
With both the model and configuration ready, start the LiteLLM proxy server, which makes your local AI model accessible to applications.
Launch the server using the following command:
litellm –config ~/litellm_config/config.yaml |
The proxy server will initialize and expose endpoints defined in your configuration, allowing applications to interact with the specified models through a consistent API.
Step 5: Test the Deployment
Confirm everything works as was to be expected by writing a simple Python script that sends a test request to the LiteLLM server and saving it as test_script.py:
import openaiclient = openai.OpenAI(api_key="anything", base_url="http://localhost:4000")response = client.chat.completions.create(model="codegemma", messages=[{"role": "user", "content": "Wriet me a python function to calculate the nth Fibonacci number."}])print(response) |
Run the script using:
python test_script.py |
If the setup is correct, you’ll recieve a response from the local model, confirming that LiteLLM is running.
Optimizing LiteLLM Performance on Embedded Devices
To ensure fast and reliable performance on embedded systems, choose the right language model and adjust LiteLLM’s settings to match your device’s limitations.
Choosing the Right language Model
Not all AI models are suitable for devices with limited resources. Opt for compact, optimized models designed for such environments:
- DistilBERT: A distilled version of BERT, retaining over 95% of BERT’s performance with 66 million parameters. Ideal for text classification, sentiment analysis, and named entity recognition.
- TinyBERT: With approximately 14.5 million parameters, TinyBERT is designed for mobile and edge devices, excelling in question answering and sentiment classification.
- MobileBERT: Optimized for on-device computations, MobileBERT has 25 million parameters and achieves nearly 99% of BERT’s accuracy, making it ideal for real-time mobile applications.
- TinyLlama: A compact model with approximately 1.1 billion parameters, TinyLlama balances capability and efficiency for real-time natural language processing in resource-constrained environments.
- MiniLM: A compact transformer model with approximately 33 million parameters, MiniLM is effective for semantic similarity and question answering, particularly in scenarios requiring rapid processing on limited hardware.
Selecting a model that fits your setup ensures smooth performance, fast responses, and efficient resource utilization.
Configure Settings for Better Performance
Fine-tuning LiteLLM settings can substantially boost performance on limited hardware.
Restrict the Number of Tokens: Shorter responses mean faster results.Limit the maximum number of tokens in responses to reduce memory and computational load. In LiteLLM,set the max_tokens parameter when making API calls. For example:
import openaiclient = openai.OpenAI(api_key="anything", base_url="http://localhost:4000")response = client.chat.completions.create(model="codegemma", messages=[{"role": "user", "content": "Write me a Python function to calculate the nth Fibonacci number."}],max_tokens=500)print(response) |
Adjusting max_tokens keeps replies concise and reduces the load on your device.
Manage Simultaneous Requests: Too manny concurrent requests can overwhelm even optimized models. Limit the number of queries LiteLLM processes together using the max_parallel_requests setting. For example:
litellm –config ~/litellm_config/config.yaml –num_requests 5 |
This setting distributes the load evenly, ensuring device stability even during peak demand.
Additional Best Practices:
- Secure Your Setup: Implement security measures like firewalls and authentication to protect the server from unauthorized access.
- Monitor Performance: Use LiteLLM’s logging capabilities to track usage, performance, and potential issues.
LiteLLM makes running language models locally possible, even on low-resource devices. By acting as a lightweight proxy with a unified API, it simplifies integration and reduces overhead. With the right setup and lightweight models, you can deploy responsive, efficient AI solutions on embedded systems for prototypes or production environments.
Summary
Running LLMs on embedded devices doesn’t require heavy infrastructure or proprietary services. LiteLLM offers a streamlined, open-source solution for deploying language models with ease, flexibility, and performance, even on devices with limited resources. With the right model and configuration, you can power real-time AI features at the edge, supporting everything from smart assistants to secure local processing.
Join Our Community
We’re continuously exploring the future of tech, innovation, and digital conversion at Intellias — and we invite you to be part of the journey.
- Visit our Intellias Blog for industry insights, trends, and expert perspectives.
- This article was written by Vedrana Vidulin, Head of Responsible AI Unit at Intellias. Connect with Vedrana through her LinkedIn page.