Deploy Lightweight LLMs on Embedded Linux with LiteLLM

This article‍ was contributed by Vedrana Vidulin, Head of Responsible AI Unit at Intellias (LinkedIn).

As ⁣artificial intelligence becomes increasingly integrated‍ into smart devices, ‌embedded systems, and ‍edge⁣ computing,​ the ability to​ run language‌ models locally – without reliance on cloud services ​– is ⁤paramount. This capability unlocks new opportunities across industries by reducing ‍latency, enhancing data privacy, and enabling offline functionality. LiteLLM ‍provides a practical⁣ solution for bridging the gap between the power of⁢ large language models (LLMs) and the ⁢constraints of resource-limited hardware.

Deploying LiteLLM,an ⁢open-source LLM gateway,on embedded linux ⁣enables ‌the⁣ execution of lightweight AI models in environments ⁣with limited resources. Functioning as a flexible proxy ​server, LiteLLM offers a unified API interface that ​accepts OpenAI-style requests, allowing developers to interact with both local and remote models ‌using⁤ a consistent and user-kind format. This guide provides a step-by-step walkthrough,‍ from installation to performance tuning, to help‍ you ⁣build a reliable and lightweight AI system on an ⁤embedded linux distribution.

Setup Checklist

Before you begin, ensure‌ you have‍ the following:

  • A device running a⁣ Linux-based operating system (Debian is recommended) with sufficient computational resources⁤ to handle LLM operations.
  • Python ⁤3.7 or higher⁤ installed on the device.
  • Access ⁤to the internet for downloading necessary packages and models.

Step-by-Step Installation

Step 1:‌ Install LiteLLM

First, ⁣ensure your device‍ is up-to-date and prepared for⁢ installation. Then, install‌ LiteLLM ‌within a clean and secure surroundings.

Update ⁤the⁣ package lists to access the latest‌ software versions:

sudo apt update

Check⁢ if pip​ (Python Package ‍Installer) is installed:

pip --version

If pip is not installed, install it using:

sudo apt-get install python3-pip

Using a virtual environment is highly recommended. Check if venv is ​installed:

dpkg -s python3-venv | grep “Status: install ok installed”

if⁢ venv is not installed,install it using:

sudo apt install python3-venv -y

Create and ⁢activate a virtual environment:

python3 -m venv litellm_env
source litellm_env/bin/activate

Install‌ LiteLLM along with its proxy ‌server component using pip:

pip install ‘litellm[proxy]’

Use LiteLLM within ‍this environment. To deactivate the virtual environment, type deactivate.

Step ‍2: Configure LiteLLM

With ⁤LiteLLM ​installed, the next step is to define ⁣its operation through a configuration file. This file specifies the language models to be used and the endpoints through which they will be served.

navigate to a ‍suitable ⁢directory and create a configuration⁢ file named config.yaml:

mkdir ~/litellm_config
cd ~/litellm_config
nano config.yaml

In config.yaml, specify the models ⁢you intend to use. Such as, to configure LiteLLM to interface with⁢ a⁤ model ⁢served by Ollama:

model_list:
- model_name: codegemma
litellm_params:
model: ollama/codegemma:2b
api_base: http://localhost:11434

This configuration‍ maps the model name codegemma ⁤to ⁤the codegemma:2b model served by‍ Ollama at http://localhost:11434.

Step 3: Serve Models with Ollama

To run your⁢ AI model ⁤locally, you’ll use Ollama, a tool ⁢specifically⁢ designed for hosting large language models (LLMs) directly on ‌your device without ‍relying on cloud services.

install Ollama using the following command:

curl -fsSL https://ollama.com/install.sh | sh

This command downloads and runs the official installation script, automatically starting the Ollama server.

Once installed, load the AI model you want ⁢to ⁤use. Such as,‌ to pull the codegemma:2b model:

ollama pull codegemma:2b

After the model⁢ is downloaded, the Ollama server will listen for requests, ready to generate responses from your local setup.

Step 4: Launch ⁣the LiteLLM⁣ Proxy Server

With⁣ both the model and configuration ready, ⁣start the LiteLLM proxy⁢ server, which⁢ makes your local AI model accessible to applications.

Launch the server using‌ the following command:

litellm –config ~/litellm_config/config.yaml

The proxy server will​ initialize and⁢ expose endpoints defined in your configuration, allowing​ applications to interact with the specified models through a consistent API.

Step 5: Test ‍the Deployment

Confirm⁤ everything works as ‌was to be expected ‍by writing a simple Python script that ⁢sends a ​test ‍request‍ to the LiteLLM server and saving it as test_script.py:

import openai
client = openai.OpenAI(api_key="anything", base_url="http://localhost:4000")
response = client.chat.completions.create(model="codegemma", messages=[{"role": "user", "content": "Wriet me a python function to calculate the nth Fibonacci number."}])
print(response)

Run the script ‌using:

python test_script.py

If ⁢the setup⁣ is correct, you’ll recieve a response from the local model, confirming that LiteLLM⁤ is running.

Optimizing LiteLLM⁣ Performance on Embedded Devices

To ensure fast and reliable performance on embedded systems, choose the right language model and adjust LiteLLM’s settings to match your device’s limitations.

Choosing the Right language Model

Not all AI models are suitable⁤ for devices‌ with limited resources. Opt for compact, optimized models designed for such environments:

  • DistilBERT: A distilled version of BERT, retaining over 95% of BERT’s performance with 66 million parameters. Ideal ⁢for text classification, sentiment analysis, ‍and​ named entity ⁢recognition.
  • TinyBERT: ⁣With⁢ approximately 14.5 million parameters, TinyBERT‌ is designed ‌for mobile and‍ edge devices, excelling ⁢in question answering and sentiment classification.
  • MobileBERT: ‍ Optimized for on-device computations, MobileBERT​ has 25 ⁣million parameters and ⁤achieves nearly⁢ 99% of BERT’s accuracy, making ‍it ideal for real-time mobile applications.
  • TinyLlama: A compact model with approximately 1.1 billion parameters, TinyLlama balances⁢ capability and efficiency for real-time‌ natural language processing‌ in ‍resource-constrained environments.
  • MiniLM: A compact transformer model with approximately 33 million parameters, MiniLM is effective‌ for ⁢semantic similarity and question answering, particularly in scenarios requiring rapid processing on‌ limited hardware.

Selecting a model that fits your setup ensures smooth performance, fast responses, and efficient resource utilization.

Configure Settings for Better Performance

Fine-tuning LiteLLM settings can substantially boost performance on limited‍ hardware.

Restrict the Number of Tokens: Shorter responses mean faster results.Limit the maximum number of tokens in responses to reduce memory and computational load. In‌ LiteLLM,set the max_tokens parameter when making API calls. For example:

import openai
client = openai.OpenAI(api_key="anything", base_url="http://localhost:4000")
response = client.chat.completions.create(model="codegemma", messages=[{"role": "user", "content": "Write me a Python function to calculate the nth Fibonacci number."}],max_tokens=500)
print(response)

Adjusting max_tokens keeps replies concise and reduces the load on your device.

Manage Simultaneous Requests: Too manny concurrent requests can overwhelm even optimized models. Limit the number of queries LiteLLM processes together using the max_parallel_requests setting. For example:

litellm –config ~/litellm_config/config.yaml –num_requests 5

This setting distributes the load evenly, ensuring device stability even during peak⁣ demand.

Additional Best Practices:

  • Secure Your Setup: Implement security measures like firewalls and authentication⁣ to⁣ protect the server from unauthorized access.
  • Monitor Performance: Use LiteLLM’s logging capabilities to track usage, performance,⁣ and potential issues.

LiteLLM makes running language models‌ locally possible, ⁣even ⁤on low-resource devices.⁣ By acting as a lightweight proxy with a​ unified API, it simplifies integration and reduces ⁤overhead.‍ With the right setup and lightweight models, you can deploy responsive, efficient AI solutions on embedded systems for prototypes​ or production environments.

Summary

Running⁣ LLMs on embedded devices doesn’t require heavy infrastructure or proprietary services. LiteLLM offers a streamlined, open-source solution‌ for deploying language models with ease, flexibility, and⁣ performance, even‍ on devices with limited resources. With the right model and configuration, you‌ can power real-time ‌AI features at the edge, supporting everything from ⁣smart assistants to secure local⁤ processing.

Join Our‍ Community

We’re ​continuously exploring the future of tech, ‍innovation, and digital conversion at⁤ Intellias ‍— ​and we invite​ you⁤ to be part ⁢of the⁤ journey.

  • Visit our Intellias Blog for industry insights, trends, and expert perspectives.
  • This ‍article was written by Vedrana Vidulin, Head of ‌Responsible‌ AI Unit at Intellias. Connect with Vedrana ⁢through her LinkedIn page.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.