Today, I had some free time to play around with an on-prem deployment of an LLM (large language model). The reason you ask? I’ve got a hunch that some of my clients will want this capability in the near future and I want a proof of concept up and running ready for a demo :]
I didn’t want to spend too much time on this demo (it’s a Saturday evening and i don’t plan to spend too much time on this so making it run on localhost and not messing with services, storage, and all of the other stuff is ok for now…)
A bit more on the technical side – I specifically used the Llama3.2 model, deploying it through the alpine/llama3.2 image. The setup was seamless, and I was impressed by how straightforward it was to get everything up and running. Once everything was in place, I was ready to test out the model’s capabilities.
The simple docker-run I used:
docker run --hostname=<Just a Random Name> --user=0 --mac-address=<Find your own mac addr> env=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin --env=SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt --env=OLLAMA_HOST=0.0.0.0 --network=bridge -p 11434:11434 --restart=no --runtime=runc -d alpine/llama3.2:latest
To play around i interacted with the model using Postman to send a simple POST request. To actually use it in any useful way you’re gonna need to use some other way to interact with the model – curl and getting back the response as a JSON isn’t difficult at all.
the prompt body:
{
"model": "llama3.2:latest",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "sing the first song that you think about"
}
],
"stream": false
}
The response i got:
{"model":"llama3.2:latest","created_at":"2025-01-04T17:05:09.353285845Z","message":{"role":"assistant","content":"I'm just a language model, I don't have personal experiences or emotions, but I can sing a popular song for you!\n\nThe song that came to mind is \"YMCA\" by the Village People! Here's my attempt:\n\n(Singing)\nYoung man, there's no need to feel down\nI said, young man, pick yourself off the ground\nI said, young man, 'cause you're in a new town\nThere's no need to be unhappy\n\nYoung man, you're in New York City\nThe lights are fast and the sound is loud\nYou can go for a stroll through Central Park\nIt makes me feel happy inside\n\nGive me an Y!\nGive me an M!\nGive me a C!\nGive me an A!\n\nWhat do we do?\nWe get down tonight!\nGive me an S!\nGive me a V!\nGive me a C!\nGive me an E!\nGive me an I!\nGive me an F!\nGive me a T!\nGive me a N!\nGive me a I!\nGive me a G!\nGive me a H!\n\nWhat do we do?\nWe get down tonight!\nEverybody's dancing\nFeeling alive\nHappiness all around\n\nCome on everybody, let's sing it out loud!\nYoung man, there's no need to feel down\nI said, young man, pick yourself off the ground\nI said, young man, 'cause you're in a new town\nThere's no need to be unhappy!\n\nHow was that?"},"done_reason":"stop","done":true,"total_duration":332970391683,"load_duration":27643043,"prompt_eval_count":39,"prompt_eval_duration":11853000000,"eval_count":303,"eval_duration":321088000000}
*Note – wrong lyrics, but close enough… not looking for it to be accurate and prompt engineering here is minimal.
Possible improvements:
Looking forward, there are several optimizations I plan to explore to enhance performance. One of the key steps is utilizing a GPU to accelerate the processing power, which can significantly reduce inference times. Additionally, experimenting with different models, could offer improvements in both accuracy and efficiency. I’ll find the time to work on the homelab a bit more, might get me a new GPU to actually get everything to work a bit faster.
Projects i think i’m going to start some time soon are:
- building a simple chatbot app just for having a private version of chatgpt (i know open-webui exists…just let me have some fun will ya?)
- building scraper that stores relevant data to some s3 bucket for later use.
- Deploying a VMware Private AI Foundation deployment – i want to see how well the solution is relative to a homebrew solution.
If you’ve got any good implementation for something like this – feel free to contact me on LinkedIn 🙂
No responses yet