Run a Local LLM API Server with vLLM (OpenAI-Compatible, Fast, and Simple)

2025-12-25 5 min read AI Development MLOps

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Checklist

  • Create an isolated Python environment with uv
  • Install vllm with an auto-selected Torch backend
  • Verify the CLI is available and inspect options
  • Launch a local API server with a small instruct model
  • Smoke-test the OpenAI-compatible endpoint with curl

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Introduction

If you want a local LLM API server that feels like OpenAI’s API but runs on your own box, vLLM is one of the cleanest ways to do it. It’s optimized for throughput and latency, and it exposes an OpenAI-compatible interface so your apps can switch from hosted models to local inference with minimal code changes.

This walkthrough uses uv for fast, reproducible Python env management and spins up vllm serve with Qwen/Qwen2.5-1.5B-Instruct—a lightweight model that’s practical for local testing.

Validation: By the end, you’ll have a localhost API you can hit with OpenAI-style requests. Next up: environment setup.

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Prerequisites

  • Linux/macOS shell (WSL works too)
  • Python 3 installed (python3)
  • uv installed (see docs: https://docs.astral.sh/uv/)
  • Enough RAM/VRAM for the model you choose (1.5B is relatively friendly)

Validation: If uv --version and python3 --version work, you’re good. Next: create the venv.

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Create the Virtual Environment (uv)

Run exactly this sequence (matches your flow):

1
2
3
4
cd ~/venv
uv venv --python python3 --seed $PWD/vllm
source vllm/bin/activate
cd ~/venv/vllm

What’s happening here:

  • uv venv --python python3 creates a clean virtualenv using your system Python.
  • --seed $PWD/vllm seeds the environment in the ~/venv/vllm path (so it’s nicely contained).
  • source vllm/bin/activate ensures subsequent installs land in this env.

Validation: Your shell prompt should show the venv activated, and which python should point inside ~/venv/vllm. Next: install vLLM.

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Install vLLM (Torch Backend Auto)

Inside the activated venv:

1
uv pip install vllm --torch-backend=auto

Notes for real-world setups:

  • --torch-backend=auto lets vLLM pick the best backend for your hardware (GPU if available, otherwise CPU).
  • If you’re on NVIDIA GPUs, make sure your CUDA drivers are sane; vLLM will benefit a lot from GPU acceleration.
  • First run may download model weights from Hugging Face when you start the server.

Validation: python -c "import vllm; print(vllm.__version__)" should work. Next: confirm the CLI.

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Verify vLLM CLI

1
vllm --help

This confirms:

  • The vllm entrypoint is installed correctly
  • You can see subcommands like serve and available flags

Validation: If help text prints, CLI wiring is correct. Next: launch the server.

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Start the Local API Server

Run:

1
vllm serve Qwen/Qwen2.5-1.5B-Instruct

What you get:

  • A local HTTP server (by default, on http://localhost:8000)
  • OpenAI-compatible endpoints (commonly /v1/chat/completions, /v1/completions, /v1/models)

Common useful flags (optional, but good to know):

  • --host 0.0.0.0 to expose on your LAN
  • --port 8000 to change the port
  • --dtype auto to let vLLM choose precision
  • --max-model-len ... if you need to cap context for memory reasons

Validation: You should see logs indicating the model loaded and the server is listening. Next: quick API smoke test.

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Smoke Test with curl (OpenAI-Compatible)

List models:

1
curl http://localhost:8000/v1/models

Chat completion (OpenAI-style):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "Write a one-liner about vLLM."}
    ],
    "temperature": 0.2
  }'

If you’re integrating with an OpenAI SDK, you typically just point the base URL to your local server and set a dummy API key (since it’s local).

Validation: If you get a JSON response with an assistant message, your local API is live. Next: tighten reliability and ops.

Completions using zapgpt

You can also use zapgpt for easier access.

First you need to install it with : uv tool run zapgpt and then use it like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
> zapgpt -p local --url localhost:8000 -q -lm

📦 Available OpenAI Models:
                     Model List
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ ID                         ┃ Created             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ Qwen/Qwen2.5-1.5B-Instruct │ 2025-12-25 23:43:21 │
└────────────────────────────┴─────────────────────┘

> zapgpt -p local --url localhost:8000 -q -m Qwen/Qwen2.5-1.5B-Instruct "How is weather in india now a days?"
As an AI developed by Alibaba Cloud, I'm not able to provide real-time information or live updates about current weather conditions. To get the most accurate and up-to-date information regarding weather conditions in India (or anywhere else), you should check reliable sources such as local meteorological services, news websites, or official government weather forecasts, which frequently monitor and update their data based on available sensor readings from around the world.

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Practical Tips (Performance + Ops)

  • Model downloads: the first serve may take time while weights download; subsequent runs are faster.
  • GPU vs CPU: CPU works for testing, but GPU is where vLLM shines.
  • Concurrency: vLLM is designed for batching and throughput—great for multiple simultaneous requests.
  • Networking: if you expose the server (--host 0.0.0.0), consider binding to localhost unless you actually need LAN access.

Validation: With these tweaks, you’ll see more stable latency and fewer “why is it slow” moments. Next: wrap it up.

Running vLLM Locally: Serve an OpenAI-Compatible API on Your Machine: Conclusion

Running vllm serve is one of the fastest ways to get a local, OpenAI-compatible LLM API up and running:

  • Use uv to keep the environment clean and reproducible
  • Install vLLM with --torch-backend=auto for sane defaults
  • Launch with vllm serve Qwen/Qwen2.5-1.5B-Instruct
  • Test via /v1/models and /v1/chat/completions

Next steps: try a larger model, tune performance flags, or wire the endpoint into your app by swapping the OpenAI base URL to http://localhost:8000/v1.

comments powered by Disqus