AI/Ollama: Difference between revisions

Revision as of 01:32, 1 March 2026

curl -fsSL https://ollama.com/install.sh | sh
ollama pull model gpt-oss:20b
ollama --version
ollama ls

curl -fsSL https://claude.ai/install.sh  | bash
ollama launch claude --model gpt-oss:20b

export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export OLLAMA_NUM_CTX=32768
export OLLAMA_KEEP_ALIVE=5m

claude --model gpt-oss:20b

Diagram

@startuml
autonumber
skinparam backgroundColor    transparent
skinparam DefaultFontName    Helvetica
skinparam actorStyle         awesome
skinparam ParticipantPadding 20
skinparam BoxPadding         10

title Claude ↔ Local FS ↔ Ollama ↔ GPT-OSS:20b

actor "Developer"                  as dev

box "Local Development PC" #LightBlue
    participant "Claude Code CLI"  as claude
    participant "Local Filesystem" as fs
end box

box "Kubernetes Cluster (K3s)" #Yellow
    participant "Ollama Service"   as ollama
    participant "GPT-OSS:20b"      as model
end box

dev      -> claude : Runs "claude --model gpt-oss:20b"
claude   -> fs     : Scans repository context
fs      --> claude : File contents / Git history

claude   -> ollama : POST /v1/messages (Anthropic API)
note right: Payload includes system prompt \nand local code context

ollama   -> model  : Load weights into GPU VRAM
model   --> ollama : Inference processing...

ollama -->> claude : Streamed Response (Tokens)
claude   -> dev    : Displays suggested code changes

@enduml

Sturcture

@startsalt
skinparam backgroundColor transparent
skinparam defaultFontName monospaced
{
{T-
+/                           | Root File System
++**/usr/local/bin/**        | Executive Binaries
+++ollama                    | Ollama Server (Standalone Binary)
+++claude                    | Claude Code CLI
++**/etc/systemd/system/**   | Services
+++ollama.service            | Systemd service file
++**/var/lib/claude-code/**  | Native installation files (Global)
++**/home/<user>/**          | User's Home Directory
+++**.ollama/**              | Ollama Data Directory
++++history                  | CLI Chat History
++++**models/**              | Saved Models
+++++blobs/                  | Weights **(gpt-oss:20b)**
+++++manifests/              | Model metadata
+++**.claude/**              | Claude Code Data Directory
++++config.json              | API URL, keys, project context
++++memory/                  | Persistent memory
+++**my-project/**           | Your development folder
++++.claude/                 | Project specific settings
++++CLAUDE.md                | Guidebook for current project
}
}
@endsalt

Optimization

Yoga Pro 7i (G9 + U7 155H + 32GB + 1TB)
Variable	Value	Impact
`OLLAMA_FLASH_ATTENTION`	`1`	Reduces memory usage and speeds up processing for long code files. Highly recommended for coding.
`OLLAMA_KV_CACHE_TYPE`	`q8_0` or `q4_0`	Compresses the short-term memory cache. `q8_0` saves space with almost no quality loss; `q4_0` saves even more space.
`OLLAMA_NUM_PARALLEL`	`1`	Crucial for 32GB RAM. Limits Ollama to one task at a time to prevent Out of Memory crashes when using a 20B model.
`OLLAMA_KEEP_ALIVE`	`30m`	Keeps the `20B` model in your RAM for `30` minutes after use so you don't have to wait `20` seconds for it to reload every time.
`OLLAMA_NUM_CTX`	`6384` to `32768`	The most important. Controls the brain capacity. `32k` is standard for Claude Code but uses `~3GB` more RAM than the default `4k`.
`OLLAMA_NUM_GPU`	`999`	Forces Ollama to offload as many layers as possible to your Intel Arc iGPU instead of the slower CPU.

References

References
AI » Ollama » Claude Code AI » Ollama » Search AI » Ollama
AI » Model » `qwen3-coder-next` AI » Model » `lfm2.5-thinking` AI » Model » `translategemma` AI » Model » `minimax-m2.5` AI » Model » `gpt-oss:20b` AI » Model » `qwen3.5` AI » Model » `gpt-oss` AI » Model » `glm-ocr` AI » Model » `glm-5` AI » Model » `lfm2`	AI » Model » `qwen3-embedding` AI » Model » `glm-4.7-flash` AI » Model » `ministral-3` AI » Model » `granite4` AI » Model » `qwen3-vl`
K8s » Configure Service Accounts for Pods K8s » Restart Pods With Kubectl K8s » Interactive Pod K8s » Restart Pods Docker » Makefile Kubernetes Minikube Kubectl CURL K8s	K8s » `kubectl rollout` K8s » CSI Hostpath Driver Security » Password K8s » Storage K8s » Ingress K8s » Service K8s » Run MinIO CIDR UFW	Bind9 » Authoritative DNS Server Bind9 » Secondary DNS Server Minikube » Ingress DNS WiFi » DWR-BE7200G Minikube » Systemd Minikube » MetalLB Minikube » Registry Minikube » Tunnel Localtunnel ZA Proxy

@@ Line 117: / Line 117: @@
 !scope='col'| Value
 !scope='col'| Impact
-|-
-!scope='row' style='text-align:left'       | <code>OLLAMA_NUM_CTX</code>
-| <code>6384</code> to <code>32768</code> || '''The most important.''' Controls the '''brain capacity.''' <code>32k</code> is standard for Claude Code but uses <code>~3GB</code> more RAM than the default <code>4k</code>.
 |-
 !scope='row' style='text-align:left'        | <code>OLLAMA_FLASH_ATTENTION</code>
@@ Line 127: / Line 124: @@
 | <code>q8_0</code> or <code>q4_0</code>   || Compresses the '''short-term memory''' cache. <code>q8_0</code> saves space with almost no quality loss; <code>q4_0</code>  saves even more space.
 |-
-!scope='row' style='text-align:left'        | <code>OLLAMA_NUM_GPU</code>
+!scope='row' style='text-align:left'        | <code>OLLAMA_NUM_PARALLEL</code>
-| <code>999</code>                         || Forces Ollama to offload as many layers as possible to your Intel Arc iGPU instead of the slower CPU.
+| <code>1</code>                           || '''Crucial for 32GB RAM.''' Limits Ollama to one task at a time to prevent '''Out of Memory''' crashes when using a 20B model.
 |-
 !scope='row' style='text-align:left'        | <code>OLLAMA_KEEP_ALIVE</code>
 | <code>30m</code>                         || Keeps the <code>20B</code> model in your RAM for <code>30</code> minutes after use so you don't have to wait <code>20</code> seconds for it to '''reload''' every time.
 |-
-!scope='row' style='text-align:left'        | <code>OLLAMA_NUM_PARALLEL</code>
+!scope='row' style='text-align:left'       | <code>OLLAMA_NUM_CTX</code>
-| <code>1</code>                           || '''Crucial for 32GB RAM.''' Limits Ollama to one task at a time to prevent '''Out of Memory''' crashes when using a 20B model.
+| <code>6384</code> to <code>32768</code> || '''The most important.''' Controls the '''brain capacity.''' <code>32k</code> is standard for Claude Code but uses <code>~3GB</code> more RAM than the default <code>4k</code>.
+|-
+!scope='row' style='text-align:left'        | <code>OLLAMA_NUM_GPU</code>
+| <code>999</code>                         || Forces Ollama to offload as many layers as possible to your Intel Arc iGPU instead of the slower CPU.
 |}
 |}

AI/Ollama: Difference between revisions

Revision as of 01:32, 1 March 2026

Optimization

References

Navigation menu

Search