双卡 A100 + Ollama 生产交付文档

张开发
2026/4/15 7:10:38 15 分钟阅读

分享文章

双卡 A100 + Ollama 生产交付文档
1. 文档说明本文档用于指导团队在Ubuntu 20.04 双卡 A100环境中完成 Ollama 的部署、启停、巡检、故障排查、升级与回滚形成一套可直接执行的生产落地手册。适用对象运维工程师平台工程师AI 应用研发工程师项目负责人 / 交付负责人本文档目标不是“介绍 Ollama 是什么”而是确保团队可以完成以下事情正确部署双实例 Ollama让 GPU0、GPU1 分别承载一个推理实例保证模型常驻与自动预热支持 Python 侧轮询分发与失败切换能完成日常巡检、故障定位、升级和回滚2. 交付目标本次交付最终目标如下2.1 服务目标ollama-gpu0.service正常运行ollama-gpu1.service正常运行ollama-prewarm.service可正常执行默认ollama.service停用避免端口冲突2.2 资源目标GPU0 对应端口11434GPU1 对应端口11435两张 A100 可同时接收推理请求模型保持常驻减少冷启动抖动2.3 调用目标Python 客户端支持双实例轮询客户端支持实例探活客户端支持失败切换客户端支持预热与常驻调用2.4 运维目标服务可开机自启支持日志追踪支持健康检查支持升级与回滚支持日常巡检3. 交付架构说明整体采用双实例双端口模式而不是一个实例同时管理两张 GPU。架构如下ollama-gpu0绑定CUDA_VISIBLE_DEVICES0监听0.0.0.0:11434ollama-gpu1绑定CUDA_VISIBLE_DEVICES1监听0.0.0.0:11435ollama-prewarm在两个实例启动完成后自动执行预热确保模型提前加载进显存Python 调用层将11434与11435作为实例池按轮询方式分发请求失败时自动切换实例这样设计的原因很直接边界清晰便于定位问题更利于吞吐优先场景一张卡异常不会直接拖垮另一张卡4. 目录规划推荐目录结构如下/data/ollama/models# 模型目录/etc/systemd/system/# systemd 服务目录/usr/local/bin/# 运行脚本目录最终落地文件如下/etc/systemd/system/ollama-gpu0.service /etc/systemd/system/ollama-gpu1.service /etc/systemd/system/ollama-prewarm.service /usr/local/bin/ollama-prewarm.sh /usr/local/bin/ollama-healthcheck.sh /usr/local/bin/ollama_bi_gpu_client.py /data/ollama/models5. 环境前提部署前需确认以下前提5.1 系统环境Ubuntu 20.04systemd 可正常使用网络正常可访问模型下载源或已有离线模型5.2 GPU 环境执行nvidia-smi nvidia-smi-L确认两张 A100 可正常识别驱动工作正常GPU 编号或 UUID 可获取5.3 Ollama 已安装执行whichollamals-l/usr/local/bin/ollama预期路径/usr/local/bin/ollama如果实际路径不是这个后续ExecStart需要同步修改。5.4 ollama 用户存在执行idollama如果不存在需先创建服务用户。6. 部署步骤6.1 创建模型目录并赋权sudomkdir-p/data/ollama/modelssudochown-Rollama:ollama /data/ollamasudochmod755/datasudochmod755/data/ollamasudochmod755/data/ollama/models验证目录链路namei-l/data/ollama/models验证ollama用户可写sudo-uollamabash-lccd /data/ollama/models touch .perm_test rm -f .perm_test echo ok如果返回ok说明目录权限已打通。6.2 写入ollama-gpu0.service[Unit] DescriptionOllama GPU0 Service Afternetwork-online.target [Service] ExecStart/usr/local/bin/ollama serve Userollama Groupollama Restartalways RestartSec3 EnvironmentPATH/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin EnvironmentOLLAMA_HOST0.0.0.0:11434 EnvironmentCUDA_VISIBLE_DEVICES0 EnvironmentOLLAMA_MODELS/data/ollama/models EnvironmentOLLAMA_KEEP_ALIVE-1 EnvironmentOLLAMA_FLASH_ATTENTION1 EnvironmentOLLAMA_KV_CACHE_TYPEq8_0 EnvironmentOLLAMA_MAX_LOADED_MODELS1 EnvironmentOLLAMA_NUM_PARALLEL4 EnvironmentOLLAMA_MAX_QUEUE1024 EnvironmentOLLAMA_CONTEXT_LENGTH8192 [Install] WantedBymulti-user.target6.3 写入ollama-gpu1.service[Unit] DescriptionOllama GPU1 Service Afternetwork-online.target [Service] ExecStart/usr/local/bin/ollama serve Userollama Groupollama Restartalways RestartSec3 EnvironmentPATH/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin EnvironmentOLLAMA_HOST0.0.0.0:11435 EnvironmentCUDA_VISIBLE_DEVICES1 EnvironmentOLLAMA_MODELS/data/ollama/models EnvironmentOLLAMA_KEEP_ALIVE-1 EnvironmentOLLAMA_FLASH_ATTENTION1 EnvironmentOLLAMA_KV_CACHE_TYPEq8_0 EnvironmentOLLAMA_MAX_LOADED_MODELS1 EnvironmentOLLAMA_NUM_PARALLEL4 EnvironmentOLLAMA_MAX_QUEUE1024 EnvironmentOLLAMA_CONTEXT_LENGTH8192 [Install] WantedBymulti-user.target6.4 写入预热脚本ollama-prewarm.sh#!/usr/bin/env bashset-eMODEL${1:-gemma3}forportin1143411435;doecho[prewarm] checking${port}curl-sfhttp://127.0.0.1:${port}/api/version/dev/nullecho[prewarm] loading${MODEL}on${port}curl-sfhttp://127.0.0.1:${port}/api/generate\-d{\model\:\${MODEL}\,\keep_alive\:-1}/dev/nullecho[prewarm] verifying${MODEL}on${port}curl-sfhttp://127.0.0.1:${port}/api/psdone赋权sudochmodx /usr/local/bin/ollama-prewarm.sh6.5 写入预热服务ollama-prewarm.service[Unit] DescriptionPrewarm Ollama Models Afterollama-gpu0.service ollama-gpu1.service Wantsollama-gpu0.service ollama-gpu1.service [Service] Typeoneshot ExecStart/usr/local/bin/ollama-prewarm.sh gemma3 [Install] WantedBymulti-user.target6.6 停用默认单实例服务sudosystemctl disable--nowollama||true目的避免占用11434避免与双实例方案混用保持服务治理边界清晰6.7 重新加载并启动服务sudosystemctl daemon-reloadsudosystemctlenable--nowollama-gpu0sudosystemctlenable--nowollama-gpu1sudosystemctlenableollama-prewarmsudosystemctl start ollama-prewarm7. 参数设计说明本次推荐参数如下OLLAMA_KEEP_ALIVE-1OLLAMA_FLASH_ATTENTION1OLLAMA_KV_CACHE_TYPEq8_0OLLAMA_MAX_LOADED_MODELS1OLLAMA_NUM_PARALLEL4OLLAMA_MAX_QUEUE1024OLLAMA_CONTEXT_LENGTH8192参数解释OLLAMA_KEEP_ALIVE-1作用模型常驻内存避免频繁卸载/重载降低冷启动对时延的影响OLLAMA_FLASH_ATTENTION1作用降低大上下文场景的内存开销提高上下文扩展场景的稳定性OLLAMA_KV_CACHE_TYPEq8_0作用降低 KV Cache 占用在质量与显存之间取得平衡OLLAMA_MAX_LOADED_MODELS1作用每实例只维持一个主服务模型避免多模型竞争显存避免行为复杂化OLLAMA_NUM_PARALLEL4作用提高单实例并发承载能力适合双实例吞吐优先场景注意并发和上下文不是独立关系而是联动关系。实例压力近似与OLLAMA_NUM_PARALLEL × OLLAMA_CONTEXT_LENGTH共同相关。OLLAMA_MAX_QUEUE1024作用给短时高峰提供缓冲降低瞬时尖峰导致的快速失败注意队列不是性能放大器只是缓冲区。OLLAMA_CONTEXT_LENGTH8192作用控制单请求上下文预算更适合吞吐优先型服务如果业务是超长上下文任务再单独为该类服务调整而不要默认全局拉太高。8. 启停与运维命令8.1 启动sudosystemctl start ollama-gpu0sudosystemctl start ollama-gpu1sudosystemctl start ollama-prewarm8.2 停止sudosystemctl stop ollama-prewarmsudosystemctl stop ollama-gpu0sudosystemctl stop ollama-gpu18.3 重启sudosystemctl restart ollama-gpu0sudosystemctl restart ollama-gpu1sudosystemctl restart ollama-prewarm8.4 查看状态systemctl status ollama-gpu0 --no-pager-lsystemctl status ollama-gpu1 --no-pager-lsystemctl status ollama-prewarm --no-pager-l8.5 查看日志journalctl-uollama-gpu0 --no-pager--follow--pager-end journalctl-uollama-gpu1 --no-pager--follow--pager-end journalctl-uollama-prewarm --no-pager--follow--pager-end9. 日常巡检手册建议每天巡检一次重点检查以下内容。9.1 服务状态巡检systemctl is-active ollama-gpu0 systemctl is-active ollama-gpu1预期都应返回active9.2 API 可达性巡检curlhttp://127.0.0.1:11434/api/versioncurlhttp://127.0.0.1:11435/api/version9.3 模型运行状态巡检curlhttp://127.0.0.1:11434/api/pscurlhttp://127.0.0.1:11435/api/ps确认两个实例都有运行模型模型未异常卸载上下文配置符合预期9.4 GPU 巡检ollamapswatch-n1nvidia-smi关注点ollama ps中模型是否为100% GPU两张卡是否都有显存占用两张卡是否都有推理活动9.5 目录权限巡检namei-l/data/ollama/models确认目录链路权限未被变更。10. 健康检查手册可直接执行/usr/local/bin/ollama-healthcheck.sh也可手工检查curlhttp://127.0.0.1:11434/api/versioncurlhttp://127.0.0.1:11435/api/versioncurlhttp://127.0.0.1:11434/api/pscurlhttp://127.0.0.1:11435/api/ps判断标准version接口可返回ps接口能看到模型模型状态正常无异常卸载11. 故障排查手册下面列出最常见问题及处理方法。11.1 服务启动失败提示端口被占用典型报错bind: address already in use处理步骤sudoss-lntp|grep11434sudoss-lntp|grep11435sudolsof-i:11434-P-nsudolsof-i:11435-P-n如果默认ollama.service在运行sudosystemctl disable--nowollama必要时释放端口sudofuser-k11434/tcpsudofuser-k11435/tcp11.2 服务启动失败提示目录权限不足典型报错permission denied: ensure path elements are traversable处理步骤sudomkdir-p/data/ollama/modelssudochown-Rollama:ollama /data/ollamasudochmod755/datasudochmod755/data/ollamasudochmod755/data/ollama/models namei-l/data/ollama/models验证sudo-uollamabash-lccd /data/ollama/models touch .perm_test rm -f .perm_test echo ok11.3 模型首包很慢排查方向是否未预热是否未设置keep_alive-1模型是否反复被卸载检查curlhttp://127.0.0.1:11434/api/pscurlhttp://127.0.0.1:11435/api/ps必要时手工预热/usr/local/bin/ollama-prewarm.sh gemma311.4 压测时吞吐不高排查方向Python 客户端是否真的轮询了两个实例是否所有请求都只打了一个端口OLLAMA_NUM_PARALLEL是否过低OLLAMA_CONTEXT_LENGTH是否过高是否发生 CPU offload检查ollamapswatch-n1nvidia-smi11.5 出现 503 overloaded原因请求超出当前实例承载能力队列已满或实例过载处理建议检查调用侧是否做了均衡分流检查并发是否过高检查上下文是否过大必要时适当调整OLLAMA_NUM_PARALLEL或OLLAMA_MAX_QUEUE12. 升级手册12.1 升级前准备先记录当前版本ollama--version备份服务文件sudocp/etc/systemd/system/ollama-gpu0.service /etc/systemd/system/ollama-gpu0.service.baksudocp/etc/systemd/system/ollama-gpu1.service /etc/systemd/system/ollama-gpu1.service.baksudocp/etc/systemd/system/ollama-prewarm.service /etc/systemd/system/ollama-prewarm.service.bak备份脚本sudocp/usr/local/bin/ollama-prewarm.sh /usr/local/bin/ollama-prewarm.sh.baksudocp/usr/local/bin/ollama-healthcheck.sh /usr/local/bin/ollama-healthcheck.sh.baksudocp/usr/local/bin/ollama_bi_gpu_client.py /usr/local/bin/ollama_bi_gpu_client.py.bak12.2 升级步骤停止服务sudosystemctl stop ollama-prewarmsudosystemctl stop ollama-gpu0sudosystemctl stop ollama-gpu1替换ollama二进制后执行sudosystemctl daemon-reloadsudosystemctl start ollama-gpu0sudosystemctl start ollama-gpu1sudosystemctl start ollama-prewarm12.3 升级后验证ollama--versionsystemctl status ollama-gpu0 --no-pager-lsystemctl status ollama-gpu1 --no-pager-l/usr/local/bin/ollama-healthcheck.sh ollamaps13. 回滚手册如果升级后异常按如下步骤回滚。13.1 停止当前服务sudosystemctl stop ollama-prewarmsudosystemctl stop ollama-gpu0sudosystemctl stop ollama-gpu113.2 恢复旧版配置与脚本sudocp/etc/systemd/system/ollama-gpu0.service.bak /etc/systemd/system/ollama-gpu0.servicesudocp/etc/systemd/system/ollama-gpu1.service.bak /etc/systemd/system/ollama-gpu1.servicesudocp/etc/systemd/system/ollama-prewarm.service.bak /etc/systemd/system/ollama-prewarm.servicesudocp/usr/local/bin/ollama-prewarm.sh.bak /usr/local/bin/ollama-prewarm.shsudocp/usr/local/bin/ollama-healthcheck.sh.bak /usr/local/bin/ollama-healthcheck.shsudocp/usr/local/bin/ollama_bi_gpu_client.py.bak /usr/local/bin/ollama_bi_gpu_client.py13.3 恢复旧版 Ollama 二进制用备份的可执行文件覆盖当前版本。13.4 重新加载并启动sudosystemctl daemon-reloadsudosystemctl start ollama-gpu0sudosystemctl start ollama-gpu1sudosystemctl start ollama-prewarm13.5 验证回滚结果ollama--versionsystemctl status ollama-gpu0 --no-pager-lsystemctl status ollama-gpu1 --no-pager-l/usr/local/bin/ollama-healthcheck.sh14. Python 接入说明Python 侧推荐使用实例池方式不要把调用写死到单一端口。启动示例fromollama_bi_gpu_clientimportProductionOllamaPool poolProductionOllamaPool(endpoints[http://127.0.0.1:11434,http://127.0.0.1:11435,])pool.prewarm(gemma3)resppool.generate(gemma3,用中文介绍一下向量数据库。)print(resp.get(response,))接入原则所有业务统一走实例池禁止直接写死单端口探活失败后自动切换实例所有关键请求建议保留keep_alive-115. 上线前最终检查清单服务层ollama-gpu0.service正常运行ollama-gpu1.service正常运行ollama-prewarm.service可执行成功默认ollama.service已停用路径层which ollama路径确认无误/data/ollama/models存在ollama用户具备读写权限父目录链路可遍历API 层11434/api/version正常11435/api/version正常11434/api/ps可见模型11435/api/ps可见模型GPU 层ollama ps显示100% GPU双卡都有显存占用双卡都有推理活动调用层Python 客户端具备轮询Python 客户端具备探活Python 客户端具备失败切换压测时两个实例都收到流量调优层load_duration在预热后明显下降prompt_eval_duration无异常飙升未出现持续性503 overloaded参数收敛在稳定区间16. 结语到这里这套双卡 A100 Ollama的交付手册已经完整闭环了。它不是一份“命令汇总”而是一套真正可执行的落地方案涵盖了部署配置权限启停预热巡检故障处理升级回滚Python 接入真正的交付价值不在于你把某个服务“拉起来了”而在于这套东西交到别人手里对方也能按文档把它稳定接住。

更多文章