- Added version comment for deployment tracking - Auto-deploy configured on fnos with 5-minute sync interval 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
392 lines
10 KiB
Markdown
392 lines
10 KiB
Markdown
# OPS 统一管理方案设计文档
|
||
|
||
> 创建日期: 2025-12-18
|
||
> 状态: 待实施
|
||
|
||
---
|
||
|
||
## 一、背景与现状
|
||
|
||
### 1.1 组织架构
|
||
- **公司数量**: 5 个公司,业务各不相同
|
||
- **研发团队**: 2 个,分别在成都和北京
|
||
- **开发人员**: 约 15 人,分布分散,含 home office
|
||
- **运维归属**: 运维人员归属某一个公司,但服务所有公司
|
||
|
||
### 1.2 基础设施
|
||
- **服务器数量**: 约 30 台
|
||
- **云服务商**: 阿里云、腾讯云为主
|
||
- **现有工具**: JumpServer(未充分使用)、监控系统、Jenkins
|
||
|
||
### 1.3 当前痛点
|
||
| 痛点 | 描述 |
|
||
|------|------|
|
||
| 响应优先级冲突 | 多个公司同时有需求,不知道先处理谁 |
|
||
| 权限/安全边界模糊 | 各公司数据和系统隔离不够清晰 |
|
||
| 两地协作困难 | 成都北京团队配合有障碍 |
|
||
| 服务器及账户管理繁琐 | **最大痛点**,密钥散落、共享密钥、密码混用 |
|
||
|
||
### 1.4 JumpServer 未用起来的原因
|
||
- 体验问题:多次跳转导致连接不稳定
|
||
- AI 工具/自动化需要直连服务器,堡垒机模式不适用
|
||
- 服务器之间不能直连,跳转增多
|
||
|
||
---
|
||
|
||
## 二、目标状态
|
||
|
||
1. **一个入口管所有** — 统一平台,全局视图
|
||
2. **按公司隔离但统一视角** — 资源逻辑隔离,总负责人有全局视图
|
||
3. **自动化优先** — 人员变动时权限自动同步
|
||
4. **兼容 AI 工具** — 支持直连,无多跳延迟
|
||
|
||
---
|
||
|
||
## 三、解决方案
|
||
|
||
### 3.1 整体架构:Headscale 组网
|
||
|
||
```
|
||
Headscale 控制器
|
||
(身份认证和节点发现)
|
||
│
|
||
┌───────────────┼───────────────┐
|
||
▼ ▼ ▼
|
||
服务器A ◄────────► 服务器B ◄────────► 开发者
|
||
100.64.0.1 100.64.0.2 100.64.0.100
|
||
|
||
特点:所有节点点对点直连,控制器不转发流量
|
||
```
|
||
|
||
### 3.2 为什么选择 Headscale
|
||
|
||
| 对比项 | 传统堡垒机 | Headscale 组网 |
|
||
|--------|-----------|---------------|
|
||
| 连接方式 | 所有流量经堡垒机中转 | 点对点直连 |
|
||
| 服务器互访 | 需多跳 | 直连 |
|
||
| AI 工具支持 | 体验差 | 原生支持 |
|
||
| 延迟 | 高 | 低 |
|
||
| 安全性 | 依赖堡垒机 | 私网隔离 + ACL |
|
||
|
||
### 3.3 User/Namespace 划分
|
||
|
||
```
|
||
# 开发环境
|
||
company-a-dev → A公司开发服务器 + A公司开发者
|
||
company-b-dev → B公司开发服务器 + B公司开发者
|
||
company-c-dev → C公司开发服务器 + C公司开发者
|
||
company-d-dev → D公司开发服务器 + D公司开发者
|
||
company-e-dev → E公司开发服务器 + E公司开发者
|
||
|
||
# 生产环境
|
||
company-a-prod → A公司生产服务器
|
||
company-b-prod → B公司生产服务器
|
||
company-c-prod → C公司生产服务器
|
||
company-d-prod → D公司生产服务器
|
||
company-e-prod → E公司生产服务器
|
||
|
||
# 管理角色
|
||
ops → 运维人员(可访问所有)
|
||
cicd → Jenkins(只访问生产做发布)
|
||
```
|
||
|
||
---
|
||
|
||
## 四、技术实现
|
||
|
||
### 4.1 Headscale 部署
|
||
|
||
#### 目录结构
|
||
```bash
|
||
mkdir -p /opt/headscale/{config,data}
|
||
```
|
||
|
||
#### config.yaml
|
||
```yaml
|
||
server_url: https://hs.yourdomain.com:443
|
||
listen_addr: 0.0.0.0:8080
|
||
metrics_listen_addr: 0.0.0.0:9090
|
||
|
||
ip_prefixes:
|
||
- 100.64.0.0/10
|
||
|
||
database:
|
||
type: sqlite
|
||
sqlite:
|
||
path: /var/lib/headscale/db.sqlite
|
||
|
||
acl_policy_path: /etc/headscale/acl.yaml
|
||
```
|
||
|
||
#### docker-compose.yml
|
||
```yaml
|
||
version: '3'
|
||
services:
|
||
headscale:
|
||
image: headscale/headscale:latest
|
||
container_name: headscale
|
||
restart: unless-stopped
|
||
ports:
|
||
- "8080:8080"
|
||
- "9090:9090"
|
||
volumes:
|
||
- ./config:/etc/headscale
|
||
- ./data:/var/lib/headscale
|
||
command: serve
|
||
```
|
||
|
||
#### Nginx 反向代理
|
||
```nginx
|
||
server {
|
||
listen 443 ssl;
|
||
server_name hs.yourdomain.com;
|
||
|
||
ssl_certificate /path/to/cert.pem;
|
||
ssl_certificate_key /path/to/key.pem;
|
||
|
||
location / {
|
||
proxy_pass http://127.0.0.1:8080;
|
||
proxy_set_header Host $host;
|
||
proxy_set_header Upgrade $http_upgrade;
|
||
proxy_set_header Connection "upgrade";
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4.2 ACL 配置
|
||
|
||
```yaml
|
||
# /opt/headscale/config/acl.yaml
|
||
|
||
groups:
|
||
group:ops: ["ops"]
|
||
group:cicd: ["cicd"]
|
||
group:all-prod:
|
||
- "company-a-prod"
|
||
- "company-b-prod"
|
||
- "company-c-prod"
|
||
- "company-d-prod"
|
||
- "company-e-prod"
|
||
|
||
acls:
|
||
# 开发者只能访问自己公司的开发环境
|
||
- action: accept
|
||
src: ["company-a-dev"]
|
||
dst: ["company-a-dev:*"]
|
||
- action: accept
|
||
src: ["company-b-dev"]
|
||
dst: ["company-b-dev:*"]
|
||
- action: accept
|
||
src: ["company-c-dev"]
|
||
dst: ["company-c-dev:*"]
|
||
- action: accept
|
||
src: ["company-d-dev"]
|
||
dst: ["company-d-dev:*"]
|
||
- action: accept
|
||
src: ["company-e-dev"]
|
||
dst: ["company-e-dev:*"]
|
||
|
||
# 运维访问所有
|
||
- action: accept
|
||
src: ["group:ops"]
|
||
dst: ["*:*"]
|
||
|
||
# CI/CD 访问生产
|
||
- action: accept
|
||
src: ["group:cicd"]
|
||
dst: ["group:all-prod:22"]
|
||
|
||
# 同公司生产环境服务器互访
|
||
- action: accept
|
||
src: ["company-a-prod"]
|
||
dst: ["company-a-prod:*"]
|
||
- action: accept
|
||
src: ["company-b-prod"]
|
||
dst: ["company-b-prod:*"]
|
||
- action: accept
|
||
src: ["company-c-prod"]
|
||
dst: ["company-c-prod:*"]
|
||
- action: accept
|
||
src: ["company-d-prod"]
|
||
dst: ["company-d-prod:*"]
|
||
- action: accept
|
||
src: ["company-e-prod"]
|
||
dst: ["company-e-prod:*"]
|
||
```
|
||
|
||
### 4.3 User 创建命令
|
||
|
||
```bash
|
||
# 开发环境
|
||
docker exec -it headscale headscale users create company-a-dev
|
||
docker exec -it headscale headscale users create company-b-dev
|
||
docker exec -it headscale headscale users create company-c-dev
|
||
docker exec -it headscale headscale users create company-d-dev
|
||
docker exec -it headscale headscale users create company-e-dev
|
||
|
||
# 生产环境
|
||
docker exec -it headscale headscale users create company-a-prod
|
||
docker exec -it headscale headscale users create company-b-prod
|
||
docker exec -it headscale headscale users create company-c-prod
|
||
docker exec -it headscale headscale users create company-d-prod
|
||
docker exec -it headscale headscale users create company-e-prod
|
||
|
||
# 管理角色
|
||
docker exec -it headscale headscale users create ops
|
||
docker exec -it headscale headscale users create cicd
|
||
```
|
||
|
||
### 4.4 生成 AuthKey
|
||
|
||
```bash
|
||
# 示例:为 company-a-dev 生成可复用的 key
|
||
docker exec -it headscale headscale preauthkeys create \
|
||
--user company-a-dev \
|
||
--expiration 720h \
|
||
--reusable
|
||
```
|
||
|
||
### 4.5 客户端接入
|
||
|
||
#### 服务器端(Linux)
|
||
```bash
|
||
curl -fsSL https://tailscale.com/install.sh | sh
|
||
tailscale up --login-server https://hs.yourdomain.com --authkey <key>
|
||
```
|
||
|
||
#### 开发者电脑
|
||
```bash
|
||
# Mac/Windows 安装 Tailscale 客户端后
|
||
tailscale up --login-server https://hs.yourdomain.com --authkey <key>
|
||
```
|
||
|
||
---
|
||
|
||
## 五、实施 Checklist
|
||
|
||
### 阶段一:准备工作
|
||
- [ ] 1.1 准备 Headscale 控制器服务器(1核1G,公网IP)
|
||
- [ ] 1.2 准备域名和 SSL 证书
|
||
- [ ] 1.3 梳理服务器清单(30台,标注公司、环境)
|
||
- [ ] 1.4 梳理人员清单(15人,标注公司、位置)
|
||
|
||
### 阶段二:部署 Headscale
|
||
- [ ] 2.1 创建目录结构
|
||
- [ ] 2.2 创建 config.yaml
|
||
- [ ] 2.3 创建 acl.yaml
|
||
- [ ] 2.4 创建 docker-compose.yml
|
||
- [ ] 2.5 启动服务
|
||
- [ ] 2.6 配置 Nginx + HTTPS
|
||
- [ ] 2.7 验证服务可访问
|
||
|
||
### 阶段三:创建用户和 AuthKey
|
||
- [ ] 3.1 创建开发环境 users(5个)
|
||
- [ ] 3.2 创建生产环境 users(5个)
|
||
- [ ] 3.3 创建管理 users(ops, cicd)
|
||
- [ ] 3.4 为每个 user 生成 preauthkey
|
||
- [ ] 3.5 验证 ACL 配置
|
||
|
||
### 阶段四:服务器接入
|
||
- [ ] 4.1 试点 1-2 台开发服务器
|
||
- [ ] 4.2 批量接入开发环境服务器
|
||
- [ ] 4.3 接入生产环境服务器
|
||
- [ ] 4.4 接入 Jenkins 服务器
|
||
- [ ] 4.5 制作服务器 IP 对照表
|
||
|
||
### 阶段五:开发者接入
|
||
- [ ] 5.1 编写开发者接入文档
|
||
- [ ] 5.2 运维人员先试用
|
||
- [ ] 5.3 第一批:成都核心开发(3-5人)
|
||
- [ ] 5.4 第二批:北京核心开发(3-5人)
|
||
- [ ] 5.5 第三批:其余开发者
|
||
- [ ] 5.6 验证 AI 工具能否正常使用
|
||
|
||
### 阶段六:并行运行期(1-2周)
|
||
- [ ] 6.1 保持公网 22 端口开放
|
||
- [ ] 6.2 收集反馈
|
||
- [ ] 6.3 解决问题
|
||
- [ ] 6.4 监控 Headscale 服务稳定性
|
||
|
||
### 阶段七:切换完成
|
||
- [ ] 7.1 确认全员适应
|
||
- [ ] 7.2 关闭服务器公网 22 端口
|
||
- [ ] 7.3 废弃旧 SSH 密钥
|
||
- [ ] 7.4 更新 Jenkins 部署配置
|
||
- [ ] 7.5 JumpServer 处置决策
|
||
|
||
### 阶段八:文档和规范
|
||
- [ ] 8.1 更新运维文档
|
||
- [ ] 8.2 制定权限申请流程
|
||
- [ ] 8.3 制定密钥轮换机制
|
||
|
||
---
|
||
|
||
## 六、时间预估
|
||
|
||
| 阶段 | 工作量 |
|
||
|------|--------|
|
||
| 准备工作 | 1 天 |
|
||
| 部署 Headscale | 半天 |
|
||
| 创建用户和配置 | 半天 |
|
||
| 服务器接入 | 1-2 天 |
|
||
| 开发者接入 | 2-3 天 |
|
||
| 并行运行 | 1-2 周 |
|
||
| 切换完成 | 1 天 |
|
||
|
||
**总计:约 2-3 周完成全部切换**
|
||
|
||
---
|
||
|
||
## 七、后续规划(第二、三层)
|
||
|
||
完成 Headscale 组网后,可继续推进:
|
||
|
||
### 第二层:统一身份入口
|
||
- 搭建 LDAP/KeyCloak 作为统一身份源
|
||
- JumpServer、Jenkins、监控、Git 对接身份源
|
||
- 入职/离职一键开通/回收账号
|
||
|
||
### 第三层:多云账号治理
|
||
- Terraform 管理多云资源
|
||
- 云控制台权限收紧
|
||
- 按公司打标签,分账
|
||
|
||
---
|
||
|
||
## 八、风险与应对
|
||
|
||
| 风险 | 应对措施 |
|
||
|------|----------|
|
||
| Headscale 控制器宕机 | 已连接的节点仍可互通,影响新节点加入 |
|
||
| ACL 配置错误 | 先在测试环境验证,逐步放开 |
|
||
| 开发者抵触 | 并行期充分沟通,收集反馈改进 |
|
||
| 紧急情况无法访问 | 保留 1-2 台服务器的公网端口作为应急入口 |
|
||
|
||
---
|
||
|
||
## 九、相关命令速查
|
||
|
||
```bash
|
||
# 查看所有节点
|
||
docker exec -it headscale headscale nodes list
|
||
|
||
# 查看所有用户
|
||
docker exec -it headscale headscale users list
|
||
|
||
# 查看某用户的节点
|
||
docker exec -it headscale headscale nodes list -u company-a-dev
|
||
|
||
# 删除节点
|
||
docker exec -it headscale headscale nodes delete -i <node_id>
|
||
|
||
# 验证 ACL
|
||
docker exec -it headscale headscale policy validate /etc/headscale/acl.yaml
|
||
|
||
# 查看 preauthkeys
|
||
docker exec -it headscale headscale preauthkeys list -u company-a-dev
|
||
```
|
||
|
||
---
|
||
|
||
*文档结束*
|