- Added version comment for deployment tracking - Auto-deploy configured on fnos with 5-minute sync interval 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2306 lines
65 KiB
Markdown
2306 lines
65 KiB
Markdown
# OPS 统一管理方案 - Headscale 组网实施方案
|
||
|
||
> **任务编号**: 4448
|
||
> **版本**: v2.0
|
||
> **最后更新**: 2025-12-18
|
||
> **文档状态**: 详细设计
|
||
|
||
---
|
||
|
||
## 目录
|
||
|
||
1. [项目背景与目标](#1-项目背景与目标)
|
||
2. [技术方案概述](#2-技术方案概述)
|
||
3. [网络架构设计](#3-网络架构设计)
|
||
4. [基础设施规划](#4-基础设施规划)
|
||
5. [Headscale 服务端部署](#5-headscale-服务端部署)
|
||
6. [客户端接入方案](#6-客户端接入方案)
|
||
7. [访问控制与安全策略](#7-访问控制与安全策略)
|
||
8. [DNS 与服务发现](#8-dns-与服务发现)
|
||
9. [监控与告警](#9-监控与告警)
|
||
10. [运维管理规范](#10-运维管理规范)
|
||
11. [故障恢复与灾备](#11-故障恢复与灾备)
|
||
12. [实施计划与里程碑](#12-实施计划与里程碑)
|
||
13. [风险评估与应对](#13-风险评估与应对)
|
||
14. [附录](#14-附录)
|
||
|
||
---
|
||
|
||
## 1. 项目背景与目标
|
||
|
||
### 1.1 项目背景
|
||
|
||
随着业务发展,运维团队面临以下挑战:
|
||
|
||
- **多云多地域分布**: 服务器分布在阿里云、腾讯云、AWS 等多个云平台,以及多个物理机房
|
||
- **网络隔离复杂**: 不同环境(生产、测试、开发)之间网络隔离管理复杂
|
||
- **VPN 管理困难**: 传统 VPN 方案(OpenVPN、IPSec)配置复杂、维护成本高
|
||
- **安全访问需求**: 需要安全、便捷地访问内部服务,同时满足合规要求
|
||
- **运维效率低下**: 跨网络运维操作繁琐,无统一入口
|
||
|
||
### 1.2 项目目标
|
||
|
||
| 目标维度 | 具体目标 | 验收标准 |
|
||
|---------|---------|---------|
|
||
| 网络互通 | 实现所有节点 P2P 直连 | 任意两节点延迟 < 50ms(同区域)|
|
||
| 安全性 | 零信任网络架构 | 所有通信加密,基于身份认证 |
|
||
| 易用性 | 一键接入内网 | 客户端安装配置 < 5分钟 |
|
||
| 可扩展 | 支持快速扩容 | 新节点接入 < 10分钟 |
|
||
| 高可用 | 控制平面高可用 | SLA 99.9% |
|
||
|
||
### 1.3 适用范围
|
||
|
||
- 生产环境所有服务器
|
||
- 测试/预发布环境服务器
|
||
- 运维/开发人员工作设备
|
||
- CI/CD 构建节点
|
||
- 数据库、缓存等基础设施
|
||
|
||
---
|
||
|
||
## 2. 技术方案概述
|
||
|
||
### 2.1 为什么选择 Headscale
|
||
|
||
| 方案 | 优点 | 缺点 | 适用场景 |
|
||
|------|-----|------|---------|
|
||
| **Headscale** | 开源自托管、WireGuard 内核、P2P 直连、轻量级 | 生态相对较新 | 自主可控要求高 |
|
||
| Tailscale | 完善的商业支持 | 数据过境国外、成本高 | 小团队快速起步 |
|
||
| OpenVPN | 成熟稳定 | 配置复杂、性能较差 | 传统企业 |
|
||
| ZeroTier | 易于使用 | 免费版限制多 | 小规模使用 |
|
||
|
||
**选择 Headscale 的核心理由**:
|
||
|
||
1. **数据主权**: 所有协调数据存储在自己的服务器上
|
||
2. **成本可控**: 完全开源,无订阅费用
|
||
3. **WireGuard 优势**: 现代密码学、低延迟、高性能
|
||
4. **Mesh 网络**: 节点间直接通信,无需中心转发
|
||
5. **兼容 Tailscale 客户端**: 可使用成熟的 Tailscale 客户端
|
||
|
||
### 2.2 技术架构图
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ Internet │
|
||
└──────────────────────────┬──────────────────────────────┘
|
||
│
|
||
┌──────────────────────────┴──────────────────────────────┐
|
||
│ │
|
||
┌─────────▼─────────┐ ┌────────────────▼────────────────┐
|
||
│ Headscale HA │ │ DERP Relay Servers │
|
||
│ Control Plane │ │ (Beijing/Shanghai/HK) │
|
||
│ │ │ │
|
||
│ ┌───────────────┐ │ │ ┌─────────┐ ┌─────────┐ │
|
||
│ │ Headscale │ │ │ │ DERP-BJ │ │ DERP-SH │ │
|
||
│ │ Primary │ │ │ └─────────┘ └─────────┘ │
|
||
│ └───────────────┘ │ │ ┌─────────┐ │
|
||
│ ┌───────────────┐ │ │ │ DERP-HK │ │
|
||
│ │ PostgreSQL │ │ │ └─────────┘ │
|
||
│ │ (HA) │ │ └─────────────────────────────────┘
|
||
│ └───────────────┘ │
|
||
└─────────┬─────────┘
|
||
│ Coordination
|
||
│
|
||
┌─────────────────────┼─────────────────────┬─────────────────────┐
|
||
│ │ │ │
|
||
▼ ▼ ▼ ▼
|
||
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
|
||
│ Production │ │ Staging │ │ Development │ │ Operator │
|
||
│ Servers │ │ Servers │ │ Servers │ │ Devices │
|
||
│ │ │ │ │ │ │ │
|
||
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
|
||
│ │ Tailscale │ │◄───►│ │ Tailscale │ │◄───►│ │ Tailscale │ │◄───►│ │ Tailscale │ │
|
||
│ │ Agent │ │ P2P │ │ Agent │ │ P2P │ │ Agent │ │ P2P │ │ Client │ │
|
||
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
|
||
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
|
||
100.64.1.x 100.64.2.x 100.64.3.x 100.64.10.x
|
||
```
|
||
|
||
### 2.3 核心组件说明
|
||
|
||
| 组件 | 功能 | 部署位置 | 高可用策略 |
|
||
|------|-----|---------|-----------|
|
||
| Headscale Server | 协调服务、密钥分发、ACL 管理 | 云主机 | 主备 + PostgreSQL HA |
|
||
| DERP Relay | NAT 穿透失败时的中继服务 | 多地域部署 | 多节点冗余 |
|
||
| Tailscale Client | 客户端 Agent | 所有节点 | 开机自启 |
|
||
| Admin UI | Web 管理界面 | 与 Headscale 同机 | - |
|
||
|
||
---
|
||
|
||
## 3. 网络架构设计
|
||
|
||
### 3.1 IP 地址规划
|
||
|
||
采用 CGNAT 地址段 `100.64.0.0/10`,按环境和用途划分:
|
||
|
||
```
|
||
100.64.0.0/10 (总地址空间: 4,194,304 个地址)
|
||
│
|
||
├── 100.64.0.0/16 - 保留地址段 (管理用途)
|
||
│ ├── 100.64.0.0/24 - Headscale 控制平面
|
||
│ ├── 100.64.1.0/24 - DERP 中继服务器
|
||
│ └── 100.64.2.0/24 - 监控基础设施
|
||
│
|
||
├── 100.65.0.0/16 - 生产环境 (Production)
|
||
│ ├── 100.65.1.0/24 - Web 服务器组
|
||
│ ├── 100.65.2.0/24 - API 服务器组
|
||
│ ├── 100.65.3.0/24 - 数据库服务器组
|
||
│ ├── 100.65.4.0/24 - 缓存服务器组
|
||
│ ├── 100.65.5.0/24 - 消息队列服务器组
|
||
│ ├── 100.65.10.0/24 - Kubernetes Master
|
||
│ ├── 100.65.11.0/23 - Kubernetes Worker
|
||
│ └── 100.65.100.0/24 - 生产环境堡垒机
|
||
│
|
||
├── 100.66.0.0/16 - 预发布环境 (Staging)
|
||
│ ├── 100.66.1.0/24 - 应用服务器
|
||
│ ├── 100.66.2.0/24 - 数据库服务器
|
||
│ └── 100.66.10.0/24 - Kubernetes 集群
|
||
│
|
||
├── 100.67.0.0/16 - 测试环境 (Testing)
|
||
│ ├── 100.67.1.0/24 - 应用服务器
|
||
│ ├── 100.67.2.0/24 - 数据库服务器
|
||
│ └── 100.67.100.0/24 - CI/CD 构建节点
|
||
│
|
||
├── 100.68.0.0/16 - 开发环境 (Development)
|
||
│ ├── 100.68.1.0/24 - 开发服务器
|
||
│ └── 100.68.2.0/24 - 开发数据库
|
||
│
|
||
├── 100.70.0.0/16 - 运维人员设备 (Operators)
|
||
│ ├── 100.70.1.0/24 - 高级运维
|
||
│ ├── 100.70.2.0/24 - 普通运维
|
||
│ └── 100.70.10.0/24 - 值班人员
|
||
│
|
||
├── 100.71.0.0/16 - 开发人员设备 (Developers)
|
||
│ ├── 100.71.1.0/24 - 后端开发
|
||
│ ├── 100.71.2.0/24 - 前端开发
|
||
│ └── 100.71.3.0/24 - 移动开发
|
||
│
|
||
└── 100.80.0.0/16 - 外部合作伙伴 (Partners)
|
||
└── 100.80.1.0/24 - 第三方供应商
|
||
```
|
||
|
||
### 3.2 命名空间设计
|
||
|
||
Headscale 使用 User (原 Namespace) 进行逻辑隔离:
|
||
|
||
| User 名称 | 用途 | IP 段 | 管理员 |
|
||
|-----------|-----|-------|--------|
|
||
| `infra` | 基础设施服务 | 100.64.0.0/16 | ops-admin |
|
||
| `prod` | 生产环境服务器 | 100.65.0.0/16 | ops-admin |
|
||
| `staging` | 预发布环境 | 100.66.0.0/16 | ops-admin |
|
||
| `testing` | 测试环境 | 100.67.0.0/16 | qa-admin |
|
||
| `dev` | 开发环境 | 100.68.0.0/16 | dev-admin |
|
||
| `ops-team` | 运维人员设备 | 100.70.0.0/16 | ops-admin |
|
||
| `dev-team` | 开发人员设备 | 100.71.0.0/16 | dev-admin |
|
||
| `partners` | 外部合作伙伴 | 100.80.0.0/16 | ops-admin |
|
||
|
||
### 3.3 节点命名规范
|
||
|
||
```
|
||
<环境>-<角色>-<区域>-<序号>
|
||
|
||
示例:
|
||
- prod-web-bj-001 生产环境北京Web服务器#1
|
||
- prod-db-sh-001 生产环境上海数据库#1
|
||
- staging-api-bj-001 预发布环境北京API服务器#1
|
||
- ops-laptop-zhangsan 运维人员张三的笔记本
|
||
```
|
||
|
||
### 3.4 DERP 中继网络
|
||
|
||
部署自建 DERP 服务器以确保 NAT 穿透失败时的可靠中继:
|
||
|
||
| 节点 | 区域 | 公网 IP | 端口 | 备注 |
|
||
|------|-----|---------|-----|------|
|
||
| derp-bj-01 | 北京 | x.x.x.x | 443/3478 | 阿里云主节点 |
|
||
| derp-sh-01 | 上海 | x.x.x.x | 443/3478 | 腾讯云备节点 |
|
||
| derp-hk-01 | 香港 | x.x.x.x | 443/3478 | AWS 海外节点 |
|
||
| derp-sg-01 | 新加坡 | x.x.x.x | 443/3478 | 东南亚节点 |
|
||
|
||
---
|
||
|
||
## 4. 基础设施规划
|
||
|
||
### 4.1 服务器资源规划
|
||
|
||
#### 4.1.1 Headscale 控制平面
|
||
|
||
| 组件 | 配置 | 数量 | 说明 |
|
||
|------|-----|------|-----|
|
||
| Headscale Primary | 4C8G 100GB SSD | 1 | 主控制节点 |
|
||
| Headscale Standby | 4C8G 100GB SSD | 1 | 热备节点 |
|
||
| PostgreSQL Primary | 4C16G 500GB SSD | 1 | 数据库主节点 |
|
||
| PostgreSQL Replica | 4C16G 500GB SSD | 1 | 数据库从节点 |
|
||
| Admin UI | 2C4G 50GB SSD | 1 | 管理界面 |
|
||
|
||
#### 4.1.2 DERP 中继服务器
|
||
|
||
| 区域 | 配置 | 带宽 | 数量 |
|
||
|------|-----|------|------|
|
||
| 北京 | 2C4G 50GB | 100Mbps | 1 |
|
||
| 上海 | 2C4G 50GB | 100Mbps | 1 |
|
||
| 香港 | 2C4G 50GB | 100Mbps | 1 |
|
||
| 新加坡 | 2C4G 50GB | 100Mbps | 1 |
|
||
|
||
### 4.2 网络要求
|
||
|
||
#### 4.2.1 Headscale 服务器端口
|
||
|
||
| 端口 | 协议 | 用途 | 来源 |
|
||
|-----|------|-----|------|
|
||
| 443 | TCP | HTTPS API & gRPC | 所有客户端 |
|
||
| 80 | TCP | HTTP 重定向 | 所有客户端 |
|
||
| 50443 | TCP | 管理 API (可选) | 管理网络 |
|
||
|
||
#### 4.2.2 DERP 服务器端口
|
||
|
||
| 端口 | 协议 | 用途 | 来源 |
|
||
|-----|------|-----|------|
|
||
| 443 | TCP | HTTPS DERP | 所有客户端 |
|
||
| 3478 | UDP | STUN | 所有客户端 |
|
||
| 80 | TCP | HTTP 重定向 | 所有客户端 |
|
||
|
||
#### 4.2.3 Tailscale 客户端端口
|
||
|
||
| 端口 | 协议 | 用途 | 方向 |
|
||
|-----|------|-----|------|
|
||
| 41641 | UDP | WireGuard 直连 | 入站/出站 |
|
||
| 443 | TCP | DERP 中继 | 出站 |
|
||
| 3478 | UDP | STUN | 出站 |
|
||
|
||
### 4.3 域名与证书规划
|
||
|
||
| 域名 | 用途 | 证书类型 |
|
||
|------|-----|---------|
|
||
| hs.ops.company.com | Headscale API | Let's Encrypt 通配符 |
|
||
| admin.hs.ops.company.com | 管理界面 | Let's Encrypt |
|
||
| derp-bj.ops.company.com | 北京 DERP | Let's Encrypt |
|
||
| derp-sh.ops.company.com | 上海 DERP | Let's Encrypt |
|
||
| derp-hk.ops.company.com | 香港 DERP | Let's Encrypt |
|
||
|
||
---
|
||
|
||
## 5. Headscale 服务端部署
|
||
|
||
### 5.1 系统环境准备
|
||
|
||
```bash
|
||
# 操作系统: Ubuntu 22.04 LTS / Rocky Linux 9
|
||
# 时区设置
|
||
timedatectl set-timezone Asia/Shanghai
|
||
|
||
# 更新系统
|
||
apt update && apt upgrade -y
|
||
|
||
# 安装必要工具
|
||
apt install -y curl wget vim htop net-tools jq unzip
|
||
|
||
# 关闭 swap (容器化部署时)
|
||
swapoff -a
|
||
sed -i '/swap/d' /etc/fstab
|
||
|
||
# 设置内核参数
|
||
cat >> /etc/sysctl.conf << EOF
|
||
net.ipv4.ip_forward = 1
|
||
net.ipv6.conf.all.forwarding = 1
|
||
net.core.rmem_max = 2500000
|
||
net.core.wmem_max = 2500000
|
||
EOF
|
||
sysctl -p
|
||
|
||
# 设置文件描述符限制
|
||
cat >> /etc/security/limits.conf << EOF
|
||
* soft nofile 65535
|
||
* hard nofile 65535
|
||
root soft nofile 65535
|
||
root hard nofile 65535
|
||
EOF
|
||
```
|
||
|
||
### 5.2 PostgreSQL 高可用部署
|
||
|
||
#### 5.2.1 PostgreSQL 主节点安装
|
||
|
||
```bash
|
||
# 安装 PostgreSQL 15
|
||
apt install -y postgresql-15 postgresql-contrib-15
|
||
|
||
# 配置 PostgreSQL
|
||
cat > /etc/postgresql/15/main/postgresql.conf << 'EOF'
|
||
listen_addresses = '*'
|
||
port = 5432
|
||
max_connections = 200
|
||
shared_buffers = 4GB
|
||
effective_cache_size = 12GB
|
||
maintenance_work_mem = 1GB
|
||
checkpoint_completion_target = 0.9
|
||
wal_buffers = 16MB
|
||
default_statistics_target = 100
|
||
random_page_cost = 1.1
|
||
effective_io_concurrency = 200
|
||
work_mem = 10MB
|
||
min_wal_size = 1GB
|
||
max_wal_size = 4GB
|
||
max_worker_processes = 4
|
||
max_parallel_workers_per_gather = 2
|
||
max_parallel_workers = 4
|
||
max_parallel_maintenance_workers = 2
|
||
|
||
# 复制配置
|
||
wal_level = replica
|
||
max_wal_senders = 5
|
||
wal_keep_size = 1GB
|
||
hot_standby = on
|
||
EOF
|
||
|
||
# 配置访问控制
|
||
cat > /etc/postgresql/15/main/pg_hba.conf << 'EOF'
|
||
local all postgres peer
|
||
local all all peer
|
||
host all all 127.0.0.1/32 scram-sha-256
|
||
host all all ::1/128 scram-sha-256
|
||
host replication replicator <standby_ip>/32 scram-sha-256
|
||
host headscale headscale <headscale_ip>/32 scram-sha-256
|
||
host headscale headscale <headscale_standby_ip>/32 scram-sha-256
|
||
EOF
|
||
|
||
# 创建数据库和用户
|
||
sudo -u postgres psql << 'EOF'
|
||
CREATE USER headscale WITH PASSWORD 'your_secure_password_here';
|
||
CREATE DATABASE headscale OWNER headscale;
|
||
GRANT ALL PRIVILEGES ON DATABASE headscale TO headscale;
|
||
|
||
CREATE USER replicator WITH REPLICATION PASSWORD 'replicator_password';
|
||
EOF
|
||
|
||
systemctl restart postgresql
|
||
systemctl enable postgresql
|
||
```
|
||
|
||
#### 5.2.2 PostgreSQL 从节点配置
|
||
|
||
```bash
|
||
# 停止 PostgreSQL
|
||
systemctl stop postgresql
|
||
|
||
# 清空数据目录
|
||
rm -rf /var/lib/postgresql/15/main/*
|
||
|
||
# 从主节点复制数据
|
||
sudo -u postgres pg_basebackup -h <primary_ip> -U replicator -p 5432 \
|
||
-D /var/lib/postgresql/15/main -Fp -Xs -P -R
|
||
|
||
# 启动从节点
|
||
systemctl start postgresql
|
||
```
|
||
|
||
### 5.3 Headscale 安装与配置
|
||
|
||
#### 5.3.1 二进制安装
|
||
|
||
```bash
|
||
# 下载最新版本 (以 0.23.0 为例)
|
||
HEADSCALE_VERSION="0.23.0"
|
||
wget -O /tmp/headscale.deb \
|
||
"https://github.com/juanfont/headscale/releases/download/v${HEADSCALE_VERSION}/headscale_${HEADSCALE_VERSION}_linux_amd64.deb"
|
||
|
||
# 安装
|
||
dpkg -i /tmp/headscale.deb
|
||
|
||
# 或使用 Docker
|
||
docker pull headscale/headscale:0.23.0
|
||
```
|
||
|
||
#### 5.3.2 Headscale 配置文件
|
||
|
||
```yaml
|
||
# /etc/headscale/config.yaml
|
||
---
|
||
server_url: https://hs.ops.company.com:443
|
||
listen_addr: 0.0.0.0:443
|
||
metrics_listen_addr: 127.0.0.1:9090
|
||
grpc_listen_addr: 0.0.0.0:50443
|
||
grpc_allow_insecure: false
|
||
|
||
# 私有密钥路径
|
||
private_key_path: /var/lib/headscale/private.key
|
||
noise:
|
||
private_key_path: /var/lib/headscale/noise_private.key
|
||
|
||
# IP 地址前缀
|
||
prefixes:
|
||
v4: 100.64.0.0/10
|
||
v6: fd7a:115c:a1e0::/48
|
||
allocation: sequential
|
||
|
||
# 数据库配置 (PostgreSQL)
|
||
database:
|
||
type: postgres
|
||
postgres:
|
||
host: <postgresql_host>
|
||
port: 5432
|
||
name: headscale
|
||
user: headscale
|
||
pass: your_secure_password_here
|
||
max_open_conns: 100
|
||
max_idle_conns: 10
|
||
conn_max_idle_time_secs: 3600
|
||
ssl: disable # 生产环境建议启用 require
|
||
|
||
# DERP 配置
|
||
derp:
|
||
server:
|
||
enabled: false # 使用独立 DERP 服务器
|
||
region_id: 999
|
||
region_code: "headscale"
|
||
region_name: "Headscale Embedded DERP"
|
||
stun_listen_addr: "0.0.0.0:3478"
|
||
urls:
|
||
- https://hs.ops.company.com/derp.json
|
||
paths: []
|
||
auto_update_enabled: true
|
||
update_frequency: 24h
|
||
|
||
# 禁用默认 Tailscale DERP
|
||
disable_check_updates: true
|
||
ephemeral_node_inactivity_timeout: 30m
|
||
|
||
# 节点更新检查
|
||
node_update_check_interval: 10s
|
||
|
||
# DNS 配置
|
||
dns:
|
||
magic_dns: true
|
||
base_domain: ts.company.local
|
||
nameservers:
|
||
global:
|
||
- 10.0.0.1 # 内部 DNS
|
||
- 223.5.5.5 # 阿里 DNS (备用)
|
||
search_domains:
|
||
- company.local
|
||
extra_records:
|
||
- name: "grafana.ts.company.local"
|
||
type: "A"
|
||
value: "100.64.0.10"
|
||
- name: "prometheus.ts.company.local"
|
||
type: "A"
|
||
value: "100.64.0.11"
|
||
|
||
# Unix socket 配置
|
||
unix_socket: /var/run/headscale/headscale.sock
|
||
unix_socket_permission: "0770"
|
||
|
||
# TLS 配置 (使用反向代理时可设为空)
|
||
tls_cert_path: ""
|
||
tls_key_path: ""
|
||
|
||
# 日志配置
|
||
log:
|
||
format: json
|
||
level: info
|
||
|
||
# ACL 策略
|
||
policy:
|
||
mode: file
|
||
path: /etc/headscale/acl.json
|
||
|
||
# OIDC 配置 (可选)
|
||
oidc:
|
||
only_start_if_oidc_is_available: true
|
||
issuer: "https://sso.company.com/realms/ops"
|
||
client_id: "headscale"
|
||
client_secret: "your_oidc_client_secret"
|
||
scope: ["openid", "profile", "email"]
|
||
extra_params:
|
||
domain_hint: company.com
|
||
strip_email_domain: true
|
||
allowed_users: []
|
||
allowed_groups:
|
||
- "/ops-team"
|
||
- "/dev-team"
|
||
```
|
||
|
||
#### 5.3.3 创建 systemd 服务
|
||
|
||
```ini
|
||
# /etc/systemd/system/headscale.service
|
||
[Unit]
|
||
Description=headscale coordination server
|
||
Documentation=https://github.com/juanfont/headscale
|
||
After=network-online.target postgresql.service
|
||
Wants=network-online.target
|
||
Requires=postgresql.service
|
||
|
||
[Service]
|
||
User=headscale
|
||
Group=headscale
|
||
Type=simple
|
||
Restart=always
|
||
RestartSec=5
|
||
ExecStart=/usr/bin/headscale serve
|
||
Environment="GIN_MODE=release"
|
||
|
||
# 资源限制
|
||
LimitNOFILE=65535
|
||
LimitNPROC=65535
|
||
|
||
# 安全加固
|
||
NoNewPrivileges=true
|
||
PrivateTmp=true
|
||
ProtectSystem=strict
|
||
ProtectHome=true
|
||
ReadWritePaths=/var/lib/headscale /var/run/headscale
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
```
|
||
|
||
#### 5.3.4 启动服务
|
||
|
||
```bash
|
||
# 创建用户和目录
|
||
useradd -r -s /bin/false headscale
|
||
mkdir -p /var/lib/headscale /var/run/headscale /etc/headscale
|
||
chown -R headscale:headscale /var/lib/headscale /var/run/headscale
|
||
|
||
# 启动服务
|
||
systemctl daemon-reload
|
||
systemctl enable headscale
|
||
systemctl start headscale
|
||
|
||
# 验证服务状态
|
||
systemctl status headscale
|
||
headscale version
|
||
```
|
||
|
||
### 5.4 DERP 中继服务器部署
|
||
|
||
#### 5.4.1 DERP 服务器配置
|
||
|
||
```bash
|
||
# 安装 Go (如果需要编译)
|
||
wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
|
||
tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz
|
||
export PATH=$PATH:/usr/local/go/bin
|
||
|
||
# 安装 derper
|
||
go install tailscale.com/cmd/derper@latest
|
||
|
||
# 或使用 Docker
|
||
docker pull ghcr.io/tailscale/derper:latest
|
||
```
|
||
|
||
#### 5.4.2 DERP Docker Compose 部署
|
||
|
||
```yaml
|
||
# /opt/derper/docker-compose.yml
|
||
version: '3.8'
|
||
services:
|
||
derper:
|
||
image: ghcr.io/tailscale/derper:latest
|
||
container_name: derper
|
||
restart: always
|
||
ports:
|
||
- "443:443"
|
||
- "80:80"
|
||
- "3478:3478/udp"
|
||
volumes:
|
||
- ./certs:/etc/derper/certs:ro
|
||
- ./config:/etc/derper/config:ro
|
||
command:
|
||
- --hostname=derp-bj.ops.company.com
|
||
- --certmode=manual
|
||
- --certdir=/etc/derper/certs
|
||
- --stun
|
||
- --stun-port=3478
|
||
- --verify-clients=true
|
||
- --verify-client-url=https://hs.ops.company.com/verify
|
||
environment:
|
||
- DERP_VERIFY_CLIENTS=true
|
||
logging:
|
||
driver: "json-file"
|
||
options:
|
||
max-size: "100m"
|
||
max-file: "3"
|
||
```
|
||
|
||
#### 5.4.3 DERP Map 配置
|
||
|
||
在 Headscale 服务器上配置 DERP Map:
|
||
|
||
```json
|
||
// /etc/headscale/derp.json
|
||
{
|
||
"Regions": {
|
||
"900": {
|
||
"RegionID": 900,
|
||
"RegionCode": "bj",
|
||
"RegionName": "Beijing",
|
||
"Avoid": false,
|
||
"Nodes": [
|
||
{
|
||
"Name": "bj1",
|
||
"RegionID": 900,
|
||
"HostName": "derp-bj.ops.company.com",
|
||
"DERPPort": 443,
|
||
"STUNPort": 3478,
|
||
"InsecureForTests": false
|
||
}
|
||
]
|
||
},
|
||
"901": {
|
||
"RegionID": 901,
|
||
"RegionCode": "sh",
|
||
"RegionName": "Shanghai",
|
||
"Avoid": false,
|
||
"Nodes": [
|
||
{
|
||
"Name": "sh1",
|
||
"RegionID": 901,
|
||
"HostName": "derp-sh.ops.company.com",
|
||
"DERPPort": 443,
|
||
"STUNPort": 3478,
|
||
"InsecureForTests": false
|
||
}
|
||
]
|
||
},
|
||
"902": {
|
||
"RegionID": 902,
|
||
"RegionCode": "hk",
|
||
"RegionName": "Hong Kong",
|
||
"Avoid": false,
|
||
"Nodes": [
|
||
{
|
||
"Name": "hk1",
|
||
"RegionID": 902,
|
||
"HostName": "derp-hk.ops.company.com",
|
||
"DERPPort": 443,
|
||
"STUNPort": 3478,
|
||
"InsecureForTests": false
|
||
}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 5.5 Nginx 反向代理配置
|
||
|
||
```nginx
|
||
# /etc/nginx/sites-available/headscale
|
||
upstream headscale {
|
||
server 127.0.0.1:8080;
|
||
keepalive 32;
|
||
}
|
||
|
||
server {
|
||
listen 80;
|
||
server_name hs.ops.company.com;
|
||
return 301 https://$server_name$request_uri;
|
||
}
|
||
|
||
server {
|
||
listen 443 ssl http2;
|
||
server_name hs.ops.company.com;
|
||
|
||
# SSL 配置
|
||
ssl_certificate /etc/letsencrypt/live/hs.ops.company.com/fullchain.pem;
|
||
ssl_certificate_key /etc/letsencrypt/live/hs.ops.company.com/privkey.pem;
|
||
ssl_protocols TLSv1.2 TLSv1.3;
|
||
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
|
||
ssl_prefer_server_ciphers off;
|
||
ssl_session_cache shared:SSL:10m;
|
||
ssl_session_timeout 1d;
|
||
ssl_session_tickets off;
|
||
ssl_stapling on;
|
||
ssl_stapling_verify on;
|
||
|
||
# 安全头
|
||
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
|
||
add_header X-Frame-Options DENY always;
|
||
add_header X-Content-Type-Options nosniff always;
|
||
|
||
location / {
|
||
proxy_pass http://headscale;
|
||
proxy_http_version 1.1;
|
||
proxy_set_header Upgrade $http_upgrade;
|
||
proxy_set_header Connection "upgrade";
|
||
proxy_set_header Host $host;
|
||
proxy_set_header X-Real-IP $remote_addr;
|
||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||
proxy_set_header X-Forwarded-Proto $scheme;
|
||
proxy_buffering off;
|
||
proxy_read_timeout 86400s;
|
||
proxy_send_timeout 86400s;
|
||
}
|
||
|
||
# gRPC 支持
|
||
location /headscale.v1.HeadscaleService/ {
|
||
grpc_pass grpc://127.0.0.1:50443;
|
||
grpc_set_header Host $host;
|
||
grpc_set_header X-Real-IP $remote_addr;
|
||
}
|
||
|
||
# 健康检查
|
||
location /health {
|
||
proxy_pass http://headscale/health;
|
||
access_log off;
|
||
}
|
||
|
||
# Metrics (仅内网访问)
|
||
location /metrics {
|
||
allow 10.0.0.0/8;
|
||
allow 172.16.0.0/12;
|
||
allow 192.168.0.0/16;
|
||
allow 100.64.0.0/10;
|
||
deny all;
|
||
proxy_pass http://127.0.0.1:9090/metrics;
|
||
}
|
||
}
|
||
```
|
||
|
||
### 5.6 管理界面部署 (Headscale-UI)
|
||
|
||
```yaml
|
||
# /opt/headscale-ui/docker-compose.yml
|
||
version: '3.8'
|
||
services:
|
||
headscale-ui:
|
||
image: ghcr.io/gurucomputing/headscale-ui:latest
|
||
container_name: headscale-ui
|
||
restart: always
|
||
ports:
|
||
- "127.0.0.1:8081:80"
|
||
environment:
|
||
- HS_SERVER=https://hs.ops.company.com
|
||
```
|
||
|
||
---
|
||
|
||
## 6. 客户端接入方案
|
||
|
||
### 6.1 Linux 服务器接入
|
||
|
||
#### 6.1.1 安装 Tailscale 客户端
|
||
|
||
```bash
|
||
# Ubuntu/Debian
|
||
curl -fsSL https://tailscale.com/install.sh | sh
|
||
|
||
# RHEL/CentOS
|
||
curl -fsSL https://tailscale.com/install.sh | sh
|
||
|
||
# 或手动安装
|
||
# Ubuntu/Debian
|
||
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/jammy.noarmor.gpg | sudo tee /usr/share/keyrings/tailscale-archive-keyring.gpg >/dev/null
|
||
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/jammy.tailscale-keyring.list | sudo tee /etc/apt/sources.list.d/tailscale.list
|
||
apt update && apt install -y tailscale
|
||
```
|
||
|
||
#### 6.1.2 连接到 Headscale
|
||
|
||
```bash
|
||
# 使用预认证密钥 (推荐用于服务器)
|
||
tailscale up \
|
||
--login-server https://hs.ops.company.com \
|
||
--authkey tskey-preauth-xxxxxxxxxxxxx \
|
||
--hostname prod-web-bj-001 \
|
||
--advertise-tags tag:prod,tag:web \
|
||
--accept-routes \
|
||
--accept-dns
|
||
|
||
# 交互式登录 (用于开发机器)
|
||
tailscale up \
|
||
--login-server https://hs.ops.company.com \
|
||
--hostname ops-laptop-zhangsan
|
||
|
||
# 验证连接
|
||
tailscale status
|
||
tailscale ip
|
||
```
|
||
|
||
#### 6.1.3 自动化安装脚本
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# /opt/scripts/setup-tailscale.sh
|
||
|
||
set -euo pipefail
|
||
|
||
# 配置变量
|
||
HEADSCALE_URL="${HEADSCALE_URL:-https://hs.ops.company.com}"
|
||
AUTH_KEY="${AUTH_KEY:-}"
|
||
HOSTNAME="${HOSTNAME:-$(hostname -s)}"
|
||
TAGS="${TAGS:-}"
|
||
ACCEPT_ROUTES="${ACCEPT_ROUTES:-true}"
|
||
|
||
# 日志函数
|
||
log() {
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
|
||
}
|
||
|
||
# 检查是否已安装
|
||
if command -v tailscale &> /dev/null; then
|
||
log "Tailscale 已安装,版本: $(tailscale version)"
|
||
else
|
||
log "正在安装 Tailscale..."
|
||
curl -fsSL https://tailscale.com/install.sh | sh
|
||
fi
|
||
|
||
# 构建 tailscale up 命令
|
||
UP_CMD="tailscale up --login-server ${HEADSCALE_URL}"
|
||
|
||
if [ -n "$AUTH_KEY" ]; then
|
||
UP_CMD="$UP_CMD --authkey $AUTH_KEY"
|
||
fi
|
||
|
||
if [ -n "$HOSTNAME" ]; then
|
||
UP_CMD="$UP_CMD --hostname $HOSTNAME"
|
||
fi
|
||
|
||
if [ -n "$TAGS" ]; then
|
||
UP_CMD="$UP_CMD --advertise-tags $TAGS"
|
||
fi
|
||
|
||
if [ "$ACCEPT_ROUTES" = "true" ]; then
|
||
UP_CMD="$UP_CMD --accept-routes --accept-dns"
|
||
fi
|
||
|
||
# 执行连接
|
||
log "正在连接到 Headscale..."
|
||
eval $UP_CMD
|
||
|
||
# 验证连接
|
||
sleep 5
|
||
if tailscale status | grep -q "100."; then
|
||
log "连接成功! IP: $(tailscale ip -4)"
|
||
else
|
||
log "连接失败,请检查配置"
|
||
exit 1
|
||
fi
|
||
```
|
||
|
||
### 6.2 macOS/Windows 客户端接入
|
||
|
||
#### 6.2.1 macOS
|
||
|
||
```bash
|
||
# 使用 Homebrew 安装
|
||
brew install tailscale
|
||
|
||
# 启动并连接
|
||
sudo tailscaled &
|
||
tailscale up --login-server https://hs.ops.company.com
|
||
|
||
# 或使用官方客户端
|
||
# 下载: https://tailscale.com/download/mac
|
||
# 安装后在设置中修改 Login Server
|
||
```
|
||
|
||
#### 6.2.2 Windows
|
||
|
||
```powershell
|
||
# 使用 Winget 安装
|
||
winget install tailscale.tailscale
|
||
|
||
# 使用 Chocolatey 安装
|
||
choco install tailscale
|
||
|
||
# 连接命令 (PowerShell 管理员)
|
||
tailscale up --login-server https://hs.ops.company.com
|
||
```
|
||
|
||
### 6.3 移动设备接入
|
||
|
||
1. 从 App Store / Google Play 下载 Tailscale 官方客户端
|
||
2. 打开 App,点击设置图标
|
||
3. 选择 "Custom coordination server"
|
||
4. 输入: `https://hs.ops.company.com`
|
||
5. 点击 "Log in" 完成认证
|
||
|
||
### 6.4 预认证密钥管理
|
||
|
||
```bash
|
||
# 创建可重用的预认证密钥 (用于自动化部署)
|
||
headscale preauthkeys create \
|
||
--user prod \
|
||
--reusable \
|
||
--expiration 720h \
|
||
--tags tag:prod,tag:automated
|
||
|
||
# 创建一次性预认证密钥
|
||
headscale preauthkeys create \
|
||
--user ops-team \
|
||
--expiration 24h
|
||
|
||
# 查看所有预认证密钥
|
||
headscale preauthkeys list --user prod
|
||
|
||
# 使密钥失效
|
||
headscale preauthkeys expire --user prod <key>
|
||
```
|
||
|
||
### 6.5 Ansible 自动化部署
|
||
|
||
```yaml
|
||
# roles/tailscale/tasks/main.yml
|
||
---
|
||
- name: Install Tailscale
|
||
shell: curl -fsSL https://tailscale.com/install.sh | sh
|
||
args:
|
||
creates: /usr/bin/tailscale
|
||
|
||
- name: Start tailscaled service
|
||
systemd:
|
||
name: tailscaled
|
||
state: started
|
||
enabled: yes
|
||
|
||
- name: Check if already connected
|
||
command: tailscale status
|
||
register: ts_status
|
||
ignore_errors: yes
|
||
changed_when: false
|
||
|
||
- name: Connect to Headscale
|
||
command: >
|
||
tailscale up
|
||
--login-server {{ headscale_url }}
|
||
--authkey {{ headscale_authkey }}
|
||
--hostname {{ inventory_hostname }}
|
||
--advertise-tags {{ tailscale_tags | join(',') }}
|
||
--accept-routes
|
||
--accept-dns
|
||
when: ts_status.rc != 0
|
||
|
||
- name: Verify connection
|
||
command: tailscale ip -4
|
||
register: ts_ip
|
||
changed_when: false
|
||
|
||
- name: Display Tailscale IP
|
||
debug:
|
||
msg: "Tailscale IP: {{ ts_ip.stdout }}"
|
||
```
|
||
|
||
---
|
||
|
||
## 7. 访问控制与安全策略
|
||
|
||
### 7.1 ACL 策略设计原则
|
||
|
||
1. **最小权限原则**: 只授予完成工作所需的最小权限
|
||
2. **分层隔离**: 生产/测试/开发环境严格隔离
|
||
3. **基于角色**: 运维/开发不同角色不同权限
|
||
4. **审计可追溯**: 所有访问可记录和追溯
|
||
|
||
### 7.2 详细 ACL 配置
|
||
|
||
```json
|
||
// /etc/headscale/acl.json
|
||
{
|
||
"groups": {
|
||
"group:ops-admin": ["user:zhangsan", "user:lisi"],
|
||
"group:ops-member": ["user:wangwu", "user:zhaoliu"],
|
||
"group:dev-senior": ["user:dev01", "user:dev02"],
|
||
"group:dev-junior": ["user:dev03", "user:dev04"],
|
||
"group:qa": ["user:qa01", "user:qa02"],
|
||
"group:dba": ["user:dba01"]
|
||
},
|
||
|
||
"tagOwners": {
|
||
"tag:prod": ["group:ops-admin"],
|
||
"tag:staging": ["group:ops-admin", "group:ops-member"],
|
||
"tag:testing": ["group:ops-admin", "group:qa"],
|
||
"tag:dev": ["group:ops-admin", "group:dev-senior"],
|
||
"tag:web": ["group:ops-admin"],
|
||
"tag:api": ["group:ops-admin"],
|
||
"tag:db": ["group:ops-admin", "group:dba"],
|
||
"tag:cache": ["group:ops-admin"],
|
||
"tag:mq": ["group:ops-admin"],
|
||
"tag:k8s": ["group:ops-admin"],
|
||
"tag:monitoring": ["group:ops-admin"],
|
||
"tag:bastion": ["group:ops-admin"]
|
||
},
|
||
|
||
"hosts": {
|
||
"prod-bastion": "100.65.100.1",
|
||
"staging-bastion": "100.66.100.1",
|
||
"monitoring-server": "100.64.0.10",
|
||
"jenkins-master": "100.67.100.1"
|
||
},
|
||
|
||
"acls": [
|
||
// ===== 基础设施规则 =====
|
||
// 所有节点可以访问 DNS
|
||
{
|
||
"action": "accept",
|
||
"src": ["*"],
|
||
"dst": ["100.64.0.1:53"]
|
||
},
|
||
|
||
// 所有节点可以访问监控系统
|
||
{
|
||
"action": "accept",
|
||
"src": ["*"],
|
||
"dst": ["tag:monitoring:9090,9093,3000"]
|
||
},
|
||
|
||
// ===== 运维管理员规则 =====
|
||
// 运维管理员可以访问所有环境的所有服务
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:ops-admin"],
|
||
"dst": ["*:*"]
|
||
},
|
||
|
||
// ===== 普通运维规则 =====
|
||
// 普通运维可以访问非生产环境
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:ops-member"],
|
||
"dst": ["tag:staging:*", "tag:testing:*", "tag:dev:*"]
|
||
},
|
||
// 普通运维只能通过堡垒机访问生产环境
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:ops-member"],
|
||
"dst": ["tag:bastion:22"]
|
||
},
|
||
|
||
// ===== DBA 规则 =====
|
||
// DBA 可以访问所有数据库
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:dba"],
|
||
"dst": ["tag:db:3306,5432,6379,27017"]
|
||
},
|
||
// DBA 可以访问堡垒机
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:dba"],
|
||
"dst": ["tag:bastion:22"]
|
||
},
|
||
|
||
// ===== 高级开发规则 =====
|
||
// 高级开发可以访问开发、测试和预发布环境
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:dev-senior"],
|
||
"dst": ["tag:staging:*", "tag:testing:*", "tag:dev:*"]
|
||
},
|
||
|
||
// ===== 初级开发规则 =====
|
||
// 初级开发只能访问开发环境
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:dev-junior"],
|
||
"dst": ["tag:dev:*"]
|
||
},
|
||
|
||
// ===== QA 规则 =====
|
||
// QA 可以访问测试和预发布环境
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:qa"],
|
||
"dst": ["tag:testing:*", "tag:staging:80,443,8080"]
|
||
},
|
||
|
||
// ===== 服务间通信规则 =====
|
||
// 生产环境 Web 服务器可以访问 API 服务器
|
||
{
|
||
"action": "accept",
|
||
"src": ["tag:web"],
|
||
"dst": ["tag:api:8080,8443"]
|
||
},
|
||
// API 服务器可以访问数据库和缓存
|
||
{
|
||
"action": "accept",
|
||
"src": ["tag:api"],
|
||
"dst": ["tag:db:3306,5432", "tag:cache:6379", "tag:mq:5672,15672"]
|
||
},
|
||
// Kubernetes 集群内部通信
|
||
{
|
||
"action": "accept",
|
||
"src": ["tag:k8s"],
|
||
"dst": ["tag:k8s:*"]
|
||
},
|
||
|
||
// ===== CI/CD 规则 =====
|
||
// Jenkins 可以访问测试环境进行部署
|
||
{
|
||
"action": "accept",
|
||
"src": ["jenkins-master"],
|
||
"dst": ["tag:testing:22,80,443,8080"]
|
||
},
|
||
|
||
// ===== 默认拒绝规则 (隐含) =====
|
||
],
|
||
|
||
// SSH 规则 (控制 Tailscale SSH)
|
||
"ssh": [
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:ops-admin"],
|
||
"dst": ["*"],
|
||
"users": ["root", "ubuntu", "centos"]
|
||
},
|
||
{
|
||
"action": "accept",
|
||
"src": ["group:ops-member"],
|
||
"dst": ["tag:staging", "tag:testing", "tag:dev"],
|
||
"users": ["ubuntu", "centos"]
|
||
}
|
||
],
|
||
|
||
// 测试规则 (用于调试)
|
||
"tests": [
|
||
{
|
||
"src": "user:zhangsan",
|
||
"accept": ["tag:prod:22", "tag:db:3306"]
|
||
},
|
||
{
|
||
"src": "user:dev01",
|
||
"accept": ["tag:dev:*"],
|
||
"deny": ["tag:prod:*"]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### 7.3 标签管理
|
||
|
||
```bash
|
||
# 为节点添加标签
|
||
headscale nodes tag -i <node_id> -t "tag:prod,tag:web"
|
||
|
||
# 查看节点标签
|
||
headscale nodes list
|
||
|
||
# 批量更新标签 (通过 API)
|
||
curl -X POST https://hs.ops.company.com/api/v1/machine/<machine_id>/tags \
|
||
-H "Authorization: Bearer <api_key>" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"tags": ["tag:prod", "tag:web", "tag:bj"]}'
|
||
```
|
||
|
||
### 7.4 安全加固措施
|
||
|
||
#### 7.4.1 Headscale 服务器加固
|
||
|
||
```bash
|
||
# 1. 防火墙配置
|
||
ufw default deny incoming
|
||
ufw default allow outgoing
|
||
ufw allow from 10.0.0.0/8 to any port 22 # SSH 仅允许内网
|
||
ufw allow 80/tcp # HTTP 重定向
|
||
ufw allow 443/tcp # HTTPS
|
||
ufw allow 50443/tcp # gRPC (如需要)
|
||
ufw enable
|
||
|
||
# 2. fail2ban 配置
|
||
apt install -y fail2ban
|
||
cat > /etc/fail2ban/jail.local << 'EOF'
|
||
[sshd]
|
||
enabled = true
|
||
port = ssh
|
||
filter = sshd
|
||
logpath = /var/log/auth.log
|
||
maxretry = 3
|
||
bantime = 3600
|
||
findtime = 600
|
||
|
||
[headscale]
|
||
enabled = true
|
||
port = 443
|
||
filter = headscale
|
||
logpath = /var/log/headscale/headscale.log
|
||
maxretry = 5
|
||
bantime = 3600
|
||
findtime = 600
|
||
EOF
|
||
|
||
# 3. 禁用密码登录
|
||
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
|
||
systemctl restart sshd
|
||
|
||
# 4. 定期更新
|
||
apt update && apt upgrade -y
|
||
```
|
||
|
||
#### 7.4.2 客户端安全配置
|
||
|
||
```bash
|
||
# 限制 Tailscale 网络接口的路由
|
||
tailscale up \
|
||
--shields-up \ # 默认拒绝入站连接
|
||
--accept-routes=false \ # 不接受其他节点的路由广播
|
||
--advertise-routes="" \ # 不广播本地路由
|
||
--exit-node="" # 不使用出口节点
|
||
```
|
||
|
||
---
|
||
|
||
## 8. DNS 与服务发现
|
||
|
||
### 8.1 MagicDNS 配置
|
||
|
||
Headscale 内置的 MagicDNS 提供自动的服务发现能力:
|
||
|
||
```yaml
|
||
# config.yaml DNS 部分
|
||
dns:
|
||
magic_dns: true
|
||
base_domain: ts.company.local
|
||
nameservers:
|
||
global:
|
||
- 10.0.0.1 # 公司内部 DNS
|
||
- 223.5.5.5 # 阿里 DNS
|
||
restricted:
|
||
internal.company.com:
|
||
- 10.0.0.1
|
||
aws.internal:
|
||
- 169.254.169.253
|
||
search_domains:
|
||
- ts.company.local
|
||
- company.local
|
||
extra_records:
|
||
- name: "grafana"
|
||
type: "A"
|
||
value: "100.64.0.10"
|
||
- name: "prometheus"
|
||
type: "A"
|
||
value: "100.64.0.11"
|
||
- name: "jenkins"
|
||
type: "A"
|
||
value: "100.67.100.1"
|
||
- name: "gitlab"
|
||
type: "CNAME"
|
||
value: "prod-gitlab-bj-001"
|
||
```
|
||
|
||
### 8.2 DNS 解析规则
|
||
|
||
启用 MagicDNS 后,域名解析规则如下:
|
||
|
||
| 域名格式 | 解析目标 | 示例 |
|
||
|---------|---------|------|
|
||
| `<hostname>` | 直接解析 | `prod-web-bj-001` → `100.65.1.1` |
|
||
| `<hostname>.<user>` | 带命名空间 | `prod-web-bj-001.prod` |
|
||
| `<hostname>.<base_domain>` | 完整域名 | `prod-web-bj-001.ts.company.local` |
|
||
| 自定义记录 | extra_records | `grafana` → `100.64.0.10` |
|
||
|
||
### 8.3 Split DNS 配置
|
||
|
||
针对特定域名使用特定 DNS 服务器:
|
||
|
||
```yaml
|
||
dns:
|
||
nameservers:
|
||
restricted:
|
||
# AWS 内部域名使用 AWS DNS
|
||
"compute.internal":
|
||
- 169.254.169.253
|
||
"ec2.internal":
|
||
- 169.254.169.253
|
||
# 阿里云内部域名
|
||
"alibaba-inc.com":
|
||
- 100.100.2.136
|
||
- 100.100.2.138
|
||
# 公司内部域名
|
||
"company.internal":
|
||
- 10.0.0.1
|
||
- 10.0.0.2
|
||
```
|
||
|
||
### 8.4 服务发现集成
|
||
|
||
#### 8.4.1 与 Consul 集成
|
||
|
||
```hcl
|
||
# consul-config.hcl
|
||
services {
|
||
id = "web-prod-001"
|
||
name = "web"
|
||
tags = ["prod", "tailscale"]
|
||
port = 80
|
||
|
||
checks = [
|
||
{
|
||
http = "http://prod-web-bj-001.ts.company.local/health"
|
||
interval = "10s"
|
||
timeout = "2s"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
#### 8.4.2 与 Kubernetes CoreDNS 集成
|
||
|
||
```yaml
|
||
# coredns-configmap.yaml
|
||
apiVersion: v1
|
||
kind: ConfigMap
|
||
metadata:
|
||
name: coredns
|
||
namespace: kube-system
|
||
data:
|
||
Corefile: |
|
||
.:53 {
|
||
errors
|
||
health
|
||
kubernetes cluster.local in-addr.arpa ip6.arpa {
|
||
pods insecure
|
||
fallthrough in-addr.arpa ip6.arpa
|
||
}
|
||
# 转发 Tailscale 域名到 MagicDNS
|
||
forward ts.company.local 100.100.100.100 {
|
||
policy sequential
|
||
}
|
||
forward . /etc/resolv.conf
|
||
cache 30
|
||
loop
|
||
reload
|
||
loadbalance
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 9. 监控与告警
|
||
|
||
### 9.1 监控架构
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Grafana Dashboard │
|
||
│ (hs-monitor.ops.company.com) │
|
||
└──────────────────────────────┬──────────────────────────────────┘
|
||
│
|
||
┌─────────────┴─────────────┐
|
||
│ Prometheus │
|
||
│ (100.64.0.11:9090) │
|
||
└─────────────┬─────────────┘
|
||
│
|
||
┌───────────────┬───────┴───────┬───────────────┐
|
||
│ │ │ │
|
||
▼ ▼ ▼ ▼
|
||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||
│ Headscale │ │ DERP │ │ Tailscale │ │ System │
|
||
│ Metrics │ │ Metrics │ │ Metrics │ │ Metrics │
|
||
│ :9090 │ │ :8080 │ │ (via API) │ │ (node_exp) │
|
||
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
|
||
```
|
||
|
||
### 9.2 Prometheus 配置
|
||
|
||
```yaml
|
||
# /etc/prometheus/prometheus.yml
|
||
global:
|
||
scrape_interval: 15s
|
||
evaluation_interval: 15s
|
||
|
||
alerting:
|
||
alertmanagers:
|
||
- static_configs:
|
||
- targets:
|
||
- alertmanager:9093
|
||
|
||
rule_files:
|
||
- "/etc/prometheus/rules/*.yml"
|
||
|
||
scrape_configs:
|
||
# Headscale 指标
|
||
- job_name: 'headscale'
|
||
static_configs:
|
||
- targets: ['100.64.0.1:9090']
|
||
metrics_path: /metrics
|
||
relabel_configs:
|
||
- source_labels: [__address__]
|
||
target_label: instance
|
||
replacement: headscale-primary
|
||
|
||
# DERP 服务器指标
|
||
- job_name: 'derp'
|
||
static_configs:
|
||
- targets:
|
||
- 'derp-bj.ops.company.com:8080'
|
||
- 'derp-sh.ops.company.com:8080'
|
||
- 'derp-hk.ops.company.com:8080'
|
||
|
||
# PostgreSQL 指标
|
||
- job_name: 'postgresql'
|
||
static_configs:
|
||
- targets: ['100.64.0.2:9187']
|
||
|
||
# 所有 Tailscale 节点 (使用服务发现)
|
||
- job_name: 'tailscale-nodes'
|
||
file_sd_configs:
|
||
- files:
|
||
- '/etc/prometheus/tailscale_nodes.json'
|
||
refresh_interval: 5m
|
||
```
|
||
|
||
### 9.3 关键监控指标
|
||
|
||
#### 9.3.1 Headscale 指标
|
||
|
||
| 指标名称 | 类型 | 说明 | 告警阈值 |
|
||
|---------|-----|------|---------|
|
||
| `headscale_connected_nodes` | Gauge | 已连接节点数 | < 预期节点数 * 0.9 |
|
||
| `headscale_api_requests_total` | Counter | API 请求总数 | - |
|
||
| `headscale_api_request_duration_seconds` | Histogram | API 响应时间 | P99 > 1s |
|
||
| `headscale_db_query_duration_seconds` | Histogram | 数据库查询时间 | P99 > 500ms |
|
||
|
||
#### 9.3.2 DERP 指标
|
||
|
||
| 指标名称 | 类型 | 说明 | 告警阈值 |
|
||
|---------|-----|------|---------|
|
||
| `derp_connections` | Gauge | 当前连接数 | > 10000 |
|
||
| `derp_bytes_sent_total` | Counter | 发送字节数 | 突增 > 200% |
|
||
| `derp_bytes_received_total` | Counter | 接收字节数 | 突增 > 200% |
|
||
| `derp_home_connections` | Gauge | Home 连接数 | - |
|
||
|
||
#### 9.3.3 节点健康指标
|
||
|
||
| 指标名称 | 类型 | 说明 | 告警阈值 |
|
||
|---------|-----|------|---------|
|
||
| `tailscale_up` | Gauge | 节点在线状态 | = 0 |
|
||
| `tailscale_derp_latency_seconds` | Gauge | DERP 延迟 | > 200ms |
|
||
| `tailscale_peer_count` | Gauge | 对等节点数 | = 0 |
|
||
|
||
### 9.4 告警规则配置
|
||
|
||
```yaml
|
||
# /etc/prometheus/rules/headscale.yml
|
||
groups:
|
||
- name: headscale
|
||
interval: 30s
|
||
rules:
|
||
# Headscale 服务不可用
|
||
- alert: HeadscaleDown
|
||
expr: up{job="headscale"} == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Headscale 控制平面不可用"
|
||
description: "Headscale 服务已离线超过1分钟"
|
||
|
||
# 节点大量离线
|
||
- alert: TailscaleNodesMassOffline
|
||
expr: |
|
||
(count(tailscale_up == 0) / count(tailscale_up)) > 0.1
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "超过10%的节点离线"
|
||
description: "{{ $value | humanizePercentage }} 的节点当前离线"
|
||
|
||
# API 响应慢
|
||
- alert: HeadscaleAPILatencyHigh
|
||
expr: |
|
||
histogram_quantile(0.99, rate(headscale_api_request_duration_seconds_bucket[5m])) > 1
|
||
for: 10m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Headscale API 响应延迟过高"
|
||
description: "API P99 延迟: {{ $value | humanizeDuration }}"
|
||
|
||
# 数据库连接问题
|
||
- alert: HeadscaleDatabaseConnectionIssues
|
||
expr: |
|
||
rate(headscale_db_errors_total[5m]) > 0.1
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Headscale 数据库连接异常"
|
||
description: "数据库错误率: {{ $value }}/s"
|
||
|
||
- name: derp
|
||
rules:
|
||
# DERP 服务不可用
|
||
- alert: DERPServerDown
|
||
expr: up{job="derp"} == 0
|
||
for: 2m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "DERP 中继服务器不可用"
|
||
description: "{{ $labels.instance }} DERP 服务已离线"
|
||
|
||
# DERP 连接数过高
|
||
- alert: DERPConnectionsHigh
|
||
expr: derp_connections > 8000
|
||
for: 10m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "DERP 连接数接近上限"
|
||
description: "{{ $labels.instance }} 当前连接数: {{ $value }}"
|
||
|
||
- name: nodes
|
||
rules:
|
||
# 单个节点离线
|
||
- alert: TailscaleNodeDown
|
||
expr: tailscale_up == 0
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Tailscale 节点离线"
|
||
description: "节点 {{ $labels.hostname }} 已离线超过5分钟"
|
||
|
||
# 生产环境节点离线 (更严格)
|
||
- alert: ProductionNodeDown
|
||
expr: tailscale_up{env="prod"} == 0
|
||
for: 2m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "生产环境节点离线"
|
||
description: "生产节点 {{ $labels.hostname }} 已离线"
|
||
|
||
# 节点无法建立直连
|
||
- alert: TailscaleNoPeerConnection
|
||
expr: tailscale_peer_count == 0 and tailscale_up == 1
|
||
for: 10m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "节点无法建立 P2P 连接"
|
||
description: "节点 {{ $labels.hostname }} 无法与其他节点建立直接连接"
|
||
```
|
||
|
||
### 9.5 Grafana 仪表板
|
||
|
||
创建以下仪表板:
|
||
|
||
1. **Headscale Overview**
|
||
- 总节点数、在线节点数、离线节点数
|
||
- API 请求 QPS 和延迟
|
||
- 数据库连接状态
|
||
|
||
2. **DERP Network**
|
||
- 各 DERP 服务器连接数
|
||
- 流量统计 (发送/接收)
|
||
- 区域分布
|
||
|
||
3. **Node Health**
|
||
- 节点在线状态矩阵
|
||
- 各节点延迟热力图
|
||
- 节点流量统计
|
||
|
||
4. **ACL Audit**
|
||
- 访问拒绝事件
|
||
- 规则命中统计
|
||
- 异常访问模式
|
||
|
||
---
|
||
|
||
## 10. 运维管理规范
|
||
|
||
### 10.1 日常运维操作
|
||
|
||
#### 10.1.1 用户管理
|
||
|
||
```bash
|
||
# 创建用户 (命名空间)
|
||
headscale users create prod
|
||
headscale users create staging
|
||
headscale users create dev
|
||
|
||
# 查看用户列表
|
||
headscale users list
|
||
|
||
# 删除用户 (谨慎操作)
|
||
headscale users destroy dev
|
||
```
|
||
|
||
#### 10.1.2 节点管理
|
||
|
||
```bash
|
||
# 列出所有节点
|
||
headscale nodes list
|
||
|
||
# 列出特定用户的节点
|
||
headscale nodes list --user prod
|
||
|
||
# 查看节点详情
|
||
headscale nodes list --identifier prod-web-bj-001
|
||
|
||
# 删除节点
|
||
headscale nodes delete --identifier <node_id>
|
||
|
||
# 重命名节点
|
||
headscale nodes rename --identifier <node_id> --name new-hostname
|
||
|
||
# 移动节点到其他用户
|
||
headscale nodes move --identifier <node_id> --user staging
|
||
|
||
# 设置节点过期时间
|
||
headscale nodes expire --identifier <node_id>
|
||
```
|
||
|
||
#### 10.1.3 路由管理
|
||
|
||
```bash
|
||
# 查看所有路由
|
||
headscale routes list
|
||
|
||
# 启用路由
|
||
headscale routes enable --route <route_id>
|
||
|
||
# 禁用路由
|
||
headscale routes disable --route <route_id>
|
||
|
||
# 删除路由
|
||
headscale routes delete --route <route_id>
|
||
```
|
||
|
||
#### 10.1.4 API Key 管理
|
||
|
||
```bash
|
||
# 创建 API Key
|
||
headscale apikeys create --expiration 90d
|
||
|
||
# 列出 API Keys
|
||
headscale apikeys list
|
||
|
||
# 使 API Key 过期
|
||
headscale apikeys expire --prefix <key_prefix>
|
||
```
|
||
|
||
### 10.2 运维脚本工具
|
||
|
||
#### 10.2.1 节点健康检查脚本
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# /opt/scripts/check-tailscale-health.sh
|
||
|
||
HEADSCALE_URL="https://hs.ops.company.com"
|
||
API_KEY="your_api_key"
|
||
ALERT_WEBHOOK="https://webhook.ops.company.com/alert"
|
||
|
||
# 获取所有节点
|
||
nodes=$(curl -s -H "Authorization: Bearer $API_KEY" \
|
||
"${HEADSCALE_URL}/api/v1/machine" | jq -r '.machines[]')
|
||
|
||
# 检查离线节点
|
||
offline_nodes=$(echo "$nodes" | jq -r 'select(.online == false) | .givenName')
|
||
|
||
if [ -n "$offline_nodes" ]; then
|
||
# 发送告警
|
||
curl -X POST "$ALERT_WEBHOOK" \
|
||
-H "Content-Type: application/json" \
|
||
-d "{\"text\": \"[Tailscale] 以下节点离线:\\n$offline_nodes\"}"
|
||
fi
|
||
|
||
# 检查即将过期的节点
|
||
expiring_nodes=$(echo "$nodes" | jq -r \
|
||
'select(.expiry != "0001-01-01T00:00:00Z") |
|
||
select((.expiry | fromdateiso8601) < (now + 604800)) |
|
||
.givenName + " (expires: " + .expiry + ")"')
|
||
|
||
if [ -n "$expiring_nodes" ]; then
|
||
curl -X POST "$ALERT_WEBHOOK" \
|
||
-H "Content-Type: application/json" \
|
||
-d "{\"text\": \"[Tailscale] 以下节点即将过期:\\n$expiring_nodes\"}"
|
||
fi
|
||
```
|
||
|
||
#### 10.2.2 批量节点管理脚本
|
||
|
||
```python
|
||
#!/usr/bin/env python3
|
||
# /opt/scripts/headscale-manager.py
|
||
|
||
import requests
|
||
import argparse
|
||
import json
|
||
from datetime import datetime, timedelta
|
||
|
||
class HeadscaleManager:
|
||
def __init__(self, url, api_key):
|
||
self.url = url.rstrip('/')
|
||
self.headers = {
|
||
'Authorization': f'Bearer {api_key}',
|
||
'Content-Type': 'application/json'
|
||
}
|
||
|
||
def get_nodes(self, user=None):
|
||
"""获取节点列表"""
|
||
params = {}
|
||
if user:
|
||
params['user'] = user
|
||
|
||
resp = requests.get(
|
||
f'{self.url}/api/v1/machine',
|
||
headers=self.headers,
|
||
params=params
|
||
)
|
||
return resp.json().get('machines', [])
|
||
|
||
def get_offline_nodes(self, threshold_hours=1):
|
||
"""获取离线节点"""
|
||
nodes = self.get_nodes()
|
||
offline = []
|
||
|
||
threshold = datetime.utcnow() - timedelta(hours=threshold_hours)
|
||
|
||
for node in nodes:
|
||
if not node.get('online', False):
|
||
last_seen = datetime.fromisoformat(
|
||
node['lastSeen'].replace('Z', '+00:00')
|
||
)
|
||
if last_seen < threshold.replace(tzinfo=last_seen.tzinfo):
|
||
offline.append(node)
|
||
|
||
return offline
|
||
|
||
def bulk_tag_nodes(self, node_ids, tags):
|
||
"""批量设置节点标签"""
|
||
results = []
|
||
for node_id in node_ids:
|
||
resp = requests.post(
|
||
f'{self.url}/api/v1/machine/{node_id}/tags',
|
||
headers=self.headers,
|
||
json={'tags': tags}
|
||
)
|
||
results.append({
|
||
'node_id': node_id,
|
||
'success': resp.status_code == 200
|
||
})
|
||
return results
|
||
|
||
def cleanup_expired_nodes(self, dry_run=True):
|
||
"""清理过期节点"""
|
||
nodes = self.get_nodes()
|
||
expired = []
|
||
|
||
for node in nodes:
|
||
expiry = node.get('expiry')
|
||
if expiry and expiry != '0001-01-01T00:00:00Z':
|
||
expiry_dt = datetime.fromisoformat(expiry.replace('Z', '+00:00'))
|
||
if expiry_dt < datetime.utcnow().replace(tzinfo=expiry_dt.tzinfo):
|
||
expired.append(node)
|
||
|
||
if not dry_run:
|
||
for node in expired:
|
||
requests.delete(
|
||
f'{self.url}/api/v1/machine/{node["id"]}',
|
||
headers=self.headers
|
||
)
|
||
|
||
return expired
|
||
|
||
if __name__ == '__main__':
|
||
parser = argparse.ArgumentParser(description='Headscale 管理工具')
|
||
parser.add_argument('--url', required=True, help='Headscale URL')
|
||
parser.add_argument('--api-key', required=True, help='API Key')
|
||
parser.add_argument('action', choices=['list', 'offline', 'cleanup'])
|
||
parser.add_argument('--user', help='过滤用户')
|
||
parser.add_argument('--dry-run', action='store_true', help='试运行模式')
|
||
|
||
args = parser.parse_args()
|
||
|
||
manager = HeadscaleManager(args.url, args.api_key)
|
||
|
||
if args.action == 'list':
|
||
nodes = manager.get_nodes(args.user)
|
||
print(json.dumps(nodes, indent=2))
|
||
elif args.action == 'offline':
|
||
offline = manager.get_offline_nodes()
|
||
print(f"离线节点数: {len(offline)}")
|
||
for node in offline:
|
||
print(f" - {node['givenName']} (last seen: {node['lastSeen']})")
|
||
elif args.action == 'cleanup':
|
||
expired = manager.cleanup_expired_nodes(dry_run=args.dry_run)
|
||
print(f"过期节点数: {len(expired)}")
|
||
for node in expired:
|
||
print(f" - {node['givenName']} (expired: {node['expiry']})")
|
||
```
|
||
|
||
### 10.3 日志管理
|
||
|
||
```bash
|
||
# Headscale 日志位置
|
||
/var/log/headscale/headscale.log
|
||
|
||
# 日志轮转配置
|
||
cat > /etc/logrotate.d/headscale << 'EOF'
|
||
/var/log/headscale/*.log {
|
||
daily
|
||
rotate 30
|
||
compress
|
||
delaycompress
|
||
missingok
|
||
notifempty
|
||
create 0640 headscale headscale
|
||
sharedscripts
|
||
postrotate
|
||
systemctl reload headscale > /dev/null 2>&1 || true
|
||
endscript
|
||
}
|
||
EOF
|
||
|
||
# 结构化日志查询 (JSON 格式)
|
||
cat /var/log/headscale/headscale.log | jq 'select(.level == "error")'
|
||
```
|
||
|
||
### 10.4 备份与恢复
|
||
|
||
#### 10.4.1 数据库备份
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# /opt/scripts/backup-headscale.sh
|
||
|
||
BACKUP_DIR="/backup/headscale"
|
||
DATE=$(date +%Y%m%d_%H%M%S)
|
||
RETENTION_DAYS=30
|
||
|
||
# PostgreSQL 备份
|
||
pg_dump -h localhost -U headscale -d headscale -F c \
|
||
-f "${BACKUP_DIR}/headscale_${DATE}.dump"
|
||
|
||
# 配置文件备份
|
||
tar -czf "${BACKUP_DIR}/config_${DATE}.tar.gz" \
|
||
/etc/headscale/config.yaml \
|
||
/etc/headscale/acl.json \
|
||
/etc/headscale/derp.json \
|
||
/var/lib/headscale/private.key \
|
||
/var/lib/headscale/noise_private.key
|
||
|
||
# 清理旧备份
|
||
find "${BACKUP_DIR}" -type f -mtime +${RETENTION_DAYS} -delete
|
||
|
||
# 上传到 S3 (可选)
|
||
aws s3 sync "${BACKUP_DIR}/" s3://backup-bucket/headscale/
|
||
```
|
||
|
||
#### 10.4.2 数据恢复
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# /opt/scripts/restore-headscale.sh
|
||
|
||
BACKUP_FILE=$1
|
||
|
||
# 停止服务
|
||
systemctl stop headscale
|
||
|
||
# 恢复数据库
|
||
pg_restore -h localhost -U headscale -d headscale -c "${BACKUP_FILE}"
|
||
|
||
# 恢复配置
|
||
tar -xzf "${BACKUP_FILE%.dump}_config.tar.gz" -C /
|
||
|
||
# 重启服务
|
||
systemctl start headscale
|
||
|
||
# 验证
|
||
headscale nodes list
|
||
```
|
||
|
||
### 10.5 版本升级流程
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# /opt/scripts/upgrade-headscale.sh
|
||
|
||
NEW_VERSION=$1
|
||
BACKUP_DIR="/backup/headscale/upgrade"
|
||
|
||
echo "开始升级 Headscale 到版本 ${NEW_VERSION}"
|
||
|
||
# 1. 备份当前版本
|
||
echo "备份当前配置和数据..."
|
||
./backup-headscale.sh
|
||
|
||
# 2. 下载新版本
|
||
echo "下载新版本..."
|
||
wget -O /tmp/headscale_new.deb \
|
||
"https://github.com/juanfont/headscale/releases/download/v${NEW_VERSION}/headscale_${NEW_VERSION}_linux_amd64.deb"
|
||
|
||
# 3. 停止服务
|
||
echo "停止 Headscale 服务..."
|
||
systemctl stop headscale
|
||
|
||
# 4. 安装新版本
|
||
echo "安装新版本..."
|
||
dpkg -i /tmp/headscale_new.deb
|
||
|
||
# 5. 数据库迁移 (如果需要)
|
||
echo "执行数据库迁移..."
|
||
headscale serve --config /etc/headscale/config.yaml --migrate-only
|
||
|
||
# 6. 启动服务
|
||
echo "启动服务..."
|
||
systemctl start headscale
|
||
|
||
# 7. 验证
|
||
echo "验证升级..."
|
||
sleep 5
|
||
headscale version
|
||
headscale nodes list | head -5
|
||
|
||
echo "升级完成!"
|
||
```
|
||
|
||
---
|
||
|
||
## 11. 故障恢复与灾备
|
||
|
||
### 11.1 故障场景与恢复方案
|
||
|
||
#### 11.1.1 Headscale 主节点故障
|
||
|
||
**影响范围**:
|
||
- 新节点无法加入网络
|
||
- 无法更新 ACL 策略
|
||
- 已连接节点正常通信 (P2P 直连)
|
||
|
||
**恢复步骤**:
|
||
|
||
```bash
|
||
# 1. 确认主节点故障
|
||
systemctl status headscale
|
||
curl -s https://hs.ops.company.com/health
|
||
|
||
# 2. 切换到备用节点
|
||
# 在备用节点上修改 DNS 或负载均衡器配置
|
||
|
||
# 3. 如果是数据库问题,切换到从库
|
||
# 修改 config.yaml 中的数据库连接
|
||
|
||
# 4. 重启服务
|
||
systemctl restart headscale
|
||
|
||
# 5. 验证服务恢复
|
||
headscale nodes list
|
||
```
|
||
|
||
#### 11.1.2 PostgreSQL 数据库故障
|
||
|
||
**恢复步骤**:
|
||
|
||
```bash
|
||
# 1. 如果主库故障,提升从库
|
||
# 在从库执行
|
||
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/15/main
|
||
|
||
# 2. 更新 Headscale 配置指向新主库
|
||
sed -i 's/old_primary_ip/new_primary_ip/' /etc/headscale/config.yaml
|
||
|
||
# 3. 重启 Headscale
|
||
systemctl restart headscale
|
||
|
||
# 4. 重建从库
|
||
# 使用 pg_basebackup 从新主库同步
|
||
```
|
||
|
||
#### 11.1.3 DERP 中继服务器故障
|
||
|
||
**影响范围**:
|
||
- 无法 NAT 穿透的节点将失去连接
|
||
- 可直连的节点不受影响
|
||
|
||
**恢复步骤**:
|
||
|
||
```bash
|
||
# 1. 检查 DERP 服务状态
|
||
systemctl status derper
|
||
curl -s https://derp-bj.ops.company.com/derp/probe
|
||
|
||
# 2. 如果无法恢复,从 DERP Map 中移除该节点
|
||
# 编辑 /etc/headscale/derp.json,移除故障节点
|
||
|
||
# 3. 等待客户端自动切换到其他 DERP
|
||
# 或手动强制刷新
|
||
tailscale netcheck
|
||
```
|
||
|
||
#### 11.1.4 完全灾难恢复
|
||
|
||
```bash
|
||
# 1. 准备新服务器
|
||
|
||
# 2. 从备份恢复数据库
|
||
pg_restore -h localhost -U headscale -d headscale /backup/latest.dump
|
||
|
||
# 3. 恢复配置文件
|
||
tar -xzf /backup/config_latest.tar.gz -C /
|
||
|
||
# 4. 安装 Headscale
|
||
dpkg -i headscale_latest.deb
|
||
|
||
# 5. 启动服务
|
||
systemctl start headscale
|
||
|
||
# 6. 更新 DNS 指向新服务器
|
||
|
||
# 7. 验证所有节点重新连接
|
||
watch 'headscale nodes list | grep -c Online'
|
||
```
|
||
|
||
### 11.2 RTO 和 RPO 目标
|
||
|
||
| 场景 | RTO (恢复时间目标) | RPO (数据恢复点目标) |
|
||
|------|-------------------|---------------------|
|
||
| Headscale 单点故障 | < 5 分钟 | 0 (热备接管) |
|
||
| 数据库故障 | < 15 分钟 | < 1 分钟 (同步复制) |
|
||
| DERP 故障 | 自动切换 | N/A |
|
||
| 完全灾难 | < 2 小时 | < 24 小时 |
|
||
|
||
### 11.3 定期演练
|
||
|
||
建议每季度进行一次故障演练:
|
||
|
||
1. **演练内容**:
|
||
- 主备切换
|
||
- 数据库故障转移
|
||
- 从备份恢复
|
||
- ACL 策略回滚
|
||
|
||
2. **演练记录**:
|
||
- 演练时间和参与人员
|
||
- 实际恢复时间
|
||
- 发现的问题和改进措施
|
||
|
||
---
|
||
|
||
## 12. 实施计划与里程碑
|
||
|
||
### 12.1 实施阶段
|
||
|
||
#### 第一阶段:基础设施准备
|
||
|
||
| 任务 | 负责人 | 前置条件 | 交付物 |
|
||
|------|--------|---------|--------|
|
||
| 服务器资源申请 | 运维 | 预算审批 | 服务器清单 |
|
||
| 域名和证书准备 | 运维 | 域名购买 | SSL 证书 |
|
||
| PostgreSQL 高可用部署 | DBA | 服务器就绪 | 数据库集群 |
|
||
| 网络规划确认 | 网络组 | - | IP 规划文档 |
|
||
|
||
#### 第二阶段:核心服务部署
|
||
|
||
| 任务 | 负责人 | 前置条件 | 交付物 |
|
||
|------|--------|---------|--------|
|
||
| Headscale 主节点部署 | 运维 | PostgreSQL 就绪 | 服务运行 |
|
||
| Headscale 备节点配置 | 运维 | 主节点就绪 | 主备切换测试 |
|
||
| DERP 中继服务器部署 | 运维 | 服务器就绪 | 多区域 DERP |
|
||
| ACL 策略配置 | 安全组 | 服务运行 | ACL 文件 |
|
||
| 监控告警部署 | 运维 | 服务运行 | Grafana 仪表板 |
|
||
|
||
#### 第三阶段:节点接入
|
||
|
||
| 任务 | 负责人 | 前置条件 | 交付物 |
|
||
|------|--------|---------|--------|
|
||
| 测试环境接入 | 运维 | 服务就绪 | 测试节点在线 |
|
||
| 预发布环境接入 | 运维 | 测试通过 | 预发布节点在线 |
|
||
| 生产环境接入 (批次1) | 运维 | 预发布验证 | 首批生产节点 |
|
||
| 生产环境接入 (批次2-N) | 运维 | 批次1成功 | 全部生产节点 |
|
||
| 运维人员设备接入 | 运维 | 生产稳定 | 运维设备在线 |
|
||
| 开发人员设备接入 | 开发组长 | 运维验证 | 开发设备在线 |
|
||
|
||
#### 第四阶段:验收与交接
|
||
|
||
| 任务 | 负责人 | 前置条件 | 交付物 |
|
||
|------|--------|---------|--------|
|
||
| 功能验收测试 | QA | 全部接入 | 验收报告 |
|
||
| 性能压力测试 | 性能组 | 功能验收 | 性能报告 |
|
||
| 故障演练 | 运维 | 验收通过 | 演练记录 |
|
||
| 文档交付 | 运维 | 演练通过 | 运维手册 |
|
||
| 培训交接 | 运维 | 文档完成 | 培训记录 |
|
||
|
||
### 12.2 里程碑
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ 实施时间线 │
|
||
├─────────────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ M1: 基础设施就绪 │
|
||
│ ├── PostgreSQL HA 部署完成 │
|
||
│ ├── 域名/证书准备完成 │
|
||
│ └── 网络规划确认 │
|
||
│ │
|
||
│ M2: 核心服务上线 │
|
||
│ ├── Headscale 主备节点运行 │
|
||
│ ├── DERP 多区域部署 │
|
||
│ ├── 监控告警就绪 │
|
||
│ └── ACL 策略配置完成 │
|
||
│ │
|
||
│ M3: 测试验证完成 │
|
||
│ ├── 测试环境全部接入 │
|
||
│ ├── 预发布环境接入 │
|
||
│ └── 功能验收通过 │
|
||
│ │
|
||
│ M4: 生产环境迁移完成 │
|
||
│ ├── 生产服务器全部接入 │
|
||
│ ├── 旧 VPN 方案下线 │
|
||
│ └── 运维设备接入 │
|
||
│ │
|
||
│ M5: 项目验收 │
|
||
│ ├── 故障演练通过 │
|
||
│ ├── 培训交接完成 │
|
||
│ └── 项目正式结项 │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 12.3 验收标准
|
||
|
||
| 验收项 | 验收标准 | 验收方法 |
|
||
|--------|---------|---------|
|
||
| 网络连通性 | 任意两节点可互通 | ping/traceroute 测试 |
|
||
| 连接延迟 | 同区域 P2P < 10ms | Tailscale ping |
|
||
| 服务可用性 | 99.9% 可用率 | 监控数据 |
|
||
| ACL 生效 | 策略符合设计 | 安全扫描 |
|
||
| 故障恢复 | RTO < 目标时间 | 故障演练 |
|
||
| 性能指标 | 支持 1000+ 节点 | 压力测试 |
|
||
|
||
---
|
||
|
||
## 13. 风险评估与应对
|
||
|
||
### 13.1 风险矩阵
|
||
|
||
| 风险项 | 可能性 | 影响 | 风险等级 | 应对措施 |
|
||
|--------|-------|-----|---------|---------|
|
||
| Headscale 版本不稳定 | 中 | 高 | 高 | 充分测试,制定回滚方案 |
|
||
| 网络穿透失败率高 | 中 | 中 | 中 | 部署多区域 DERP |
|
||
| 密钥泄露 | 低 | 极高 | 高 | 密钥管理,定期轮换 |
|
||
| 性能瓶颈 | 中 | 中 | 中 | 监控预警,容量规划 |
|
||
| 运维人员技能不足 | 中 | 中 | 中 | 培训,文档完善 |
|
||
| 与现有系统冲突 | 低 | 中 | 低 | 充分测试,分批上线 |
|
||
|
||
### 13.2 回滚方案
|
||
|
||
#### 13.2.1 服务端回滚
|
||
|
||
```bash
|
||
# 1. 停止新版本服务
|
||
systemctl stop headscale
|
||
|
||
# 2. 恢复旧版本
|
||
dpkg -i /backup/headscale_old.deb
|
||
|
||
# 3. 恢复配置
|
||
cp /backup/config_old.yaml /etc/headscale/config.yaml
|
||
|
||
# 4. 如需回滚数据库
|
||
pg_restore -h localhost -U headscale -d headscale -c /backup/db_old.dump
|
||
|
||
# 5. 重启服务
|
||
systemctl start headscale
|
||
```
|
||
|
||
#### 13.2.2 客户端回滚
|
||
|
||
```bash
|
||
# 断开 Headscale 连接
|
||
tailscale down
|
||
|
||
# 恢复原有 VPN 配置
|
||
# (根据原有 VPN 方案操作)
|
||
```
|
||
|
||
### 13.3 应急联系人
|
||
|
||
| 角色 | 姓名 | 联系方式 | 职责 |
|
||
|------|-----|---------|-----|
|
||
| 项目负责人 | xxx | 138xxxxxxxx | 决策、协调 |
|
||
| 技术负责人 | xxx | 139xxxxxxxx | 技术方案 |
|
||
| 运维负责人 | xxx | 137xxxxxxxx | 部署实施 |
|
||
| DBA | xxx | 136xxxxxxxx | 数据库运维 |
|
||
| 安全负责人 | xxx | 135xxxxxxxx | 安全评审 |
|
||
|
||
---
|
||
|
||
## 14. 附录
|
||
|
||
### 14.1 术语表
|
||
|
||
| 术语 | 解释 |
|
||
|------|-----|
|
||
| Headscale | Tailscale 的开源自托管控制服务器 |
|
||
| Tailscale | 基于 WireGuard 的零配置 VPN 方案 |
|
||
| WireGuard | 现代化的 VPN 协议 |
|
||
| DERP | Designated Encrypted Relay for Packets,加密中继协议 |
|
||
| MagicDNS | Tailscale 的自动 DNS 服务 |
|
||
| ACL | Access Control List,访问控制列表 |
|
||
| PreAuth Key | 预认证密钥,用于无交互接入 |
|
||
| Mesh Network | 网状网络,节点间可直接通信 |
|
||
| NAT Traversal | NAT 穿透技术 |
|
||
| STUN | Session Traversal Utilities for NAT |
|
||
|
||
### 14.2 参考文档
|
||
|
||
- [Headscale 官方文档](https://headscale.net/)
|
||
- [Tailscale 官方文档](https://tailscale.com/docs/)
|
||
- [WireGuard 官方网站](https://www.wireguard.com/)
|
||
- [Headscale GitHub](https://github.com/juanfont/headscale)
|
||
|
||
### 14.3 常用命令速查
|
||
|
||
```bash
|
||
# Headscale 服务管理
|
||
systemctl start|stop|restart|status headscale
|
||
|
||
# 用户管理
|
||
headscale users list
|
||
headscale users create <name>
|
||
headscale users destroy <name>
|
||
|
||
# 节点管理
|
||
headscale nodes list
|
||
headscale nodes delete -i <id>
|
||
headscale nodes expire -i <id>
|
||
headscale nodes rename -i <id> -n <new_name>
|
||
headscale nodes tag -i <id> -t <tags>
|
||
|
||
# 预认证密钥
|
||
headscale preauthkeys create --user <user> --expiration 24h
|
||
headscale preauthkeys list --user <user>
|
||
|
||
# 路由管理
|
||
headscale routes list
|
||
headscale routes enable -r <route_id>
|
||
|
||
# API Key
|
||
headscale apikeys create --expiration 90d
|
||
headscale apikeys list
|
||
|
||
# Tailscale 客户端
|
||
tailscale up --login-server <url>
|
||
tailscale down
|
||
tailscale status
|
||
tailscale ip
|
||
tailscale ping <peer>
|
||
tailscale netcheck
|
||
```
|
||
|
||
### 14.4 配置模板
|
||
|
||
配置模板文件位于:
|
||
- `/opt/templates/headscale/config.yaml.tmpl`
|
||
- `/opt/templates/headscale/acl.json.tmpl`
|
||
- `/opt/templates/derp/docker-compose.yml.tmpl`
|
||
|
||
### 14.5 变更记录
|
||
|
||
| 版本 | 日期 | 变更内容 | 变更人 |
|
||
|------|-----|---------|--------|
|
||
| v1.0 | 2025-12-15 | 初稿 | xxx |
|
||
| v2.0 | 2025-12-18 | 详细设计完善 | AI Assistant |
|
||
|
||
---
|
||
|
||
> **文档维护说明**: 本文档应随着项目进展持续更新,每次重大变更需记录在变更记录中。
|