# OPS 统一管理方案 - Headscale 组网实施方案

> **任务编号**: 4448
> **版本**: v2.0
> **最后更新**: 2025-12-18
> **文档状态**: 详细设计

---

## 目录

1. [项目背景与目标](#1-项目背景与目标)
2. [技术方案概述](#2-技术方案概述)
3. [网络架构设计](#3-网络架构设计)
4. [基础设施规划](#4-基础设施规划)
5. [Headscale 服务端部署](#5-headscale-服务端部署)
6. [客户端接入方案](#6-客户端接入方案)
7. [访问控制与安全策略](#7-访问控制与安全策略)
8. [DNS 与服务发现](#8-dns-与服务发现)
9. [监控与告警](#9-监控与告警)
10. [运维管理规范](#10-运维管理规范)
11. [故障恢复与灾备](#11-故障恢复与灾备)
12. [实施计划与里程碑](#12-实施计划与里程碑)
13. [风险评估与应对](#13-风险评估与应对)
14. [附录](#14-附录)

---

## 1. 项目背景与目标

### 1.1 项目背景

随着业务发展，运维团队面临以下挑战：

- **多云多地域分布**: 服务器分布在阿里云、腾讯云、AWS 等多个云平台，以及多个物理机房
- **网络隔离复杂**: 不同环境（生产、测试、开发）之间网络隔离管理复杂
- **VPN 管理困难**: 传统 VPN 方案（OpenVPN、IPSec）配置复杂、维护成本高
- **安全访问需求**: 需要安全、便捷地访问内部服务，同时满足合规要求
- **运维效率低下**: 跨网络运维操作繁琐，无统一入口

### 1.2 项目目标

| 目标维度 | 具体目标 | 验收标准 |
|---------|---------|---------|
| 网络互通 | 实现所有节点 P2P 直连 | 任意两节点延迟 < 50ms（同区域）|
| 安全性 | 零信任网络架构 | 所有通信加密，基于身份认证 |
| 易用性 | 一键接入内网 | 客户端安装配置 < 5分钟 |
| 可扩展 | 支持快速扩容 | 新节点接入 < 10分钟 |
| 高可用 | 控制平面高可用 | SLA 99.9% |

### 1.3 适用范围

- 生产环境所有服务器
- 测试/预发布环境服务器
- 运维/开发人员工作设备
- CI/CD 构建节点
- 数据库、缓存等基础设施

---

## 2. 技术方案概述

### 2.1 为什么选择 Headscale

| 方案 | 优点 | 缺点 | 适用场景 |
|------|-----|------|---------|
| **Headscale** | 开源自托管、WireGuard 内核、P2P 直连、轻量级 | 生态相对较新 | 自主可控要求高 |
| Tailscale | 完善的商业支持 | 数据过境国外、成本高 | 小团队快速起步 |
| OpenVPN | 成熟稳定 | 配置复杂、性能较差 | 传统企业 |
| ZeroTier | 易于使用 | 免费版限制多 | 小规模使用 |

**选择 Headscale 的核心理由**：

1. **数据主权**: 所有协调数据存储在自己的服务器上
2. **成本可控**: 完全开源，无订阅费用
3. **WireGuard 优势**: 现代密码学、低延迟、高性能
4. **Mesh 网络**: 节点间直接通信，无需中心转发
5. **兼容 Tailscale 客户端**: 可使用成熟的 Tailscale 客户端

### 2.2 技术架构图

```
                              ┌─────────────────────────────────────────────────────────┐
                              │                    Internet                              │
                              └──────────────────────────┬──────────────────────────────┘
                                                         │
                              ┌──────────────────────────┴──────────────────────────────┐
                              │                                                          │
                    ┌─────────▼─────────┐                               ┌────────────────▼────────────────┐
                    │   Headscale HA    │                               │        DERP Relay Servers       │
                    │   Control Plane   │                               │     (Beijing/Shanghai/HK)       │
                    │                   │                               │                                 │
                    │ ┌───────────────┐ │                               │  ┌─────────┐  ┌─────────┐      │
                    │ │  Headscale    │ │                               │  │ DERP-BJ │  │ DERP-SH │      │
                    │ │  Primary      │ │                               │  └─────────┘  └─────────┘      │
                    │ └───────────────┘ │                               │       ┌─────────┐              │
                    │ ┌───────────────┐ │                               │       │ DERP-HK │              │
                    │ │  PostgreSQL   │ │                               │       └─────────┘              │
                    │ │  (HA)         │ │                               └─────────────────────────────────┘
                    │ └───────────────┘ │
                    └─────────┬─────────┘
                              │ Coordination
                              │
        ┌─────────────────────┼─────────────────────┬─────────────────────┐
        │                     │                     │                     │
        ▼                     ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  Production   │     │   Staging     │     │  Development  │     │   Operator    │
│   Servers     │     │   Servers     │     │   Servers     │     │   Devices     │
│               │     │               │     │               │     │               │
│ ┌───────────┐ │     │ ┌───────────┐ │     │ ┌───────────┐ │     │ ┌───────────┐ │
│ │ Tailscale │ │◄───►│ │ Tailscale │ │◄───►│ │ Tailscale │ │◄───►│ │ Tailscale │ │
│ │  Agent    │ │ P2P │ │  Agent    │ │ P2P │ │  Agent    │ │ P2P │ │  Client   │ │
│ └───────────┘ │     │ └───────────┘ │     │ └───────────┘ │     │ └───────────┘ │
└───────────────┘     └───────────────┘     └───────────────┘     └───────────────┘
     100.64.1.x            100.64.2.x            100.64.3.x            100.64.10.x
```

### 2.3 核心组件说明

| 组件 | 功能 | 部署位置 | 高可用策略 |
|------|-----|---------|-----------|
| Headscale Server | 协调服务、密钥分发、ACL 管理 | 云主机 | 主备 + PostgreSQL HA |
| DERP Relay | NAT 穿透失败时的中继服务 | 多地域部署 | 多节点冗余 |
| Tailscale Client | 客户端 Agent | 所有节点 | 开机自启 |
| Admin UI | Web 管理界面 | 与 Headscale 同机 | - |

---

## 3. 网络架构设计

### 3.1 IP 地址规划

采用 CGNAT 地址段 `100.64.0.0/10`，按环境和用途划分：

```
100.64.0.0/10 (总地址空间: 4,194,304 个地址)
│
├── 100.64.0.0/16    - 保留地址段 (管理用途)
│   ├── 100.64.0.0/24    - Headscale 控制平面
│   ├── 100.64.1.0/24    - DERP 中继服务器
│   └── 100.64.2.0/24    - 监控基础设施
│
├── 100.65.0.0/16    - 生产环境 (Production)
│   ├── 100.65.1.0/24    - Web 服务器组
│   ├── 100.65.2.0/24    - API 服务器组
│   ├── 100.65.3.0/24    - 数据库服务器组
│   ├── 100.65.4.0/24    - 缓存服务器组
│   ├── 100.65.5.0/24    - 消息队列服务器组
│   ├── 100.65.10.0/24   - Kubernetes Master
│   ├── 100.65.11.0/23   - Kubernetes Worker
│   └── 100.65.100.0/24  - 生产环境堡垒机
│
├── 100.66.0.0/16    - 预发布环境 (Staging)
│   ├── 100.66.1.0/24    - 应用服务器
│   ├── 100.66.2.0/24    - 数据库服务器
│   └── 100.66.10.0/24   - Kubernetes 集群
│
├── 100.67.0.0/16    - 测试环境 (Testing)
│   ├── 100.67.1.0/24    - 应用服务器
│   ├── 100.67.2.0/24    - 数据库服务器
│   └── 100.67.100.0/24  - CI/CD 构建节点
│
├── 100.68.0.0/16    - 开发环境 (Development)
│   ├── 100.68.1.0/24    - 开发服务器
│   └── 100.68.2.0/24    - 开发数据库
│
├── 100.70.0.0/16    - 运维人员设备 (Operators)
│   ├── 100.70.1.0/24    - 高级运维
│   ├── 100.70.2.0/24    - 普通运维
│   └── 100.70.10.0/24   - 值班人员
│
├── 100.71.0.0/16    - 开发人员设备 (Developers)
│   ├── 100.71.1.0/24    - 后端开发
│   ├── 100.71.2.0/24    - 前端开发
│   └── 100.71.3.0/24    - 移动开发
│
└── 100.80.0.0/16    - 外部合作伙伴 (Partners)
    └── 100.80.1.0/24    - 第三方供应商
```

### 3.2 命名空间设计

Headscale 使用 User (原 Namespace) 进行逻辑隔离：

| User 名称 | 用途 | IP 段 | 管理员 |
|-----------|-----|-------|--------|
| `infra` | 基础设施服务 | 100.64.0.0/16 | ops-admin |
| `prod` | 生产环境服务器 | 100.65.0.0/16 | ops-admin |
| `staging` | 预发布环境 | 100.66.0.0/16 | ops-admin |
| `testing` | 测试环境 | 100.67.0.0/16 | qa-admin |
| `dev` | 开发环境 | 100.68.0.0/16 | dev-admin |
| `ops-team` | 运维人员设备 | 100.70.0.0/16 | ops-admin |
| `dev-team` | 开发人员设备 | 100.71.0.0/16 | dev-admin |
| `partners` | 外部合作伙伴 | 100.80.0.0/16 | ops-admin |

### 3.3 节点命名规范

```
<环境>-<角色>-<区域>-<序号>

示例:
- prod-web-bj-001      生产环境北京Web服务器#1
- prod-db-sh-001       生产环境上海数据库#1
- staging-api-bj-001   预发布环境北京API服务器#1
- ops-laptop-zhangsan  运维人员张三的笔记本
```

### 3.4 DERP 中继网络

部署自建 DERP 服务器以确保 NAT 穿透失败时的可靠中继：

| 节点 | 区域 | 公网 IP | 端口 | 备注 |
|------|-----|---------|-----|------|
| derp-bj-01 | 北京 | x.x.x.x | 443/3478 | 阿里云主节点 |
| derp-sh-01 | 上海 | x.x.x.x | 443/3478 | 腾讯云备节点 |
| derp-hk-01 | 香港 | x.x.x.x | 443/3478 | AWS 海外节点 |
| derp-sg-01 | 新加坡 | x.x.x.x | 443/3478 | 东南亚节点 |

---

## 4. 基础设施规划

### 4.1 服务器资源规划

#### 4.1.1 Headscale 控制平面

| 组件 | 配置 | 数量 | 说明 |
|------|-----|------|-----|
| Headscale Primary | 4C8G 100GB SSD | 1 | 主控制节点 |
| Headscale Standby | 4C8G 100GB SSD | 1 | 热备节点 |
| PostgreSQL Primary | 4C16G 500GB SSD | 1 | 数据库主节点 |
| PostgreSQL Replica | 4C16G 500GB SSD | 1 | 数据库从节点 |
| Admin UI | 2C4G 50GB SSD | 1 | 管理界面 |

#### 4.1.2 DERP 中继服务器

| 区域 | 配置 | 带宽 | 数量 |
|------|-----|------|------|
| 北京 | 2C4G 50GB | 100Mbps | 1 |
| 上海 | 2C4G 50GB | 100Mbps | 1 |
| 香港 | 2C4G 50GB | 100Mbps | 1 |
| 新加坡 | 2C4G 50GB | 100Mbps | 1 |

### 4.2 网络要求

#### 4.2.1 Headscale 服务器端口

| 端口 | 协议 | 用途 | 来源 |
|-----|------|-----|------|
| 443 | TCP | HTTPS API & gRPC | 所有客户端 |
| 80 | TCP | HTTP 重定向 | 所有客户端 |
| 50443 | TCP | 管理 API (可选) | 管理网络 |

#### 4.2.2 DERP 服务器端口

| 端口 | 协议 | 用途 | 来源 |
|-----|------|-----|------|
| 443 | TCP | HTTPS DERP | 所有客户端 |
| 3478 | UDP | STUN | 所有客户端 |
| 80 | TCP | HTTP 重定向 | 所有客户端 |

#### 4.2.3 Tailscale 客户端端口

| 端口 | 协议 | 用途 | 方向 |
|-----|------|-----|------|
| 41641 | UDP | WireGuard 直连 | 入站/出站 |
| 443 | TCP | DERP 中继 | 出站 |
| 3478 | UDP | STUN | 出站 |

### 4.3 域名与证书规划

| 域名 | 用途 | 证书类型 |
|------|-----|---------|
| hs.ops.company.com | Headscale API | Let's Encrypt 通配符 |
| admin.hs.ops.company.com | 管理界面 | Let's Encrypt |
| derp-bj.ops.company.com | 北京 DERP | Let's Encrypt |
| derp-sh.ops.company.com | 上海 DERP | Let's Encrypt |
| derp-hk.ops.company.com | 香港 DERP | Let's Encrypt |

---

## 5. Headscale 服务端部署

### 5.1 系统环境准备

```bash
# 操作系统: Ubuntu 22.04 LTS / Rocky Linux 9
# 时区设置
timedatectl set-timezone Asia/Shanghai

# 更新系统
apt update && apt upgrade -y

# 安装必要工具
apt install -y curl wget vim htop net-tools jq unzip

# 关闭 swap (容器化部署时)
swapoff -a
sed -i '/swap/d' /etc/fstab

# 设置内核参数
cat >> /etc/sysctl.conf << EOF
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1
net.core.rmem_max = 2500000
net.core.wmem_max = 2500000
EOF
sysctl -p

# 设置文件描述符限制
cat >> /etc/security/limits.conf << EOF
* soft nofile 65535
* hard nofile 65535
root soft nofile 65535
root hard nofile 65535
EOF
```

### 5.2 PostgreSQL 高可用部署

#### 5.2.1 PostgreSQL 主节点安装

```bash
# 安装 PostgreSQL 15
apt install -y postgresql-15 postgresql-contrib-15

# 配置 PostgreSQL
cat > /etc/postgresql/15/main/postgresql.conf << 'EOF'
listen_addresses = '*'
port = 5432
max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 10MB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2

# 复制配置
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
hot_standby = on
EOF

# 配置访问控制
cat > /etc/postgresql/15/main/pg_hba.conf << 'EOF'
local   all             postgres                                peer
local   all             all                                     peer
host    all             all             127.0.0.1/32            scram-sha-256
host    all             all             ::1/128                 scram-sha-256
host    replication     replicator      <standby_ip>/32         scram-sha-256
host    headscale       headscale       <headscale_ip>/32       scram-sha-256
host    headscale       headscale       <headscale_standby_ip>/32 scram-sha-256
EOF

# 创建数据库和用户
sudo -u postgres psql << 'EOF'
CREATE USER headscale WITH PASSWORD 'your_secure_password_here';
CREATE DATABASE headscale OWNER headscale;
GRANT ALL PRIVILEGES ON DATABASE headscale TO headscale;

CREATE USER replicator WITH REPLICATION PASSWORD 'replicator_password';
EOF

systemctl restart postgresql
systemctl enable postgresql
```

#### 5.2.2 PostgreSQL 从节点配置

```bash
# 停止 PostgreSQL
systemctl stop postgresql

# 清空数据目录
rm -rf /var/lib/postgresql/15/main/*

# 从主节点复制数据
sudo -u postgres pg_basebackup -h <primary_ip> -U replicator -p 5432 \
  -D /var/lib/postgresql/15/main -Fp -Xs -P -R

# 启动从节点
systemctl start postgresql
```

### 5.3 Headscale 安装与配置

#### 5.3.1 二进制安装

```bash
# 下载最新版本 (以 0.23.0 为例)
HEADSCALE_VERSION="0.23.0"
wget -O /tmp/headscale.deb \
  "https://github.com/juanfont/headscale/releases/download/v${HEADSCALE_VERSION}/headscale_${HEADSCALE_VERSION}_linux_amd64.deb"

# 安装
dpkg -i /tmp/headscale.deb

# 或使用 Docker
docker pull headscale/headscale:0.23.0
```

#### 5.3.2 Headscale 配置文件

```yaml
# /etc/headscale/config.yaml
---
server_url: https://hs.ops.company.com:443
listen_addr: 0.0.0.0:443
metrics_listen_addr: 127.0.0.1:9090
grpc_listen_addr: 0.0.0.0:50443
grpc_allow_insecure: false

# 私有密钥路径
private_key_path: /var/lib/headscale/private.key
noise:
  private_key_path: /var/lib/headscale/noise_private.key

# IP 地址前缀
prefixes:
  v4: 100.64.0.0/10
  v6: fd7a:115c:a1e0::/48
  allocation: sequential

# 数据库配置 (PostgreSQL)
database:
  type: postgres
  postgres:
    host: <postgresql_host>
    port: 5432
    name: headscale
    user: headscale
    pass: your_secure_password_here
    max_open_conns: 100
    max_idle_conns: 10
    conn_max_idle_time_secs: 3600
    ssl: disable  # 生产环境建议启用 require

# DERP 配置
derp:
  server:
    enabled: false  # 使用独立 DERP 服务器
    region_id: 999
    region_code: "headscale"
    region_name: "Headscale Embedded DERP"
    stun_listen_addr: "0.0.0.0:3478"
  urls:
    - https://hs.ops.company.com/derp.json
  paths: []
  auto_update_enabled: true
  update_frequency: 24h

# 禁用默认 Tailscale DERP
disable_check_updates: true
ephemeral_node_inactivity_timeout: 30m

# 节点更新检查
node_update_check_interval: 10s

# DNS 配置
dns:
  magic_dns: true
  base_domain: ts.company.local
  nameservers:
    global:
      - 10.0.0.1  # 内部 DNS
      - 223.5.5.5  # 阿里 DNS (备用)
  search_domains:
    - company.local
  extra_records:
    - name: "grafana.ts.company.local"
      type: "A"
      value: "100.64.0.10"
    - name: "prometheus.ts.company.local"
      type: "A"
      value: "100.64.0.11"

# Unix socket 配置
unix_socket: /var/run/headscale/headscale.sock
unix_socket_permission: "0770"

# TLS 配置 (使用反向代理时可设为空)
tls_cert_path: ""
tls_key_path: ""

# 日志配置
log:
  format: json
  level: info

# ACL 策略
policy:
  mode: file
  path: /etc/headscale/acl.json

# OIDC 配置 (可选)
oidc:
  only_start_if_oidc_is_available: true
  issuer: "https://sso.company.com/realms/ops"
  client_id: "headscale"
  client_secret: "your_oidc_client_secret"
  scope: ["openid", "profile", "email"]
  extra_params:
    domain_hint: company.com
  strip_email_domain: true
  allowed_users: []
  allowed_groups:
    - "/ops-team"
    - "/dev-team"
```

#### 5.3.3 创建 systemd 服务

```ini
# /etc/systemd/system/headscale.service
[Unit]
Description=headscale coordination server
Documentation=https://github.com/juanfont/headscale
After=network-online.target postgresql.service
Wants=network-online.target
Requires=postgresql.service

[Service]
User=headscale
Group=headscale
Type=simple
Restart=always
RestartSec=5
ExecStart=/usr/bin/headscale serve
Environment="GIN_MODE=release"

# 资源限制
LimitNOFILE=65535
LimitNPROC=65535

# 安全加固
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/headscale /var/run/headscale

[Install]
WantedBy=multi-user.target
```

#### 5.3.4 启动服务

```bash
# 创建用户和目录
useradd -r -s /bin/false headscale
mkdir -p /var/lib/headscale /var/run/headscale /etc/headscale
chown -R headscale:headscale /var/lib/headscale /var/run/headscale

# 启动服务
systemctl daemon-reload
systemctl enable headscale
systemctl start headscale

# 验证服务状态
systemctl status headscale
headscale version
```

### 5.4 DERP 中继服务器部署

#### 5.4.1 DERP 服务器配置

```bash
# 安装 Go (如果需要编译)
wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

# 安装 derper
go install tailscale.com/cmd/derper@latest

# 或使用 Docker
docker pull ghcr.io/tailscale/derper:latest
```

#### 5.4.2 DERP Docker Compose 部署

```yaml
# /opt/derper/docker-compose.yml
version: '3.8'
services:
  derper:
    image: ghcr.io/tailscale/derper:latest
    container_name: derper
    restart: always
    ports:
      - "443:443"
      - "80:80"
      - "3478:3478/udp"
    volumes:
      - ./certs:/etc/derper/certs:ro
      - ./config:/etc/derper/config:ro
    command:
      - --hostname=derp-bj.ops.company.com
      - --certmode=manual
      - --certdir=/etc/derper/certs
      - --stun
      - --stun-port=3478
      - --verify-clients=true
      - --verify-client-url=https://hs.ops.company.com/verify
    environment:
      - DERP_VERIFY_CLIENTS=true
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "3"
```

#### 5.4.3 DERP Map 配置

在 Headscale 服务器上配置 DERP Map：

```json
// /etc/headscale/derp.json
{
  "Regions": {
    "900": {
      "RegionID": 900,
      "RegionCode": "bj",
      "RegionName": "Beijing",
      "Avoid": false,
      "Nodes": [
        {
          "Name": "bj1",
          "RegionID": 900,
          "HostName": "derp-bj.ops.company.com",
          "DERPPort": 443,
          "STUNPort": 3478,
          "InsecureForTests": false
        }
      ]
    },
    "901": {
      "RegionID": 901,
      "RegionCode": "sh",
      "RegionName": "Shanghai",
      "Avoid": false,
      "Nodes": [
        {
          "Name": "sh1",
          "RegionID": 901,
          "HostName": "derp-sh.ops.company.com",
          "DERPPort": 443,
          "STUNPort": 3478,
          "InsecureForTests": false
        }
      ]
    },
    "902": {
      "RegionID": 902,
      "RegionCode": "hk",
      "RegionName": "Hong Kong",
      "Avoid": false,
      "Nodes": [
        {
          "Name": "hk1",
          "RegionID": 902,
          "HostName": "derp-hk.ops.company.com",
          "DERPPort": 443,
          "STUNPort": 3478,
          "InsecureForTests": false
        }
      ]
    }
  }
}
```

### 5.5 Nginx 反向代理配置

```nginx
# /etc/nginx/sites-available/headscale
upstream headscale {
    server 127.0.0.1:8080;
    keepalive 32;
}

server {
    listen 80;
    server_name hs.ops.company.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name hs.ops.company.com;

    # SSL 配置
    ssl_certificate /etc/letsencrypt/live/hs.ops.company.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/hs.ops.company.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
    ssl_prefer_server_ciphers off;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 1d;
    ssl_session_tickets off;
    ssl_stapling on;
    ssl_stapling_verify on;

    # 安全头
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Frame-Options DENY always;
    add_header X-Content-Type-Options nosniff always;

    location / {
        proxy_pass http://headscale;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_buffering off;
        proxy_read_timeout 86400s;
        proxy_send_timeout 86400s;
    }

    # gRPC 支持
    location /headscale.v1.HeadscaleService/ {
        grpc_pass grpc://127.0.0.1:50443;
        grpc_set_header Host $host;
        grpc_set_header X-Real-IP $remote_addr;
    }

    # 健康检查
    location /health {
        proxy_pass http://headscale/health;
        access_log off;
    }

    # Metrics (仅内网访问)
    location /metrics {
        allow 10.0.0.0/8;
        allow 172.16.0.0/12;
        allow 192.168.0.0/16;
        allow 100.64.0.0/10;
        deny all;
        proxy_pass http://127.0.0.1:9090/metrics;
    }
}
```

### 5.6 管理界面部署 (Headscale-UI)

```yaml
# /opt/headscale-ui/docker-compose.yml
version: '3.8'
services:
  headscale-ui:
    image: ghcr.io/gurucomputing/headscale-ui:latest
    container_name: headscale-ui
    restart: always
    ports:
      - "127.0.0.1:8081:80"
    environment:
      - HS_SERVER=https://hs.ops.company.com
```

---

## 6. 客户端接入方案

### 6.1 Linux 服务器接入

#### 6.1.1 安装 Tailscale 客户端

```bash
# Ubuntu/Debian
curl -fsSL https://tailscale.com/install.sh | sh

# RHEL/CentOS
curl -fsSL https://tailscale.com/install.sh | sh

# 或手动安装
# Ubuntu/Debian
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/jammy.noarmor.gpg | sudo tee /usr/share/keyrings/tailscale-archive-keyring.gpg >/dev/null
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/jammy.tailscale-keyring.list | sudo tee /etc/apt/sources.list.d/tailscale.list
apt update && apt install -y tailscale
```

#### 6.1.2 连接到 Headscale

```bash
# 使用预认证密钥 (推荐用于服务器)
tailscale up \
  --login-server https://hs.ops.company.com \
  --authkey tskey-preauth-xxxxxxxxxxxxx \
  --hostname prod-web-bj-001 \
  --advertise-tags tag:prod,tag:web \
  --accept-routes \
  --accept-dns

# 交互式登录 (用于开发机器)
tailscale up \
  --login-server https://hs.ops.company.com \
  --hostname ops-laptop-zhangsan

# 验证连接
tailscale status
tailscale ip
```

#### 6.1.3 自动化安装脚本

```bash
#!/bin/bash
# /opt/scripts/setup-tailscale.sh

set -euo pipefail

# 配置变量
HEADSCALE_URL="${HEADSCALE_URL:-https://hs.ops.company.com}"
AUTH_KEY="${AUTH_KEY:-}"
HOSTNAME="${HOSTNAME:-$(hostname -s)}"
TAGS="${TAGS:-}"
ACCEPT_ROUTES="${ACCEPT_ROUTES:-true}"

# 日志函数
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}

# 检查是否已安装
if command -v tailscale &> /dev/null; then
    log "Tailscale 已安装，版本: $(tailscale version)"
else
    log "正在安装 Tailscale..."
    curl -fsSL https://tailscale.com/install.sh | sh
fi

# 构建 tailscale up 命令
UP_CMD="tailscale up --login-server ${HEADSCALE_URL}"

if [ -n "$AUTH_KEY" ]; then
    UP_CMD="$UP_CMD --authkey $AUTH_KEY"
fi

if [ -n "$HOSTNAME" ]; then
    UP_CMD="$UP_CMD --hostname $HOSTNAME"
fi

if [ -n "$TAGS" ]; then
    UP_CMD="$UP_CMD --advertise-tags $TAGS"
fi

if [ "$ACCEPT_ROUTES" = "true" ]; then
    UP_CMD="$UP_CMD --accept-routes --accept-dns"
fi

# 执行连接
log "正在连接到 Headscale..."
eval $UP_CMD

# 验证连接
sleep 5
if tailscale status | grep -q "100."; then
    log "连接成功! IP: $(tailscale ip -4)"
else
    log "连接失败，请检查配置"
    exit 1
fi
```

### 6.2 macOS/Windows 客户端接入

#### 6.2.1 macOS

```bash
# 使用 Homebrew 安装
brew install tailscale

# 启动并连接
sudo tailscaled &
tailscale up --login-server https://hs.ops.company.com

# 或使用官方客户端
# 下载: https://tailscale.com/download/mac
# 安装后在设置中修改 Login Server
```

#### 6.2.2 Windows

```powershell
# 使用 Winget 安装
winget install tailscale.tailscale

# 使用 Chocolatey 安装
choco install tailscale

# 连接命令 (PowerShell 管理员)
tailscale up --login-server https://hs.ops.company.com
```

### 6.3 移动设备接入

1. 从 App Store / Google Play 下载 Tailscale 官方客户端
2. 打开 App，点击设置图标
3. 选择 "Custom coordination server"
4. 输入: `https://hs.ops.company.com`
5. 点击 "Log in" 完成认证

### 6.4 预认证密钥管理

```bash
# 创建可重用的预认证密钥 (用于自动化部署)
headscale preauthkeys create \
  --user prod \
  --reusable \
  --expiration 720h \
  --tags tag:prod,tag:automated

# 创建一次性预认证密钥
headscale preauthkeys create \
  --user ops-team \
  --expiration 24h

# 查看所有预认证密钥
headscale preauthkeys list --user prod

# 使密钥失效
headscale preauthkeys expire --user prod <key>
```

### 6.5 Ansible 自动化部署

```yaml
# roles/tailscale/tasks/main.yml
---
- name: Install Tailscale
  shell: curl -fsSL https://tailscale.com/install.sh | sh
  args:
    creates: /usr/bin/tailscale

- name: Start tailscaled service
  systemd:
    name: tailscaled
    state: started
    enabled: yes

- name: Check if already connected
  command: tailscale status
  register: ts_status
  ignore_errors: yes
  changed_when: false

- name: Connect to Headscale
  command: >
    tailscale up
    --login-server {{ headscale_url }}
    --authkey {{ headscale_authkey }}
    --hostname {{ inventory_hostname }}
    --advertise-tags {{ tailscale_tags | join(',') }}
    --accept-routes
    --accept-dns
  when: ts_status.rc != 0

- name: Verify connection
  command: tailscale ip -4
  register: ts_ip
  changed_when: false

- name: Display Tailscale IP
  debug:
    msg: "Tailscale IP: {{ ts_ip.stdout }}"
```

---

## 7. 访问控制与安全策略

### 7.1 ACL 策略设计原则

1. **最小权限原则**: 只授予完成工作所需的最小权限
2. **分层隔离**: 生产/测试/开发环境严格隔离
3. **基于角色**: 运维/开发不同角色不同权限
4. **审计可追溯**: 所有访问可记录和追溯

### 7.2 详细 ACL 配置

```json
// /etc/headscale/acl.json
{
  "groups": {
    "group:ops-admin": ["user:zhangsan", "user:lisi"],
    "group:ops-member": ["user:wangwu", "user:zhaoliu"],
    "group:dev-senior": ["user:dev01", "user:dev02"],
    "group:dev-junior": ["user:dev03", "user:dev04"],
    "group:qa": ["user:qa01", "user:qa02"],
    "group:dba": ["user:dba01"]
  },

  "tagOwners": {
    "tag:prod": ["group:ops-admin"],
    "tag:staging": ["group:ops-admin", "group:ops-member"],
    "tag:testing": ["group:ops-admin", "group:qa"],
    "tag:dev": ["group:ops-admin", "group:dev-senior"],
    "tag:web": ["group:ops-admin"],
    "tag:api": ["group:ops-admin"],
    "tag:db": ["group:ops-admin", "group:dba"],
    "tag:cache": ["group:ops-admin"],
    "tag:mq": ["group:ops-admin"],
    "tag:k8s": ["group:ops-admin"],
    "tag:monitoring": ["group:ops-admin"],
    "tag:bastion": ["group:ops-admin"]
  },

  "hosts": {
    "prod-bastion": "100.65.100.1",
    "staging-bastion": "100.66.100.1",
    "monitoring-server": "100.64.0.10",
    "jenkins-master": "100.67.100.1"
  },

  "acls": [
    // ===== 基础设施规则 =====
    // 所有节点可以访问 DNS
    {
      "action": "accept",
      "src": ["*"],
      "dst": ["100.64.0.1:53"]
    },

    // 所有节点可以访问监控系统
    {
      "action": "accept",
      "src": ["*"],
      "dst": ["tag:monitoring:9090,9093,3000"]
    },

    // ===== 运维管理员规则 =====
    // 运维管理员可以访问所有环境的所有服务
    {
      "action": "accept",
      "src": ["group:ops-admin"],
      "dst": ["*:*"]
    },

    // ===== 普通运维规则 =====
    // 普通运维可以访问非生产环境
    {
      "action": "accept",
      "src": ["group:ops-member"],
      "dst": ["tag:staging:*", "tag:testing:*", "tag:dev:*"]
    },
    // 普通运维只能通过堡垒机访问生产环境
    {
      "action": "accept",
      "src": ["group:ops-member"],
      "dst": ["tag:bastion:22"]
    },

    // ===== DBA 规则 =====
    // DBA 可以访问所有数据库
    {
      "action": "accept",
      "src": ["group:dba"],
      "dst": ["tag:db:3306,5432,6379,27017"]
    },
    // DBA 可以访问堡垒机
    {
      "action": "accept",
      "src": ["group:dba"],
      "dst": ["tag:bastion:22"]
    },

    // ===== 高级开发规则 =====
    // 高级开发可以访问开发、测试和预发布环境
    {
      "action": "accept",
      "src": ["group:dev-senior"],
      "dst": ["tag:staging:*", "tag:testing:*", "tag:dev:*"]
    },

    // ===== 初级开发规则 =====
    // 初级开发只能访问开发环境
    {
      "action": "accept",
      "src": ["group:dev-junior"],
      "dst": ["tag:dev:*"]
    },

    // ===== QA 规则 =====
    // QA 可以访问测试和预发布环境
    {
      "action": "accept",
      "src": ["group:qa"],
      "dst": ["tag:testing:*", "tag:staging:80,443,8080"]
    },

    // ===== 服务间通信规则 =====
    // 生产环境 Web 服务器可以访问 API 服务器
    {
      "action": "accept",
      "src": ["tag:web"],
      "dst": ["tag:api:8080,8443"]
    },
    // API 服务器可以访问数据库和缓存
    {
      "action": "accept",
      "src": ["tag:api"],
      "dst": ["tag:db:3306,5432", "tag:cache:6379", "tag:mq:5672,15672"]
    },
    // Kubernetes 集群内部通信
    {
      "action": "accept",
      "src": ["tag:k8s"],
      "dst": ["tag:k8s:*"]
    },

    // ===== CI/CD 规则 =====
    // Jenkins 可以访问测试环境进行部署
    {
      "action": "accept",
      "src": ["jenkins-master"],
      "dst": ["tag:testing:22,80,443,8080"]
    },

    // ===== 默认拒绝规则 (隐含) =====
  ],

  // SSH 规则 (控制 Tailscale SSH)
  "ssh": [
    {
      "action": "accept",
      "src": ["group:ops-admin"],
      "dst": ["*"],
      "users": ["root", "ubuntu", "centos"]
    },
    {
      "action": "accept",
      "src": ["group:ops-member"],
      "dst": ["tag:staging", "tag:testing", "tag:dev"],
      "users": ["ubuntu", "centos"]
    }
  ],

  // 测试规则 (用于调试)
  "tests": [
    {
      "src": "user:zhangsan",
      "accept": ["tag:prod:22", "tag:db:3306"]
    },
    {
      "src": "user:dev01",
      "accept": ["tag:dev:*"],
      "deny": ["tag:prod:*"]
    }
  ]
}
```

### 7.3 标签管理

```bash
# 为节点添加标签
headscale nodes tag -i <node_id> -t "tag:prod,tag:web"

# 查看节点标签
headscale nodes list

# 批量更新标签 (通过 API)
curl -X POST https://hs.ops.company.com/api/v1/machine/<machine_id>/tags \
  -H "Authorization: Bearer <api_key>" \
  -H "Content-Type: application/json" \
  -d '{"tags": ["tag:prod", "tag:web", "tag:bj"]}'
```

### 7.4 安全加固措施

#### 7.4.1 Headscale 服务器加固

```bash
# 1. 防火墙配置
ufw default deny incoming
ufw default allow outgoing
ufw allow from 10.0.0.0/8 to any port 22   # SSH 仅允许内网
ufw allow 80/tcp                            # HTTP 重定向
ufw allow 443/tcp                           # HTTPS
ufw allow 50443/tcp                         # gRPC (如需要)
ufw enable

# 2. fail2ban 配置
apt install -y fail2ban
cat > /etc/fail2ban/jail.local << 'EOF'
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600

[headscale]
enabled = true
port = 443
filter = headscale
logpath = /var/log/headscale/headscale.log
maxretry = 5
bantime = 3600
findtime = 600
EOF

# 3. 禁用密码登录
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
systemctl restart sshd

# 4. 定期更新
apt update && apt upgrade -y
```

#### 7.4.2 客户端安全配置

```bash
# 限制 Tailscale 网络接口的路由
tailscale up \
  --shields-up \                    # 默认拒绝入站连接
  --accept-routes=false \           # 不接受其他节点的路由广播
  --advertise-routes="" \           # 不广播本地路由
  --exit-node=""                    # 不使用出口节点
```

---

## 8. DNS 与服务发现

### 8.1 MagicDNS 配置

Headscale 内置的 MagicDNS 提供自动的服务发现能力：

```yaml
# config.yaml DNS 部分
dns:
  magic_dns: true
  base_domain: ts.company.local
  nameservers:
    global:
      - 10.0.0.1        # 公司内部 DNS
      - 223.5.5.5       # 阿里 DNS
    restricted:
      internal.company.com:
        - 10.0.0.1
      aws.internal:
        - 169.254.169.253
  search_domains:
    - ts.company.local
    - company.local
  extra_records:
    - name: "grafana"
      type: "A"
      value: "100.64.0.10"
    - name: "prometheus"
      type: "A"
      value: "100.64.0.11"
    - name: "jenkins"
      type: "A"
      value: "100.67.100.1"
    - name: "gitlab"
      type: "CNAME"
      value: "prod-gitlab-bj-001"
```

### 8.2 DNS 解析规则

启用 MagicDNS 后，域名解析规则如下：

| 域名格式 | 解析目标 | 示例 |
|---------|---------|------|
| `<hostname>` | 直接解析 | `prod-web-bj-001` → `100.65.1.1` |
| `<hostname>.<user>` | 带命名空间 | `prod-web-bj-001.prod` |
| `<hostname>.<base_domain>` | 完整域名 | `prod-web-bj-001.ts.company.local` |
| 自定义记录 | extra_records | `grafana` → `100.64.0.10` |

### 8.3 Split DNS 配置

针对特定域名使用特定 DNS 服务器：

```yaml
dns:
  nameservers:
    restricted:
      # AWS 内部域名使用 AWS DNS
      "compute.internal":
        - 169.254.169.253
      "ec2.internal":
        - 169.254.169.253
      # 阿里云内部域名
      "alibaba-inc.com":
        - 100.100.2.136
        - 100.100.2.138
      # 公司内部域名
      "company.internal":
        - 10.0.0.1
        - 10.0.0.2
```

### 8.4 服务发现集成

#### 8.4.1 与 Consul 集成

```hcl
# consul-config.hcl
services {
  id   = "web-prod-001"
  name = "web"
  tags = ["prod", "tailscale"]
  port = 80

  checks = [
    {
      http     = "http://prod-web-bj-001.ts.company.local/health"
      interval = "10s"
      timeout  = "2s"
    }
  ]
}
```

#### 8.4.2 与 Kubernetes CoreDNS 集成

```yaml
# coredns-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        # 转发 Tailscale 域名到 MagicDNS
        forward ts.company.local 100.100.100.100 {
            policy sequential
        }
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
```

---

## 9. 监控与告警

### 9.1 监控架构

```
┌─────────────────────────────────────────────────────────────────┐
│                        Grafana Dashboard                         │
│                    (hs-monitor.ops.company.com)                  │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                 ┌─────────────┴─────────────┐
                 │       Prometheus           │
                 │    (100.64.0.11:9090)      │
                 └─────────────┬─────────────┘
                               │
       ┌───────────────┬───────┴───────┬───────────────┐
       │               │               │               │
       ▼               ▼               ▼               ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│  Headscale  │ │    DERP     │ │  Tailscale  │ │   System    │
│   Metrics   │ │   Metrics   │ │   Metrics   │ │   Metrics   │
│  :9090      │ │   :8080     │ │  (via API)  │ │  (node_exp) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
```

### 9.2 Prometheus 配置

```yaml
# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  # Headscale 指标
  - job_name: 'headscale'
    static_configs:
      - targets: ['100.64.0.1:9090']
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: headscale-primary

  # DERP 服务器指标
  - job_name: 'derp'
    static_configs:
      - targets:
        - 'derp-bj.ops.company.com:8080'
        - 'derp-sh.ops.company.com:8080'
        - 'derp-hk.ops.company.com:8080'

  # PostgreSQL 指标
  - job_name: 'postgresql'
    static_configs:
      - targets: ['100.64.0.2:9187']

  # 所有 Tailscale 节点 (使用服务发现)
  - job_name: 'tailscale-nodes'
    file_sd_configs:
      - files:
        - '/etc/prometheus/tailscale_nodes.json'
        refresh_interval: 5m
```

### 9.3 关键监控指标

#### 9.3.1 Headscale 指标

| 指标名称 | 类型 | 说明 | 告警阈值 |
|---------|-----|------|---------|
| `headscale_connected_nodes` | Gauge | 已连接节点数 | < 预期节点数 * 0.9 |
| `headscale_api_requests_total` | Counter | API 请求总数 | - |
| `headscale_api_request_duration_seconds` | Histogram | API 响应时间 | P99 > 1s |
| `headscale_db_query_duration_seconds` | Histogram | 数据库查询时间 | P99 > 500ms |

#### 9.3.2 DERP 指标

| 指标名称 | 类型 | 说明 | 告警阈值 |
|---------|-----|------|---------|
| `derp_connections` | Gauge | 当前连接数 | > 10000 |
| `derp_bytes_sent_total` | Counter | 发送字节数 | 突增 > 200% |
| `derp_bytes_received_total` | Counter | 接收字节数 | 突增 > 200% |
| `derp_home_connections` | Gauge | Home 连接数 | - |

#### 9.3.3 节点健康指标

| 指标名称 | 类型 | 说明 | 告警阈值 |
|---------|-----|------|---------|
| `tailscale_up` | Gauge | 节点在线状态 | = 0 |
| `tailscale_derp_latency_seconds` | Gauge | DERP 延迟 | > 200ms |
| `tailscale_peer_count` | Gauge | 对等节点数 | = 0 |

### 9.4 告警规则配置

```yaml
# /etc/prometheus/rules/headscale.yml
groups:
  - name: headscale
    interval: 30s
    rules:
      # Headscale 服务不可用
      - alert: HeadscaleDown
        expr: up{job="headscale"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Headscale 控制平面不可用"
          description: "Headscale 服务已离线超过1分钟"

      # 节点大量离线
      - alert: TailscaleNodesMassOffline
        expr: |
          (count(tailscale_up == 0) / count(tailscale_up)) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "超过10%的节点离线"
          description: "{{ $value | humanizePercentage }} 的节点当前离线"

      # API 响应慢
      - alert: HeadscaleAPILatencyHigh
        expr: |
          histogram_quantile(0.99, rate(headscale_api_request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Headscale API 响应延迟过高"
          description: "API P99 延迟: {{ $value | humanizeDuration }}"

      # 数据库连接问题
      - alert: HeadscaleDatabaseConnectionIssues
        expr: |
          rate(headscale_db_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Headscale 数据库连接异常"
          description: "数据库错误率: {{ $value }}/s"

  - name: derp
    rules:
      # DERP 服务不可用
      - alert: DERPServerDown
        expr: up{job="derp"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "DERP 中继服务器不可用"
          description: "{{ $labels.instance }} DERP 服务已离线"

      # DERP 连接数过高
      - alert: DERPConnectionsHigh
        expr: derp_connections > 8000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "DERP 连接数接近上限"
          description: "{{ $labels.instance }} 当前连接数: {{ $value }}"

  - name: nodes
    rules:
      # 单个节点离线
      - alert: TailscaleNodeDown
        expr: tailscale_up == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tailscale 节点离线"
          description: "节点 {{ $labels.hostname }} 已离线超过5分钟"

      # 生产环境节点离线 (更严格)
      - alert: ProductionNodeDown
        expr: tailscale_up{env="prod"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "生产环境节点离线"
          description: "生产节点 {{ $labels.hostname }} 已离线"

      # 节点无法建立直连
      - alert: TailscaleNoPeerConnection
        expr: tailscale_peer_count == 0 and tailscale_up == 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "节点无法建立 P2P 连接"
          description: "节点 {{ $labels.hostname }} 无法与其他节点建立直接连接"
```

### 9.5 Grafana 仪表板

创建以下仪表板：

1. **Headscale Overview**
   - 总节点数、在线节点数、离线节点数
   - API 请求 QPS 和延迟
   - 数据库连接状态

2. **DERP Network**
   - 各 DERP 服务器连接数
   - 流量统计 (发送/接收)
   - 区域分布

3. **Node Health**
   - 节点在线状态矩阵
   - 各节点延迟热力图
   - 节点流量统计

4. **ACL Audit**
   - 访问拒绝事件
   - 规则命中统计
   - 异常访问模式

---

## 10. 运维管理规范

### 10.1 日常运维操作

#### 10.1.1 用户管理

```bash
# 创建用户 (命名空间)
headscale users create prod
headscale users create staging
headscale users create dev

# 查看用户列表
headscale users list

# 删除用户 (谨慎操作)
headscale users destroy dev
```

#### 10.1.2 节点管理

```bash
# 列出所有节点
headscale nodes list

# 列出特定用户的节点
headscale nodes list --user prod

# 查看节点详情
headscale nodes list --identifier prod-web-bj-001

# 删除节点
headscale nodes delete --identifier <node_id>

# 重命名节点
headscale nodes rename --identifier <node_id> --name new-hostname

# 移动节点到其他用户
headscale nodes move --identifier <node_id> --user staging

# 设置节点过期时间
headscale nodes expire --identifier <node_id>
```

#### 10.1.3 路由管理

```bash
# 查看所有路由
headscale routes list

# 启用路由
headscale routes enable --route <route_id>

# 禁用路由
headscale routes disable --route <route_id>

# 删除路由
headscale routes delete --route <route_id>
```

#### 10.1.4 API Key 管理

```bash
# 创建 API Key
headscale apikeys create --expiration 90d

# 列出 API Keys
headscale apikeys list

# 使 API Key 过期
headscale apikeys expire --prefix <key_prefix>
```

### 10.2 运维脚本工具

#### 10.2.1 节点健康检查脚本

```bash
#!/bin/bash
# /opt/scripts/check-tailscale-health.sh

HEADSCALE_URL="https://hs.ops.company.com"
API_KEY="your_api_key"
ALERT_WEBHOOK="https://webhook.ops.company.com/alert"

# 获取所有节点
nodes=$(curl -s -H "Authorization: Bearer $API_KEY" \
  "${HEADSCALE_URL}/api/v1/machine" | jq -r '.machines[]')

# 检查离线节点
offline_nodes=$(echo "$nodes" | jq -r 'select(.online == false) | .givenName')

if [ -n "$offline_nodes" ]; then
  # 发送告警
  curl -X POST "$ALERT_WEBHOOK" \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"[Tailscale] 以下节点离线:\\n$offline_nodes\"}"
fi

# 检查即将过期的节点
expiring_nodes=$(echo "$nodes" | jq -r \
  'select(.expiry != "0001-01-01T00:00:00Z") |
   select((.expiry | fromdateiso8601) < (now + 604800)) |
   .givenName + " (expires: " + .expiry + ")"')

if [ -n "$expiring_nodes" ]; then
  curl -X POST "$ALERT_WEBHOOK" \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"[Tailscale] 以下节点即将过期:\\n$expiring_nodes\"}"
fi
```

#### 10.2.2 批量节点管理脚本

```python
#!/usr/bin/env python3
# /opt/scripts/headscale-manager.py

import requests
import argparse
import json
from datetime import datetime, timedelta

class HeadscaleManager:
    def __init__(self, url, api_key):
        self.url = url.rstrip('/')
        self.headers = {
            'Authorization': f'Bearer {api_key}',
            'Content-Type': 'application/json'
        }

    def get_nodes(self, user=None):
        """获取节点列表"""
        params = {}
        if user:
            params['user'] = user

        resp = requests.get(
            f'{self.url}/api/v1/machine',
            headers=self.headers,
            params=params
        )
        return resp.json().get('machines', [])

    def get_offline_nodes(self, threshold_hours=1):
        """获取离线节点"""
        nodes = self.get_nodes()
        offline = []

        threshold = datetime.utcnow() - timedelta(hours=threshold_hours)

        for node in nodes:
            if not node.get('online', False):
                last_seen = datetime.fromisoformat(
                    node['lastSeen'].replace('Z', '+00:00')
                )
                if last_seen < threshold.replace(tzinfo=last_seen.tzinfo):
                    offline.append(node)

        return offline

    def bulk_tag_nodes(self, node_ids, tags):
        """批量设置节点标签"""
        results = []
        for node_id in node_ids:
            resp = requests.post(
                f'{self.url}/api/v1/machine/{node_id}/tags',
                headers=self.headers,
                json={'tags': tags}
            )
            results.append({
                'node_id': node_id,
                'success': resp.status_code == 200
            })
        return results

    def cleanup_expired_nodes(self, dry_run=True):
        """清理过期节点"""
        nodes = self.get_nodes()
        expired = []

        for node in nodes:
            expiry = node.get('expiry')
            if expiry and expiry != '0001-01-01T00:00:00Z':
                expiry_dt = datetime.fromisoformat(expiry.replace('Z', '+00:00'))
                if expiry_dt < datetime.utcnow().replace(tzinfo=expiry_dt.tzinfo):
                    expired.append(node)

        if not dry_run:
            for node in expired:
                requests.delete(
                    f'{self.url}/api/v1/machine/{node["id"]}',
                    headers=self.headers
                )

        return expired

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Headscale 管理工具')
    parser.add_argument('--url', required=True, help='Headscale URL')
    parser.add_argument('--api-key', required=True, help='API Key')
    parser.add_argument('action', choices=['list', 'offline', 'cleanup'])
    parser.add_argument('--user', help='过滤用户')
    parser.add_argument('--dry-run', action='store_true', help='试运行模式')

    args = parser.parse_args()

    manager = HeadscaleManager(args.url, args.api_key)

    if args.action == 'list':
        nodes = manager.get_nodes(args.user)
        print(json.dumps(nodes, indent=2))
    elif args.action == 'offline':
        offline = manager.get_offline_nodes()
        print(f"离线节点数: {len(offline)}")
        for node in offline:
            print(f"  - {node['givenName']} (last seen: {node['lastSeen']})")
    elif args.action == 'cleanup':
        expired = manager.cleanup_expired_nodes(dry_run=args.dry_run)
        print(f"过期节点数: {len(expired)}")
        for node in expired:
            print(f"  - {node['givenName']} (expired: {node['expiry']})")
```

### 10.3 日志管理

```bash
# Headscale 日志位置
/var/log/headscale/headscale.log

# 日志轮转配置
cat > /etc/logrotate.d/headscale << 'EOF'
/var/log/headscale/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0640 headscale headscale
    sharedscripts
    postrotate
        systemctl reload headscale > /dev/null 2>&1 || true
    endscript
}
EOF

# 结构化日志查询 (JSON 格式)
cat /var/log/headscale/headscale.log | jq 'select(.level == "error")'
```

### 10.4 备份与恢复

#### 10.4.1 数据库备份

```bash
#!/bin/bash
# /opt/scripts/backup-headscale.sh

BACKUP_DIR="/backup/headscale"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

# PostgreSQL 备份
pg_dump -h localhost -U headscale -d headscale -F c \
  -f "${BACKUP_DIR}/headscale_${DATE}.dump"

# 配置文件备份
tar -czf "${BACKUP_DIR}/config_${DATE}.tar.gz" \
  /etc/headscale/config.yaml \
  /etc/headscale/acl.json \
  /etc/headscale/derp.json \
  /var/lib/headscale/private.key \
  /var/lib/headscale/noise_private.key

# 清理旧备份
find "${BACKUP_DIR}" -type f -mtime +${RETENTION_DAYS} -delete

# 上传到 S3 (可选)
aws s3 sync "${BACKUP_DIR}/" s3://backup-bucket/headscale/
```

#### 10.4.2 数据恢复

```bash
#!/bin/bash
# /opt/scripts/restore-headscale.sh

BACKUP_FILE=$1

# 停止服务
systemctl stop headscale

# 恢复数据库
pg_restore -h localhost -U headscale -d headscale -c "${BACKUP_FILE}"

# 恢复配置
tar -xzf "${BACKUP_FILE%.dump}_config.tar.gz" -C /

# 重启服务
systemctl start headscale

# 验证
headscale nodes list
```

### 10.5 版本升级流程

```bash
#!/bin/bash
# /opt/scripts/upgrade-headscale.sh

NEW_VERSION=$1
BACKUP_DIR="/backup/headscale/upgrade"

echo "开始升级 Headscale 到版本 ${NEW_VERSION}"

# 1. 备份当前版本
echo "备份当前配置和数据..."
./backup-headscale.sh

# 2. 下载新版本
echo "下载新版本..."
wget -O /tmp/headscale_new.deb \
  "https://github.com/juanfont/headscale/releases/download/v${NEW_VERSION}/headscale_${NEW_VERSION}_linux_amd64.deb"

# 3. 停止服务
echo "停止 Headscale 服务..."
systemctl stop headscale

# 4. 安装新版本
echo "安装新版本..."
dpkg -i /tmp/headscale_new.deb

# 5. 数据库迁移 (如果需要)
echo "执行数据库迁移..."
headscale serve --config /etc/headscale/config.yaml --migrate-only

# 6. 启动服务
echo "启动服务..."
systemctl start headscale

# 7. 验证
echo "验证升级..."
sleep 5
headscale version
headscale nodes list | head -5

echo "升级完成!"
```

---

## 11. 故障恢复与灾备

### 11.1 故障场景与恢复方案

#### 11.1.1 Headscale 主节点故障

**影响范围**：
- 新节点无法加入网络
- 无法更新 ACL 策略
- 已连接节点正常通信 (P2P 直连)

**恢复步骤**：

```bash
# 1. 确认主节点故障
systemctl status headscale
curl -s https://hs.ops.company.com/health

# 2. 切换到备用节点
# 在备用节点上修改 DNS 或负载均衡器配置

# 3. 如果是数据库问题，切换到从库
# 修改 config.yaml 中的数据库连接

# 4. 重启服务
systemctl restart headscale

# 5. 验证服务恢复
headscale nodes list
```

#### 11.1.2 PostgreSQL 数据库故障

**恢复步骤**：

```bash
# 1. 如果主库故障，提升从库
# 在从库执行
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/15/main

# 2. 更新 Headscale 配置指向新主库
sed -i 's/old_primary_ip/new_primary_ip/' /etc/headscale/config.yaml

# 3. 重启 Headscale
systemctl restart headscale

# 4. 重建从库
# 使用 pg_basebackup 从新主库同步
```

#### 11.1.3 DERP 中继服务器故障

**影响范围**：
- 无法 NAT 穿透的节点将失去连接
- 可直连的节点不受影响

**恢复步骤**：

```bash
# 1. 检查 DERP 服务状态
systemctl status derper
curl -s https://derp-bj.ops.company.com/derp/probe

# 2. 如果无法恢复，从 DERP Map 中移除该节点
# 编辑 /etc/headscale/derp.json，移除故障节点

# 3. 等待客户端自动切换到其他 DERP
# 或手动强制刷新
tailscale netcheck
```

#### 11.1.4 完全灾难恢复

```bash
# 1. 准备新服务器

# 2. 从备份恢复数据库
pg_restore -h localhost -U headscale -d headscale /backup/latest.dump

# 3. 恢复配置文件
tar -xzf /backup/config_latest.tar.gz -C /

# 4. 安装 Headscale
dpkg -i headscale_latest.deb

# 5. 启动服务
systemctl start headscale

# 6. 更新 DNS 指向新服务器

# 7. 验证所有节点重新连接
watch 'headscale nodes list | grep -c Online'
```

### 11.2 RTO 和 RPO 目标

| 场景 | RTO (恢复时间目标) | RPO (数据恢复点目标) |
|------|-------------------|---------------------|
| Headscale 单点故障 | < 5 分钟 | 0 (热备接管) |
| 数据库故障 | < 15 分钟 | < 1 分钟 (同步复制) |
| DERP 故障 | 自动切换 | N/A |
| 完全灾难 | < 2 小时 | < 24 小时 |

### 11.3 定期演练

建议每季度进行一次故障演练：

1. **演练内容**：
   - 主备切换
   - 数据库故障转移
   - 从备份恢复
   - ACL 策略回滚

2. **演练记录**：
   - 演练时间和参与人员
   - 实际恢复时间
   - 发现的问题和改进措施

---

## 12. 实施计划与里程碑

### 12.1 实施阶段

#### 第一阶段：基础设施准备

| 任务 | 负责人 | 前置条件 | 交付物 |
|------|--------|---------|--------|
| 服务器资源申请 | 运维 | 预算审批 | 服务器清单 |
| 域名和证书准备 | 运维 | 域名购买 | SSL 证书 |
| PostgreSQL 高可用部署 | DBA | 服务器就绪 | 数据库集群 |
| 网络规划确认 | 网络组 | - | IP 规划文档 |

#### 第二阶段：核心服务部署

| 任务 | 负责人 | 前置条件 | 交付物 |
|------|--------|---------|--------|
| Headscale 主节点部署 | 运维 | PostgreSQL 就绪 | 服务运行 |
| Headscale 备节点配置 | 运维 | 主节点就绪 | 主备切换测试 |
| DERP 中继服务器部署 | 运维 | 服务器就绪 | 多区域 DERP |
| ACL 策略配置 | 安全组 | 服务运行 | ACL 文件 |
| 监控告警部署 | 运维 | 服务运行 | Grafana 仪表板 |

#### 第三阶段：节点接入

| 任务 | 负责人 | 前置条件 | 交付物 |
|------|--------|---------|--------|
| 测试环境接入 | 运维 | 服务就绪 | 测试节点在线 |
| 预发布环境接入 | 运维 | 测试通过 | 预发布节点在线 |
| 生产环境接入 (批次1) | 运维 | 预发布验证 | 首批生产节点 |
| 生产环境接入 (批次2-N) | 运维 | 批次1成功 | 全部生产节点 |
| 运维人员设备接入 | 运维 | 生产稳定 | 运维设备在线 |
| 开发人员设备接入 | 开发组长 | 运维验证 | 开发设备在线 |

#### 第四阶段：验收与交接

| 任务 | 负责人 | 前置条件 | 交付物 |
|------|--------|---------|--------|
| 功能验收测试 | QA | 全部接入 | 验收报告 |
| 性能压力测试 | 性能组 | 功能验收 | 性能报告 |
| 故障演练 | 运维 | 验收通过 | 演练记录 |
| 文档交付 | 运维 | 演练通过 | 运维手册 |
| 培训交接 | 运维 | 文档完成 | 培训记录 |

### 12.2 里程碑

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              实施时间线                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  M1: 基础设施就绪                                                            │
│  ├── PostgreSQL HA 部署完成                                                  │
│  ├── 域名/证书准备完成                                                       │
│  └── 网络规划确认                                                            │
│                                                                              │
│  M2: 核心服务上线                                                            │
│  ├── Headscale 主备节点运行                                                  │
│  ├── DERP 多区域部署                                                         │
│  ├── 监控告警就绪                                                            │
│  └── ACL 策略配置完成                                                        │
│                                                                              │
│  M3: 测试验证完成                                                            │
│  ├── 测试环境全部接入                                                        │
│  ├── 预发布环境接入                                                          │
│  └── 功能验收通过                                                            │
│                                                                              │
│  M4: 生产环境迁移完成                                                        │
│  ├── 生产服务器全部接入                                                      │
│  ├── 旧 VPN 方案下线                                                        │
│  └── 运维设备接入                                                            │
│                                                                              │
│  M5: 项目验收                                                                │
│  ├── 故障演练通过                                                            │
│  ├── 培训交接完成                                                            │
│  └── 项目正式结项                                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 12.3 验收标准

| 验收项 | 验收标准 | 验收方法 |
|--------|---------|---------|
| 网络连通性 | 任意两节点可互通 | ping/traceroute 测试 |
| 连接延迟 | 同区域 P2P < 10ms | Tailscale ping |
| 服务可用性 | 99.9% 可用率 | 监控数据 |
| ACL 生效 | 策略符合设计 | 安全扫描 |
| 故障恢复 | RTO < 目标时间 | 故障演练 |
| 性能指标 | 支持 1000+ 节点 | 压力测试 |

---

## 13. 风险评估与应对

### 13.1 风险矩阵

| 风险项 | 可能性 | 影响 | 风险等级 | 应对措施 |
|--------|-------|-----|---------|---------|
| Headscale 版本不稳定 | 中 | 高 | 高 | 充分测试，制定回滚方案 |
| 网络穿透失败率高 | 中 | 中 | 中 | 部署多区域 DERP |
| 密钥泄露 | 低 | 极高 | 高 | 密钥管理，定期轮换 |
| 性能瓶颈 | 中 | 中 | 中 | 监控预警，容量规划 |
| 运维人员技能不足 | 中 | 中 | 中 | 培训，文档完善 |
| 与现有系统冲突 | 低 | 中 | 低 | 充分测试，分批上线 |

### 13.2 回滚方案

#### 13.2.1 服务端回滚

```bash
# 1. 停止新版本服务
systemctl stop headscale

# 2. 恢复旧版本
dpkg -i /backup/headscale_old.deb

# 3. 恢复配置
cp /backup/config_old.yaml /etc/headscale/config.yaml

# 4. 如需回滚数据库
pg_restore -h localhost -U headscale -d headscale -c /backup/db_old.dump

# 5. 重启服务
systemctl start headscale
```

#### 13.2.2 客户端回滚

```bash
# 断开 Headscale 连接
tailscale down

# 恢复原有 VPN 配置
# (根据原有 VPN 方案操作)
```

### 13.3 应急联系人

| 角色 | 姓名 | 联系方式 | 职责 |
|------|-----|---------|-----|
| 项目负责人 | xxx | 138xxxxxxxx | 决策、协调 |
| 技术负责人 | xxx | 139xxxxxxxx | 技术方案 |
| 运维负责人 | xxx | 137xxxxxxxx | 部署实施 |
| DBA | xxx | 136xxxxxxxx | 数据库运维 |
| 安全负责人 | xxx | 135xxxxxxxx | 安全评审 |

---

## 14. 附录

### 14.1 术语表

| 术语 | 解释 |
|------|-----|
| Headscale | Tailscale 的开源自托管控制服务器 |
| Tailscale | 基于 WireGuard 的零配置 VPN 方案 |
| WireGuard | 现代化的 VPN 协议 |
| DERP | Designated Encrypted Relay for Packets，加密中继协议 |
| MagicDNS | Tailscale 的自动 DNS 服务 |
| ACL | Access Control List，访问控制列表 |
| PreAuth Key | 预认证密钥，用于无交互接入 |
| Mesh Network | 网状网络，节点间可直接通信 |
| NAT Traversal | NAT 穿透技术 |
| STUN | Session Traversal Utilities for NAT |

### 14.2 参考文档

- [Headscale 官方文档](https://headscale.net/)
- [Tailscale 官方文档](https://tailscale.com/docs/)
- [WireGuard 官方网站](https://www.wireguard.com/)
- [Headscale GitHub](https://github.com/juanfont/headscale)

### 14.3 常用命令速查

```bash
# Headscale 服务管理
systemctl start|stop|restart|status headscale

# 用户管理
headscale users list
headscale users create <name>
headscale users destroy <name>

# 节点管理
headscale nodes list
headscale nodes delete -i <id>
headscale nodes expire -i <id>
headscale nodes rename -i <id> -n <new_name>
headscale nodes tag -i <id> -t <tags>

# 预认证密钥
headscale preauthkeys create --user <user> --expiration 24h
headscale preauthkeys list --user <user>

# 路由管理
headscale routes list
headscale routes enable -r <route_id>

# API Key
headscale apikeys create --expiration 90d
headscale apikeys list

# Tailscale 客户端
tailscale up --login-server <url>
tailscale down
tailscale status
tailscale ip
tailscale ping <peer>
tailscale netcheck
```

### 14.4 配置模板

配置模板文件位于：
- `/opt/templates/headscale/config.yaml.tmpl`
- `/opt/templates/headscale/acl.json.tmpl`
- `/opt/templates/derp/docker-compose.yml.tmpl`

### 14.5 变更记录

| 版本 | 日期 | 变更内容 | 变更人 |
|------|-----|---------|--------|
| v1.0 | 2025-12-15 | 初稿 | xxx |
| v2.0 | 2025-12-18 | 详细设计完善 | AI Assistant |

---

> **文档维护说明**: 本文档应随着项目进展持续更新，每次重大变更需记录在变更记录中。