Files

qiudl 06f287b7af chore: enable auto-deploy for saltthing.top

- Added version comment for deployment tracking
- Auto-deploy configured on fnos with 5-minute sync interval

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-04 20:50:33 +10:30

65 KiB

Raw Blame History

OPS 统一管理方案 - Headscale 组网实施方案

任务编号: 4448 版本: v2.0 最后更新: 2025-12-18 文档状态: 详细设计

1. 项目背景与目标

1.1 项目背景

随着业务发展，运维团队面临以下挑战：

多云多地域分布: 服务器分布在阿里云、腾讯云、AWS 等多个云平台，以及多个物理机房
网络隔离复杂: 不同环境（生产、测试、开发）之间网络隔离管理复杂
VPN 管理困难: 传统 VPN 方案（OpenVPN、IPSec）配置复杂、维护成本高
安全访问需求: 需要安全、便捷地访问内部服务，同时满足合规要求
运维效率低下: 跨网络运维操作繁琐，无统一入口

1.2 项目目标

目标维度	具体目标	验收标准
网络互通	实现所有节点 P2P 直连	任意两节点延迟 < 50ms（同区域）
安全性	零信任网络架构	所有通信加密，基于身份认证
易用性	一键接入内网	客户端安装配置 < 5分钟
可扩展	支持快速扩容	新节点接入 < 10分钟
高可用	控制平面高可用	SLA 99.9%

1.3 适用范围

生产环境所有服务器
测试/预发布环境服务器
运维/开发人员工作设备
CI/CD 构建节点
数据库、缓存等基础设施

2. 技术方案概述

2.1 为什么选择 Headscale

方案	优点	缺点	适用场景
Headscale	开源自托管、WireGuard 内核、P2P 直连、轻量级	生态相对较新	自主可控要求高
Tailscale	完善的商业支持	数据过境国外、成本高	小团队快速起步
OpenVPN	成熟稳定	配置复杂、性能较差	传统企业
ZeroTier	易于使用	免费版限制多	小规模使用

选择 Headscale 的核心理由：

数据主权: 所有协调数据存储在自己的服务器上
成本可控: 完全开源，无订阅费用
WireGuard 优势: 现代密码学、低延迟、高性能
Mesh 网络: 节点间直接通信，无需中心转发
兼容 Tailscale 客户端: 可使用成熟的 Tailscale 客户端

2.2 技术架构图

                              ┌─────────────────────────────────────────────────────────┐
                              │                    Internet                              │
                              └──────────────────────────┬──────────────────────────────┘
                                                         │
                              ┌──────────────────────────┴──────────────────────────────┐
                              │                                                          │
                    ┌─────────▼─────────┐                               ┌────────────────▼────────────────┐
                    │   Headscale HA    │                               │        DERP Relay Servers       │
                    │   Control Plane   │                               │     (Beijing/Shanghai/HK)       │
                    │                   │                               │                                 │
                    │ ┌───────────────┐ │                               │  ┌─────────┐  ┌─────────┐      │
                    │ │  Headscale    │ │                               │  │ DERP-BJ │  │ DERP-SH │      │
                    │ │  Primary      │ │                               │  └─────────┘  └─────────┘      │
                    │ └───────────────┘ │                               │       ┌─────────┐              │
                    │ ┌───────────────┐ │                               │       │ DERP-HK │              │
                    │ │  PostgreSQL   │ │                               │       └─────────┘              │
                    │ │  (HA)         │ │                               └─────────────────────────────────┘
                    │ └───────────────┘ │
                    └─────────┬─────────┘
                              │ Coordination
                              │
        ┌─────────────────────┼─────────────────────┬─────────────────────┐
        │                     │                     │                     │
        ▼                     ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  Production   │     │   Staging     │     │  Development  │     │   Operator    │
│   Servers     │     │   Servers     │     │   Servers     │     │   Devices     │
│               │     │               │     │               │     │               │
│ ┌───────────┐ │     │ ┌───────────┐ │     │ ┌───────────┐ │     │ ┌───────────┐ │
│ │ Tailscale │ │◄───►│ │ Tailscale │ │◄───►│ │ Tailscale │ │◄───►│ │ Tailscale │ │
│ │  Agent    │ │ P2P │ │  Agent    │ │ P2P │ │  Agent    │ │ P2P │ │  Client   │ │
│ └───────────┘ │     │ └───────────┘ │     │ └───────────┘ │     │ └───────────┘ │
└───────────────┘     └───────────────┘     └───────────────┘     └───────────────┘
     100.64.1.x            100.64.2.x            100.64.3.x            100.64.10.x

2.3 核心组件说明

组件	功能	部署位置	高可用策略
Headscale Server	协调服务、密钥分发、ACL 管理	云主机	主备 + PostgreSQL HA
DERP Relay	NAT 穿透失败时的中继服务	多地域部署	多节点冗余
Tailscale Client	客户端 Agent	所有节点	开机自启
Admin UI	Web 管理界面	与 Headscale 同机	-

3. 网络架构设计

3.1 IP 地址规划

采用 CGNAT 地址段 100.64.0.0/10，按环境和用途划分：

100.64.0.0/10 (总地址空间: 4,194,304 个地址)
│
├── 100.64.0.0/16    - 保留地址段 (管理用途)
│   ├── 100.64.0.0/24    - Headscale 控制平面
│   ├── 100.64.1.0/24    - DERP 中继服务器
│   └── 100.64.2.0/24    - 监控基础设施
│
├── 100.65.0.0/16    - 生产环境 (Production)
│   ├── 100.65.1.0/24    - Web 服务器组
│   ├── 100.65.2.0/24    - API 服务器组
│   ├── 100.65.3.0/24    - 数据库服务器组
│   ├── 100.65.4.0/24    - 缓存服务器组
│   ├── 100.65.5.0/24    - 消息队列服务器组
│   ├── 100.65.10.0/24   - Kubernetes Master
│   ├── 100.65.11.0/23   - Kubernetes Worker
│   └── 100.65.100.0/24  - 生产环境堡垒机
│
├── 100.66.0.0/16    - 预发布环境 (Staging)
│   ├── 100.66.1.0/24    - 应用服务器
│   ├── 100.66.2.0/24    - 数据库服务器
│   └── 100.66.10.0/24   - Kubernetes 集群
│
├── 100.67.0.0/16    - 测试环境 (Testing)
│   ├── 100.67.1.0/24    - 应用服务器
│   ├── 100.67.2.0/24    - 数据库服务器
│   └── 100.67.100.0/24  - CI/CD 构建节点
│
├── 100.68.0.0/16    - 开发环境 (Development)
│   ├── 100.68.1.0/24    - 开发服务器
│   └── 100.68.2.0/24    - 开发数据库
│
├── 100.70.0.0/16    - 运维人员设备 (Operators)
│   ├── 100.70.1.0/24    - 高级运维
│   ├── 100.70.2.0/24    - 普通运维
│   └── 100.70.10.0/24   - 值班人员
│
├── 100.71.0.0/16    - 开发人员设备 (Developers)
│   ├── 100.71.1.0/24    - 后端开发
│   ├── 100.71.2.0/24    - 前端开发
│   └── 100.71.3.0/24    - 移动开发
│
└── 100.80.0.0/16    - 外部合作伙伴 (Partners)
    └── 100.80.1.0/24    - 第三方供应商

3.2 命名空间设计

Headscale 使用 User (原 Namespace) 进行逻辑隔离：

User 名称	用途	IP 段	管理员
`infra`	基础设施服务	100.64.0.0/16	ops-admin
`prod`	生产环境服务器	100.65.0.0/16	ops-admin
`staging`	预发布环境	100.66.0.0/16	ops-admin
`testing`	测试环境	100.67.0.0/16	qa-admin
`dev`	开发环境	100.68.0.0/16	dev-admin
`ops-team`	运维人员设备	100.70.0.0/16	ops-admin
`dev-team`	开发人员设备	100.71.0.0/16	dev-admin
`partners`	外部合作伙伴	100.80.0.0/16	ops-admin

3.3 节点命名规范

<环境>-<角色>-<区域>-<序号>

示例:
- prod-web-bj-001      生产环境北京Web服务器#1
- prod-db-sh-001       生产环境上海数据库#1
- staging-api-bj-001   预发布环境北京API服务器#1
- ops-laptop-zhangsan  运维人员张三的笔记本

3.4 DERP 中继网络

部署自建 DERP 服务器以确保 NAT 穿透失败时的可靠中继：

节点	区域	公网 IP	端口	备注
derp-bj-01	北京	x.x.x.x	443/3478	阿里云主节点
derp-sh-01	上海	x.x.x.x	443/3478	腾讯云备节点
derp-hk-01	香港	x.x.x.x	443/3478	AWS 海外节点
derp-sg-01	新加坡	x.x.x.x	443/3478	东南亚节点

4. 基础设施规划

4.1 服务器资源规划

4.1.1 Headscale 控制平面

组件	配置	数量	说明
Headscale Primary	4C8G 100GB SSD	1	主控制节点
Headscale Standby	4C8G 100GB SSD	1	热备节点
PostgreSQL Primary	4C16G 500GB SSD	1	数据库主节点
PostgreSQL Replica	4C16G 500GB SSD	1	数据库从节点
Admin UI	2C4G 50GB SSD	1	管理界面

4.1.2 DERP 中继服务器

区域	配置	带宽	数量
北京	2C4G 50GB	100Mbps	1
上海	2C4G 50GB	100Mbps	1
香港	2C4G 50GB	100Mbps	1
新加坡	2C4G 50GB	100Mbps	1

4.2 网络要求

4.2.1 Headscale 服务器端口

端口	协议	用途	来源
443	TCP	HTTPS API & gRPC	所有客户端
80	TCP	HTTP 重定向	所有客户端
50443	TCP	管理 API (可选)	管理网络

4.2.2 DERP 服务器端口

端口	协议	用途	来源
443	TCP	HTTPS DERP	所有客户端
3478	UDP	STUN	所有客户端
80	TCP	HTTP 重定向	所有客户端

4.2.3 Tailscale 客户端端口

端口	协议	用途	方向
41641	UDP	WireGuard 直连	入站/出站
443	TCP	DERP 中继	出站
3478	UDP	STUN	出站

4.3 域名与证书规划

域名	用途	证书类型
hs.ops.company.com	Headscale API	Let's Encrypt 通配符
admin.hs.ops.company.com	管理界面	Let's Encrypt
derp-bj.ops.company.com	北京 DERP	Let's Encrypt
derp-sh.ops.company.com	上海 DERP	Let's Encrypt
derp-hk.ops.company.com	香港 DERP	Let's Encrypt

5. Headscale 服务端部署

5.1 系统环境准备

# 操作系统: Ubuntu 22.04 LTS / Rocky Linux 9
# 时区设置
timedatectl set-timezone Asia/Shanghai

# 更新系统
apt update && apt upgrade -y

# 安装必要工具
apt install -y curl wget vim htop net-tools jq unzip

# 关闭 swap (容器化部署时)
swapoff -a
sed -i '/swap/d' /etc/fstab

# 设置内核参数
cat >> /etc/sysctl.conf << EOF
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1
net.core.rmem_max = 2500000
net.core.wmem_max = 2500000
EOF
sysctl -p

# 设置文件描述符限制
cat >> /etc/security/limits.conf << EOF
* soft nofile 65535
* hard nofile 65535
root soft nofile 65535
root hard nofile 65535
EOF

5.2 PostgreSQL 高可用部署

5.2.1 PostgreSQL 主节点安装

# 安装 PostgreSQL 15
apt install -y postgresql-15 postgresql-contrib-15

# 配置 PostgreSQL
cat > /etc/postgresql/15/main/postgresql.conf << 'EOF'
listen_addresses = '*'
port = 5432
max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 10MB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
max_parallel_maintenance_workers = 2

# 复制配置
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
hot_standby = on
EOF

# 配置访问控制
cat > /etc/postgresql/15/main/pg_hba.conf << 'EOF'
local   all             postgres                                peer
local   all             all                                     peer
host    all             all             127.0.0.1/32            scram-sha-256
host    all             all             ::1/128                 scram-sha-256
host    replication     replicator      <standby_ip>/32         scram-sha-256
host    headscale       headscale       <headscale_ip>/32       scram-sha-256
host    headscale       headscale       <headscale_standby_ip>/32 scram-sha-256
EOF

# 创建数据库和用户
sudo -u postgres psql << 'EOF'
CREATE USER headscale WITH PASSWORD 'your_secure_password_here';
CREATE DATABASE headscale OWNER headscale;
GRANT ALL PRIVILEGES ON DATABASE headscale TO headscale;

CREATE USER replicator WITH REPLICATION PASSWORD 'replicator_password';
EOF

systemctl restart postgresql
systemctl enable postgresql

5.2.2 PostgreSQL 从节点配置

# 停止 PostgreSQL
systemctl stop postgresql

# 清空数据目录
rm -rf /var/lib/postgresql/15/main/*

# 从主节点复制数据
sudo -u postgres pg_basebackup -h <primary_ip> -U replicator -p 5432 \
  -D /var/lib/postgresql/15/main -Fp -Xs -P -R

# 启动从节点
systemctl start postgresql

5.3 Headscale 安装与配置

5.3.1 二进制安装

# 下载最新版本 (以 0.23.0 为例)
HEADSCALE_VERSION="0.23.0"
wget -O /tmp/headscale.deb \
  "https://github.com/juanfont/headscale/releases/download/v${HEADSCALE_VERSION}/headscale_${HEADSCALE_VERSION}_linux_amd64.deb"

# 安装
dpkg -i /tmp/headscale.deb

# 或使用 Docker
docker pull headscale/headscale:0.23.0

5.3.2 Headscale 配置文件

# /etc/headscale/config.yaml
---
server_url: https://hs.ops.company.com:443
listen_addr: 0.0.0.0:443
metrics_listen_addr: 127.0.0.1:9090
grpc_listen_addr: 0.0.0.0:50443
grpc_allow_insecure: false

# 私有密钥路径
private_key_path: /var/lib/headscale/private.key
noise:
  private_key_path: /var/lib/headscale/noise_private.key

# IP 地址前缀
prefixes:
  v4: 100.64.0.0/10
  v6: fd7a:115c:a1e0::/48
  allocation: sequential

# 数据库配置 (PostgreSQL)
database:
  type: postgres
  postgres:
    host: <postgresql_host>
    port: 5432
    name: headscale
    user: headscale
    pass: your_secure_password_here
    max_open_conns: 100
    max_idle_conns: 10
    conn_max_idle_time_secs: 3600
    ssl: disable  # 生产环境建议启用 require

# DERP 配置
derp:
  server:
    enabled: false  # 使用独立 DERP 服务器
    region_id: 999
    region_code: "headscale"
    region_name: "Headscale Embedded DERP"
    stun_listen_addr: "0.0.0.0:3478"
  urls:
    - https://hs.ops.company.com/derp.json
  paths: []
  auto_update_enabled: true
  update_frequency: 24h

# 禁用默认 Tailscale DERP
disable_check_updates: true
ephemeral_node_inactivity_timeout: 30m

# 节点更新检查
node_update_check_interval: 10s

# DNS 配置
dns:
  magic_dns: true
  base_domain: ts.company.local
  nameservers:
    global:
      - 10.0.0.1  # 内部 DNS
      - 223.5.5.5  # 阿里 DNS (备用)
  search_domains:
    - company.local
  extra_records:
    - name: "grafana.ts.company.local"
      type: "A"
      value: "100.64.0.10"
    - name: "prometheus.ts.company.local"
      type: "A"
      value: "100.64.0.11"

# Unix socket 配置
unix_socket: /var/run/headscale/headscale.sock
unix_socket_permission: "0770"

# TLS 配置 (使用反向代理时可设为空)
tls_cert_path: ""
tls_key_path: ""

# 日志配置
log:
  format: json
  level: info

# ACL 策略
policy:
  mode: file
  path: /etc/headscale/acl.json

# OIDC 配置 (可选)
oidc:
  only_start_if_oidc_is_available: true
  issuer: "https://sso.company.com/realms/ops"
  client_id: "headscale"
  client_secret: "your_oidc_client_secret"
  scope: ["openid", "profile", "email"]
  extra_params:
    domain_hint: company.com
  strip_email_domain: true
  allowed_users: []
  allowed_groups:
    - "/ops-team"
    - "/dev-team"

5.3.3 创建 systemd 服务

# /etc/systemd/system/headscale.service
[Unit]
Description=headscale coordination server
Documentation=https://github.com/juanfont/headscale
After=network-online.target postgresql.service
Wants=network-online.target
Requires=postgresql.service

[Service]
User=headscale
Group=headscale
Type=simple
Restart=always
RestartSec=5
ExecStart=/usr/bin/headscale serve
Environment="GIN_MODE=release"

# 资源限制
LimitNOFILE=65535
LimitNPROC=65535

# 安全加固
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/headscale /var/run/headscale

[Install]
WantedBy=multi-user.target

5.3.4 启动服务

# 创建用户和目录
useradd -r -s /bin/false headscale
mkdir -p /var/lib/headscale /var/run/headscale /etc/headscale
chown -R headscale:headscale /var/lib/headscale /var/run/headscale

# 启动服务
systemctl daemon-reload
systemctl enable headscale
systemctl start headscale

# 验证服务状态
systemctl status headscale
headscale version

5.4 DERP 中继服务器部署

5.4.1 DERP 服务器配置

# 安装 Go (如果需要编译)
wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

# 安装 derper
go install tailscale.com/cmd/derper@latest

# 或使用 Docker
docker pull ghcr.io/tailscale/derper:latest

5.4.2 DERP Docker Compose 部署

# /opt/derper/docker-compose.yml
version: '3.8'
services:
  derper:
    image: ghcr.io/tailscale/derper:latest
    container_name: derper
    restart: always
    ports:
      - "443:443"
      - "80:80"
      - "3478:3478/udp"
    volumes:
      - ./certs:/etc/derper/certs:ro
      - ./config:/etc/derper/config:ro
    command:
      - --hostname=derp-bj.ops.company.com
      - --certmode=manual
      - --certdir=/etc/derper/certs
      - --stun
      - --stun-port=3478
      - --verify-clients=true
      - --verify-client-url=https://hs.ops.company.com/verify
    environment:
      - DERP_VERIFY_CLIENTS=true
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "3"

5.4.3 DERP Map 配置

在 Headscale 服务器上配置 DERP Map：

// /etc/headscale/derp.json
{
  "Regions": {
    "900": {
      "RegionID": 900,
      "RegionCode": "bj",
      "RegionName": "Beijing",
      "Avoid": false,
      "Nodes": [
        {
          "Name": "bj1",
          "RegionID": 900,
          "HostName": "derp-bj.ops.company.com",
          "DERPPort": 443,
          "STUNPort": 3478,
          "InsecureForTests": false
        }
      ]
    },
    "901": {
      "RegionID": 901,
      "RegionCode": "sh",
      "RegionName": "Shanghai",
      "Avoid": false,
      "Nodes": [
        {
          "Name": "sh1",
          "RegionID": 901,
          "HostName": "derp-sh.ops.company.com",
          "DERPPort": 443,
          "STUNPort": 3478,
          "InsecureForTests": false
        }
      ]
    },
    "902": {
      "RegionID": 902,
      "RegionCode": "hk",
      "RegionName": "Hong Kong",
      "Avoid": false,
      "Nodes": [
        {
          "Name": "hk1",
          "RegionID": 902,
          "HostName": "derp-hk.ops.company.com",
          "DERPPort": 443,
          "STUNPort": 3478,
          "InsecureForTests": false
        }
      ]
    }
  }
}

5.5 Nginx 反向代理配置

# /etc/nginx/sites-available/headscale
upstream headscale {
    server 127.0.0.1:8080;
    keepalive 32;
}

server {
    listen 80;
    server_name hs.ops.company.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name hs.ops.company.com;

    # SSL 配置
    ssl_certificate /etc/letsencrypt/live/hs.ops.company.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/hs.ops.company.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
    ssl_prefer_server_ciphers off;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 1d;
    ssl_session_tickets off;
    ssl_stapling on;
    ssl_stapling_verify on;

    # 安全头
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Frame-Options DENY always;
    add_header X-Content-Type-Options nosniff always;

    location / {
        proxy_pass http://headscale;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_buffering off;
        proxy_read_timeout 86400s;
        proxy_send_timeout 86400s;
    }

    # gRPC 支持
    location /headscale.v1.HeadscaleService/ {
        grpc_pass grpc://127.0.0.1:50443;
        grpc_set_header Host $host;
        grpc_set_header X-Real-IP $remote_addr;
    }

    # 健康检查
    location /health {
        proxy_pass http://headscale/health;
        access_log off;
    }

    # Metrics (仅内网访问)
    location /metrics {
        allow 10.0.0.0/8;
        allow 172.16.0.0/12;
        allow 192.168.0.0/16;
        allow 100.64.0.0/10;
        deny all;
        proxy_pass http://127.0.0.1:9090/metrics;
    }
}

5.6 管理界面部署 (Headscale-UI)

# /opt/headscale-ui/docker-compose.yml
version: '3.8'
services:
  headscale-ui:
    image: ghcr.io/gurucomputing/headscale-ui:latest
    container_name: headscale-ui
    restart: always
    ports:
      - "127.0.0.1:8081:80"
    environment:
      - HS_SERVER=https://hs.ops.company.com

6. 客户端接入方案

6.1 Linux 服务器接入

6.1.1 安装 Tailscale 客户端

# Ubuntu/Debian
curl -fsSL https://tailscale.com/install.sh | sh

# RHEL/CentOS
curl -fsSL https://tailscale.com/install.sh | sh

# 或手动安装
# Ubuntu/Debian
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/jammy.noarmor.gpg | sudo tee /usr/share/keyrings/tailscale-archive-keyring.gpg >/dev/null
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/jammy.tailscale-keyring.list | sudo tee /etc/apt/sources.list.d/tailscale.list
apt update && apt install -y tailscale

6.1.2 连接到 Headscale

# 使用预认证密钥 (推荐用于服务器)
tailscale up \
  --login-server https://hs.ops.company.com \
  --authkey tskey-preauth-xxxxxxxxxxxxx \
  --hostname prod-web-bj-001 \
  --advertise-tags tag:prod,tag:web \
  --accept-routes \
  --accept-dns

# 交互式登录 (用于开发机器)
tailscale up \
  --login-server https://hs.ops.company.com \
  --hostname ops-laptop-zhangsan

# 验证连接
tailscale status
tailscale ip

6.1.3 自动化安装脚本

#!/bin/bash
# /opt/scripts/setup-tailscale.sh

set -euo pipefail

# 配置变量
HEADSCALE_URL="${HEADSCALE_URL:-https://hs.ops.company.com}"
AUTH_KEY="${AUTH_KEY:-}"
HOSTNAME="${HOSTNAME:-$(hostname -s)}"
TAGS="${TAGS:-}"
ACCEPT_ROUTES="${ACCEPT_ROUTES:-true}"

# 日志函数
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}

# 检查是否已安装
if command -v tailscale &> /dev/null; then
    log "Tailscale 已安装，版本: $(tailscale version)"
else
    log "正在安装 Tailscale..."
    curl -fsSL https://tailscale.com/install.sh | sh
fi

# 构建 tailscale up 命令
UP_CMD="tailscale up --login-server ${HEADSCALE_URL}"

if [ -n "$AUTH_KEY" ]; then
    UP_CMD="$UP_CMD --authkey $AUTH_KEY"
fi

if [ -n "$HOSTNAME" ]; then
    UP_CMD="$UP_CMD --hostname $HOSTNAME"
fi

if [ -n "$TAGS" ]; then
    UP_CMD="$UP_CMD --advertise-tags $TAGS"
fi

if [ "$ACCEPT_ROUTES" = "true" ]; then
    UP_CMD="$UP_CMD --accept-routes --accept-dns"
fi

# 执行连接
log "正在连接到 Headscale..."
eval $UP_CMD

# 验证连接
sleep 5
if tailscale status | grep -q "100."; then
    log "连接成功! IP: $(tailscale ip -4)"
else
    log "连接失败，请检查配置"
    exit 1
fi

6.2 macOS/Windows 客户端接入

6.2.1 macOS

# 使用 Homebrew 安装
brew install tailscale

# 启动并连接
sudo tailscaled &
tailscale up --login-server https://hs.ops.company.com

# 或使用官方客户端
# 下载: https://tailscale.com/download/mac
# 安装后在设置中修改 Login Server

6.2.2 Windows

# 使用 Winget 安装
winget install tailscale.tailscale

# 使用 Chocolatey 安装
choco install tailscale

# 连接命令 (PowerShell 管理员)
tailscale up --login-server https://hs.ops.company.com

6.3 移动设备接入

从 App Store / Google Play 下载 Tailscale 官方客户端
打开 App，点击设置图标
选择 "Custom coordination server"
输入: https://hs.ops.company.com
点击 "Log in" 完成认证

6.4 预认证密钥管理

# 创建可重用的预认证密钥 (用于自动化部署)
headscale preauthkeys create \
  --user prod \
  --reusable \
  --expiration 720h \
  --tags tag:prod,tag:automated

# 创建一次性预认证密钥
headscale preauthkeys create \
  --user ops-team \
  --expiration 24h

# 查看所有预认证密钥
headscale preauthkeys list --user prod

# 使密钥失效
headscale preauthkeys expire --user prod <key>

6.5 Ansible 自动化部署

# roles/tailscale/tasks/main.yml
---
- name: Install Tailscale
  shell: curl -fsSL https://tailscale.com/install.sh | sh
  args:
    creates: /usr/bin/tailscale

- name: Start tailscaled service
  systemd:
    name: tailscaled
    state: started
    enabled: yes

- name: Check if already connected
  command: tailscale status
  register: ts_status
  ignore_errors: yes
  changed_when: false

- name: Connect to Headscale
  command: >
    tailscale up
    --login-server {{ headscale_url }}
    --authkey {{ headscale_authkey }}
    --hostname {{ inventory_hostname }}
    --advertise-tags {{ tailscale_tags | join(',') }}
    --accept-routes
    --accept-dns
  when: ts_status.rc != 0

- name: Verify connection
  command: tailscale ip -4
  register: ts_ip
  changed_when: false

- name: Display Tailscale IP
  debug:
    msg: "Tailscale IP: {{ ts_ip.stdout }}"

7. 访问控制与安全策略

7.1 ACL 策略设计原则

最小权限原则: 只授予完成工作所需的最小权限
分层隔离: 生产/测试/开发环境严格隔离
基于角色: 运维/开发不同角色不同权限
审计可追溯: 所有访问可记录和追溯

7.2 详细 ACL 配置

// /etc/headscale/acl.json
{
  "groups": {
    "group:ops-admin": ["user:zhangsan", "user:lisi"],
    "group:ops-member": ["user:wangwu", "user:zhaoliu"],
    "group:dev-senior": ["user:dev01", "user:dev02"],
    "group:dev-junior": ["user:dev03", "user:dev04"],
    "group:qa": ["user:qa01", "user:qa02"],
    "group:dba": ["user:dba01"]
  },

  "tagOwners": {
    "tag:prod": ["group:ops-admin"],
    "tag:staging": ["group:ops-admin", "group:ops-member"],
    "tag:testing": ["group:ops-admin", "group:qa"],
    "tag:dev": ["group:ops-admin", "group:dev-senior"],
    "tag:web": ["group:ops-admin"],
    "tag:api": ["group:ops-admin"],
    "tag:db": ["group:ops-admin", "group:dba"],
    "tag:cache": ["group:ops-admin"],
    "tag:mq": ["group:ops-admin"],
    "tag:k8s": ["group:ops-admin"],
    "tag:monitoring": ["group:ops-admin"],
    "tag:bastion": ["group:ops-admin"]
  },

  "hosts": {
    "prod-bastion": "100.65.100.1",
    "staging-bastion": "100.66.100.1",
    "monitoring-server": "100.64.0.10",
    "jenkins-master": "100.67.100.1"
  },

  "acls": [
    // ===== 基础设施规则 =====
    // 所有节点可以访问 DNS
    {
      "action": "accept",
      "src": ["*"],
      "dst": ["100.64.0.1:53"]
    },

    // 所有节点可以访问监控系统
    {
      "action": "accept",
      "src": ["*"],
      "dst": ["tag:monitoring:9090,9093,3000"]
    },

    // ===== 运维管理员规则 =====
    // 运维管理员可以访问所有环境的所有服务
    {
      "action": "accept",
      "src": ["group:ops-admin"],
      "dst": ["*:*"]
    },

    // ===== 普通运维规则 =====
    // 普通运维可以访问非生产环境
    {
      "action": "accept",
      "src": ["group:ops-member"],
      "dst": ["tag:staging:*", "tag:testing:*", "tag:dev:*"]
    },
    // 普通运维只能通过堡垒机访问生产环境
    {
      "action": "accept",
      "src": ["group:ops-member"],
      "dst": ["tag:bastion:22"]
    },

    // ===== DBA 规则 =====
    // DBA 可以访问所有数据库
    {
      "action": "accept",
      "src": ["group:dba"],
      "dst": ["tag:db:3306,5432,6379,27017"]
    },
    // DBA 可以访问堡垒机
    {
      "action": "accept",
      "src": ["group:dba"],
      "dst": ["tag:bastion:22"]
    },

    // ===== 高级开发规则 =====
    // 高级开发可以访问开发、测试和预发布环境
    {
      "action": "accept",
      "src": ["group:dev-senior"],
      "dst": ["tag:staging:*", "tag:testing:*", "tag:dev:*"]
    },

    // ===== 初级开发规则 =====
    // 初级开发只能访问开发环境
    {
      "action": "accept",
      "src": ["group:dev-junior"],
      "dst": ["tag:dev:*"]
    },

    // ===== QA 规则 =====
    // QA 可以访问测试和预发布环境
    {
      "action": "accept",
      "src": ["group:qa"],
      "dst": ["tag:testing:*", "tag:staging:80,443,8080"]
    },

    // ===== 服务间通信规则 =====
    // 生产环境 Web 服务器可以访问 API 服务器
    {
      "action": "accept",
      "src": ["tag:web"],
      "dst": ["tag:api:8080,8443"]
    },
    // API 服务器可以访问数据库和缓存
    {
      "action": "accept",
      "src": ["tag:api"],
      "dst": ["tag:db:3306,5432", "tag:cache:6379", "tag:mq:5672,15672"]
    },
    // Kubernetes 集群内部通信
    {
      "action": "accept",
      "src": ["tag:k8s"],
      "dst": ["tag:k8s:*"]
    },

    // ===== CI/CD 规则 =====
    // Jenkins 可以访问测试环境进行部署
    {
      "action": "accept",
      "src": ["jenkins-master"],
      "dst": ["tag:testing:22,80,443,8080"]
    },

    // ===== 默认拒绝规则 (隐含) =====
  ],

  // SSH 规则 (控制 Tailscale SSH)
  "ssh": [
    {
      "action": "accept",
      "src": ["group:ops-admin"],
      "dst": ["*"],
      "users": ["root", "ubuntu", "centos"]
    },
    {
      "action": "accept",
      "src": ["group:ops-member"],
      "dst": ["tag:staging", "tag:testing", "tag:dev"],
      "users": ["ubuntu", "centos"]
    }
  ],

  // 测试规则 (用于调试)
  "tests": [
    {
      "src": "user:zhangsan",
      "accept": ["tag:prod:22", "tag:db:3306"]
    },
    {
      "src": "user:dev01",
      "accept": ["tag:dev:*"],
      "deny": ["tag:prod:*"]
    }
  ]
}

7.3 标签管理

# 为节点添加标签
headscale nodes tag -i <node_id> -t "tag:prod,tag:web"

# 查看节点标签
headscale nodes list

# 批量更新标签 (通过 API)
curl -X POST https://hs.ops.company.com/api/v1/machine/<machine_id>/tags \
  -H "Authorization: Bearer <api_key>" \
  -H "Content-Type: application/json" \
  -d '{"tags": ["tag:prod", "tag:web", "tag:bj"]}'

7.4 安全加固措施

7.4.1 Headscale 服务器加固

# 1. 防火墙配置
ufw default deny incoming
ufw default allow outgoing
ufw allow from 10.0.0.0/8 to any port 22   # SSH 仅允许内网
ufw allow 80/tcp                            # HTTP 重定向
ufw allow 443/tcp                           # HTTPS
ufw allow 50443/tcp                         # gRPC (如需要)
ufw enable

# 2. fail2ban 配置
apt install -y fail2ban
cat > /etc/fail2ban/jail.local << 'EOF'
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600

[headscale]
enabled = true
port = 443
filter = headscale
logpath = /var/log/headscale/headscale.log
maxretry = 5
bantime = 3600
findtime = 600
EOF

# 3. 禁用密码登录
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
systemctl restart sshd

# 4. 定期更新
apt update && apt upgrade -y

7.4.2 客户端安全配置

# 限制 Tailscale 网络接口的路由
tailscale up \
  --shields-up \                    # 默认拒绝入站连接
  --accept-routes=false \           # 不接受其他节点的路由广播
  --advertise-routes="" \           # 不广播本地路由
  --exit-node=""                    # 不使用出口节点

8. DNS 与服务发现

8.1 MagicDNS 配置

Headscale 内置的 MagicDNS 提供自动的服务发现能力：

# config.yaml DNS 部分
dns:
  magic_dns: true
  base_domain: ts.company.local
  nameservers:
    global:
      - 10.0.0.1        # 公司内部 DNS
      - 223.5.5.5       # 阿里 DNS
    restricted:
      internal.company.com:
        - 10.0.0.1
      aws.internal:
        - 169.254.169.253
  search_domains:
    - ts.company.local
    - company.local
  extra_records:
    - name: "grafana"
      type: "A"
      value: "100.64.0.10"
    - name: "prometheus"
      type: "A"
      value: "100.64.0.11"
    - name: "jenkins"
      type: "A"
      value: "100.67.100.1"
    - name: "gitlab"
      type: "CNAME"
      value: "prod-gitlab-bj-001"

8.2 DNS 解析规则

启用 MagicDNS 后，域名解析规则如下：

域名格式	解析目标	示例
`<hostname>`	直接解析	`prod-web-bj-001` → `100.65.1.1`
`<hostname>.<user>`	带命名空间	`prod-web-bj-001.prod`
`<hostname>.<base_domain>`	完整域名	`prod-web-bj-001.ts.company.local`
自定义记录	extra_records	`grafana` → `100.64.0.10`

8.3 Split DNS 配置

针对特定域名使用特定 DNS 服务器：

dns:
  nameservers:
    restricted:
      # AWS 内部域名使用 AWS DNS
      "compute.internal":
        - 169.254.169.253
      "ec2.internal":
        - 169.254.169.253
      # 阿里云内部域名
      "alibaba-inc.com":
        - 100.100.2.136
        - 100.100.2.138
      # 公司内部域名
      "company.internal":
        - 10.0.0.1
        - 10.0.0.2

8.4 服务发现集成

8.4.1 与 Consul 集成

# consul-config.hcl
services {
  id   = "web-prod-001"
  name = "web"
  tags = ["prod", "tailscale"]
  port = 80

  checks = [
    {
      http     = "http://prod-web-bj-001.ts.company.local/health"
      interval = "10s"
      timeout  = "2s"
    }
  ]
}

8.4.2 与 Kubernetes CoreDNS 集成

# coredns-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        # 转发 Tailscale 域名到 MagicDNS
        forward ts.company.local 100.100.100.100 {
            policy sequential
        }
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

9. 监控与告警

9.1 监控架构

┌─────────────────────────────────────────────────────────────────┐
│                        Grafana Dashboard                         │
│                    (hs-monitor.ops.company.com)                  │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                 ┌─────────────┴─────────────┐
                 │       Prometheus           │
                 │    (100.64.0.11:9090)      │
                 └─────────────┬─────────────┘
                               │
       ┌───────────────┬───────┴───────┬───────────────┐
       │               │               │               │
       ▼               ▼               ▼               ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│  Headscale  │ │    DERP     │ │  Tailscale  │ │   System    │
│   Metrics   │ │   Metrics   │ │   Metrics   │ │   Metrics   │
│  :9090      │ │   :8080     │ │  (via API)  │ │  (node_exp) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘

9.2 Prometheus 配置

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  # Headscale 指标
  - job_name: 'headscale'
    static_configs:
      - targets: ['100.64.0.1:9090']
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: headscale-primary

  # DERP 服务器指标
  - job_name: 'derp'
    static_configs:
      - targets:
        - 'derp-bj.ops.company.com:8080'
        - 'derp-sh.ops.company.com:8080'
        - 'derp-hk.ops.company.com:8080'

  # PostgreSQL 指标
  - job_name: 'postgresql'
    static_configs:
      - targets: ['100.64.0.2:9187']

  # 所有 Tailscale 节点 (使用服务发现)
  - job_name: 'tailscale-nodes'
    file_sd_configs:
      - files:
        - '/etc/prometheus/tailscale_nodes.json'
        refresh_interval: 5m

9.3 关键监控指标

9.3.1 Headscale 指标

指标名称	类型	说明	告警阈值
`headscale_connected_nodes`	Gauge	已连接节点数	< 预期节点数 * 0.9
`headscale_api_requests_total`	Counter	API 请求总数	-
`headscale_api_request_duration_seconds`	Histogram	API 响应时间	P99 > 1s
`headscale_db_query_duration_seconds`	Histogram	数据库查询时间	P99 > 500ms

9.3.2 DERP 指标

指标名称	类型	说明	告警阈值
`derp_connections`	Gauge	当前连接数	> 10000
`derp_bytes_sent_total`	Counter	发送字节数	突增 > 200%
`derp_bytes_received_total`	Counter	接收字节数	突增 > 200%
`derp_home_connections`	Gauge	Home 连接数	-

9.3.3 节点健康指标

指标名称	类型	说明	告警阈值
`tailscale_up`	Gauge	节点在线状态	= 0
`tailscale_derp_latency_seconds`	Gauge	DERP 延迟	> 200ms
`tailscale_peer_count`	Gauge	对等节点数	= 0

9.4 告警规则配置

# /etc/prometheus/rules/headscale.yml
groups:
  - name: headscale
    interval: 30s
    rules:
      # Headscale 服务不可用
      - alert: HeadscaleDown
        expr: up{job="headscale"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Headscale 控制平面不可用"
          description: "Headscale 服务已离线超过1分钟"

      # 节点大量离线
      - alert: TailscaleNodesMassOffline
        expr: |
          (count(tailscale_up == 0) / count(tailscale_up)) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "超过10%的节点离线"
          description: "{{ $value | humanizePercentage }} 的节点当前离线"

      # API 响应慢
      - alert: HeadscaleAPILatencyHigh
        expr: |
          histogram_quantile(0.99, rate(headscale_api_request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Headscale API 响应延迟过高"
          description: "API P99 延迟: {{ $value | humanizeDuration }}"

      # 数据库连接问题
      - alert: HeadscaleDatabaseConnectionIssues
        expr: |
          rate(headscale_db_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Headscale 数据库连接异常"
          description: "数据库错误率: {{ $value }}/s"

  - name: derp
    rules:
      # DERP 服务不可用
      - alert: DERPServerDown
        expr: up{job="derp"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "DERP 中继服务器不可用"
          description: "{{ $labels.instance }} DERP 服务已离线"

      # DERP 连接数过高
      - alert: DERPConnectionsHigh
        expr: derp_connections > 8000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "DERP 连接数接近上限"
          description: "{{ $labels.instance }} 当前连接数: {{ $value }}"

  - name: nodes
    rules:
      # 单个节点离线
      - alert: TailscaleNodeDown
        expr: tailscale_up == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tailscale 节点离线"
          description: "节点 {{ $labels.hostname }} 已离线超过5分钟"

      # 生产环境节点离线 (更严格)
      - alert: ProductionNodeDown
        expr: tailscale_up{env="prod"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "生产环境节点离线"
          description: "生产节点 {{ $labels.hostname }} 已离线"

      # 节点无法建立直连
      - alert: TailscaleNoPeerConnection
        expr: tailscale_peer_count == 0 and tailscale_up == 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "节点无法建立 P2P 连接"
          description: "节点 {{ $labels.hostname }} 无法与其他节点建立直接连接"

9.5 Grafana 仪表板

创建以下仪表板：

Headscale Overview
- 总节点数、在线节点数、离线节点数
- API 请求 QPS 和延迟
- 数据库连接状态
DERP Network
- 各 DERP 服务器连接数
- 流量统计 (发送/接收)
- 区域分布
Node Health
- 节点在线状态矩阵
- 各节点延迟热力图
- 节点流量统计
ACL Audit
- 访问拒绝事件
- 规则命中统计
- 异常访问模式

10. 运维管理规范

10.1 日常运维操作

10.1.1 用户管理

# 创建用户 (命名空间)
headscale users create prod
headscale users create staging
headscale users create dev

# 查看用户列表
headscale users list

# 删除用户 (谨慎操作)
headscale users destroy dev

10.1.2 节点管理

# 列出所有节点
headscale nodes list

# 列出特定用户的节点
headscale nodes list --user prod

# 查看节点详情
headscale nodes list --identifier prod-web-bj-001

# 删除节点
headscale nodes delete --identifier <node_id>

# 重命名节点
headscale nodes rename --identifier <node_id> --name new-hostname

# 移动节点到其他用户
headscale nodes move --identifier <node_id> --user staging

# 设置节点过期时间
headscale nodes expire --identifier <node_id>

10.1.3 路由管理

# 查看所有路由
headscale routes list

# 启用路由
headscale routes enable --route <route_id>

# 禁用路由
headscale routes disable --route <route_id>

# 删除路由
headscale routes delete --route <route_id>

10.1.4 API Key 管理

# 创建 API Key
headscale apikeys create --expiration 90d

# 列出 API Keys
headscale apikeys list

# 使 API Key 过期
headscale apikeys expire --prefix <key_prefix>

10.2 运维脚本工具

10.2.1 节点健康检查脚本

#!/bin/bash
# /opt/scripts/check-tailscale-health.sh

HEADSCALE_URL="https://hs.ops.company.com"
API_KEY="your_api_key"
ALERT_WEBHOOK="https://webhook.ops.company.com/alert"

# 获取所有节点
nodes=$(curl -s -H "Authorization: Bearer $API_KEY" \
  "${HEADSCALE_URL}/api/v1/machine" | jq -r '.machines[]')

# 检查离线节点
offline_nodes=$(echo "$nodes" | jq -r 'select(.online == false) | .givenName')

if [ -n "$offline_nodes" ]; then
  # 发送告警
  curl -X POST "$ALERT_WEBHOOK" \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"[Tailscale] 以下节点离线:\\n$offline_nodes\"}"
fi

# 检查即将过期的节点
expiring_nodes=$(echo "$nodes" | jq -r \
  'select(.expiry != "0001-01-01T00:00:00Z") |
   select((.expiry | fromdateiso8601) < (now + 604800)) |
   .givenName + " (expires: " + .expiry + ")"')

if [ -n "$expiring_nodes" ]; then
  curl -X POST "$ALERT_WEBHOOK" \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"[Tailscale] 以下节点即将过期:\\n$expiring_nodes\"}"
fi

10.2.2 批量节点管理脚本

#!/usr/bin/env python3
# /opt/scripts/headscale-manager.py

import requests
import argparse
import json
from datetime import datetime, timedelta

class HeadscaleManager:
    def __init__(self, url, api_key):
        self.url = url.rstrip('/')
        self.headers = {
            'Authorization': f'Bearer {api_key}',
            'Content-Type': 'application/json'
        }

    def get_nodes(self, user=None):
        """获取节点列表"""
        params = {}
        if user:
            params['user'] = user

        resp = requests.get(
            f'{self.url}/api/v1/machine',
            headers=self.headers,
            params=params
        )
        return resp.json().get('machines', [])

    def get_offline_nodes(self, threshold_hours=1):
        """获取离线节点"""
        nodes = self.get_nodes()
        offline = []

        threshold = datetime.utcnow() - timedelta(hours=threshold_hours)

        for node in nodes:
            if not node.get('online', False):
                last_seen = datetime.fromisoformat(
                    node['lastSeen'].replace('Z', '+00:00')
                )
                if last_seen < threshold.replace(tzinfo=last_seen.tzinfo):
                    offline.append(node)

        return offline

    def bulk_tag_nodes(self, node_ids, tags):
        """批量设置节点标签"""
        results = []
        for node_id in node_ids:
            resp = requests.post(
                f'{self.url}/api/v1/machine/{node_id}/tags',
                headers=self.headers,
                json={'tags': tags}
            )
            results.append({
                'node_id': node_id,
                'success': resp.status_code == 200
            })
        return results

    def cleanup_expired_nodes(self, dry_run=True):
        """清理过期节点"""
        nodes = self.get_nodes()
        expired = []

        for node in nodes:
            expiry = node.get('expiry')
            if expiry and expiry != '0001-01-01T00:00:00Z':
                expiry_dt = datetime.fromisoformat(expiry.replace('Z', '+00:00'))
                if expiry_dt < datetime.utcnow().replace(tzinfo=expiry_dt.tzinfo):
                    expired.append(node)

        if not dry_run:
            for node in expired:
                requests.delete(
                    f'{self.url}/api/v1/machine/{node["id"]}',
                    headers=self.headers
                )

        return expired

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Headscale 管理工具')
    parser.add_argument('--url', required=True, help='Headscale URL')
    parser.add_argument('--api-key', required=True, help='API Key')
    parser.add_argument('action', choices=['list', 'offline', 'cleanup'])
    parser.add_argument('--user', help='过滤用户')
    parser.add_argument('--dry-run', action='store_true', help='试运行模式')

    args = parser.parse_args()

    manager = HeadscaleManager(args.url, args.api_key)

    if args.action == 'list':
        nodes = manager.get_nodes(args.user)
        print(json.dumps(nodes, indent=2))
    elif args.action == 'offline':
        offline = manager.get_offline_nodes()
        print(f"离线节点数: {len(offline)}")
        for node in offline:
            print(f"  - {node['givenName']} (last seen: {node['lastSeen']})")
    elif args.action == 'cleanup':
        expired = manager.cleanup_expired_nodes(dry_run=args.dry_run)
        print(f"过期节点数: {len(expired)}")
        for node in expired:
            print(f"  - {node['givenName']} (expired: {node['expiry']})")

10.3 日志管理

# Headscale 日志位置
/var/log/headscale/headscale.log

# 日志轮转配置
cat > /etc/logrotate.d/headscale << 'EOF'
/var/log/headscale/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0640 headscale headscale
    sharedscripts
    postrotate
        systemctl reload headscale > /dev/null 2>&1 || true
    endscript
}
EOF

# 结构化日志查询 (JSON 格式)
cat /var/log/headscale/headscale.log | jq 'select(.level == "error")'

10.4 备份与恢复

10.4.1 数据库备份

#!/bin/bash
# /opt/scripts/backup-headscale.sh

BACKUP_DIR="/backup/headscale"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

# PostgreSQL 备份
pg_dump -h localhost -U headscale -d headscale -F c \
  -f "${BACKUP_DIR}/headscale_${DATE}.dump"

# 配置文件备份
tar -czf "${BACKUP_DIR}/config_${DATE}.tar.gz" \
  /etc/headscale/config.yaml \
  /etc/headscale/acl.json \
  /etc/headscale/derp.json \
  /var/lib/headscale/private.key \
  /var/lib/headscale/noise_private.key

# 清理旧备份
find "${BACKUP_DIR}" -type f -mtime +${RETENTION_DAYS} -delete

# 上传到 S3 (可选)
aws s3 sync "${BACKUP_DIR}/" s3://backup-bucket/headscale/

10.4.2 数据恢复

#!/bin/bash
# /opt/scripts/restore-headscale.sh

BACKUP_FILE=$1

# 停止服务
systemctl stop headscale

# 恢复数据库
pg_restore -h localhost -U headscale -d headscale -c "${BACKUP_FILE}"

# 恢复配置
tar -xzf "${BACKUP_FILE%.dump}_config.tar.gz" -C /

# 重启服务
systemctl start headscale

# 验证
headscale nodes list

10.5 版本升级流程

#!/bin/bash
# /opt/scripts/upgrade-headscale.sh

NEW_VERSION=$1
BACKUP_DIR="/backup/headscale/upgrade"

echo "开始升级 Headscale 到版本 ${NEW_VERSION}"

# 1. 备份当前版本
echo "备份当前配置和数据..."
./backup-headscale.sh

# 2. 下载新版本
echo "下载新版本..."
wget -O /tmp/headscale_new.deb \
  "https://github.com/juanfont/headscale/releases/download/v${NEW_VERSION}/headscale_${NEW_VERSION}_linux_amd64.deb"

# 3. 停止服务
echo "停止 Headscale 服务..."
systemctl stop headscale

# 4. 安装新版本
echo "安装新版本..."
dpkg -i /tmp/headscale_new.deb

# 5. 数据库迁移 (如果需要)
echo "执行数据库迁移..."
headscale serve --config /etc/headscale/config.yaml --migrate-only

# 6. 启动服务
echo "启动服务..."
systemctl start headscale

# 7. 验证
echo "验证升级..."
sleep 5
headscale version
headscale nodes list | head -5

echo "升级完成!"

11. 故障恢复与灾备

11.1 故障场景与恢复方案

11.1.1 Headscale 主节点故障

影响范围：

新节点无法加入网络
无法更新 ACL 策略
已连接节点正常通信 (P2P 直连)

恢复步骤：

# 1. 确认主节点故障
systemctl status headscale
curl -s https://hs.ops.company.com/health

# 2. 切换到备用节点
# 在备用节点上修改 DNS 或负载均衡器配置

# 3. 如果是数据库问题，切换到从库
# 修改 config.yaml 中的数据库连接

# 4. 重启服务
systemctl restart headscale

# 5. 验证服务恢复
headscale nodes list

11.1.2 PostgreSQL 数据库故障

恢复步骤：

# 1. 如果主库故障，提升从库
# 在从库执行
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/15/main

# 2. 更新 Headscale 配置指向新主库
sed -i 's/old_primary_ip/new_primary_ip/' /etc/headscale/config.yaml

# 3. 重启 Headscale
systemctl restart headscale

# 4. 重建从库
# 使用 pg_basebackup 从新主库同步

11.1.3 DERP 中继服务器故障

影响范围：

无法 NAT 穿透的节点将失去连接
可直连的节点不受影响

恢复步骤：

# 1. 检查 DERP 服务状态
systemctl status derper
curl -s https://derp-bj.ops.company.com/derp/probe

# 2. 如果无法恢复，从 DERP Map 中移除该节点
# 编辑 /etc/headscale/derp.json，移除故障节点

# 3. 等待客户端自动切换到其他 DERP
# 或手动强制刷新
tailscale netcheck

11.1.4 完全灾难恢复

# 1. 准备新服务器

# 2. 从备份恢复数据库
pg_restore -h localhost -U headscale -d headscale /backup/latest.dump

# 3. 恢复配置文件
tar -xzf /backup/config_latest.tar.gz -C /

# 4. 安装 Headscale
dpkg -i headscale_latest.deb

# 5. 启动服务
systemctl start headscale

# 6. 更新 DNS 指向新服务器

# 7. 验证所有节点重新连接
watch 'headscale nodes list | grep -c Online'

11.2 RTO 和 RPO 目标

场景	RTO (恢复时间目标)	RPO (数据恢复点目标)
Headscale 单点故障	< 5 分钟	0 (热备接管)
数据库故障	< 15 分钟	< 1 分钟 (同步复制)
DERP 故障	自动切换	N/A
完全灾难	< 2 小时	< 24 小时

11.3 定期演练

建议每季度进行一次故障演练：

演练内容：
- 主备切换
- 数据库故障转移
- 从备份恢复
- ACL 策略回滚
演练记录：
- 演练时间和参与人员
- 实际恢复时间
- 发现的问题和改进措施

12. 实施计划与里程碑

12.1 实施阶段

第一阶段：基础设施准备

任务	负责人	前置条件	交付物
服务器资源申请	运维	预算审批	服务器清单
域名和证书准备	运维	域名购买	SSL 证书
PostgreSQL 高可用部署	DBA	服务器就绪	数据库集群
网络规划确认	网络组	-	IP 规划文档

第二阶段：核心服务部署

任务	负责人	前置条件	交付物
Headscale 主节点部署	运维	PostgreSQL 就绪	服务运行
Headscale 备节点配置	运维	主节点就绪	主备切换测试
DERP 中继服务器部署	运维	服务器就绪	多区域 DERP
ACL 策略配置	安全组	服务运行	ACL 文件
监控告警部署	运维	服务运行	Grafana 仪表板

第三阶段：节点接入

任务	负责人	前置条件	交付物
测试环境接入	运维	服务就绪	测试节点在线
预发布环境接入	运维	测试通过	预发布节点在线
生产环境接入 (批次1)	运维	预发布验证	首批生产节点
生产环境接入 (批次2-N)	运维	批次1成功	全部生产节点
运维人员设备接入	运维	生产稳定	运维设备在线
开发人员设备接入	开发组长	运维验证	开发设备在线

第四阶段：验收与交接

任务	负责人	前置条件	交付物
功能验收测试	QA	全部接入	验收报告
性能压力测试	性能组	功能验收	性能报告
故障演练	运维	验收通过	演练记录
文档交付	运维	演练通过	运维手册
培训交接	运维	文档完成	培训记录

12.2 里程碑

┌─────────────────────────────────────────────────────────────────────────────┐
│                              实施时间线                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  M1: 基础设施就绪                                                            │
│  ├── PostgreSQL HA 部署完成                                                  │
│  ├── 域名/证书准备完成                                                       │
│  └── 网络规划确认                                                            │
│                                                                              │
│  M2: 核心服务上线                                                            │
│  ├── Headscale 主备节点运行                                                  │
│  ├── DERP 多区域部署                                                         │
│  ├── 监控告警就绪                                                            │
│  └── ACL 策略配置完成                                                        │
│                                                                              │
│  M3: 测试验证完成                                                            │
│  ├── 测试环境全部接入                                                        │
│  ├── 预发布环境接入                                                          │
│  └── 功能验收通过                                                            │
│                                                                              │
│  M4: 生产环境迁移完成                                                        │
│  ├── 生产服务器全部接入                                                      │
│  ├── 旧 VPN 方案下线                                                        │
│  └── 运维设备接入                                                            │
│                                                                              │
│  M5: 项目验收                                                                │
│  ├── 故障演练通过                                                            │
│  ├── 培训交接完成                                                            │
│  └── 项目正式结项                                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

12.3 验收标准

验收项	验收标准	验收方法
网络连通性	任意两节点可互通	ping/traceroute 测试
连接延迟	同区域 P2P < 10ms	Tailscale ping
服务可用性	99.9% 可用率	监控数据
ACL 生效	策略符合设计	安全扫描
故障恢复	RTO < 目标时间	故障演练
性能指标	支持 1000+ 节点	压力测试

13. 风险评估与应对

13.1 风险矩阵

风险项	可能性	影响	风险等级	应对措施
Headscale 版本不稳定	中	高	高	充分测试，制定回滚方案
网络穿透失败率高	中	中	中	部署多区域 DERP
密钥泄露	低	极高	高	密钥管理，定期轮换
性能瓶颈	中	中	中	监控预警，容量规划
运维人员技能不足	中	中	中	培训，文档完善
与现有系统冲突	低	中	低	充分测试，分批上线

13.2 回滚方案

13.2.1 服务端回滚

# 1. 停止新版本服务
systemctl stop headscale

# 2. 恢复旧版本
dpkg -i /backup/headscale_old.deb

# 3. 恢复配置
cp /backup/config_old.yaml /etc/headscale/config.yaml

# 4. 如需回滚数据库
pg_restore -h localhost -U headscale -d headscale -c /backup/db_old.dump

# 5. 重启服务
systemctl start headscale

13.2.2 客户端回滚

# 断开 Headscale 连接
tailscale down

# 恢复原有 VPN 配置
# (根据原有 VPN 方案操作)

13.3 应急联系人

角色	姓名	联系方式	职责
项目负责人	xxx	138xxxxxxxx	决策、协调
技术负责人	xxx	139xxxxxxxx	技术方案
运维负责人	xxx	137xxxxxxxx	部署实施
DBA	xxx	136xxxxxxxx	数据库运维
安全负责人	xxx	135xxxxxxxx	安全评审

14. 附录

14.1 术语表

术语	解释
Headscale	Tailscale 的开源自托管控制服务器
Tailscale	基于 WireGuard 的零配置 VPN 方案
WireGuard	现代化的 VPN 协议
DERP	Designated Encrypted Relay for Packets，加密中继协议
MagicDNS	Tailscale 的自动 DNS 服务
ACL	Access Control List，访问控制列表
PreAuth Key	预认证密钥，用于无交互接入
Mesh Network	网状网络，节点间可直接通信
NAT Traversal	NAT 穿透技术
STUN	Session Traversal Utilities for NAT

14.2 参考文档

14.3 常用命令速查

# Headscale 服务管理
systemctl start|stop|restart|status headscale

# 用户管理
headscale users list
headscale users create <name>
headscale users destroy <name>

# 节点管理
headscale nodes list
headscale nodes delete -i <id>
headscale nodes expire -i <id>
headscale nodes rename -i <id> -n <new_name>
headscale nodes tag -i <id> -t <tags>

# 预认证密钥
headscale preauthkeys create --user <user> --expiration 24h
headscale preauthkeys list --user <user>

# 路由管理
headscale routes list
headscale routes enable -r <route_id>

# API Key
headscale apikeys create --expiration 90d
headscale apikeys list

# Tailscale 客户端
tailscale up --login-server <url>
tailscale down
tailscale status
tailscale ip
tailscale ping <peer>
tailscale netcheck

14.4 配置模板

配置模板文件位于：

/opt/templates/headscale/config.yaml.tmpl
/opt/templates/headscale/acl.json.tmpl
/opt/templates/derp/docker-compose.yml.tmpl

14.5 变更记录

版本	日期	变更内容	变更人
v1.0	2025-12-15	初稿	xxx
v2.0	2025-12-18	详细设计完善	AI Assistant

文档维护说明: 本文档应随着项目进展持续更新，每次重大变更需记录在变更记录中。

65 KiB Raw Blame History Unescape Escape

OPS 统一管理方案 - Headscale 组网实施方案

目录

1. 项目背景与目标

1.1 项目背景

1.2 项目目标

1.3 适用范围

2. 技术方案概述

2.1 为什么选择 Headscale

2.2 技术架构图

2.3 核心组件说明

3. 网络架构设计

3.1 IP 地址规划

3.2 命名空间设计

3.3 节点命名规范

3.4 DERP 中继网络

4. 基础设施规划

4.1 服务器资源规划

4.1.1 Headscale 控制平面

4.1.2 DERP 中继服务器

4.2 网络要求

4.2.1 Headscale 服务器端口

4.2.2 DERP 服务器端口

4.2.3 Tailscale 客户端端口

4.3 域名与证书规划

5. Headscale 服务端部署

5.1 系统环境准备

5.2 PostgreSQL 高可用部署

5.2.1 PostgreSQL 主节点安装

5.2.2 PostgreSQL 从节点配置

5.3 Headscale 安装与配置

5.3.1 二进制安装

5.3.2 Headscale 配置文件

5.3.3 创建 systemd 服务

5.3.4 启动服务

5.4 DERP 中继服务器部署

5.4.1 DERP 服务器配置

5.4.2 DERP Docker Compose 部署

5.4.3 DERP Map 配置

5.5 Nginx 反向代理配置

5.6 管理界面部署 (Headscale-UI)

6. 客户端接入方案

6.1 Linux 服务器接入

6.1.1 安装 Tailscale 客户端

6.1.2 连接到 Headscale

6.1.3 自动化安装脚本

6.2 macOS/Windows 客户端接入

6.2.1 macOS

6.2.2 Windows

6.3 移动设备接入

6.4 预认证密钥管理

6.5 Ansible 自动化部署

7. 访问控制与安全策略

7.1 ACL 策略设计原则

7.2 详细 ACL 配置

7.3 标签管理

7.4 安全加固措施

7.4.1 Headscale 服务器加固

7.4.2 客户端安全配置

8. DNS 与服务发现

8.1 MagicDNS 配置

8.2 DNS 解析规则

8.3 Split DNS 配置

8.4 服务发现集成

8.4.1 与 Consul 集成

8.4.2 与 Kubernetes CoreDNS 集成

9. 监控与告警

9.1 监控架构

9.2 Prometheus 配置

9.3 关键监控指标

9.3.1 Headscale 指标

9.3.2 DERP 指标

9.3.3 节点健康指标

9.4 告警规则配置

9.5 Grafana 仪表板

10. 运维管理规范

10.1 日常运维操作

10.1.1 用户管理

10.1.2 节点管理

10.1.3 路由管理

65 KiB

Raw Blame History