Files

600 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: dev-cicd
description: CI/CD 流水线设计、优化与排查。适配 Gitea Actions + Go/Swift/Next.js/Docker 栈。当用户提到 CI、CD、流水线、pipeline、workflow、构建失败、runner 相关任务时自动激活。
---
# CI/CD 流水线技能 (dev-cicd)
## 概述
管理 Gitea Actions CI/CD 流水线的设计、优化和故障排查。适配技术栈:
- **Git**: Gitea (self-hosted, GitHub Actions YAML 兼容)
- **Backend**: Go (Gin + GORM)
- **iOS**: Swift 6 + SwiftUI + TCA
- **Web**: Next.js (React)
- **Container**: Docker + Docker Compose
- **Registry**: Aliyun ACR
- **Runners**: self-hosted (Linux) + macos-arm64 (iOS)
---
## 命令参考
| 命令 | 说明 |
|------|------|
| `/cicd analyze` | 分析当前 workflow 找优化点 |
| `/cicd troubleshoot` | 诊断流水线失败原因 |
| `/cicd template [go\|ios\|web\|docker]` | 生成 workflow 模板 |
| `/cicd status` | 查看最近 workflow 运行状态 |
---
## 1. Pipeline 设计
### 1.1 Monorepo 路径过滤
仓库包含多个子项目,用 `paths` 只触发相关构建:
```yaml
# .gitea/workflows/ci-cd.yml — Go + Web + Docker
on:
push:
branches: [develop, main]
paths:
- 'gateway/**'
- 'web/**'
- 'docker/**'
- 'scripts/**'
# .gitea/workflows/ios-testflight.yml — iOS 独立
on:
push:
branches: [develop, main]
paths:
- 'ios/**'
```
### 1.2 Pipeline 结构原则
```
快速反馈优先:
1. 静态检查 (lint/vet) — 秒级
2. 单元测试 (test) — 1-5 分钟
3. 构建 (build) — 2-10 分钟
4. 集成测试 (可选) — 5-15 分钟
5. 发布 (deploy) — 5-15 分钟
```
### 1.3 Go 后端模板
```yaml
jobs:
ci:
runs-on: self-hosted
steps:
- name: Checkout
run: |
cd ${{ github.workspace }}
if [ -d .git ]; then
git fetch --depth 1 origin ${{ github.ref_name }}
git reset --hard origin/${{ github.ref_name }}
else
git clone --depth 1 --branch ${{ github.ref_name }} \
http://xiaoqu:${{ secrets.REPO_TOKEN }}@localhost:3000/<org>/<repo>.git .
fi
- name: Go Vet
run: cd gateway && go vet ./...
- name: Go Test
run: cd gateway && go test ./... -count=1 -timeout 120s
- name: Go Build
run: cd gateway && go build ./cmd/gateway/
```
### 1.4 iOS 模板
```yaml
jobs:
ios:
runs-on: macos-arm64
if: "!contains(github.event.head_commit.message, '[skip ci]')"
steps:
- name: Checkout
run: git clone --depth 1 --branch ${{ github.ref_name }} <repo-url> .
- name: xcodegen
run: /opt/homebrew/bin/xcodegen generate
working-directory: ios
- name: Test
run: |
set -o pipefail
swift test 2>&1 | tee /tmp/test.log | tail -20
working-directory: ios
- name: Deploy TestFlight
env:
KEYCHAIN_PASSWORD: ${{ secrets.KEYCHAIN_PASSWORD }}
ASC_KEY_ID: ${{ secrets.ASC_KEY_ID }}
ASC_ISSUER_ID: ${{ secrets.ASC_ISSUER_ID }}
run: ./scripts/ios-testflight.sh
```
### 1.5 Web (Next.js) 模板
```yaml
- name: Web Install
run: cd web && npm ci --legacy-peer-deps
- name: Web Build
run: cd web && npm run build
- name: Docker Build Web
run: |
docker build -t $REGISTRY/$WEB_IMAGE:${{ github.sha }} \
-t $REGISTRY/$WEB_IMAGE:latest ./web
```
### 1.6 单 Job vs 多 Job
| 场景 | 选择 | 原因 |
|------|------|------|
| Runner capacity=1 | 单 Job | 多 Job 串行 + 多次 checkout = 更慢 |
| 多 Runner 可用 | 多 Job + needs | 并行加速 |
| 不同 OS (Linux+macOS) | 分 Workflow | 不同 runner label |
**当前推荐**Linux runner 单 JobGo+Web+DockermacOS runner 单 JobiOS
---
## 2. 优化
### 2.1 浅克隆
```yaml
# 首次 clone
git clone --depth 1 --branch ${{ github.ref_name }} <url> .
# 增量 fetch
git fetch --depth 1 origin ${{ github.ref_name }}
git reset --hard origin/${{ github.ref_name }}
```
**效果**仓库含大量二进制文件时clone 时间从 30s+ 降到 3-5s。
**注意**:需要 push 时先 `git fetch --unshallow`
### 2.2 依赖缓存
Gitea Actions 不支持 `actions/cache`,但 self-hosted runner 可利用本地磁盘:
```yaml
# Go modules — runner 上全局缓存
env:
GOMODCACHE: /opt/runner-cache/go/mod
GOCACHE: /opt/runner-cache/go/build
# npm — 利用 node_modules 持久化
# self-hosted runner 的 workspace 在两次运行间保留
- run: |
if [ -f web/node_modules/.cache-hash ] && \
[ "$(cat web/node_modules/.cache-hash)" = "$(md5sum web/package-lock.json | cut -d' ' -f1)" ]; then
echo "npm cache hit, skip install"
else
cd web && npm ci --legacy-peer-deps
md5sum package-lock.json | cut -d' ' -f1 > node_modules/.cache-hash
fi
# SPM — Xcode 自动缓存到 DerivedDataself-hosted runner 保留
```
### 2.3 并发取消
避免同一分支多次 push 排队等待:
```yaml
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
```
### 2.4 条件跳过
```yaml
# 跳过 CI Bot 的自动提交
if: "!contains(github.event.head_commit.message, '[skip ci]')"
# 只在 develop 分支部署
if: github.ref == 'refs/heads/develop'
```
### 2.5 构建产物复用
```yaml
# Build once, use in deploy
- name: Build
run: go build -o /tmp/gateway ./cmd/gateway/
- name: Docker Build
run: |
# 用已编译的二进制,不在 Docker 内重新编译
cp /tmp/gateway docker/
docker build -f docker/gateway.prebuilt.Dockerfile -t $IMAGE .
```
### 2.6 Docker Context 瘦身
**问题**`docker build` 会将整个 context 目录发送到 daemon。缺少 `.dockerignore` 时,`node_modules`(数百 MB`.next/``.git/` 等全部传入,导致 `transferring context: 768MB` 耗时 30s+。
**诊断**
```bash
# 检查 context 大小(模拟 docker build 发送量)
du -sh --exclude=.git <project-dir>
# 检查是否有 .dockerignore
cat <project-dir>/.dockerignore 2>/dev/null || echo "缺少 .dockerignore!"
```
**`/cicd analyze` 必查项**:对每个有 Dockerfile 的目录检查 `.dockerignore` 是否存在。缺失则告警。
**标准 .dockerignore 模板**
```
# Node.js
node_modules
.next
.turbo
coverage
# Common
.git
.gitignore
.env*
*.md
.vscode
.idea
```
**效果**Web 项目 context 从 768MB → ~10MBDocker build 加速 10x。
**经验教训**`.gitignore` 不等于 `.dockerignore`。Git 忽略的文件可能在 runner workspace 中存在(如 self-hosted runner 保留的 `node_modules` 缓存docker build 会把它们全部打包传入。每个有 Dockerfile 的子目录**必须有 `.dockerignore`**。
---
## 3. 故障排查
### 3.1 决策树
```
Pipeline 失败
├── Workflow 没触发
│ ├── 检查 paths 过滤 → 改动不在匹配路径下
│ ├── 检查 branch 过滤 → 分支名不匹配
│ ├── 检查 [skip ci] → commit message 含跳过标记
│ └── Runner 离线 → Gitea Admin > Runners 检查状态
├── Checkout 失败
│ ├── "Authentication failed" → REPO_TOKEN secret 过期/无效
│ ├── "Connection refused :3000" → Gitea 服务未运行
│ └── Checkout 很慢 → 加 --depth 1 浅克隆
├── Go 构建失败
│ ├── "module not found" → GOPROXY 设置 / go mod tidy
│ ├── "cannot find package" → go.sum 不完整
│ └── "go: version mismatch" → runner 上 Go 版本与 go.mod 不匹配
├── iOS 构建失败
│ ├── "Macro must be enabled" → 加 -skipMacroValidation
│ ├── "cannot find type" → xcodegen generate 未运行
│ ├── "errSecInternalComponent" → unlock-keychain + set-key-partition-list
│ ├── "No signing certificate" → Xcode > Accounts 登录下载证书
│ ├── "Redundant Binary Upload" → 递增 CURRENT_PROJECT_VERSION
│ └── "Missing required icon" → Assets.xcassets 缺 1024x1024 icon
├── Docker 构建失败/慢
│ ├── "Cannot connect to daemon" → Docker Desktop 未启动
│ ├── "unauthorized" / "denied" → docker login 凭据过期 或 ACR namespace 缺失
│ ├── "no space left" → docker system prune
│ ├── "transferring context: XXX MB" 很慢 → 缺少 .dockerignorenode_modules 被传入)
│ ├── build 成功但 push denied → 镜像路径缺 namespaceregistry/namespace/image
│ ├── docker compose pull 超时 → 不带参数会拉 Docker Hub 上的 postgres/redis只拉业务镜像
│ └── docker compose up -d 也会 pull → 加 `--no-deps gateway web` 只重启业务容器
└── 部署失败
├── "Connection refused" (SSH) → 目标服务器 SSH 端口/密钥
├── "health check failed" → 应用启动慢,增加重试等待
├── "port already in use" → docker compose down 先停旧容器
├── "no such service: xxx" → 服务器 compose 与 CI 配置不一致
├── health check 失败但容器在跑 → curl URL 的端口与实际服务端口不匹配
├── --no-deps 跳过了 nginx → health check 走 port 80 但 nginx 未启动
├── gateway 无端口映射 → prod compose 不暴露端口,用 docker exec 检查
└── nginx crash "upstream not allowed" → nginx.conf mount 到 /etc/nginx/nginx.conf 覆盖主配置,改 /etc/nginx/conf.d/default.conf
```
### 3.2 常见错误速查
| 错误 | 原因 | 修复 |
|------|------|------|
| `errSecInternalComponent` | SSH 会话无法访问 Keychain | `security unlock-keychain` + `set-key-partition-list` |
| `Macro "X" must be enabled` | Swift Macros 安全限制 | `-skipMacroValidation` |
| `cannot find type 'Foo'` | xcodeproj 未包含新文件 | `xcodegen generate` |
| `Redundant Binary Upload` | build number 重复 | 递增 `CURRENT_PROJECT_VERSION` |
| `Cloud signing permission error` | API Key 权限不足或 Issuer ID 错误 | 用手动签名 + 本地 profile |
| `HTTP 401 Unauthorized` (ASC API) | JWT 缺少 `kid` header | `headers={"kid": KEY_ID}` |
| `No profiles for bundle id` | 无 distribution profile | 在 Apple Developer 创建并安装 |
| `transferring context: 768MB` | 缺 .dockerignore | 创建 .dockerignore 排除 node_modules/.next/.git |
| `denied: requested access` (push) | ACR 镜像路径缺 namespace | registry/**namespace**/image |
| `docker compose pull` 超时 | 拉了 Docker Hub 的 postgres/redis | `docker compose pull gateway web` 只拉业务镜像 |
| `docker compose up -d` 也超时 | up 隐含 pull 所有 service | `docker compose up -d --no-deps gateway web` |
| health check 失败但容器在跑 | curl URL 端口 ≠ 服务端口 | 检查 nginx(80) vs gateway(8080),直接 `curl :8080/health` |
| `--no-deps` 后 nginx 没启动 | nginx 被 no-deps 跳过 | 显式加 `--no-deps gateway web nginx` |
| `no such service: xxx` | 服务器 compose 缺 service | SSH 检查实际 compose 文件 |
| gateway healthy 但 curl 不通 | prod compose 无端口映射 | `docker exec <container> wget -q -O- localhost:8080/health` |
| nginx `upstream not allowed` | nginx.conf mount 到 /etc/nginx/nginx.conf | 改 mount 到 `/etc/nginx/conf.d/default.conf` |
| `missing icon file 120x120` | 无 App Icon asset | 创建 Assets.xcassets + AppIcon |
| `UIInterfaceOrientation` iPad | 缺 iPad 方向声明 | 四方向 + `UIRequiresFullScreen` |
### 3.3 调试技巧
```bash
# 查看 Gitea runner 状态
curl -s -H "Authorization: token <TOKEN>" \
http://<gitea>/api/v1/repos/<org>/<repo>/actions/runners
# 查看最近 workflow 运行
curl -s -H "Authorization: token <TOKEN>" \
http://<gitea>/api/v1/repos/<org>/<repo>/actions/runs?limit=5
# 本地模拟 CI 环境
# Go
docker run -v $(pwd):/app -w /app golang:1.25 go build ./cmd/gateway/
# iOS — 只能在 macOS 上
ssh bjwework "cd ~/workspace/xiaoqu-ai/ios && swift test"
```
---
## 4. 安全
### 4.1 Secrets 管理
```bash
# 通过 Gitea API 配置 secrets不要手动编辑 workflow 文件)
curl -X PUT -H "Authorization: token <ADMIN_TOKEN>" \
-H "Content-Type: application/json" \
"http://<gitea>/api/v1/repos/<org>/<repo>/actions/secrets/<NAME>" \
-d '{"data": "<VALUE>"}'
```
**必需 Secrets 清单**
| Secret | 用途 | 轮换周期 |
|--------|------|---------|
| `REPO_TOKEN` | Git clone 认证 | 按需 |
| `ACR_USERNAME` / `ACR_PASSWORD` | Docker 镜像推送 | 90 天 |
| `SSH_PRIVATE_KEY` | 服务器部署 | 按需 |
| `KEYCHAIN_PASSWORD` | macOS 签名解锁 | 改密码时 |
| `ASC_KEY_ID` / `ASC_ISSUER_ID` | App Store Connect | 按需 |
| `FEISHU_WEBHOOK` | 通知 | 不过期 |
### 4.2 防泄漏检查清单
- [ ] `.gitignore` 包含 `.env``*.p8``*.pem``*.mobileprovision`
- [ ] Workflow 中无硬编码密码/token全走 `${{ secrets.* }}`
- [ ] 脚本用 `${VAR:?error}` 强制要求环境变量(不用默认值暴露凭据)
- [ ] Docker 镜像不包含 `.env` 文件Dockerfile 有 `.dockerignore`
- [ ] Git remote URL 不含 token用 secrets 注入)
### 4.3 提交前检查
```bash
# 扫描即将提交的文件是否含密钥
git diff --cached --name-only | xargs grep -lE \
'(PRIVATE KEY|password|secret|token|apikey)' 2>/dev/null
```
---
## 5. 监控
### 5.1 查看 Pipeline 状态
```bash
# 最近运行
curl -s -H "Authorization: token <TOKEN>" \
"http://<gitea>/api/v1/repos/<org>/<repo>/actions/runs?limit=5" | \
python3 -c "
import json, sys
for r in json.load(sys.stdin).get('workflow_runs', []):
print(f\"{r['id']} | {r['display_title'][:40]} | {r['status']} | {r['conclusion']}\")
"
```
### 5.2 飞书通知模板
```yaml
# 成功/失败通知(在 workflow 最后一步 if: always()
- name: Notify
if: always()
run: |
STATUS="${{ job.status }}"
EMOJI=$([ "$STATUS" = "success" ] && echo "✅" || echo "❌")
COLOR=$([ "$STATUS" = "success" ] && echo "green" || echo "red")
cat > /tmp/notify.json << EOF
{
"msg_type": "interactive",
"card": {
"header": {
"title": {"tag": "plain_text", "content": "$EMOJI <App> $STATUS"},
"template": "$COLOR"
},
"elements": [{
"tag": "div",
"text": {"tag": "lark_md", "content": "**分支**: ${{ github.ref_name }}\n**提交**: ${{ github.sha }}\n**触发**: ${{ github.event.head_commit.message }}"}
}]
}
}
EOF
curl -s -X POST "${{ secrets.FEISHU_WEBHOOK }}" \
-H "Content-Type: application/json" -d @/tmp/notify.json || true
```
### 5.3 构建时间追踪
在 workflow 首尾加时间戳:
```yaml
steps:
- name: Start Timer
run: echo "BUILD_START=$(date +%s)" >> $GITHUB_ENV
# ... 构建步骤 ...
- name: Report Duration
if: always()
run: |
DURATION=$(( $(date +%s) - $BUILD_START ))
echo "Build duration: ${DURATION}s"
```
---
## 6. Runner 管理
### 6.1 Runner 类型
| Runner | 标签 | 用途 | 位置 |
|--------|------|------|------|
| xiaoqu-runner | `self-hosted` | Go + Web + Docker | 阿里云 39.104.65.241 |
| bjwework-macos | `macos-arm64` | iOS + Swift | Tailscale 100.69.230.116 |
### 6.2 新增 Runner
```bash
# 1. 获取注册 token
curl -s -H "Authorization: token <ADMIN_TOKEN>" \
"http://<gitea>/api/v1/repos/<org>/<repo>/actions/runners/registration-token"
# 2. 注册
./act_runner register --no-interactive \
--instance http://<gitea> \
--token <TOKEN> \
--name <NAME> \
--labels <LABEL>:host
# 3. 启动macOS 用 launchd
launchctl load ~/Library/LaunchAgents/com.gitea.act-runner.plist
```
### 6.3 Runner 健康检查
```bash
# 检查 runner 进程
ssh bjwework "launchctl list | grep act-runner"
# 检查 runner 日志
ssh bjwework "tail -20 ~/act_runner/runner.log"
# 检查 Gitea 上的 runner 状态
curl -s -H "Authorization: token <TOKEN>" \
"http://<gitea>/api/v1/repos/<org>/<repo>/actions/runners" | \
python3 -c "import json,sys; [print(f\"{r['name']} | {r['status']}\") for r in json.load(sys.stdin)]"
```
---
## 7. Workflow 模板生成
### `/cicd analyze` 检查清单
执行时自动扫描以下项目:
1. **Workflow YAML** — 语法检查、路径过滤、并发取消、[skip ci]
2. **Docker context** — 每个有 Dockerfile 的目录是否有 `.dockerignore`**必查**
3. **Secrets** — workflow 中是否有硬编码凭据、路径
4. **缓存** — 是否利用了依赖缓存npm/Go/SPM
5. **浅克隆** — checkout 是否用了 `--depth 1`
6. **镜像命名** — ACR/registry 路径是否包含 namespace
```bash
# 快速扫描命令
echo "=== .dockerignore 检查 ==="
find . -name Dockerfile -exec sh -c 'DIR=$(dirname "{}"); [ -f "$DIR/.dockerignore" ] && echo "✅ $DIR" || echo "❌ $DIR 缺少 .dockerignore"' \;
echo "=== 硬编码凭据检查 ==="
grep -rn 'password\|secret\|token' .gitea/workflows/ | grep -v 'secrets\.' | grep -v '#'
```
### `/cicd template go`
生成 Go 后端 CI workflow含 vet → test → build → docker → deploy。
### `/cicd template ios`
生成 iOS TestFlight workflow含 xcodegen → test → archive → upload → notify。
### `/cicd template web`
生成 Next.js CI workflow含 install → build → docker → deploy。
### `/cicd template docker`
生成 Docker multi-service build+push workflow含 ACR 登录 → 多镜像构建 → SSH 部署。
---
## 8. CD 部署前验证清单
**每次修改 deploy 步骤前必须逐项确认:**
```
1. 服务器 compose 有哪些 service
→ ssh <server> "docker compose -f <file> config --services"
2. CI deploy 启动了哪些 service
→ grep "up -d" .gitea/workflows/ci-cd.yml
3. health check URL 指向哪个端口?
→ grep "curl.*health" .gitea/workflows/ci-cd.yml
4. 该端口由哪个 service 服务?
→ port 80 = nginx, port 8080 = gateway, port 3001 = web
5. 该 service 是否在 deploy 启动列表中?
→ 如果 health check 走 nginx:80deploy 必须包含 nginx
6. 基础服务postgres/redis是否已运行
→ docker compose ps 检查,不要在 CI 中重启它们
7. Docker Hub 可达吗?
→ 国内服务器必须配镜像源,或只拉 ACR 镜像
```
**部署命令标准模板:**
```bash
# 只拉业务镜像(不触碰 Docker Hub
docker compose -f docker-compose.prod.yml pull gateway web
# 只重启业务容器 + nginx不动 postgres/redis
docker compose -f docker-compose.prod.yml up -d --no-deps gateway web nginx
# 直接检查 gateway 端口(不依赖 nginx
sleep 10
curl -sf http://localhost:8080/health
```
---
## 9. 与其他技能的关系
| 技能 | 协作点 |
|------|--------|
| `dev-deploy` | `/deploy ios` 执行 TestFlight 部署,`/deploy docker` 执行容器部署 |
| `dev-coding` | 开发完成后触发 CI |
| `req` | `/req deploy` 项目级批量部署 |
| `pull-request` | PR 触发 CI 检查 |
| `req-test-gate` | CI 中的测试门禁 |