nextjs运维

我们一直听到过前端运维，那么何为前端运维？和正常的运维工程师又有什么区别呢？这篇文章将会以最基础的视角来看看前端运维要做哪些事情？如何做好前端运维？

本篇文章基于的背景是：使用云效+K8s+Nextjs项目

1、运维的基础知识

作为能够处理前端运维工作的事情，有些前置知识是需要自己掌握的，但是这些知识没有必要像专业的运维工程师那样掌握得很细致，只需要有些基础概念以及知道是什么即可。这些知识包括不限于以下方面：

https的基础知识
nginx的基础知识
k8s的基础知识
计算机基础知识

1.1、nginx的基础知识

::: nginx 极简教程

https://dunwu.github.io/nginx-tutorial/#/: https://dunwu.github.io/nginx-tutorial/#/ :::

nginx的作用

反向代理：将客户端请求转发到后端服务器
负载均衡：在多个后端服务器之间分配请求
静态文件服务：直接提供静态资源
SSL终止：处理HTTPS加密/解密

Nginx重要语法和指令

变量系统

Nginx提供了丰富的内置变量和自定义变量功能：

1server {
2    # 内置变量示例
3    set $custom_var "hello";
4    
5    location / {
6        # 常用内置变量
7        add_header X-Real-IP $remote_addr;
8        add_header X-Forwarded-For $proxy_add_x_forwarded_for;
9        add_header X-Request-URI $request_uri;
10        add_header X-Args $args;
11        add_header X-Host $host;
12        add_header X-Scheme $scheme;
13        add_header X-Server-Name $server_name;
14        add_header X-Request-Time $request_time;
15        
16        # 自定义变量
17        set $backend_pool "default";
18        if ($http_user_agent ~* "mobile") {
19            set $backend_pool "mobile";
20        }
21        
22        proxy_pass http://$backend_pool;
23    }
24}
25

条件判断语法

1server {
2    location / {
3        # 基本if语句
4        if ($request_method = POST) {
5            return 405;
6        }
7        
8        # 正则匹配
9        if ($http_user_agent ~* "(bot|crawler|spider)") {
10            return 403;
11        }
12        
13        # 文件存在性检查
14        if (!-f $request_filename) {
15            rewrite ^.*$ /index.html last;
16        }
17        
18        # 多条件组合（使用变量）
19        set $mobile "";
20        if ($http_user_agent ~* "(mobile|android|iphone)") {
21            set $mobile "M";
22        }
23        if ($args ~ "force_desktop=1") {
24            set $mobile "";
25        }
26        if ($mobile = "M") {
27            rewrite ^(.*)$ /mobile$1 last;
28        }
29    }
30}
31

重写规则（Rewrite）

1server {
2    # 基本重写
3    rewrite ^/old-path/(.*)$ /new-path/$1 permanent;
4    
5    # 条件重写
6    location /api {
7        # 去除API前缀
8        rewrite ^/api/(.*)$ /$1 break;
9        proxy_pass http://backend;
10    }
11    
12    # 复杂重写规则
13    location ~* ^/product/(\d+)/?$ {
14        # 产品页面重写
15        rewrite ^/product/(\d+)/?$ /products.php?id=$1 last;
16    }
17    
18    # SPA应用的fallback
19    location / {
20        try_files $uri $uri/ /index.html;
21    }
22    
23    # 带参数的重写
24    location /search {
25        if ($args ~* "^q=(.+)") {
26            rewrite ^/search$ /search-results.html?query=$1 redirect;
27        }
28    }
29}
30

负载均衡配置

1# 上游服务器定义
2upstream nextjs_backend {
3    # 负载均衡方法
4    least_conn;  # 或者 ip_hash; 或者 fair;
5    
6    # 服务器定义
7    server 127.0.0.1:3001 weight=3 max_fails=2 fail_timeout=30s;
8    server 127.0.0.1:3002 weight=2 max_fails=2 fail_timeout=30s;
9    server 127.0.0.1:3003 weight=1 backup;  # 备用服务器
10    
11    # 健康检查
12    keepalive 32;
13}
14
15upstream nextjs_static {
16    server static1.nextjs.com;
17    server static2.nextjs.com;
18}
19
20server {
21    location /api {
22        proxy_pass http://nextjs_backend;
23        
24        # 负载均衡相关头部
25        proxy_set_header X-Upstream $upstream_addr;
26        proxy_set_header X-Response-Time $upstream_response_time;
27    }
28    
29    location /static {
30        proxy_pass http://nextjs_static;
31    }
32}
33

缓存控制

1server {
2    # 静态文件缓存
3    location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
4        expires 1y;
5        add_header Cache-Control "public, immutable";
6        add_header Vary Accept-Encoding;
7        
8        # 版本控制
9        location ~* \.(css|js)$ {
10            if ($args ~ "v=([0-9.]+)") {
11                expires 1y;
12            }
13            if ($args !~ "v=") {
14                expires 1h;
15            }
16        }
17    }
18    
19    # API缓存
20    location /api/cache {
21        proxy_cache api_cache;
22        proxy_cache_valid 200 302 10m;
23        proxy_cache_valid 404 1m;
24        proxy_cache_key "$scheme$request_method$host$request_uri";
25        
26        # 缓存控制头部
27        add_header X-Cache-Status $upstream_cache_status;
28        
29        proxy_pass http://backend;
30    }
31}
32
33# 在http块中定义缓存
34http {
35    proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:10m inactive=60m;
36}
37

安全相关配置

1server {
2    # 安全头部
3    add_header X-Frame-Options "SAMEORIGIN" always;
4    add_header X-Content-Type-Options "nosniff" always;
5    add_header X-XSS-Protection "1; mode=block" always;
6    add_header Referrer-Policy "strict-origin-when-cross-origin" always;
7    add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline'" always;
8    
9    # 隐藏Nginx版本
10    server_tokens off;
11    
12    # 限制请求大小
13    client_max_body_size 10M;
14    
15    # 防止某些攻击
16    location ~ /\. {
17        deny all;
18        access_log off;
19        log_not_found off;
20    }
21    
22    # 防止访问敏感文件
23    location ~* \.(env|git|svn|htaccess|htpasswd)$ {
24        deny all;
25    }
26}
27

常用Nginx命令：

1# 测试配置文件语法
2nginx -t
3
4# 重新加载配置（不中断服务）
5nginx -s reload
6
7# 查看Nginx状态
8systemctl status nginx
9
10# 查看错误日志
11tail -f /var/log/nginx/error.log
12
13# 查看访问日志
14tail -f /var/log/nginx/access.log
15

1.2、k8s

Docker核心概念

镜像（Image）：应用程序的只读模板
容器（Container）：镜像的运行实例
Dockerfile：构建镜像的指令文件

nextjs项目的Dockerfile分析：

1# 多阶段构建，优化镜像大小
2FROM node:20 AS base
3WORKDIR /src
4COPY package.json pnpm-lock.yaml /src/
5RUN npm i -g pnpm && pnpm config -g set registry https://npm.nextmar.com && pnpm i
6
7# 构建阶段
8FROM base AS builder
9ARG APP_ENV
10ARG AREA
11COPY . /src
12RUN pnpm run build:${AREA}:${APP_ENV}
13
14# 生产镜像
15FROM node:20
16RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
17WORKDIR /app
18
19# 创建日志目录和缓存链接
20RUN mkdir -p /app/web/logs/cache
21RUN ln -s /app/.next/cache/ /app/web/logs/cache
22
23# 复制构建产物
24COPY --from=builder /src/.next/standalone /app
25COPY --from=builder /src/.next/static /app/.next/static
26COPY --from=builder /src/public /app/public
27
28CMD ["node","server.js"]
29

容器化最佳实践：

使用多阶段构建减少镜像大小
合理使用.dockerignore排除不必要文件
设置正确的时区避免日志时间混乱
使用非root用户提高安全性
合理设置资源限制

常用的Docker命令

上传文件到dev1：

1scp 本地文件名 username@servername:/path/filename
2

下载文件到本地：(在本地电脑执行，如果是目录，加上 -r 即可)

1scp username@servername:/path/filename /var/www/local_dir
2

进入Docker容器

1docker exec -it 容器名称 bash
2

拷贝本地文件到Docker容器内部

1docker cp 本地文件名  容器名称:/home
2

从容器内部拷贝到本地

1sudo docker cp 容器名称:/容器内文件路径 宿主机路径
2

停止和删除容器

1docker stop 容器名称 && docker rm 容器名称
2

启动容器

1docker run -dit --name 容器名称 -p 3000:3000 容器镜像存放的位置:1.0.1
2

重启容器

1docker restart fe-node-linguang
2

查看容器状态

1docker ps -a | grep 容器名称
2

查看容器日志

1docker logs -f 容器名称
2

2. 前端运维实践

2.1 日志管理与监控

Pino日志系统配置

nextjs项目使用Pino作为日志管理器，配置文件位于src/services/logger.ts：

1// 日志配置分析
2const createLoggerConfig = () => {
3  if (IS_LOCAL || isNextCompiling) {
4    // 本地开发：美化输出到控制台
5    return {
6      transport: {
7        targets: [{
8          target: 'pino-pretty',
9          options: {
10            colorize: true,
11            translateTime: 'SYS:standard',
12            ignore: 'pid,hostname',
13          },
14          level: 'debug',
15        }],
16      },
17      level: 'debug',
18    };
19  } else {
20    // 生产环境：JSON格式输出到文件
21    return {
22      level: LOG_LEVEL,
23      timestamp: pino.stdTimeFunctions.isoTime,
24      base: {
25        pid: process.pid,
26        hostname,
27        env: NODE_ENV,
28        pod: hostname,
29      },
30      transport: {
31        targets: [
32          // 错误日志
33          {
34            target: 'pino-roll',
35            options: {
36              file: path.join(LOG_DIR, `${hostname}-error`),
37              frequency: 'daily',
38              size: 10 * 1024 * 1024, // 10MB
39              extension: '.log',
40              dateFormat: 'yyyy-MM-dd',
41              maxFiles: 140,
42              maxDays: 14,
43            },
44            level: 'error',
45          },
46          // 信息日志
47          {
48            target: 'pino-roll',
49            options: {
50              file: path.join(LOG_DIR, `${hostname}-info`),
51              frequency: 'daily',
52              size: 10 * 1024 * 1024,
53              extension: '.log',
54              dateFormat: 'yyyy-MM-dd',
55              maxFiles: 140,
56              maxDays: 14,
57            },
58            level: 'info',
59          },
60        ],
61      },
62    };
63  }
64};
65

日志最佳实践：

1// 在应用中使用日志
2import { logger } from '@/services/logger';
3
4// 请求开始日志
5logger.info(`[STARTING] ${method} ${path} with RequestID: ${requestId}`);
6
7// 错误日志
8logger.error(err, `获取全局数据失败[${requestId}]:`);
9
10// 请求结束日志
11logger.info(`[ENDING] ${method} ${path} with RequestID ${requestId} cost ${duration}ms`);
12

2.2 服务器登录与日志查看

登录到K8s 管理平台

目前K8S的日志都是存在于Rancher

如果是线上，可以到 Kibana查看聚合的日志。

登录到RANCHER之后：

prod是线上的容器集群，tx-dev是开发环境集群。点击任何一个进去之后：

选择Workloads -> Deployments，之后找到你要排查问题的容器，比如nextjs项目：

选择“Execute Shell” 可以登录到容器内部的控制台，选择"View Logs"可以查看任何打印到stdout的日志，包括前端的console.log这种以及一些没有兜住的Error打印。

日志文件位置与查看

1# 进入日志目录
2cd /app/web/logs
3
4# 查看日志文件列表
5ls -la
6
7# 实时查看错误日志
8tail -f hostname-error-2024-01-15.log
9
10# 实时查看信息日志
11tail -f hostname-info-2024-01-15.log
12
13# 查看最近1000行日志
14tail -n 1000 hostname-error-2024-01-15.log
15

2.3 日志过滤与分析技巧

使用grep过滤关键信息

1# 查找特定RequestID的所有日志
2grep "RequestID: abc-123-def" hostname-info-2024-01-15.log
3
4# 查找错误日志
5grep -i "error\|exception\|failed" hostname-error-2024-01-15.log
6
7# 查找特定时间段的日志
8grep "2024-01-15T10:" hostname-info-2024-01-15.log
9
10# 查找API请求耗时超过1000ms的日志
11grep "cost [0-9]\{4,\}ms" hostname-info-2024-01-15.log
12
13# 统计错误数量
14grep -c "ERROR" hostname-error-2024-01-15.log
15
16# 查找内存相关错误
17grep -i "memory\|heap\|oom" hostname-error-2024-01-15.log
18

使用awk进行复杂分析

1# 分析API响应时间分布
2awk '/cost [0-9]+ms/ {
3  line = $0;
4  gsub(/.*cost /, "", line);
5  gsub(/ms.*/, "", line);
6  time = line + 0;
7  if(time < 100) fast++;
8  else if(time < 500) medium++;
9  else slow++;
10} END {
11  print "Fast(<100ms):", fast;
12  print "Medium(100-500ms):", medium;
13  print "Slow(>=500ms):", slow;
14}' nextjs-54df689445-2wgnh-info.2025-05-25.1.log
15
16# 统计最频繁的错误类型
17awk '/ERROR/ { if(match($0, /"msg":"[^"]+"/)) { line = $0; gsub(/.*"msg":"/, "", line); gsub(/".*/, "", line); errors[line]++; } } END { for(error in errors) { print errors[error], error; } }' nextjs-54df689445-2wgnh-info.2025-05-25.1.log | sort -nr
18

使用jq分析JSON日志

1# 安装jq（如果没有）
2apt-get update && apt-get install -y jq
3
4# 分析JSON格式的日志
5cat hostname-info-2024-01-15.log | jq 'select(.level >= 40)' # 只显示warn及以上级别
6
7# 统计不同级别日志数量
8cat hostname-info-2024-01-15.log | jq -r '.level' | sort | uniq -c
9
10# 查找特定用户的操作日志
11cat hostname-info-2024-01-15.log | jq 'select(.userId == "12345")'
12

使用Kibana分析日志

上述的排查方式都是在单台服务器上排查的，如果涉及用户多个请求，更推荐使用kibana的日志服务，它是聚合所有的日志在一起。Kibana基础查询语法叫做KQL（KQL - Kibana Query Language），它支持正则以及常见的一些AND、OR等操作以及范围判断。比如下面这些：

1# 基础字段查询
2level: "error"
3status: 500
4method: "POST"
5
6# 范围查询
7status: [400 TO 499]
8response_time: >1000
9timestamp: ["2024-01-15T00:00:00" TO "2024-01-15T23:59:59"]
10
11# 通配符查询
12message: "database*"
13user_agent: "*mobile*"
14path: "/api/*/users"
15
16# 布尔查询
17level: "error" AND service: "nextjs"
18status: 500 OR status: 502
19NOT level: "debug"
20
21# 字段存在性查询
22_exists_: error_code
23NOT _exists_: user_id
24
25# 正则表达式查询
26message: /timeout|connection.*failed/
27path: /\/api\/v[0-9]+\/.*/
28

常用的Kibana搜索场景

1. 错误日志分析

1# 查找所有错误日志
2level: "error"
3
4# 查找特定时间段的错误
5level: "error" AND @timestamp: ["now-1h" TO "now"]
6
7# 查找包含特定关键词的错误
8level: "error" AND message: (*timeout* OR *connection* OR *database*)
9
10# 查找特定服务的错误
11level: "error" AND service: "nextjs" AND pod: nextjs-*
12
13# 按错误类型聚合
14level: "error" AND error.type: *
15

2. 性能分析查询

1# 查找慢请求（响应时间>1秒）
2response_time: >1000
3
4# 查找特定API的性能
5path: "/api/products*" AND response_time: >500
6
7# 查找高CPU使用率的日志
8cpu_usage: >80
9
10# 查找内存使用异常
11memory_usage: >512 OR message: (*memory* OR *heap* OR *oom*)
12
13# 数据库查询慢的日志
14message: "database query" AND duration: >1000
15

3. 用户行为分析

1# 查找特定用户的操作
2user_id: "12345"
3
4# 查找登录失败的记录
5message: "*login*" AND status: (400 OR 401 OR 403)
6
7# 查找支付相关操作
8path: "/api/payment*" OR message: "*payment*"
9
10# 查找移动端用户的请求
11user_agent: (*mobile* OR *android* OR *iphone*)
12
13# 查找特定地区的用户
14geo.country: "CN" OR geo.country: "US"
15

4. 系统监控查询

1# 查找服务重启记录
2message: ("starting" OR "restart" OR "shutdown")
3
4# 查找资源不足的警告
5level: "warn" AND message: (*memory* OR *disk* OR *cpu*)
6
7# 查找网络相关问题
8message: (*network* OR *connection* OR *timeout* OR *refused*)
9
10# 查找容器相关事件
11message: (*pod* OR *container* OR *kubernetes*)
12
13# 查找负载均衡器的健康检查
14path: "/ping" OR path: "/health" OR message: "*health*"
15

Kibana聚合查询示例

1. 错误统计聚合

1{
2  "aggs": {
3    "error_by_service": {
4      "terms": {
5        "field": "service.keyword",
6        "size": 10
7      },
8      "aggs": {
9        "error_count": {
10          "filter": {
11            "term": {
12              "level": "error"
13            }
14          }
15        }
16      }
17    }
18  }
19}
20

2. 响应时间分析

1{
2  "aggs": {
3    "response_time_stats": {
4      "stats": {
5        "field": "response_time"
6      }
7    },
8    "response_time_percentiles": {
9      "percentiles": {
10        "field": "response_time",
11        "percents": [50, 90, 95, 99]
12      }
13    }
14  }
15}
16

3. 时间序列分析

1{
2  "aggs": {
3    "requests_over_time": {
4      "date_histogram": {
5        "field": "@timestamp",
6        "interval": "1m"
7      },
8      "aggs": {
9        "avg_response_time": {
10          "avg": {
11            "field": "response_time"
12          }
13        },
14        "error_rate": {
15          "filter": {
16            "range": {
17              "status": {
18                "gte": 400
19              }
20            }
21          }
22        }
23      }
24    }
25  }
26}
27

Kibana Dashboard常用可视化

参考文章：# kibana如何制作出好看酷炫的图表？

告警和监控设置

1. 错误率告警

1{
2  "trigger": {
3    "schedule": {
4      "interval": "1m"
5    }
6  },
7  "input": {
8    "search": {
9      "request": {
10        "search_type": "query_then_fetch",
11        "indices": ["nextjs-logs-*"],
12        "body": {
13          "query": {
14            "bool": {
15              "must": [
16                {
17                  "range": {
18                    "@timestamp": {
19                      "gte": "now-5m"
20                    }
21                  }
22                },
23                {
24                  "term": {
25                    "level": "error"
26                  }
27                }
28              ]
29            }
30          }
31        }
32      }
33    }
34  },
35  "condition": {
36    "compare": {
37      "ctx.payload.hits.total": {
38        "gt": 10
39      }
40    }
41  }
42}
43

2. 响应时间告警

1{
2  "trigger": {
3    "schedule": {
4      "interval": "1m"
5    }
6  },
7  "input": {
8    "search": {
9      "request": {
10        "indices": ["nextjs-logs-*"],
11        "body": {
12          "query": {
13            "range": {
14              "@timestamp": {
15                "gte": "now-5m"
16              }
17            }
18          },
19          "aggs": {
20            "avg_response_time": {
21              "avg": {
22                "field": "response_time"
23              }
24            }
25          }
26        }
27      }
28    }
29  },
30  "condition": {
31    "compare": {
32      "ctx.payload.aggregations.avg_response_time.value": {
33        "gt": 1000
34      }
35    }
36  }
37}
38

实用的Kibana操作技巧

1. 快速时间范围选择

1# 常用时间范围快捷键
2Last 15 minutes: now-15m
3Last 1 hour: now-1h  
4Last 24 hours: now-1d
5Last 7 days: now-7d
6This week: now/w
7This month: now/M
8
9# 自定义时间范围
10From: 2024-01-15T09:00:00
11To: 2024-01-15T17:00:00
12

2. 字段过滤和排序

1# 在Discover页面中
21. 点击字段名旁的 "+" 添加到表格
32. 点击字段值进行快速过滤
43. 使用 "Sort" 按钮对结果排序
54. 保存搜索为 "Saved Search"
6

3. 导出和分享

1# 导出数据
21. 在Discover页面点击 "Share"
32. 选择 "CSV Reports" 导出CSV
43. 或选择 "Permalink" 分享链接
5
6# 创建Dashboard
71. 保存多个可视化图表
82. 组合到一个Dashboard
93. 设置自动刷新间隔
104. 分享Dashboard链接
11

2.4 实际案例分析

案例1：页面加载缓慢排查

1# 1. 查找慢请求
2grep "cost [5-9][0-9][0-9][0-9]ms\|cost [0-9]\{5,\}ms" hostname-info-2024-01-15.log
3
4# 2. 分析慢请求的URL模式
5grep "cost [5-9][0-9][0-9][0-9]ms" hostname-info-2024-01-15.log | \
6awk '{print $4}' | sort | uniq -c | sort -nr
7
8# 3. 查看特定时间段的系统负载
9grep "2024-01-15T14:" hostname-info-2024-01-15.log | \
10grep -E "memory|cpu|load"
11

案例2：用户登录失败排查

1# 1. 查找登录相关错误
2grep -i "login\|auth\|token" hostname-error-2024-01-15.log
3
4# 2. 查找特定用户的登录尝试
5grep "userId.*12345" hostname-info-2024-01-15.log | grep -i "login"
6
7# 3. 统计登录失败的原因
8grep "login.*failed" hostname-error-2024-01-15.log | \
9awk -F'"' '{print $4}' | sort | uniq -c
10

3. 流水线运维与故障排查

3.1 开发服务器故障标准排查流程

第一步：快速状态检查

通过2.2节说的方式，登录到Rancher上去，查看容器的状态，如果容器出现下面这种容器一直启动不了，死循环重启：

那么怀疑的方向有两个：

服务没有提供健康检查或者服务的健康检查接口挂了
服务本身启动不成功

第二步：详细诊断

根据上述原因，我们一一排查问题所在。可以通过如下方式进行排除：

因为容器启动不起来，所以进入不了终端，但是我们可以查看容器在重启前的最后一次日志打印。
自己尝试在本地进行打包构建，之后按照Dockerfile的启动方式进行启动，之后测试健康检查接口以及看是否服务可以正常启动
如果本地不能够复现，那么可以尝试把服务启动先临时关闭，让运维关掉容器的检查，直接登录到容器内部，查看有用的日志。

按照近期遇到的问题，可能的问题一般存在于：

环境变量没有正确配置或者漏了配置，服务启动失败
健康检查接口漏了或者没有返回正确的“ok”
代码存在问题，执行失败导致重启（可能跟构建也有关系，需要仔细排查）

3.2 流水线故障排查

流水线故障的话，就更容易排查了，一般首先查看流水线的日志报错。然后进行分析，一般存在的问题在于：

构建脚本有问题，导致构建失败
代码存在问题，导致构建失败
部署的服务不存在，导致部署失败。

现阶段，我们常用的流水线配置如下：

开发环境构建：

里面涉及到的环境变量入参随每个项目不同而不同

1docker build --build-arg APP_ENV=${APP_ENV} -t mirror.xmslol.cn:8443/tools/anser-node:${CI_COMMIT_ID_1}-${BUILD_NUMBER} .
2
3docker push mirror.xmslol.cn:8443/tools/anser-node:${CI_COMMIT_ID_1}-${BUILD_NUMBER}
4
5docker rmi mirror.xmslol.cn:8443/tools/anser-node:${CI_COMMIT_ID_1}-${BUILD_NUMBER}
6

生产环境构建：

1docker build --build-arg APP_ENV=${APP_ENV} -t mirror.xmslol.cn:8443/tools/anser-node:${CI_COMMIT_REF_NAME} .
2
3docker push mirror.xmslol.cn:8443/tools/anser-node:${CI_COMMIT_REF_NAME}
4
5docker rmi mirror.xmslol.cn:8443/tools/anser-node:${CI_COMMIT_REF_NAME}
6

部署开发环境：

1
2# 指定镜像名称
3imageName="mirror.xmslol.cn:8443/tools/anser-node:${CI_COMMIT_ID_1}-${BUILD_NUMBER}"
4# 生产环境容器化部署
5kubectl set image deployment/anser-f2 anser-node-f2="${imageName}" -n tools
6

部署线上环境：

1# 指定镜像名称
2imageName="mirror.xmslol.cn:8443/tools/anser-node:${CI_COMMIT_REF_NAME}"
3# 生产环境容器化部署
4kubectl set image deployment/anser-f2 anser-node-f2="${imageName}" -n tools
5

3.3 应急处理流程

3.3.1 服务完全不可用

紧急回滚：目前紧急回滚的功能还得需要和运维一起讨论，上次在云效上测试过，不能确定是否有用，理论上我们要在云效上可以操作一键回滚！

3.3.2 部分功能异常

排查当天发布的需求代码是否存在问题，如果部分功能影响用户操作，首先都需要先回滚后排查问题，避免影响范围扩大造成资损

4. 监控与性能分析

4.1 Grafana监控面板

通过该地址：Grafana访问地址查看服务的状态，下面是当时的一个容器状态：

4.2 内存监控与泄漏检测

4.2.1 RSS vs WSS 概念理解

RSS (Resident Set Size)

进程当前实际占用的物理内存
包括代码、数据、共享库等
不包括被交换到磁盘的内存

WSS (Working Set Size)

进程在特定时间窗口内访问的内存页面集合
更准确反映进程的内存需求
K8s中通常指working_set_bytes

4.2.2 内存泄漏判断标准

Grafana面板的RSS内存一直往上升，没有间歇性的下降趋势，就可以认为我们的服务存在内存泄露

4.2.3 内存泄漏排查步骤

待补充

4.2.3 日常监控检查清单

每日检查

查看Grafana面板，确认关键指标正常
检查错误日志，关注新出现的错误类型
验证告警系统正常工作
检查备份和恢复流程

每周检查

分析性能趋势，识别潜在问题
检查资源使用情况，规划容量
更新监控规则和告警阈值
进行故障演练

每月检查

全面性能评估和优化建议
监控系统本身的健康检查
更新运维文档和流程
团队培训和知识分享

总结

前端运维是一个综合性的技术领域，需要掌握从基础设施到应用层面的各种技能。通过本文的学习，你应该能够：

理解核心概念：HTTPS、Nginx、容器化、CI/CD等基础知识
掌握实践技能：日志分析、故障排查、性能监控等实际操作
建立运维思维：预防为主、快速响应、持续改进的运维理念
形成标准流程：标准化的故障排查和应急处理流程

记住，优秀的运维不仅仅是解决问题，更重要的是预防问题的发生。持续学习新技术，不断优化现有流程，才能在快速发展的技术环境中保持竞争力。

前端运维基础