
一切已准备就绪让我来为您提供最终总结。完成。以下是交付内容总结需要上传到云服务器的文件本地打包命令cd D:\sft-repairpowershell -NoProfile -ExecutionPolicy Bypass -File scripts\package_repair_sft.ps1 -IncludeValidationData云端执行顺序简版# 1. 初始化bash /root/localstorage/cloud_bootstrap_repair_sft.shsource /root/localstorage/adacor_env.sh# 2. 放好 SFT HF checkpoint → /root/localstorage/models/Qwen3-4B-Instruct-2507-adacor-sft-hf/# 3. HF→torch_dist 转换bash ${PROJECT_DIR}/scripts/prepare_repair_checkpoints.sh# 4. 检查bash ${PROJECT_DIR}/scripts/preflight_repair_sft_a800.sh --stage all# 5. SwanLab 监控export SWANLAB_API_KEYizD5LXskmr6Ift5Iil1pBbash ${PROJECT_DIR}/scripts/start_swanlab_watch.sh# 6. 训练bash ${PROJECT_DIR}/scripts/run_repair_sft_qwen3_4b_a800.sh# 7. 转回 HFbash ${PROJECT_DIR}/scripts/convert_repair_torch_dist_to_hf.sh# 8. Validationbash ${PROJECT_DIR}/scripts/run_repair_validation_qwen3_4b.sh关键设计决策1. --ref-load 指向 SFT torch_distrepair SFT 从当前 SFT checkpoint 继续训练而非从 base model2. --load/--save 指向新 repair 目录不覆盖原 SFT checkpoint3. NUM_GPUS1适配单卡 A800Ray 仍可正常调度4. LR1e-5, NUM_EPOCH1repair 是轻推而非重训5. SwanLab API key 不写入脚本/文档仅在云端 shell 临时 export完整指导文档在 D:\sft-repair\docs\repair_sft_cloud_runbook.md。