Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention. Use when training large models (7B-175B+) across multiple GPUs with memory optimization.
/plugin marketplace add zechenzhangAGI/AI-research-SKILLs/plugin install deepspeed@zechenzhangAGI/AI-research-SKILLs