zai-org

Open Source

ChatGLM2-6B

# ChatGLM2-6B 🤗 <a href="https://huggingface.co/THUDM/chatglm2-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> 👋 加入我们的 <a href="https://discord.gg/fK2dz4bg" target="_blank">Discord</a> 和 <a href="resources/WECHAT.md" target="_blank">WeChat</a> 📍在 <a href="https://www.chatglm.cn">chatglm.cn</a> 体验更大规模的 ChatGLM 模型。 *Read this in [English](README_EN.md)* ## GLM-4 开源模型和API 我们已经发布最新的 **GLM-4** 模型，该模型在多个指标上有了新的突破，您可以在以下两个渠道体验我们的最新模型。 + [GLM-4 开源模型](https://github.com/THUDM/GLM-4) 我们已经开源了 GLM-4-9B 系列模型，在各项指标的ce是上有明显提升，欢迎尝试。 + [智谱清言](https://chatglm.cn/main/detail?fr=ecology_x) 体验最新版 GLM-4，包括 **GLMs，All tools**等功能。 + [API平台](https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9) 新一代 API 平台已经上线，您可以直接在 API 平台上体验 `GLM-4-0520`、`GLM-4-air`、`GLM-4-airx`、`GLM-4-flash`、`GLM-4`、`GLM-3-Turbo`、`CharacterGLM-3`，`CogView-3` 等新模型。其中`GLM-4`、`GLM-3-Turbo`两个模型支持了 `System Prompt`、`Function Call`、 `Retrieval`、`Web_Search`等新功能，欢迎体验。 + [GLM-4 API 开源教程](https://github.com/MetaGLM/glm-cookbook/) GLM-4 API教程和基础应用，欢迎尝试。 API相关问题可以在本开源教程疑问，或者使用 [GLM-4 API AI助手](https://open.bigmodel.cn/shareapp/v1/?share_code=sQwt5qyqYVaNh1O_87p8O) 来获得常见问题的帮助。 ----- ## 介绍 ChatGLM**2**-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) 的第二代版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM**2**-6B 引入了如下新特性： 1. **更强大的性能**：基于 ChatGLM 初代模型的开发经验，我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 [GLM](https://github.com/THUDM/GLM) 的混合目标函数，经过了 1.4T 中英标识符的预训练与人类偏好对齐训练，[评测结果](#评测结果)显示，相比于初代模型，ChatGLM2-6B 在 MMLU（+23%）、CEval（+33%）、GSM8K（+571%）、BBH（+60%）等数据集上的性能取得了大幅度的提升，在同尺寸开源模型中具有较强的竞争力。 2. **更长的上下文**：基于 [FlashAttention](https://github.com/HazyResearch/flash-attention) 技术，我们将基座模型的上下文长度（Context Length）由 ChatGLM-6B 的 2K 扩展到了 32K，并在对话阶段使用 8K 的上下文长度训练。对于更长的上下文，我们发布了 [ChatGLM2-6B-32K](https://huggingface.co/THUDM/chatglm2-6b-32k) 模型。[LongBench](https://github.com/THUDM/LongBench) 的测评结果表明，在等量级的开源模型中，ChatGLM2-6B-32K 有着较为明显的竞争优势。 3. **更高效的推理**：基于 [Multi-Query Attention](http://arxiv.org/abs/1911.02150) 技术，ChatGLM2-6B 有更高效的推理速度和更低的显存占用：在官方的模型实现下，推理速度相比初代提升了 42%，INT4 量化下，6G 显存支持的对话长度由 1K 提升到了 8K。 4. **更开放的协议**：ChatGLM2-6B 权重对学术研究**完全开放**，在填写[问卷](https://open.bigmodel.cn/mla/form)进行登记后**亦允许免费商业使用**。 ----- ChatGLM2-6B 开源模型旨在与开源社区一起推动大模型技术发展，恳请开发者和大家遵守[开源协议](MODEL_LICENSE)，勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。**目前，本项目团队未基于 ChatGLM2-6B 开发任何应用，包括网页端、安卓、苹果 iOS 及 Windows App 等应用。** 尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性，但由于 ChatGLM2-6B 模型规模较小，且模型受概率随机性因素影响，无法保证输出内容的准确性，且模型易被误导。**本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。** ## 更新信息 **[2023/07/31]** 发布 [ChatGLM2-6B-32K](https://huggingface.co/THUDM/chatglm2-6b-32k) 模型，提升对于长文本的理解能力。 **[2023/07/25]** 发布 [CodeGeeX2](https://github.com/THUDM/CodeGeeX2) 模型，基于 ChatGLM2-6B 加入代码预训练实现，代码能力全面提升。 **[2023/07/04]** 发布 P-Tuning v2 与全参数微调脚本，参见 [P-Tuning](./ptuning)。 ## 友情链接对 ChatGLM2 进行加速的开源项目： * [fastllm](https://github.com/ztxz16/fastllm/): 全平台加速推理方案，单GPU批量推理每秒可达10000+token，手机端最低3G内存实时运行（骁龙865上约4~5 token/s） * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): 类似 llama.cpp 的 CPU 量化加速推理方案，实现 Mac 笔记本上实时对话 * [ChatGLM2-TPU](https://github.com/sophgo/ChatGLM2-TPU): 采用TPU加速推理方案，在算能端侧芯片BM1684X（16T@FP16，内存16G）上实时运行约5 token/s 基于或使用了 ChatGLM2-6B 的开源项目： * [Chuanhu Chat](https://github.com/GaiZhenbiao/ChuanhuChatGPT): 为各个大语言模型和在线模型API提供美观易用、功能丰富、快速部署的用户界面，支持ChatGLM2-6B。支持 ChatGLM-6B 和相关应用在线训练的示例项目： * [ChatGLM2-6B 在腾讯云部署教程](https://cloud.tencent.com/document/product/1721/104848) * [ChatGLM2-6B 的部署与微调教程](https://www.heywhale.com/mw/project/64984a7b72ebe240516ae79c) ## 评测结果我们选取了部分中英文典型数据集进行了评测，以下为 ChatGLM2-6B 模型在 [MMLU](https://github.com/hendrycks/test) (英文)、[C-Eval](https://cevalbenchmark.com/static/leaderboard.html)（中文）、[GSM8K](https://github.com/openai/grade-school-math)（数学）、[BBH](https://github.com/suzgunmirac/BIG-Bench-Hard)（英文）上的测评结果。在 [evaluation](./evaluation/README.md) 中提供了在 C-Eval 上进行测评的脚本。 ### MMLU | Model | Average | STEM | Social Sciences | Humanities | Others | | ----- |------| ---- |------|-------| ----- | | ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 | | ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 | | ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 | | ChatGLM2-12B (base) | 56.18 | 48.18 | 65.13 | 52.58 | 60.93 | | ChatGLM2-12B | 52.13 | 47.00 | 61.00 | 46.10 | 56.05 | > Chat 模型使用 zero-shot CoT (Chain-of-Thought) 的方法测试，Base 模型使用 few-shot answer-only 的方法测试 ### C-Eval | Model | Average | STEM | Social Sciences | Humanities | Others | | ----- |---------|-------| ----- |------------|--------| | ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 | | ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 | | ChatGLM2-6B | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 | | ChatGLM2-12B (base) | 61.6 | 55.4 | 73.7 | 64.2 | 59.4 | | ChatGLM2-12B | 57.0 | 52.1 | 69.3 | 58.5 | 53.2 | > Chat 模型使用 zero-shot CoT 的方法测试，Base 模型使用 few-shot answer only 的方法测试 ### GSM8K | Model | Accuracy | Accuracy (Chinese)* | |--------------|----------| - | | ChatGLM-6B | 4.82 | 5.85 | | ChatGLM2-6B (base) | 32.37 | 28.95 | | ChatGLM2-6B | 28.05 | 20.45 | | ChatGLM2-12B (base) | 40.94 | 42.71 | | ChatGLM2-12B | 38.13 | 23.43 | > 所有模型均使用 few-shot CoT 的方法测试，CoT prompt 来自 http://arxiv.org/abs/2201.11903 > > \* 我们使用翻译 API 翻译了 GSM8K 中的 500 道题目和 CoT prompt 并进行了人工校对 ### BBH | Model | Accuracy | |--------------|-------| | ChatGLM-6B | 18.73 | | ChatGLM2-6B (base) | 33.68 | | ChatGLM2-6B | 30.00 | | ChatGLM2-12B (base) | 36.02 | | ChatGLM2-12B | 39.98 | > 所有模型均使用 few-shot CoT 的方法测试，CoT prompt 来自 https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts ## 推理性能 ChatGLM2-6B 使用了 [Multi-Query Attention](http://arxiv.org/abs/1911.02150)，提高了生成速度。生成 2000 个字符的平均速度对比如下 | Model | 推理速度 (字符/秒) | | ---- | ----- | | ChatGLM-6B | 31.49 | | ChatGLM2-6B | 44.62 | > 使用官方实现，batch size = 1，max length = 2048，bf16 精度，测试硬件为 A100-SXM4-80G，软件环境为 PyTorch 2.0.1 Multi-Query Attention 同时也降低了生成过程中 KV Cache 的显存占用，此外，ChatGLM2-6B 采用 Causal Mask 进行对话训练，连续对话时可复用前面轮次的 KV Cache，进一步优化了显存占用。因此，使用 6GB 显存的显卡进行 INT4 量化的推理时，初代的 ChatGLM-6B 模型最多能够生成 1119 个字符就会提示显存耗尽，而 ChatGLM2-6B 能够生成至少 8192 个字符。 | **量化等级** | **编码 2048 长度的最小显存** | **生成 8192 长度的最小显存** | | -------------- |---------------------|---------------------| | FP16 / BF16 | 13.1 GB | 12.8 GB | | INT8 | 8.2 GB | 8.1 GB | | INT4 | 5.5 GB | 5.1 GB | > ChatGLM2-6B 利用了 PyTorch 2.0 引入的 `torch.nn.functional.scaled_dot_product_attention` 实现高效的 Attention 计算，如果 PyTorch 版本较低则会 fallback 到朴素的 Attention 实现，出现显存占用高于上表的情况。我们也测试了量化对模型性能的影响。结果表明，量化对模型性能的影响在可接受范围内。 | 量化等级 | Accuracy (MMLU) | Accuracy (C-Eval dev) | | ----- | ----- |-----------------------| | BF16 | 45.47 | 53.57 | | INT4 | 43.13 | 50.30 | ## ChatGLM2-6B 示例相比于初代模型，ChatGLM2-6B 多个维度的能力都取得了提升，以下是一些对比示例。更多 ChatGLM2-6B 的可能，等待你来探索发现！ <details><summary>数理逻辑</summary> ![](resources/math.png) </details> <details><summary>知识推理</summary> ![](resources/knowledge.png) </details> <details><summary>长文档理解</summary> ![](resources/long-context.png) </details> ## 使用方式 ### 环境安装首先需要下载本仓库： ```shell git clone https://github.com/THUDM/ChatGLM2-6B cd ChatGLM2-6B ``` 然后使用 pip 安装依赖： ``` pip install -r requirements.txt ``` 其中 `transformers` 库版本推荐为 `4.30.2`，`torch` 推荐使用 2.0 及以上的版本，以获得最佳的推理性能。 ### 代码调用可以通过如下代码调用 ChatGLM2-6B 模型来生成对话： ```python >>> from transformers import AutoTokenizer, AutoModel >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) >>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda') >>> model = model.eval() >>> response, history = model.chat(tokenizer, "你好", history=[]) >>> print(response) 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。 >>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) >>> print(response) 晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法: 1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。 2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。 3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。 4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。 5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。 6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。 ``` #### 从本地加载模型以上代码会由 `transformers` 自动下载模型实现和参数。完整的模型实现在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm2-6b)。如果你的网络环境较差，下载模型参数可能会花费较长时间甚至失败。此时可以先将模型下载到本地，然后从本地加载。从 Hugging Face Hub 下载模型需要先[安装Git LFS](https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)，然后运行 ```Shell git clone https://huggingface.co/THUDM/chatglm2-6b ``` 如果你从 Hugging Face Hub 上下载 checkpoint 的速度较慢，可以只下载模型实现 ```Shell GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm2-6b ``` 然后从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载模型参数文件，并将下载的文件替换到本地的 `chatglm2-6b` 目录下。将模型下载到本地之后，将以上代码中的 `THUDM/chatglm2-6b` 替换为你本地的 `chatglm2-6b` 文件夹的路径，即可从本地加载模型。模型的实现仍然处在变动中。如果希望固定使用的模型实现以保证兼容性，可以在 `from_pretrained` 的调用中增加 `revision="v1.0"` 参数。`v1.0` 是当前最新的版本号，完整的版本列表参见 [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log)。 ### 网页版 Demo ![web-demo](resources/web-demo.gif) 可以通过以下命令启动基于 Gradio 的网页版 demo： ```shell python web_demo.py ``` ![web-demo](resources/web-demo2.gif) 可以通过以下命令启动基于 Streamlit 的网页版 demo： ```shell streamlit run web_demo2.py ``` 网页版 demo 会运行一个 Web Server，并输出地址。在浏览器中打开输出的地址即可使用。经测试，基于 Streamlit 的网页版 Demo 会更流畅。 ### 命令行 Demo ![cli-demo](resources/cli-demo.png) 运行仓库中 [cli_demo.py](cli_demo.py)： ```shell python cli_demo.py ``` 程序会在命令行中进行交互式的对话，在命令行中输入指示并回车即可生成回复，输入 `clear` 可以清空对话历史，输入 `stop` 终止程序。 ### API 部署首先需要安装额外的依赖 `pip install fastapi uvicorn`，然后运行仓库中的 [api.py](api.py)： ```shell python api.py ``` 默认部署在本地的 8000 端口，通过 POST 方法进行调用 ```shell curl -X POST "http://127.0.0.1:8000" \ -H 'Content-Type: application/json' \ -d '{"prompt": "你好", "history": []}' ``` 得到的返回值为 ```shell { "response":"你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。", "history":[["你好","你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。"]], "status":200, "time":"2023-03-23 21:38:40" } ``` 感谢 [@hiyouga]() 实现了 OpenAI 格式的流式 API 部署，可以作为任意基于 ChatGPT 的应用的后端，比如 [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web)。可以通过运行仓库中的[openai_api.py](openai_api.py) 进行部署： ```shell python openai_api.py ``` 进行 API 调用的示例代码为 ```python import openai if __name__ == "__main__": openai.api_base = "http://localhost:8000/v1" openai.api_key = "none" for chunk in openai.ChatCompletion.create( model="chatglm2-6b", messages=[ {"role": "user", "content": "你好"} ], stream=True ): if hasattr(chunk.choices[0].delta, "content"): print(chunk.choices[0].delta.content, end="", flush=True) ``` ## 低成本部署 ### 模型量化默认情况下，模型以 FP16 精度加载，运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限，可以尝试以量化方式加载模型，使用方法如下： ```python model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).cuda() ``` 模型量化会带来一定的性能损失，经过测试，ChatGLM2-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。量化模型的参数文件也可以从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载。 ### CPU 部署如果你没有 GPU 硬件的话，也可以在 CPU 上进行推理，但是推理速度会更慢。使用方法如下（需要大概 32GB 内存） ```python model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float() ``` 如果你的内存不足的话，也可以使用量化后的模型 ```python model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).float() ``` 在 cpu 上运行量化后的模型需要安装 `gcc` 与 `openmp`。多数 Linux 发行版默认已安装。对于 Windows ，可在安装 [TDM-GCC](https://jmeubank.github.io/tdm-gcc/) 时勾选 `openmp`。 Windows 测试环境 `gcc` 版本为 `TDM-GCC 10.3.0`， Linux 为 `gcc 11.3.0`。在 MacOS 上请参考 [Q1](FAQ.md#q1)。 ### Mac 部署对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac，可以使用 MPS 后端来在 GPU 上运行 ChatGLM2-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly（正确的版本号应该是2.x.x.dev2023xxxx，而不是 2.x.x）。目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载，并使用 mps 后端： ```python model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps') ``` 加载半精度的 ChatGLM2-6B 模型需要大概 13GB 内存。内存较小的机器（比如 16GB 内存的 MacBook Pro），在空余内存不足的情况下会使用硬盘上的虚拟内存，导致推理速度严重变慢。此时可以使用量化后的模型 chatglm2-6b-int4。因为 GPU 上量化的 kernel 是使用 CUDA 编写的，因此无法在 MacOS 上使用，只能使用 CPU 进行推理。为了充分使用 CPU 并行，还需要[单独安装 OpenMP](FAQ.md#q1)。在 Mac 上进行推理也可以使用 [ChatGLM.cpp](https://github.com/li-plus/chatglm.cpp) ### 多卡部署如果你有多张 GPU，但是每张 GPU 的显存大小都不足以容纳完整的模型，那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`，然后通过如下方法加载模型： ```python from utils import load_model_on_gpus model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2) ``` 即可将模型部署到两张 GPU 上进行推理。你可以将 `num_gpus` 改为你希望使用的 GPU 数。默认是均匀切分的，你也可以传入 `device_map` 参数来自己指定。 ## 协议本仓库的代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源，ChatGLM2-6B 模型的权重的使用则需要遵循 [Model License](MODEL_LICENSE)。ChatGLM2-6B 权重对学术研究**完全开放**，在填写[问卷](https://open.bigmodel.cn/mla/form)进行登记后**亦允许免费商业使用**。 ## 引用如果你觉得我们的工作有帮助的话，请考虑引用下列论文，ChatGLM2-6B 的论文会在近期公布，敬请期待～ ``` @misc{glm2024chatglm, title={ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools}, author={Team GLM and Aohan Zeng and Bin Xu and Bowen Wang and Chenhui Zhang and Da Yin and Diego Rojas and Guanyu Feng and Hanlin Zhao and Hanyu Lai and Hao Yu and Hongning Wang and Jiadai Sun and Jiajie Zhang and Jiale Cheng and Jiayi Gui and Jie Tang and Jing Zhang and Juanzi Li and Lei Zhao and Lindong Wu and Lucen Zhong and Mingdao Liu and Minlie Huang and Peng Zhang and Qinkai Zheng and Rui Lu and Shuaiqi Duan and Shudan Zhang and Shulin Cao and Shuxun Yang and Weng Lam Tam and Wenyi Zhao and Xiao Liu and Xiao Xia and Xiaohan Zhang and Xiaotao Gu and Xin Lv and Xinghan Liu and Xinyi Liu and Xinyue Yang and Xixuan Song and Xunkai Zhang and Yifan An and Yifan Xu and Yilin Niu and Yuantao Yang and Yueyan Li and Yushi Bai and Yuxiao Dong and Zehan Qi and Zhaoyu Wang and Zhen Yang and Zhengxiao Du and Zhenyu Hou and Zihan Wang}, year={2024}, eprint={2406.12793}, archivePrefix={arXiv}, primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'} } ```

AI & Machine Learning LLM Tools & Chat UIs

15.6K Github Stars

Open Source

CogVideo

# CogVideo & CogVideoX [中文阅读](./README_zh.md) [日本語で読む](./README_ja.md) <div align="center"> <img src=resources/logo.svg width="50%"/> </div> Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> or <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> 📚 View the <a href="https://arxiv.org/abs/2408.06072" target="_blank">paper</a> and <a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh" target="_blank">user guide</a> 👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/dCGfUsagrD" target="_blank">Discord</a> 📍 Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models. ## Project Updates - 🔥🔥 **News**: ```2025/03/24```: We have launched [CogKit](https://github.com/THUDM/CogKit), a fine-tuning and inference framework for the **CogView4** and **CogVideoX** series. This toolkit allows you to fully explore and utilize our multimodal generation models. - 🔥 **News**: ```2025/02/28```: DDIM Inverse is now supported in `CogVideoX-5B` and `CogVideoX1.5-5B`. Check [here](inference/ddim_inversion.py). - 🔥 **News**: ```2025/01/08```: We have updated the code for `Lora` fine-tuning based on the `diffusers` version model, which uses less GPU memory. For more details, please see [here](finetune/README.md). - 🔥 **News**: ```2024/11/15```: We released the `CogVideoX1.5` model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code. - 🔥 **News**: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX. The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution. The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [here](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT). - 🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single 4090 GPU, [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory), has been released. It supports fine-tuning with multiple resolutions. Feel free to use it! - 🔥 **News**: ```2024/10/10```: We have updated our technical report. Please click [here](https://arxiv.org/pdf/2408.06072) to view it. More training details and a demo have been added. To see the demo, click [here](https://yzy-thu.github.io/CogVideoX-demo/).- 🔥 **News**: ```2024/10/09```: We have publicly released the [technical documentation](https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh) for CogVideoX fine-tuning on Feishu, further increasing distribution flexibility. All examples in the public documentation can be fully reproduced. - 🔥 **News**: ```2024/9/19```: We have open-sourced the CogVideoX series image-to-video model **CogVideoX-5B-I2V**. This model can take an image as a background input and generate a video combined with prompt words, offering greater controllability. With this, the CogVideoX series models now support three tasks: text-to-video generation, video continuation, and image-to-video generation. Welcome to try it online at [Experience](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space). - 🔥 ```2024/9/19```: The Caption model [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption), used in the training process of CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it. - 🔥 ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run **CogVideoX-2B** on older GPUs like `GTX 1080TI`, and **CogVideoX-5B** on desktop GPUs like `RTX 3060`. Please strictly follow the [requirements](requirements.txt) to update and install dependencies, and refer to [cli_demo](inference/cli_demo.py) for inference code. Additionally, the open-source license for the **CogVideoX-2B** model has been changed to the **Apache 2.0 License**. - 🔥 ```2024/8/6```: We have open-sourced **3D Causal VAE**, used for **CogVideoX-2B**, which can reconstruct videos with almost no loss. - 🔥 ```2024/8/6```: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B **. - 🌱 **Source**: ```2022/5/19```: We have open-sourced the CogVideo video generation model (now you can see it in the `CogVideo` branch). This is the first open-source large Transformer-based text-to-video generation model. You can access the [ICLR'23 paper](https://arxiv.org/abs/2205.15868) for technical details. ## Table of Contents Jump to a specific section: - [Quick Start](#quick-start) - [Prompt Optimization](#prompt-optimization) - [SAT](#sat) - [Diffusers](#diffusers) - [Gallery](#gallery) - [CogVideoX-5B](#cogvideox-5b) - [CogVideoX-2B](#cogvideox-2b) - [Model Introduction](#model-introduction) - [Friendly Links](#friendly-links) - [Project Structure](#project-structure) - [Quick Start with Colab](#quick-start-with-colab) - [Inference](#inference) - [finetune](#finetune) - [sat](#sat-1) - [Tools](#tools) - [CogVideo(ICLR'23)](#cogvideoiclr23) - [Citation](#citation) - [Model-License](#model-license) ## Quick Start ### Prompt Optimization Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation. ### SAT **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.** Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. ### Diffusers **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.** ``` pip install -r requirements.txt ``` Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. For more details on quantized inference, please refer to [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao/). With Diffusers and TorchAO, quantized inference is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of memory and time benchmarks with various settings on A100 and H100 has been published at [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao). ## Gallery ### CogVideoX-5B <table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video> </td> </tr> <tr> <td> <video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video> </td> </tr> </table> ### CogVideoX-2B <table border="0" style="width: 100%; text-align: left; margin-top: 20px;"> <tr> <td> <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video> </td> <td> <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video> </td> </tr> </table> To view the corresponding prompt words for the gallery, please click [here](resources/galary_prompt.md) ## Model Introduction CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information. <table style="border-collapse: collapse; width: 100%;"> <tr> <th style="text-align: center;">Model Name</th> <th style="text-align: center;">CogVideoX1.5-5B (Latest)</th> <th style="text-align: center;">CogVideoX1.5-5B-I2V (Latest)</th> <th style="text-align: center;">CogVideoX-2B</th> <th style="text-align: center;">CogVideoX-5B</th> <th style="text-align: center;">CogVideoX-5B-I2V</th> </tr> <tr> <td style="text-align: center;">Release Date</td> <th style="text-align: center;">November 8, 2024</th> <th style="text-align: center;">November 8, 2024</th> <th style="text-align: center;">August 6, 2024</th> <th style="text-align: center;">August 27, 2024</th> <th style="text-align: center;">September 19, 2024</th> </tr> <tr> <td style="text-align: center;">Video Resolution</td> <td colspan="1" style="text-align: center;">1360 * 768</td> <td colspan="1" style="text-align: center;"> Min(W, H) = 768 768 ≤ Max(W, H) ≤ 1360 Max(W, H) % 16 = 0 </td> <td colspan="3" style="text-align: center;">720 * 480</td> </tr> <tr> <td style="text-align: center;">Number of Frames</td> <td colspan="2" style="text-align: center;">Should be 16N + 1 where N <= 10 (default 81)</td> <td colspan="3" style="text-align: center;">Should be 8N + 1 where N <= 6 (default 49)</td> </tr> <tr> <td style="text-align: center;">Inference Precision</td> <td colspan="2" style="text-align: center;">BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4</td> <td style="text-align: center;">FP16*(Recommended), BF16, FP32, FP8*, INT8, Not supported: INT4</td> <td colspan="2" style="text-align: center;">BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4</td> </tr> <tr> <td style="text-align: center;">Single GPU Memory Usage </td> <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB diffusers BF16: from 10GB* diffusers INT8(torchao): from 7GB*</td> <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB diffusers FP16: 4GB minimum* diffusers INT8 (torchao): 3.6GB minimum*</td> <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB diffusers BF16 : 5GB minimum* diffusers INT8 (torchao): 4.4GB minimum* </td> </tr> <tr> <td style="text-align: center;">Multi-GPU Memory Usage</td> <td colspan="2" style="text-align: center;">BF16: 24GB* using diffusers </td> <td style="text-align: center;">FP16: 10GB* using diffusers </td> <td colspan="2" style="text-align: center;">BF16: 15GB* using diffusers </td> </tr> <tr> <td style="text-align: center;">Inference Speed (Step = 50, FP/BF16)</td> <td colspan="2" style="text-align: center;">Single A100: ~1000 seconds (5-second video) Single H100: ~550 seconds (5-second video)</td> <td style="text-align: center;">Single A100: ~90 seconds Single H100: ~45 seconds</td> <td colspan="2" style="text-align: center;">Single A100: ~180 seconds Single H100: ~90 seconds</td> </tr> <tr> <td style="text-align: center;">Prompt Language</td> <td colspan="5" style="text-align: center;">English*</td> </tr> <tr> <td style="text-align: center;">Prompt Token Limit</td> <td colspan="2" style="text-align: center;">224 Tokens</td> <td colspan="3" style="text-align: center;">226 Tokens</td> </tr> <tr> <td style="text-align: center;">Video Length</td> <td colspan="2" style="text-align: center;">5 seconds or 10 seconds</td> <td colspan="3" style="text-align: center;">6 seconds</td> </tr> <tr> <td style="text-align: center;">Frame Rate</td> <td colspan="2" style="text-align: center;">16 frames / second </td> <td colspan="3" style="text-align: center;">8 frames / second </td> </tr> <tr> <td style="text-align: center;">Position Encoding</td> <td colspan="2" style="text-align: center;">3d_rope_pos_embed</td> <td style="text-align: center;">3d_sincos_pos_embed</td> <td style="text-align: center;">3d_rope_pos_embed</td> <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td> </tr> <tr> <td style="text-align: center;">Download Link (Diffusers)</td> <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">🤗 HuggingFace</a> <a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">🤖 ModelScope</a> <a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">🟣 WiseModel</a></td> <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">🤗 HuggingFace</a> <a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🤖 ModelScope</a> <a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🟣 WiseModel</a></td> <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a> <a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a> <a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td> <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a> <a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a> <a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td> <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a> <a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a> <a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td> </tr> <tr> <td style="text-align: center;">Download Link (SAT)</td> <td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">🤗 HuggingFace</a> <a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🤖 ModelScope</a> <a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🟣 WiseModel</a></td> <td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td> </tr> </table> **Data Explanation** + While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures. Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table. However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including: ``` pipe.enable_sequential_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() ``` + For multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled. + Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal video quality loss, though inference speed will significantly decrease. + The CogVideoX-2B model was trained in `FP16` precision, and all CogVideoX-5B models were trained in `BF16` precision. We recommend using the precision in which the model was trained for inference. + [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao` Python packages. CUDA 12.4 is recommended. + The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed increases by about 10%. Only the `diffusers` version of the model supports quantization. + The model only supports English input; other languages can be translated into English for use via large model refinement. ## Friendly Links We highly welcome contributions from the community and actively contribute to the open-source community. The following works have already been adapted for CogVideoX, and we invite everyone to use them: + [LeMiCa](https://unicomai.github.io/LeMiCa/): a diffusion model inference acceleration solution developed by China Unicom Data Science and Artificial Intelligence Research Institute. By leveraging cache-based techniques and global denoising path optimization, LeMiCa provides efficient inference support for CogVideoX, achieving nearly 2.5x lossless acceleration while maintaining visual consistency and quality. + [RIFLEx-CogVideoX](https://github.com/thu-ml/RIFLEx): RIFLEx extends the video with just one line of code: `freq[k-1]=(2np.pi)/(Ls)`. The framework not only supports training-free inference, but also offers models fine-tuned based on CogVideoX. By fine-tuning the model for just 1,000 steps on original-length videos, RIFLEx significantly enhances its length extrapolation capability. + [CogVideoX-Fun](https://github.com/aigc-apps/CogVideoX-Fun): CogVideoX-Fun is a modified pipeline based on the CogVideoX architecture, supporting flexible resolutions and multiple launch methods. + [CogStudio](https://github.com/pinokiofactory/cogstudio): A separate repository for CogVideo's Gradio Web UI, which supports more functional Web UIs. + [Xorbits Inference](https://github.com/xorbitsai/inference): A powerful and comprehensive distributed inference framework, allowing you to easily deploy your own models or the latest cutting-edge open-source models with just one click. + [ComfyUI-CogVideoXWrapper](https://github.com/kijai/ComfyUI-CogVideoXWrapper) Use the ComfyUI framework to integrate CogVideoX into your workflow. + [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys): VideoSys provides a user-friendly, high-performance infrastructure for video generation, with full pipeline support and continuous integration of the latest models and techniques. + [AutoDL Space](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): A one-click deployment Huggingface Space image provided by community members. + [Interior Design Fine-Tuning Model](https://huggingface.co/collections/bertjiazheng/koolcogvideox-66e4762f53287b7f39f8f3ba): is a fine-tuned model based on CogVideoX, specifically designed for interior design. + [xDiT](https://github.com/xdit-project/xDiT): xDiT is a scalable inference engine for Diffusion Transformers (DiTs) on multiple GPU Clusters. xDiT supports real-time image and video generations services. [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory): A cost-effective fine-tuning framework for CogVideoX, compatible with the `diffusers` version model. Supports more resolutions, and fine-tuning CogVideoX-5B can be done with a single 4090 GPU. + [CogVideoX-Interpolation](https://github.com/feizc/CogvideX-Interpolation): A pipeline based on the modified CogVideoX structure, aimed at providing greater flexibility for keyframe interpolation generation. + [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth Studio is a diffusion engine. It has restructured the architecture, including text encoders, UNet, VAE, etc., enhancing computational performance while maintaining compatibility with open-source community models. The framework has been adapted for CogVideoX. + [CogVideoX-Controlnet](https://github.com/TheDenk/cogvideox-controlnet): A simple ControlNet module code that includes the CogVideoX model. + [VideoTuna](https://github.com/VideoVerses/VideoTuna): VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation. + [ConsisID](https://github.com/PKU-YuanGroup/ConsisID): An identity-preserving text-to-video generation model, bases on CogVideoX-5B, which keep the face consistent in the generated video by frequency decomposition. + [A Step by Step Tutorial](https://www.youtube.com/watch?v=5UCkMzP2VLE&ab_channel=SECourses): A step-by-step guide on installing and optimizing the CogVideoX1.5-5B-I2V model in Windows and cloud environments. Special thanks to the [FurkanGozukara](https://github.com/FurkanGozukara) for his effort and support! ## Project Structure This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples of the **CogVideoX** open-source model. ### Quick Start with Colab Here provide three projects that can be run directly on free Colab T4 instances: + [CogVideoX-5B-T2V-Colab.ipynb](https://colab.research.google.com/drive/1pCe5s0bC_xuXbBlpvIH1z0kfdTLQPzCS?usp=sharing): CogVideoX-5B Text-to-Video Colab code. + [CogVideoX-5B-T2V-Int8-Colab.ipynb](https://colab.research.google.com/drive/1DUffhcjrU-uz7_cpuJO3E_D4BaJT7OPa?usp=sharing): CogVideoX-5B Quantized Text-to-Video Inference Colab code, which takes about 30 minutes per run. + [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing): CogVideoX-5B Image-to-Video Colab code. + [CogVideoX-5B-V2V-Colab.ipynb](https://colab.research.google.com/drive/1comfGAUJnChl5NwPuO8Ox5_6WCy4kbNN?usp=sharing): CogVideoX-5B Video-to-Video Colab code. ### Inference + [dcli_demo](inference/cli_demo.py): A more detailed inference code explanation, including the significance of common parameters. All of this is covered here. + [cli_demo_quantization](inference/cli_demo_quantization.py): Quantized model inference code that can run on devices with lower memory. You can also modify this code to support running CogVideoX models in FP8 precision. + [diffusers_vae_demo](inference/cli_vae_demo.py): Code for running VAE inference separately. + [space demo](inference/gradio_composite_demo): The same GUI code as used in the Huggingface Space, with frame interpolation and super-resolution tools integrated. <div style="text-align: center;"> <img src="resources/web_demo.png" style="width: 100%; height: auto;" /> </div> + [convert_demo](inference/convert_demo.py): How to convert user input into long-form input suitable for CogVideoX. Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data using an LLM. The script defaults to using GLM-4, but it can be replaced with GPT, Gemini, or any other large language model. + [gradio_web_demo](inference/gradio_composite_demo): A simple Gradio web application demonstrating how to use the CogVideoX-2B / 5B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple web application for video generation. ### finetune + [finetune_demo](finetune/README.md): Fine-tuning scheme and details of the diffusers version of the CogVideoX model. ### sat + [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. ### Tools This folder contains some tools for model conversion / caption generation, etc. + [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Converts SAT model weights to Huggingface model weights. + [caption_demo](tools/caption/README.md): Caption tool, a model that understands videos and outputs descriptions in text. + [export_sat_lora_weight](tools/export_sat_lora_weight.py): SAT fine-tuning model export tool, exports the SAT Lora Adapter in diffusers format. + [load_cogvideox_lora](tools/load_cogvideox_lora.py): Tool code for loading the diffusers version of fine-tuned Lora Adapter. + [llm_flux_cogvideox](tools/llm_flux_cogvideox/llm_flux_cogvideox.py): Automatically generate videos using an open-source local large language model + Flux + CogVideoX. + [parallel_inference_xdit](tools/parallel_inference/parallel_inference_xdit.py): Supported by [xDiT](https://github.com/xdit-project/xDiT), parallelize the video generation process on multiple GPUs. ## CogVideo(ICLR'23) The official repo for the paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) **CogVideo is able to generate relatively high-frame-rate videos.** A 4-second clip of 32 frames is shown below. ![High-frame-rate sample](https://raw.githubusercontent.com/THUDM/CogVideo/CogVideo/assets/appendix-sample-highframerate.png) ![Intro images](https://raw.githubusercontent.com/THUDM/CogVideo/CogVideo/assets/intro-image.png) <div align="center"> <video src="https://github.com/user-attachments/assets/2fa19651-e925-4a2a-b8d6-b3f216d490ba" width="80%" controls autoplay></video> </div> The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get hands-on practice on text-to-video generation. *The original input is in Chinese.* ## Citation 🌟 If you find our work helpful, please leave us a star and cite our paper. ``` @article{yang2024cogvideox, title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others}, journal={arXiv preprint arXiv:2408.06072}, year={2024} } @article{hong2022cogvideo, title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers}, author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie}, journal={arXiv preprint arXiv:2205.15868}, year={2022} } ``` ## Model-License The code in this repository is released under the [Apache 2.0 License](LICENSE). The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under the [Apache 2.0 License](LICENSE). The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under the [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).

AI & Machine Learning Video Editing

12.8K Github Stars

Open Source

GLM-4.5

# GLM-4.7 & GLM-4.6 & GLM-4.5 [中文阅读](./README_zh.md) | [日本語版](./README_ja.md) <div align="center"> <img src=resources/logo.svg width="15%"/> </div> 👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> or <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community. 📖 Check out the GLM-4.7 <a href="https://z.ai/blog/glm-4.7" target="_blank">technical blog</a>, <a href="https://arxiv.org/abs/2508.06471" target="_blank">technical report(GLM-4.5)</a>, and <a href="https://zhipu-ai.feishu.cn/wiki/Gv3swM0Yci7w7Zke9E0crhU7n7D" target="_blank">Zhipu AI technical documentation</a>. 📍 Use GLM-4.7 API services on <a href="https://docs.z.ai/guides/llm/glm-4.7">Z.ai API Platform</a>. 👉 One click to <a href="https://chat.z.ai">GLM-4.7</a>. ## Model Introduction ### GLM-4.7 **GLM-4.7**, your new coding partner, is coming with the following features: - **Core Coding**: GLM-4.7 brings clear gains, compared to its predecessor GLM-4.6, in multilingual agentic coding and terminal-based tasks, including (73.8%, +5.8%) on SWE-bench, (66.7%, +12.9%) on SWE-bench Multilingual, and (41%, +16.5%) on Terminal Bench 2.0. GLM-4.7 also supports thinking before acting, with significant improvements on complex tasks in mainstream agent frameworks such as Claude Code, Kilo Code, Cline, and Roo Code. - **Vibe Coding**: GLM-4.7 takes a big step forward in improving UI quality. It produces cleaner, more modern webpages and generates better-looking slides with more accurate layout and sizing. - **Tool Using**: GLM-4.7 achieves significantly improvements in Tool using. Significant better performances can be seen on benchmarks such as τ^2-Bench and on web browsing via BrowseComp. - **Complex Reasoning**: GLM-4.7 delivers a substantial boost in mathematical and reasoning capabilities, achieving (42.8%, +12.4%) on the HLE (Humanity’s Last Exam) benchmark compared to GLM-4.6. More general, one would also witness significant improvements in many other scenarios such as chat, creative writing, and role-play scenario. ![bench](resources/bench_glm47.png) **Interleaved Thinking & Preserved Thinking** ![thinking](resources/thinking.png) GLM-4.7 further enhances **Interleaved Thinking** (a feature introduced since GLM-4.5) and introduces **Preserved Thinking** and **Turn-level Thinking**. By thinking between actions and staying consistent across turns, it makes complex tasks more stable and more controllable: - **Interleaved Thinking**: The model thinks before every response and tool calling, improving instruction following and the quality of generation. - **Preserved Thinking**: In coding agent scenarios, the model automatically retains all thinking blocks across multi-turn conversations, reusing the existing reasoning instead of re-deriving from scratch. This reduces information loss and inconsistencies, and is well-suited for long-horizon, complex tasks. - **Turn-level Thinking**: The model supports per-turn control over reasoning within a session—disable thinking for lightweight requests to reduce latency/cost, enable it for complex tasks to improve accuracy and stability. More details: https://docs.z.ai/guides/capabilities/thinking-mode We also provide the lightweight 30B-A3B model GLM-4.7-Flash, offering a new option for lightweight deployment that balances performance and efficiency. ### GLM-4.6 Compared with GLM-4.5, **GLM-4.6** brings several key improvements: - **Longer context window:** The context window has been expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks. - **Superior coding performance:** The model achieves higher scores on code benchmarks and demonstrates better real-world performance in applications such as Claude Code、Cline、Roo Code and Kilo Code, including improvements in generating visually polished front-end pages. - **Advanced reasoning:** GLM-4.6 shows a clear improvement in reasoning performance and supports tool use during inference, leading to stronger overall capability. - **More capable agents:** GLM-4.6 exhibits stronger performance in tool using and search-based agents, and integrates more effectively within agent frameworks. - **Refined writing:** Better aligns with human preferences in style and readability, and performs more naturally in role-playing scenarios. We evaluated GLM-4.6 across eight public benchmarks covering agents, reasoning, and coding. Results show clear gains over GLM-4.5, with GLM-4.6 also holding competitive advantages over leading domestic and international models such as **DeepSeek-V3.1-Terminus** and **Claude Sonnet 4**. ### GLM-4.5 The **GLM-4.5** series models are foundation models designed for intelligent agents. GLM-4.5 has **355** billion total parameters with **32** billion active parameters, while GLM-4.5-Air adopts a more compact design with **106** billion total parameters and **12** billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of **63.2**, in the **3rd** place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at **59.8** while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our [technical report](https://arxiv.org/abs/2508.06471). ## Model Downloads | Model | Download Links | Model Size | Precision | |------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------| | GLM-4.7 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.7) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.7) | 355B-A32B | BF16 | | GLM-4.7-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.7-FP8) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.7-FP8) | 355B-A32B | FP8 | | GLM-4.7-Flash | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.7-Flash) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.7-Flash) | 30B-A3B | BF16 | | GLM-4.6 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.6) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.6) | 355B-A32B | BF16 | | GLM-4.6-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.6-FP8) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.6-FP8) | 355B-A32B | FP8 | | GLM-4.5 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5) | 355B-A32B | BF16 | | GLM-4.5-Air | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air) | 106B-A12B | BF16 | | GLM-4.5-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-FP8) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-FP8) | 355B-A32B | FP8 | | GLM-4.5-Air-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-FP8) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-FP8) | 106B-A12B | FP8 | | GLM-4.5-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Base) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Base) | 355B-A32B | BF16 | | GLM-4.5-Air-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-Base) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-Base) | 106B-A12B | BF16 | - The model code, tool parser and reasoning parser of GLM-4.5, GLM-4.6 and GLM-4.7 can be found in the implementation of [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4_moe), [vLLM](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4_moe_mtp.py) and [SGLang](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/glm4_moe.py). - The model code of GLM-4.7-Flash can be found in the implementation of [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4_moe_lite), [vLLM](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4_moe_lite_mtp.py) and [SGLang](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/glm4_moe_lite.py). ## System Requirements ### Inference with Nvidia GPUs We provide minimum and recommended configurations for "full-featured" model inference. The data in the table below is based on the following conditions: 1. All models use MTP layers and specify `--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4` to ensure competitive inference speed. 2. The `cpu-offload` parameter is not used. 3. Inference batch size does not exceed `8`. 4. All are executed on devices that natively support FP8 inference, ensuring both weights and cache are in FP8 format. 5. Server memory must exceed `1T` to ensure normal model loading and operation. The models can run under the configurations in the table below: | Model | Precision | GPU Type and Count | |---------------|-----------|--------------------| | GLM-4.5 | BF16 | H100 x 16 | | GLM-4.5 | FP8 | H100 x 8 | | GLM-4.5-Air | BF16 | H100 x 4 | | GLM-4.5-Air | FP8 | H100 x 2 | | GLM-4.7-Flash | BF16 | H100 x 1 | Under the configurations in the table below, the models can utilize their full 128K context length: | Model | Precision | GPU Type and Count | |---------------|-------------|--------------------| | GLM-4.5 | BF16 | H100 x 32 | | GLM-4.5 | FP8 | H100 x 16 | | GLM-4.5-Air | BF16 | H100 x 8 | | GLM-4.5-Air | FP8 | H100 x 4 | | GLM-4.7-Flash | BF16 | H100 x 2 | ### Other Devices - To perform fast inference on Ascend A3 devices using [xLLM](https://github.com/jd-opensource/xllm), please refer to the [Ascend NPU Deployment Guide](example/Ascend_NPU/README_zh.md). - To run inference on AMD GPUs, please refer to the [AMD GPU Deployment Guide](example/AMD_GPU/README.md). ### Fine-tuning The code can run under the configurations in the table below using [Llama Factory](https://github.com/hiyouga/LLaMA-Factory): | Model | GPU Type and Count | Strategy | Batch Size (per GPU) | |-------------|--------------------|----------|----------------------| | GLM-4.5 | H100 x 16 | Lora | 1 | | GLM-4.5-Air | H100 x 4 | Lora | 1 | The code can run under the configurations in the table below using [Swift](https://github.com/modelscope/ms-swift): | Model | GPU Type and Count | Strategy | Batch Size (per GPU) | |-------------|--------------------|----------|----------------------| | GLM-4.5 | H20 (96GiB) x 16 | Lora | 1 | | GLM-4.5-Air | H20 (96GiB) x 4 | Lora | 1 | | GLM-4.5 | H20 (96GiB) x 128 | SFT | 1 | | GLM-4.5-Air | H20 (96GiB) x 32 | SFT | 1 | | GLM-4.5 | H20 (96GiB) x 128 | RL | 1 | | GLM-4.5-Air | H20 (96GiB) x 32 | RL | 1 | ## Quick Start Install dependencies (sglang, vllm, etc.) according to the configuration requirements in `requirements.txt`. ### transformers Please refer to the `trans_infer_cli.py` code in the `inference` folder. ### vLLM ```shell vllm serve zai-org/GLM-4.7-FP8 \ --tensor-parallel-size 4 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-fp8 ``` ### SGLang ```shell python3 -m sglang.launch_server \ --model-path zai-org/GLM-4.7-FP8 \ --tp-size 8 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --mem-fraction-static 0.8 \ --served-model-name glm-4.7-fp8 \ --host 0.0.0.0 \ --port 8000 ``` - PD-Disaggregation The following is a simple method to implement PD-Disaggregation using a single machine with multiple GPUs, P and D each use 4 GPUs for GLM-4.5 ```shell python -m sglang.launch_server --model-path zai-org/GLM-4.5-Air --disaggregation-mode prefill --disaggregation-ib-device mlx5_0 --tp-size 4 python -m sglang.launch_server --model-path zai-org/GLM-4.5-Air --disaggregation-mode decode --port 30001 --disaggregation-ib-device mlx5_0 --tp-size 4 --base-gpu-id 4 python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 ``` ### Parameter Instructions - For GLM-4.7, `--tool-call-parser` should be set to `glm47` in both `vLLM` and `SGLang` method. - For agentic tasks of GLM-4.7, please turn on [Preserved Thinking mode](https://docs.z.ai/guides/capabilities/thinking-mode) by adding the following config (only sglang support): ``` "chat_template_kwargs": { "enable_thinking": true, "clear_thinking": false } ``` - When using `vLLM` and `SGLang`, thinking mode is enabled by default when sending requests. If you want to disable the thinking switch, you need to add the `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` parameter. - Both support tool calling. Please use OpenAI-style tool description format for calls. - For specific code, please refer to `api_request.py` in the `inference` folder. ### Evaluation - For tool-integrated reasoning, please refer to [this doc](resources/glm_4.6_tir_guide.md). - For search benchmark, we design a specific format for searching toolcall in thinking mode to support search agent, please refer to [this](resources/trajectory_search.json). for the detailed template. ## Citation If you find our work useful in your research, please consider citing the following paper: ```bibtex @misc{5team2025glm45agenticreasoningcoding, title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, author={GLM Team and Aohan Zeng and Xin Lv and Qinkai Zheng and Zhenyu Hou and Bin Chen and Chengxing Xie and Cunxiang Wang and Da Yin and Hao Zeng and Jiajie Zhang and Kedong Wang and Lucen Zhong and Mingdao Liu and Rui Lu and Shulin Cao and Xiaohan Zhang and Xuancheng Huang and Yao Wei and Yean Cheng and Yifan An and Yilin Niu and Yuanhao Wen and Yushi Bai and Zhengxiao Du and Zihan Wang and Zilin Zhu and Bohan Zhang and Bosi Wen and Bowen Wu and Bowen Xu and Can Huang and Casey Zhao and Changpeng Cai and Chao Yu and Chen Li and Chendi Ge and Chenghua Huang and Chenhui Zhang and Chenxi Xu and Chenzheng Zhu and Chuang Li and Congfeng Yin and Daoyan Lin and Dayong Yang and Dazhi Jiang and Ding Ai and Erle Zhu and Fei Wang and Gengzheng Pan and Guo Wang and Hailong Sun and Haitao Li and Haiyang Li and Haiyi Hu and Hanyu Zhang and Hao Peng and Hao Tai and Haoke Zhang and Haoran Wang and Haoyu Yang and He Liu and He Zhao and Hongwei Liu and Hongxi Yan and Huan Liu and Huilong Chen and Ji Li and Jiajing Zhao and Jiamin Ren and Jian Jiao and Jiani Zhao and Jianyang Yan and Jiaqi Wang and Jiayi Gui and Jiayue Zhao and Jie Liu and Jijie Li and Jing Li and Jing Lu and Jingsen Wang and Jingwei Yuan and Jingxuan Li and Jingzhao Du and Jinhua Du and Jinxin Liu and Junkai Zhi and Junli Gao and Ke Wang and Lekang Yang and Liang Xu and Lin Fan and Lindong Wu and Lintao Ding and Lu Wang and Man Zhang and Minghao Li and Minghuan Xu and Mingming Zhao and Mingshu Zhai and Pengfan Du and Qian Dong and Shangde Lei and Shangqing Tu and Shangtong Yang and Shaoyou Lu and Shijie Li and Shuang Li and Shuang-Li and Shuxun Yang and Sibo Yi and Tianshu Yu and Wei Tian and Weihan Wang and Wenbo Yu and Weng Lam Tam and Wenjie Liang and Wentao Liu and Xiao Wang and Xiaohan Jia and Xiaotao Gu and Xiaoying Ling and Xin Wang and Xing Fan and Xingru Pan and Xinyuan Zhang and Xinze Zhang and Xiuqing Fu and Xunkai Zhang and Yabo Xu and Yandong Wu and Yida Lu and Yidong Wang and Yilin Zhou and Yiming Pan and Ying Zhang and Yingli Wang and Yingru Li and Yinpei Su and Yipeng Geng and Yitong Zhu and Yongkun Yang and Yuhang Li and Yuhao Wu and Yujiang Li and Yunan Liu and Yunqing Wang and Yuntao Li and Yuxuan Zhang and Zezhen Liu and Zhen Yang and Zhengda Zhou and Zhongpei Qiao and Zhuoer Feng and Zhuorui Liu and Zichen Zhang and Zihan Wang and Zijun Yao and Zikang Wang and Ziqiang Liu and Ziwei Chai and Zixuan Li and Zuodong Zhao and Wenguang Chen and Jidong Zhai and Bin Xu and Minlie Huang and Hongning Wang and Juanzi Li and Yuxiao Dong and Jie Tang}, year={2025}, eprint={2508.06471}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.06471}, } ```

AI & Machine Learning AI Agents ML Frameworks

4.4K Github Stars

Open Source

GLM-5

# GLM-5.1 & GLM-5 <div align="center"> <img src=resources/logo.svg width="15%"/> </div> 👋 Join our <a href="resources/WECHAT.md" target="_blank">Wechat</a> or <a href="https://discord.gg/zFMhpMRFP" target="_blank">Discord</a> community. 📖 Check out the GLM-5.1 <a href="https://z.ai/blog/glm-5.1" target="_blank">blog</a> and GLM-5 <a href="https://arxiv.org/abs/2602.15763" target="_blank">Technical report</a>. 📍 Use GLM-5.1 API services on <a href="https://docs.z.ai/guides/llm/glm-5.1">Z.ai API Platform. </a> 🔜 <a href="https://chat.z.ai">GLM-5.1</a> will be available on chat.z.ai in the coming days. ## Introduction ### GLM-5.1 GLM-5.1 is our next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks). ![bench_51](resources/bench_51.png) But the most meaningful leap goes beyond first-pass performance. Previous models—including GLM-5—tend to exhaust their repertoire early: they apply familiar techniques for quick initial gains, then plateau. Giving them more time doesn't help. GLM-5.1, by contrast, is built to stay effective on agentic tasks over much longer horizons. We've found that the model handles ambiguous problems with better judgment and stays productive over longer sessions. It breaks complex problems down, runs experiments, reads results, and identifies blockers with real precision. By revisiting its reasoning and revising its strategy through repeated iteration, GLM-5.1 sustains optimization over hundreds of rounds and thousands of tool calls. The longer it runs, the better the result. ### GLM-5 We are launching GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity. Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed [slime](https://github.com/THUDM/slime), a novel **asynchronous RL infrastructure** that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models. ![bench](resources/bench.png) GLM-5 is purpose-built for complex systems engineering and long-horizon agentic tasks. On our internal evaluation suite CC-Bench-V2, GLM-5 significantly outperforms GLM-4.7 across frontend, backend, and long-horizon tasks, narrowing the gap to Claude Opus 4.5. ![realworld_bench](resources/realworld_bench.png) On [Vending Bench 2](https://andonlabs.com/evals/vending-bench-2), a benchmark that measures long-term operational capability, GLM-5 ranks \#1 among open-source models. Vending Bench 2 requires the model to run a simulated vending machine business over a one-year horizon; GLM-5 finishes with a final account balance of $4,432, approaching Claude Opus 4.5 and demonstrating strong long-term planning and resource management. ![vending_bench](resources/vending_bench.png) ## Download Model | Model | Download Links | Model Size | Precision | |-------------|-------------------------------------------------------------------------------------------------------------------------------------|------------|-----------| | GLM-5.1 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-5.1) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-5.1) | 744B-A40B | BF16 | | GLM-5.1-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-5.1-FP8) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-5.1-FP8) | 744B-A40B | FP8 | | GLM-5 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-5) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-5) | 744B-A40B | BF16 | | GLM-5-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-5-FP8) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-5-FP8) | 744B-A40B | FP8 | ## Serve GLM-5 Series Locally ### Prepare environment vLLM, SGLang, xLLM and Ktransformers all support local deployment of GLM-5 series model, A simple deployment guide is provided here. + vLLM Using Docker as: ```shell docker pull vllm/vllm-openai:v0.20.2-cu129 docker pull vllm/vllm-openai:v0.20.2 # For CUDA 13.0 ``` + SGLang Using Docker as: ```bash docker pull lmsysorg/sglang:v0.5.11 docker pull lmsysorg/sglang:v0.5.11-cu130 # For CUDA 13.0 ``` ### Deploy + vLLM ```shell vllm serve zai-org/GLM-5.1-FP8 \ --tensor-parallel-size 8 \ --gpu-memory-utilization 0.85 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 3 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --chat-template-content-format=string \ --served-model-name glm-5.1-fp8 ``` Check the [recipes](https://github.com/vllm-project/recipes/blob/main/GLM/GLM5.md) for more details. >Note: When encounter Tool Call Parse issue with MTP enabled, please turn to vllm main branch to serve GLM-5.1. + SGLang ```shell sglang serve \ --model-path zai-org/GLM-5.1-FP8 \ --tp-size 8 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --mem-fraction-static 0.85 \ --served-model-name glm-5.1-fp8 \ --port 8000 \ --host 0.0.0.0 ``` Check the [sglang cookbook](https://cookbook.sglang.io/autoregressive/GLM/GLM-5.1) for more details. + xLLM Please check the deployment guide [here](https://github.com/zai-org/GLM-5/blob/main/example/ascend.md). + Ktransformers Please check the deployment guide [here](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/GLM-5.1-Tutorial.md). ## Citation If you find GLM-5 series model useful in your research, please cite our technical report: ```bibtex @misc{glm5team2026glm5vibecodingagentic, title={GLM-5: from Vibe Coding to Agentic Engineering}, author={GLM-5-Team and : and Aohan Zeng and Xin Lv and Zhenyu Hou and Zhengxiao Du and Qinkai Zheng and Bin Chen and Da Yin and Chendi Ge and Chenghua Huang and Chengxing Xie and Chenzheng Zhu and Congfeng Yin and Cunxiang Wang and Gengzheng Pan and Hao Zeng and Haoke Zhang and Haoran Wang and Huilong Chen and Jiajie Zhang and Jian Jiao and Jiaqi Guo and Jingsen Wang and Jingzhao Du and Jinzhu Wu and Kedong Wang and Lei Li and Lin Fan and Lucen Zhong and Mingdao Liu and Mingming Zhao and Pengfan Du and Qian Dong and Rui Lu and Shuang-Li and Shulin Cao and Song Liu and Ting Jiang and Xiaodong Chen and Xiaohan Zhang and Xuancheng Huang and Xuezhen Dong and Yabo Xu and Yao Wei and Yifan An and Yilin Niu and Yitong Zhu and Yuanhao Wen and Yukuo Cen and Yushi Bai and Zhongpei Qiao and Zihan Wang and Zikang Wang and Zilin Zhu and Ziqiang Liu and Zixuan Li and Bojie Wang and Bosi Wen and Can Huang and Changpeng Cai and Chao Yu and Chen Li and Chengwei Hu and Chenhui Zhang and Dan Zhang and Daoyan Lin and Dayong Yang and Di Wang and Ding Ai and Erle Zhu and Fangzhou Yi and Feiyu Chen and Guohong Wen and Hailong Sun and Haisha Zhao and Haiyi Hu and Hanchen Zhang and Hanrui Liu and Hanyu Zhang and Hao Peng and Hao Tai and Haobo Zhang and He Liu and Hongwei Wang and Hongxi Yan and Hongyu Ge and Huan Liu and Huanpeng Chu and Jia'ni Zhao and Jiachen Wang and Jiajing Zhao and Jiamin Ren and Jiapeng Wang and Jiaxin Zhang and Jiayi Gui and Jiayue Zhao and Jijie Li and Jing An and Jing Li and Jingwei Yuan and Jinhua Du and Jinxin Liu and Junkai Zhi and Junwen Duan and Kaiyue Zhou and Kangjian Wei and Ke Wang and Keyun Luo and Laiqiang Zhang and Leigang Sha and Liang Xu and Lindong Wu and Lintao Ding and Lu Chen and Minghao Li and Nianyi Lin and Pan Ta and Qiang Zou and Rongjun Song and Ruiqi Yang and Shangqing Tu and Shangtong Yang and Shaoxiang Wu and Shengyan Zhang and Shijie Li and Shuang Li and Shuyi Fan and Wei Qin and Wei Tian and Weining Zhang and Wenbo Yu and Wenjie Liang and Xiang Kuang and Xiangmeng Cheng and Xiangyang Li and Xiaoquan Yan and Xiaowei Hu and Xiaoying Ling and Xing Fan and Xingye Xia and Xinyuan Zhang and Xinze Zhang and Xirui Pan and Xu Zou and Xunkai Zhang and Yadi Liu and Yandong Wu and Yanfu Li and Yidong Wang and Yifan Zhu and Yijun Tan and Yilin Zhou and Yiming Pan and Ying Zhang and Yinpei Su and Yipeng Geng and Yong Yan and Yonglin Tan and Yuean Bi and Yuhan Shen and Yuhao Yang and Yujiang Li and Yunan Liu and Yunqing Wang and Yuntao Li and Yurong Wu and Yutao Zhang and Yuxi Duan and Yuxuan Zhang and Zezhen Liu and Zhengtao Jiang and Zhenhe Yan and Zheyu Zhang and Zhixiang Wei and Zhuo Chen and Zhuoer Feng and Zijun Yao and Ziwei Chai and Ziyuan Wang and Zuzhou Zhang and Bin Xu and Minlie Huang and Hongning Wang and Juanzi Li and Yuxiao Dong and Jie Tang}, year={2026}, eprint={2602.15763}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.15763}, } ```

AI Tools AI Agents

3.4K Github Stars

Open Source

GLM-skills

# GLM Skills [中文文档](README_zh.md) Official skills for the [GLM](https://github.com/zai-org) family of models, designed for agent architectures including [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [OpenCode](https://github.com/opencode-ai/opencode), [OpenClaw](https://github.com/openclaw/openclaw), [AutoClaw](https://autoglm.zhipuai.cn/autoclaw/), and other AI coding agents. This repository consolidates skills originally distributed across individual model repos into a single, unified collection. ## Skills ### GLM-V (Multimodal) | Skill | Description | | ------------------------------------------------------- | ----------------------------------------------------------------------------- | | [glmv-caption](skills/glmv-caption) | Generate captions and descriptions for images, videos, and documents | | [glmv-doc-based-writing](skills/glmv-doc-based-writing) | Write content (papers, articles, reports) based on PDF/DOCX documents | | [glmv-grounding](skills/glmv-grounding) | Object localization with bounding-box visualization in images and videos | | [glmv-pdf-to-ppt](skills/glmv-pdf-to-ppt) | Convert PDF documents into multi-slide HTML presentations | | [glmv-pdf-to-web](skills/glmv-pdf-to-web) | Convert research papers into polished academic project websites | | [glmv-prd-to-app](skills/glmv-prd-to-app) | Build full-stack web applications from PRD documents and prototypes | | [glmv-prompt-gen](skills/glmv-prompt-gen) | Generate AI art prompts from visual references (Midjourney, SD, DALL-E, etc.) | | [glmv-resume-screen](skills/glmv-resume-screen) | Screen and evaluate resumes against user-defined criteria | | [glmv-stock-analyst](skills/glmv-stock-analyst) | Multi-source stock analysis and report generation for HK/A-share/US markets | | [glmv-web-replication](skills/glmv-web-replication) | Create frontend visual replicas of existing websites | ### GLM-OCR | Skill | Description | | ----------------------------------------------- | -------------------------------------------- | | [glmocr](skills/glmocr) | General text extraction from images and PDFs | | [glmocr-formula](skills/glmocr-formula) | Extract mathematical formulas into LaTeX | | [glmocr-handwriting](skills/glmocr-handwriting) | Recognize handwritten text | | [glmocr-sdk](skills/glmocr-sdk) | Document parsing via GLM-OCR SDK CLI | | [glmocr-table](skills/glmocr-table) | Extract tables into Markdown | ### GLM-Image | Skill | Description | | ------------------------------------- | ---------------------------------------------- | | [glm-image-gen](skills/glm-image-gen) | Generate high-quality images from text prompts | ### Meta | Skill | Description | | ------------------------------------------- | --------------------------------------------------- | | [glm-master-skill](skills/glm-master-skill) | Discovery and installation guide for all GLM skills | ## Installation ### Method 1: Install from Clawhub (Recommended) ```bash # Install a single skill npx clawhub@latest install glmocr # Install multiple skills at once npx clawhub@latest install glmocr glmocr-table glmv-caption glm-image-gen ``` ### Method 2: Clone from GitHub ```bash git clone https://github.com/zai-org/skills.git # Then follow each skill's SKILL.md for setup instructions ``` ## API Key Most skills require a `ZHIPU_API_KEY` environment variable. Get your key at [bigmodel.cn](https://bigmodel.cn/usercenter/proj-mgmt/apikeys). ```bash export ZHIPU_API_KEY="your_key" ``` ## License This project is licensed under the [Apache License 2.0](LICENSE).

AI Tools Image Editing

409 Github Stars

Software by zai-org

ChatGLM2-6B

CogVideo

GLM-4.5

GLM-5

GLM-skills