openai

Open Source

plugins

# Plugins This repository contains a curated collection of Codex plugin examples. Each plugin lives under `plugins/<name>/` with a required `.codex-plugin/plugin.json` manifest and optional companion surfaces such as `skills/`, `.app.json`, `.mcp.json`, plugin-level `agents/`, `commands/`, `hooks.json`, `assets/`, and other supporting files. Highlighted richer examples in this repo include: - `plugins/figma` for `use_figma`, Code to Canvas, Code Connect, and design system rules - `plugins/notion` for planning, research, meetings, and knowledge capture - `plugins/build-ios-apps` for SwiftUI implementation, refactors, performance, and debugging - `plugins/build-macos-apps` for macOS SwiftUI/AppKit workflows, build/run/debug loops, and packaging guidance - `plugins/build-web-apps` for deployment, UI, payments, and database workflows - `plugins/expo` for Expo and React Native apps, SDK upgrades, EAS workflows, and Codex Run actions - `plugins/netlify`, `plugins/remotion`, and `plugins/google-slides` for additional public skill- and MCP-backed plugin bundles

Developer Tools JavaScript Libraries & Components AI Tools

2.5K Github Stars

Open Source

openai/whisper

# Whisper [[Blog]](https://openai.com/blog/whisper) [[Paper]](https://arxiv.org/abs/2212.04356) [[Model card]](https://github.com/openai/whisper/blob/main/model-card.md) [[Colab example]](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb) Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. ## Approach ![Approach](https://raw.githubusercontent.com/openai/whisper/main/approach.png) A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets. ## Setup We used Python 3.9.9 and [PyTorch](https://pytorch.org/) 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.8-3.11 and recent PyTorch versions. The codebase also depends on a few Python packages, most notably [OpenAI's tiktoken](https://github.com/openai/tiktoken) for their fast tokenizer implementation. You can download and install (or update to) the latest release of Whisper with the following command: pip install -U openai-whisper Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies: pip install git+https://github.com/openai/whisper.git To update the package to the latest version of this repository, please run: pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git It also requires the command-line tool [`ffmpeg`](https://ffmpeg.org/) to be installed on your system, which is available from most package managers: ```bash # on Ubuntu or Debian sudo apt update && sudo apt install ffmpeg # on Arch Linux sudo pacman -S ffmpeg # on MacOS using Homebrew (https://brew.sh/) brew install ffmpeg # on Windows using Chocolatey (https://chocolatey.org/) choco install ffmpeg # on Windows using Scoop (https://scoop.sh/) scoop install ffmpeg ``` You may need [`rust`](http://rust-lang.org) installed as well, in case [tiktoken](https://github.com/openai/tiktoken) does not provide a pre-built wheel for your platform. If you see installation errors during the `pip install` command above, please follow the [Getting started page](https://www.rust-lang.org/learn/get-started) to install Rust development environment. Additionally, you may need to configure the `PATH` environment variable, e.g. `export PATH="$HOME/.cargo/bin:$PATH"`. If the installation fails with `No module named 'setuptools_rust'`, you need to install `setuptools_rust`, e.g. by running: ```bash pip install setuptools-rust ``` ## Available models and languages There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model. The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware. | Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed | |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:| | tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~10x | | base | 74 M | `base.en` | `base` | ~1 GB | ~7x | | small | 244 M | `small.en` | `small` | ~2 GB | ~4x | | medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x | | large | 1550 M | N/A | `large` | ~10 GB | 1x | | turbo | 809 M | N/A | `turbo` | ~6 GB | ~8x | The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models. Additionally, the `turbo` model is an optimized version of `large-v3` that offers faster transcription speed with a minimal degradation in accuracy. Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of `large-v3` and `large-v2` models by language, using WERs (word error rates) or CER (character error rates, shown in *Italic*) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of [the paper](https://arxiv.org/abs/2212.04356), as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3. ![WER breakdown by language](https://github.com/openai/whisper/assets/266841/f4619d66-1058-4005-8f67-a9d811b77c62) ## Command-line usage The following command will transcribe speech in audio files, using the `turbo` model: ```bash whisper audio.flac audio.mp3 audio.wav --model turbo ``` The default setting (which selects the `turbo` model) works well for transcribing English. However, **the `turbo` model is not trained for translation tasks**. If you need to **translate non-English speech into English**, use one of the **multilingual models** (`tiny`, `base`, `small`, `medium`, `large`) instead of `turbo`. For example, to transcribe an audio file containing non-English speech, you can specify the language: ```bash whisper japanese.wav --language Japanese ``` To **translate** speech into English, use: ```bash whisper japanese.wav --model medium --language Japanese --task translate ``` > **Note:** The `turbo` model will return the original language even if `--task translate` is specified. Use `medium` or `large` for the best translation results. Run the following to view all available options: ```bash whisper --help ``` See [tokenizer.py](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py) for the list of all available languages. ## Python usage Transcription can also be performed within Python: ```python import whisper model = whisper.load_model("turbo") result = model.transcribe("audio.mp3") print(result["text"]) ``` Internally, the `transcribe()` method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window. Below is an example usage of `whisper.detect_language()` and `whisper.decode()` which provide lower-level access to the model. ```python import whisper model = whisper.load_model("turbo") # load audio and pad/trim it to fit 30 seconds audio = whisper.load_audio("audio.mp3") audio = whisper.pad_or_trim(audio) # make log-Mel spectrogram and move to the same device as the model mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device) # detect the spoken language _, probs = model.detect_language(mel) print(f"Detected language: {max(probs, key=probs.get)}") # decode the audio options = whisper.DecodingOptions() result = whisper.decode(model, mel, options) # print the recognized text print(result.text) ``` ## More examples Please use the [🙌 Show and tell](https://github.com/openai/whisper/discussions/categories/show-and-tell) category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc. ## License Whisper's code and model weights are released under the MIT License. See [LICENSE](https://github.com/openai/whisper/blob/main/LICENSE) for further details.

Media ML Frameworks

81.1K Github Stars

Open Source

openai-agents-python

# OpenAI Agents SDK [![PyPI](https://img.shields.io/pypi/v/openai-agents?label=pypi%20package)](https://pypi.org/project/openai-agents/) The OpenAI Agents SDK is a lightweight yet powerful framework for building multi-agent workflows. It is provider-agnostic, supporting the OpenAI Responses and Chat Completions APIs, as well as 100+ other LLMs. <img src="https://cdn.openai.com/API/docs/images/orchestration.png" alt="Image of the Agents Tracing UI" style="max-height: 803px;"> > [!NOTE] > Looking for the JavaScript/TypeScript version? Check out [Agents SDK JS/TS](https://github.com/openai/openai-agents-js). ### Core concepts: 1. [**Agents**](https://openai.github.io/openai-agents-python/agents): LLMs configured with instructions, tools, guardrails, and handoffs 1. [**Sandbox Agents**](https://openai.github.io/openai-agents-python/sandbox_agents): Agents preconfigured to work with a container to perform work over long time horizons. 1. **[Agents as tools](https://openai.github.io/openai-agents-python/tools/#agents-as-tools) / [Handoffs](https://openai.github.io/openai-agents-python/handoffs/)**: Delegating to other agents for specific tasks 1. [**Tools**](https://openai.github.io/openai-agents-python/tools/): Various Tools let agents take actions (functions, MCP, hosted tools) 1. [**Guardrails**](https://openai.github.io/openai-agents-python/guardrails/): Configurable safety checks for input and output validation 1. [**Human in the loop**](https://openai.github.io/openai-agents-python/human_in_the_loop/): Built-in mechanisms for involving humans across agent runs 1. [**Sessions**](https://openai.github.io/openai-agents-python/sessions/): Automatic conversation history management across agent runs 1. [**Tracing**](https://openai.github.io/openai-agents-python/tracing/): Built-in tracking of agent runs, allowing you to view, debug and optimize your workflows 1. [**Realtime Agents**](https://openai.github.io/openai-agents-python/realtime/quickstart/): Build powerful voice agents with `gpt-realtime-2` and full agent features Explore the [examples](https://github.com/openai/openai-agents-python/tree/main/examples) directory to see the SDK in action, and read our [documentation](https://openai.github.io/openai-agents-python/) for more details. ## Get started To get started, set up your Python environment (Python 3.10 or newer required), and then install OpenAI Agents SDK package. ### venv ```bash python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install openai-agents ``` For voice support, install with the optional `voice` group: `pip install 'openai-agents[voice]'`. For Redis session support, install with the optional `redis` group: `pip install 'openai-agents[redis]'`. ### uv If you're familiar with [uv](https://docs.astral.sh/uv/), installing the package would be even easier: ```bash uv init uv add openai-agents ``` For voice support, install with the optional `voice` group: `uv add 'openai-agents[voice]'`. For Redis session support, install with the optional `redis` group: `uv add 'openai-agents[redis]'`. ## Run your first Sandbox Agent [Sandbox Agents](https://openai.github.io/openai-agents-python/sandbox_agents) are new in version 0.14.0. A sandbox agent is an agent that uses a computer environment to perform real work with a filesystem, in an environment you configure and control. Sandbox agents are useful when the agent needs to inspect files, run commands, apply patches, or carry workspace state across longer tasks. ```python from agents import Runner from agents.run import RunConfig from agents.sandbox import Manifest, SandboxAgent, SandboxRunConfig from agents.sandbox.entries import GitRepo from agents.sandbox.sandboxes import UnixLocalSandboxClient agent = SandboxAgent( name="Workspace Assistant", instructions="Inspect the sandbox workspace before answering.", default_manifest=Manifest( entries={ "repo": GitRepo(repo="openai/openai-agents-python", ref="main"), } ), ) result = Runner.run_sync( agent, "Inspect the repo README and summarize what this project does.", # Run this agent on the local filesystem run_config=RunConfig(sandbox=SandboxRunConfig(client=UnixLocalSandboxClient())), ) print(result.final_output) # This project provides a Python SDK for building multi-agent workflows. ``` (_If running this, ensure you set the `OPENAI_API_KEY` environment variable_) (_For Jupyter notebook users, see [hello_world_jupyter.ipynb](https://github.com/openai/openai-agents-python/blob/main/examples/basic/hello_world_jupyter.ipynb)_) Explore the [examples](https://github.com/openai/openai-agents-python/tree/main/examples) directory to see the SDK in action, and read our [documentation](https://openai.github.io/openai-agents-python/) for more details. ## Acknowledgements We'd like to acknowledge the excellent work of the open-source community, especially: - [Pydantic](https://docs.pydantic.dev/latest/) - [Requests](https://github.com/psf/requests) - [MCP Python SDK](https://github.com/modelcontextprotocol/python-sdk) - [Griffe](https://github.com/mkdocstrings/griffe) This library has these optional dependencies: - [websockets](https://github.com/python-websockets/websockets) - [SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy) - [any-llm](https://github.com/mozilla-ai/any-llm) and [LiteLLM](https://github.com/BerriAI/litellm) We also rely on the following tools to manage the project: - [uv](https://github.com/astral-sh/uv) and [ruff](https://github.com/astral-sh/ruff) - [mypy](https://github.com/python/mypy) and [Pyright](https://github.com/microsoft/pyright) - [pytest](https://github.com/pytest-dev/pytest) and [Coverage.py](https://github.com/coveragepy/coveragepy) - [MkDocs](https://github.com/squidfunk/mkdocs-material) We're committed to continuing to build the Agents SDK as an open source framework so others in the community can expand on our approach.

AI Agents ML Frameworks

27K Github Stars

Open Source

CLIP

# CLIP [[Blog]](https://openai.com/blog/clip/) [[Paper]](https://arxiv.org/abs/2103.00020) [[Model Card]](model-card.md) [[Colab]](https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb) CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision. ## Approach ![CLIP](CLIP.png) ## Usage First, [install PyTorch 1.7.1](https://pytorch.org/get-started/locally/) (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick: ```bash $ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0 $ pip install ftfy regex tqdm $ pip install git+https://github.com/openai/CLIP.git ``` Replace `cudatoolkit=11.0` above with the appropriate CUDA version on your machine or `cpuonly` when installing on a machine without a GPU. ```python import torch import clip from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device) text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) logits_per_image, logits_per_text = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]] ``` ## API The CLIP module `clip` provides the following methods: #### `clip.available_models()` Returns the names of the available CLIP models. #### `clip.load(name, device=..., jit=False)` Returns the model and the TorchVision transform needed by the model, specified by the model name returned by `clip.available_models()`. It will download the model as necessary. The `name` argument can also be a path to a local checkpoint. The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU. When `jit` is `False`, a non-JIT version of the model will be loaded. #### `clip.tokenize(text: Union[str, List[str]], context_length=77)` Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model --- The model returned by `clip.load()` supports the following methods: #### `model.encode_image(image: Tensor)` Given a batch of images, returns the image features encoded by the vision portion of the CLIP model. #### `model.encode_text(text: Tensor)` Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model. #### `model(image: Tensor, text: Tensor)` Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100. ## More Examples ### Zero-Shot Prediction The code below performs zero-shot prediction using CLIP, as shown in Appendix B in the paper. This example takes an image from the [CIFAR-100 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), and predicts the most likely labels among the 100 textual labels from the dataset. ```python import os import clip import torch from torchvision.datasets import CIFAR100 # Load the model device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load('ViT-B/32', device) # Download the dataset cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False) # Prepare the inputs image, class_id = cifar100[3637] image_input = preprocess(image).unsqueeze(0).to(device) text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device) # Calculate features with torch.no_grad(): image_features = model.encode_image(image_input) text_features = model.encode_text(text_inputs) # Pick the top 5 most similar labels for the image image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1) values, indices = similarity[0].topk(5) # Print the result print("\nTop predictions:\n") for value, index in zip(values, indices): print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%") ``` The output will look like the following (the exact numbers may be slightly different depending on the compute device): ``` Top predictions: snake: 65.31% turtle: 12.29% sweet_pepper: 3.83% lizard: 1.88% crocodile: 1.75% ``` Note that this example uses the `encode_image()` and `encode_text()` methods that return the encoded features of given inputs. ### Linear-probe evaluation The example below uses [scikit-learn](https://scikit-learn.org/) to perform logistic regression on image features. ```python import os import clip import torch import numpy as np from sklearn.linear_model import LogisticRegression from torch.utils.data import DataLoader from torchvision.datasets import CIFAR100 from tqdm import tqdm # Load the model device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load('ViT-B/32', device) # Load the dataset root = os.path.expanduser("~/.cache") train = CIFAR100(root, download=True, train=True, transform=preprocess) test = CIFAR100(root, download=True, train=False, transform=preprocess) def get_features(dataset): all_features = [] all_labels = [] with torch.no_grad(): for images, labels in tqdm(DataLoader(dataset, batch_size=100)): features = model.encode_image(images.to(device)) all_features.append(features) all_labels.append(labels) return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy() # Calculate the image features train_features, train_labels = get_features(train) test_features, test_labels = get_features(test) # Perform logistic regression classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1) classifier.fit(train_features, train_labels) # Evaluate using the logistic regression classifier predictions = classifier.predict(test_features) accuracy = np.mean((test_labels == predictions).astype(float)) * 100. print(f"Accuracy = {accuracy:.3f}") ``` Note that the `C` value should be determined via a hyperparameter sweep using a validation split. ## See Also * [OpenCLIP](https://github.com/mlfoundations/open_clip): includes larger and independently trained CLIP models up to ViT-G/14 * [Hugging Face implementation of CLIP](https://huggingface.co/docs/transformers/model_doc/clip): for easier integration with the HF ecosystem

ML Frameworks

33.7K Github Stars

Software by openai

plugins

openai/whisper

openai-agents-python

CLIP