Home
Softono
t

thiswillbeyourgithub

Professional software vendor delivering innovative solutions on the Softono platform. Specialized in both open-source and proprietary software development.

Total Products
2

Software by thiswillbeyourgithub

AnkiAIUtils
Open Source

AnkiAIUtils

# Anki AI Utils A powerful suite of AI-powered tools to enhance your [Anki](https://en.wikipedia.org/wiki/Anki_(software)) flashcard learning experience by automatically improving cards you struggle with, tested through medical school. For example think of it like this: every time you fail a card you get a ChatGPT explanation, a Dall-E illustration, mnemonics, etc but supporting your own mnemonics. **Check out my other Anki and AI related projects on my [GitHub profile](https://github.com/thiswillbeyourgithub)!** ### Simple example **Those scripts make it so that every failed note will automatically have new fields containing explanations, mnemonics, and illustrations.** This is done in a way that respects **your own mnemonics**, can even use the [major system](https://en.wikipedia.org/wiki/Mnemonic_major_system), and has **many** more features. An example of the kind of workflow that AnkiIllustrator facilitates, watch [pixorise's youtube channel](https://www.youtube.com/watch?v=QzxHpjryJHg0) which is full of great image mnmemonics, particularly appropriate for mental palace types of learning. ## Developer's note / call for help This collection of scripts is the culmination of my efforts to contributes the AI features I wish existed when I started medical school. All scripts should be working but I released them hastily after documenting them heavily with the help of [aider](https://aider.chat/). It is possible that some aspects of the documentation is slightly off or imprecise. It is also possible that some of the scripts where slighly broken during the release process. In any case, **by releasing this project made with love and care my hope is to motivate others to package it into addons.** I have too little time to learn how to package those scripts into addons and make the appropriate GUI so any help is absolutely welcome. To that end, please take a look at the TODO list below and open an isue! ## Key Features - **Adaptive Learning**: Uses [semantic similarity](https://en.wikipedia.org/wiki/Semantic_similarity) to dynamically match your cards with the most relevant examples from your training datasets. The more examples you add, the better it gets! - **Personalized Memory Hooks**: Reuses consistent mnemonics from your custom collection, building a personalized memory system. Includes a dedicated tool to help create and manage your mnemonic library. - **Automation Ready**: Run programmatically - for example, use cron to automatically enhance cards you struggled with yesterday, making them easier to remember through images, mnemonics, and explanations. - **Universal Compatibility**: Modifies Anki notes directly in-place, working seamlessly across all Anki clients (Windows, Mac, Linux, Android, iOS). Extensive logging ensures you can track changes and rollback if needed. - **Provider Agnostic**: Supports all LLM providers and models through LiteLLM, letting you choose the best option for your needs. - **Infinitely Extensible**: Add as many examples as you want to your training datasets - the semantic filtering automatically picks the most relevant ones for each card. ## Tools ### Illustrator Creates custom mnemonic images for your cards using AI image generation. It: - Analyzes card content to identify key concepts - Generates creative visual memory hooks - Preserves a history of generated images - Supports both DALL-E2, DALL-E3 and Stable Diffusion - Automatically formats images for optimal display (centered, proper sizing) - Handles multiple images per card with consistent layout Perfect for visual learners or complex topics that benefit from imagery. An example of the kind of workflow that AnkiIllustrator facilitates, watch [pixorise's youtube channel](https://www.youtube.com/watch?v=QzxHpjryJHg0) which is full of great image mnmemonics, particularly appropriate for mental palace types of learning. <details> <summary> Click to see an example </summary> For example, I had this French flashcard: ![](./screenshots/illustrator_fever.png) <details> <summary>Click here if you can't read French</summary> Here's the note content translated to English: ``` Diagnostic criteria for simple febrile seizures: - Age greater than 1 year - Seizure occurring with fever above 38°C (100.4°F) - Symmetrical motor manifestations lasting less than 15 minutes and without deficit - Single seizure per febrile episode - Less than 3 episodes in total ``` </details> `illustrator.py` generated to me this image: ![](./screenshots/illustrator_fever_generated.png) As well as this text content to understand its thought process: (Note that this part is always in English no matter the original language) ``` Topic: 'Diagnostic criteria for a simple febrile seizure' Anchors: 'Febrile / Fever: a fireplace (sounds like febrile)' Phonetic: '1 as T link in thai, tea, tie; 3 as M link in mow, my, meh; 8 as F link in fire, faux, few; 5 as L link in low, lee, lie' Remarks: * The criteria are specific and should be presented in a way that is easy to remember. Step by step decomposition: * Age over 1 year: a toddler holding a teacup (T for 1) * Seizure with fever over 38°C: a thermometer with flames around it (F for 8) * Symmetric motor manifestations lasting less than 15 minutes without deficit: a clock showing 15 minutes with two identical figurines moving symmetrically on either side * 1 unique seizure per febrile episode: a single lightning bolt striking a fireplace (fireplace for fever) * Less than 3 episodes in total: three thermometers, but only two showing a high temperature Imagining: * A toddler holding a teacup, standing next to a fireplace with a single lightning bolt striking it. * Above the fireplace, a clock showing 15 minutes with two identical figurines moving symmetrically. * Next to the toddler, a thermometer surrounded by flames, and three thermometers, two of which show high temperatures. Subject: 'a toddler holding a teacup next to a fireplace struck by a single lightning bolt, a clock showing 15 minutes with symmetric figurines, a flaming thermometer, and three thermometers with two showing high temperatures' Description words: 'educational, colorful, engaging, vivid, detailed' Style: 'illustration' Realism: 'semi-realistic' a toddler holding a teacup next to a fireplace struck by a single lightning bolt, a clock showing 15 minutes with symmetric figurines, a flaming thermometer, and three thermometers with two showing high temperatures, educational, colorful, engaging, vivid, detailed, illustration, semi-realistic [DATE:09/04/2024 VERSION:2.5 LLMMODEL:openai/gpt-4-0125-preview IMAGEMODEL:openai/dall-e-3] ``` </details> ### Reformulator An intelligent tool that rephrases your flashcards while preserving their core meaning and structure. It helps when: - Cards are poorly worded or unclear - You want to vary the phrasing to strengthen recall - Cards need to be more concise or natural sounding - Your preferred card format has evolved over time The tool uses LLMs to reformulate content while carefully preserving cloze deletions and media. This is especially valuable for long-term Anki users - for example, during medical school, your idea of what makes a "perfect" flashcard often evolves after a few semesters. The Reformulator lets you easily update all your older cards to match your current preferred format and style. <details> <summary> Click to see an example </summary> For example, given this poorly worded flashcard: ``` bilateral and symmetric alveolar syndrome, perihilar, often with effusion, what to consider? {{c1::APE}} ``` The reformulator would improve it to: ``` What should be considered in presence of bilateral and symmetric alveolar syndrome, perihilar, often with effusion? {{c1::In case of bilateral and symmetric alveolar syndrome, perihilar, often with effusion, one should consider APE.}} ``` The reformulator can also make cards more concise when needed. For example, it could reformat the same card as: ``` Key features of bilateral and symmetric alveolar syndrome, perihilar, often with effusion: - {{c1::Consider APE}} ``` The true power of the reformulator lies in its flexibility - your imagination is the only limit. It can: - Make questions grammatically complete and clear - Structure content as proper questions, bullet points, tables, or any format you prefer - Make answers self-contained by repeating key context when needed - Adjust verbosity from concise summaries to detailed explanations - Preserve exact medical terminology while improving clarity - Maintain cloze deletion format while enhancing content - Adapt to any formatting style (paragraphs, lists, diagrams, etc.) - Transform content into mnemonics, analogies, or memory palaces - Create hierarchical structures for complex topics - Generate multiple variations to strengthen recall through interleaving The reformulator is not just a tool for fixing poorly worded cards - it's a creative partner that can help you explore new ways to structure and present information. Whether you want to create visual mind maps, build memory palaces, or develop unique mnemonic systems, the reformulator can help bring your ideas to life while maintaining the integrity of your Anki cards. </details> ### Mnemonics Creator Generates memorable mnemonics tailored to your cards by: - Creating multiple mnemonic options per card - Using proven memory techniques like the [Major System](https://en.wikipedia.org/wiki/Mnemonic_major_system) - Incorporating your existing memory anchors - Preserving context and accuracy Helps create lasting memory connections, especially for numbers and sequences. <details> <summary> Click to see an example </summary> The mnemonics made for the card above about infant fever ended up with this content in the AnkiMnemonics field: --- 1. 'Heureux Hephaistos fébrile tend sa banane unique près du feu'<br>* <b>Heureux </b> Âge supérieur à 1 an <u> Heureux évoque la maturité et donc un âge déjà avancé, supérieur à 1 an </u><br>* <b>Hephaistos fébrile </b> Survenue de la crise avec une fièvre supérieure à 38°C <u> Hephaistos évoque la fièvre du fait de son rôle de forgeron et fébrile réitère ce concept </u><br>* <b>tend sa banane unique </b> 1 unique crise par épisode fébrile donné <u> banane unique évoque une seule occurrence, ici la crise unique par épisode fébrile </u><br>* <b>près du feu </b> Moins de 3 épisodes au total <u> feu évoque la fièvre et sa proximité suggère une limite, ici moins de 3 épisodes en tout </u><br><br>2. 'Un enfant trébuche dans le feu, danse symétriquement, a un seul coup et moins de trois feux'<br>* <b>Un enfant trébuche </b> Âge supérieur à 1 an <u> l'idée d'un enfant qui commence juste à marcher évoque l'âge juste après un an </u><br>* <b>dans le feu, </b> Survenue de la crise avec une fièvre supérieure à 38°C <u> le feu évoque la chaleur, donc la fièvre </u><br>* <b>danse symétriquement, </b> Manifestations motrices symétriques <u> danser évoque le mouvement, et symétriquement évoque les deux côtés du corps bougeant de la même manière </u><br>* <b>a un seul coup </b> 1 unique crise par épisode fébrile donné <u> un seul coup évoque l'unicité de la crise pendant l'épisode fébrile </u><br>* <b>et moins de trois feux </b> Moins de 3 épisodes au total <u> moins de trois feux évoque le nombre total d'épisodes, utilisant l'analogie avec la fièvre comme feu </u><br><br>3. 'Un enfant febrile symetrique forge une unique bulle dans la prairie'<br>* <b>Un enfant </b> Âge supérieur à 1 an <u> enfant indique que le sujet concerne un jeune individu, donc plus d'un an </u><br>* <b>febrile </b> Survenue de la crise avec une fièvre supérieure à 38°C <u> fébrile se lie à la notion de fièvre </u><br>* <b>symetrique </b> Manifestations motrices symétriques <u> directement lié à symétrique </u><br>* <b>forge </b> durant moins de 15 minutes et sans déficit <u> forger évoque une action courte et intense, comme la crise qui dure moins de 15 minutes sans laisser de séquelles </u><br>* <b>une unique </b> 1 unique crise par épisode fébrile donné <u> unique précise le nombre de crises </u><br>* <b>bulle </b> Moins de 3 épisodes au total <u> une bulle évoque quelque chose de rare et limité, semblable à moins de 3 épisodes au total </u><br>* <b>dans la prairie </b> hyperthermique <u> la prairie évoque un espace ouvert et naturel, hyperthermique évoque la chaleur comme celle du soleil sur une prairie </u><br><br>[DATE:09/04/2024 VERSION:2.1 MODEL:openai/gpt-4-0125-preview]<br><br><!--SEPARATOR--> --- </details> ### Explainer Provides clear, detailed explanations when you struggle with cards by: - Breaking down complex concepts - Highlighting key relationships - Adding helpful context - Using analogies and examples Particularly useful for understanding why you got a card wrong and filling knowledge gaps. <details> <summary> Click to see an example </summary> The mnemonics made for the card above about infant fever ended up with this content in the AnkiExplainer field (I translated it french to English for universal documentation): --- * <b>EXPLANATION</b> A simple febrile seizure is characterized by its uniqueness and brevity during a febrile episode, which helps distinguish it from complex seizures or other neurological disorders.<br>* <b>MECHANISM</b> Fever can lower the seizure threshold in certain children, which explains why an elevation in body temperature can trigger a seizure in predisposed individuals.<br><br>[DATE:09/04/2024 VERSION:1.7 LLMMODEL:openai/gpt-4-0125-preview]<br><br><!--SEPARATOR--> --- </details> ### Mnemonics Helper A lightweight interactive CLI tool for quick mnemonic generation that: - Takes a concept and finds semantically similar existing mnemonics - Generates multiple new mnemonic options using LLMs - Lets you choose from generated options with vim-style navigation - Automatically saves selected mnemonics for future reference - Works independently of Anki, perfect for brainstorming sessions Unlike the Mnemonics Creator which processes Anki cards in batch, this tool provides an interactive interface for generating mnemonics one concept at a time. Those new mnemonics can automatically be added to a dataset file that can readily be used by the other tools. This allows rapidly tailoring the scripts to your own imagination. ## FAQ <details> <summary> Click to read more </summary> ### What are the core benefits of those tools? Basically if you run these tools each evening on cards you failed that day it will steadily improve your deck quality and learning effectiveness: - Automatically enhance cards you struggle with - Save time on manual card improvements - Create stronger memory connections - Track improvements with detailed history - Preserve card structure while enhancing content ### What is the [Major System](https://en.wikipedia.org/wiki/Mnemonic_major_system)? The Major System is a powerful memory technique that converts numbers into consonant sounds, which can then be turned into memorable words. For example: - 0 = S sound (as in "sea") - 1 = T sound (as in "tea") - 2 = N sound (as in "new") - etc. This makes it easier to remember numbers by turning them into words. For example, "92" could become "pen" (P=9, N=2). You can read more about it [on wikipedia](https://en.wikipedia.org/wiki/Mnemonic_major_system) ### What are Memory Anchors? Memory anchors are existing associations you already know well that can be used to create new memories. For example, if you already strongly associate "Napoleon" with "France", you can use Napoleon as an anchor when learning new facts about French history. The tools can use your personal set of memory anchors to generate mnemonics that build on your existing knowledge. ### Which LLM providers are supported? The tools use [LiteLLM](https://docs.litellm.ai/docs/) which provides a unified interface to virtually any LLM provider including: - OpenAI - Anthropic - Google - [MiniMax](https://www.minimaxi.com/) (MiniMax-M3 — 512K context, MiniMax-M2.7, MiniMax-M2.7-highspeed — 204K context) - OpenRouter - Azure - AWS Bedrock - Local models - And many more Just specify the model in LiteLLM format (e.g. "openai/gpt-4", "anthropic/claude-3-opus", or "minimax/MiniMax-M3") and it will handle the rest. For MiniMax, place your API key in a file named `MINIMAX` inside the `API_KEYS/` directory (the tools will load it as `MINIMAX_API_KEY`). ### What languages are supported? The tools work in any language supported by the LLM you choose to use. Since these scripts support virtually all LLM providers through LiteLLM, you can use any model that works well with your language. For example: - OpenAI's models support 100+ languages - Anthropic's Claude supports 100+ languages - You can use local models specifically trained for your language - etc. The tools will preserve all language-specific formatting, including: - Right-to-left text - Special characters and diacritics - Language-specific punctuation - etc. ### How do the Mnemonics Work? The mnemonics tools use several proven memory techniques: - [Major System](https://en.wikipedia.org/wiki/Mnemonic_major_system) for numbers - Vivid imagery and visualization - Personal memory anchors - Phonetic similarities - Humor and absurdity - Story-based connections This creates memorable associations that help strengthen recall while preserving accuracy. ### Where can I find example datasets for each tool? The `examples/` folder contains training datasets and example files for each tool. While these were originally written in French and hastily translated to English, they provide good templates for creating your own datasets. Check the Example Files section below for details on each file. ### What's the future of this project? This toolkit was developed and battle-tested while studying tens of thousands of Anki cards during medical school. It proved invaluable for maintaining and enhancing a large flashcard collection during intense study periods. However, as research commitments have grown, I now have limited time to transform these scripts into a more user-friendly package. The tools work well but need: - Packaging as a proper Anki addon - Installation via PyPI - Code deduplication and cleanup - Better documentation I'm actively looking for contributors of all skill levels to help make these tools more accessible to the wider Anki community. Whether you're a seasoned developer or just getting started, all contributions are welcome! I can provide guidance and direction based on extensive experience with the codebase, while you help with the technical aspects of packaging and distribution. Check out the detailed roadmap below to see what needs improving. If you're interested in helping transform these battle-tested scripts into a polished Anki addon, please don't hesitate to reach out - I'm always happy to chat and help you get started! ### Why is there code duplication across the tools? This project evolved organically alongside my Python skills while solving real needs during medical school. Each tool was developed independently when needed, prioritizing functionality over code elegance. While they all work reliably, there's significant opportunity to unify their codebases around a common API. I can provide detailed guidance on refactoring and consolidating the code, but lack the time to implement these changes myself. Check the roadmap below if you're interested in helping streamline the codebase while preserving its battle-tested functionality. ### When Should I Use Each Tool? - **Mnemonics Creator**: Best for memorizing numbers, sequences, lists, and abstract concepts - **Illustrator**: Ideal for visual learners and complex topics that benefit from imagery - **Reformulator**: Use when card wording is unclear or you want variety in phrasing. Don't worry about running it on well-formatted cards - the LLM is trained to recognize and preserve cards that already follow best practices, avoiding unnecessary changes that could disrupt your learning - **Explainer**: Great for understanding why you got a card wrong and filling knowledge gaps - **Mnemonics Helper**: Simple script to quickly ask an LLM to come up with new mnemonics by taking into accountsthe [semantic similarity](https://en.wikipedia.org/wiki/Semantic_similarity) of the new subject vs your previous mnemonics. ### What happens if I run a script multiple times on the same card? For most tools (Mnemonics Creator, Illustrator, Explainer), the previous content will be preserved in a collapsible HTML section using the `<details>` and `<summary>` tags. The new content appears above this section. This makes it easy to: - See the latest generated content first - Access previous versions by expanding the collapsible sections - Track how the card evolved over time The Reformulator works differently - it replaces the content of the original field directly, but saves all previous versions and metadata in a separate `AnkiReformulator` field. This preserves the card's readability while maintaining a complete history. ### How can I track which cards were modified? Each tool meticulously tracks modifications through tags and metadata to ensure transparency and reversibility. For example, when a tool processes a card, it adds a dated tag like `AnkiIllustrator::done::02/07/2023`. This makes it easy to: - Quickly identify which cards were modified by each tool - Track when modifications were made - Find cards that haven't been processed yet - Rollback changes if needed (especially with the Reformulator) You can use these tags in the Anki browser to assess how many cards could benefit from each tool and review the modifications made. Note that notes for which a script failed will have a tag added to it. For example `AnkiI ::failed`. ### How much does it cost to run these tools? The cost depends on your usage patterns and which features you enable: - Start small with a few cards to get comfortable with each tool - Built-in safeguards prevent accidental overspending: - Maximum cards per run can be limited - Cost tracking per script is stored in the database - Failed API calls don't count towards your quota - You can set hard spending limits - Typical costs per card: - Reformulator: ~$0.02-0.04 (text only) - Mnemonics: ~$0.02-0.04 (text only) - Explainer: ~$0.03-0.06 (more complex reasoning) - Illustrator: ~$0.02 + image cost ($0.04-0.12 per image) The database tracks total spending per script, making it easy to budget and monitor costs. You can also use cheaper models for initial testing before scaling up to more capable ones. ### Can I use these tools on mobile? While you need to run the scripts themselves from a computer (not your phone), all changes are made directly to your Anki notes. This means: - Run the scripts from your computer/server - Sync Anki on your computer - The improved cards will appear on AnkiMobile/AnkiDroid after syncing - All generated content (reformulations, mnemonics, images, etc.) works perfectly on mobile ### Example Files The `examples/` folder contains example files to help you get started. Note that these examples were originally written in French (except for system prompts) and were quickly translated to English - some examples may not make perfect sense but should still demonstrate the basic usage: - `anki_ai_utils_tmux_launcher.sh`: A tmux-based launcher script I used every morning to automatically process cards I struggled with the previous day - `anchors.json`: Example memory anchors mapping file - `dataset_anchors.txt`: Training examples for memory anchor processing - `explainer_dataset.txt`: Examples for the Explainer tool - `illustrator_dataset.txt`: Training data for image generation - `illustrator_sanitize_dataset.txt`: Examples for sanitizing image prompts - `mnemonics_dataset.txt`: Training data for mnemonic generation - `reformulator_dataset.txt`: Examples for card reformulation - `string_formatting.py`: Handles cloze deletions and text formatting ### Aren't you concerned about LLM hallucinations? While hallucinations are a valid concern when using LLMs as search engines or relying on their compressed inner knowledge, these tools take a different approach that minimizes this risk: 1. **Few-shot Learning**: By providing carefully crafted examples, we guide the LLM to follow specific patterns and formats, reducing the chance of inventing information. 2. **Structured Output**: The tools enforce strict output formats that make hallucinations easier to detect and correct. 3. **Preservation of Source Material**: Rather than generating new facts, the tools focus on reformulating and enhancing existing content from your cards. 4. **Model Agnosticism**: As new, more reliable models emerge, you can easily switch to them without changing your workflow. 5. **Specialization**: By focusing on specific tasks (reformulation, mnemonic creation, etc.), we reduce the scope for hallucinations compared to general-purpose chat. While no system is perfect, this approach has proven reliable through extensive testing during medical school. As LLMs continue to improve, we can expect hallucinations to become increasingly rare. ### What's the format of dataset files? Dataset files (like `explainer_dataset.txt`, `reformulator_dataset.txt`, etc.) are simple text files where messages are separated by `----`. The first message is assumed to be a system prompt, followed by alternating user and assistant messages. This format mirrors a typical LLM conversation flow while remaining easy to read and edit. </details> ## Usage / Getting started <details> <summary> Click to read more </summary> To get started using those scripts (and until someone comes along to help me turn it into an addon!), the steps are roughly: 0. **Note: if you don't understand those steps, I recommend self teaching you using LLMs. The steps always differ slightly depending on if you are on macos/linux/windows but the idea is always the same** 1. Make sure you have python installed in your console. The recommended python version is 3.12.7 or 3.11.something because that's what I used at the time. If you try to use more recent versions you will probably encounter problems. 2. git clone https://github.com/thiswillbeyourgithub/AnkiAIUtils/ 3. I recommend creating a virtual environment for python inside this new folder, for example `uv venv` then `source .venv/bin/activate` on my linux. 4. Run a script, for that refer to the individual usage sections below. #### Reformulator The Reformulator can be run from the command line: ```bash python reformulator.py \ --query "(rated:2:1 OR rated:2:2) -is:suspended" \ --dataset_path "data/reformulator_dataset.txt" \ --string_formatting "data/string_formatting.py" \ --ntfy_url "ntfy.sh/YOUR_TOPIC" \ --main_field_index 0 \ --llm "openai/gpt-4" \ --embedding_model "openai/text-embedding-3-small" \ --max_token 4000 \ --llm_temp 0 ``` Key arguments: - `query`: Anki browser query to select cards (defaults to recently failed cards) - `dataset_path`: Example prompts for reformulation - `string_formatting`: Custom text formatting functions - `ntfy_url`: Optional notifications via ntfy.sh - `main_field_index`: Index of the field to reformulate (0 for first field) - `llm`: LLM model to use in litellm format - `embedding_model`: Model for semantic similarity search - `max_token`: Maximum tokens per query - `llm_temp`: LLM temperature (0 for consistent output) Additional options: - `--debug`: Enable debug mode - `--force`: Process cards even if already reformulated - `--print_db_then_exit`: Display database contents and exit - `--parallel`: Number of parallel processes (default 4) - `--exclude_media`: Skip cards containing media - `--mode`: Either 'reformulate' or 'reset' to restore original content. Note that the 'reset' feature is not absolutely guaranteed to work, but if things go wrong there are tons of logs on purpose to make sure you don't lose anything. #### Mnemonics The Mnemonics Creator can be run from the command line: ```bash python mnemonics.py \ --field_names "body" \ --query "(rated:2:1 OR rated:2:2) -is:suspended" \ --memory_anchors_file "data/anchors.json" \ --dataset_path "data/mnemonics_dataset.txt" \ --string_formatting "data/string_formatting.py" \ --ntfy_url "ntfy.sh/YOUR_TOPIC" \ --llm "openrouter/anthropic/claude-3-sonnet" \ --embedding_model "openai/text-embedding-3-small" \ --n_mnemonic 1 ``` Key arguments: - `field_names`: Comma-separated list of note fields to analyze - `query`: Anki browser query to select cards (defaults to recently failed cards) - `memory_anchors_file`: JSON file mapping concepts to memory anchors - `dataset_path`: Example prompts for mnemonic generation - `string_formatting`: Custom text formatting functions - `ntfy_url`: Optional notifications via ntfy.sh - `llm`: LLM model to use in litellm format - `embedding_model`: Model for semantic similarity search - `n_mnemonic`: Number of mnemonics to generate per card Additional options: - `--debug`: Enable debug mode - `--force`: Process cards even if they already have mnemonics - `--note_mode`: Don't count cards of the same note twice - `--do_sync`: Sync Anki before and after processing #### Mnemonics Creator CLI The Mnemonics Creator CLI provides an interactive interface for generating mnemonics: ```bash python mnemonics_creator.py \ --top_k 100 \ --n_gen 10 \ --model "openrouter/anthropic/claude-3-sonnet" \ --embed_model "openai/text-embedding-3-small" ``` Key arguments: - `top_k`: Number of similar existing mnemonics to use as examples (default: 100) - `n_gen`: Number of new mnemonics to generate per query (default: 10) - `model`: LLM model to use in litellm format - `embed_model`: Model for semantic similarity search - `query`: Optional initial query to process - `gui`: Enable GUI interface (not yet implemented) The CLI provides an interactive interface where you can: - Enter concepts to generate mnemonics for - See similar existing mnemonics as context - Choose from multiple generated options - Navigate with vim-style keys (j/k) or numbers - Save selected mnemonics to your collection #### Explainer The Explainer can be run from the command line: ```bash python explainer.py \ --field_names "body" \ --query "(rated:2:1 OR rated:2:2) -is:suspended" \ --dataset_path "data/explainer_dataset.txt" \ --string_formatting "data/string_formatting.py" \ --ntfy_url "ntfy.sh/YOUR_TOPIC" \ --llm "openrouter/anthropic/claude-3-sonnet" \ --embedding_model "openai/text-embedding-3-small" \ --llm_max_token 3000 ``` Key arguments: - `field_names`: Comma-separated list of note fields to analyze - `query`: Anki browser query to select cards (defaults to recently failed cards) - `dataset_path`: Example prompts for generating explanations - `string_formatting`: Custom text formatting functions - `ntfy_url`: Optional notifications via ntfy.sh - `llm`: LLM model to use in litellm format - `embedding_model`: Model for semantic similarity search - `llm_max_token`: Maximum tokens per query Additional options: - `--debug`: Enable debug mode - `--force`: Process cards even if they already have explanations - `--note_mode`: Don't count cards of the same note twice - `--do_sync`: Sync Anki before and after processing #### Illustrator The Illustrator can be run from the command line: ```bash python illustrator.py \ --field_names "front,back" \ --query "(rated:2:1 OR rated:2:2) -is:suspended" \ --memory_anchors_file "data/anchors.json" \ --dataset_path "data/illustrator_dataset.txt" \ --dataset_sanitize_path "data/illustrator_sanitize.txt" \ --string_formatting "data/string_formatting.py" \ --ntfy_url "ntfy.sh/YOUR_TOPIC" \ --n_image 1 ``` Key arguments: - `field_names`: Comma-separated list of note fields to analyze - `query`: Anki browser query to select cards (defaults to recently failed cards) - `memory_anchors_file`: JSON file mapping concepts to memory anchors - `dataset_path`: Example prompts for image generation - `dataset_sanitize_path`: Examples for sanitizing unsafe prompts - `string_formatting`: Custom text formatting functions - `ntfy_url`: Optional notifications via ntfy.sh - `n_image`: Number of images to generate per card Additional options: - `--debug`: Enable debug mode - `--force`: Process cards even if they already have illustrations - `--disable_notif`: Disable ntfy.sh notifications </details> ### Roadmap <details> <summary> Click to read more </summary> <i>This TODO list is maintained automatically by [MdXLogseqTODOSync](https://github.com/thiswillbeyourgithub/MdXLogseqTODOSync)</i> <!-- BEGIN_TODO --> - turn those scripts into addons (for that, please help me for the rest of the TODO and it should make the addonification more straightforward) - ### Applies to all tools - load API keys from env variable instead - do a unique class that could be used to unify all those codes - arguments: - name (to differentiate each children: for example "illustrator") - query - output field name - template - tags_regex (to tell which tags to include in the template) - llm name - embedding_model - llm_max_token - llm_temp - tkn_warn_limit (to know when to stop) - exclude_media - exclude_version - exclude_done - n_note_limit - do_sync - callback (function like ntfy_url) - debug - parallel - force - print_db_then_exit - methods: - string_format (can be overloaded) - load_history - save_history - total_cost - execute_query (to find those cards and apply the filters given by arguments) - loop_over_notes (that check that compute_new_field is indeed declared) - addtags et removetags - note: - in the init, check that indeed there is a version attribute - make sure to use a self.lock - rewrite each script to use this class - the --help should be redirected to the class of each project - make each class use the same entrypoint - use toml instead of json? it allows setting comments too - use beartype everywhere for static typing and code cleanliness - store all inference in a compressed sqlite db instead of a json. It gets too large - add check that we indeed removed all the done tags - actually there's no need to store the "Done" tags because all important info is stored in the field - use xml formatting for the examples - make use of <thinking> tags too - tell user how much time each answer took - add an arg to include tags or not in the LLM context for a given note, as otherwise the LLM can get confused by some acronyms - but with a regex arg to keep only the tags that match the regex. This way we can keep only a portion of them for the LLM - make it installable with a setup.py on pypi - - ### Mnemonics Creator - Add keybindings - binding e to edit a proposition - binding to restart generation - binding to enter chat mode and construct the mnemonics with him - - ### Illustrator - use an llm to extract numbers - ask it to do quick transformations like turn 48h into 2 days, modify units, etc, - add support for note containing media like audio, images etc - add a mode without actually creating images. This could be used like a mnemonics after all. - - ### Reformulator - Add 5 to 10 example for the LLM of how to manage media like iimages etc then add support for them - make it work with specific fstring template for field replacement. Otherwise it can only reformulate a single field - better: add an arg to specify the single output field, and an arg to specify a comma separated list of input fields - - ### explainer - compute all embeddings at the start, making it faster - it's actually quite terrible. Use one LLM call to ask for which follow up questions to ask, then another LLM call to answer each using async - save each new question answer as a <details> tag to make it easy to access on phones by touching the field - - ### Ankimnemonics - comment out the mnemonics that dont respect the rule of adding the subject first - understand why it sometimes hangs during a run - make it distinguish 'has to appear in plain' vs 'has to appear as mnemonic'? - - ### AnkiAiFilter - use an eval llm like in [wdoc](https://wdoc.readthedocs.io/en/latest/) to better filer an anki query - actually wdoc can already be used for that! Maybe it should be converted into an addon? - - ### Tagger (In project) - always prepend tags by ankitagger: but customizable - always sort those tags by alphabetical order - add modes: - mode "predefined": the user gives a list of tags and the LLM finds which to apply to each note given a query - mode "natural_list": where the LLM creates the list itself - loop over each note and ask it to generate tags - but also show the list of tags until now then finally loop all over again and ask the LLM to tell which tag from a list should apply after filtering via embeddings - but still allow starting from a premade list - arg for image support if media found - if the card contains an image, it should be hashed, then a cached call to a func that asks a vision model to describe the type of image, then use the embedding of this answer to suggest the appropriate tags to suggest to the LLM for classification - image should have their own tags, like "imagery", "decision tree", "classification", "table" etc <!-- END_TODO --> ## Credits This project makes heavy use of [AnkiConnect](https://git.foosoft.net/alex/anki-connect) to interact with Anki. </details>

AI & Machine Learning Flash Cards & Spaced Repetition
859 Github Stars
wdoc
Open Source

wdoc

[![PyPI version](https://badge.fury.io/py/wdoc.svg)](https://badge.fury.io/py/wdoc) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/thiswillbeyourgithub/wdoc) # wdoc <p align="center"><img src="https://github.com/thiswillbeyourgithub/wdoc/blob/main/images/icon.png?raw=true" width="512" style="background-color: transparent !important"></p> > *I'm wdoc. I solve RAG problems.* > - wdoc, imitating Winston "The Wolf" Wolf `wdoc` is a powerful RAG (Retrieval-Augmented Generation) system designed to summarize, search, and query documents across various file types. It's particularly useful for handling large volumes of diverse document types, making it ideal for researchers, students, and professionals dealing with extensive information sources. Created by a psychiatry resident who needed a way to get a definitive answer from multiple sources at the same time (audio recordings, video lectures, [Anki flashcards](https://apps.ankiweb.net/), PDFs, EPUBs, etc.). `wdoc` was born from frustration with existing RAG solutions for querying and summarizing. Note: `wdoc` was coded mostly by hand, without LLM assistance as they didn't exist at the time, Claude Code will probably be used to refactor the code as it evolves. *(The online documentation can be found [here](https://wdoc.readthedocs.io/en/stable))* * **Goal and project specifications**: `wdoc`'s goal is to create **perfectly useful** summaries and **perfectly useful** sourced answers to questions on heterogeneous corpus. It's capable of querying **tens of thousands** of documents across [various file types](#filetypes) at the same time. The project also includes an opinionated summary feature to help users efficiently keep up with large amounts of information. It uses mostly [LangChain](https://python.langchain.com/) and [LiteLLM](https://docs.litellm.ai/docs/) as backends. * **Current status**: **usable, tested, still under active development, tens of planned features** * I don't plan on stopping to read anytime soon so if you find it promising, stick around as I have many improvements planned (see roadmap section). * **I would greatly benefit from testing by users as it's the quickest way for me to find the many minor quick-to-fix bugs.** * The main branch is more stable than the dev branch, which in turns offers more features. * Open to feature requests and pull requests. All feedbacks, including reports of typos, are highly appreciated * Please open an issue before making a PR, as there may be ongoing improvements in the pipeline. * **Key Features**: * **Docker Web UI**: Easy deployment with a [Gradio-based web interface](./docker/README.md) for simplified document processing without CLI interaction. * **High recall and specificity**: it was made to find A LOT of documents using carefully designed embedding search then carefully aggregate gradually each answer using semantic batch to produce a single answer that mentions the source pointing to the exact portion of the source document. * Use both an expensive and cheap LLM to make recall as high as possible because we can afford fetching a lot of documents per query (via embeddings) * Supports **virtually any LLM providers**, including local ones, and even with extra layers of security for super secret stuff. * Aims to **support *any* filetypes** and query from all of them at the same time (**15+** are already implemented!) * **Actually *useful* AI powered summary**: get the thought process of the author instead of nebulous takeaways. * **Actually *useful* AI powered queries**: get the **sourced** indented markdown answer to your questions instead of hallucinated nonsense. * **Extensible**: this is both a tool and a library. It was even turned into [an Open-WebUI Tool](https://openwebui.com/t/qqqqqqqqqqqqqqqqqqqq/wdoctool). Also available as a [Docker web UI](./docker/README.md) for easy deployment. * **Web Search**: Preliminary web search support using [DuckDuckGo](https://en.wikipedia.org/wiki/DuckDuckGo) (via the [ddgs](https://pypi.org/project/ddgs/) library) ### Table of contents - [Comprehensive reference (SKILL.md)](#comprehensive-reference) - [Explanatory diagrams](#explanatory-diagrams) - [Ultra short guide for people in a hurry](#ultra-short-guide-for-people-in-a-hurry) - [Features](#features) - [Tasks](#Tasks) - [Filetypes](#filetypes) - [Walkthrough and examples](#walkthrough-and-examples) - [Getting started](#getting-started) - [Direct installation](#direct-installation) - [Experimental Docker Interface](#experiental-docker-interface) - [Scripts made with wdoc](#scripts-made-with-wdoc) - [FAQ](#faq) - [Roadmap](#roadmap) ## Comprehensive reference A single-page comprehensive reference covering every CLI argument, environment variable, filetype, and the full Python API can be found in **[SKILL.md](./SKILL.md)**. ## Explanatory diagrams <p float="left" align="middle"> <img src="https://github.com/thiswillbeyourgithub/wdoc/blob/main/images/diagram_query.png?raw=true" alt="Query task workflow diagram showing the flow from user inputs through Raphael the Rephraser, VectorStore, Eve the Evaluator, Anna the Answerer, and recursive combining to final output" height="400"> <img src="https://github.com/thiswillbeyourgithub/wdoc/blob/main/images/diagram_summary.png?raw=true" alt="Summary task workflow diagram showing the flow from user inputs through loading & chunking, Sam the Summarizer, concatenation to wdocSummary output" height="400"> <img src="https://github.com/thiswillbeyourgithub/wdoc/blob/main/images/diagram_search.png?raw=true" alt="Search task workflow diagram showing the flow from user inputs through Raphael the Rephraser, VectorStore, Eve the Evaluator to search output" height="400"> </p> ## Ultra short guide for people in a hurry <details> <summary> Give it to me I am in a hurry! </summary> **Note: a list of examples can be found in [examples.md](https://github.com/thiswillbeyourgithub/wdoc/blob/main/wdoc/docs/examples.md)** > **TL;DR for installation:** when in doubt, use `uvx wdoc[full]`. The plain `wdoc` only ships PDF + URL/web loaders; everything else (youtube, audio, anki, office formats, logseq) lives in optional extras. `[full]` bundles all of them so you never have to think about missing dependencies. See the [Direct Installation](#direct-installation) section for the full list of extras. **Quick Start with Docker**: If you want an experimental web UI, check out the [Docker deployment guide](./docker/README.md). First, let's see how to *query* a pdf. ``` zsh link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf" uvx wdoc[full] --path=$link --task=query --filetype="online_pdf" --query="What does it say about alphago?" --query_retrievers='basic_multiquery' --top_k=auto_200_500 ``` * This will: 1. parse what's in --path as a link to a pdf to download (otherwise the url could simply be a webpage, but in most cases you can leave it to 'auto' by default as heuristics are in place to detect the most appropriate parser). 2. cut the text into chunks and create embeddings for each 3. Take the user query, create embeddings for it ('basic') AND ask the default LLM to generate alternative queries and embed those 4. Use those embeddings to search through all chunks of the text and get the 200 most appropriate documents 5. Pass each of those documents to the smaller LLM (default: openrouter/deepseek/deepseek-v4-flash) to tell us if the document seems appropriate given the user query 6. If More than 90% of the 200 documents are appropriate, then we do another search with a higher top_k and repeat until documents start to be irrelevant OR we it 500 documents. 7. Then each relevant doc is sent to the strong LLM (by default, openrouter/deepseek/deepseek-v4-pro) to extract relevant info and give one answer per relevant document. 8. Then all those "intermediate" answers are 'semantic batched' (meaning we create embeddings, do hierarchical clustering, then create small batch containing several intermediate answers of similar semantics, sort the batch in semantic order too), each batch is combined into a single answer per batch of relevant doc (or after: per batch of batches). 9. Rinse and repeat steps 7+8 (i.e. gradually aggregate batches) until we have only one answer, that is returned to the user. Now, let's see how to summarize a pdf. ``` zsh link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf" uvx wdoc[full] --path=$link --task=summarize --filetype="online_pdf" ``` * This will: 1. Split the text into chunks 2. pass each chunk into the strong LLM (by default openrouter/deepseek/deepseek-v4-pro) for a very low level (=with all details) summary. The format is markdown bullet points for each idea and with logical indentation. 3. When creating each new chunk, the LLM has access to the previous chunk for context. 4. All summary are then concatenated and returned to the user * For extra large documents like books for example, this summary can be recusively fed to `wdoc` using argument --summary_n_recursion=2 for example. * Those two tasks, query and summary, can be combined with --task summarize_then_query which will summarize the document but give you a prompt at the end to ask question in case you want to clarify things. * For more, you can read [examples.md](https://github.com/thiswillbeyourgithub/wdoc/blob/main/wdoc/docs/examples.md). * Note that there is [an official Open-WebUI Tool](https://openwebui.com/t/qqqqqqqqqqqqqqqqqqqq/wdoctool) that is even simpler to use. </details> ## Features * **15+ filetypes**: also supports combination to load recursively or define complex heterogenous corpus like a list of files, list of links, using regex, youtube playlists etc. See [Filestypes](#Filetypes) and [Recursive Filetypes](#recursive-filetypes). All filetype can be seamlessly combined in the same index, meaning you can query your anki collection at the same time as your work PDFs). It supports removing silence from audio files and youtube videos too! There is even a `ddg` filetype to search the web using [DuckDuckGo](https://en.wikipedia.org/wiki/DuckDuckGo). * **100+ LLMs and many embeddings**: Supports any LLM by OpenAI, Mistral, Claude, Ollama, Openrouter, etc. thanks to [litellm](https://docs.litellm.ai/). The list of supported embeddings engine can be found [here](https://docs.litellm.ai/docs/embedding/supported_embedding) but includes at least Openai (or any openai API compatible models), Cohere, Azure, Bedrock, NVIDIA NIM, Hugginface, Mistral, Ollama, Gemini, Vertex, Voyage. * **Local and Private LLM**: When in private mode, measures are taken to make sure no data leaves your computer and goes to an LLM provider: no API keys are used, all `api_base` are user set, cache are isolated from the rest, outgoing connections are censored by overloading python sockets, etc. * **Advanced RAG to query lots of diverse documents**: 1. The documents are retrieved using embeddings 2. Then a weak LLM model ("Eve the Evaluator") is used to tell which of those document is not relevant 3. Then the strong LLM is used to answer ("Anna the Answerer") the question using each individual remaining documents. 4. Then all relevant answers are combined ("Carl the Combiner") into a single short markdown-formatted answer. Before being combined, they are batched by semantic clusters and semantic order using scipy's hierarchical clustering and leaf ordering, this makes it easier for the LLM to combine the answers in a manner that makes bottom up sense. `Eve the Evaluator`, `Anna the Answerer` and `Carl the Combiner` are the names given to each LLM in their system prompt, this way you can easily add specific additional instructions to a specific step. There's also `Sam the Summarizer` for summaries and `Raphael the Rephraser` to expand your query. 5. Each document is identified by a unique hash and the answers are sourced, meaning you know from which document comes each information of the answer. * Supports a special syntax like "QE >>>> QA" were QE is a question used to filter the embeddings and QA is the actual question you want answered. * **Web Search**: Preliminary support for web search using [DuckDuckGo](https://en.wikipedia.org/wiki/DuckDuckGo). Just do `uvx wdoc web "How is Nvidia today this month?"` * **Advanced summary**: * Instead of unusable "high level takeaway" points, compress the reasoning, arguments, though process etc of the author into an easy to skim markdown file. * The summaries are then checked again n times for correct logical indentation etc. * The summary can be in the same language as the documents or directly translated. * **Many tasks**: See [Supported tasks](#Tasks). * **Trust but verify**: The answer is sourced: `wdoc` keeps track of the hash of each document used in the answer, allowing you to verify each assertion. * **Markdown formatted answers and summaries**: using [rich](https://github.com/Textualize/rich). * **Sane embeddings**: By default use sophisticated embeddings like [multi query retrievers](https://python.langchain.com/docs/how_to/MultiQueryRetriever) but also include SVM, KNN, parent retriever etc. Customizable. * **Fully documented** Lots of docstrings, lots of in code comments, detailed `--help` etc. Take a look at the [examples.md](https://github.com/thiswillbeyourgithub/wdoc/blob/main/wdoc/docs/examples.md) for a list of shell and python examples. The full help can be found in the file [help.md](https://github.com/thiswillbeyourgithub/wdoc/docs/help.md) or via `uvx wdoc --help`. I work hard to maintain an exhaustive documentation. The complete documentation in a single page is available [on the website](https://wdoc.readthedocs.io/en/latest/all_docs.html). * **Scriptable / Extensible**: You can use `wdoc` as an executable or as a library. Take a look at the scripts [below](#scripts-made-with-wdoc). There is even [an open-webui Tool](https://openwebui.com/t/qqqqqqqqqqqqqqqqqqqq/wdoctool). * **Strictly Typed**: Runtime type checking without performance penalty thanks to the incredible [beartype](https://beartype.readthedocs.io/en/latest/)! Opt out using an environment flag: `WDOC_TYPECHECKING="disabled / warn / crash" wdoc` (by default: `warn`). * **LLM (and embeddings) caching**: speed things up, as well as index storing and loading (handy for large collections). * **Good PDF parsing** PDF parsers are notoriously unreliable, so 15 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via [openparse](https://github.com/Filimoa/open-parse/) (no GPU needed by default) or via [UnstructuredPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_pdfloader/). * **Langfuse support**: If you set the appropriate langfuse environment variables they will be used. See [this guide](https://langfuse.com/docs/integrations/langchain/tracing) or [this one](https://langfuse.com/docs/integrations/litellm/tracing) to learn more (Note: this is disabled if using private_mode to avoid any leaks). * **Document filtering**: based on regex for document content or metadata. * **Binary embeddings support**: Custom langchain VectorStore to use binary embeddings, leading (potentially, as it depends on the embeddings model) to [~32x better compression ratio, faster search and usually negligible accuracy loss](https://simonwillison.net/2024/Mar/26/binary-vector-search/). * **Fast**: Parallel document loading, parsing, embeddings, querying, etc. * **Shell autocompletion** using [python-fire](https://github.com/google/python-fire/blob/master/docs/using-cli.md#completion-flag) * **Notification callback**: Can be used for example to get summaries on your phone using [ntfy.sh](ntfy.sh). * **Hacker mindset**: I'm a friendly dev! Just open an issue if you have a feature request or anything else. ### Tasks * **query** give documents and asks questions about it. * **search** only returns the documents and their metadata. For anki it can be used to directly open cards in the browser. * **summarize** give documents and read a summary. The summary prompt can be found in `utils/prompts.py`. * **summarize_then_query** summarize the document then allow you to query directly about it. ### Filetypes * **anki**: any subset of an [anki](https://github.com/ankitects/anki) collection db. `alt` and `title` of images can be shown to the LLM, meaning that if you used [the ankiOCR addon](https://github.com/cfculhane/AnkiOCR) this information will help contextualize the note for the LLM. * **auto**: default, guess the filetype for you * **epub**: barely tested because epub is in general a poorly defined format * **json_dict**: a text file containing a single json dict. * **local_audio**: supports many file formats, can use either OpenAI's whisper or [deepgram](https://deepgram.com)'s Nova-3 model. Supports automatically removing silence etc. Note: audio that are too large for whisper (usually >25mb) are automatically split into smaller files, transcribed, then combined. Also, audio transcripts are converted to text containing timestamps at regular intervals, making it possible to ask the LLM when something was said. * **local_html**: useful for website dumps * **local_video**: extract the audio then treat it as **local_audio** * **logseq_markdown**: thanks to my other project: [LogseqMarkdownParser](https://github.com/thiswillbeyourgithub/LogseqMarkdownParser) you can use your [Logseq graph](https://github.com/logseq/logseq/) * **online_media**: use youtube_dl to try to download videos/audio, if fails try to intercept good url candidates using playwright to load the page. Then processed as **local_audio** (but works with video too). * **online_pdf**: via URL then treated as a **pdf** (see above) * **pdf**: 15 default loaders are implemented, heuristics are used to keep the best one and stop early. Table support via [openparse](https://github.com/Filimoa/open-parse/) or [UnstructuredPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/unstructured_pdfloader/). Easy to add more. * **powerpoint**: .ppt, .pptx, .odp, ... * **string**: the cli prompts you for a text so you can easily paste something, handy for paywalled articles! * **text**: send a text content directly as path * **txt**: .txt, markdown, etc * **url**: try many ways to load a webpage, with heuristics to find the better parsed one * **word**: .doc, .docx, .odt, ... * **youtube**: text is then either from the yt subtitles / translation or even better: using whisper / deepgram. Note that youtube subtitles are downloaded with the timecode (so you can ask 'when does the author talks about such and such) but at a lower sampling frequency (instead of one timecode per second, only one per 15s). Youtube chapters are also given as context to the LLM when summarizing, which probably help it a lot. ### Recursive Filetypes * **ddg**: does an online web search using [DuckDuckGo](https://en.wikipedia.org/wiki/DuckDuckGo). This is not an agent search, we only use `wdoc` over the urls fetched by DuckDuckGo and return the result. Only supported by `query` tasks. * **json_entries**: turns a path to a file where each line is a json **dict**: that contains arguments to use when loading. Example: load several other recursive types. An example can be found in `docs/json_entries_example.json`. * **link_file**: turn a text file where each line contains a url into appropriate loader arguments. Supports any link, so for example webpage, link to pdfs and youtube links can be in the same file. Handy for summarizing lots of things! * **recursive_paths**: turns a path, a regex pattern and a filetype into all the files found recurisvely, and treated a the specified filetype (for example many PDFs or lots of HTML files etc). * **toml_entries**: read a .toml file. An example can be found in `docs/toml_entries_example.toml`. * **youtube playlists**: get the link for each video then process as **youtube** ## Walkthrough and examples Refer to [examples.md](https://github.com/thiswillbeyourgithub/wdoc/blob/main/wdoc/docs/examples.md). ## Getting started *`wdoc` was mainly developped and tested on python 3.13.5 but for compatibility it is installable with python version `>=3.11`. If possible, try to use python `3.13`.* ### Direct Installation 1. To install: * The recommended invocation is simply `uvx wdoc[full]` (see [uv](https://docs.astral.sh/uv/)). * You can specify the `dev` branch like so: `uvx --from git+https://github.com/thiswillbeyourgithub/wdoc@dev[full] wdoc` * If you cloned the repository and have modified the code: `uvx --from PATH/TO/WDOC[full] --refresh wdoc` * **Picking only the loaders you need**: `wdoc` ships in a modular fashion so you don't have to pull in heavy ML dependencies you won't use. Plain `wdoc` already includes the engine plus the **PDF** and **URL/web search** loaders (the most common cases). Optional extras: * `wdoc[youtube]` -- youtube videos and playlists (yt-dlp, youtube-transcript-api) * `wdoc[audio]` -- local audio/video transcription (deepgram, pydub, torchaudio, ffmpeg-python) * `wdoc[anki]` -- anki collection loading (ankipandas) * `wdoc[office]` -- word/powerpoint/epub and other office formats (unstructured[all-docs], docx2txt, pandoc) * `wdoc[logseq]` -- logseq markdown graphs * `wdoc[fasttext]` -- language detection (buggy on windows, hence optional) * `wdoc[pdftotext]` -- an additional pdf parser that needs system libs (`sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev`) * You can combine extras freely: `uvx wdoc[youtube,audio,anki]`. * `wdoc[full]` is a shortcut that includes all the loader extras above (excluding `fasttext` and `pdftotext`, which need special handling). If unsure, use `wdoc[full]` and don't worry about filetypes. * If you have problems with pdftotext or fasttext, try `uvx wdoc[full,pdftotext,fasttext]`. * If you plan on contributing, you will also need `wdoc[dev]` for the commit hooks. * **Claude Code users**: to give Claude Code knowledge of `wdoc`'s CLI and Python API, install the [SKILL.md](./SKILL.md) reference file: ```bash mkdir -p ~/.claude/skills/wdoc && wget -O ~/.claude/skills/wdoc/SKILL.md https://raw.githubusercontent.com/thiswillbeyourgithub/wdoc/main/SKILL.md ``` 2. Add the API key for the backend you want as an environment variable: for example `export ANTHROPIC_API_KEY="***my_key***"` 3. Launch is as easy as using `uvx wdoc --task=query --path=MYDOC [ARGS]` * If for some reason this fails, maybe try with `python -m wdoc`. And if everything fails, try with `uvx wdoc@latest`, or as last resort clone this repo and try again after `cd` inside it? Don't hesitate to open an issue. * To get shell autocompletion: if you're using zsh: `eval $(cat shell_completions/wdoc_completion.zsh)`. Also provided for `bash` and `fish`. You can generate your own with `uvx wdoc -- --completion MYSHELL > my_completion_file"`. * Don't forget that if you're using a lot of documents (notably via recursive filetypes) it can take a lot of time (depending on parallel processing too, but you then might run into memory errors). * Take a look at the [examples.md](https://github.com/thiswillbeyourgithub/wdoc/blob/main/wdoc/docs/examples.md) for a list of shell and python examples. 4. To ask questions about a local document: `uvx wdoc[office] query --path="PATH/TO/YOUR/FILE" --filetype="auto"` * If you want to reduce the startup time by directly loading the embeddings from a previous run (although the embeddings are always cached anyway): add `--saveas="some/path"` to the previous command to save the generated embeddings to a file and replace with `--loadfrom "some/path"` on every subsequent call. 5. To do an online search, the idea is `uvx wdoc --task=query --path='How is Nvidia doing this month?' --query='How is Nvidia doing this month' --filetype=ddg`. But if any of `path` or `query` is missing, we replace it by the other one. This can also be used like so: `uvx wdoc web 'How is Nvidia doing this month?'`. 6. For more: read the documentation at `uvx wdoc --help` ### Experimental Docker Interface You can also use the experimental docker interface to use `wdoc` in the browser (including on a smartphone!). See the [Docker README](./docker/README.md) for detailed instructions. ## Scripts made with wdoc * *More to come in [the scripts folder](./scripts/)*. * [Ntfy Summarizer](scripts/NtfySummarizer): automatically summarize a document from your android phone using [ntfy.sh](ntfy.sh). * [TheFiche](scripts/TheFiche): create summaries for specific notions directly as a [logseq](https://github.com/logseq/logseq) page. * [FilteredDeckCreator](scripts/FilteredDeckCreator): directly create an [anki](https://ankitects.github.io/) filtered deck from the cards found by `wdoc`. * [Official Open-WebUI Tool](https://openwebui.com/t/qqqqqqqqqqqqqqqqqqqq/wdoctool), hosted [here](https://github.com/thiswillbeyourgithub/openwebui_custom_pipes_filters/blob/main/tools/wdoc_tools.py). * [MediaURLFinder](scripts/MediaURLFinder) simply leverages the `find_online_media` loader helper to use `playwright` and `yt-dlp` to find all the URLs of medias (videos, audio etc). This is especially useful if `yt-dlp` alone is not able to find the URL of a ressource. ## FAQ <details> <summary> FAQ </summary> * **Who is this for?** * `wdoc` is for power users who want document querying on steroid, and in depth AI powered document summaries. * **What's RAG?** * A RAG system (retrieval augmented generation) is basically an LLM powered search through a text corpus. * **Why make another RAG system? Can't you use any of the others?** * I'm [Olicorne](https://olicorne.org/), a psychiatry resident who needed a tool to ask medical questions from **a lot** (tens of thousands) of documents, of different types (epub, pdf, [anki](https://ankitects.github.io/) database, [Logseq](https://github.com/logseq/logseq/), website dump, youtube videos and playlists, recorded conferences, audio files, etc). Existing solutions couldn't handle this diversity and scale of content. * **Why is `wdoc` better than most RAG system to ask questions on documents?** * It uses both a strong and query_eval LLM. After finding the appropriate documents using embeddings, the query_eval LLM is used to filter through the documents that don't seem to be about the question, then the strong LLM answers the question based on each remaining documents, then combines them all in a neat markdown. Also `wdoc` is very customizable. * **Can you use wdoc on `wdoc`'s documentation?** * Yes of course! `uvx wdoc --task=query --path https://wdoc.readthedocs.io/en/latest/all_docs.html` * **Why can `wdoc` also produce summaries?** * I have little free time so I needed a tailor made summary feature to keep up with the news. But most summary systems are rubbish and just try to give you the high level takeaway points, and don't handle properly text chunking. So I made my own tailor made summarizer. **The summary prompts can be found in `utils/prompts.py` and focus on extracting the arguments/reasonning/though process/arguments of the author then use markdown indented bullet points to make it easy to read.** It's really good! The prompts dataclass is not frozen so you can provide your own prompt if you want. * **Which tasks are supported by `wdoc`?** * See [Tasks](#tasks). * **Which LLM providers are supported by `wdoc`?** * `wdoc` supports virtually any LLM provider thanks to [litellm](https://docs.litellm.ai/). It even supports local LLM and local embeddings (see [examples.md](https://github.com/thiswillbeyourgithub/wdoc/blob/main/wdoc/docs/examples.md)). The list of supported embeddings engine can be found [here](https://docs.litellm.ai/docs/embedding/supported_embedding) but includes at least Openai (or any openai API compatible models), Cohere, Azure, Bedrock, NVIDIA NIM, Hugginface, Mistral, Ollama, Gemini, Vertex, Voyage. * **What do you use `wdoc` for?** * I follow heterogeneous sources to keep up with the news: youtube, website, etc. So thanks to `wdoc` I can automatically create awesome markdown summaries that end up straight into my [Logseq](https://github.com/logseq/logseq/) database as a bunch of `TODO` blocks. * I use it to ask technical questions to my vast heterogeneous corpus of medical knowledge. * I use it to query my personal documents using the `--private` argument. * I sometimes use it to summarize a documents then go straight to asking questions about it, all in the same command. * I use it to ask questions about entire youtube playlists. * Other use case are the reason I made the [scripts made with `wdoc` section](#scripts-made-with-wdoc) * **What's up with the name?** * One of my favorite character (and somewhat of a rolemodel is [Winston Wolf](https://www.youtube.com/watch?v=UeoMuK536C8) and after much hesitation I decided `WolfDoc` would be too confusing and `WinstonDoc` sounds like something micro$oft would do. Also `wd` and `wdoc` were free, whereas `doctools` was already taken. The initial name of the project was `DocToolsLLM`, a play on words between 'doctor' and 'tool'. * **How can I improve the prompt for a specific task without coding?** * Each prompt of the `query` task are roleplaying as employees working for WDOC-CORP©, either as `Eve the Evaluator` (the LLM that filters out relevant documents), `Anna the Answerer` (the LLM that answers the question from a filtered document) or `Carl the Combiner` (the LLM that combines answers from Answerer as one). There's also `Sam the Summarizer` for summaries and `Raphael the Rephraser` to expand your query. They are all receiving orders from you if you talk to them in a prompt. * **How can I use `wdoc`'s parser for my own documents?** * If you are in the shell cli you can easily use `uvx wdoc parse my_file.pdf`. add `--format=langchain_dict` to get the text and metadata as a list of dict, otherwise you will only get the text. Other formats exist including `--format=xml` to make it LLM friendly like [files-to-promt](https://github.com/simonw/files-to-prompt). * If you want the document using python: ``` python from wdoc import wdoc list_of_docs = wdoc.parse_doc(path=my_path) ``` * Another example would be to use wdoc to parse an anki deck: `uvx wdoc[anki] parse --filetype "anki" --anki_profile "Main" --anki_deck "mydeck::subdeck1" --anki_notetype "my_notetype" --anki_template "<header>\n{header}\n</header>\n<body>\n{body}\n</body>\n<personal_notes>\n{more}\n</personal_notes>\n<tags>{tags}</tags>\n{image_ocr_alt}" --anki_tag_filter "a::tag::regex::.*something.*" --format=text` * **What should I do if my PDF are encrypted?** * If you're on linux you can try running `qpdf --decrypt input.pdf output.pdf` * I made a quick and dirty batch script for [in this repo](https://github.com/thiswillbeyourgithub/PDF_batch_decryptor) * **How can I add my own pdf parser?** * Write a python class and add it there: `wdoc.utils.loaders.pdf_loaders['parser_name']=parser_object` then call `wdoc` with `--pdf_parsers=parser_name`. * The class has to take a `path` argument in `__init__`, have a `load` method taking no argument but returning a `List[Document]`. Take a look at the `OpenparseDocumentParser` class for an example. * **Can `wdoc` add source citations to summaries?** * Yes! When summarizing documents that have page metadata (like PDFs), `wdoc` automatically adds `[p.N]` citations to bullet points tracking which page the information came from. For multi-file summaries, citations include the filename: `[p.N, file.pdf]`. You can also use `--citation_url_template` to turn these into clickable markdown links pointing to your own document server (e.g. `--citation_url_template="https://my-site.com/docs/{source}#page={page}"`). This feature was developed with Claude Code. * For the query task, source documents are referenced with clickable anchor links `[N](#document-N)` in the final answer. * **What should I do if I keep hitting rate limits?** * The simplest way is to add the `debug` argument. It will disable multithreading, multiprocessing and LLM concurrency. A less harsh alternative is to set the environment variable `WDOC_LLM_MAX_CONCURRENCY` to a lower value. * **How can I run the tests?** * Take a look at the files `./tests/run_all_tests.sh`. * **How can I query a text but without chunking? / How can I query a text with the full text as context?** * If you set the environment variable `WDOC_MAX_CHUNK_SIZE` to a very high value and use a model with enough context according to litellm's metadata, then no chunking will happen and the LLM will have the full text as context. * **Is there a way to use `wdoc` with [Open-WebUI](https://github.com/open-webui/open-webui/)?** * Yes! I am maintaining an [official Open-WebUI Tool](https://openwebui.com/t/qqqqqqqqqqqqqqqqqqqq/wdoctool) which is hosted [here](https://github.com/thiswillbeyourgithub/openwebui_custom_pipes_filters/blob/main/tools/wdoc_tools.py). * **Is there a web UI for `wdoc`?** * Yes! An [experimental Docker-based Gradio web interface](./docker/README.md) is available for easy deployment and use without command-line interaction. * **Can I use shell pipes with `wdoc`?** * Yes! Data sent using shell pipes (be it for strings or binary data) will be automatically saved to a temporary file which is then passed as `--path=[temp_file]` argument. For example `cat **/*.txt | uvx wdoc --task=query`, `echo $my_url | uvx wdoc parse` or even `cat my_file.pdf | uvx wdoc parse --filetype=pdf`. For binary input it is strongly recommended to use a `--filetype` argument because `python-magic` version <=0.4.27 chokes otherwise (see [that issue](https://github.com/ahupp/python-magic/issues/261). * **Can the environment variables be set at runtime?** * Sort of. Actually when importing `wdoc`, code in `wdoc/utils/env.py` creates a dataclass that holds the environment variables used by `wdoc`. This is done primarily to ensure runtime type checking and to ensure that when an env variable is accessed inside wdoc's code (through the dataclass) it is always compared to the environment one. If you decide to change env variables throughout the code, this change new value will be used inside `wdoc`. But that's somewhat brittle because some env variables are used to store the *default* value of some function or class and hence are only used when importing code so will be out of sync. Additionaly, `wdoc` will intentionaly crash if it suspects the `WDOC_PRIVATE_MODE` env var is out of sync, just to be safe. Also note that if env vars like `WDOC_LANGFUSE_PUBLIC_KEY` are found, `wdoc` will overwrite `LANGFUSE_PUBLIC_KEY` with it. This is because `litellm` (maybe others) looks for this env variable to enable `langfuse` callbacks. This whole contraption allows to set env variable for a specific user of when using the `open-webui` `wdoc` tool. Feedback is much welcome for this feature. * **How can I build the autodoc using sphinx?** * The command I've been using is `sphinx-apidoc -o docs/source/ wdoc --force`, to call from the root of this repository. * **Why can't I load the vectorstores in other langchain projects?** * In `wdoc/utils/customs/binary_faiss_vectorstore.py`, we create `BinaryFAISS` and `CompressedFAISS`. The latter is just like FAISS but with zlib compression to the pickled index and the former adds on top binary embeddings, resulting in faster and more compact embeddings. If you want to disable compression altogether, use the env variable `WDOC_MOD_FAISS_COMPRESSION=false`. * **Which python version is used in the test suite?** * The recommended python version is `3.12.11`. * **Why does the online search only supports the 'query' task?** * The way `wdoc` works for summaries is to take the "whole document", chunk it into sequential "documents" and iteratively create the summary. But if we start with several documents (say difference web pages) then the "sequence" wouldn't make sense. </details> ## Roadmap <details> <summary> Click to read more </summary> <i>This TODO list is maintained automatically by [MdXLogseqTODOSync](https://github.com/thiswillbeyourgithub/MdXLogseqTODOSync)</i> <!-- BEGIN_TODO --> - ## Most urgent - figure out a good way to skip merging batches that are too large before trying to merge them - probably means adding an env var to store a max value, document it in the help.md - then check after batch creation if a batch is that large - if it is put it in a separate var, to be concatenated later with the rest of the answers - add more tests - add test for the private mode - add test for the testing models - add test for the recursive loader functions - add test for each loader - rewrite the python API to make it more useable. (also related to https://github.com/thiswillbeyourgithub/wdoc/issues/13) - pay attention to how to modify the init and main.py files - pay attention to how the --help flag works - pay attention to how the USAGE document is structured - support other vector databases - learn how to set a github action for test code coverage - allow anki to use anki type search queries - refactor the tasks to use langgraph, as it seems easier to do complex recursive tasks with it - use async for the langchain chains - ### Features - use clusters of semantic ordering instead of just the order you dumbass - ability to cap the search documents capped by a number of tokens instead of a number of documents - Add prompt caching for claude - add a "fast summary" feature that does not use recursive summary if you care more about speed than overlapping summaries - count how many time each source is used, as it can be relevant to infer answer quality - add an html format output. It would display a nice UI with proper dropdowns for sources etc - if a model supports structured output we should make use of it to get the thinking and answer part. Opt in because some models hide their thoughts. - add an intermediate step for queries that asks the LLM for appropriate headers for the md output. Then for each intermediate answer attribute it a list of 1 to 3 headers (because a given intermediate answer can contain several pieces of information), then do the batch merge of intermediate answer per header. - this needs to be scalable and easy to add recursion to (because then we can do this for subheaders and so on) - the end goal is to have a scalable solution to answer queries about extremely large documents for impossibly vast questions - use apprise instead of ntfy for the scripts - add crawl4ai parser: https://github.com/unclecode/crawl4ai - Way to add the title (or all metadata) of a document to its own text. Enabled by default. Because this would allow searching among many documents that don't refer to the original title (for example: material safety datasheets) - default value is "author" "page" title" - pay attention to avoid including personnal info (for example use relative paths instead of absolute paths) - add a /save PATH command to save the chat and metadata to a json file - add image support printing via icat or via the other lib you found last time, would be useful for summaries etc - add wdoc to tldr pages - add an audio backend to use the subtitles from a video file directly - store the anki images as 'imagekeys' as the idea works for other parsers too - investigate asking the LLM to add leading emojis to the bullet point for improved reading - add a key/val arg to specify the trust we have in a doc, call it context - add a way to open the documents automatically, based on platform dirs etc. For ex if okular is installed, open pdfs directly at the right page - the best way would be to create opener.py that does a bit like loader but for all filetypes and platforms - use a cli selector like in mnemonics creator - add shortcut to sort by score or by name - display metadata and score in a previewer - add an argument --whole_text to avoid chunking (this would just increase the chunk size to a super large number I guess) - add apprise callback support - add a filetype "custom_parser" and an argument "--custom_parser" containing a path to a python file. Must receive a docdict and a few other things and return a list of documents - add bespoke-minicheck from ollama to fact check when using RAG: https://ollama.com/library/bespoke-minicheck - or via their API directly : https://docs.bespokelabs.ai/bespoke-minicheck/api but they don't seem to properly disclose what they do with the data - add a langchain code loader that uses aider to get the repomap - https://github.com/paul-gauthier/aider/issues/1043#issuecomment-2278486840 - https://aider.chat/docs/scripting.html - add a pikepdf loader because it can be used to automatically decrypt pdfs - add a query_branching_nb argument that asks an LLM to identify a list of keywords from the intermediate answers, then look again for documents using this keyword and filtering via the weak llm - write a script that shows how to use bertopic on the documents of wdoc - add a retriever where the LLM answer without any context - add support for readabilipy for parsing html - https://github.com/alan-turing-institute/ReadabiliPy - add an obsidian loader - https://pypi.org/project/obsidiantools/ - add a /chat command to the prompt, it would enable starting an interactive session directly with the llm - find a way to make it work with llm from simonw - make images an actual filetype - ### Enhancements - store the available tasks in a dataclass in misc.py - maybe add support for docling to parse documents? - when querying hard stuff the number of drop documents after batching is non negligible, we should remove those from the list of documents to display and instead store those in another variable - check if using html syntax is less costly and confusing to LLMs than markdown with tall those indentation. Or maybe json. It would be simple to turn that into markdown afterwards. - check that the task search work on things other than anki - create a custom custom retriever, derived from multiquery retriever that does actual parallel requests. Right now it's not the case (maybe in async but I don't plan on using async for now). This retriever seems a good part of the slow down. - stop using your own youtube timecode parser and instead use langchain's chunk transcript format - implement usearch instead of faiss, it seems in all points faster, supports quantized embeddings, i trust their langchain implementation more - https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.usearch.USearch.html#langchain_community.vectorstores.usearch.USearch - Use an env var to drop_params of litellm - add more specific exceptions for file loading error. One exception for all, one for batch and one for individual loader - use heuristics to find the best number of clusters when doing semantic reranking - arg to use jina v3 embeddings for semantic batching because it allows specifying tasks that seem appropriate for that - add an env variable or arg to overload the backend url for whisper. Then set it always for you and mention it there: https://github.com/fedirz/faster-whisper-server/issues/5 - find a way to set a max cost at which to crash if it exceeds a maximum cost during a query, probably via the price callback - anki_profile should be able to be a path - store wdoc's version and indexing timestamp in the metadata of the document - arg --oneoff that does not trigger the chat after replying. Allowing to not hog all the RAM if ran in multiple terminals for example through SSH - add a (high) token threshold above which two texts are not combined but just concatenated in the semantic order. It would avoid it loosing context. Use a --- separator - compute the cost of whisper and deepgram - use a pydantic basemodel for output instead of a dict - same for summaries, it should at least contain the method to substitute the sources and then back - investigate storing the vectors in a sqlite3 file - make a plugin to llm that looks like file-to-prompt from simonw - Always bind a user metadata to litellm for langfuse etc - Add more metadata to each request to langfuse more informative - add a reranker to better sort the output of the retrievers. Right now with the multiquery it returns way too many and I'm thinking it might be a bad idea to just crop at top_k as I'm doing currently - add a status argument that just outputs the logs location and size, the cache location and size, the number of documents etc - add the python magic of the file as a file metadata - add an env var to specify the threshold for relevant document by the query eval llm - find a way to return the evaluations for each document also - move retrievers.py in an embeddings folder - stop using lambda functions in the chains because it makes the code barely readable - when doing recursive summary: tell the model that if it's really sure that there are no modifications to do: it should just reply "EXIT" and it would save time and money instead of waiting for it to copy back the exact content - add image parsing as base64 metadata from pdf - use multiple small chains instead of one large and complicated and hard to maintain - add an arg to bypass query combine, useful for small models - tell the llm to write a special message if the parsing failed or we got a 404 or paywall etc - catch this text and crash - add check that all metadata is only made of int float and str - move the code that filters embeddings inside the embeddings.py file - this way we can dynamically refilter using the chat prompt - task summary then query should keep in context both the full text and the summary - if there's only one intermediate answer, pass it as answer without trying to recombine - filter_metadata should support an OR syntax - add a --show_models argument to display the list of available models - add a way to open the documents automatically, based on platform dirs etc. For ex if okular is installed, open pdfs directly at the right page - the best way would be to create opener.py that does a bit like loader but for all filetypes and platforms - add an image filetype: it will be either OCR'd using format and/or will be captioned using a multimodal llm, for example gpt4o mini - nanollava is a 0.5b that probably can be used for that with proper prompting - add a key/val arg to specify the trust we have in a doc, call this metadata context in the prompt - add an arg to return just the dict of all documents and embeddings. Notably useful to debug documents - use a class for the cli prompt, instead of a dumb function - arg to disable eval llm filtering - just answer 1 directly if no eval llm is set - display the number of documents and tokens in the bottom toolbar - add a demo gif - investigate asking the LLM to add leading emojis to the bullet point for quicker reading of summaries - see how easy or hard it is to use an async chain - ability to cap the search documents capped by a number of tokens instead of a number of documents - for anki, allow using a query instead of loading with ankipandas - add a "try_all" filetype that will try each filetype and keep the first that works - add textract extractor : https://textract.readthedocs.io/en/stable/ - write a langchain compatible tool for agents - add bespoke-minicheck from ollama to fact check when using RAG: https://ollama.com/library/bespoke-minicheck - or via their API directly : https://docs.bespokelabs.ai/bespoke-minicheck/api but they don't seem to properly disclose what they do with the data <!-- END_TODO --> </details>

Knowledge Bases & RAG
519 Github Stars