rs-trafilatura
rs-trafilatura is a high-performance Rust library for fast and accurate web content extraction. It serves as a port of the Python trafilatura and Go go-trafilatura projects, designed to extract clean, readable text from web pages while effectively removing boilerplate, navigation, menus, and advertisements. The library features machine learning-based page type classification that identifies seven distinct categories including articles, forums, products, collections, listings, documentation, and service pages with 96.6 F1 score accuracy on the ScrapingHub benchmark. Based on the detected page type, it applies specialized extraction profiles optimized for specific platforms and frameworks. A built-in extraction quality predictor uses an XGBoost model to provide confidence scores, helping users decide when to fallback to LLM processing for low-confidence results. Output supports GitHub Flavored Markdown with preserved formatting for headings, lists, tables, and code blocks, alongside rich metadata extraction fro