๐ Global News Intelligence Platform
Global news analytics with GDELT + AI + modern data stack
Live Demo โข Features โข Architecture โข Tech Stack โข Quick Start โข Cost Efficiency
๐ฏ Overview
A full-stack data engineering project that ingests, processes, and visualizes 100,000+ daily global news events from the GDELT Project. Includes AI chat for natural language queries and a live analytics dashboard.
๐ By the Numbers
| Metric | Value |
|---|---|
| Cumulative Events | 20M+ processed |
| Daily Ingestion | 100K+ events/day |
| Data History | 8+ months live data |
| Languages | 100+ monitored |
| Countries | 200+ covered |
| Query Speed | <1 second |
| Monthly Cost | $0 |
What is GDELT?
The GDELT Project monitors the world's news media from nearly every country in 100+ languages, identifying people, locations, themes, and emotions driving global society.
๐ธ Dashboard Preview
Home - KPIs & Trending News

Emotions - GKG Mood Analysis (NEW!)

Analytics - Actors & Countries

AI Chat - Natural Language Queries

RAG Chat - AI Analysis of World Events

Feed - Event Stream

โจ Features
| Feature | Description |
|---|---|
| ๐ Real-Time Dashboard | Live metrics, trending news, sentiment analysis, geographic distribution |
| ๐ง Emotion Analytics | GKG-powered emotion tracking: Fear, Joy, Positive/Negative, Global Mood Index |
| ๐ค AI Chat Interface | Ask questions in plain English โ Get SQL-powered answers |
| โก 15-Min Updates | cron-job.org (15-min external trigger) โ GitHub Actions workflow_dispatch + Dagster job runner |
| ๐ Data Quality Gates | Custom data validation prevents bad data |
| ๐ Global Coverage | Events from 200+ countries with country code mapping |
| ๐ Trend Analysis | 30-day time series, intensity tracking, actor monitoring |
| ๐ฅ Trending Topics | AI-extracted themes from global news (GKG) |
| ๐จ Dark Mode UI | Custom dark theme, responsive Plotly charts |
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PRODUCTION PIPELINE ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ GDELT Events โ โ GDELT GKG โ
โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ
โ โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ INGESTION (Every 15 min via cron-job.org โ workflow_dispatch) โ
โ GitHub Actions โ Dagster โ Polars (10x faster) โ custom validation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TRANSFORMATION โ
โ dbt Core: staging (stg_events) โ marts (fct_daily, dim_actors, etc.) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ STORAGE & AI โ
โ MotherDuck (DWH) โ Voyage AI (Embeddings) โ Cerebras LLM (RAG/SQL) โ
โ โโโ gkg_emotions: Fear, Joy, Tone, Topics โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PRESENTATION โ
โ Streamlit: HOME | FEED | EMOTIONS | AI Chat | ABOUT โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Flow (ELT Pipeline)
- Extract: GDELT Events API + GKG Feed โ Polars (10x faster than Pandas)
- Validate: Custom schema + threshold data quality checks
- Load: Deduplicated data into MotherDuck (serverless DuckDB)
- Transform: dbt models create staging views and mart tables
- Emotions: GKG data โ Extract tone, fear, joy, topics (rolling 24h)
- Embed: Voyage AI generates vectors every 12 hours
- Serve: Streamlit dashboard with AI chat (SQL + RAG modes)
๐ ๏ธ Tech Stack
Data Engineering
| Tool | Purpose | Replaces |
|---|---|---|
| Polars | High-performance DataFrame processing (10x faster) | Pandas |
| dbt Core | SQL transformations with staging/marts pattern | Raw SQL |
| DataQualityValidator | Custom schema + threshold validation & testing | Manual checks |
| Dagster | Pipeline orchestration with asset-based design | Apache Airflow |
| DuckDB/MotherDuck | Serverless cloud OLAP warehouse | Snowflake/Redshift |
| GitHub Actions | CI/CD with workflow_dispatch (15-min via cron-job.org) + 12-hr scheduled jobs | AWS Lambda |
AI/ML
| Tool | Purpose | Replaces |
|---|---|---|
| Cerebras | LLM inference (GPT-OSS 120B) | OpenAI GPT-4 |
| LlamaIndex | Text-to-SQL query engine | Custom NLP |
| Voyage AI | Vector embeddings for RAG | OpenAI Embeddings |
| MotherDuck Vectors | Native vector similarity search | Pinecone / Weaviate |
Frontend
| Tool | Purpose | Replaces |
|---|---|---|
| Streamlit | Interactive dashboard framework | Tableau / Power BI |
| Plotly | Dynamic charts and visualizations | D3.js / Chart.js |
Skills Demonstrated
- Python (Polars, Pandas, RegEx, API integration)
- SQL (Complex queries, window functions, dbt models)
- Data Quality (custom data quality validation, schema testing)
- ELT Pipelines (Extract, Load, Transform with dbt)
- CI/CD (GitHub Actions, cron scheduling)
- Vector Search (Embeddings, cosine similarity, RAG)
๐ Quick Start
Prerequisites
- Python 3.10+
- MotherDuck Account (free tier)
- Cerebras API Key (free tier)
Installation
# Clone the repository
git clone https://github.com/Mohith-akash/Global-News-Intel-Platform.git
cd Global-News-Intel-Platform
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
.\venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
Configuration
Create a .env file in the project root:
MOTHERDUCK_TOKEN=your_motherduck_token
CEREBRAS_API_KEY=your_cerebras_api_key
VOYAGE_API_KEY=your_voyage_api_key # Optional: enables RAG mode
Run the Dashboard
streamlit run app.py
Run the Pipeline Manually
# Polars-powered ingestion (15-min schedule)
python -m dagster job execute -f etl/pipeline_polars.py -j gdelt_ingestion_job
# Embedding generation (12-hour schedule)
python -m dagster job execute -f etl/embedding_job.py -j gdelt_embedding_job
# Run dbt models
cd dbt && dbt run
๐ฐ Enterprise Tools vs My Stack
This project demonstrates how to achieve enterprise-grade capabilities at zero cost:
| Enterprise Tool | Monthly Cost | My Alternative | My Cost |
|---|---|---|---|
| Databricks/Spark | ~$500 | DuckDB | $0 |
| Snowflake/BigQuery | ~$300 | MotherDuck | $0 |
| Managed Airflow | ~$300 | Dagster + GitHub Actions | $0 |
| dbt Cloud | ~$100 | dbt Core (self-hosted) | $0 |
| Pinecone/Weaviate | ~$70 | MotherDuck Vectors | $0 |
| OpenAI Embeddings | ~$50 | Voyage AI | $0 |
| OpenAI GPT-4 | ~$100 | Cerebras | $0 |
| Tableau/Power BI | ~$70 | Streamlit | $0 |
| TOTAL | $1,490+ | $0 |
Key Insight: MotherDuck's native vector search eliminates the need for a separate vector database like Pinecone.
๐ Technology Evolution
This project evolved through multiple iterations to optimize for cost and performance:
Data Warehouse
โ๏ธ Snowflake (trial) โ ๐ฆ MotherDuck (free tier)
- Started with Snowflake trial for learning enterprise DWH
- Migrated to MotherDuck to eliminate costs while keeping SQL compatibility
AI/LLM Provider
โจ Gemini 2.0/2.5 Flash โ โก Groq (Llama 3.3 70B) โ ๐ง Cerebras (Llama 3.1 8B โ GPT-OSS 120B)
- Tested Gemini models for natural language queries
- Tried Groq's fast inference with larger Llama models
- Settled on Cerebras for reliable free tier and good performance
- Moved to GPT-OSS 120B after Cerebras archived the Llama 3.1 models
RAG Embeddings
๐ Voyage AI (embeddings) + ๐ฆ MotherDuck (vector search)
- Voyage AI creates 1024-dim embeddings for semantic search
- MotherDuck's native
array_cosine_similarity()replaces Pinecone - Dual-mode AI: SQL for precise queries, RAG for semantic exploration
Key Learning: The best tool isn't always the most expensiveโit's the one that solves your problem within constraints.
๐ Project Structure
gdelt_project/
โโโ app.py # Streamlit dashboard entry point
โโโ src/ # Core modules
โ โโโ config.py # Configuration constants
โ โโโ database.py # Database connection
โ โโโ queries.py # SQL query functions
โ โโโ ai_engine.py # LLM/AI setup (Cerebras + LlamaIndex)
โ โโโ rag_engine.py # RAG engine (Voyage AI + vector search)
โ โโโ data_processing.py # Headline extraction
โ โโโ utils.py # Utility functions
โ โโโ styles.py # CSS styling
โโโ etl/ # Data pipeline
โ โโโ pipeline_polars.py # ๐ Polars ingestion + custom validation
โ โโโ embedding_job.py # ๐ 12-hour embedding generation
โโโ dbt/ # ๐ dbt transformation layer
โ โโโ dbt_project.yml # dbt configuration
โ โโโ profiles.yml # MotherDuck connection
โ โโโ models/
โ โโโ staging/ # stg_events (cleaned data)
โ โโโ marts/ # fct_daily_events, dim_actors, dim_countries
โโโ components/ # UI components
โ โโโ render.py # Dashboard rendering
โ โโโ ai_chat.py # AI chat interface
โ โโโ emotions.py # GKG emotions tab
โ โโโ about.py # About page
โโโ requirements.txt # Python dependencies
โโโ .env # Environment variables (not in repo)
โโโ .github/workflows/
โโโ gdelt_ingest.yml # 15-min Polars ingestion
โโโ gdelt_embeddings_12hr.yml # 12-hour embedding job
โโโ health_monitor.yml # uptime checks + ntfy alerts
๐ฎ Future Enhancements
- [x]
Add dbt transformations for advanced modelingโ Done! - [x]
Upgrade to Polars for faster processingโ Done! - [x]
Add data quality validationโ Done! - [ ] Implement event clustering with ML
- [ ] Add email/Slack alerts for crisis events
- [ ] Expand AI chat with multi-turn conversations
- [ ] Add export functionality (CSV, PDF reports)
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ฌ Contact
Mohith Akash
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Built with โ and curiosity โข Data sourced from GDELT Project