About Global-News-Intel-Platform

AI-powered geopolitical news intelligence platform. Ingests 100K+ daily events from GDELT, stores in MotherDuck (DuckDB), orchestrates with Dagster, and features an AI chat interface with Text-to-SQL. Full data engineering stack at $0/month.

m

Published by

mohith-akash

Visit View Profile

README.md

View on GitHub

🌐 Global News Intelligence Platform

Global news analytics with GDELT + AI + modern data stack

Live Demo • Features • Architecture • Tech Stack • Quick Start • Cost Efficiency

🎯 Overview

A full-stack data engineering project that ingests, processes, and visualizes 100,000+ daily global news events from the GDELT Project. Includes AI chat for natural language queries and a live analytics dashboard.

📊 By the Numbers

Metric	Value
Cumulative Events	20M+ processed
Daily Ingestion	100K+ events/day
Data History	8+ months live data
Languages	100+ monitored
Countries	200+ covered
Query Speed	<1 second
Monthly Cost	$0

What is GDELT?

The GDELT Project monitors the world's news media from nearly every country in 100+ languages, identifying people, locations, themes, and emotions driving global society.

📸 Dashboard Preview

Home - KPIs & Trending News

Dashboard Home

Emotions - GKG Mood Analysis (NEW!)

Emotions Tab

Analytics - Actors & Countries

Dashboard Charts

AI Chat - Natural Language Queries

AI Chat

RAG Chat - AI Analysis of World Events

RAG Chat

Feed - Event Stream

Feed Tab

✨ Features

Feature	Description
📊 Real-Time Dashboard	Live metrics, trending news, sentiment analysis, geographic distribution
🧠 Emotion Analytics	GKG-powered emotion tracking: Fear, Joy, Positive/Negative, Global Mood Index
🤖 AI Chat Interface	Ask questions in plain English → Get SQL-powered answers
⚡ 15-Min Updates	cron-job.org (15-min external trigger) → GitHub Actions workflow_dispatch + Dagster job runner
🔍 Data Quality Gates	Custom data validation prevents bad data
🌍 Global Coverage	Events from 200+ countries with country code mapping
📈 Trend Analysis	30-day time series, intensity tracking, actor monitoring
🔥 Trending Topics	AI-extracted themes from global news (GKG)
🎨 Dark Mode UI	Custom dark theme, responsive Plotly charts

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         PRODUCTION PIPELINE ARCHITECTURE                 │
└─────────────────────────────────────────────────────────────────────────┘

              ┌──────────────┐          ┌──────────────┐
              │ GDELT Events │          │  GDELT GKG   │
              └──────┬───────┘          └──────┬───────┘
                     │                         │
                     └────────────┬────────────┘
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  INGESTION (Every 15 min via cron-job.org → workflow_dispatch)                                                │
│  GitHub Actions → Dagster → Polars (10x faster) → custom validation      │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  TRANSFORMATION                                                          │
│  dbt Core: staging (stg_events) → marts (fct_daily, dim_actors, etc.)   │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  STORAGE & AI                                                            │
│  MotherDuck (DWH) ← Voyage AI (Embeddings) → Cerebras LLM (RAG/SQL)     │
│  └── gkg_emotions: Fear, Joy, Tone, Topics                              │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  PRESENTATION                                                            │
│  Streamlit: HOME | FEED | EMOTIONS | AI Chat | ABOUT                    │
└─────────────────────────────────────────────────────────────────────────┘

Data Flow (ELT Pipeline)

Extract: GDELT Events API + GKG Feed → Polars (10x faster than Pandas)
Validate: Custom schema + threshold data quality checks
Load: Deduplicated data into MotherDuck (serverless DuckDB)
Transform: dbt models create staging views and mart tables
Emotions: GKG data → Extract tone, fear, joy, topics (rolling 24h)
Embed: Voyage AI generates vectors every 12 hours
Serve: Streamlit dashboard with AI chat (SQL + RAG modes)

🛠️ Tech Stack

Data Engineering

Tool	Purpose	Replaces
Polars	High-performance DataFrame processing (10x faster)	Pandas
dbt Core	SQL transformations with staging/marts pattern	Raw SQL
DataQualityValidator	Custom schema + threshold validation & testing	Manual checks
Dagster	Pipeline orchestration with asset-based design	Apache Airflow
DuckDB/MotherDuck	Serverless cloud OLAP warehouse	Snowflake/Redshift
GitHub Actions	CI/CD with workflow_dispatch (15-min via cron-job.org) + 12-hr scheduled jobs	AWS Lambda

AI/ML

Tool	Purpose	Replaces
Cerebras	LLM inference (GPT-OSS 120B)	OpenAI GPT-4
LlamaIndex	Text-to-SQL query engine	Custom NLP
Voyage AI	Vector embeddings for RAG	OpenAI Embeddings
MotherDuck Vectors	Native vector similarity search	Pinecone / Weaviate

Frontend

Tool	Purpose	Replaces
Streamlit	Interactive dashboard framework	Tableau / Power BI
Plotly	Dynamic charts and visualizations	D3.js / Chart.js

Skills Demonstrated

Python (Polars, Pandas, RegEx, API integration)
SQL (Complex queries, window functions, dbt models)
Data Quality (custom data quality validation, schema testing)
ELT Pipelines (Extract, Load, Transform with dbt)
CI/CD (GitHub Actions, cron scheduling)
Vector Search (Embeddings, cosine similarity, RAG)

🚀 Quick Start

Prerequisites

Python 3.10+
MotherDuck Account (free tier)
Cerebras API Key (free tier)

Installation

# Clone the repository
git clone https://github.com/Mohith-akash/Global-News-Intel-Platform.git
cd Global-News-Intel-Platform

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

MOTHERDUCK_TOKEN=your_motherduck_token
CEREBRAS_API_KEY=your_cerebras_api_key
VOYAGE_API_KEY=your_voyage_api_key  # Optional: enables RAG mode

Run the Dashboard

streamlit run app.py

Run the Pipeline Manually

# Polars-powered ingestion (15-min schedule)
python -m dagster job execute -f etl/pipeline_polars.py -j gdelt_ingestion_job

# Embedding generation (12-hour schedule)
python -m dagster job execute -f etl/embedding_job.py -j gdelt_embedding_job

# Run dbt models
cd dbt && dbt run

💰 Enterprise Tools vs My Stack

This project demonstrates how to achieve enterprise-grade capabilities at zero cost:

Enterprise Tool	Monthly Cost	My Alternative	My Cost
Databricks/Spark	~$500	DuckDB	$0
Snowflake/BigQuery	~$300	MotherDuck	$0
Managed Airflow	~$300	Dagster + GitHub Actions	$0
dbt Cloud	~$100	dbt Core (self-hosted)	$0
Pinecone/Weaviate	~$70	MotherDuck Vectors	$0
OpenAI Embeddings	~$50	Voyage AI	$0
OpenAI GPT-4	~$100	Cerebras	$0
Tableau/Power BI	~$70	Streamlit	$0
TOTAL	$1,490+		$0

Key Insight: MotherDuck's native vector search eliminates the need for a separate vector database like Pinecone.

🔄 Technology Evolution

This project evolved through multiple iterations to optimize for cost and performance:

Data Warehouse

❄️ Snowflake (trial) → 🦆 MotherDuck (free tier)

Started with Snowflake trial for learning enterprise DWH
Migrated to MotherDuck to eliminate costs while keeping SQL compatibility

AI/LLM Provider

✨ Gemini 2.0/2.5 Flash → ⚡ Groq (Llama 3.3 70B) → 🧠 Cerebras (Llama 3.1 8B → GPT-OSS 120B)

Tested Gemini models for natural language queries
Tried Groq's fast inference with larger Llama models
Settled on Cerebras for reliable free tier and good performance
Moved to GPT-OSS 120B after Cerebras archived the Llama 3.1 models

RAG Embeddings

🚀 Voyage AI (embeddings) + 🦆 MotherDuck (vector search)

Voyage AI creates 1024-dim embeddings for semantic search
MotherDuck's native array_cosine_similarity() replaces Pinecone
Dual-mode AI: SQL for precise queries, RAG for semantic exploration

Key Learning: The best tool isn't always the most expensive—it's the one that solves your problem within constraints.

📁 Project Structure

gdelt_project/
├── app.py                    # Streamlit dashboard entry point
├── src/                      # Core modules
│   ├── config.py             # Configuration constants
│   ├── database.py           # Database connection
│   ├── queries.py            # SQL query functions
│   ├── ai_engine.py          # LLM/AI setup (Cerebras + LlamaIndex)
│   ├── rag_engine.py         # RAG engine (Voyage AI + vector search)
│   ├── data_processing.py    # Headline extraction
│   ├── utils.py              # Utility functions
│   └── styles.py             # CSS styling
├── etl/                      # Data pipeline
│   ├── pipeline_polars.py    # 🆕 Polars ingestion + custom validation
│   └── embedding_job.py      # 🆕 12-hour embedding generation
├── dbt/                      # 🆕 dbt transformation layer
│   ├── dbt_project.yml       # dbt configuration
│   ├── profiles.yml          # MotherDuck connection
│   └── models/
│       ├── staging/          # stg_events (cleaned data)
│       └── marts/            # fct_daily_events, dim_actors, dim_countries
├── components/               # UI components
│   ├── render.py             # Dashboard rendering
│   ├── ai_chat.py            # AI chat interface
│   ├── emotions.py           # GKG emotions tab
│   └── about.py              # About page
├── requirements.txt          # Python dependencies
├── .env                      # Environment variables (not in repo)
└── .github/workflows/
    ├── gdelt_ingest.yml          # 15-min Polars ingestion
    ├── gdelt_embeddings_12hr.yml # 12-hour embedding job
    └── health_monitor.yml        # uptime checks + ntfy alerts

🔮 Future Enhancements

[x] ~~Add dbt transformations for advanced modeling~~ ✅ Done!
[x] ~~Upgrade to Polars for faster processing~~ ✅ Done!
[x] ~~Add data quality validation~~ ✅ Done!
[ ] Implement event clustering with ML
[ ] Add email/Slack alerts for crisis events
[ ] Expand AI chat with multi-turn conversations
[ ] Add export functionality (CSV, PDF reports)

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📬 Contact

Mohith Akash

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

_{Built with ☕ and curiosity • Data sourced from GDELT Project}

Global-News-Intel-Platform