Home
Softono
Global-News-Intel-Platform

Global-News-Intel-Platform

Open source MIT Python
18
Stars
2
Forks
0
Issues
0
Watchers
2 weeks
Last Commit

About Global-News-Intel-Platform

AI-powered geopolitical news intelligence platform. Ingests 100K+ daily events from GDELT, stores in MotherDuck (DuckDB), orchestrates with Dagster, and features an AI chat interface with Text-to-SQL. Full data engineering stack at $0/month.

Platforms

Web Self-hosted

Languages

Python

Python Polars dbt DuckDB Dagster License Pipeline

๐ŸŒ Global News Intelligence Platform

Global news analytics with GDELT + AI + modern data stack

Live Demo

Live Demo โ€ข Features โ€ข Architecture โ€ข Tech Stack โ€ข Quick Start โ€ข Cost Efficiency


๐ŸŽฏ Overview

A full-stack data engineering project that ingests, processes, and visualizes 100,000+ daily global news events from the GDELT Project. Includes AI chat for natural language queries and a live analytics dashboard.

๐Ÿ“Š By the Numbers

Metric Value
Cumulative Events 20M+ processed
Daily Ingestion 100K+ events/day
Data History 8+ months live data
Languages 100+ monitored
Countries 200+ covered
Query Speed <1 second
Monthly Cost $0

What is GDELT?

The GDELT Project monitors the world's news media from nearly every country in 100+ languages, identifying people, locations, themes, and emotions driving global society.


๐Ÿ“ธ Dashboard Preview

Home - KPIs & Trending News

Dashboard Home

Emotions - GKG Mood Analysis (NEW!)

Emotions Tab

Analytics - Actors & Countries

Dashboard Charts

AI Chat - Natural Language Queries

AI Chat

RAG Chat - AI Analysis of World Events

RAG Chat

Feed - Event Stream

Feed Tab


โœจ Features

Feature Description
๐Ÿ“Š Real-Time Dashboard Live metrics, trending news, sentiment analysis, geographic distribution
๐Ÿง  Emotion Analytics GKG-powered emotion tracking: Fear, Joy, Positive/Negative, Global Mood Index
๐Ÿค– AI Chat Interface Ask questions in plain English โ†’ Get SQL-powered answers
โšก 15-Min Updates cron-job.org (15-min external trigger) โ†’ GitHub Actions workflow_dispatch + Dagster job runner
๐Ÿ” Data Quality Gates Custom data validation prevents bad data
๐ŸŒ Global Coverage Events from 200+ countries with country code mapping
๐Ÿ“ˆ Trend Analysis 30-day time series, intensity tracking, actor monitoring
๐Ÿ”ฅ Trending Topics AI-extracted themes from global news (GKG)
๐ŸŽจ Dark Mode UI Custom dark theme, responsive Plotly charts

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         PRODUCTION PIPELINE ARCHITECTURE                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚ GDELT Events โ”‚          โ”‚  GDELT GKG   โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚                         โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  INGESTION (Every 15 min via cron-job.org โ†’ workflow_dispatch)                                                โ”‚
โ”‚  GitHub Actions โ†’ Dagster โ†’ Polars (10x faster) โ†’ custom validation      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                  โ”‚
                                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  TRANSFORMATION                                                          โ”‚
โ”‚  dbt Core: staging (stg_events) โ†’ marts (fct_daily, dim_actors, etc.)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                  โ”‚
                                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  STORAGE & AI                                                            โ”‚
โ”‚  MotherDuck (DWH) โ† Voyage AI (Embeddings) โ†’ Cerebras LLM (RAG/SQL)     โ”‚
โ”‚  โ””โ”€โ”€ gkg_emotions: Fear, Joy, Tone, Topics                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                  โ”‚
                                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  PRESENTATION                                                            โ”‚
โ”‚  Streamlit: HOME | FEED | EMOTIONS | AI Chat | ABOUT                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Flow (ELT Pipeline)

  1. Extract: GDELT Events API + GKG Feed โ†’ Polars (10x faster than Pandas)
  2. Validate: Custom schema + threshold data quality checks
  3. Load: Deduplicated data into MotherDuck (serverless DuckDB)
  4. Transform: dbt models create staging views and mart tables
  5. Emotions: GKG data โ†’ Extract tone, fear, joy, topics (rolling 24h)
  6. Embed: Voyage AI generates vectors every 12 hours
  7. Serve: Streamlit dashboard with AI chat (SQL + RAG modes)

๐Ÿ› ๏ธ Tech Stack

Data Engineering

Tool Purpose Replaces
Polars High-performance DataFrame processing (10x faster) Pandas
dbt Core SQL transformations with staging/marts pattern Raw SQL
DataQualityValidator Custom schema + threshold validation & testing Manual checks
Dagster Pipeline orchestration with asset-based design Apache Airflow
DuckDB/MotherDuck Serverless cloud OLAP warehouse Snowflake/Redshift
GitHub Actions CI/CD with workflow_dispatch (15-min via cron-job.org) + 12-hr scheduled jobs AWS Lambda

AI/ML

Tool Purpose Replaces
Cerebras LLM inference (GPT-OSS 120B) OpenAI GPT-4
LlamaIndex Text-to-SQL query engine Custom NLP
Voyage AI Vector embeddings for RAG OpenAI Embeddings
MotherDuck Vectors Native vector similarity search Pinecone / Weaviate

Frontend

Tool Purpose Replaces
Streamlit Interactive dashboard framework Tableau / Power BI
Plotly Dynamic charts and visualizations D3.js / Chart.js

Skills Demonstrated

  • Python (Polars, Pandas, RegEx, API integration)
  • SQL (Complex queries, window functions, dbt models)
  • Data Quality (custom data quality validation, schema testing)
  • ELT Pipelines (Extract, Load, Transform with dbt)
  • CI/CD (GitHub Actions, cron scheduling)
  • Vector Search (Embeddings, cosine similarity, RAG)

๐Ÿš€ Quick Start

Prerequisites

Installation

# Clone the repository
git clone https://github.com/Mohith-akash/Global-News-Intel-Platform.git
cd Global-News-Intel-Platform

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

MOTHERDUCK_TOKEN=your_motherduck_token
CEREBRAS_API_KEY=your_cerebras_api_key
VOYAGE_API_KEY=your_voyage_api_key  # Optional: enables RAG mode

Run the Dashboard

streamlit run app.py

Run the Pipeline Manually

# Polars-powered ingestion (15-min schedule)
python -m dagster job execute -f etl/pipeline_polars.py -j gdelt_ingestion_job

# Embedding generation (12-hour schedule)
python -m dagster job execute -f etl/embedding_job.py -j gdelt_embedding_job

# Run dbt models
cd dbt && dbt run

๐Ÿ’ฐ Enterprise Tools vs My Stack

This project demonstrates how to achieve enterprise-grade capabilities at zero cost:

Enterprise Tool Monthly Cost My Alternative My Cost
Databricks/Spark ~$500 DuckDB $0
Snowflake/BigQuery ~$300 MotherDuck $0
Managed Airflow ~$300 Dagster + GitHub Actions $0
dbt Cloud ~$100 dbt Core (self-hosted) $0
Pinecone/Weaviate ~$70 MotherDuck Vectors $0
OpenAI Embeddings ~$50 Voyage AI $0
OpenAI GPT-4 ~$100 Cerebras $0
Tableau/Power BI ~$70 Streamlit $0
TOTAL $1,490+ $0

Key Insight: MotherDuck's native vector search eliminates the need for a separate vector database like Pinecone.


๐Ÿ”„ Technology Evolution

This project evolved through multiple iterations to optimize for cost and performance:

Data Warehouse

โ„๏ธ Snowflake (trial) โ†’ ๐Ÿฆ† MotherDuck (free tier)
  • Started with Snowflake trial for learning enterprise DWH
  • Migrated to MotherDuck to eliminate costs while keeping SQL compatibility

AI/LLM Provider

โœจ Gemini 2.0/2.5 Flash โ†’ โšก Groq (Llama 3.3 70B) โ†’ ๐Ÿง  Cerebras (Llama 3.1 8B โ†’ GPT-OSS 120B)
  • Tested Gemini models for natural language queries
  • Tried Groq's fast inference with larger Llama models
  • Settled on Cerebras for reliable free tier and good performance
  • Moved to GPT-OSS 120B after Cerebras archived the Llama 3.1 models

RAG Embeddings

๐Ÿš€ Voyage AI (embeddings) + ๐Ÿฆ† MotherDuck (vector search)
  • Voyage AI creates 1024-dim embeddings for semantic search
  • MotherDuck's native array_cosine_similarity() replaces Pinecone
  • Dual-mode AI: SQL for precise queries, RAG for semantic exploration

Key Learning: The best tool isn't always the most expensiveโ€”it's the one that solves your problem within constraints.


๐Ÿ“ Project Structure

gdelt_project/
โ”œโ”€โ”€ app.py                    # Streamlit dashboard entry point
โ”œโ”€โ”€ src/                      # Core modules
โ”‚   โ”œโ”€โ”€ config.py             # Configuration constants
โ”‚   โ”œโ”€โ”€ database.py           # Database connection
โ”‚   โ”œโ”€โ”€ queries.py            # SQL query functions
โ”‚   โ”œโ”€โ”€ ai_engine.py          # LLM/AI setup (Cerebras + LlamaIndex)
โ”‚   โ”œโ”€โ”€ rag_engine.py         # RAG engine (Voyage AI + vector search)
โ”‚   โ”œโ”€โ”€ data_processing.py    # Headline extraction
โ”‚   โ”œโ”€โ”€ utils.py              # Utility functions
โ”‚   โ””โ”€โ”€ styles.py             # CSS styling
โ”œโ”€โ”€ etl/                      # Data pipeline
โ”‚   โ”œโ”€โ”€ pipeline_polars.py    # ๐Ÿ†• Polars ingestion + custom validation
โ”‚   โ””โ”€โ”€ embedding_job.py      # ๐Ÿ†• 12-hour embedding generation
โ”œโ”€โ”€ dbt/                      # ๐Ÿ†• dbt transformation layer
โ”‚   โ”œโ”€โ”€ dbt_project.yml       # dbt configuration
โ”‚   โ”œโ”€โ”€ profiles.yml          # MotherDuck connection
โ”‚   โ””โ”€โ”€ models/
โ”‚       โ”œโ”€โ”€ staging/          # stg_events (cleaned data)
โ”‚       โ””โ”€โ”€ marts/            # fct_daily_events, dim_actors, dim_countries
โ”œโ”€โ”€ components/               # UI components
โ”‚   โ”œโ”€โ”€ render.py             # Dashboard rendering
โ”‚   โ”œโ”€โ”€ ai_chat.py            # AI chat interface
โ”‚   โ”œโ”€โ”€ emotions.py           # GKG emotions tab
โ”‚   โ””โ”€โ”€ about.py              # About page
โ”œโ”€โ”€ requirements.txt          # Python dependencies
โ”œโ”€โ”€ .env                      # Environment variables (not in repo)
โ””โ”€โ”€ .github/workflows/
    โ”œโ”€โ”€ gdelt_ingest.yml          # 15-min Polars ingestion
    โ”œโ”€โ”€ gdelt_embeddings_12hr.yml # 12-hour embedding job
    โ””โ”€โ”€ health_monitor.yml        # uptime checks + ntfy alerts

๐Ÿ”ฎ Future Enhancements

  • [x] Add dbt transformations for advanced modeling โœ… Done!
  • [x] Upgrade to Polars for faster processing โœ… Done!
  • [x] Add data quality validation โœ… Done!
  • [ ] Implement event clustering with ML
  • [ ] Add email/Slack alerts for crisis events
  • [ ] Expand AI chat with multi-turn conversations
  • [ ] Add export functionality (CSV, PDF reports)

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


๐Ÿ“ฌ Contact

Mohith Akash

GitHub LinkedIn


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with โ˜• and curiosity โ€ข Data sourced from GDELT Project