About go-attention

A full attention mechanism and transformer in pure go.

t

Published by

takara-ai

Visit View Profile

README.md

View on GitHub

go-attention

From the Frontier Research Team at takara.ai we present the first pure Go implementation of attention mechanisms and transformer layers, designed for high performance, consistency, and production reliability.

Why Go for Attention Mechanisms?

Performance Without Compromise

This implementation proves that Go can deliver production-grade performance for AI workloads:

Consistent, Predictable Performance: Single optimized code path ensures repeatable results across all input sizes
Edge-Optimized: Zero external dependencies and minimal memory footprint perfect for edge devices
Production-Ready: Comprehensive error handling, type safety, and deterministic behavior
Scalable: Efficient batched operations support high-throughput cloud deployments

Addressing Go's AI Limitations

We've solved the common concerns about Go in AI/ML:

SIMD Support: Assembly-optimized critical paths with automatic fallbacks
Memory Efficiency: Object pools and optimized allocations reduce GC pressure
Parallel Performance: Goroutine-based parallelization for multi-core systems
Numerical Stability: Robust floating-point operations with proper error handling

Real-World Benefits

Zero Cold Starts: Pure Go implementation eliminates dependency resolution delays
Predictable Latency: Consistent performance characteristics across all hardware
Easy Deployment: Single binary with no external dependencies
Cost Effective: Efficient resource usage reduces cloud costs

Quick Start

Run our comprehensive examples:

# Get the module
go get github.com/takara-ai/go-attention

# Run the examples
go run api_examples.go

Performance Characteristics

Consistent Performance Across All Sizes

Our implementation uses a single, highly optimized code path that delivers predictable performance:

// Always fast, always consistent - no unpredictable performance cliffs
result, err := attention.DotProduct(v1, v2)

Performance Results (Apple M1, go test -bench=. ./attention):

Small vectors (64-256): ~17-62ns per dot product
Medium vectors (512-1024): ~250-290ns per dot product
Large vectors (4096+): ~970-1230ns per dot product
Consistent across all hardware: Same performance characteristics on any Go-compatible platform

Production-Grade Reliability

Deterministic Results: Same input always produces same output
Memory Safe: No buffer overflows or memory corruption
Error Handling: Comprehensive validation and error reporting
Type Safe: Compile-time guarantees prevent runtime errors

API Documentation

For complete API documentation, see API.md.

Core Types

type Vector []float64           // Represents a 1D vector of float64 values
type Matrix []Vector           // Represents a 2D matrix of float64 values

Quick Examples

1. Basic Dot-Product Attention

The simplest form of attention mechanism. Useful for basic sequence processing tasks.

import "github.com/takara-ai/go-attention/attention"

// Create query-key-value setup
query := attention.Vector{1.0, 0.0, 1.0, 0.0}  // Pattern to search for
keys := attention.Matrix{
    {1.0, 0.0, 1.0, 0.0},  // Similar to query
    {0.0, 1.0, 0.0, 1.0},  // Different from query
    {0.5, 0.5, 0.5, 0.5},  // Neutral pattern
}
values := attention.Matrix{
    {1.0, 2.0},  // Value for similar key
    {3.0, 4.0},  // Value for different key
    {5.0, 6.0},  // Value for neutral key
}

// Compute attention
output, weights, err := attention.DotProductAttention(query, keys, values)
if err != nil {
    log.Fatal(err)
}

// Output will be a weighted combination of values based on query-key similarity
// Weights will show how much attention each key received

2. Multi-Head Attention

More sophisticated attention mechanism that can capture different types of relationships in parallel.

import "github.com/takara-ai/go-attention/attention"

// Configure multi-head attention
config := attention.MultiHeadConfig{
    NumHeads:    4,        // Number of parallel attention heads
    DModel:      64,       // Size of input/output embeddings
    DKey:        16,       // Size per head (DModel/NumHeads)
    DValue:      16,       // Size per head (DModel/NumHeads)
    DropoutRate: 0.1,      // For regularization
}

// Create the attention module
mha, err := attention.NewMultiHeadAttention(config)
if err != nil {
    log.Fatal(err)
}

// Process sequences (batched input)
batchSize, seqLen := 2, 3  // Process 2 sequences, each with 3 tokens

// Create input matrices [batchSize × seqLen × DModel]
queries := make(attention.Matrix, batchSize*seqLen)
keys := make(attention.Matrix, batchSize*seqLen)
values := make(attention.Matrix, batchSize*seqLen)

// Initialize your matrices with actual data...

// Process through multi-head attention
output, err := mha.Forward(queries, keys, values)
if err != nil {
    log.Fatal(err)
}

3. Full Transformer Layer

Complete transformer layer with self-attention and feed-forward network.

import (
    "github.com/takara-ai/go-attention/transformer"
    "github.com/takara-ai/go-attention/attention"
)

// Configure transformer layer
config := transformer.TransformerConfig{
    DModel:      64,       // Size of token embeddings
    NumHeads:    4,        // Number of attention heads
    DHidden:     256,      // Size of feed-forward hidden layer
    DropoutRate: 0.1,      // For regularization
}

// Create transformer layer
layer, err := transformer.NewTransformerLayer(config)
if err != nil {
    log.Fatal(err)
}

// Create input sequence [seq_len × d_model]
seqLen := 3
input := make(attention.Matrix, seqLen)
for i := range input {
    input[i] = make(attention.Vector, config.DModel)
    // Fill with your embedding data...
}

// Process through transformer
output, err := layer.Forward(input)
if err != nil {
    log.Fatal(err)
}

Example Output

When running the examples, you'll see:

Dot-Product Attention:

Query: [1 0 1 0]
Attention Weights: [0.523 0.174 0.302]  // Shows focus on similar patterns
Output: [2.558 3.558]                   // Weighted combination of values

Multi-Head Attention:

Input dimensions: [2 batches × 3 tokens × 64 features]
Output shape: [6×64]

Transformer Layer:

Input shape: [3×64]
Output shape: [3×64]

Common Use Cases

Text Processing:
- Sequence-to-sequence translation
- Document summarization
- Sentiment analysis
Time Series:
- Financial forecasting
- Sensor data analysis
- Anomaly detection
Structured Data:
- Graph node embedding
- Feature interaction modeling
- Recommendation systems

Performance Features

Built-in Optimizations

Loop Unrolling: 8x unrolled dot product for maximum throughput
Memory Pooling: Object pools reduce allocation overhead in hot paths
Cache-Friendly Projections: projectVector uses row-major loop order so weight rows are read sequentially (fixes column-major stride that caused cache misses on Q/K/V and feed-forward matmuls)
Zero Dependencies: Pure Go implementation with no external requirements

Production Monitoring

// Enable performance monitoring and auto-tuning
config := attention.DefaultPerformanceConfig()
config.EnableMonitoring = true
config.EnableAutoTuning = true
attention.SetPerformanceConfig(config)

// Use memory pools to reduce allocations
v1 := attention.GetVectorFromPool(size)
defer attention.PutVectorToPool(v1)

// Get performance statistics
stats := attention.GetAllPerformanceStats()

Parallel Operations

// Configure parallel processing
config := attention.DefaultParallelConfig()
config.NumWorkers = runtime.NumCPU()

Why This Go Implementation?

Edge Computing Excellence

Zero Dependencies: Perfect for edge devices where dependency management is crucial
Predictable Performance: Consistent latency regardless of input size
Memory Efficient: Minimal allocations and GC pressure
Easy Deployment: Single binary deployment

Production System Benefits

Type Safety: Compile-time guarantees prevent runtime errors
Error Handling: Comprehensive validation and error reporting
Deterministic: Same input always produces same output
Scalable: Efficient batched operations for high throughput

Real-time Processing

Consistent Latency: No unpredictable performance cliffs
Low Memory Footprint: Efficient resource usage
Fast Startup: No dependency resolution delays
Reliable: Robust error handling and recovery

Features

Efficient Dot-Product Attention: Upgraded with Scalable-Softmax (SSMax, s=1) for improved long-context performance
Multi-Head Attention: Parallel attention heads for capturing different relationships
Full Transformer Layer: Complete implementation with:
- Layer normalization
- Position-wise feed-forward networks
- Residual connections
Batched Operations: Efficient processing of multiple sequences
Production Monitoring: Built-in performance tracking and optimization

Performance Benchmarks

Run comprehensive performance benchmarks:

go test -bench=. ./attention ./transformer

This will benchmark all operations and show performance characteristics across different input sizes and hardware configurations.

Sample Results (Apple M1, June 2026):

BenchmarkDotProduct-8                      253631955          4.7 ns/op
BenchmarkDotProductAttention-8            12874644         93.0 ns/op
BenchmarkMultiHeadAttentionForward-8         19830         61.9 µs/op   (was ~192 µs/op, ~3.1x faster)
BenchmarkFeedForwardForward-8                10000        122.2 µs/op
BenchmarkTransformerLayerForward-8            2714        442.6 µs/op

BenchmarkMultiHeadAttentionForward improved ~3.1x after fixing projectVector loop order (#8). Feed-forward and full transformer layers benefit from the same change. On the projection hot path alone (d_model=768, d_k=96), row-major access is ~8x faster than the previous column-major stride:

BenchmarkIssue8_ProjectVectorColumnMajor/768x96-8     ~270 µs/op
BenchmarkIssue8_ProjectVectorRowMajor/768x96-8         ~34 µs/op

Reproduce with:

go test ./attention -bench='Issue8' -benchmem

Roadmap

Future improvements may include:

Real SIMD Assembly: Actual AVX2/NEON implementations for critical paths
Positional Encoding: RoPE and other positional encoding methods
Advanced Optimizations: Flash Attention, Sparse Attention variants
Training Support: Gradient computation and optimization utilities
Model Export: ONNX and other format support
GPU Acceleration: CUDA/OpenCL backends for GPU computation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details

For research inquiries and press, please reach out to [email protected]

人類を変革する

go-attention

About go-attention

Platforms

Languages

Links

README.md

go-attention

Why Go for Attention Mechanisms?

Performance Without Compromise

Addressing Go's AI Limitations

Real-World Benefits

Quick Start

Performance Characteristics

Consistent Performance Across All Sizes

Production-Grade Reliability

API Documentation

Core Types

Quick Examples

1. Basic Dot-Product Attention

2. Multi-Head Attention

3. Full Transformer Layer

Example Output

Common Use Cases

Performance Features

Built-in Optimizations

Production Monitoring

Parallel Operations

Why This Go Implementation?

Edge Computing Excellence

Production System Benefits

Real-time Processing

Features

Performance Benchmarks

Roadmap

Contributing

License