go-attention
From the Frontier Research Team at takara.ai we present the first pure Go implementation of attention mechanisms and transformer layers, designed for high performance, consistency, and production reliability.
Why Go for Attention Mechanisms?
Performance Without Compromise
This implementation proves that Go can deliver production-grade performance for AI workloads:
- Consistent, Predictable Performance: Single optimized code path ensures repeatable results across all input sizes
- Edge-Optimized: Zero external dependencies and minimal memory footprint perfect for edge devices
- Production-Ready: Comprehensive error handling, type safety, and deterministic behavior
- Scalable: Efficient batched operations support high-throughput cloud deployments
Addressing Go's AI Limitations
We've solved the common concerns about Go in AI/ML:
- SIMD Support: Assembly-optimized critical paths with automatic fallbacks
- Memory Efficiency: Object pools and optimized allocations reduce GC pressure
- Parallel Performance: Goroutine-based parallelization for multi-core systems
- Numerical Stability: Robust floating-point operations with proper error handling
Real-World Benefits
- Zero Cold Starts: Pure Go implementation eliminates dependency resolution delays
- Predictable Latency: Consistent performance characteristics across all hardware
- Easy Deployment: Single binary with no external dependencies
- Cost Effective: Efficient resource usage reduces cloud costs
Quick Start
Run our comprehensive examples:
# Get the module
go get github.com/takara-ai/go-attention
# Run the examples
go run api_examples.go
Performance Characteristics
Consistent Performance Across All Sizes
Our implementation uses a single, highly optimized code path that delivers predictable performance:
// Always fast, always consistent - no unpredictable performance cliffs
result, err := attention.DotProduct(v1, v2)
Performance Results (Apple M1, go test -bench=. ./attention):
- Small vectors (64-256): ~17-62ns per dot product
- Medium vectors (512-1024): ~250-290ns per dot product
- Large vectors (4096+): ~970-1230ns per dot product
- Consistent across all hardware: Same performance characteristics on any Go-compatible platform
Production-Grade Reliability
- Deterministic Results: Same input always produces same output
- Memory Safe: No buffer overflows or memory corruption
- Error Handling: Comprehensive validation and error reporting
- Type Safe: Compile-time guarantees prevent runtime errors
API Documentation
For complete API documentation, see API.md.
Core Types
type Vector []float64 // Represents a 1D vector of float64 values
type Matrix []Vector // Represents a 2D matrix of float64 values
Quick Examples
1. Basic Dot-Product Attention
The simplest form of attention mechanism. Useful for basic sequence processing tasks.
import "github.com/takara-ai/go-attention/attention"
// Create query-key-value setup
query := attention.Vector{1.0, 0.0, 1.0, 0.0} // Pattern to search for
keys := attention.Matrix{
{1.0, 0.0, 1.0, 0.0}, // Similar to query
{0.0, 1.0, 0.0, 1.0}, // Different from query
{0.5, 0.5, 0.5, 0.5}, // Neutral pattern
}
values := attention.Matrix{
{1.0, 2.0}, // Value for similar key
{3.0, 4.0}, // Value for different key
{5.0, 6.0}, // Value for neutral key
}
// Compute attention
output, weights, err := attention.DotProductAttention(query, keys, values)
if err != nil {
log.Fatal(err)
}
// Output will be a weighted combination of values based on query-key similarity
// Weights will show how much attention each key received
2. Multi-Head Attention
More sophisticated attention mechanism that can capture different types of relationships in parallel.
import "github.com/takara-ai/go-attention/attention"
// Configure multi-head attention
config := attention.MultiHeadConfig{
NumHeads: 4, // Number of parallel attention heads
DModel: 64, // Size of input/output embeddings
DKey: 16, // Size per head (DModel/NumHeads)
DValue: 16, // Size per head (DModel/NumHeads)
DropoutRate: 0.1, // For regularization
}
// Create the attention module
mha, err := attention.NewMultiHeadAttention(config)
if err != nil {
log.Fatal(err)
}
// Process sequences (batched input)
batchSize, seqLen := 2, 3 // Process 2 sequences, each with 3 tokens
// Create input matrices [batchSize × seqLen × DModel]
queries := make(attention.Matrix, batchSize*seqLen)
keys := make(attention.Matrix, batchSize*seqLen)
values := make(attention.Matrix, batchSize*seqLen)
// Initialize your matrices with actual data...
// Process through multi-head attention
output, err := mha.Forward(queries, keys, values)
if err != nil {
log.Fatal(err)
}
3. Full Transformer Layer
Complete transformer layer with self-attention and feed-forward network.
import (
"github.com/takara-ai/go-attention/transformer"
"github.com/takara-ai/go-attention/attention"
)
// Configure transformer layer
config := transformer.TransformerConfig{
DModel: 64, // Size of token embeddings
NumHeads: 4, // Number of attention heads
DHidden: 256, // Size of feed-forward hidden layer
DropoutRate: 0.1, // For regularization
}
// Create transformer layer
layer, err := transformer.NewTransformerLayer(config)
if err != nil {
log.Fatal(err)
}
// Create input sequence [seq_len × d_model]
seqLen := 3
input := make(attention.Matrix, seqLen)
for i := range input {
input[i] = make(attention.Vector, config.DModel)
// Fill with your embedding data...
}
// Process through transformer
output, err := layer.Forward(input)
if err != nil {
log.Fatal(err)
}
Example Output
When running the examples, you'll see:
-
Dot-Product Attention:
Query: [1 0 1 0] Attention Weights: [0.523 0.174 0.302] // Shows focus on similar patterns Output: [2.558 3.558] // Weighted combination of values -
Multi-Head Attention:
Input dimensions: [2 batches × 3 tokens × 64 features] Output shape: [6×64] -
Transformer Layer:
Input shape: [3×64] Output shape: [3×64]
Common Use Cases
-
Text Processing:
- Sequence-to-sequence translation
- Document summarization
- Sentiment analysis
-
Time Series:
- Financial forecasting
- Sensor data analysis
- Anomaly detection
-
Structured Data:
- Graph node embedding
- Feature interaction modeling
- Recommendation systems
Performance Features
Built-in Optimizations
- Loop Unrolling: 8x unrolled dot product for maximum throughput
- Memory Pooling: Object pools reduce allocation overhead in hot paths
- Cache-Friendly Projections:
projectVectoruses row-major loop order so weight rows are read sequentially (fixes column-major stride that caused cache misses on Q/K/V and feed-forward matmuls) - Zero Dependencies: Pure Go implementation with no external requirements
Production Monitoring
// Enable performance monitoring and auto-tuning
config := attention.DefaultPerformanceConfig()
config.EnableMonitoring = true
config.EnableAutoTuning = true
attention.SetPerformanceConfig(config)
// Use memory pools to reduce allocations
v1 := attention.GetVectorFromPool(size)
defer attention.PutVectorToPool(v1)
// Get performance statistics
stats := attention.GetAllPerformanceStats()
Parallel Operations
// Configure parallel processing
config := attention.DefaultParallelConfig()
config.NumWorkers = runtime.NumCPU()
Why This Go Implementation?
Edge Computing Excellence
- Zero Dependencies: Perfect for edge devices where dependency management is crucial
- Predictable Performance: Consistent latency regardless of input size
- Memory Efficient: Minimal allocations and GC pressure
- Easy Deployment: Single binary deployment
Production System Benefits
- Type Safety: Compile-time guarantees prevent runtime errors
- Error Handling: Comprehensive validation and error reporting
- Deterministic: Same input always produces same output
- Scalable: Efficient batched operations for high throughput
Real-time Processing
- Consistent Latency: No unpredictable performance cliffs
- Low Memory Footprint: Efficient resource usage
- Fast Startup: No dependency resolution delays
- Reliable: Robust error handling and recovery
Features
- Efficient Dot-Product Attention: Upgraded with Scalable-Softmax (SSMax, s=1) for improved long-context performance
- Multi-Head Attention: Parallel attention heads for capturing different relationships
- Full Transformer Layer: Complete implementation with:
- Layer normalization
- Position-wise feed-forward networks
- Residual connections
- Batched Operations: Efficient processing of multiple sequences
- Production Monitoring: Built-in performance tracking and optimization
Performance Benchmarks
Run comprehensive performance benchmarks:
go test -bench=. ./attention ./transformer
This will benchmark all operations and show performance characteristics across different input sizes and hardware configurations.
Sample Results (Apple M1, June 2026):
BenchmarkDotProduct-8 253631955 4.7 ns/op
BenchmarkDotProductAttention-8 12874644 93.0 ns/op
BenchmarkMultiHeadAttentionForward-8 19830 61.9 µs/op (was ~192 µs/op, ~3.1x faster)
BenchmarkFeedForwardForward-8 10000 122.2 µs/op
BenchmarkTransformerLayerForward-8 2714 442.6 µs/op
BenchmarkMultiHeadAttentionForward improved ~3.1x after fixing projectVector loop order (#8). Feed-forward and full transformer layers benefit from the same change. On the projection hot path alone (d_model=768, d_k=96), row-major access is ~8x faster than the previous column-major stride:
BenchmarkIssue8_ProjectVectorColumnMajor/768x96-8 ~270 µs/op
BenchmarkIssue8_ProjectVectorRowMajor/768x96-8 ~34 µs/op
Reproduce with:
go test ./attention -bench='Issue8' -benchmem
Roadmap
Future improvements may include:
- Real SIMD Assembly: Actual AVX2/NEON implementations for critical paths
- Positional Encoding: RoPE and other positional encoding methods
- Advanced Optimizations: Flash Attention, Sparse Attention variants
- Training Support: Gradient computation and optimization utilities
- Model Export: ONNX and other format support
- GPU Acceleration: CUDA/OpenCL backends for GPU computation
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details
For research inquiries and press, please reach out to [email protected]
人類を変革する