ueaj

Work Experience

Machine Learning Researcher · Pangram Labs · Feb 2026 – Present

Full-time · New York, NY · On-site

Frontier defining work in a new field (AI detection). I really like this job, you should join us.

Machine Learning Researcher · Block · Jan 2026 – Feb 2026

Contract · Remote

ML research for Dessa/Block. Left a week before The Layoff for Pangram.

Education

M.S. in ML/AI - Northeastern University (dropped in Dec 2025 for work)

B.S. in Computer Science - University of Texas at Dallas (May 2025)

IBM Professional Data Science Certification - September 2023

Relevant Coursework: Advanced Calculus, Intro to Machine Learning, Intelligent Systems Analysis/Design, Extragalactic Astrophysics, Probability, Quantum Physics for Programmers and Engineers, Big Data Management and Analytics, Undergraduate Research, Artificial Intelligence

Major AI/ML Projects

Multiscale Muon^¹

successful

Optimizer with multiple momentum buffers at different timescales. Inspired by further evidence from other groups of hierarchical structures like those initially explored in 'Hyperbolic Space' and multi-temporal memory in humans.

Hierarchical Marginalization

theory

Mathematical proof of the exactness of sub-graph partitioned marginalization in factor graphs using 'port nodes'. Proves existence of exact message passing algorithms through hierarchical cutset conditioning for probabilistic inference.

^¹MacroGPT-JAX

completed

My own pretraining script written from scratch, specialized for hackability and OOD research. Reaches tps parity with llm.c.

Hyperbolic Space

theory successful

Evidence that neural networks learn multiscale hierarchical structures, confirmed by statistical analysis of model weights showing possible multiscale structures in MLP matrices.

Symbolic Transformer

theory pedagogy

A thought experiment paired with a real experiment, exploring alternative representations for transformer internals. Built for people and LLMs in RL to develop a visceral experience of what it's like to think like a transformer.

essays

Who art thou, 2.5pro

philosophy

Philosophical exploration of AI consciousness and identity through dialogue with language models.

Open Source Dystopia

philosophy

My case for closed-source AI. Enumerates the flaws in the open source vision — its impossibilities and moral gaps — addresses common criticisms of centralization, and makes the positive case for it.

Cognitronium

proposal pedagogy

Economically useful AI requires continual learning; context summarization with heavily pretrained models is the most practical path because it keeps internal state interpretable and debuggable.

smaller projects

Associative Transformer^¹

null

Local learning rule where each transformer layer acts as its own momentum buffer, treating layers as compressions of past gradients. Theoretically interesting, empirically failed to converge.

BPTT got hands^¹

null theory

Exploration of BPTT's "impossible triangle" for linear sequence modeling. Evaluates a novel surrogate gradient method for TTT modules — partial success, validates the theory but insufficient for training.

BPTT Round 2

null theory

Final (for now) attempt at a surrogate gradient method, prioritizing composability in the abstraction. Doesn't work; discussion on why it's needed.

Rank Aware Training^¹

successful theory highlight

Theory + two-part experiment. If scale matters for circuit learning it should also matter for ICL — found no diminishing returns on ICL with at least 8× more attention params/layer. Devised and trained a new MLA method using full-rank master weights decomposed to low rank (like QAT but for rank), demonstrating low ICL loss vs full rank with a good KV state tradeoff.

ScrupulousnessBench

proposal benchmark

Misguided Attention, but for vision tasks. Built by taking common optical illusions and making them "literal" to see whether models can tell the difference.

ParaRNN and beyond

successful highlight theory pedagogy

Highlighting a new paper that represents a new abstraction, detailing how I extracted this abstraction from the paper, how it can be applied in novel ways. Also replicates the paper partially in JAX in a modular way that can be used for other projects.

Anything WILL work

theory pedagogy

Discussing the limitations of synthetic data and why we should be weary of the empirical benefits they provide. Also explores when exactly synthetic data does make sense and the principles behind those decisions.

ScMoE

theory highlight

A highlight of a paper that beat me to an idea I had for low-mem bandwidth (i.e. mac, CPU) specialized architectures. Namely using speculative expert decoding (moving the router to before the attn mechanism) to preload experts.

Optimal Configuration Code

pedagogy

The objectively perfect configuration system for Python, balancing specificity with readability and re-use for research code.

Misguided Attention

successful benchmark

Resurrecting a benchmark in an important category and evaluating it on the latest models. Designed as a follow up to the synthetic data blog post to find what models are suitable for reliable data processing given the data processing inequality.

Prompt Ensembles

theory null

Testing if randomized persona prompts can increase LLM output diversity and reduce entropy collapse. Benchmarked across major model families using coin flips and dice rolls - works best for Anthropic models, less effective elsewhere, suggesting architectural approaches may be needed.

Low Precision Low LR

null theory

Investigation into overcoming quantization barriers in low-precision training through collective precision methods.

Fast Min-P

theory

Mathematical derivation of a faster Min-P sampling algorithm that avoids full softmax computation through clever use of log-space operations.

Cyberspace Evolutionary Automata

proposal

Hyperspace cellular automata adapted to GPU cluster geometry, enabling evolution of computationally efficient organisms through local learning rules.

Scaling Laws

pedagogy theory

Comprehensive analysis of neural scaling laws and their implications for model performance and efficiency.

Parameter Scaling

pedagogy theory

Deep dive into how model performance scales with parameter count across different architectures.

Work Experience

Education

Major AI/ML Projects

Multiscale Muon^¹

Hierarchical Marginalization

^¹MacroGPT-JAX

Hyperbolic Space

Symbolic Transformer

essays

Who art thou, 2.5pro

Open Source Dystopia

Cognitronium

smaller projects

Associative Transformer^¹

BPTT got hands^¹

BPTT Round 2

Rank Aware Training^¹

ScrupulousnessBench

ParaRNN and beyond

Anything WILL work

ScMoE

Optimal Configuration Code

Misguided Attention

Prompt Ensembles

Low Precision Low LR

Fast Min-P

Cyberspace Evolutionary Automata

Scaling Laws

Parameter Scaling

regular software dev

Jerraria

ARRP

OpenEasterEggsLib

Fabric Transfer API

Amalgamation

Work Experience

Education

Major AI/ML Projects

Multiscale Muon¹

Hierarchical Marginalization

¹MacroGPT-JAX

Hyperbolic Space

Symbolic Transformer

essays

Who art thou, 2.5pro

Open Source Dystopia

Cognitronium

smaller projects

Associative Transformer¹

BPTT got hands¹

BPTT Round 2

Rank Aware Training¹

ScrupulousnessBench

ParaRNN and beyond

Anything WILL work

ScMoE

Optimal Configuration Code

Misguided Attention

Prompt Ensembles

Low Precision Low LR

Fast Min-P

Cyberspace Evolutionary Automata

Scaling Laws

Parameter Scaling

regular software dev

Jerraria

ARRP

OpenEasterEggsLib

Fabric Transfer API

Amalgamation

Multiscale Muon^¹

^¹MacroGPT-JAX

Associative Transformer^¹

BPTT got hands^¹

Rank Aware Training^¹