Research preprint

Your browser just became a supercomputer

We proved that browsers can run GPU computation at near-native speed. No download. No account. No expensive hardware. Just open a page.

GPU computing has a gate

For decades, running computation on a graphics card meant navigating three barriers that kept most of the world locked out.

🔒

Hardware lock-in

CUDA only runs on NVIDIA GPUs. If you have a Mac, an AMD card, or an Intel integrated chip — you're out.

🧩

Software complexity

Python environments, CUDA drivers, cuDNN versions, framework dependencies. One mismatch and nothing works.

💰

Cost

Cloud GPU instances cost $2–4/hour. A university lab without GPU budget simply can't participate.

We proved there's another way

WebGPU is a new browser standard that gives web pages direct access to your graphics card. We showed it's fast enough for real scientific computation.

71×

Apple Silicon median 92 unique devices, 7 vendors

20×

Qualcomm Adreno median Android phones

things to install just open Chrome

“For fitness functions with sequential dependencies — common in reinforcement learning, financial simulation, and control systems — custom compute shaders dramatically outperform framework-based GPU code.”

From the preprint, Section 4.2

Who this is for

This isn't about replacing data centers. It's about giving GPU access to the people who never had it.

🎓

The student who can't afford a GPU cluster

A computer science student in Lagos, Bangalore, or rural anywhere can now run real GPU-accelerated experiments on their $400 laptop. Open a browser tab, not a grant application.

👩‍🏫

The teacher who wants to show, not just tell

Instead of slides about parallel computing, show it running live in the classroom. Every student's laptop becomes a GPU workstation. No lab setup, no admin permissions.

🔬

The researcher who wants reproducible science

“To reproduce our results, open this URL.” Not “install Python 3.10.12, CUDA 12.1, cuDNN 8.9.7, match the exact driver version.” Reproducibility should be one click.

🚀

The startup that can't justify cloud bills yet

A biotech team in Nairobi running molecular screening. A fintech in Sao Paulo backtesting strategies. Their scientists' laptops ARE the compute cluster. No AWS required.

How it works (simply)

The technical details are in the preprint. Here's the intuition.

Your GPU is already powerful

The graphics chip in your laptop — whether it's Apple Silicon, AMD, Intel, or NVIDIA — is a parallel processor with thousands of cores. It spends most of its time idle.

Browsers can now talk to it

WebGPU is a new web standard (shipping in Chrome since 2023) that lets web pages run computation directly on your GPU. Not just graphics — real number crunching.

We figured out how to make it fast in the browser

Instead of sending thousands of small tasks to the GPU one by one (the way naïve eager loops and TF.js do — production compilers like TVM, XLA, and Burn fuse some of this already), we pack the entire computation into a single WebGPU instruction. One dispatch instead of 22,500. The browser adds 48% overhead vs native, and the fused kernel still beats unfused dispatch on every vendor we tested.

We proved it with real benchmarks

30 independent runs per experiment. Statistical tests. Comparisons against 8 systems on 2 hardware platforms. Not hype — evidence.

What actually changes

This isn't theoretical. Here's what's different tomorrow.

Before

“Reproduce our results” → Install Python 3.10.12 → Install CUDA 12.1 → Install cuDNN 8.9.7 → Match driver version → Debug for 3 hours → Maybe it works

→

After

Click this link. Results run in your browser. Verified in 30 seconds.

Before

“I need GPU compute” → AWS account → GPU instance ($2–4/hr) → DevOps setup → $5,000/month bill → Need funding before building

→

After

Users' own GPUs do the work. Server cost: $0. Ship on day one.

Before

“Today we'll learn parallel computing” → Show slides → Students nod → Nobody runs anything because the school has no GPU lab

→

After

“Open this URL on your laptop.” 30 students run GPU computation simultaneously. They see it, touch it, modify it.

Before

“Run screening on patient data” → 6-month legal review to upload to cloud → HIPAA/GDPR audit → $200K contract → Finally start work

→

After

Data never leaves the laptop. GPU compute runs in the browser. Compliance by architecture, not by contract.

The result that surprised us

The paper measured 159–720× on two machines. Then 92 unique devices ran it across 7 GPU vendors, and the median typical experience held above 20× on every vendor.

In the paper, we tested across four GPU APIs on two hardware platforms. On a Tesla T4: hand-fused CUDA 720×, JAX lax.scan 172×, Triton 27×. On an M2 Pro: WebGPU 159× over PyTorch MPS. The pattern was consistent: fusion eliminates dispatch overhead.

Since publishing, 92 unique devices across 7 GPU vendors have confirmed the mechanism. Median speedup vs unfused dispatch: 71× on Apple Silicon, 56× on NVIDIA, 20× on Qualcomm Adreno (Android phones), 55× on ARM Mali, 43× on Intel, 40× on AMD. Peak observed (excluding Safari measurement artifacts on the unfused baseline): 402× NVIDIA, 226× Apple, 103× Qualcomm.

Why are the real-world numbers bigger?

The papers measured on 2 machines: an Apple M2 Pro laptop and a Tesla T4 server. Both are fast GPUs with efficient command dispatching. That's why the speedups were “only” 159–720×.

Real-world devices include phones, tablets, Chromebooks, and laptops with integrated GPUs — hardware that was never designed for compute workloads. These GPUs have much worse dispatch overhead.

Kernel fusion eliminates dispatch overhead. The worse a device is at dispatching, the more it benefits in relative terms. NVIDIA desktop GPUs (good dispatching, lots of compute per dispatch) see a median 56×. Apple Silicon laptops see ~71× and ARM Mali consumer GPUs see ~55×. Qualcomm Adreno phones see ~20× — smaller in relative terms because absolute throughput is lower, but still an order-of-magnitude improvement. The mechanism is hardware-agnostic; the magnitude depends on each device's dispatch-overhead fraction.

See it for yourself

Run real GPU benchmarks on your hardware, right now, in your browser.

Run the Benchmark All Results (Open Data)Read the Preprint

Every result from every device is public. No cherry-picking. Verify any claim yourself.

Companion projects

The research line and the end-to-end projects that build on it.

TheoryResearch line

kernelfusion.dev

The research line. Two published preprints, one npm SDK, 92 unique devices across 7 GPU vendors. The theory that all the applied projects build on.

webgpudna.com

Electron track-structure simulation ported from the CNRS/IN2P3 Geant4-DNA toolkit to WebGPU. One thread per primary, full 10 keV history in a single for-loop. Radiolysis chemistry and DNA damage scoring live in a browser tab.

See the simulation →LLM inference

zerotvm.com

Phi-3-mini (3.8B) running end-to-end via 10 kernel roles across 27 WGSL files, replacing the 85 TVM-autotuned shaders WebLLM needs. ~40 tok/s on M2 Pro, 22% behind WebLLM.

Run it live →Visualization

neuropulse.live

A real forward pass of Phi-3-mini visualised tensor-by-tensor. 3.8 billion parameters, your GPU, your browser — every glow is a live activation read back from WebGPU. Zero server, zero API key.

Watch it think →Quantum

webgpu-q.vercel.app

Statevector + MPS quantum simulator running on commodity hardware via WebGPU compute. Six-level research ladder from bandwidth-bound statevector through MPS, kernel fusion, WebRTC swarm, IBM hardware cross-verify, to chemistry/VQE. No CUDA, no install.

Open the simulator →Personal

barisgunaydin.com

Personal site and project hub.

About →