Run ComfyUI on Any Cloud GPU: a 5090 Today, an H100 Tomorrow, Same Setup
Here's the move almost nobody who runs ComfyUI in the cloud gets to make: rent a 5090 this afternoon to iterate on a workflow, snapshot it, then resume that exact setup on an H100 tomorrow night for a heavy video render, then drop down to a cheap 4090 on a third provider for an overnight batch. Same custom nodes. Same model checkpoints. Same workflow JSON. Nothing re-downloaded, nothing reinstalled, nothing broken.
Most people can't do that today. They're locked to one box, or one provider, or one region. The moment they chase a cheaper or more available card somewhere else, they pay the full setup tax again. And the worst part isn't the re-download. It's that when they finally rebuild on the new machine, the saved workflow that worked yesterday throws a wall of red missing-node errors, because the nodes came back at different versions than the ones it was built against.
That gap, the one between "locked to a box" and "my exact studio on whatever card is cheapest," is the whole subject of this post. I dug into how people on r/comfyui and r/StableDiffusion actually talk about moving between GPUs, and the pattern underneath all of it is that cross-provider ComfyUI isn't really a storage problem. It's a version-resolution problem. The data is the easy half. The hard half is making your environment come back the same.
TL;DR: To run ComfyUI on any cloud GPU and actually keep your setup, you need more than persistent storage. A network volume keeps your files but stays locked to one region and one provider, and a fresh instance still reinstalls custom nodes at whatever version it resolves, so your workflow won't load. Real portability means snapshotting the whole environment, custom nodes at their pinned commits, the venv, and your models, and restoring it byte-for-byte on a different GPU class or provider. That's the difference between "my data moved" and "my studio moved."
Table of contents
- Why "just use a network volume" only half-works
- The part that actually breaks: version resolution
- What moves with you vs what's lost
- How cross-provider portability works when it's done right
- A worked scenario: 5090 to H100 to a cheap batch card
- Why video gen makes this matter even more
- What you should actually do
- Where Aquanode fits
Why "just use a network volume" only half-works
If you've spent any time in cloud ComfyUI threads, you already know the standard answer to "my pod wiped my models": use a network volume. It's good advice as far as it goes. A Next Diffusion guide on running ComfyUI with a RunPod network volume walks you through it, and the upside is real: your volume retains all ComfyUI files, models, and workflows even after a pod is stopped or deleted, which saves you from re-downloading and reconfiguring everything.
That solves the most visceral pain, the one a RunPod ComfyUI + Flux guide addresses by telling you to attach a persistent volume in the first place: by default a pod's container storage is ephemeral, and once you stop or terminate the pod that storage is wiped. People lose 20GB of checkpoints and a tree of custom nodes to a single accidental terminate. So yes, persist the volume.
But here's where the network-volume answer quietly stops working. RunPod's own network volumes documentation spells out the catch that most people skim past:
"Attaching a single network volume constrains worker deployments to that volume's datacenter, which may limit GPU availability."
Read that again with the cross-provider goal in mind. Your persistence is welded to one datacenter. Data on the volume doesn't sync across regions, so the cheap 4090 that just freed up in another region, or the H100 you want for one weekend on a different provider entirely, can't see it without a manual re-transfer. You persisted your data and lost your portability in the same move.
So the network volume is a real fix for the terminate problem and a non-fix for the switch providers problem. It's a leash with a longer lead, not freedom.
The part that actually breaks: version resolution
Let's say you accept the friction and decide to move providers the manual way. You re-download your models to the new box, you git clone your custom nodes again, you open your saved workflow. And it doesn't load.
This is the failure that surprises people, and it's worth being precise about why it happens. ComfyUI workflows are bound to specific versions of specific custom nodes. A fresh install on a new instance doesn't resolve those nodes to the versions your workflow was built against. It resolves them to whatever is current. That's the core of what a breakdown of ComfyUI custom-node version hell lays out: sharing or reproducing a workflow needs the exact version of every custom node, because changes aren't always backward compatible, and a prebuilt or freshly-installed ComfyUI may not match the pinned setup you actually had working.
The sharpest framing of this I've seen comes from that same thread of community writing, and it's the line this whole post turns on:
"The cross-provider restore is not a data problem; it's a version-resolution problem. The node version that worked on your volume isn't the version the managed service installs."
That's it. That's the thing that makes "just move your files" a lie. Your files can be perfectly intact and your workflow still won't open, because the environment around the files drifted.
And this isn't a hypothetical edge case. It already happened at scale. When ComfyUI migrated to Nodes 2.0 in late 2024, a post-mortem on the transition documented the fallout in stark terms: past workflows could not be loaded, custom nodes were wiped out, and old setups broke when custom_nodes moved and dependency resolution failed. Tens of thousands of people watched their workflow libraries go unloadable at once. Anyone who'd been carefully pinning their setup, and then rebuilt fresh on a new cloud instance after the migration, got the worst version of this.
So when a creator hesitates to switch providers, the fear isn't "will my files transfer." The fear is "will my exact setup actually come back, or will I start from zero." Version resolution is what they're afraid of, even if they don't use that phrase.
What moves with you vs what's lost
It helps to be concrete about what's actually at stake in a provider switch. Here's the split, based on how the common workarounds behave.
| Part of your studio | Network volume (region-locked) | Managed ComfyUI service | Full-environment snapshot |
|---|---|---|---|
| Model checkpoints + LoRAs | moves, but only within one region | re-downloaded fresh | moves, byte-for-byte |
| Custom nodes (the code) | re-cloned on new instance | reinstalled at their version | moves, at your pinned commits |
| Custom node versions | drift on reinstall | drift to platform's version | preserved exactly |
| The venv / Python deps | rebuilt | platform-controlled | moves intact |
| Saved workflow JSON | moves | moves, but may not load | moves, and loads |
| Across regions | no | n/a (their region) | yes |
| Across providers | no | no, you're locked in | yes |
The column that matters is the one most workarounds get wrong: custom node versions. The data columns are easy to satisfy. Everyone can move a 6GB checkpoint with enough patience. It's the version row, plus the cross-provider row, that decides whether your workflow opens on the other side. That's the whole game.
This is also why buying your own GPU keeps coming up as the exit. One creator described paying for a managed cloud until a price bump pushed them over the edge, and the reasoning was less about hourly rate than about ownership: money spent on cloud fees is just gone, while buying your own card gets you an asset. The discussion around the 5090-purchase decision shows up across r/StableDiffusion regularly. And the managed-service markup makes the math worse: RunComfy's pricing lists an H100 tier at $4.49/hr, while the same class of card runs under $2/hr on raw providers and a 4090 can be found near $0.34/hr. A 5090 is a genuinely good asset, right up until you need an H100 for a video model for one weekend, at which point owning one card has trapped you on one card. The goal isn't to own hardware. It's to stop losing your setup every time you reach for different hardware.
How cross-provider portability works when it's done right
So what does it take to actually run ComfyUI on any cloud GPU and keep the studio? The mechanism is straightforward to describe even if it's fiddly to build.
Instead of persisting a directory and hoping the environment around it reconstructs, you snapshot the whole ComfyUI environment as one unit. For a real ComfyUI install that means more than the workspace folder, because ComfyUI puts its repo, its venv, its custom_nodes tree, and its models under its own install path, not in your home directory. The snapshot has to capture all of it.
Concretely, the parts that have to come back together are:
- The ComfyUI install itself, at the commit you were running.
- The venv, so your Python dependencies are the exact ones, not whatever pip resolves today.
- Every custom node at its pinned commit, not the current tip of each repo.
- Your models and LoRAs, byte-for-byte.
- Your saved workflow JSONs.
Then on restore, each of those paths goes back to its original location on the new box, on a different GPU class or a different provider entirely. Because the node commits and the venv came along, the saved workflow resolves against the same versions it was built against. It opens. That's the difference between "my data moved" and "my studio moved," and it's exactly the version-resolution problem from earlier, solved by refusing to let the environment drift in the first place.
At Aquanode this is what the ComfyUI-aware snapshot does. We capture the full environment with restic, models plus custom nodes at their pinned commits plus the venv plus workflows, and restore it on any of nine GPU providers at raw prices. We validated it end to end: a real pause-and-resume of a ComfyUI deploy round-tripped the entire install, including the custom nodes at their commits, the venv, and a 2.13GB checkpoint, restored bit-for-bit on a different box, verified by checksum. One honest caveat so nobody is surprised: restore brings your environment back, it doesn't auto-launch the app. You relaunch ComfyUI once after the restore. Your setup is there, exactly; you press start.
A worked scenario: 5090 to H100 to a cheap batch card
Abstract is easy to nod along to, so here's the concrete version. Say you're building an image-plus-video workflow over a few days.
Day one, iteration on a 5090. You want fast feedback while you wire up nodes and tune prompts, and a 5090 at consumer-ish hourly rates is plenty for stills and quick previews. You install your custom nodes, pull your checkpoints and LoRAs, get the workflow loading clean. Before you tear down, you snapshot. The snapshot captures the install, the venv, the nodes at their commits, the models, the workflow.
Day two, the heavy render on an H100. Now you want to push a video pass that the 5090 would crawl through. You resume your snapshot onto an H100 on a different provider. Nothing re-downloads. The workflow opens because the node versions came with it. You relaunch ComfyUI, and you're rendering on the big card within minutes instead of rebuilding for an hour first. When the render's done, you snapshot again and let the expensive card go, because there's no reason to keep paying H100 rates while you sleep.
Day three, an overnight batch on a cheap 4090. You've got a queue of variations to crank through, no latency pressure, just throughput-per-dollar. You resume onto whatever 4090 is cheapest and available, possibly a third provider entirely. Same setup, same models, same nodes. You start the batch and walk away.
Three GPU classes, three providers, one studio that followed you across all of them. The only thing you re-did each time was press start. Compare that to the default loop, where each of those three steps would have meant re-downloading models, re-cloning nodes, and praying the workflow loads, the eight-plus manual steps that one ComfyUI-on-cloud writeup documented verbatim as "too many manual steps, which means it's slow, error-prone, and easy to forget when you come back a week later."
Why video gen makes this matter even more
Cross-provider portability is nice-to-have for stills. For open video models it's closer to load-bearing.
WAN, LTX, Hunyuan, the open video-gen stack is genuinely GPU-heavy, and the usage shape is bursty. You don't render video continuously; you reach for a big card for a weekend, push a batch of clips, then you don't need that card again for a while. A 2026 breakdown of ComfyUI GPU choices makes the split concrete: a 5090 or 4090 wins on cost-efficiency for SDXL-class image work, but the heavy video models with their large VRAM appetite effectively force you onto an 80GB H100. That's the worst possible shape for "buy your own GPU," because you'd be buying an H100-class card to use it ten percent of the time. And it's the worst possible shape for "locked to one provider," because the big card you want might be cheapest on a different provider than the one you iterate on.
It's the best possible shape for portability, though. Rent the big card only for the burst, keep your setup the whole time, drop back to a cheap card for everything else. A persistent, portable studio is what turns "I'd love to play with video gen but the GPU math doesn't work" into "I rent the big card for the weekend and keep my setup." The heavier and burstier the workload, the more the portability pays for itself.
What you should actually do
If you run ComfyUI in the cloud and you ever switch GPUs, here's the practical shape of it, independent of which tool you use.
Stop thinking of the problem as storage. Storage is the half that's already solved by any network volume. The half that bites you is version resolution, so the question to ask of any setup is: when I restore on a different box, do my custom nodes come back at the exact commits my workflow expects, or do they reinstall at whatever's current? If it's the latter, you don't have portability, you have a faster way to start over.
Second, check the boundaries before you commit. A persistence feature that only works within one provider, or one region, is fine until the day you want to chase a cheaper or more available card, which for cost-sensitive creators is most days. Region-locked and provider-locked persistence is the failure mode the network-volume crowd keeps rediscovering.
Third, for anything video-shaped, lean into renting big cards episodically instead of owning. The only reason owning wins is that switching loses your setup; remove that, and renting the right card for the moment is strictly better.
Where Aquanode fits
We built Aquanode for exactly this: keep your ComfyUI setup, models, custom nodes at their pinned commits, the venv, and your workflows, and bring it back on any GPU class or provider at raw GPU prices. Snapshot on a 5090, resume on an H100, drop to a cheap 4090 for batch, same studio throughout. Nine providers, no region lock, and the version-pinned restore is the part that makes your saved workflow actually open on the other side instead of throwing missing-node errors.
The honest bound, again, because trust matters with an audience that's been burned: restore brings the environment back, not a running app. You relaunch ComfyUI once after a resume. Everything's there, byte-for-byte; you press start. That's a very different promise from "your setup follows you," and it's the one we can actually keep.
About the author
I'm Ansh, and I work on Aquanode. I'm not going to pretend I've shipped a thousand ComfyUI workflows, because I haven't, the founding insight here was an infrastructure one: every cloud GPU is stateless by design, and a ComfyUI setup is the most stateful thing a creator owns. I dug into r/comfyui and r/StableDiffusion to understand the real shape of the pain, and the version-resolution problem kept surfacing as the thing nobody's actually fixed. This is our attempt to fix it.
Sources
- How to Run ComfyUI on RunPod with a Network Volume, Next Diffusion (the canonical network-volume guide; volume retains files after pod delete, bound to its region)
- Network volumes documentation, RunPod (a single network volume constrains deployments to that volume's datacenter)
- Automate AI Image Workflows with ComfyUI + Flux on RunPod, RunPod (attach a persistent volume so multi-GB models survive between sessions instead of re-downloading)
- Taming ComfyUI Custom Nodes Version Hell, Extra Ordinary (restore is a version-resolution problem, not a data problem; reinstalling at a different version breaks the workflow)
- The "Truth" Behind the Massive ComfyUI Update: Transition to Nodes 2.0, kazumu (Nodes 2.0 migration; past workflows could not be loaded, custom nodes wiped out)
- One-command ComfyUI on Cloud GPUs, promptingpixels (the 8+ manual setup steps documented verbatim)
- ComfyUI on GPU Cloud 2026: RTX 5090 vs H100 for Stable Diffusion and Flux, Spheron (5090/4090 win on SDXL cost-efficiency; video models force the 80GB H100)
- RunComfy Pricing, the managed-service H100 markup vs raw GPU prices
- Is it worth buying a 5090 for AI image generation?, r/StableDiffusion (the own-vs-rent decision under price pressure)