Migrations at Aquanode across VMs

So i am training a translation model for a low-resource akkadain language. But using a single gpu whole time, doesn't make sense.

TLDR

I needed to do some quality translations on my data, hence using Qwen/Qwen3-30B-A3B-Instruct-2507

I will just have migrations on my data directory: /root/data

I processed my stuff.

Took manual snapshot, and closing the instance.

migration-data-process

My training can happen on 5090, so i will continue it there.

To get the processed data here, just get the migrations. All data available on same path /root/data

Now let's start training.

Sometimes your shell may exit or ssh connection drops, that can kill the process associated with shell.

Using tmux to create a persistent process, so we can detah from shell

tmux new -s train

source .venv/bin/activate

python train.py --lr 1e-4 --batch_size 4

train-phase-1

Ctrl + B, then D (to detach, doesn't kill the process, now you can safely exit)

tmux attach -t train

Mean time, let's invoke migrations for /root choose time based on how long this training will go for.

After the desired training is done, or even if closed

Here now after getting things done, i closed my VM.

And now let's resume our training

Get the migrations for the last snapshot.

restore

It took 5 minutes for the 33GB of Data.

Acivate venv

tmux new -s train
source .venv/bin/activate
python train.py --lr 1e-5 --batch_size 4 --resume_from_checkpoint latest

train-phase-2

Uploading model to HuggingFace

hf auth login

hf upload <user-name/repo> <path to upload from>
hf upload Arpit-Bansal/Akkadian-experiments models/