PyTorch and NVIdia Flare is taking all computing resource on machine learning experiments

23 hours ago 2

ARTICLE AD BOX

I am utilizing PyTorch for federated experiments. As my experiments involves 50 datasets with models, so, I have to run multiple ML models experiments parallelly.

The code for training ML model is shared here:

def train(dataloader, model, loss_fn, optimizer, device): num_batches = len(dataloader) # Total number of observation divided by batch size model.train() model.to(device) total_loss = 0 for batch, (X, y) in enumerate(dataloader): X, y = X.to(device), y.to(device) # x is covariates and y is the pseudo values in the batch # Compute prediction error pred = model(X) loss = loss_fn(pred,y) # Backpropagation optimizer.zero_grad(set_to_none=True) loss.backward() optimizer.step() total_loss += float(loss.item()) total_loss /= num_batches return total_loss

As you can see that the PyTorch tensor x, y, and model is taken to cuda:0 device. However, still, the full CPU is taking this process.

I have already tried by settings this configurations:

torch.set_num_threads(2)

Also, in the NVidia Flare side, I have restricted to use 5 threads. Still, CPU at server is being used 100%.

controller = FederatedAvg( patience=args.patience, modeldir=args.modeldir, reg=args.reg, lr=args.lr, optimizer=args.optimizer, epochs=args.epochs, partition=args.partition, batch_size=args.batch_size, model=args.model, device=args.device, dataset=args.dataset, num_clients=args.n_parties, num_rounds=args.comm_round, seed=args.init_seed, logdir=args.logdir, run_id=args.run_id, arguments=args, ) job.simulator_run(ws, n_clients=args.n_parties, threads=5, log_config= os.path.abspath(config_path))

As a result, when I am running multiple ML methods concurrently, then due to usage of 100% CPU, only one process is processed at a time and other methods run are waiting to complete this experiment. However, I have to run multiple methods concurrently.

What is the best practice to obtain experiments faster? How can I improve?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

PyTorch and NVIdia Flare is taking all computing resource on machine learning experiments

ARTICLE AD BOX

Related

Getting 404 not found when I make a post request to flask + nginx

How to get coordinates of picked points with open3D?

AI & ML for Nanotech and nanomaterial [closed]

LEFT SIDEBAR AD