Comments (9)
This specific issue after communication, is actually because the system does not have enough resource.
Thus it got killed by the OS, this ValueError: I/O operation on closed file.
is coming from running inside the Google Colab environment.
Closing as that is figured out.
We do find an issue regarding Pipe + PipeHandler:
When sending heartbeats in PipeHandler (Line 323), we use timeout = None, which will be translated to use the default_request_timeout set to 5 seconds. This causes a problem where this send_to_peer(msg) can block the entire _try_read thread for up to 5 seconds. It is supposed to be a fast-checking loop.
When the loop is blocked, it also slows down the follow-up checking/sending heartbeat logic, making the system more prone to timeouts.
I will submit PR to fix it.
from nvflare.
the model persister is on the server side, the client runner is on the client-side, so not sure they are related. can you share the client config files ? there is a flag launch_once should set to True, to avoid each round to re-start the process.
from nvflare.
@falibabaei thanks for the report!
Can you provide more information?
Your job's config_fed_server / config_fed_client / and meta.json / custom client code ... etc.
Thanks!
from nvflare.
Hi,
Thank you for your reply. I use the flag launch_once=true in my config_fed_client.conf, but I still have that issue. I am in the first step of converting my code using decorators to the nvflare version and for debug purposes I am using the nvflare simulator to run the code. Unfortunately, since the custom code is not in the public repository, I cannot share the repository, but my train function is
@flare.train
def train(input_model=None):
"""
Train UNet model with training images (X_train) and masks (y_train)
and user-defined parameters.
Use test dataset only to visualise performance before final evaluation.
X data can be 3 channel or of different dimension (UNet will automatically be adapted).
:param X_train: train images
:param y_train_onehot: corresponding onehot encoded train masks
:param X_test: test images
:param y_test_onehot: corresponding onehot encoded test masks
:param model_path: (Path) path to which model will be saved
:return: model saved to provided path
"""
start = time.time()
# Check that provided path can be used for saving model
assert (
type(model_path) == PosixPath
and model_path.suffix == ".h5py"
), f"Provided model path '{model_path}' is not a Path ending in '.h5py'!"
# (5) loads model from NVFlare
for k, v in input_model.params.items():
model.get_layer(k).set_weights(v)
cfg["epochs"] = 1
_logger.info("Training model...")
history = model.fit(
X_train,
y_train_onehot,
batch_size=cfg["batch_size"],
verbose=2,
epochs=cfg["epochs"],
validation_data=(X_test, y_test_onehot),
callbacks=[CustomEpochLogger()],
)
# print('history is ', history.history)
metrics = {
metric: values[-1]
for metric, values in history.history.items()
}
mlflow_writer.log_metrics(
metrics=metrics, step=input_model.current_round
)
duration = time.time() - start
_logger.info(
f"Elapsed time during model training:\t{round(duration / 60, 2)} min"
)
# Save trained Model
# model.save_weights(model_path)
model.save(model_dir)
_logger.info(f"Model saved to '{model_path}'.")
save_model_info(model=model, model_dir=model_dir)
# save json config to model directory
# cp_conf(model_dir)
_logger.info(
f"Saved configuration of training run to {model_dir}"
)
# (3) send back the model to nvflare server
output_model = flare.FLModel(
params={
layer.name: layer.get_weights()
for layer in model.layers
},
metrics=history.history,
)
return output_model
while flare.is_running():
input_model = flare.receive()
input_model.current_round
print(f"current_round={input_model.current_round}")
# (optional) print system info
system_info = flare.system_info()
print(f"NVFlare system info: {system_info}")
train(input_model=input_model)
and my configuration for the client and server are attached.
config_fed_client.txt
config_fed_server.txt
from nvflare.
@ chesterxgchen I noticed that even with the flag launch_once= ture, if I use the simulator with the flag -t 1, the process restarts every round.
from nvflare.
from nvflare.
@chesterxgchen I did what you asked but I do not know why I got this message after the first round
File "custom/train_UNet.py", line 436, in main
print(f"Currently in round {current_round} out of a total of {total_rounds} rounds")
ValueError: I/O operation on closed file.
the print comes from here
while flare.is_running():
input_model = flare.receive()
current_round = input_model.current_round
total_rounds = input_model.total_rounds
print(f"Currently in round {current_round} out of a total of {total_rounds} rounds")
from nvflare.
Strange, I haven't seen this before. @YuanTingHsieh can you help ?
@falibabaei if you don't mind, you can email me your code, we can check for you to see if we can re-produce the problem. It shouldn't matter what data correct ?
from nvflare.
@chesterxgchen Thank you. I have added you and @YuanTingHsieh to the repository. There are some data in the dataset to test the code.
from nvflare.
Related Issues (20)
- Unknown type torch.Tensor, caused by mismatching decomposers HOT 1
- [BUG] After several hours FL server quickly consumes enormous amounts of memory HOT 10
- can not run np/code/train_metrics.py example
- Add Global Best model to the mlflow model registry HOT 2
- [BUG] Unable to browse nvflare dashboard webpages normally HOT 4
- [BUG] Error loading Global best model with testing score script for Prostate3D example HOT 1
- Docker container hello world error socket name resolution [BUG] HOT 3
- Reliable Federated XGBoost
- Missing classification results for the FL linear regression tutorial HOT 4
- Getting Auth Policy Error while trying to create a project using nvfalre==2.4 HOT 5
- Secure Content from server pod of nvflare==2.1.2 HOT 5
- [BUG] RuntimeError during running spleen_ct_segmentation_sim and spleen_ct_segmentation_local HOT 2
- [BUG] _validate() in the Vertical Federated Splitlearning CIFAR10 example with a ResNet50 causes torch.cuda.OutOfMemoryError HOT 4
- [BUG] Host memory increases steadily over time when using the Scatter and Gather component HOT 1
- Set FLModel round information in BaseFedAvg
- Print summary of FLModel HOT 1
- [BUG] aws default vpc should not be required. HOT 4
- [BUG] dig is a requirement in cloud deployment but not documented as such HOT 1
- [BUG] dependency grpcio wheel build fails with Python 3.12 on Ubuntu 24.04 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nvflare.