Comments (12)
Hello,
We are investigating the issue that you've reported.
In order to troubleshoot this issue further, we will need you to open a support case from your AWS account.
Thanks!
from aws-fpga.
No support plan so can't create a ticket.
These are UTC times and IPs for instances where we had problems:
Jan 31 00:19:31.771 ip-12-63-201-12.ec2.internal
Jan 31 00:19:28.497 ip-12-63-14-64.ec2.internal
Jan 29 23:07:06.589 ip-12-63-11-159.ec2.internal
Jan 31 17:21:55.295 ip-11-63-1-53.ec2.internal
Jan 30 19:31:36.423 ip-11-63-11-107.ec2.internal
Happy to share account numbers if there's a private way to do so.
from aws-fpga.
The easiest path forward for us to support you on this would be for you to open an AWS support case under the AWS account where you are experiencing this issue.
That will allow us to provide support privately
from aws-fpga.
If you can also provide
- the affected instance IDs
- the region(s) you noted the issue in
- a rough timeline of when you started seeing the issue and when (if) it stopped occurring
that will assist us in identifying the issue that you are reporting
from aws-fpga.
Us-east-1 and all incidents & times as per prior post. I have no good way to correlate the historic IPs with instance IDs.
I supplemented our logging with instance IDs so will update if and when we see again (but I also changed the retry strategy so may have quashed our problem).
from aws-fpga.
Hello!
We have been investigating this issue. Were unable to resolve this issue with your new retry strategy, or have there been further failures? Is there anything else we can help with on this issue?
from aws-fpga.
The bitfile loading problem resolved with the code below. I may be seeing the same underlying problem manifest elsewhere though as I've also been having occasional communication problems with the fpga starting around the same time. Will close this issue and reopen if I can collect actionable data.
// load bitfile
for (uint i = 0; i < 10; ++i) {
const int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 100, 30, &fpga_mgmt_image_info);
if (result_check == 0) {
break;
}
if (i == 10) {
b_error = true;
error_message << "Fpga::Load: fpga_mgmt_load_local_image_sync() failed; ";
error_message << "AWS error code: " << result_check << "; ";
error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
return;
}
usleep(10000);
}
from aws-fpga.
Checked my logs and I have an instance ID & time (UTC).
You'll note I'm an idiot and did bad error checking (retry checking should be against 9, not 10). This was a lucky mistake though as I know the FPGA worked properly despite the error response to the load function.
The left timestamps are from datadog so not completely accurate and right timestamps where available are system clock. I see the time spent in the bitfile loading function is 1708469173623 - 1708469153936 = 19.7 seconds, which is odd because I expect the elapsed time should be 10 attempts * 100 sub attempts (function argument) * 30ms wait (function argument) = 30 seconds.
Feb 20 22:46:03.015 [fpga-xxx] 1708469153936 InstanceId: i-0919988b3ae3b675d
Feb 20 22:46:03.015 [fpga-xxx] 1708469154006 State: BOOT
Feb 20 22:46:03.015 Loading fpga with bitfile named 'agfi-xxx'
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 1
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 2
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 3
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 4
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 5
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 6
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 7
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 8
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 9
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 10
Feb 20 22:46:14.022 [fpga-xxx] 1708469173623 State: LOAD_READY
Feb 20 22:46:16.022 [fpga-xxx] 1708469175447 Command: LOAD
// load bitfile
for (uint i = 0; i < 10; ++i) {
const int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 100, 30, &fpga_mgmt_image_info);
if (result_check == 0) {
break;
}
std::cout << "WARNING: fpga_mgmt_load_local_image_sync() failed, attempt " << i + 1 << std::endl;
if (i == 10) {
b_error = true;
error_message << "Fpga::Load | fpga_mgmt_load_local_image_sync() failed; ";
error_message << "AWS error code: " << result_check << "; ";
error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
return;
}
usleep(10000);
}
from aws-fpga.
Hi,
I'm glad to hear that the FPGA functioned properly in this case. I'm wondering what's the returned status when the load command does not return as loaded in your retry loop. As I noticed that the function might skip the built-in retry and retry delay if the return status is neither succeed or busy (
aws-fpga/sdk/userspace/fpga_libs/fpga_mgmt/fpga_mgmt.c
Lines 538 to 564 in 863d963
fpga_mgmt
library.from aws-fpga.
I initially observed -110, which is the stderr code for a connection timeout.
Fpga::Load: fpga_mgmt_load_local_image_sync() failed; AWS error code: -110; AWS error message: A time out error is usually spurious and the request can be retried. If
from aws-fpga.
Thanks, I think that's reason the retry time observed didn't match to the calculation. But timeout can be ignored and request should be retried. Please let me know if you have any other questions.
from aws-fpga.
I did retry the fpga_mgmt function call 10x with 10ms separation between function calls and it kept failing so it doesn't look like I can handle the error appropriately in any way besides terminating the instance and starting fresh.
I'll close the issue again because terminating the instance is good enough for my use case if the incident rate doesn't go above a few times per week but it seems to me like there's some more fundamental issue to be addressed as the underlying connection timeout error doesn't make much sense to me while nothing else is happening except loading the bitfile.
from aws-fpga.
Related Issues (20)
- Using ARM processor in f1 FPGA instance HOT 16
- DDR performance drop with specific address access pattern HOT 4
- PCIM AXI W Channel hangs with wReady stuck at 0 HOT 2
- Time delay in filling up BRAM using DMA HOT 3
- Clock routing error [Place 30-838] HOT 4
- DMA read is not correct HOT 1
- [XRT] ERROR: failed to load xclbin: Invalid argument / xocl_read_axlf_helper: interface uuids do not match HOT 4
- Inserting ILA is improving the performance HOT 4
- Update HOT 8
- Name port does not exist for instance f1_inst HOT 6
- PCIM ARID all bits can be used? HOT 1
- peer xclbin download err: -11 HOT 2
- Linux Kernel 6 support for xdma HOT 4
- Does aws have u200? HOT 2
- Files missing from AMI HOT 2
- DRAM page mode HOT 1
- Accessing the AXI-Lite via the fpga_pci_attach call... HOT 1
- Debug embedded microblaze using XVC JTAG in AWS FPGA shell HOT 9
- Instance status checks failures HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-fpga.