Git Product home page Git Product logo

Comments (12)

jacobmgn avatar jacobmgn commented on June 2, 2024

Hello,
We are investigating the issue that you've reported.
In order to troubleshoot this issue further, we will need you to open a support case from your AWS account.
Thanks!

from aws-fpga.

ljp0101 avatar ljp0101 commented on June 2, 2024

No support plan so can't create a ticket.

These are UTC times and IPs for instances where we had problems:

Jan 31 00:19:31.771 ip-12-63-201-12.ec2.internal
Jan 31 00:19:28.497 ip-12-63-14-64.ec2.internal
Jan 29 23:07:06.589 ip-12-63-11-159.ec2.internal

Jan 31 17:21:55.295 ip-11-63-1-53.ec2.internal
Jan 30 19:31:36.423 ip-11-63-11-107.ec2.internal

Happy to share account numbers if there's a private way to do so.

from aws-fpga.

jacobmgn avatar jacobmgn commented on June 2, 2024

The easiest path forward for us to support you on this would be for you to open an AWS support case under the AWS account where you are experiencing this issue.

That will allow us to provide support privately

from aws-fpga.

jacobmgn avatar jacobmgn commented on June 2, 2024

If you can also provide

  • the affected instance IDs
  • the region(s) you noted the issue in
  • a rough timeline of when you started seeing the issue and when (if) it stopped occurring

that will assist us in identifying the issue that you are reporting

from aws-fpga.

ljp0101 avatar ljp0101 commented on June 2, 2024

Us-east-1 and all incidents & times as per prior post. I have no good way to correlate the historic IPs with instance IDs.

I supplemented our logging with instance IDs so will update if and when we see again (but I also changed the retry strategy so may have quashed our problem).

from aws-fpga.

AWSjoeluc avatar AWSjoeluc commented on June 2, 2024

Hello!

We have been investigating this issue. Were unable to resolve this issue with your new retry strategy, or have there been further failures? Is there anything else we can help with on this issue?

from aws-fpga.

ljp0101 avatar ljp0101 commented on June 2, 2024

The bitfile loading problem resolved with the code below. I may be seeing the same underlying problem manifest elsewhere though as I've also been having occasional communication problems with the fpga starting around the same time. Will close this issue and reopen if I can collect actionable data.

   // load bitfile
   for (uint i = 0; i < 10; ++i) {
      const int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 100, 30, &fpga_mgmt_image_info);
      if (result_check == 0) {
         break;      
      }    
      if (i == 10) {
         b_error = true;
         error_message << "Fpga::Load: fpga_mgmt_load_local_image_sync() failed; ";
         error_message << "AWS error code: " << result_check << "; ";
         error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
         return;
      }
      usleep(10000);
   }

from aws-fpga.

ljp0101 avatar ljp0101 commented on June 2, 2024

Checked my logs and I have an instance ID & time (UTC).

You'll note I'm an idiot and did bad error checking (retry checking should be against 9, not 10). This was a lucky mistake though as I know the FPGA worked properly despite the error response to the load function.

The left timestamps are from datadog so not completely accurate and right timestamps where available are system clock. I see the time spent in the bitfile loading function is 1708469173623 - 1708469153936 = 19.7 seconds, which is odd because I expect the elapsed time should be 10 attempts * 100 sub attempts (function argument) * 30ms wait (function argument) = 30 seconds.

Feb 20 22:46:03.015 [fpga-xxx] 1708469153936 InstanceId: i-0919988b3ae3b675d
Feb 20 22:46:03.015 [fpga-xxx] 1708469154006 State: BOOT
Feb 20 22:46:03.015 Loading fpga with bitfile named 'agfi-xxx'
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 1
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 2
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 3
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 4
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 5
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 6
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 7
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 8
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 9
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 10
Feb 20 22:46:14.022 [fpga-xxx] 1708469173623 State: LOAD_READY
Feb 20 22:46:16.022 [fpga-xxx] 1708469175447 Command: LOAD

   // load bitfile
   for (uint i = 0; i < 10; ++i) {
      const int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 100, 30, &fpga_mgmt_image_info);
      if (result_check == 0) {
         break;      
      }
      std::cout << "WARNING: fpga_mgmt_load_local_image_sync() failed, attempt " << i + 1 << std::endl;
      if (i == 10) {
         b_error = true;
         error_message << "Fpga::Load | fpga_mgmt_load_local_image_sync() failed; ";
         error_message << "AWS error code: " << result_check << "; ";
         error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
         return;
      }
      usleep(10000);
   }

from aws-fpga.

czfpga avatar czfpga commented on June 2, 2024

Hi,

I'm glad to hear that the FPGA functioned properly in this case. I'm wondering what's the returned status when the load command does not return as loaded in your retry loop. As I noticed that the function might skip the built-in retry and retry delay if the return status is neither succeed or busy (

/** Wait until the status is "loaded" or timeout */
while (!done) {
ret = fpga_mgmt_describe_cmd(opt->slot_id, &tmp_info, 0); /** flags==0 */
status = (ret == 0) ? tmp_info.status : FPGA_STATUS_END;
if (status == FPGA_STATUS_LOADED) {
/** Sanity check the afi_id */
ret = (strncmp(opt->afi_id, tmp_info.ids.afi_id, sizeof(tmp_info.ids.afi_id))) ?
FPGA_ERR_FAIL : 0;
fail_on(ret, out, "AFI ID mismatch: requested afi_id=%s, loaded afi_id=%s",
opt->afi_id, tmp_info.ids.afi_id);
done = true;
} else if (status == FPGA_STATUS_BUSY) {
fail_on(ret = (retries >= timeout_tmp) ? -ETIMEDOUT : 0, out,
"fpga_mgmt_describe_local_image timed out, status=%s(%d), retries=%u",
FPGA_STATUS2STR(status), status, retries);
retries++;
msleep(delay_msec_tmp);
} else {
/**
* Catch error status cases here.
* -the caller can then display the error status and cause upon return.
*/
ret = tmp_info.status_q;
goto out;
}
}
) assuming you're the function coming from the fpga_mgmt library.

from aws-fpga.

ljp0101 avatar ljp0101 commented on June 2, 2024

I initially observed -110, which is the stderr code for a connection timeout.

Fpga::Load: fpga_mgmt_load_local_image_sync() failed; AWS error code: -110; AWS error message: A time out error is usually spurious and the request can be retried. If

from aws-fpga.

czfpga avatar czfpga commented on June 2, 2024

Thanks, I think that's reason the retry time observed didn't match to the calculation. But timeout can be ignored and request should be retried. Please let me know if you have any other questions.

from aws-fpga.

ljp0101 avatar ljp0101 commented on June 2, 2024

I did retry the fpga_mgmt function call 10x with 10ms separation between function calls and it kept failing so it doesn't look like I can handle the error appropriately in any way besides terminating the instance and starting fresh.

I'll close the issue again because terminating the instance is good enough for my use case if the incident rate doesn't go above a few times per week but it seems to me like there's some more fundamental issue to be addressed as the underlying connection timeout error doesn't make much sense to me while nothing else is happening except loading the bitfile.

from aws-fpga.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.