Git Product home page Git Product logo

Comments (12)

vermagit avatar vermagit commented on August 23, 2024

NCr_v3 is SR-IOV enabled while NCr and NCr_v2 are not SR-IOV enabled. Some details on the bifurcation here.
IB can be configured on the SR-IOV enabled VM sizes with the OFED drivers while the non-SR-IOV VM sizes require ND drivers. This IB support is available appropriately on CentOS-HPC VMIs.
For Ubuntu, see the instruction here for installing both the OFED and ND drivers as described in the docs.

So summarily, the ND driver path needs to be enabled for the non-SR-IOV NCr and NCr_v2 VM sizes.

from azhpc-images.

vermagit avatar vermagit commented on August 23, 2024

I've also updated the known issues section with this feedback which should go live in a few hours.

from azhpc-images.

abagshaw avatar abagshaw commented on August 23, 2024

@vermagit Thanks for the quick response! Unfortunately this still doesn't work for me. Following the instructions for Non SR-IOV machine types on the standard ubuntu18.04 gen 1 image from the marketplace still doesn't show any IB device in lspci on NC24r instance:

0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
0001:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0002:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0003:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
0004:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

In /var/log/waagent.log I see:

2020/08/20 15:02:14.467584 INFO Daemon Found RDMA details. IPv4=172.16.1.6 MAC=00:15:5D:33:FF:0F
2020/08/20 15:02:14.474096 INFO Daemon RDMA: starting device processing.
2020/08/20 15:02:14.475070 INFO Daemon RDMA: provisioning Network Direct RDMA device.
2020/08/20 15:02:14.475938 INFO Daemon Updating DAPL configuration file
2020/08/20 15:02:14.476798 INFO Daemon RDMA: trying /etc/dat.conf
2020/08/20 15:02:14.477685 INFO Daemon RDMA: DAPL config is at: /etc/dat.conf
2020/08/20 15:02:14.479022 INFO Daemon RDMA: DAPL configuration is updated
2020/08/20 15:02:14.481711 INFO Daemon RDMA: failed to resolve module name. Use original name
2020/08/20 15:02:14.484483 ERROR Daemon Command: [modprobe hv_network_direct], return code: [1], result: [modprobe: FATAL: Module hv_network_direct not found in directory /lib/modules/5.3.0-1035-azure
]
2020/08/20 15:02:14.485432 ERROR Daemon RDMA: failed to load module hv_network_direct
2020/08/20 15:02:14.486242 INFO Daemon RDMA: completed device processing.
2020/08/20 15:02:14.487043 INFO Daemon RDMA: device is set up

from azhpc-images.

vermagit avatar vermagit commented on August 23, 2024

It appears that the support for the ND driver stack (vmbus-rdma-driver required in the non-SRIOV VMs) was dropped in the 5.3 kernel in the latest Ubuntu 18.04-LTS image in the Marketplace. This will be taken up with Canonical.

An older image with kernel 5.0 (say Canonical UbuntuServer 18.04-LTS 18.04.202004080) has the missing module "hv_network_direct" and should work.
Ubuntu 20.04 also doesn't show this issue (not tested though).

Thank you for reporting this issue. Please let us know here if the above workarounds work for you.

from azhpc-images.

vermagit avatar vermagit commented on August 23, 2024

Works with Ubuntu18.04.202004080 version and it has 5.0.0-1036-azure kernel.

from azhpc-images.

victoryang00 avatar victoryang00 commented on August 23, 2024

The Ubuntu18.04.202004080 version which has 5.0.0-1036-azure kernel is not working. It will generate:
hv_network_direct_144_0: Unknown symbol ib_alloc_device (err -2)

from azhpc-images.

tbugfinder avatar tbugfinder commented on August 23, 2024

Not sure what I'm missing again however using CentOS-HPC 7.8 Gen2 there's no Mellanox Infiniband adapter available.

Standard NC24rs_v3:

# ibstatus
Fatal error:  device '*': sys files not found (/sys/class/infiniband/*/ports)

# lspci
0001:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
0002:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
0003:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
0004:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
38af:00:02.0 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

from azhpc-images.

jithinjosepkl avatar jithinjosepkl commented on August 23, 2024

The latest MOFED that supports CX3-Pro is MOFED 5.0.
MOFED support matrix

Please see the notes in this document, and use a different version of CentOS HPC image.

from azhpc-images.

tbugfinder avatar tbugfinder commented on August 23, 2024

Well, it's just not really expected as that's the "only" Infiniband enabled v100 GPU SKU. So I'd appreciate a decent OS compatibility matrix (CentOS 7.9, CentOS 8.2, 8.3).

from azhpc-images.

vermagit avatar vermagit commented on August 23, 2024

@tbugfinder : Thanks for your feedback.
We understand that the lack of support for older MOFED in newer VM images can be frustrating especially if your workload has dependence on these SKUs.
Both V100 SKUs: NCv3 and NDv2 have CX3-Pro which are not supported by the latest Mellanox OFED (>=5.1). These latest MOFED are on the latest releases of the CentOS-HPC images. The document which @jithinjosepkl points out above lists some VM images which can support your scenario. Does 7.6, 7.7, 8.1 not meet your needs?
Would it be possible to build your own custom image for your scenario using the scripts in this repo (replacing the MOFED with the latest one as per guidance here) but on the CentOS version of your choice?

from azhpc-images.

tbugfinder avatar tbugfinder commented on August 23, 2024

Indeed there are two issues.
a) My bad, I didn't check the compatibility matrix.
b) Initially I started off with CentOS-7.6 had a kernel not supporting the configuration (IIRC), after upgrading the kernel I was fine with CentOS-HPC 7.7 (Aug-2020). However new version of the CentOS-HPC 7.7 might include "breaking changes".
So that's also a learning cure on my end, and now we are prepared with improved testing, and can either downgrade OFED or upgrade kernel.

from azhpc-images.

abcdabcd987 avatar abcdabcd987 commented on August 23, 2024

Could we reopen this issue? I ran into similar issues on NC24r.

With Ubuntu, the first problem that I encountered was @abagshaw above mentioned modprobe: FATAL: Module hv_network_direct not found. As per @vermagit 's suggestion, I downgraded the image to Ubuntu18.04.202004080. Then I ran into the same problem as @victoryang00 had, i.e. hv_network_direct_144_0: Unknown symbol ib_alloc_device. I can fix that problem by downgrading the kernel to 4.15.0-1106-azure.

After downgrading the kernel, the rdma nic eth1 finally has an IP address. However, ibstatus hangs for seconds before reporting a timeout:

$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
cat: /sys/class/infiniband/mlx4_0/ports/1/gids/0: Connection timed out
default gid: unknown
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 0: <unknown>
rate: 5 Gb/sec (1X DDR)
link_layer: Ethernet

Besides the timeout, the link layer and rate are different from what the document said:

This interface allows the RDMA-capable instances to communicate over an InfiniBand (IB) network, [...], FDR rates for H16r, H16mr, and other RDMA-capable N-series virtual machines

Also ibv_rc_pingpong and other connectivity tests all failed.

Then I tried to reimage to CentOS-HPC 7.4. ibstatus worked, but ibv_rc_pingpong still wouldn't. Another problem is that once I reboot the machine, the hv_network_direct would fail to load. dmesg says hv_network_direct: disagrees about version of symbol vmbus_driver_unregister.

Now I downgraded the image to OpenLogic:CentOS-HPC:7.1:7.1.201803010, the problem is exactly the same with Ubuntu on 4.15.0-1106-azure.

Any helps? Thanks!

from azhpc-images.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.