Git Product home page Git Product logo

Comments (174)

geerlingguy avatar geerlingguy commented on May 5, 2024 6

@elFarto ah, thank you! The detail level is perfect, and I think I'll be able to take a stab soon. Note that I've put the GPU testing on pause just to catch up on some of the other boards I'm testing... which stinks because a GTX 750 Ti just arrived, and I'm itching to test it.

But all this work isn't worth much if I can't ultimately (a) sum up the things that actually worked and show those things to people, and (b) keep my brain from fragmenting too terribly much :D

I'm still hopeful some other brave adventurers may soon receive CM4s/IO Boards so they can get to testing things too... I'm definitely hitting my limits debugging this GPU issue and while I'm glad to learn new things (never used UART before!), I'm also thinking if someone else had access to the darn CM4, they'd probably be able to debug this in an hour (whereas I'm sure it'll take an afternoon for me).

But I do like the process of learning, and I know some people unfortunately don't 'debug in the open' like this (that's why I'm doing it—so other people don't have to go through the first week of troubleshooting since I've already done it for them!).

from raspberry-pi-pcie-devices.

dtischler avatar dtischler commented on May 5, 2024 4

The suspense! :-)

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 4

@elFarto - Oh hey, when I did what I showed earlier (though your comment would seem to also work the same), I got further! I got to at least /* set up the gfx ring */! (But not as far as /* set up the compute queues - allocate horizontally across pipes */).

from raspberry-pi-pcie-devices.

volkertb avatar volkertb commented on May 5, 2024 4

Your hard work is appreciated, man!

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 3

It has arrived.

But I have been full bore on a few other things today, so it will have to wait. I set it next to the CM4 IO Board so it can start getting 'familiar' with it.

from raspberry-pi-pcie-devices.

sinetek avatar sinetek commented on May 5, 2024 3

Cards for the Mac market also shouldn't have that I/O section, because they don't use the whole BIOS system at all - and not even the x86 set but that was a long time ago. So Mac branded cards (they do exist..) are also an option.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 3

Heh... I'm going to have to pause it for a little bit to get caught up on a couple other projects. I want to see this through to the bitter end, though!

from raspberry-pi-pcie-devices.

clarkalastair avatar clarkalastair commented on May 5, 2024 2

Some amd cards can be flashed to Mac / efi mode too, if you’re that way inclined.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 2

Dangit, same issue, so it seems I'm hitting:

[   73.734651] [drm] Not enough PCI address space for a large BAR.

And then it keeps trying to initialize though, but stops at the point of [ 73.739897] [drm] Chained IB support enabled! and won't progress any further, meanwhile the entire Pi kinda locks itself up. That message comes from here: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1062

So maybe even if there's not enough address space for a large BAR, could it still work with a 'small BAR'? After all, it would be much safer to have no BAR in this time of Covid.

Alternatively, I'm building from the default branch of the raspberrypi/linux project on GitHub (rpi-5.4.y) — is it possible I need to be on a newer version? It looks like that's the latest version of that file, at least.

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024 2

Yes, you can set it larger that RAM size, since we're only allocating address space, not RAM itself, and the Pi has 32GiB worth of address space.

The "link down" issue might just be that the driver isn't waiting long enough. Currently it's hardcoded (in the pcie-brcmstb.c file) to wait 100ms for the link to establish.

Not sure on the other error, it seems to occur just before the 'register mmio base' line is printed, which seems to have something to do the PCI BARs. Could you paste the dmesg for that boot, specifically the BAR mappings it was assigned?

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024 2

That looks great, up to the lock up of course. I was a bit worried Linux might not use that address range sensibly, wrt. 32-bit addresses, but it looks like it's done the right thing.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 2

All right, two more notes:

First, it seems to be between these two breakpoints where the failure occurs: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1135-L1162 (noting that I believe this chip is a CHIP_POLARIS12).

And second, when the Pi locks up, the fan on the GPU goes slower (and stays slower) than the baseline medium speed that it runs the entire time before initialization. Possible power supply issue? I have another PCIe riser on its way—the first one I tried powers the card but I could never see it with lspci.

Edit: Another round of debugging. Here is the line where everything goes belly up: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1139

err = request_firmware(&adev->gfx.mec2_fw, fw_name, adev->dev);

And the fw_name == amdgpu/polaris12_mec2_2.bin

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024 2

That's odd, request_firmware just loads from disk, it's not uploading it to the GPU so it seems odd that this one would break, and all the previous ones are fine.

Maybe try putting a printk after the request_firmware, followed by a mdelay(100); to make sure that last printk got out before the crash.

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024 2

I've finally gotten around the writing up the steps for debugging the kernel over a serial connection. I think using this method is the only way we'll know for sure exactly where it's crashing, and if it's random or if that's just some artefact of the buffering/network connection.

https://gist.github.com/elFarto/1f9ba845e5ba3539a2c914aae1f4a1e4

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 1

All right, so I have a terminal open with dmesg --follow, and another one where I run modprobe amdgpu (as root):

[  173.558495] [drm] amdgpu kernel modesetting enabled.
[  173.558693] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x600000000 -> 0x60fffffff
[  173.558699] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x610000000 -> 0x6101fffff
[  173.558704] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x618000000 -> 0x61803ffff
[  173.558790] pci 0000:00:00.0: enabling device (0000 -> 0002)
[  173.558804] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[  173.559150] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[  173.559176] [drm] register mmio base: 0x18000000
[  173.559179] [drm] register mmio size: 262144
[  173.559183] [drm] PCI I/O BAR is not found.
[  173.559188] [drm] PCIE atomic ops is not supported
[  173.559201] [drm] add ip block number 0 <vi_common>
[  173.559205] [drm] add ip block number 1 <gmc_v8_0>
[  173.559209] [drm] add ip block number 2 <tonga_ih>
[  173.559213] [drm] add ip block number 3 <gfx_v8_0>
[  173.559217] [drm] add ip block number 4 <sdma_v3_0>
[  173.559221] [drm] add ip block number 5 <powerplay>
[  173.559225] [drm] add ip block number 6 <dm>
[  173.559229] [drm] add ip block number 7 <uvd_v6_0>
[  173.559233] [drm] add ip block number 8 <vce_v3_0>
[  173.805864] ATOM BIOS: 113-36764-U61
[  173.805941] [drm] UVD is enabled in VM mode
[  173.805945] [drm] UVD ENC is enabled in VM mode
[  173.805951] [drm] VCE enabled in VM mode
[  173.805976] [drm] GPU posting now...
[  173.926955] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[  173.932337] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x610000000-0x6101fffff 64bit pref]
[  173.932346] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x600000000-0x60fffffff 64bit pref]
[  173.932390] pci 0000:00:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[  173.932407] pci 0000:00:00.0: BAR 9: no space for [mem size 0xc0000000 64bit pref]
[  173.932412] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0xc0000000 64bit pref]
[  173.932420] amdgpu 0000:01:00.0: BAR 0: no space for [mem size 0x80000000 64bit pref]
[  173.932425] amdgpu 0000:01:00.0: BAR 0: failed to assign [mem size 0x80000000 64bit pref]
[  173.932431] amdgpu 0000:01:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[  173.932435] amdgpu 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[  173.932440] pci 0000:00:00.0: PCI bridge to [bus 01]
[  173.932449] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6180fffff]
[  173.932460] pci 0000:00:00.0: PCI bridge to [bus 01]
[  173.932467] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6180fffff]
[  173.932473] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[  173.932516] [drm] Not enough PCI address space for a large BAR.
[  173.932523] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[  173.932542] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[  173.932570] amdgpu 0000:01:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[  173.932576] amdgpu 0000:01:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[  173.932582] [drm] Detected VRAM RAM=2048M, BAR=256M
[  173.932586] [drm] RAM width 64bits GDDR5
[  173.932754] [TTM] Zone  kernel: Available graphics memory: 1944480 KiB
[  173.932759] [TTM] Initializing pool allocator
[  173.932780] [TTM] Initializing DMA pool allocator
[  173.932854] [drm] amdgpu: 2048M of VRAM memory ready
[  173.932864] [drm] amdgpu: 2848M of GTT memory ready.
[  173.932930] [drm] GART: num cpu pages 65536, num gpu pages 65536
[  173.934178] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[  173.937749] [drm] Chained IB support enabled!

At that moment, the Pi just completely locks up. So... something going on here that's killing the Pi, maybe a power issue? I'm going to pop the card in a couple different adapters and see if I can overcome it. Otherwise it could be a driver/SoC problem, and that ain't going to be fun.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 1

Okay, I tried reversing the ranges:

ranges = <0x02000000 0x0 0x40000000 0x5 0x00000000 0x0 0x40000000 0x02000000 0x2 0x00000000 0x6 0x00000000 0x2 0x00000000>;

Still getting:

[    0.900648] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.900667] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.900755] brcm-pcie fd500000.pcie:      MEM 0xffffffffffffffff..0x003ffffffe -> 0x0040000000
[    0.900802] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.900858] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    0.948053] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    0.948352] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    0.948367] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.948383] pci_bus 0000:00: root bus resource [mem 0x600000000-0x7ffffffff] (bus address [0x200000000-0x3ffffffff])
[    0.948433] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    0.948651] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.952103] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    0.952305] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    0.952420] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    0.952462] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    0.952489] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    0.952516] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x0003ffff]
[    0.952544] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    0.952573] pci 0000:01:00.0: enabling Extended Tags
[    0.952847] pci 0000:01:00.0: supports D1 D2
[    0.952858] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.952921] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    0.953063] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.953141] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    0.953230] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    0.953338] pci 0000:01:00.1: enabling Extended Tags
[    0.953528] pci 0000:01:00.1: supports D1 D2
[    0.956851] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    0.956891] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    0.956905] pci 0000:00:00.0: BAR 8: no space for [mem size 0x00100000]
[    0.956916] pci 0000:00:00.0: BAR 8: failed to assign [mem size 0x00100000]
[    0.956934] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    0.956971] pci 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    0.957005] pci 0000:01:00.0: BAR 5: no space for [mem size 0x00040000]
[    0.957015] pci 0000:01:00.0: BAR 5: failed to assign [mem size 0x00040000]
[    0.957028] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00020000 pref]
[    0.957038] pci 0000:01:00.0: BAR 6: failed to assign [mem size 0x00020000 pref]
[    0.957050] pci 0000:01:00.1: BAR 0: no space for [mem size 0x00004000 64bit]
[    0.957060] pci 0000:01:00.1: BAR 0: failed to assign [mem size 0x00004000 64bit]
[    0.957071] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    0.957081] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    0.957094] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.957117] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    0.957214] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024 1

Well, at least we got something out of all that. I guess it's time to start sticking printf's into the kernel to figure out exactly where it's stopping.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 1

Hmm... I recompiled everything with ftrace enabled, and I can see contents after mounting it:

# mount -t tracefs tracefs /sys/kernel/tracing
# cd /sys/kernel/tracing
# cat available_tracers
hwlat blk function_graph wakeup_dl wakeup_rt wakeup preemptirqsoff preemptoff irqsoff function nop

But how would I be able to trace everything that happens live when I run modprobe amdgpu?

It seems like to trace something that happens as a result of a module loading, you have to load the module, then you can start tracing it? (In this case though, it fails during the load.)

Edit: Answering for myself:

# echo function_graph > current_tracer
# echo 1 > tracing_on
# cat trace_pipe

Then in another session, run sudo modprobe amdgpu. I have a huuuuuuuuge file with all the data (10,000+ lines) from my last trace, so I'm going to need to find a way to get that shared.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 1

I dropped some printk() statements in the code:

        if (adev->gfx.ce_feature_version >= 46 &&
            adev->gfx.pfp_feature_version >= 46) {
                adev->virt.chained_ib_support = true;
                DRM_INFO("Chained IB support enabled!\n");
                printk(KERN_INFO "Zero\n");
        } else
                adev->virt.chained_ib_support = false;

        printk(KERN_INFO "One\n");
        snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", chip_name);
        printk(KERN_INFO "Two\n");
        err = request_firmware(&adev->gfx.rlc_fw, fw_name, adev->dev);
        if (err)
                goto out;
        printk(KERN_INFO "Three\n");
        err = amdgpu_ucode_validate(adev->gfx.rlc_fw);
        printk(KERN_INFO "Four\n");
        rlc_hdr = (const struct rlc_firmware_header_v2_0 *)adev->gfx.rlc_fw->data;
        printk(KERN_INFO "Five\n");
        adev->gfx.rlc_fw_version = le32_to_cpu(rlc_hdr->header.ucode_version);
        adev->gfx.rlc_feature_version = le32_to_cpu(rlc_hdr->ucode_feature_version);

And it looks like it's getting through that section at least...

[   45.929949] [drm] Chained IB support enabled!
[   45.929956] Zero
[   45.929960] One
[   45.929963] Two
[   45.930867] Three
[   45.930875] Four
[   45.930879] Five

Going to keep dropping in a bunch more printk's!

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024 1

All right, more notes:

Using a PCE164P-N03 (version 003 or 888, either one now that I have two), it either resulted in the Pi booting with 'link down' and the fan on the GPU going to 100%, or it resulted in that kernel dump that is pictured in the previous comment.

Using my IO crest SI-PEX60016 PCIe expansion board, I get the exact same behavior that I get with the GPU plugged directly into the Pi:

  • Boots up fine, fan goes to like 40% or so
  • Can see the card via lspci (NOTE: Only seems to work in slot P2, not in slot P1 for some reason)
  • When I run modprobe amdgpu, it gets stuck at the line memset(hpd, 0, mec_hpd_size); (here in gfx_v8_0.c).

In the IO crest, it also stops at different points during the initialization process.

So I guess my next question is: why do all the generic GPU risers / external power cards seem to fail (except for the IO crest one, which seems to work with the card in slot 2 but fails in a similar non-consistent fashion as the Pi itself)?

from raspberry-pi-pcie-devices.

dtischler avatar dtischler commented on May 5, 2024

Will be curious if you can allocate enough BAR for this card. Looking forward to finding out! :-)

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Ha, well I half wonder if I'll need a CM4 8GB... which I have not ordered (I have a couple more 4GB models on the way but they aren't shipping for a couple weeks!).

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Over in the raspberrypi/linux project, it looks like this commit (raspberrypi/linux@54db4b2) has increased the default BAR allocation to 1GB by default—nice!

from raspberry-pi-pcie-devices.

Rucadi avatar Rucadi commented on May 5, 2024

Cards for the Mac market also shouldn't have that I/O section, because they don't use the whole BIOS system at all - and not even the x86 set but that was a long time ago. So Mac branded cards (they do exist..) are also an option.

Hi,

I think that potentially you can patch the driver to ignore this problem.

The problem resides in the line 1423 of the file drivers/gpu/drm/radeon/radeon_device.c
Using the latest driver version is:

rdev->rio_mem = pci_iomap(rdev->pdev, i, rdev->rio_mem_size);

The pci_iomap is returning NULL, however, I think that you don't really need to do the iomap, since it's only needed in the case of AMD legacy cards.

In fact, I think you could just continue executing by erasing the if:
(although this is not causing the driver to not initialize as far as I can see)
if (rdev->rio_mem == NULL)
DRM_ERROR("Unable to find PCI I/O BAR\n");

and as far as I can see, the code is already prepared to have rio_mem to NULL, as you can see on

amdgpu_atmbios.c:1988
int amdgpu_atombios_init(struct amdgpu_device *adev)

It fallbacks to the MMIO (as the I/O BAR region that the driver would use is also mapped in the BAR)

/* needed for iio ops */
if (adev->rio_mem) {
	atom_card_info->ioreg_read = cail_ioreg_read;
	atom_card_info->ioreg_write = cail_ioreg_write;
} else {
	DRM_DEBUG("PCI I/O BAR is not found. Using MMIO to access ATOM BIOS\n");
	atom_card_info->ioreg_read = cail_reg_read;
	atom_card_info->ioreg_write = cail_reg_write;
}

Also, I can see that this check is not only here, but in various places.

I can't test it myself, but I think it's worth a try.

However, I think the problem it's more likely located in:

radeon_get_bios on radeon_bios.c

It's absolutely returning true that function, since in the log I can see that it's expecting an evergreen GPU, but failing at it.

That check is donde in this part of code:

if (!memcmp(rdev->bios + tmp, "ATOM", 4) ||
    !memcmp(rdev->bios + tmp, "MOTA", 4)) {
	rdev->is_atom_bios = true;
} else {
	rdev->is_atom_bios = false;
}

is is_atom_bios is true, the code at evergreen.c will continue initializing.

The problem is that is_atom_bios is set to false, so I think it's reading garbage. (I would love to debug it).

Also, I'm pretty sure that it's not failing early because if it detects an incorrect BIOS signature or is unable to allocate the bios map, it returns false in a previous if, and it causes a return code of -EINVAL;

I don't know if this is useful or my conclusions are utter garbage, since I'm by no means an expert in this topic.

from raspberry-pi-pcie-devices.

sinetek avatar sinetek commented on May 5, 2024

@Rucadi Good analysis – in practice i fear it won't be this easy.
But it might just be. It's worth also checking how the Raptor Talos II folks are handling this case.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Hmm, first plugins aren't going super well...

[    1.010470] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.010490] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.010547] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x067fffffff -> 0x00c0000000
[    1.010601] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    1.329328] brcm-pcie fd500000.pcie: link down

And some plugging and unplugging and rebooting later, and sometimes it just halts boot as it hits the following:

IMG_2667

I have a couple other powered PCIe adapters I may try. And maybe just for grins, try out the plain unpowered adapter too... at least I know it works with all my other devices.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Well that's better... so far so good with the plain (unpowered) adapter:

$ sudo lspci -vvv
...
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7) (prog-if 00 [VGA controller])
	Subsystem: Sapphire Technology Limited Lexa PRO [Radeon RX 550/550X] (Lexa PRO [Radeon RX 550])
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	Region 0: Memory at 640000000 (64-bit, prefetchable) [disabled] [size=256M]
	Region 2: Memory at 650000000 (64-bit, prefetchable) [disabled] [size=2M]
	Region 4: I/O ports at <unassigned> [disabled]
	Region 5: Memory at 600000000 (32-bit, non-prefetchable) [disabled] [size=256K]
	[virtual] Expansion ROM at 600040000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [200 v1] #15
	Capabilities: [270 v1] #19
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable-, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000020, Page Request Allocation: 00000000
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [370 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us

01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
	Subsystem: Sapphire Technology Limited Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin B routed to IRQ 255
	Region 0: Memory at 600060000 (64-bit, non-prefetchable) [disabled] [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0

from raspberry-pi-pcie-devices.

dtischler avatar dtischler commented on May 5, 2024

No kernel / boot errors reaching this point, when using the unpowered adapter?

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

And dmesg logs:

[    1.011261] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.011281] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.011338] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x067fffffff -> 0x00c0000000
[    1.011392] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    1.059289] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    1.059578] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    1.059593] pci_bus 0000:00: root bus resource [bus 00-ff]
[    1.059610] pci_bus 0000:00: root bus resource [mem 0x600000000-0x67fffffff] (bus address [0xc0000000-0x13fffffff])
[    1.059663] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    1.059884] pci 0000:00:00.0: PME# supported from D0 D3hot
[    1.063495] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    1.063695] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    1.063809] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    1.063851] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    1.063879] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    1.063907] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x0003ffff]
[    1.063935] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    1.063965] pci 0000:01:00.0: enabling Extended Tags
[    1.064241] pci 0000:01:00.0: supports D1 D2
[    1.064253] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    1.064317] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    1.064459] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    1.064545] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    1.064635] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    1.064745] pci 0000:01:00.1: enabling Extended Tags
[    1.064937] pci 0000:01:00.1: supports D1 D2
[    1.068388] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    1.068430] pci 0000:00:00.0: BAR 9: assigned [mem 0x640000000-0x657ffffff 64bit pref]
[    1.068444] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[    1.068463] pci 0000:01:00.0: BAR 0: assigned [mem 0x640000000-0x64fffffff 64bit pref]
[    1.068501] pci 0000:01:00.0: BAR 2: assigned [mem 0x650000000-0x6501fffff 64bit pref]
[    1.068536] pci 0000:01:00.0: BAR 5: assigned [mem 0x600000000-0x60003ffff]
[    1.068556] pci 0000:01:00.0: BAR 6: assigned [mem 0x600040000-0x60005ffff pref]
[    1.068571] pci 0000:01:00.1: BAR 0: assigned [mem 0x600060000-0x600063fff 64bit]
[    1.068607] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    1.068618] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    1.068632] pci 0000:00:00.0: PCI bridge to [bus 01]
[    1.068650] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[    1.068666] pci 0000:00:00.0:   bridge window [mem 0x640000000-0x657ffffff 64bit pref]
[    1.068767] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

Always with that silly i/o bar. Well, let's go recompile cross-compile the kernel and see what the amdgpu driver gives us...

And to @dtischler - no, no issues so far. My power supply seems to be happy to put out enough juice at least to get things started (and get the fan on the card moving).

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

All right, recompiled the kernel, now where does that get us:

[    4.194363] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x600000000 -> 0x60003ffff
[    4.194377] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: passed res_id (0) is not a memory bar
[    4.194435] pci 0000:00:00.0: enabling device (0000 -> 0002)
[    4.194464] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[    4.344078] brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac43456-sdio for chip BCM4345/9
[    4.357361] brcmfmac: brcmf_c_preinit_dcmds: Firmware: BCM4345/9 wl0: May 14 2020 17:26:08 version 7.84.17.1 (r871554) FWID 01-3d9e1d87
[    4.360945] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[    4.360991] [drm] register mmio base: 0x00000000
[    4.361000] [drm] register mmio size: 262144
[    4.361009] [drm] PCI I/O BAR is not found.
[    4.361021] [drm] PCIE atomic ops is not supported
[    4.361044] [drm] add ip block number 0 <vi_common>
[    4.361053] [drm] add ip block number 1 <gmc_v8_0>
[    4.361062] [drm] add ip block number 2 <tonga_ih>
[    4.361070] [drm] add ip block number 3 <gfx_v8_0>
[    4.361078] [drm] add ip block number 4 <sdma_v3_0>
[    4.361087] [drm] add ip block number 5 <powerplay>
[    4.361096] [drm] add ip block number 6 <dm>
[    4.361104] [drm] add ip block number 7 <uvd_v6_0>
[    4.361112] [drm] add ip block number 8 <vce_v3_0>
[    4.609386] ATOM BIOS: 113-36764-U61
[    4.609527] [drm] UVD is enabled in VM mode
[    4.609536] [drm] UVD ENC is enabled in VM mode
[    4.609549] [drm] VCE enabled in VM mode
[    4.609574] [drm] GPU posting now...
[    4.729868] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    4.729982] amdgpu 0000:01:00.0: Direct firmware load for amdgpu/polaris12_mc.bin failed with error -2
[    4.729997] mc: Failed to load firmware "amdgpu/polaris12_mc.bin"
[    4.730341] [drm:gmc_v8_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
[    4.730641] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -2
[    4.730653] amdgpu 0000:01:00.0: amdgpu_device_ip_init failed
[    4.730666] amdgpu 0000:01:00.0: Fatal error during GPU init
[    4.730676] [drm] amdgpu: finishing device.
[    4.730763] ------------[ cut here ]------------
[    4.730773] sysfs group 'fw_version' not found for kobject '0000:01:00.0'
[    4.730821] WARNING: CPU: 2 PID: 163 at fs/sysfs/group.c:280 sysfs_remove_group+0x94/0xa0
[    4.730826] Modules linked in: amdgpu(+) brcmfmac brcmutil sha256_generic libsha256 i2c_algo_bit ttm vc4 cec cfg80211 v3d drm_kms_helper gpu_sched rfkill bcm2835_codec(C) bcm2835_isp(C) bcm2835_v4l2(C) v4l2_mem2mem raspberrypi_hwmon bcm2835_mmal_vchiq(C) snd_soc_core videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops drm snd_bcm2835(C) snd_compress videobuf2_v4l2 snd_pcm_dmaengine videobuf2_common backlight drm_panel_orientation_quirks snd_pcm videodev mc snd_timer vc_sm_cma(C) snd syscopyarea sysfillrect sysimgblt fb_sys_fops rpivid_mem uio_pdrv_genirq uio i2c_dev ip_tables x_tables ipv6
[    4.730931] CPU: 2 PID: 163 Comm: systemd-udevd Tainted: G         C        5.4.74-v8gpu+ #1
[    4.730936] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[    4.730944] pstate: 80000005 (Nzcv daif -PAN -UAO)
[    4.730953] pc : sysfs_remove_group+0x94/0xa0
[    4.730961] lr : sysfs_remove_group+0x94/0xa0
[    4.730966] sp : ffffffc01165b790
[    4.730971] x29: ffffffc01165b790 x28: 0000000000000000 
[    4.730981] x27: 0000000000000000 x26: ffffffc0092ad198 
[    4.730990] x25: ffffff80f0474d70 x24: ffffffc00921d440 
[    4.730998] x23: ffffffc0092ad000 x22: 00000000ffffffff 
[    4.731005] x21: ffffff80f66468a0 x20: ffffffc0091cb6a8 
[    4.731013] x19: 0000000000000000 x18: 0000000000000004 
[    4.731021] x17: 0000000000000fff x16: 0000000000000009 
[    4.731028] x15: ffffff80f6d0b890 x14: ffffff80ef04cca8 
[    4.731035] x13: 0000000000000000 x12: ffffffc010fa5000 
[    4.731043] x11: ffffffc010ea1000 x10: ffffffc010fa5958 
[    4.731051] x9 : 0000000000000000 x8 : 0000000000000003 
[    4.731058] x7 : 0000000000000163 x6 : ffffffc01165b480 
[    4.731066] x5 : 0000000000000001 x4 : ffffff80f79c3150 
[    4.731074] x3 : 0000000000000006 x2 : 0000000000000007 
[    4.731081] x1 : e6d12f7aeb88b200 x0 : 0000000000000000 
[    4.731090] Call trace:
[    4.731099]  sysfs_remove_group+0x94/0xa0
[    4.731401]  amdgpu_ucode_sysfs_fini+0x28/0x38 [amdgpu]
[    4.731692]  amdgpu_device_fini+0x424/0x46c [amdgpu]
[    4.731988]  amdgpu_driver_unload_kms+0x54/0xa8 [amdgpu]
[    4.732297]  amdgpu_driver_load_kms+0x11c/0x178 [amdgpu]
[    4.732405]  drm_dev_register+0x144/0x1c8 [drm]
[    4.732738]  amdgpu_pci_probe+0xe0/0x178 [amdgpu]
[    4.732760]  pci_device_probe+0xb8/0x180
[    4.732769]  really_probe+0xe0/0x330
[    4.732776]  driver_probe_device+0x5c/0xf0
[    4.732783]  device_driver_attach+0x74/0x80
[    4.732790]  __driver_attach+0x64/0xe0
[    4.732800]  bus_for_each_dev+0x84/0xd8
[    4.732806]  driver_attach+0x30/0x40
[    4.732812]  bus_add_driver+0x188/0x1e8
[    4.732819]  driver_register+0x64/0x110
[    4.732828]  __pci_register_driver+0x58/0x68
[    4.733152]  amdgpu_init+0x70/0x7c [amdgpu]
[    4.733165]  do_one_initcall+0x54/0x2b8
[    4.733174]  do_init_module+0x5c/0x230
[    4.733181]  load_module+0x1ddc/0x2078
[    4.733188]  __do_sys_finit_module+0xd0/0xe8
[    4.733195]  __arm64_sys_finit_module+0x28/0x38
[    4.733207]  el0_svc_common.constprop.1+0x98/0x1a0
[    4.733215]  el0_svc_handler+0x34/0xa0
[    4.733223]  el0_svc+0x8/0x204
[    4.733231] ---[ end trace d9b9d6fba13c699e ]---

from raspberry-pi-pcie-devices.

Rucadi avatar Rucadi commented on May 5, 2024

Do you have firmware-amd-graphics installed?

The error is -2 (File Not Found) That's the binary blob for the GPU, so you have to install the firmware package or add it manually

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Just tried sudo apt install -y firmware-amd-graphics after seeing this post, rebooted and... now it gets stuck during boot (no HDMI output) and the D2 activity LED just stays solid green.

So then I tried pulling the microSD card and commenting out the vc4-fkms-v3d dtoverlay in config.txt, and... it wouldn't boot.

I unplugged the card and got it to boot again, and then created /etc/modprobe.d/blacklist-amdgpu.conf with the contents blacklist amdgpu, then shut down, plugged in the card, and booted using the jumper at the end of J2 and... now it's booting all the way, so I'm going to modprobe this sucker and see if I can figure out what's going on.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Well that's odd, I'm also getting some MEM space allocation failures again:

[    0.945205] pci 0000:00:00.0: BAR 9: no space for [mem size 0x18000000 64bit pref]
[    0.945218] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0x18000000 64bit pref]
[    0.945231] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[    0.945251] pci 0000:01:00.0: BAR 0: no space for [mem size 0x10000000 64bit pref]
[    0.945261] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x10000000 64bit pref]
[    0.945275] pci 0000:01:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[    0.945285] pci 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[    0.945297] pci 0000:01:00.0: BAR 5: assigned [mem 0x600000000-0x60003ffff]
[    0.945317] pci 0000:01:00.0: BAR 6: assigned [mem 0x600040000-0x60005ffff pref]
[    0.945331] pci 0000:01:00.1: BAR 0: assigned [mem 0x600060000-0x600063fff 64bit]

Going to dig into that first, before I run modprobe amdgpu to see what happens at that point.

Edit: Heh, I forgot that when I copied the generated dtb files I had to re-adjust the BAR space again... oops. Doing that now, will see what happens.

Edit 2: BAR MEM space is allocated again (using 1 GB, 0x40000000). I was planning on testing 2 GB (0x80000000), but that seems unnecessary, and besides, 0x40000000 is the value that will be in the next version of the Pi kernel, so it'd be nice to confirm that works.

from raspberry-pi-pcie-devices.

Rucadi avatar Rucadi commented on May 5, 2024

Well that's odd, I'm also getting some MEM space allocation failures again:

[    0.945205] pci 0000:00:00.0: BAR 9: no space for [mem size 0x18000000 64bit pref]
[    0.945218] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0x18000000 64bit pref]
[    0.945231] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[    0.945251] pci 0000:01:00.0: BAR 0: no space for [mem size 0x10000000 64bit pref]
[    0.945261] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x10000000 64bit pref]
[    0.945275] pci 0000:01:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[    0.945285] pci 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[    0.945297] pci 0000:01:00.0: BAR 5: assigned [mem 0x600000000-0x60003ffff]
[    0.945317] pci 0000:01:00.0: BAR 6: assigned [mem 0x600040000-0x60005ffff pref]
[    0.945331] pci 0000:01:00.1: BAR 0: assigned [mem 0x600060000-0x600063fff 64bit]

Going to dig into that first, before I run modprobe amdgpu to see what happens at that point.

Edit: Heh, I forgot that when I copied the generated dtb files I had to re-adjust the BAR space again... oops. Doing that now, will see what happens.

Good luck! e.e"

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Err... upon further reading of the above log from dmesg, it's getting more BAR MEM space errors. Going to try 2 GB like I did earlier and see if that might help.

Edit: Nope, same thing, same BAR MEM space allocation failures. I might try for 4 GB instead of 2 GB...

Edit 2: Apparently 0xffffffff is the maximum value allowed for that bit of the array, as I got an error that any higher values were out of the 32-bit range. So if it won't work in 4 GB, I might be outta luck, at least assuming it is a BAR issue.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Reading through some mailing list messages, I found this:

Now your Polaris 10 cards have either 8GB or 4GB installed on each board and additionally to the installed memory we need 2MB for each card for the doorbell bar. Since the assignments can basically only be done as a power of two we end up with a requirement of 16GB address space for the 8GB card and 8GB address space for the 4GB.

For compatibility reasons the cards only advertise a 256MB window for the video memory BAR to the BIOS on boot and we later try to resize that to the real size of the installed memory.

Following that to it's conclusion, it seems this card requires 4 GB of BAR space, which I'm providing (well, maybe one byte less than that, dumb 32 bit integer!)... but it doesn't like maybe that there's one byte less. Or maybe it's hoping for 8 GB which I just can't provide.

In any case:

Fortunately the driver manages to fallback to the original 256MB configuration and continues with that. That is a bit sub-optimal, but still not a real problem.

So it's something else. Going to try powered connector and see if maybe it's a power issue.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

With the PCE164P-NO3 VER 006, I'm getting:

[    1.206474] brcm-pcie fd500000.pcie: link down

Also, after boot, the fan on the card goes to 100% and puts out quite a bit of air!

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Interesting, with this other adapter (a 2 port PCIe switch), I'm not getting the link down issue, and I see:

$ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 20)
01:00.0 PCI bridge: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch (rev 05)
02:01.0 PCI bridge: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch (rev 05)
02:02.0 PCI bridge: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch (rev 05)
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]

So let's give it a go: modprobe amdgpu

[   75.713936] [drm] amdgpu kernel modesetting enabled.
[   75.714124] pci 0000:00:00.0: of_irq_parse_pci: failed with rc=-22
[   75.714153] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x640000000 -> 0x64fffffff
[   75.714161] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x650000000 -> 0x6501fffff
[   75.714169] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x600000000 -> 0x60003ffff
[   75.714270] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   75.714293] pci 0000:01:00.0: enabling device (0000 -> 0002)
[   75.714313] pci 0000:02:01.0: enabling device (0000 -> 0002)
[   75.714332] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[   75.714806] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[   75.714841] [drm] register mmio base: 0x00000000
[   75.714846] [drm] register mmio size: 262144
[   75.714851] [drm] PCI I/O BAR is not found.
[   75.714860] [drm] PCIE atomic ops is not supported
[   75.714882] [drm] add ip block number 0 <vi_common>
[   75.714887] [drm] add ip block number 1 <gmc_v8_0>
[   75.714892] [drm] add ip block number 2 <tonga_ih>
[   75.714897] [drm] add ip block number 3 <gfx_v8_0>
[   75.714903] [drm] add ip block number 4 <sdma_v3_0>
[   75.714908] [drm] add ip block number 5 <powerplay>
[   75.714913] [drm] add ip block number 6 <dm>
[   75.714919] [drm] add ip block number 7 <uvd_v6_0>
[   75.714924] [drm] add ip block number 8 <vce_v3_0>
[   75.972711] ATOM BIOS: 113-36764-U61
[   75.972794] [drm] UVD is enabled in VM mode
[   75.972798] [drm] UVD ENC is enabled in VM mode
[   75.972805] [drm] VCE enabled in VM mode
[   75.972852] [drm] GPU posting now...
[   76.091407] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[   76.096689] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x650000000-0x6501fffff 64bit pref]
[   76.096698] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x640000000-0x64fffffff 64bit pref]
[   76.096749] pci 0000:02:01.0: BAR 9: releasing [mem 0x640000000-0x657ffffff 64bit pref]
[   76.096756] pci 0000:01:00.0: BAR 9: releasing [mem 0x640000000-0x657ffffff 64bit pref]
[   76.096761] pci 0000:00:00.0: BAR 9: releasing [mem 0x640000000-0x657ffffff 64bit pref]
[   76.096781] pci 0000:00:00.0: BAR 9: no space for [mem size 0xc0000000 64bit pref]
[   76.096785] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0xc0000000 64bit pref]
[   76.096793] pci 0000:01:00.0: BAR 9: no space for [mem size 0xc0000000 64bit pref]
[   76.096797] pci 0000:01:00.0: BAR 9: failed to assign [mem size 0xc0000000 64bit pref]
[   76.096803] pci 0000:02:01.0: BAR 9: no space for [mem size 0xc0000000 64bit pref]
[   76.096807] pci 0000:02:01.0: BAR 9: failed to assign [mem size 0xc0000000 64bit pref]
[   76.096840] amdgpu 0000:03:00.0: BAR 0: no space for [mem size 0x80000000 64bit pref]
[   76.096845] amdgpu 0000:03:00.0: BAR 0: failed to assign [mem size 0x80000000 64bit pref]
[   76.096851] amdgpu 0000:03:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[   76.096856] amdgpu 0000:03:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[   76.096864] pci 0000:02:02.0: PCI bridge to [bus 04]
[   76.096885] pci 0000:00:00.0: PCI bridge to [bus 01-04]
[   76.096911] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[   76.096933] pci 0000:00:00.0: PCI bridge to [bus 01-04]
[   76.096940] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[   76.096958] pci 0000:00:00.0:   bridge window [mem 0x640000000-0x657ffffff 64bit pref]
[   76.096966] pci 0000:01:00.0: PCI bridge to [bus 02-04]
[   76.096987] pci 0000:01:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[   76.096994] pci 0000:01:00.0:   bridge window [mem 0x640000000-0x657ffffff 64bit pref]
[   76.097016] pci 0000:02:01.0: PCI bridge to [bus 03]
[   76.097025] pci 0000:02:01.0:   bridge window [mem 0x600000000-0x6000fffff]
[   76.097043] pci 0000:02:01.0:   bridge window [mem 0x640000000-0x657ffffff 64bit pref]
[   76.097079] [drm] Not enough PCI address space for a large BAR.
[   76.097098] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x640000000-0x64fffffff 64bit pref]
[   76.097131] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x650000000-0x6501fffff 64bit pref]
[   76.097185] amdgpu 0000:03:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[   76.097190] amdgpu 0000:03:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[   76.097209] [drm] Detected VRAM RAM=2048M, BAR=256M
[   76.097213] [drm] RAM width 64bits GDDR5
[   76.100942] [TTM] Zone  kernel: Available graphics memory: 1944480 KiB
[   76.100948] [TTM] Initializing pool allocator
[   76.100960] [TTM] Initializing DMA pool allocator
[   76.101058] [drm] amdgpu: 2048M of VRAM memory ready
[   76.101069] [drm] amdgpu: 2848M of GTT memory ready.
[   76.101134] [drm] GART: num cpu pages 65536, num gpu pages 65536
[   76.102413] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[   76.106273] [drm] Chained IB support enabled!

A couple differences in the output... and when I ran modprobe, I noticed the fan started spinning slower. Not sure what to make of it. But this is using an external 5v-molex-adapted-to-floppy-connector power supply. It doesn't seem the most reliable contraption in any sense, as these connectors are very cheap quality, and it's a far cry from working inside a computer with a 300W+ quality power supply :)

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Interesting that it stops on the 'Chained IB support enabled message', as I've noticed your MEM and IB MEM sections have overlapping PCI address space mappings (assuming IB refers to the same thing):

[    1.011281] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.011338] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x067fffffff -> 0x00c0000000
[    1.011392] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
...
[    1.059610] pci_bus 0000:00: root bus resource [mem 0x600000000-0x67fffffff] (bus address [0xc0000000-0x13fffffff])

You have 2GiB of BAR space, and the last 1GiB is in the IB MEM range.

Could you paste the 'ranges' and 'dma-ranges' lines from the pcie section in your device tree? I'm not sure how the IB MEM section ended up there.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

So today for fun I tried the following:

  1. Flashed Pi OS 64-bit (full GUI) to microSD card.
  2. Cross-compiled with amdgpu driver enabled
  3. Booted the device.
  4. Blacklisted amdgpu modules by creating /etc/modprobe.d/blacklist-amdgpu.conf with contents blacklist amdgpu.
  5. Installed AMD firmware: sudo apt install -y firmware-amd-graphics
  6. Increased BAR space to maximum of 2 GB 4 GB (value 0xffffffff).
  7. Rebooted (card still not plugged in). Made sure Pi booted correctly. Then shut down.
  8. Plugged in the card via dumb 16x to 1x adapter.

The card started it's 'normal' fan routine (where it spins up, stops, then spins at a nice calm rate). Sometimes it goes into 'EVIL FAN' mode where it goes max speed and I know the card didn't power up correctly.

$ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 20)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]

And @elFarto the lines from the decompiled device tree are:

                pcie@7d500000 {
                        compatible = "brcm,bcm2711-pcie";
                        reg = < 0x00 0x7d500000 0x00 0x9310 >;
                        device_type = "pci";
                        #address-cells = < 0x03 >;
                        #interrupt-cells = < 0x01 >;
                        #size-cells = < 0x02 >;
                        interrupts = < 0x00 0x94 0x04 0x00 0x94 0x04 >;
                        interrupt-names = "pcie\0msi";
                        interrupt-map-mask = < 0x00 0x00 0x00 0x07 >;
                        interrupt-map = < 0x00 0x00 0x00 0x01 0x01 0x00 0x8f 0x04 >;
                        msi-controller;
                        msi-parent = < 0x2a >;
                        ranges = <0x02000000 0x0 0xc0000000 0x6 0x00000000 0x0 0xffffffff>;
                        dma-ranges = < 0x2000000 0x00 0x00 0x00 0x00 0x00 0xc0000000 >;
                        brcm,enable-ssc;
                        brcm,enable-l1ss;
                        phandle = < 0x2a >;
                };

I ran sudo modprobe amdgpu and dmesg --follow died on [ 133.508246] [drm] Chained IB support enabled! again.

@elFarto - Are you thinking my ranges/dma-ranges may be out of whack, maybe causing some memory addresses to be overwritten? Wouldn't be the first time (to be honest my brain kind of collapses sometimes working with this stuff).

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

I'm not entirely sure what's going on, but I don't think those ranges are correct. Firstly 0xffffffff is 4GiB - 1, not 2GiB :). Next, based on your dmesg output way above, here's where everything gets mapped (reformatted to make it easier to read):

           CPU Addresses		 PCI Addresses
ranges     0x0600000000..0x067fffffff -> 0x00c0000000..0x13fffffff
dma-ranges 0x0000000000..0x00ffffffff -> 0x0100000000..0x1ffffffff

But...based on your decompiled device tree, I can't see why the dma-ranges gets pushed up to 0x01'0000'0000, since that's not what's specified (assuming you didn't change the dma-ranges in the device tree from which that dmesg came from). PhilE did say on the RasPi forums that something (firmware?) patches the device tree, so maybe that's what's happening here (maybe you can retrieve the device tree that's actually loaded from sysfs? rather than from the filesystem).

With that said, the last device tree you've pasted has this layout (assuming the dma-ranges gets changed the same way):

           CPU Addresses		 PCI Addresses
ranges     0x0600000000..0x06ffffffff -> 0x00c0000000..0x1ffffffff
dma-ranges 0x0000000000..0x00ffffffff -> 0x0100000000..0x1ffffffff

Add to that the MSI target address which is either 0x0'ffff'fffc if dma-ranges start address is >= 0x01'0000'0000 or 0xf'ffff'fffc if it's less, we end up with 3 overlaps...I think.

You can have sizes over 4GB. Here's roughly how the ranges and dma-ranges fields are structured:

ranges = <0x02000000 0x0 0xc0000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x0 0xffffffff>;//Size (4GiB - 1)
                     
dma-ranges = < 0x2000000 0x00 0x00 //PCI address
			 0x00 0x00 //CPU address
			 0x00 0xc0000000 >; //Size (3GiB)

After the first field, they're paired making a 64-bit integer. You can also have multiple ranges (but not multiple dma-ranges, that's not supported on the Pi). So if you wanted an 8GiB BAR size, you could do this (not sure this one will work due to the alignment):

ranges = <0x02000000 0x0 0xc0000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x2 0x00000000>;//Size (8GiB)

Now what we need is an address space that fits everything in, without overlapping. Maybe something like this:

ranges = <0x02000000 0x2 0x00000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x2 0x00000000>;//Size (8GiB)
                     
dma-ranges = < 0x2000000 0x00 0x00 //PCI address
			 0x00 0x00 //CPU address
			 0x00 0xc0000000 >; //Size

I have no idea if that'll work, we're using a lot of PCI address space, and I can't see any details on how much it supports, so 🤷, but nothing should be overlapping.

You could pare back the ranges size to 4GiB if that doesn't work.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

After applying the ranges in your post (8GB) it did seem to boot, and I got:

[    0.901049] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.901068] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.901126] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.901182] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    1.218107] brcm-pcie fd500000.pcie: link down

Note that I'm on a CM4 with 4 GB of memory—can I set the BAR space larger than the system RAM?

Now when I try sudo modprobe amdgpu I get:

[   37.923447] [drm] amdgpu kernel modesetting enabled.
[   37.923669] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x600000000 -> 0x60fffffff
[   37.923675] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x610000000 -> 0x6101fffff
[   37.923708] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   37.923722] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[   37.924048] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[   37.924062] amdgpu 0000:01:00.0: Fatal error during GPU init
[   37.926646] amdgpu: probe of 0000:01:00.0 failed with error -12

Edit: Also, after reboots sometimes the GPU fan just goes ballistic (highest speed) and I get the PCIe 'link is down' in dmesg. I have to completely power off before the card seems to go back into not-panicking mode.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

With:

                        ranges = <0x02000000 0x2 0x00000000 0x6 0x00000000 0x2 0x00000000>;
                        dma-ranges = < 0x2000000 0x00 0x00 0x00 0x00 0x00 0xc0000000 >;

I end up getting the following in dmesg after reboot with the card connected:

[    0.900642] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.900661] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.900718] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.900774] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    0.948085] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    0.948383] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    0.948399] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.948414] pci_bus 0000:00: root bus resource [mem 0x600000000-0x7ffffffff] (bus address [0x200000000-0x3ffffffff])
[    0.948466] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    0.948684] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.952283] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    0.952485] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    0.952600] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    0.952641] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    0.952669] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    0.952696] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x0003ffff]
[    0.952723] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    0.952753] pci 0000:01:00.0: enabling Extended Tags
[    0.953028] pci 0000:01:00.0: supports D1 D2
[    0.953039] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.953101] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    0.953242] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.953319] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    0.953408] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    0.953516] pci 0000:01:00.1: enabling Extended Tags
[    0.953706] pci 0000:01:00.1: supports D1 D2
[    0.957171] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    0.957211] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    0.957225] pci 0000:00:00.0: BAR 8: no space for [mem size 0x00100000]
[    0.957236] pci 0000:00:00.0: BAR 8: failed to assign [mem size 0x00100000]
[    0.957254] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    0.957291] pci 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    0.957325] pci 0000:01:00.0: BAR 5: no space for [mem size 0x00040000]
[    0.957335] pci 0000:01:00.0: BAR 5: failed to assign [mem size 0x00040000]
[    0.957348] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00020000 pref]
[    0.957358] pci 0000:01:00.0: BAR 6: failed to assign [mem size 0x00020000 pref]
[    0.957370] pci 0000:01:00.1: BAR 0: no space for [mem size 0x00004000 64bit]
[    0.957381] pci 0000:01:00.1: BAR 0: failed to assign [mem size 0x00004000 64bit]
[    0.957391] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    0.957402] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    0.957414] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.957437] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    0.957535] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Ok, I was worried about that. The ranges setting is purely 64-bit, and if there are 32-bit only BARs there's no valid addresses for them to use. So....I guess we allocate MORE BAR SPACE!:

ranges = <0x02000000 0x2 0x00000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x2 0x00000000  //Size (8GiB 64-bit only)
          0x02000000 0x0 0x00000000  //PCI address
                     0x4 0x00000000  //CPU address
                     0x0 0x80000000  //Size (2GiB 32-bit)
                     >;

Now, do we need 10GiB of BAR space? To that I answer, who are you and what have you done with the real Jeff :)

edit Might need to make the CPU address 0x5'0000'0000 on the second allocation, 0x4'0000'0000 is mapped to 'L2 Cached (allocating)', not sure what that is.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@elFarto - Hmm... using that value I'm still seeing:

[    0.900748] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.900767] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.900857] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.900907] brcm-pcie fd500000.pcie:      MEM 0xffffffffffffffff..0x007ffffffe -> 0x0000000000
[    0.900958] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    0.948062] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    0.948364] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    0.948379] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.948394] pci_bus 0000:00: root bus resource [mem 0x600000000-0x7ffffffff] (bus address [0x200000000-0x3ffffffff])
[    0.948447] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    0.948666] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.952117] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    0.952319] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    0.952434] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    0.952475] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    0.952502] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    0.952530] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x0003ffff]
[    0.952557] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    0.952586] pci 0000:01:00.0: enabling Extended Tags
[    0.952860] pci 0000:01:00.0: supports D1 D2
[    0.952870] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.952934] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    0.953075] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.953153] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    0.953242] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    0.953351] pci 0000:01:00.1: enabling Extended Tags
[    0.953541] pci 0000:01:00.1: supports D1 D2
[    0.956858] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    0.956899] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    0.956913] pci 0000:00:00.0: BAR 8: no space for [mem size 0x00100000]
[    0.956924] pci 0000:00:00.0: BAR 8: failed to assign [mem size 0x00100000]
[    0.956942] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    0.956979] pci 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    0.957013] pci 0000:01:00.0: BAR 5: no space for [mem size 0x00040000]
[    0.957024] pci 0000:01:00.0: BAR 5: failed to assign [mem size 0x00040000]
[    0.957036] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00020000 pref]
[    0.957047] pci 0000:01:00.0: BAR 6: failed to assign [mem size 0x00020000 pref]
[    0.957059] pci 0000:01:00.1: BAR 0: no space for [mem size 0x00004000 64bit]
[    0.957069] pci 0000:01:00.1: BAR 0: failed to assign [mem size 0x00004000 64bit]
[    0.957080] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    0.957090] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    0.957103] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.957126] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    0.957224] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

BAR 0, 5, 6 and 8 still seem unhappy :(

Edit: Also tried 0x5 0x00000000 for the 2nd allocation and that didn't seem to make any difference. Still no space for those BARs.

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Er, something went very wrong there:

[    0.900907] brcm-pcie fd500000.pcie:      MEM 0xffffffffffffffff..0x007ffffffe -> 0x0000000000

Ok, lets try this:

ranges = <0x02000000 0x2 0x00000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x2 0x00000000  //Size (8GiB 64-bit only)
          0x02000000 0x0 0x40000000  //PCI address
                     0x5 0x00000000  //CPU address
                     0x0 0x40000000  //Size (1GiB 32-bit)
                     >;

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Still:

[    0.900883] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.900902] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.900991] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.901041] brcm-pcie fd500000.pcie:      MEM 0xffffffffffffffff..0x003ffffffe -> 0x0040000000
[    0.901092] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    0.948043] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    0.948342] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    0.948357] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.948372] pci_bus 0000:00: root bus resource [mem 0x600000000-0x7ffffffff] (bus address [0x200000000-0x3ffffffff])
[    0.948424] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    0.948642] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.952093] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    0.952294] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    0.952410] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    0.952452] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    0.952479] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    0.952507] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x0003ffff]
[    0.952534] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    0.952565] pci 0000:01:00.0: enabling Extended Tags
[    0.952841] pci 0000:01:00.0: supports D1 D2
[    0.952852] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.952916] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    0.953059] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.953136] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    0.953225] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    0.953334] pci 0000:01:00.1: enabling Extended Tags
[    0.953523] pci 0000:01:00.1: supports D1 D2
[    0.956831] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    0.956871] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    0.956885] pci 0000:00:00.0: BAR 8: no space for [mem size 0x00100000]
[    0.956896] pci 0000:00:00.0: BAR 8: failed to assign [mem size 0x00100000]
[    0.956915] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    0.956951] pci 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    0.956985] pci 0000:01:00.0: BAR 5: no space for [mem size 0x00040000]
[    0.956996] pci 0000:01:00.0: BAR 5: failed to assign [mem size 0x00040000]
[    0.957009] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00020000 pref]
[    0.957019] pci 0000:01:00.0: BAR 6: failed to assign [mem size 0x00020000 pref]
[    0.957031] pci 0000:01:00.1: BAR 0: no space for [mem size 0x00004000 64bit]
[    0.957041] pci 0000:01:00.1: BAR 0: failed to assign [mem size 0x00004000 64bit]
[    0.957052] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    0.957063] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    0.957076] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.957099] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    0.957199] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

1 GB 32-bit might not be enough?

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Hmm, not sure. The only thing I can thing of is to reverse to order of those ranges. Something is causing that smaller range to get an invalid starting address. I thought a starting address of 0 might be triggering a bug in the code. Until it's valid, nothing can be allocated from it.

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

I have absolutely no idea what's happening. It looks like the device tree is being corrupted. A few things to try. Can you try decompiling the device tree like this:

sudo dtc -I dtb -O dts /sys/firmware/fdt

As I thought, you can retrieve the device tree file as the kernel loaded it. This should show what's actually happening. My theory is that something is patching it before it's given to the kernel, and that patching is breaking it (although it is odd how the invalid entry moves around).

Secondly, here's a single BAR mapping that puts a 3GiB range at the start of the PCI address space. Perhaps it doesn't like having two entries in the device tree:

ranges = <0x02000000 0x0 0x00000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x0 0xc0000000>;  //Size (3GiB 32-bit)

edit Wait, I think I see the issue. There's another 'ranges' section in the enclosing 'scb' section in the device tree. It doesn't have a mapping for 0x5'0000'0000 so that might be what's causing it. Also the 0x6'0000'0000 mapping is only 1GiB large, so that might also be an issue.

edit 2 Ok, I think I can see what to do. Try this in the 'scb' section:

ranges = <	0x00 0x7c000000 0x00 0xfc000000 0x00 0x3800000 
		0x00 0x40000000 0x00 0xff800000 0x00 0x800000 
		0x05 0x00000000 0x05 0x00000000 0x00 0x80000000 
		0x06 0x00000000 0x06 0x00000000 0x02 0x00000000 
		0x00 0x00000000 0x00 0x00000000 0x00 0xfc000000 >;

And this in the 'pcie' section:

ranges = <0x02000000 0x2 0x00000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x2 0x00000000  //Size (8GiB 64-bit only)
          0x02000000 0x0 0x00000000  //PCI address
                     0x5 0x00000000  //CPU address
                     0x0 0x80000000  //Size (2GiB 32-bit)
                     >;

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Okay, so I put in the following:

        scb {
                compatible = "simple-bus";
                #address-cells = < 0x02 >;
                #size-cells = < 0x02 >;
                ranges = < 0x00 0x7c000000 0x00 0xfc000000 0x00 0x3800000 0x00 0x40000000 0x00 0xff800000 0x00 0x800000 0x05 0x00000000 0x05 0x00000000 0x00 0x80000000 0x06 0x00000000 0x06 0x00000000 0x02 0x00000000 0x00 0x00000000 0x00 0x00000000 0x00 0xfc000000 >;
                dma-ranges = < 0x00 0x00 0x00 0x00 0x04 0x00 >;
                phandle = < 0xd2 >;

                pcie@7d500000 {
                        compatible = "brcm,bcm2711-pcie";
                        reg = < 0x00 0x7d500000 0x00 0x9310 >;
                        device_type = "pci";
                        #address-cells = < 0x03 >;
                        #interrupt-cells = < 0x01 >;
                        #size-cells = < 0x02 >;
                        interrupts = < 0x00 0x94 0x04 0x00 0x94 0x04 >;
                        interrupt-names = "pcie\0msi";
                        interrupt-map-mask = < 0x00 0x00 0x00 0x07 >;
                        interrupt-map = < 0x00 0x00 0x00 0x01 0x01 0x00 0x8f 0x04 >;
                        msi-controller;
                        msi-parent = < 0x2a >;
                        ranges = < 0x02000000 0x2 0x00000000 0x6 0x00000000 0x2 0x00000000 0x02000000 0x0 0x00000000 0x5 0x00000000 0x0 0x80000000 >;
                        dma-ranges = < 0x2000000 0x00 0x00 0x00 0x00 0x00 0xc0000000 >;
                        brcm,enable-ssc;
                        brcm,enable-l1ss;
                        phandle = < 0x2a >;
                };

...

Rebooted, and here's dmesg:

[    0.900853] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.900873] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.900963] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.901014] brcm-pcie fd500000.pcie:      MEM 0x0500000000..0x057fffffff -> 0x0000000000
[    0.901069] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    1.218092] brcm-pcie fd500000.pcie: link down

D'oh. I'm going to restart a few more times and see what's up.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Ah, now it looks like allocations are happening:

[    0.900714] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.900734] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.900824] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.900875] brcm-pcie fd500000.pcie:      MEM 0x0500000000..0x057fffffff -> 0x0000000000
[    0.900931] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    0.948032] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    0.948333] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    0.948348] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.948363] pci_bus 0000:00: root bus resource [mem 0x600000000-0x7ffffffff] (bus address [0x200000000-0x3ffffffff])
[    0.948377] pci_bus 0000:00: root bus resource [mem 0x500000000-0x57fffffff] (bus address [0x00000000-0x7fffffff])
[    0.948428] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    0.948648] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.952070] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    0.952269] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    0.952384] pci 0000:01:00.0: reg 0x10: [mem 0x500000000-0x50fffffff 64bit pref]
[    0.952424] pci 0000:01:00.0: reg 0x18: [mem 0x500000000-0x5001fffff 64bit pref]
[    0.952452] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    0.952479] pci 0000:01:00.0: reg 0x24: [mem 0x500000000-0x50003ffff]
[    0.952507] pci 0000:01:00.0: reg 0x30: [mem 0x500000000-0x50001ffff pref]
[    0.952536] pci 0000:01:00.0: enabling Extended Tags
[    0.952811] pci 0000:01:00.0: supports D1 D2
[    0.952822] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.952884] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    0.953028] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.953104] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    0.953193] pci 0000:01:00.1: reg 0x10: [mem 0x500000000-0x500003fff 64bit]
[    0.953302] pci 0000:01:00.1: enabling Extended Tags
[    0.953492] pci 0000:01:00.1: supports D1 D2
[    0.956776] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    0.956817] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    0.956832] pci 0000:00:00.0: BAR 8: assigned [mem 0x500000000-0x5000fffff]
[    0.956850] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    0.956886] pci 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    0.956921] pci 0000:01:00.0: BAR 5: assigned [mem 0x500000000-0x50003ffff]
[    0.956941] pci 0000:01:00.0: BAR 6: assigned [mem 0x500040000-0x50005ffff pref]
[    0.956955] pci 0000:01:00.1: BAR 0: assigned [mem 0x500060000-0x500063fff 64bit]
[    0.956989] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    0.956999] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    0.957012] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.957030] pci 0000:00:00.0:   bridge window [mem 0x500000000-0x5000fffff]
[    0.957045] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    0.957143] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Ooh! Something different!

[   81.495740] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x600000000 -> 0x60fffffff
[   81.495755] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x610000000 -> 0x6101fffff
[   81.495755] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x500000000 -> 0x50003ffff
[   81.495790] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   81.495799] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[   81.496128] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[   81.496157] [drm] register mmio base: 0x00000000
[   81.496158] [drm] register mmio size: 262144
[   81.496161] [drm] PCI I/O BAR is not found.
[   81.496170] [drm] PCIE atomic ops is not supported
[   81.496183] [drm] add ip block number 0 <vi_common>
[   81.496186] [drm] add ip block number 1 <gmc_v8_0>
[   81.496190] [drm] add ip block number 2 <tonga_ih>
[   81.496190] [drm] add ip block number 3 <gfx_v8_0>
[   81.496190] [drm] add ip block number 4 <sdma_v3_0>
[   81.496202] [drm] add ip block number 5 <powerplay>
[   81.496202] [drm] add ip block number 6 <dm>
[   81.496207] [drm] add ip block number 7 <uvd_v6_0>
[   81.496212] [drm] add ip block number 8 <vce_v3_0>
[   81.496249] amdgpu 0000:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x0000
[   81.496288] amdgpu 0000:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x0000
[   81.496598] [drm:amdgpu_get_bios [amdgpu]] *ERROR* Unable to locate a BIOS ROM
[   81.496605] amdgpu 0000:01:00.0: Fatal error during GPU init
[   81.496612] [drm] amdgpu: finishing device.
[   81.496638] ------------[ cut here ]------------
[   81.496644] sysfs group 'fw_version' not found for kobject '0000:01:00.0'
[   81.496677] WARNING: CPU: 2 PID: 955 at fs/sysfs/group.c:280 sysfs_remove_group+0x94/0xa0
[   81.496679] Modules linked in: amdgpu(+) i2c_algo_bit ttm backlight aes_neon_blk crypto_simd cryptd rfcomm bnep hci_uart btbcm bluetooth ecdh_generic ecc fuse 8021q garp stp llc vc4 brcmfmac brcmutil cec sha256_generic libsha256 bcm2835_codec(C) bcm2835_isp(C) bcm2835_v4l2(C) drm_kms_helper v4l2_mem2mem v3d bcm2835_mmal_vchiq(C) videobuf2_dma_contig videobuf2_vmalloc videobuf2_memops gpu_sched videobuf2_v4l2 cfg80211 videobuf2_common rfkill videodev snd_soc_core drm mc drm_panel_orientation_quirks snd_compress vc_sm_cma(C) raspberrypi_hwmon snd_pcm_dmaengine snd_pcm snd_timer snd rpivid_mem syscopyarea sysfillrect sysimgblt fb_sys_fops uio_pdrv_genirq uio i2c_dev ip_tables x_tables ipv6
[   81.496737] CPU: 2 PID: 955 Comm: modprobe Tainted: G         C        5.4.75-v8+ #3
[   81.496737] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[   81.496748] pstate: 80000005 (Nzcv daif -PAN -UAO)
[   81.496750] pc : sysfs_remove_group+0x94/0xa0
[   81.496755] lr : sysfs_remove_group+0x94/0xa0
[   81.496755] sp : ffffffc01168b790
[   81.496759] x29: ffffffc01168b790 x28: 0000000000000000 
[   81.496759] x27: 0000000000000000 x26: ffffffc0092d4198 
[   81.496764] x25: ffffff80e8af4d70 x24: ffffffc009244440 
[   81.496769] x23: ffffffc0092d4000 x22: 00000000ffffffff 
[   81.496772] x21: ffffff80f66468a0 x20: ffffffc0091f26a8 
[   81.496775] x19: 0000000000000000 x18: 0000000000000070 
[   81.496778] x17: 0000000000009c80 x16: 00000000000070a8 
[   81.496780] x15: ffffffffffffffff x14: 2c35356161783020 
[   81.496784] x13: 0000000000000000 x12: ffffffc010fb5000 
[   81.496787] x11: ffffffc010eb1000 x10: ffffffc010fb5958 
[   81.496792] x9 : 0000000000000000 x8 : 0000000000000003 
[   81.496794] x7 : 0000000000000177 x6 : ffffffc01168b480 
[   81.496796] x5 : 0000000000000001 x4 : ffffff80f79c3150 
[   81.496800] x3 : 0000000000000006 x2 : 0000000000000007 
[   81.496804] x1 : 42bf2ea9eb098400 x0 : 0000000000000000 
[   81.496804] Call trace:
[   81.496811]  sysfs_remove_group+0x94/0xa0
[   81.496955]  amdgpu_ucode_sysfs_fini+0x28/0x38 [amdgpu]
[   81.497081]  amdgpu_device_fini+0x424/0x46c [amdgpu]
[   81.497205]  amdgpu_driver_unload_kms+0x54/0xa8 [amdgpu]
[   81.497330]  amdgpu_driver_load_kms+0x11c/0x178 [amdgpu]
[   81.497395]  drm_dev_register+0x144/0x1c8 [drm]
[   81.497533]  amdgpu_pci_probe+0xe0/0x178 [amdgpu]
[   81.497539]  pci_device_probe+0xb8/0x180
[   81.497540]  really_probe+0xe0/0x330
[   81.497547]  driver_probe_device+0x5c/0xf0
[   81.497549]  device_driver_attach+0x74/0x80
[   81.497550]  __driver_attach+0x64/0xe0
[   81.497554]  bus_for_each_dev+0x84/0xd8
[   81.497554]  driver_attach+0x30/0x40
[   81.497560]  bus_add_driver+0x188/0x1e8
[   81.497564]  driver_register+0x64/0x110
[   81.497567]  __pci_register_driver+0x58/0x68
[   81.497694]  amdgpu_init+0x70/0x7c [amdgpu]
[   81.497698]  do_one_initcall+0x54/0x2b8
[   81.497702]  do_init_module+0x5c/0x230
[   81.497706]  load_module+0x1ddc/0x2078
[   81.497708]  __do_sys_finit_module+0xd0/0xe8
[   81.497711]  __arm64_sys_finit_module+0x28/0x38
[   81.497715]  el0_svc_common.constprop.1+0x98/0x1a0
[   81.497720]  el0_svc_handler+0x34/0xa0
[   81.497723]  el0_svc+0x8/0x204
[   81.497727] ---[ end trace 190d680300723fc8 ]---
[   81.503721] amdgpu: probe of 0000:01:00.0 failed with error -22

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Can you do a lspci -vvv before you load the module?

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024
$ sudo lspci -vvv
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 20) (prog-if 00 [Normal decode])
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 0
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 00000000-00000fff
	Memory behind bridge: 00000000-000fffff
	Prefetchable memory behind bridge: 0000000200000000-0000000217ffffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible+
		RootCap: CRSVisible+
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Via WAKE# ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
	Capabilities: [240 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=8us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=1us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7) (prog-if 00 [VGA controller])
	Subsystem: Sapphire Technology Limited Lexa PRO [Radeon RX 550/550X] (Lexa PRO [Radeon RX 550])
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	Region 0: Memory at 600000000 (64-bit, prefetchable) [disabled] [size=256M]
	Region 2: Memory at 610000000 (64-bit, prefetchable) [disabled] [size=2M]
	Region 4: I/O ports at <unassigned> [disabled]
	Region 5: [virtual] Memory at 500000000 (32-bit, non-prefetchable) [size=256K]
	[virtual] Expansion ROM at 500040000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [200 v1] #15
	Capabilities: [270 v1] #19
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable-, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000020, Page Request Allocation: 00000000
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [370 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel modules: amdgpu

01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
	Subsystem: Sapphire Technology Limited Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin B routed to IRQ 255
	Region 0: Memory at 500060000 (64-bit, non-prefetchable) [disabled] [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Hmm, I don't think we can use the 0x5'0000'0000 address range. The expansion ROM is being allocated in that space, and it seems reading from it just returns 00's. While we can remap PCI address space at will, it looks like we can't do the same the CPU address space, that address just isn't wired to go to the PCIe device.

Ok, lets see if the Pi documentation is correct. It says the top 8GiB of address space is mapped to the PCIe device. You can remove the 0x5'0000'000 entry in the scb section we added before, but keep the modified 0x6'0000'0000 one.

Use these ranges for the pcie section:

ranges = <0x02000000 0x2 0x00000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x1 0x00000000  //Size (4GiB 64-bit only)
          0x02000000 0x0 0x00000000  //PCI address
                     0x7 0x00000000  //CPU address
                     0x0 0x80000000  //Size (2GiB 32-bit)
                     >;

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Okay, trying with:

ranges = < 0x02000000 0x2 0x00000000 0x6 0x00000000 0x1 0x00000000 0x02000000 0x0 0x00000000 0x7 0x00000000 0x0 0x80000000 >;

Along with the earlier change in the scb ranges:

ranges = < 0x00 0x7c000000 0x00 0xfc000000 0x00 0x3800000 0x00 0x40000000 0x00 0xff800000 0x00 0x800000 0x05 0x00000000 0x05 0x00000000 0x00 0x80000000 0x06 0x00000000 0x06 0x00000000 0x02 0x00000000 0x00 0x00000000 0x00 0x00000000 0x00 0xfc000000 >;

BARs were assigned, and here's output after modprobe:

[   70.190591] [drm] amdgpu kernel modesetting enabled.
[   70.190786] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x600000000 -> 0x60fffffff
[   70.190792] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x610000000 -> 0x6101fffff
[   70.190797] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x700000000 -> 0x70003ffff
[   70.190827] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   70.190840] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[   70.191201] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[   70.191227] [drm] register mmio base: 0x00000000
[   70.191230] [drm] register mmio size: 262144
[   70.191234] [drm] PCI I/O BAR is not found.
[   70.191240] [drm] PCIE atomic ops is not supported
[   70.191253] [drm] add ip block number 0 <vi_common>
[   70.191257] [drm] add ip block number 1 <gmc_v8_0>
[   70.191261] [drm] add ip block number 2 <tonga_ih>
[   70.191265] [drm] add ip block number 3 <gfx_v8_0>
[   70.191269] [drm] add ip block number 4 <sdma_v3_0>
[   70.191273] [drm] add ip block number 5 <powerplay>
[   70.191278] [drm] add ip block number 6 <dm>
[   70.191282] [drm] add ip block number 7 <uvd_v6_0>
[   70.191286] [drm] add ip block number 8 <vce_v3_0>
[   70.439057] ATOM BIOS: 113-36764-U61
[   70.439138] [drm] UVD is enabled in VM mode
[   70.439142] [drm] UVD ENC is enabled in VM mode
[   70.439149] [drm] VCE enabled in VM mode
[   70.439174] [drm] GPU posting now...
[   70.560470] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[   70.562646] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x610000000-0x6101fffff 64bit pref]
[   70.562655] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x600000000-0x60fffffff 64bit pref]
[   70.562699] pci 0000:00:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   70.562719] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x6bfffffff 64bit pref]
[   70.562727] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x67fffffff 64bit pref]
[   70.562745] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x680000000-0x6801fffff 64bit pref]
[   70.562763] pci 0000:00:00.0: PCI bridge to [bus 01]
[   70.562771] pci 0000:00:00.0:   bridge window [mem 0x700000000-0x7000fffff]
[   70.562778] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6bfffffff 64bit pref]
[   70.562798] amdgpu 0000:01:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[   70.562804] amdgpu 0000:01:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[   70.562811] [drm] Detected VRAM RAM=2048M, BAR=2048M
[   70.562815] [drm] RAM width 64bits GDDR5
[   70.562981] [TTM] Zone  kernel: Available graphics memory: 1944444 KiB
[   70.562985] [TTM] Initializing pool allocator
[   70.563000] [TTM] Initializing DMA pool allocator
[   70.563073] [drm] amdgpu: 2048M of VRAM memory ready
[   70.563083] [drm] amdgpu: 2848M of GTT memory ready.
[   70.563135] [drm] GART: num cpu pages 65536, num gpu pages 65536
[   70.564367] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[   70.567975] [drm] Chained IB support enabled!

Sooo close (now I have all the memory available!), but still locking up at that point.

from raspberry-pi-pcie-devices.

elmeyer avatar elmeyer commented on May 5, 2024

Very excited about this. Might I suggest attempting to use ftrace as an alternative to @elFarto's suggestion? This seems like a decent how-to: https://embeddedbits.org/tracing-the-linux-kernel-with-ftrace/

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

It seems the Pi doesn't actually support having multiple 'ranges' mappings.

PhilE has posted a new firmware version and device tree config in the Pi forums which should help a lot. It moves the RAM mapping up to 0x2'0000'0000, which means we can have an single 8GiB BAR mapping at 0x0'0000'0000.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@elFarto - Yeah I just saw that. I'm going to give it a try in a bit, I just have a case of the Mondays... about 50 things to do before lunch, and that's only an hour away :D

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

All right, so I followed the directions in this post:

  1. Downloaded the trial firmware and replaced the files in the boot volume.

  2. Set scb ranges to:

       ranges = <0x0 0x7c000000  0x0 0xfc000000  0x0 0x03800000>,
                <0x0 0x40000000  0x0 0xff800000  0x0 0x00800000>,
                <0x6 0x00000000  0x6 0x00000000  0x2 0x00000000>,
                <0x0 0x00000000  0x0 0x00000000  0x0 0xfc000000>;
    
  3. Set pcie ranges to:

       ranges = <0x02000000 0x0 0x00000000 0x6 0x00000000
                 0x2 0x00000000>;
    

On boot, here is the dmesg output:

[    0.905022] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.905041] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.905100] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0000000000
[    0.905157] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0200000000
[    0.952149] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    0.952454] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    0.952470] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.952485] pci_bus 0000:00: root bus resource [mem 0x600000000-0x7ffffffff] (bus address [0x00000000-0x1ffffffff])
[    0.952538] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    0.952762] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.956405] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    0.956606] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    0.956723] pci 0000:01:00.0: reg 0x10: [mem 0x600000000-0x60fffffff 64bit pref]
[    0.956764] pci 0000:01:00.0: reg 0x18: [mem 0x600000000-0x6001fffff 64bit pref]
[    0.956791] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    0.956819] pci 0000:01:00.0: reg 0x24: [mem 0x600000000-0x60003ffff]
[    0.956846] pci 0000:01:00.0: reg 0x30: [mem 0x600000000-0x60001ffff pref]
[    0.956876] pci 0000:01:00.0: enabling Extended Tags
[    0.957152] pci 0000:01:00.0: supports D1 D2
[    0.957163] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.957227] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    0.957369] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.957447] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    0.957537] pci 0000:01:00.1: reg 0x10: [mem 0x600000000-0x600003fff 64bit]
[    0.957647] pci 0000:01:00.1: enabling Extended Tags
[    0.957838] pci 0000:01:00.1: supports D1 D2
[    0.961329] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    0.961369] pci 0000:00:00.0: BAR 9: assigned [mem 0x700000000-0x717ffffff 64bit pref]
[    0.961383] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[    0.961401] pci 0000:01:00.0: BAR 0: assigned [mem 0x700000000-0x70fffffff 64bit pref]
[    0.961438] pci 0000:01:00.0: BAR 2: assigned [mem 0x710000000-0x7101fffff 64bit pref]
[    0.961473] pci 0000:01:00.0: BAR 5: assigned [mem 0x600000000-0x60003ffff]
[    0.961492] pci 0000:01:00.0: BAR 6: assigned [mem 0x600040000-0x60005ffff pref]
[    0.961506] pci 0000:01:00.1: BAR 0: assigned [mem 0x600060000-0x600063fff 64bit]
[    0.961540] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    0.961551] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    0.961563] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.961582] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[    0.961597] pci 0000:00:00.0:   bridge window [mem 0x700000000-0x717ffffff 64bit pref]
[    0.961695] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

And after sudo modprobe amdgpu, I get:

[   39.424358] [drm] amdgpu kernel modesetting enabled.
[   39.424558] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x700000000 -> 0x70fffffff
[   39.424564] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x710000000 -> 0x7101fffff
[   39.424569] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x600000000 -> 0x60003ffff
[   39.424608] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   39.424622] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[   39.424973] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[   39.425001] [drm] register mmio base: 0x00000000
[   39.425005] [drm] register mmio size: 262144
[   39.425009] [drm] PCI I/O BAR is not found.
[   39.425015] [drm] PCIE atomic ops is not supported
[   39.425029] [drm] add ip block number 0 <vi_common>
[   39.425033] [drm] add ip block number 1 <gmc_v8_0>
[   39.425036] [drm] add ip block number 2 <tonga_ih>
[   39.425040] [drm] add ip block number 3 <gfx_v8_0>
[   39.425044] [drm] add ip block number 4 <sdma_v3_0>
[   39.425048] [drm] add ip block number 5 <powerplay>
[   39.425052] [drm] add ip block number 6 <dm>
[   39.425056] [drm] add ip block number 7 <uvd_v6_0>
[   39.425060] [drm] add ip block number 8 <vce_v3_0>
[   39.671054] ATOM BIOS: 113-36764-U61
[   39.671131] [drm] UVD is enabled in VM mode
[   39.671135] [drm] UVD ENC is enabled in VM mode
[   39.671142] [drm] VCE enabled in VM mode
[   39.671167] [drm] GPU posting now...
[   39.792533] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[   39.794715] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x710000000-0x7101fffff 64bit pref]
[   39.794723] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x700000000-0x70fffffff 64bit pref]
[   39.794767] pci 0000:00:00.0: BAR 9: releasing [mem 0x700000000-0x717ffffff 64bit pref]
[   39.794785] pci 0000:00:00.0: BAR 9: assigned [mem 0x700000000-0x7bfffffff 64bit pref]
[   39.794792] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x700000000-0x77fffffff 64bit pref]
[   39.794810] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x780000000-0x7801fffff 64bit pref]
[   39.794828] pci 0000:00:00.0: PCI bridge to [bus 01]
[   39.794837] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[   39.794843] pci 0000:00:00.0:   bridge window [mem 0x700000000-0x7bfffffff 64bit pref]
[   39.794864] amdgpu 0000:01:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[   39.794870] amdgpu 0000:01:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[   39.794877] [drm] Detected VRAM RAM=2048M, BAR=2048M
[   39.794881] [drm] RAM width 64bits GDDR5
[   39.795032] [TTM] Zone  kernel: Available graphics memory: 1944444 KiB
[   39.795036] [TTM] Initializing pool allocator
[   39.795048] [TTM] Initializing DMA pool allocator
[   39.795125] [drm] amdgpu: 2048M of VRAM memory ready
[   39.795135] [drm] amdgpu: 2848M of GTT memory ready.
[   39.795198] [drm] GART: num cpu pages 65536, num gpu pages 65536
[   39.796470] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[   39.800292] [drm] Chained IB support enabled!
< Pi locks up as usual >

Just noting here so I can have a consolidated list for new builds:

  1. Create /etc/modprobe.d/blacklist-amdgpu.conf with contents blacklist amdgpu
  2. Compile kernel with amdgpu driver enabled and copy that to the microSD card.
  3. Follow the instructions above (in this comment) to put in the trial firmware and also change the BAR allocations.
  4. Ensure AMD firmware is installed: sudo apt install -y firmware-amd-graphics
  5. Reboot and then follow things (tracing, dmesg, etc.) and run sudo modprobe amdgpu

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Here's the last bits that were written to screen that mentioned amdgpu: https://pastebin.com/nmXKcdNW

And the very last bits that were printed just a ms or so later, when everything completely locks up: https://pastebin.com/0rnRnbp6

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Limiting to drm:

# echo 0 > tracing_on
# echo function_graph > current_tracer
# echo drm* > set_ftrace_filter
# echo 1 > tracing_on
# cat trace_pipe

The last bit of output:

 0)               |  drm_ioctl [drm]() {
 0)   1.315 us    |    drm_dev_enter [drm]();
 0)   1.297 us    |    drm_dev_exit [drm]();
 0)   1.166 us    |    drm_dbg [drm]();
 0)               |    drm_ioctl_kernel [drm]() {
 0)   1.185 us    |      drm_dev_enter [drm]();
 0)   1.223 us    |      drm_dev_exit [drm]();
 0)   1.129 us    |      drm_ioctl_permit [drm]();
 0)               |      drm_mode_getconnector [drm]() {
 0)   2.871 us    |        drm_mode_object_find [drm]();
 0)   1.519 us    |        drm_mode_object_find [drm]();
 0)               |        drm_helper_probe_single_connector_modes [drm_kms_helper]() {
 0)   1.574 us    |          drm_modeset_acquire_init [drm]();
 0)   1.167 us    |          drm_dbg [drm]();
 0)   2.259 us    |          drm_modeset_lock [drm]();
 0)               |          drm_helper_probe_detect [drm_kms_helper]() {
 0)   1.889 us    |            drm_modeset_lock [drm]();
 0)   1.074 us    |            drm_dbg [drm]();
 0)   6.593 us    |          }
 0)               |          drm_do_get_edid [drm]() {
 0)               |            drm_get_override_edid [drm]() {
 0)   1.167 us    |              drm_load_edid_firmware [drm]();
 0)   3.518 us    |            }
 0)   5.704 us    |            drm_edid_block_valid [drm]();
 0) + 12.019 us   |            drm_edid_block_valid [drm]();
 0) * 28497.81 us |          }
 0)               |          drm_detect_hdmi_monitor [drm]() {
 0)   1.556 us    |            drm_find_cea_extension [drm]();
 0)   4.407 us    |          }
 0)               |          drm_connector_update_edid_property [drm]() {
 0)               |            drm_add_display_info [drm]() {
 0)   1.167 us    |              drm_dbg [drm]();
 0)   1.204 us    |              drm_find_cea_extension [drm]();
 0)   1.056 us    |              drm_dbg [drm]();
 0) + 10.630 us   |            }
 0)   9.130 us    |            drm_object_property_set_value [drm]();
 0)               |            drm_property_replace_global_blob [drm]() {
 0)   6.278 us    |              drm_property_create_blob.part.9 [drm]();
 0)   1.148 us    |              drm_object_property_set_value [drm]();

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Just as a note—looking at the linux/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c driver itself, right after the 'Chained IB' message is output, it looks like the next step is loading the firmware: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1066-L1069

Is there any easy way I could trace just that section of code? I guess maybe throw in some trace_printk() calls?

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

All those calls to atom_* seem to be the driver executing some bytecode. Looks like that has some extra debugging you can enable[1].

[1] drivers/gpu/drm/amd/amdgpu/atom.c

from raspberry-pi-pcie-devices.

elmeyer avatar elmeyer commented on May 5, 2024

Re: the drm trace, what monitor are you using? It looks like the last thing it's doing is reading an HDMI monitor's EDID and CEA info. I do think firmware upload sounds like a more likely culprit though.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@elFarto - I set ATOM_DEBUG=1 and recompiled, and also modified this line to set amdgpu_atom_debug to 1, and am copying that over to the Pi.

@elmeyer - I have a random little HDMI 7" display plugged in, but I don't believe I had it plugged in on the run where I dumped that info... maybe I did. Can't remember :/ — in any case, I've tried a number of times with it unplugged, and a few times with it plugged in, always fails at the same stage.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

atom-debug-output-dmesg.txt

Whee! That's a lot of debug output coming from dmesg now (see 600+ KB file above)—but anyways, you can see the last lines are here:

[   36.680235] COMPARE_PS @ 0xBA41
[   36.680239]    
[   36.680240]    
[   36.680243]    
[   36.680246]    
[   36.680249]    src1: 
[   36.680252] PS[0x00,0x1007530]

After that PS line, the system completely locks up. Going to run it again and see if it's any different.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

2nd run showed different output—the last lines were:

[   55.857326] MOVE_WS @ 0xBAD3
[   55.857331]    
[   55.857333]    
[   55.857336]    
[   55.857340]    
[   55.857343]    src: 
[   55.857347] WS[0x40]
[   55.857350] .[31:16] -> 0x09DE
[   55.857354]    
[   55.857356]    
[   55.857359]    

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Something isn't right with that output, we've gotten well past that first line without all the debugging. Perhaps it's producing too much and filling up the buffers and crashing before dmesg can output it. I would disable the ATOM_DEBUG for the moment, and stick to some well placed printk lines after the Chained IB message.

While ftrace might be useful, since it's crashing we have no idea if we're seeing the site of the crash or a completely unrelated spot.

There is an option called CONFIG_DYNAMIC_DEBUG that might be useful. I haven't used it before though.

from raspberry-pi-pcie-devices.

elmeyer avatar elmeyer commented on May 5, 2024

I am absolutely not an expert on this but if you are logging to disk, are we seeing different results from run to run depending on how much of the log has actually been written to disk? If this is indeed an issue, a serial cable might help?

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@elmeyer - I just have a 2nd terminal window open via SSH that's getting output straight from the Pi (no writing to disk, since the poor microSD card would probably catch fire).

from raspberry-pi-pcie-devices.

elmeyer avatar elmeyer commented on May 5, 2024

Seems like it's uploading quite a few firmware binaries. Maybe a good way to triangulate would be to just drop printk's after every request_firmware and amdgpu_ucode_validate?

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@elFarto - Trying that now—I had a printk() immediately following, and didn't see its output, but I'll add an mdelay right after that printk() and see if it prints.

Edit: Ah, right you are! I got the next debug statement to print with that added delay. I'm now adding a bunch more breakpoints with delays as well, because each time I add more delays it seems the output gets further :P

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

I can confirm we make it to line 1220: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1220

The last statement I injected was immediately before that out: line.

After the out: line, inside the if (err) statement, I put a debug message but it doesn't print. It ends on the debug message I stuck after the if (adev->gfx.mec2_fw) {.

Don't know what to make of this...

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Put a printk after this line: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1213

And add the address of the info variable (with a %p).

edit I'm thinking it's an unaligned access, although thinking about it a bit more didn't arm64 start allowing those.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

For that line:

[   30.834344] ucodeinfo: 00000000433d4c73

And I slammed a bunch more debug statements into the surrounding function and am now seeing the code get to here: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1987 (either stopping in r = adev->gfx.rlc.funcs->init(adev); or r = gfx_v8_0_mec_init(adev);).

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

All right, more debugging, looks like we're stopping somewhere inside https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1994 (gfx_v8_0_mec_init(adev)).

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Looks like inside that function, it gets down to https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1352 (memset(hpd, 0, mec_hpd_size); — part of the gfx_v8_0_mec_init() function).

Note that I'm going to run this last test a few times. One of the times it did not make it this far, it seemed to bail out far earlier in the initialization process.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

Hmm... Now I'm really wondering if it could be a power issue. Coupled with the fact that the fan speed reduces maybe 20% when everything seems to lock up, I'm seeing the same test result in stoppage at a few different places in the flow.

But with a power issue, could it really completely lock up the Pi? Usually if there's a power blip (momentary, though), the Pi tends to recover gracefully.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

I just got a new powered riser in the mail today from eBay, and it at least seems to make the card behave differently, if not stably:

IMG_2804

The fan comes on the same as when it's plugged straight into the Pi, but after a bunch of reboots and hard power cycles, I just end up with a kernel panic:

IMG_2803

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

AH! I got the darn PCE164P-NO3 to work, after fiddling with it, changing the USB cable a couple times, and shutting down and starting up the Pi.

But alas, it hits the exact same failure mode—it hits memset(hpd, 0, mec_hpd_size); and then the Pi goes AWOL.

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Can you print the address of hpd? Also, I seem to remember someone had a similar problem with an Nvidia card on the pi forums.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@elFarto Here they are:

[   45.962749] hpd: 1074274304
[   45.962754] mec_hpd_size: 32768

Edit: Also, regarding that inconsistency—sometimes it seems like the Pi bails way earlier in the process, but it seems like it's always after [drm] GPU posting now.... That probably doesn't help much, but I just find it interesting that it doesn't always make it to the memset call... but it does about half the time (and never further).

Edit 2: And I can confirm it always has the same values (well at least 3/3 tries) for hpd and mec_hpd_size.

from raspberry-pi-pcie-devices.

PixlRainbow avatar PixlRainbow commented on May 5, 2024

Also, regarding that inconsistency—sometimes it seems like the Pi bails way earlier in the process, but it seems like it's always after [drm] GPU posting now.... That probably doesn't help much, but I just find it interesting that it doesn't always make it to the memset call... but it does about half the time (and never further).

I wonder if the GPU is starting up and sending back some data/triggering some interrupt when its done, and the interrupt handler somewhere is crashing but we're not tracing it. Might explain the inconsistency? Not sure

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

After a bit of searching on the matter I think I know what the issue is. memset uses a specific instruction, DC ZVA, to speed up clearing RAM. However this instruction only works on RAM, and will fault if used on device memory/MMIO. The correct method to use for this seems to be memset_io. So you can try changing it to use that method, and/or (probably best to double check this) set the length memset uses to 63 (one less than what will trigger it to use that instruction).

On x86 there's no difference between the methods, so it would never have been spotted there.

If this is the case then it's a bug/oversight in the AMD driver. Hopefully there aren't too many of them, or you'll be making a lot of changes 😃 (and not to mention all the user space code...)

from raspberry-pi-pcie-devices.

PixlRainbow avatar PixlRainbow commented on May 5, 2024

sounds like someone should put in a pull request for the driver repo

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

I will try this out in a bit—things are a bit crazy this week so far :P

Strangely, now 3 for 3 tries, I get stuck at:

[   27.927735] [drm] amdgpu kernel modesetting enabled.
...
[   27.948611] [drm] add ip block number 7 <uvd_v6_0>
[   27.948615] [drm] add ip block number 8 <vce_v3_0>

And it locks up at that point.

Edit: And if I back out the previous change I made, I'm still getting that. Maybe I need to reset things a little.

Edit 2: Ah, I was using a bad dtb. Had to recompile it on the Pi.

Edit 3: With memset_io(hpd, 0, mec_hpd_size);, I can't even get any output to dmesg from amdgpu before the system freezes. Weird.

Edit 4: Well, I'm able to get it to the same point (that memset_io() call now) but it still dies at that point. It dies well before that point a lot of times too, so I do also wonder if something else the card does is killing the system, and the amount of time it takes for the card to do it is variable (though never progresses past the memset/memset_io()).

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@elFarto - How would I go about doing:

set the length memset uses to 63

I tried:

        /*mec_hpd_size = adev->gfx.num_compute_rings * GFX8_MEC_HPD_SIZE;*/
        mec_hpd_size = 63;

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

memset(hpd, 0, 63);

In fact, you could probably just remove the call completely. However, given that it doesn't seem to be reliable, it does seem to indicate that there's another issue somewhere.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

So now, it is failing at this point: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1907

	r = amdgpu_ring_init(adev, ring, 1024,
			&adev->gfx.eop_irq, irq_type);

amdgpu_ring_init is here, for reference.

(That's the furthest it got 3/4 times—the 3rd time it still failed at the memset(). The inconsistency is killing me!)

from raspberry-pi-pcie-devices.

elmeyer avatar elmeyer commented on May 5, 2024

Just in case anyone else was wondering what MEC stands for, apparently it is the "Micro Engine Compute" or the compute command processor. Hope my continued speculation doesn't bother you, but if it does turn out to be a power issue, I could imagine uploading the firmware and then the attempted initialization kick a significant portion of the GPU into action, causing somewhat of a spike.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@elmeyer - Speculation is quite welcome :)

I have been using three different externally-powered PCIe risers, and it seems like if it is a power issue, at least one of the three would work. But stranger things... The one I'm using now I'm plugging in through my nice 600W PSU's 4-pin molex connectors, and besides the fan on the GPU spinning a little more slowly (I don't know if that's normal or not—it does spin faster during initial poweron, and then settles down to a normal speed before initialization), nothing on the PSU seems to indicate it's getting a surge or hurting for electrons.

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

One option I've been looking at is to setup kgdb (the kernel debugger) over a serial connection (It might be possible to use kdb aswell). This would in theory let you step line by line through the kernel until it hit the problem. However, it setting this up seems complicated. This would help remove the ambiguity of not seeing log lines, it'll either hit the breakpoint or not.

Here's an series of articles on it: https://oliveryang.net/2015/08/using-kgdb-debug-linux-kernel-1/
https://oliveryang.net/2015/08/using-kgdb-debug-linux-kernel-2/
https://oliveryang.net/2016/04/using-kgdb-debug-linux-kernel-3/

from raspberry-pi-pcie-devices.

Coreforge avatar Coreforge commented on May 5, 2024

One way to make sure it's not power would be to hook up an oscilloscope to 12V power and set it to single shot capture to capture it if there's a power surge/spike. Another thing I could think of is a bad ground connection between the card and the pi with the powered risers as they usually use the shield of the cable for ground. To get around that, you could either use one power supply for both the pi and card or add an additional ground wire. My cm4 hasn't arrived yet so I can't test it with that, but the PCE164P-N03 VER006 riser worked without a problem in my PC where everything uses one PSU.

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

@Coreforge there are so many little things I would like to do with a scope, but alas I don't have one. Maybe 2020 is finally the year I stop waffling on the decision to buy a good one and do it! (Note: I'm originally a software guy, and higher level, so the past few years are the first time I've dabbled in hardware... so I'm still building up my knowledge/tools, and trying to make sure I get good ones the first time, so I can have them for life (my Dad still has and uses his giant old analog oscilloscope he bought after graduating college many years ago—though he uses nice digital ones at his workplace!).

from raspberry-pi-pcie-devices.

volkertb avatar volkertb commented on May 5, 2024

So I just read about this new "AMD Smart Access Memory / Resizable BAR" patch for the Linux kernel that is currently in development. Apparently, it would allow the driver to automatically adjust the BAR size for the GPU(s) as needed. According to this article on Phoronix, the patch still needs some work. Apparently, Marek Olšák (employed by AMD) is the main developer working on this functionality.

Perhaps someone here could reach out to him to ask him to take ARM64 compatibility into account as he works on this? He's been offering support for this patch in this thread in the Phoronix forums, to users willing to try out the patch. He also seems to be gathering feedback there, so I'm sure he'd be interested in the results of any attempts to get AMD cards working on a CM4, at least while using the patch.

Anyway, thank you for continuing to raise the BAR with testing PCIe GPUs on the Raspberry Pi CM4. (I'm surprised no one else here had used that obvious pun yet. 🥁 😜)

from raspberry-pi-pcie-devices.

Coreforge avatar Coreforge commented on May 5, 2024

I doubt SAM is going to work on the pi, considering that for now at least, it only works on newer ryzen chips. Tha BAR size doesn't seem to be that big of a deal though, as they can be resized in the device tree.

from raspberry-pi-pcie-devices.

elFarto avatar elFarto commented on May 5, 2024

Actually the Pi is already resizing the BAR, see this snippet from one of the above log outputs:

[   39.794877] [drm] Detected VRAM RAM=2048M, BAR=2048M

I don't see any reason why the rest of the SAM patches wouldn't also work on the Pi (I believe they're more about optimisations you can make when you have access to all of the GPU's memory).

from raspberry-pi-pcie-devices.

geerlingguy avatar geerlingguy commented on May 5, 2024

This is nice—with the latest kernel build (5.10.y branch), BAR space is allocated properly without having to manage it by hand:

[    1.245667] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.248169] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.250790] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x063fffffff -> 0x00c0000000
[    1.253348] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0200000000
[    1.306332] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
[    1.309124] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    1.311512] pci_bus 0000:00: root bus resource [bus 00-ff]
[    1.313972] pci_bus 0000:00: root bus resource [mem 0x600000000-0x63fffffff] (bus address [0xc0000000-0xffffffff])
[    1.316518] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    1.319188] pci 0000:00:00.0: PME# supported from D0 D3hot
[    1.325301] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    1.328081] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    1.330652] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    1.333195] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    1.335670] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    1.338124] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x0003ffff]
[    1.340567] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    1.342999] pci 0000:01:00.0: enabling Extended Tags
[    1.345722] pci 0000:01:00.0: supports D1 D2
[    1.348074] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    1.350511] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
[    1.353123] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    1.355660] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    1.358155] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    1.360681] pci 0000:01:00.1: enabling Extended Tags
[    1.363260] pci 0000:01:00.1: supports D1 D2
[    1.369215] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    1.371579] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    1.373952] pci 0000:00:00.0: BAR 8: assigned [mem 0x618000000-0x6180fffff]
[    1.376294] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    1.378623] pci 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    1.380924] pci 0000:01:00.0: BAR 5: assigned [mem 0x618000000-0x61803ffff]
[    1.383183] pci 0000:01:00.0: BAR 6: assigned [mem 0x618040000-0x61805ffff pref]
[    1.385498] pci 0000:01:00.1: BAR 0: assigned [mem 0x618060000-0x618063fff 64bit]
[    1.387760] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    1.390012] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    1.392265] pci 0000:00:00.0: PCI bridge to [bus 01]
[    1.394466] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6180fffff]
[    1.396674] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    1.399035] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

from raspberry-pi-pcie-devices.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.