Show HN: Deterministic PCIe Diagnostics for GPUs on Linux

github.com

20 points by gpu_systems 2 days ago

I built a small Linux tool to deterministically verify GPU PCIe link health and bandwidth.

It reports: - Negotiated PCIe generation and width - Peak Host→Device and Device→Host memcpy bandwidth - Sustained PCIe TX/RX utilization via NVML - A rule-based verdict derived from observable hardware data only

This exists because PCIe issues (Gen downgrades, reduced lane width, risers, bifurcation) are often invisible at the application layer and can’t be fixed by kernel tuning or async overlap.

Linux-only: it relies on sysfs and PCIe AER exposure that Windows does not provide.

AuthAuth 2 days ago

This is great. Are there any features you are looking to add? Would checking for bad memory blocks be useful? I've never seen it happen on a GPU but surely it must.

wtallis 2 days ago

Is this entirely NVIDIA-specific, or can it do any diagnostics for other GPUs?

  • kimixa 2 days ago

    It's very much nvidia specific, not just using CUDA but the backing nvidia-specific management libraries.

    Though I don't think there's anything particularly device-specific they're measuring, they're using the private nvidia interfaces to do so.

    • cr125rider 2 days ago

      OP you should call that out a little more clearly that this is Nvidia only.