nycerrrrrrrrrr 1 day ago

This might be orthogonal to the TLB miss overhead you found, but have you looked at using P2PDMA to transfer directly from the NVMe SSDs to the NIC? Not sure how the CRC calculation would play into that.

  • MrCroxx 10 hours ago

    Thank you for your reply. This is a long-running service. Without CRC validation, errors caused by partial writes could accumulate over time and affect correctness. Therefore, we adopted this approach.

    • nycerrrrrrrrrr 4 hours ago

      Depending on your hardware you may be able to do the CRC validation in the NIC. If you're using Intel you could use DSA also, but then you're still copying the data through DRAM.

      Just throwing out some ideas, obviously the best solution is the one that you already have working :)

jeffbee 1 day ago

It seems that you could have reached this conclusion faster by elaborating on your use of the profiler. Don't assume that cycles are spent on instructions. Look at your IPC and drill down into what CPU-bound means for your workload. In your case I think a standard top down analysis would have made the virtual memory management cost jump right out.

  • MrCroxx 9 hours ago

    Thank you for your suggestion. You are absolutely right, XD. In fact, the order of events in this blog post does not match the actual order in which I debugged and analyzed the issue.

    After I identified the TLB misses and confirmed that huge pages were effective, I noticed that there were still many suspicious points in the flame graph. Interestingly, my agent attributed the effectiveness of huge pages to those suspicious points, which turned out to be unrelated to the bottleneck at the time. That sparked my curiosity.

    The structure of this blog post was mainly chosen to make the story easier to follow, while also covering the various issues I investigated in depth along the way.

    In fact, I recently switched from my previous job building data infrastructure on cloud to an HPC-related role, so I am still not very familiar with some of the mature practices and established conclusions in the HPC world.

    So thank you very much for your suggestions. I also hope to learn about more and better methods that can help people identify root causes more quickly and accurately in complex scenarios.

MrCroxx 5 days ago

Author here. This post is a write-up of a performance-debugging rabbit hole I hit while trying to saturate NICs with NVMe reads using io_uring and RDMA.

The short version: READ_FIXED fixed the obvious per-I/O GUP overhead in a small demo, but the larger deployment still got stuck at roughly half of line rate. After ruling out io-wq backlog, request splitting, fd lookup, and CRC arithmetic, the actual wall turned out to be dTLB misses from scanning 1,028 KiB buffers backed by 4 KiB pages. Moving the read arena to hugepages brought the system close to NIC saturation.

The funny part is that an AI agent suggested hugepages early and got the optimization right, but its explanation was wrong. This post is mostly about reconstructing the evidence for why it worked.

I’d be very interested in feedback from people who have used AI to debug performance issues in a complex system.

  • ozgrakkurt 1 day ago

    I disagree with the AI part. Because hugepages is one of the things that can be guessed to improve performance when doing something with substantial amount of data.

    So anyone familiar with the space could have suggested something like that without knowing the details of the problem. Hence it is not useful advice IMO.

    That aside, the blog post was really cool to read and a instant favorite, wish there were more english posts on the blog.

    Especially like the hardware limit based expectations, detailed measurements and the writing style.

    • MrCroxx 9 hours ago

      Thank you for liking this blog. I agree with your point. Actually, I’ve just recently transitioned from building data infrastructure on the cloud to taking on a high-performance computing role that truly handles massive amounts of data. So, although I’d heard about the benefits of hugepages before, I had never actually reproduced these issues in my own environment. This time, even though I initially suspected the problems were related to hugepages and the TLB, I didn’t write this blog from a seasoned perspective. Instead, I wanted to methodically investigate and eliminate all other possible issues I could think of. (Interestingly, my agent attributed the effectiveness of hugepages to the root causes of these bugs, which piqued my curiosity and drove my deeper exploration.)

      Finally, thank you very much for your appreciation, which means a lot to me. Previously, I was working on open-source projects, but now that I’ve changed jobs, I may not have the same amount of energy to contribute to open-source code as before. However, I think blogging might be a new way for me to contribute. I hope I can keep it up.

      (My English writing skills are poor, so I wrote in Chinese and used AI to translate it; I hope you don’t mind.)

      • MrCroxx 9 hours ago

        But I'm trying my best to practice it. Hope one day I can produce some solid posts in English directly. qwq