See the original CodeProject article here.
CudaPAD - Cuda Assembly Viewer
CudaPAD aids in the optimizing and understanding of nVidia’s Cuda kernels by displaying an on-the-fly view of the PTX/SASS that makes up the GPU kernel. CudaPAD shows the PTX/SASS output; however, it has several visual aids to help understand how minor code tweaks or compiler options can affect the PTX/SASS.What is CudaPAD?
Paste your Cuda C++ code in the right window and cleaned-up assembly will show up on the right side. Every time a small change is made to the source code window it automatically starts re-compiling (with an indicator) and displays the updated assembly on the right. In the top menu, the compiler options can be tinkered with to see how it impacts the assembly.
It also has some visual helpers. It will do a "diff" on the code with each change so the user can see what exactly changed. Also, matching the source code on the left window to the assembly on the right can be a daunting task so I added little visual lines that connect the two. Colorful highlighting was added to make the code and assembly easier to read. It also had register highlighting. A user could click on a register, and it would show up on the right. Also, raw assembly is pretty messy, so it strips out the junk for you.
It also has some tools. I made it so when you click on the error it would show you right where the issue is with the source code. Also, when right clicking it would do a google search on the error.
Introduction
What is PTX or SASS anyway? NVidia’s PTX is an intermediate language for NVidia GPU’s. It is more closely tied to pure GPU assembly(SASS) but slightly abstracted. PTX is less tied to the specific hardware or a hardware generation which makes it more useful in most cases when compared to assembly. One item it abstracts is physical register numbers which make it easier to use than the raw assembly. PTX instructions are usually translated into one or more actual SASS hardware instructions. SASS is hardcore assembly. It is what the GPU actually runs and is directly translated into machine code. Viewing SASS code is more difficult but it does show exactly what the GPU will do. As mentioned, SASS code also works with the registers directly so there is more control where registers are stored but it’s another item that the programmer needs to keep track of and makes SASS more difficult to work with.
Often when programming in Cuda, there is a need to view what a kernel’s PTX/SASS might look like and CudaPAD helps with this. There might be a need to view PTX/SASS for debugging, understanding what’s happening, to squeezing a little more performance out of a kernel, or just for curiosity. To use the application, simply type or paste a kernel in the left panel and then the right panel will display the corresponding disassembly information. Visual informational aids like visual Cuda-to-PTX code matching lines, PTX cleanup, WinDiff, and quick register highlighting are built-in to help make the PTX easily to follow. Other on-the-fly information is also displayed like register counts, memory usage, and error information.
With any piece of code, there are often several ways to perform the same thing. Sometimes, just modifying a line or two will lead to different machine instructions with better registers and memory usage. Have fun and make some changes to a kernel in the left window and watch how the PTX/SASS changes on the right.
Just as a quick note. CudaPAD does not run any code. CudaPAD is only for viewing PTX, SASS, and register/memory usage.
Background
Like most of my projects, this one was grown out of a personals need. For some algorithms I develop, GPU efficiency is essential. One way to help with this is by understanding the low-level mechanics and making any necessary adjustments. Before creating this app, I would often get in this loop where I would write a critical performance kernel then view the PTX/SASS over and over using command line tools. Doing this repetitively was time-consuming so I decided to build a quick C# app that would automate the process.
It started out as a simple app that would take a kernel in the left window and then output the PTX to the right side window. This was accomplished by basically running the same command line tools as before, mainly nvcc.exe, but now in an automated fashion in the background. I got carried away however and within a short period of time, I started adding several features including automatic re-compiling, WinDiff, visual code lines markers, compile errors, and register/memory usage.
AMD used to have a similar tool for Brooke++ and this gave me the idea of having the two window app back in 2009 when I first built it. The tool had a left window where a Brook+ kernel could be added and a right window where the assembly would output to. A button could be clicked to update the output window. AMD has had a couple of these over the years but it has since been replaced with AMD’s CodeXL.
AMD’s CodeXL and NVidia’s NSight have since replaced many tools like these however CudaPAD still has its place for quick, on the fly viewing of low-level assembly and experimentation. Both CodeXL and NSight are professional grade free tools and are a must-have for GPU developers.
Using CudaPAD
Requirements
CudaPAD is simple to use. But before running it, make sure these system requirements are met:
- Visual Studio 2017/2019 (Express/Community editions are okay)
- NVidia’s Cuda 10
A dedicated GPU is not required since we are only compiling code and not running anything.
If the requirements are met, then simply launch executable. When CudaPAD loads, it will have a sample kernel. The sample provides a quick place to start playing around or even a starting framework for a new kernel. Whenever the kernel on the left is edited, it will update the PTX or SASS on the right. If there is a compile error, it will show that near the bottom.
There are several features that can be enabled/disabled. All are on by default (also see Features section).
PTX/SASS View Modes
Change the drop-down textbox between PTX, SASS or SOURCE views.
PTX view – shows the PTX intermediate language output of the kernel. PTX is close to SASS hardware instructions but is slightly higher level and is less tied to a particular GPU generation. Usually, PTX instructions translate directly to SASS however sometimes there are multiple SASS instructions per PTX instruction.
SASS view – These are true assembly instructions. These types of instructions executed directly on the GPU. The amount of visual information supplied when viewing SASS is less then PTX – like the visual code lines do not show.
Raw code view – This view is mostly for debugging CudaPAD itself. Behind the covers, this app does not re-compile after every change. It only re-compiles when the code is modified and not comments or whitespace. The raw code is a stripped down version of the real code. The reason this was added was that I did not want it to keep compiling when I was adding/editing comments or adding/removing whitespace. This would not be resource friendly and would also throw off the WinDiff feature.
In the background, CudaPAD simply compiles the kernels with Cuda tools. The Cuda compiler then, in turn, calls a C++ compiler like Visual Studio. So to run this CudaPAD, Cuda needs to be installed and most likely a C++ compiler like Visual Studio.
Enabling/Disabling Features
Disabling the auto-compile is useful for making multiple changes before a compile. This can help show the changes in the diff (differencing) output over several changes. To do a manual compile, just click the green ‘start’ in the top right corner.
Under the Hood
Let's take a look at how this application works. I will present what happens when the left window is edited. This triggers a recompile and then updates the right PTX/SASS window. Here it is in steps:
- The user enters in some Cuda in the left window.
- The textbox change kicks off a short term timer. If the user should type in any more text before that timer finishes, then the timer is reset. This system prevents the compile process from firing on every keystroke and lets the user finish typing before it automatically starts.
- When the timer completes an event is raised. In this event, we check to see if there were any changes that would require a re-compile. Obviously, if a user is just editing some comments or adding/removing whitespace, then we don't need to recompile. If there are no "code" changes, then we stop here. In the dropdown box, CODE can be selected to see what this cleaned up code looks like.
- We save the Cuda textbox to a file. This will be needed later when the Cuda compiler compiles it.
- We then clear any lines on the screen as we are going to draw new ones soon.
- We then call a batch file that does most of the compiling. This batch file is generated based on the options selected in CudaPAD. If the user has the sm_35 architecture selected, then this option is appended to the nvcc line. If the user selects an optimization level of three, then -O3 is appended. If SASS output is requested, then the CuObjDump command is appended. Here is the batch file:
- Perform some cleanup in the temp folder from the last time a compile was done.
- Calls NVidia's Cuda compiler with some options:
nvcc.exe -keep -cubin --generate-line-info ...
This command compiles the Cuda file into a cubin file. (device code) We also use the-keep
option and keep the PTX files as well as the--generate-line-info
so we know the line numbers of the source file so we can draw the lines. - If SASS is selected from the dropdown, then we run CuObjDump.exe to disassemble the cubin device file into SASS code.
- Lastly, we capture any output messages from these commands to info.txt.
- Next, we fill the info textbox that has the registers and memory utilization information.
- We extract this out from the output log info.txt file we created from the batch file.
- We then grab the global, constant, stack, and shared memory, byte counts, register spill information, register usage and general log information using RegEx.
- This info is then formatted and displayed in the informational window.
- Next, any errors/warnings are captured from the rtcof.dat file and are then formatted and then placed in the error window.
- We then take grab the text from the outputted data.ptx (from nvcc.exe) and compare it to the PTX already in the window using a
diff
algorithm. The final results of thediff
function is the new PTX with what changed in the form of comments. I chose to put the change information in comments so that if the text is copied to another program, it will still run. - Next, we store the position of the scrollbars and caret location for the PTX/SASS window. This is needed because after we re-fill the output window with text, we are going to want to restore these.
- Next, we grab the line information from the PTX and store that. The line numbers will be needed later to draw the connecting lines. The line information is in the form of "
.loc # ## #
" statements. Any line information is then deleted from the PTX so that it is not displayed. - Do some cleanup on the PTX to make it look all nice and dandy.
- Next, we draw the visual code lines.
- Previously, we saved the line number information for each location specified in the output PTX file. Example: On line 45 of the PTX we might have had a
.loc 1 20 1
. The20
here would be the source line so a line would be drawn from line 20 in the source to line 45 in the PTX window. - Next, we get the indentation for each line. This is done by counting the whitespace (spaces/tabs) before each word. This is needed so the lines start or end where the code starts instead of just at the beginning of the line.
- Using the textbox height/width plus the current scroll positions for each window plus the indentation and line number of each line, we then draw the lines.
- Previously, we saved the line number information for each location specified in the output PTX file. Example: On line 45 of the PTX we might have had a
- Finally, we restore the scroll positions and caret location.
Features
Visual Code Lines
These lines match up the Cuda source code to the PTX output. They help the programmer quickly identify what Cuda code matches up with what PTX. This function can be enabled or disabled by clicking the lines icon in the top of the PTX window.
Auto Assembly Refresh
When needed, the application will automatically re-generate the PTX code. It does not do this on each text change in the source window but rather when the stuff that matters changes. Many items are stripped from the source text that does not impact the output such as comments or spaces. The Auto Update function can be enabled or disabled by clicking the auto update icon in the top of the PTX window.
Built-in Diff utility
Each time the output window updates, this will automatically run a differencing algorithm each time the PTX output changes. The notes are added in such a way that it does not impact the runnability of the code. I decided to add the diff
information inside of comments in the event the user wants to copy and paste the code. I came up with a system of using //
style comments on deleted lines and a /*new*/
comment for new comments. The //
comments disable the entire line while the /*new*/
does not.
Single-Click Multiple Highlighting (new in 2016)
Just click on any register or word in the PTX window and it will highlight all instances of that item. Click on another and it will highlight those as well with a different color. Click on any highlighted item and it will un-highlight all instances of that item. With just three clicks the following can be achieved:
Syntax Highlighting and Output Formatting
The ScintillaNET textbox control by Jacob Slusser has some convenient text highlighting abilities that visually helps when viewing code. Originally, this started out as a plain textbox, then moved to another 3rd party control and then finally to the ScintillaNET control. This results in a more colorful and cleaner looking code.
Besides the text highlighting, the text in the output window is formatted so it’s a little cleaner. Things like compiler information and header information are removed:
- remove unneeded comment
- remove unneeded id: comment
- remove empty "//" comments
- shorten __cudaparam_
- shorten labels
- remove .loc 15 lines (i.e. “.loc 3 3431 3”)
- remove "%" in front of registers (New as of Jan. 2016)
- remove "// Inline" lines (New as of Jan. 2016)
- remove .file 1 "C:\\....." (New as of Jan. 2016)
Example of highlighted and cleaned up output formatting is as follows:
Online Error/Warning Search
Often when running across an error, it is helpful to do a quick online search. I found I was often opening a browser and then copying and pasting the error into a search box. This was not efficient so I added a search online function. At the time, I think this was one of the first of its kind but since it was released in 2009, I have seen other IDEs have this.
Points of Interest
I had a little fun creating this. This is probably why so much time was put into this.
Getting the code lines to work was exciting for me. I believe the visual code lines might have been one of the first of their kind when I built this in 2009 but I am not sure. This was a wild idea I had and I was not sure if I could get it working. Drawing moving lines on the screen is not that easy as I found out as there always seemed to be some side effects. Drawing the spline was the easy part but all the miscellaneous stuff like cleaning it up was more difficult. Another difficult part was calculating the location in the text box. The textbox line height and line number must be known for each spline drawn. I’m not a graphics developer so I am just happy to get it to work! The visual lines turned out better than expected and are fun to play with.
At the time, I dreamed up many different “line” ideas to help break down the assembly but none of the others have been implemented yet:
Note: These other features have NOT been added to CudaPAD. (at least not at this time)
- Draw curved lines that show jumps. Upward jumps are in a lighter color and downward jumps are in a darker color.
- Click on a register and it would display lines where a register impacts. Dark lines for the actual places the register is used. Gray for registers it impacts. And light gray for registers it impacts after two instructions. This would have been similar to Excel’s Trace Precedents / Trace Dependents function.
- One other feature that I wanted to create but never got a chance to would have been a registers used function. This really helps understand where a kernel is maxing out on the register usage and often limits a kernel. When a register is used for the last time, it is freed after that instruction.
Advantages of Viewing PTX/SASS
Here are some advantages of viewing PTX...
- Curiosity - This is what I use it most for. Sometimes I just want to see what is going on at the lower levels and how small changes impact the code. This can be a very useful tool for trying to learn PTX/SASS and the Cuda compiler.
- Software bug- Trying to figure out that annoying bug. Is it a compiler bug or is it something with my code? Sometimes viewing the machine instructions can aid in understanding an unexpected result.
- Changing up a line or two often produces different results. When there exists a kernel that might need some performance optimization, toying with different ways of doing the same thing can produce more efficient code. One example that comes to mind was I found that using a union the PTX would always result in local memory. This was a while ago so it might not be true anymore but here is the example:
However, when using something like:local .align 4 .b8 someLocMem[4]; .... st.local.s32 [someLocMem], someIntReg; <--very expensive ld.local.f32 someFloatReg, [someLocMem]; <--very expensive
This is easily spotted in CudaPAD because of the quick feedback and visual markers."int strangeInt = *(int*) &somefloat;” the output looks like this: mov.b32 someFloatReg, someIntReg;
- Does the code do nothing? Several times in the past, I realized that my kernel had a bug because when I changed or deleted some code nothing changed in the PTX output. I thought to myself, how could this be? The reason why PTX might not show up is that the compiler often simplifies out useless code that does not do anything. As I found out, this is more common then I expected because I ran into this a couple of times. This is usually caused by a bug but it could also just be pointless code also. In most cases, code that is optimized out should either be removed or fixed. Noticing this can help find some hidden errors in a program.
Just as a word of caution, try not to go optimization crazy. Optimization does have its place for particular functions that get run often however optimization can make code less readable, awkward, and more difficult to maintain. Also, time should only be spent on code where a performance increase would have a large impact. There is much more on this subject that I will not get into.
Videos (updated in 2016)
Below is a quick tutorial video. The sub-menu options did not show properly in the video but I explain what I am clinking on so hopefully you can still follow along.
CudaPAD won a poster spot at the 2016 GPU Technology Conference. Even better than that it was also selected as one of the top 20! At the conference, I gave a short presentation to about 100-150 people on April 4th, 2016.
Wish List
Here are some wish list items I have that may or may not be added in the future:
- Isolate the implementation code from interface code using the bridge pattern. While the GUI and code are somewhat split into different files right now, they are not really separable. It’s often good practice to split this up.
- Add the ability to execute the code for timing purpose. Right now PTX can be visually looked at but not benchmarked.
- Add a per-line register usage counter. Basically what this would require is to keep track of how many variables are being used on each PTX line. A GPU has a fixed number of registers and knowing where the register pressure is highest can help programmers balance their code. This is something I added into my AMD GPU compiler, ASM4GCN, but have not added it here.
- Add jump lines to the PTX so one can easily see where a jump statements lands.
Simular Programs
GoldBolt Compiler Explorer
In 2019, I discovered a a project very simular to this one. Matt GoldBolt created a software called GCC compiler explorer. Some people call the this process ""GodBolting"" code now. I use GodBolting because that is the term everyone knows these days. Matt's program created ardound 2012 was web-based though but shared many features: Both have...
- Enter your code on the left windows, and the ASM shows up on the right.
- Automatically starts re-compiling each time a change is made in the left window.
- Indicates a status that is is working in the top right.
- Menues at the top to select different compiler options to compare.
- Does a ""diff"" on the output to see what changed.
- Both worked on C++ though one was Cuda c++ and the other GCC c++.
- Has a visuals that helps the view match up the source on the left to the asm on the right.
- Use to optimize code, view what happens, and see what happens with compare different compiler options.
- Uses colorful text highlighting.
- Cleans up of the assembly output code. (removes assembly junk to clean it up)
- When clicking a warning/error in the compiler output window it shows the user directly where the line is.
- Allow users to highlight all matching registers to see all the matches quickly. (Note: added Jan. 2016 by me)
Some people call the this process ""GodBolting"" code now.
A tool in the ATI/AMD Brook++ GPGPU toolkit in 2008
There is not much online anymore but I did find a screenshot. find much about this anymore but it was a c++ modified langague for GPU development and a set of tools create by a universitiy as I recall. One of the tools in it was a small app where a user can enter in code on the left side and in the middle there was a button that can can be pressed that updated the assembly on the right side. This is where I got my idea from. It didn't however have a highlighting, compiler options(as I recall), automated updates, visual lines to match up the code on the left to the asm on the right, diff, cleanup(as I reacall), etc.
A Special Thanks to...
- Diff functionality - This is a nice drop-in C# file that provides quality diff functionality. Originally created by Eugene Myers in 1986; Converted into C# by Matthias Herte. The mostly un-edited source is in the file Diff.cs.
- ScintillaNET - This nice tool provides the text highlighting for this project. It is a Windows Forms control, wrapper, and bindings for the versatile Scintilla source code editing component. It really adds a lot of life to this project.
- nVidia - In 2016, CudaPAD won a spot on a CudaPAD poster at the 2016 GPU Technology Conference. Moreover, it was selected as an honorable mention (top 20). I presented it to an audience of around 100-150 people on a super large projector screen. It was a wonderful experience - one of the best I ever had.
History
- Dec 2008 – Initially built and it has remained mostly unchanged since 2020.
- Aug 2009 - Build Cudapad.com website at a group project while at CSUEB - it remained up one year until it expired.
- Jan 2013 – Changed the code textbox to use ScintillaNET for better syntax highlighting
- Nov 2014 – Updated for NVidia Cuda 6.0/6.5
- June 2015 – Code released to the public; changed to MIT License; updated for Cuda 6.5/7.0
- Jan 2016 – Added a single-click multiple highlighting search feature; Updated for Cuda 7.0/7.5.
- Jan 2017 – Verified okay with Cuda 8.0
- Jun 2019 - Updated for Cuda 10 and Visual Studio 2017/2019