I was working on my RISC-V (dis)assembler today when I got to something rather monotonous. I was writing tests for the M extension (making sure each instruction assembled and disassembled correctly) when I thought to myself: I bet an LLM could do this.

So I decided to test my hypothesis. I gave the following input:

write tests in the following format for each of the following instructions. format: #[test]
fn multiply_high_unsigned() {
    // check assembler
    let i = assemble_line("mulhu ra,t5,s11").unwrap();
    assert_eq!(
        i,
        Instruction::MULHU(IRegister::ReturnAddress, IRegister::T5, IRegister::S11)
    );

    // check decoder
    let i2 = decode_instruction(0x03bf30b3).unwrap();
    assert_eq!(
        i2,
        Instruction::MULHU(IRegister::ReturnAddress, IRegister::T5, IRegister::S11),
    );

    // check disassembler
    let i3 = assemble_line(&disassemble_instruction(&i)).unwrap();
    assert_eq!(i, i3);
}
Instructions: mulhsu div divu rem remu mulw divw divuw remw remuw

And it worked! It even named the functions correctly. The registers used were actually random, but I didn’t specify that so each test used the same registers. Asking again for can you do the same, except choose random RISC-V registers produced the desired results.

It worked, except for the binary forms of the instructions, which were (as I expected) incorrect. Still, it saved me some time copying and pasting the rest of the code.

Other Models

The above results were from Github Copilot using the GPT-4o model. Out of curiosity, I decided to try some other models and see how they handled the task.

Deepseek-R1 14b

The R1 model did some interesting things. It generated a large number of “thought” tokens about the specifics of instruction encoding, only to use placeholder instructions instead (which it did tell the user). It replaced the registers I gave it with x0, x1, x2, and changed the format of a test. It even rewrote my input test:

#[test]
fn mulhsu_test() {
    // Assemble instruction
    let instr_str = "mulhsu x0, x1, x2";
    let instr = assemble_line(instr_str).unwrap();

    // Decode to check hex value (hypothetical hex for illustration)
    let hex_value = 0x03b030b3; // Example hex for MULHSU
    let decoded_instr = decode_instruction(hex_value).unwrap();

    // Assert assemble and decode match
    assert_eq!(instr, decoded_instr);

    // Disassemble to check round-trip
    let disassembled_str = disassemble_instruction(decoded_instr).unwrap();
    assert_eq!(disassembled_str, instr_str);
}

It moved some things into variables, changed the comments and renamed the function to use the instruction mnemonic. It also changed the semantics of the test; the original compared the disassembled and decoded instructions to constant values, but the new test compares them to each other. The new test also compares the strings in the round trip test rather than the Instructions. Interestingly, it also changed the example hex value even though it was given in the input; I don’t know why it used a placeholder value that is 1 hex digit different from the correct one.

The placeholder values for the other tests each incremented the 5th digit (0x03b030b3, 0x03b130b3, 0x03b230b3, …), I don’t know why. I would try a version with more than 16 instructions, but generating this took long enough already.

Phi4 14b

Phi4 produced similar output to GPT-4o, though it used different values for the decode test (which were also wrong). Like Deepseek, it told the user that they were examples (though it did refer to the whole instruction as an opcode).

Llama 3.1 8b

Note that this model is significantly smaller than Deepseek and Phi. There isn’t (at least on Ollama) a Llama 3 at a size closer to 14b parameters.

Llama produced output similar to the original test, though strangely it swapped the order of t5 and s11 in the tests for div, divu, rem, remu, divw, divuw, remw, and remuw. It kept the same hex value for each test and did not make any mention of it. Interestingly, it also changed the mnemonics for divuw and remuw to divuwa and remuwa, and mulhsu to mulhu (despite correctly naming the test multpliy_high_signed_unsigned). In each instance, it still generated the correct instruction variant (eg. Instruction::DIVUW not Instruction::DIVUWA).

Conclusion

Didn’t you say that Github Copilot found a bug in your code? When are you going to get to that?

Well, the tests showed what I hadn’t implemented yet. It’s LLMs, you have to expect a little puffery.