I wanted to try some of the new AI programming tools, so I (or really Codex) wrote a new testing library: GitHub - samth/recspecs: Expect testing for Racket
It's "expect testing" which Jane Street has advocated for in the OCaml community. You use it like this:
That's a rackunit test case that passes because those to commands print that. There's an emacs command to automatically update the expected output, so you can write the test case and then run the command to update the test output. It features lots of configuration options.
I wrote effectively none of the code; but it took a bunch (50+) prompts to Codex to create everything.
I want to write something a bit more about it, but it's very good. It can write Racket code quite well, and for something like this is significantly more productive than me writing code.
The biggest lesson for using AI coding tools is to write lots and lots of tests (and have it write tests). For example, the emacs integration has tests (which Codex wrote) and that made changes to that code work much better.
I want to write something a bit more about it, but it's very good. It can write Racket code quite well, and for something like this is significantly more productive than me writing code.
The biggest lesson for using AI coding tools is to write lots and lots of tests (and have it write tests).
I've suggested this internally at work, too—aren't we concerned that LLM-generated test cases suffer from some kind of confirmation bias? A kind of self-fulfilling prophecy along the lines of "my code [or the LLM's code] is correct because the LLM says so"? Of course human feedback should fix these problems, but I find relying on oversight generally prone to automation fatigue and blindness. This effect may be heightened in corporate settings (as opposed to personal settings)
The perhaps-inflammatory (and less constructive) versions I wrote at work go something like this.
“My code works because the average prediction says it does!” This is like Boeing
self-attesting their airplanes are “fine because we said so.”
Also: "My code is correct because AI said it is" is so circular.
Also: Having an AI model generate unit tests seems like it will give you
confidence that your buggy code is correct (it's not a mind-reader, it doesn't
know what the code is supposed to do).
Anyway, I've seen a lot of "LLMs make it easy to generate lots of tests" elsewhere, and I've decided I'm more interested in property-based testing, fuzzing, or similar concepts in the SMT world to get more juice from fewer tests
Here are some more thoughts on the experience (more available in this bluesky thread).
It really works. The code base is now almost 1500 lines of code, tests, and docs. It includes emacs integration, an extensive test suite, it works with rackunit, it has scribble docs. And I wrote effectively 0 lines of code.
It knows Racket. It knows how to write Racket programs, it knows scribble documentation, it knows Racket library APIs, it can use Racket idioms, etc. I'm sure the experience is better for Python and TypeScript but I didn't feel held back by using a minor language. It's better at writing racket than models were even a year ago.
It really helps to know what you want. recspecs clones an existing ocaml library, I knew it should use rackunit and I knew how I wanted it to work. I don't think it would work as well for something I didn't know.
Tests are really important. The more tests you have, the more (a) you can trust the results and (b) the more likely it is to come back with results that are correct (because codex runs the tests). And you can have it write a lot of tests. Even the emacs integration has tests.
Tools are really important. That raco fmt exists is very necessary for getting nice looking results (thanks @sorawee), and it can just be completely transparent. Similar for testing tools, debugging tools, and so on.
The most confusing aspect for me is provenance and licensing.
IIUC a model is trained on code offered under various licenses -- some permissive, but some restrictive like GPL or BSD-3-Clause-Attribution.
Ideally the model/tool would remember the licenses, and, when it copy-pastes a sufficiently big chunk, give you the information needed to know the original license (and comply, if possible).
But AFAIK none of the models/tools do that provenance. Either they don't share it, or they never bothered collecting it in the first place.
So... I don't understand how this will play out, going forward.
Uber's tactic was to ignore local taxi laws and hope to grow fast enough that enough customers would demand lack of enforcement. I guess that's the plan here, too?
Plus the AI coding tools have licenses denying liability and dropping the hot potato in their users' laps. "Licenses for me, not for thee."
p.s. I don't mean to suggest it's wrong to use these tools, particularly to evaluate them. Also it's pretty likely I have some simplistic and/or wrong understanding of the issues involved.
I agree that the copyright situation is currently unsettled. The tool provider's view is that the output is not a derived work of the training data and thus there's no copyright problem. People with lots of data that they charge for (eg the New York Times) disagree and think that the ability of the models to reproduce their content is a sign of infringement. Other people think simply training on copyrighted data and producing output that's influenced in some way is infringement. Certainly the model providers disclaim any copyright in the outputs.
Personally, I think it's very unlikely that the third view will prevail, and that the things I'm generating are not copies of existing programs. But I can definitely understand not wanting to use them for things you'll distribute until it's more resolved.
Although I am not a lawyer I believe there can be a problem when a human or AI, either way, copies a "sufficiently big" fragment of code, even making trivial alterations like renaming variables. I don't know how "sufficiently big" is defined, but an entire non-trivial function might qualify.
As a human, I have occasionally done such copying, after checking the license. Sometimes I've added a "Portions Copyright..." notice to the file, to be safe. Sometimes just put a "provenance comment" in there. This gives credit. Also it gives context to someone (including future-me) who might benefit from seeing related code.
I feel like this is a valuable aspect of utilizing open source code, that the new tools could and should preserve. Hell, someday maybe an "AI" would benefit from provenance/context, too.
I broadly agree with this. Currently GitHub Copilot does some of this; it will indicate sometimes that the generated code is similar to something found on github already.