Copilot's rocky takeoff: GitHub ‘steals code’

Richi Jennings
Blog Author

Richi Jennings, Independent industry analyst, editor, and content strategist. Read More...


Microsoft’s new ML toy promises to write your code for you. It uses a “transformer” deep learning model to suck in billions of lines of code—much of it open source. Much has been said about it being a paid model. But how can Copilot users credit the original authors or ensure they don’t violate the original licenses?

But how can Copilot users credit the original authors or ensure they don’t violate the original licenses?

It doesn’t seem like they can. In this week’s Secure Software Blogwatch, we shut our eyes and think of the deadline.

Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: A New Nope.

[ Get a free SBOM and supply chain risk analysis report ]

Robocode in disguise

What’s the craic? Kyle Wiggers reports—“Copilot, GitHub’s AI-powered programming assistant, is now generally available”:

Copilot remains an imperfect system
Last June, Microsoft-owned GitHub and OpenAI launched Copilot, a service that provides suggestions for whole lines of code inside development environments like Microsoft Visual Studio. Available as a downloadable extension, Copilot is powered by an AI model called Codex that’s trained on billions of lines of public code to suggest additional … code and functions [by] drawing on its knowledge base and current context.

Copilot is now available to all developers. … It’ll be free for students as well as “verified” open source contributors — starting with roughly 60,000 developers selected from the community and students in the GitHub Education program. … Developers can cycle through suggestions for Python, JavaScript, TypeScript, Ruby, Go and dozens of other programming languages and accept, reject or manually edit them. … Copilot extensions are available for Noevim and JetBrains in addition to Visual Studio Code [and] GitHub Codespaces.

[But] Copilot remains an imperfect system. [It] can produce insecure coding patterns, bugs and references to outdated APIs, or idioms reflecting the less-than-perfect code in its training data.

But why would you? Mathew Lodge goes deeper—“AI-Augmented Coding”:

Copilot gets that right about 43% of the time
99.999% of code is written by hand. … AI-based code competition tools are a big advance: … They write more substantial chunks of code by trying to infer the developer’s intent from code and/or comments that have come before. … Copilot from GitHub is perhaps the most well-known.

These tools came out of work on Transformers, a form of [deep learning] neural network designed for natural language processing — language translation, text summarization and similar applications. … Transformers are trained on a very large corpus of examples, and they use the resulting model to synthesize new examples.

According to GitHub, Copilot gets that right about 43% of the time, so developers must review the code the tool generates. This is more useful than it perhaps might sound — a strength of Copilot is in generating code that calls “foreign lands” such as cloud service APIs [or] for writing “boilerplate code.”

Is it any good? David Bethune is impressed—“Flying with GitHub Copilot”:

Copilot has just saved you [from] a few bugs
Today I installed GitHub Copilot in VSCode and decided to see what it could do with Typescript. … I’m impressed! … The #1 Use Case for Copilot is helping with dumb, simple, boilerplate stuff that you could write but don’t have to.

In context, I think it’s very useful and I plan to continue trying it. But the most important takeaway here for me is that, like any copy/paste code (and you can think of Copilot as roboto copy/paste which changes the function and parameter names for you), you need to fully understand what it writes. … If not, then use the suggestion as a jumping-off point for learning … before you add it to your production code.

In contrast, if the proposed solution is obviously correct and you can write and understand it yourself, then Copilot has just saved you a lot of time — and possibly a few bugs.

To which Tom Chiverton 1 retorts:

**** that. I'll just write it myself rather than risk some weird bug from code I didn't want.

Automation for automation’s sake? locater16 can just see the ad spot:

Are you tired, worn out from days of downloading code from NPM or copy pasting code from StackOverflow? Is your wrist start to strain from all this manual labor? Well why not have an AI copy/paste for you?

That's right! With GitHub Copilot you can leave the hard work of Googling your current problem and then hitting ctrl-C ctrl-V … to a machine. With GitHub Copilot it's like you'll never have to work again!

But where is the ML model getting its source from? Open-source projects, silly! w4ffl35 has deep concerns:

Were licenses violated (intentionally or otherwise) in the released Copilot model? … How is Github giving attribution where due?

[It] feels as though … Microsoft / Github stabbed its users in the back by violating our licenses. … This feels like a massive scam.

And Matthew Butterick flies in—“Copilot is stupid and wants to kill me”:

Copilot’s greatest enemy is itself
I am not your lawyer [but] I agree with those who consider Copilot to be primarily an engine for violating open-source licenses. … I expect that organizations that create software assets will have to forbid the use of Copilot and other AI-assisted tools, lest they unwittingly contaminate those software assets with license violations. … Because what other position could be defensible?

Though open-source licenses allow redistribution and modification of code, I still have to honor the specific terms of other open-source software that I use in my projects. … It’s not a mosh pit. … Not all of them are compatible. For instance … I can’t embed GPL-licensed software within my MIT-licensed projects

If Copilot was trained on software code that was subject to an open-source license, what license might apply to the code produced by Copilot? MIT? GPL? … Open-source software has successfully coexisted with proprietary software because it plays by the same legal rules. Copilot does not. … Microsoft has imposed upon users the responsibility for determining the IP status of the code that Copilot emits, but provides none of the data they would need to do so. … Copilot’s greatest enemy is itself.

But what can you do? breakfast doesn’t like it:

I don't like it, but it's probably the future. … What this is really doing is treating an existing resource … as a free source of data to effectively crowdsource code from without credit. [It’s] arguably a form of deniable mass copyright theft.

However, Safia—@captainsafia—might be getting a call from her employer’s legal team tomorrow:

Why would I pay GitHub $100/year for an AI to tell me what code to write, when men do it for free?🤔

Meanwhile, slazzy sees the oint in the flyment:

Works great for autocomplete of companies’ private API keys. Big increase in productivity!

And finally:

Not sure if genius or garbage

Hat tip: jonbob

Previously in And Finally

You have been reading Secure Software Blogwatch by Richi Jennings. Richi curates the best bloggy bits, finest forums, and weirdest websites … so you don’t have to. Hate mail may be directed to @RiCHi or ssbw@richi.uk. Ask your doctor before reading. Your mileage may vary. E&OE. 30.

Image sauce: Midland Airport (cc:by-sa; leveled and cropped)