Most computer code compilers are at risk of ‘Trojan source’ attacks in which adversaries can introduce targeted vulnerabilities into any software without being detected, according to researchers from the University of Cambridge.
The paper, Trojan Source: Invisible Vulnerabilities, detailed how weaknesses in text encoding standards such as Unicode can be exploited “to produce source code whose tokens are logically encoded in a different order from the one they are displayed.” This leads to very difficult vulnerabilities for human code reviewers to detect, as the rendered source code looks perfectly acceptable.
Specifically, the weakness was observed in Unicode’s bi-directional (Bidi) algorithm, which handles displaying text that includes mixed scripts with different display orders, such as Arabic – which is read right to left – and English (left to right). Unicode currently defines more than 143,000 characters across 154 different language scripts.
The researchers noted that in some cases, Bidi override control characters enable switching the display ordering of groups of characters.
Most programming languages allow these Bidi overrides to be put in comments and strings, which developers largely ignore. This enables targeted vulnerabilities to be inserted into source code without detection.
The authors Nicholas Boucher and Ross Anderson explained: “Therefore, by placing Bidi override characters exclusively within comments and strings, we can smuggle them into source code in a manner that most compilers will accept. Our key insight is that we can reorder source code characters in such a way that the resulting display order also represents syntactically valid source code.”
“Bringing all this together, we arrive at a novel supply-chain attack on source code. By injecting Unicode Bidi override characters into comments and strings, an adversary can produce syntactically-valid source code in most modern languages for which the display order of characters presents logic that diverges from the real logic. In effect, we anagram program A into program B.”
The researchers added that Bidi overrides characters through the copy-and-paste functions on most modern browsers, editors and operating systems. Therefore, “any developer who copies code from an untrusted source into a protected code base may inadvertently introduce an invisible vulnerability.”
While there is currently no evidence that threat actors have utilized these types of attacks, the authors warned of the need for new security controls to counter this danger. They stated: “As powerful supply-chain attacks can be launched easily using these techniques, it is essential for organizations that participate in a software supply chain to implement defenses.
“We have discussed countermeasures that can be used at a variety of levels in the software development toolchain: the language specification, the compiler, the text editor, the code repository, and the build pipeline. We are of the view that the long-term solution to the problem will be deployed in compilers.”
Commenting on the research, Tim Mackey, principal security strategist at the Synopsys CyRC, said: “We’ve seen a variety of novel attacks on software supply chains in 2021, and this is another example of how the trust placed in development processes can be exploited. Teams intrinsically trust their developers, but developers are human and even the best developers can’t be expected to know all the nuances of how code libraries function.
"When in doubt, they’ll search the internet for examples. Those examples might just be exactly what’s needed to solve the problem, with a result of the found code being copied into the application. While legal teams have been concerned about the potential licensing liability surrounding copied code, an attack using Unicode bidi overrides should concern security teams since that perfect code might only look perfect to the human eye, but instead contain code representing the launch point for an attack that will ultimately be distributed by the application owner.”