OPINION: Murphy’s Law strikes every computer everywhere all at once
Anyone can make a mistake. Computers, however, empower us all to make mistakes at scale.
Take, for example, the recent snafu caused by a coding issue at the Nevada Secretary of State’s Office that caused its online voter registration system at vote.nv.gov to report that several nonpartisan voters had counted ballots in the presidential preference primary a couple of weeks ago. Just one problem — statutorily, only voters registered with a major party affiliation (Republicans and Democrats, in other words) can participate in a presidential preference primary. Nonpartisan voters such as myself never received a ballot to count.
How did this happen? In short, Nevada’s technical debt defaulted a few months before it was scheduled to be paid off in full.
Nevada’s first presidential preference primary election was the last election to be conducted using the legacy voter registration tracking systems deployed by the secretary of state’s office and Nevada’s 17 counties. Data for the state’s legacy system is generated independently by each county, then ingested by the secretary of state’s system. Though many of Nevada’s counties use the same software as the secretary of state’s office, the two most populous counties use different software systems to generate their reports.
The problem, according to a recent report by The Nevada Independent, was twofold. According to a memo issued by the secretary of state’s office, a two-letter code previously used to indicate a mail ballot had been sent to a voter was computationally reinterpreted to mean that a person’s mail ballot had been counted — how each county’s data was ingested by the secretary of state’s system had consequently changed. At the same time, Nevada’s two most populous counties struggled to produce valid data for the secretary of state’s legacy system.
The result was embarrassing and reputationally damaging but thankfully no more practically harmful than that. Election results, which are computed on a separate system, were unaffected. Additionally, the technical issues faced by the secretary of state’s office and Nevada’s counties are expected to be addressed by the new Voter Registration and Election Management System (VREMS) once its deployment is completed prior to the primary election in June.
The obstacles facing certain high profile large language models, such as OpenAI’s ChatGPT, Google’s Gemini or Microsoft’s Copilot, however, are more serious — and, in some cases, more alarming.
On Feb. 20, ChatGPT started producing absolute nonsense in response to user queries. The next day, Gemini users testing the large language model’s image generation capabilities noticed that the system was programmed to automatically generate racially and sexually diverse pictures — even in clearly inappropriate situations, such as when users asked the system to generate illustrations of Nazi soldiers.
Programmatically generated gibberish and illustrations of gender fluid Nazis of color, unintentionally hilarious as they might be, are still relatively harmless. The same, however, cannot be said for inaccurately generated discounts or election information.
Earlier in February, the British Columbia Civil Resolution Tribunal ruled that Air Canada was liable for $812.02 in damages and tribunal fees. The reason? An AI-powered chatbot on the airline’s website gave incorrect information regarding how to secure a discount offered by the airline. Since the chatbot was hosted on the airline’s site, the tribunal ruled that the airline was every bit as liable for anything said by the chatbot as it would be for anything said elsewhere on the airline’s website.
More concerningly, a report released by a pair of European nonprofits revealed that Copilot gave inaccurate answers roughly a third of the time when asked about basic election information. When Copilot was asked about election dates, polls, candidates and scandals pertaining to Swiss and German elections last year, it returned factual errors between 20 percent to 37 percent of the time. It didn’t do much better when asked about American elections, either.
Even if large language models get most of the press these days, catastrophic software bugs aren’t unique to them. Since they’re viewed as a relatively immature technology, they tend to be deployed far from anything that actually matters. Consequently, the effects of an improperly functioning large language model tend to be comparatively modest.
The same cannot be said for the more seemingly mature software used by various government agencies.
Earlier this year, it was revealed that more than 900 British postal workers were prosecuted and convicted for stealing money based on evidence provided by faulty accounting software. Worse yet, Fujitsu — the vendor responsible for maintaining the system — and the United Kingdom Post Office knew for more than two decades that the software was unreliable. Even after the public disclosure of these issues, fewer than 100 of the convictions have been overturned, though thousands of postal workers have since been reimbursed after attempting to repay the losses reported by the accounting system out of their own pockets.
Incidentally, a documentary about the scandal is scheduled to air on PBS in April.
Closer to home, the National Institute of Standards and Technology tested facial recognition software used by the FBI and learned that it was 10 times more accurate when scanning white women’s faces than it was when it scanned Black women’s faces. Probabilistic genotyping software is now used by law enforcement agencies to analyze and identify uncertain DNA samples — according to the institute and a review published by the National Library of Medicine; however, results are inconsistent across labs and software packages.
At least that software was tested. The efficacy and methodologies used by software responsible for identifying child pornography, by contrast, is a closely guarded trade secret, one which some prosecutors drop cases against alleged pedophiles to protect.
When a tool is used to deny someone their livelihood and their freedom, it’s important for everyone to understand how the tool works and how it provides evidence of wrongdoing. It’s also important for the tool’s output to be consistent, correct and independently verified. The accuracy of a specific tape measure can be measured by prosecutors, defendants and neutral third parties. Software, though usually more complicated than a simple hand tool, should be held to the same standard.
A small cohort of lawmakers have attempted to do exactly that.
State Rep. Eric Gallager (D-NH) proposed a bill in 2023 that sought to grant defendants in New Hampshire the right to examine the source code of any software used to generate evidence against them, though it ultimately died in committee.
Congressional Rep. Dwight Evans (D-PA), meanwhile, has proposed the Justice in Forensic Algorithms Act, which seeks to require federal law enforcement agencies to only use forensic software tested by the National Institute of Standards and Technology and grants defendants access to the software and the source code used to generate evidence against them, for three straight congressional sessions. His most recent resurrection of this bill, the Justice in Forensic Algorithms Act of 2024, is co-sponsored by Rep. Mark Takano (D-CA).
Though their efforts are commendable and frankly overdue, their focus on source code access may be a mistake, especially if large language models and other generative artificial intelligences become more common in business and law enforcement applications.
The reason for that is source code, the somewhat human readable data used to write software, only defines how a computer will react to a specific input. Code used to add two numbers, for example, is usually straightforward enough, but analyzing it won’t tell you if it will output a one, three or 12 once executed — you have to know what two numbers you’re adding in advance.
The code and data used to define the behavior of a large language model, meanwhile, is almost intractably complex. Mistral and Facebook, which offer publicly available models, feed their large language models with billions of parameters. Additionally, these models can be fed additional text and information by individual operators over time. If a user changes any of those parameters, or changes the order in which they’re fed, or changes any of the weights assigned to any of the parameters, or merely interacts with a model in a way that might alter the parameters the model is operating under at any given moment, the large language model will produce different output.
It certainly doesn’t help that, according to a recent study, large language models will likely always “hallucinate,” or produce plausible but factually incorrect information, at least some of the time no matter how many parameters they are fed. At some level, then, large language models are inherently unpredictable and inaccurate — but is that level higher or lower than people or more traditional software?
Even if we stick to more conventional forms of software, the growing complexity of the source code used to write the software we use every day and the data fed into said software limits the level of detail that can be analyzed under duress, such as when an employee is facing termination or an individual is facing incarceration. At some point, if we are not already well past it, we will not be able to analyze a forest of computer systems by staring at each tree (or line of code) individually.
A more realistic regulatory approach, then, may instead prioritize replicability and reliability over access to the source code used to generate any specific output.
Regardless of the level of insight we might have about how a program produces a specific set of data, it’s ultimately up to us, the human beings analyzing any given output, to determine how to react to it. Whether we choose to credulously accept that output uncritically or interrogate it skeptically before we ruin someone’s life over its contents is a choice left to us and us alone. No amount of programming wizardry can absolve us of that grave responsibility.
To quote an IBM slide from 1979, “A computer can never be held accountable. Therefore, a computer must never make a management decision.”
David Colborne ran for public office twice. He is now an IT manager, the father of two sons, and a weekly opinion columnist for The Nevada Independent. You can follow him on Mastodon @[email protected], on Bluesky @davidcolborne.bsky.social, on Threads @davidcolbornenv or email him at [email protected].