AI Links, Now With Commentary

AI and the career ladder, big tech’s training data struggles, and more.

by Paul Ford - November 20, 2024

The world keeps spinning, doesn’t it? Nonetheless—a few things worth reading (I didn’t have AI help me find links this week. I used an RSS reader, like in ye days of olde).

“OpenAI, Google and Anthropic Are Struggling to Build More Advanced AI,” by Rachel Metz, Shirin Ghaffary, Dina Bass, and Julia Love in Bloomberg

The companies are facing several challenges. It’s become increasingly difficult to find new, untapped sources of high-quality, human-made training data that can be used to build more advanced AI systems. Orion’s unsatisfactory coding performance was due in part to the lack of sufficient coding data to train on, two people said. At the same time, even modest improvements may not be enough to justify the tremendous costs associated with building and operating new models, or to live up to the expectations that come with branding a product as a major upgrade.

My guess is folks will find ways to jump over this barrier but even if not—this is a problem for companies seeking trillion-dollar valuations, but less of a problem for me and you. At some level, I want these firms to slow down. We can’t even figure out the value of the current models, and there are tons of iterative improvements and integration points with the existing world of software (IDEs, databases, etc.) where LLMs could be useful. But right now, so much heat and light is on making the models bigger using nuclear power that no one cares about any of that. We need time to metabolize the brick we just swallowed. (Counterpoint.)

“How AI Could Break the Career Ladder,” by Molly Kinder in Bloomberg

Traditionally, the first few years of a newly accredited lawyer’s career is spent working under the tutelage of more senior lawyers and engaged in routine tasks—missives like “document review,” basic research, drafting client communications, taking notes, and preparing briefs and other legal documents. Advances in AI-powered legal software have the potential to create vast efficiencies in these tasks, enabling their completion in a fraction of the time—and a fraction of the billable hours—that it has historically taken junior lawyers and paralegals to complete them.

This mirrors the challenge in software development—if AI can do the work of junior employees, how do we grow talent? My hunch is that the answer to this question is slightly recursive: The systems themselves will guide people on how to upskill, and they’ll be able to do more valuable work than note-taking. There’s always more work to be done, better, so maybe we can actually get to some of it? Would it be so bad if there were fewer personal injury lawyers advertising on the subway? I joke. But really.

“There’s No Longer Any Doubt That Hollywood Writing Is Powering AI,” by Alex Reisner in The Atlantic

I can now say with absolute confidence that many AI systems have been trained on TV and film writers’ work. Not just on The Godfather and Alf, but on more than 53,000 other movies and 85,000 other TV episodes: Dialogue from all of it is included in an AI-training data set that has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and other companies. I recently downloaded this data set, which I saw referenced in papers about the development of various large language models (or LLMs). It includes writing from every film nominated for Best Picture from 1950 to 2016, at least 616 episodes of The Simpsons, 170 episodes of Seinfeld, 45 episodes of Twin Peaks, and every episode of The Wire, The Sopranos, and Breaking Bad. It even includes prewritten “live” dialogue from Golden Globes and Academy Awards broadcasts. If a chatbot can mimic a crime-show mobster or a sitcom alien—or, more pressingly, if it can piece together whole shows that might otherwise require a room of writers—data like this are part of the reason why.

OpenSubtitles is an absolute gift to the world when you want to watch some foreign film you could never see any other way, but this points out a fuzzy zone in culture: The same people who tend to like leaky intellectual property protections tend to be opposed to AI gobbling up data and regurgitating it. It’s an interesting stress point in our new world. If Google builds search indices (which we usually like) and AI LLMs (which make people anxious) using the same spidered info, how do we define what goes where? You tell me.

“Project: VERDAD—tracking misinformation in radio broadcasts using Gemini 1.5,” interview by Simon Willison

VERDAD tracks radio broadcasts from 48 different talk radio stations across the USA, primarily in Spanish. Audio from these stations is archived as MP3s, transcribed and then analyzed to identify potential examples of political misinformation.

The result is “snippets” of audio accompanied by the transcript, an English translation, categories indicating the type of misinformation that may be present and an LLM-generated explanation of why that snippet was selected.

These are then presented in an interface for human reviewers, who can listen directly to the audio in question, update the categories and add their own comments as well.

Every aspect of this project is fascinating. First, it makes audio legible at scale. Second, it makes foreign languages legible to English speakers. Third, it does a credible job of identifying misinformation patterns using AI (and as new patterns emerge one could revisit the entire corpus). Fourth, it makes all of this available online.

Five years ago, this kind of project would have cost millions of dollars of labor and development time, and would have required a large ongoing pool of translators; it would have been basically unfundable or relegated to a much smaller project performed by graduate students. Today, it is the (hard) work of a small cohort and runs in the background as code.

Counterpoint—from Hugging Face, via a Reddit post about the strangest/most unique AI models:

Disinfo4_mistral-ft-optimized-1218 is an experimental language model fine tune developed to synthesize and analyze complex narratives within the realms of continental philosophy, conspiracy theories, and political discourse. It represents the fourth iteration in the disinfo.zone dataset series, fine-tuned on the mistral-ft-optimized-1218 framework. This model, based on a 7B-parameter Mistral architecture, is specifically designed to emulate and deconstruct writing styles pertinent to its target domains.

So that’s great. When I read these last two, I think of a world where we’ve drifted away from mass social media, into smaller pockets of online conversation, and vast wars are being fought on ghost networks between disinfo-spewing bots and huge cleanup crews of misinformation taggers.