Dire le vrai. Perspectives situées

Sous la direction de Gilbert Willy Tio Babena

Pour accéder au livre en version html, cliquez ici.
Pour télécharger le PDF, cliquez ici.

« Le vrai, la vérité, c’est ce à quoi tendent tant les dénonciateurs que les whistleblowers, les gens à ‘‘franc-parler’’ et ceux qui pratiquent l’outing, les ‘‘rapporteurs et rapporteuses’’ de cours de récréation et les confessés des confessionnaux. Ces diseurs et diseuses de vérité, que tout sépare, ont cependant un objectif commun : révéler des informations, passer du caché au su, débusquer les secrets, les insus, les clandestinités. Autour de ces thématiques, mille choses, de toutes époques et de toutes cultures, pourraient être dites, et écrites », écrit Marie-Anne Paveau dans son appel d’œuvres de l’esprit pour meubler les murs de La Villa réflexive. Le présent ouvrage est le produit de cette expérimentation épistémologique sur la thématique du dire le vrai. Liés mais tout aussi indépendants les uns des autres, les chapitres sont écrits dans un style décolonial qui brise les codes de l’écriture positiviste pour proposer au lectorat un regard réflexif sur la vérité à partir de perspectives situées. Pour élargir l’horizon, le volume offre une pluralité de ressources multimodales (liens hypertextes, images, QR codes de vidéos et audios) qui peuvent être consultées aussi bien dans les versions numériques que papier.

Un livre de la collection Réflexivités et expérimentations épistémologiques

ISBN pour l’impression : 978-2-925128-30-4

ISBN pour le PDF : 978-2-925128-29-8

DOI : 10.5281/zenodo.8139735

319 pages

Design de la couverture : Kate McDonnell

Date de publication : 2023

Cet ouvrage est publié avec le soutien de la Faculté des Arts, Lettres et Sciences Humaines de l’Université de Maroua.

***

In memoriam

Ouvrir la porte de la vérité discursive – Gilbert Willy Tio Babena

I. Qu’est-ce que la vérité? Qui dit vrai et comment?
III. Les voiles ou les falsifications de la vérité

Machine Anxiety or Why I Should Close TikTok–But Don’t

It’s Wednesday evening and I have no specific plans. I’m chilling on the sofa, scrolling through TikTok and Instagram while Below Deck Sailing Yacht is playing in the background. Even though I’m not moving I’m tired and bored. I have multiple screens open, trying to distract me, with no plans and no people available to save me from myself. It’s just me, my apps–and my algorithms.

Suddenly I receive an alert from my period tracker telling me that most likely, it will come on Friday, which explains why I feel so tired and why I keep craving bread even though I’m gluten intolerant. Unsurprisingly, Instagram and TikTok know this as well. Comfy clothing ads, vegan Ben & Jerry’s, pilates against general inflammation, Beyonce’s Renaissance tour, carrot salad for hormonal balance, retinol, Margiela/Ann Demeulemeester/Rick Owens/The Row clean girl aesthetic mixed with ‘you probably have autism’ videos.

An Uber Eats pop-up reminds me that I can order Indian instead of cooking. 40-50 min away. Great, butter chicken it is.

I recently heard someone talk about how there’s an epidemic of people who are self-diagnosing with Autism and ADHD, yes, exhibiting symptoms of it, but also maybe it’s just the dopamine burnout caused by the same apps that made them self-diagnose in the first place. I wonder how much this might be true. Are these algorithms so good at analysing our behaviour that they end up reflecting them back to us in a digested 20 seconds video that allows us to identify things in ourselves that we weren’t aware of? Or are we consuming this content at such a large and quick rate that we end up becoming what they predict us to be? In other words, are we fulfilling their prophesies or do they know us better than we know ourselves?

Did I really want those Tabi ballet flats, or did the algorithm make me buy them? Do I have ADHD, or am I experiencing dopamine burnout? Am I having a style crisis because I am an evolving human being or because the algorithm keeps pushing me into the clean girl aesthetic while also wanting me to lean into Y2K and Rick Owens vibe but also learn how to wear a fucking hair clip correctly because that’s what the Copenhagen influenced Amsterdam girlies are into? Am I ready to move into a cabin in the woods and live my girl moss dreams or go clubbing in Berghain, pluck my eyebrows to death and bleach my hair? Is my stomach hurting because all of this is going through my head and my screens (yes, multiple) at the same time? Because you can have it all girl, you go girl, work-life balance girl, celery juice girl. Or do I have that rare, incurable undiagnosed disease the algorithm told me to google on web MD?

Am I going blind and need glasses, or should I just listen to my mom and stare at the distance for at least 10 minutes every hour?

When I was younger, things seemed easier but also a lot more serious. Now things seem unserious and a lot more complicated. Nothing is that important anymore, but everything seems to have a thousand layers more, everything is more nuanced and complex while at the same time, stupid. I feel very old saying that. And yet I grew up in the middle of a digital revolution. I can’t remember a time there wasn’t a computer in my house. I remember being very little and playing with the Paintbrush app on my father’s Macintosh. His cellphone was the size of a brick, and you could hear the sound of the internet over the house phone. Yes, we had landlines. We had a set of CDs containing the Encyclopaedia Britannica instead of Google and Wikipedia. Facts seemed to be a lot easier to identify, and fiction was a thing left for the arts. Nobody was talking about the Pope wearing Moncler and Trump being president would have been unimaginable.

In the era of AI and misinformation, life has never been more confusing. Facts and fiction are blended seamlessly. All information seems extremely urgent and, at the same time, irrelevant. It has made sceptics out of all of us. Hyper aware that at any time, we can be deceived.

But the nature of AI has always been deceptive. Its success has always relied on its capacity to imitate, trick or replicate human language. In Alan Turing’s Computing Machinery and Intelligence, deception is placed at the centre of the test to determine a machine’s capacity to exhibit intelligent behaviour. Turing’s test proposed judging Machines on their capacity to make human subjects believe they are human. So as technology advanced, AI scientists began studying the human’s reaction to the machine in order to improve its performance based on Turin’s work. And even tho deception was never the main objective, creating the illusion of intelligence rather than intelligence itself became the force driving sentient-like technologies like AI. As Simone Natale points out, »While debates have largely focused on the possibility that the pursuit of strong AI would lead to forms of consciousness similar or alternative to that of humans, where we have landed might more accurately be described as the creation of a range of technologies that provide an illusion of intelligence—in other words, the creation not of intelligent beings but of technologies that humans perceive as intelligent«. Turing named this ‘the imitation game’.

As algorithms got better at imitating us and scientists got better at training them, we also became lazier at recognising them. Making it easier for us to fall into the illusion.

In Deceitful Media; Artificial Intelligence and Social Life after the Turing test, Natale states that »At the roots of technology’s association with magic lies, in fact, its opacity. Our wonder at technological innovations often derives from our failure to understand the technical means through which they work, just as our amazement at a magician’s feat depends partly on our inability to understand the trick«. Yet in my experience, knowing does not warrant that we will not fall into the illusion. In fact, most people who enjoy magic tricks are not ignorant of how the tricks are performed, at least in their most superficial way. Magic shows still attract masses of people ready to surrender to fantasy in exchange for entertainment, aware that it is not real magic. Even more, magicians themselves are avid consumers of the trickery of their colleagues. Because deep down, we all want to be believers.

Our interactions with AI are based, as with many technologies and other systems of belief, on the projections we make in the spaces left by the illusion. We project into the machine our desire to see something that confirms our expectations. We deeply want to believe that what we want to see, hear, feel, and experience is really there.

It’s not surprising that in our loneliest or most boring moments, we turn to our machines for companionship, wanting to believe in the promise of closeness, of something that reflects back to us our deepest fears, wildest dreams and general anxieties, all repackaged in a shiny wrapper of entertainment or distraction, and the promise of taking our problems away.

AI will save the world, solve climate change, inequality, work, creativity blocks, mental health!

When Eliza, one of the first chatbots built in 1964, was put to the test against the secretary of its programmer, Joseph Weizenbaum, also known as one of the fathers of modern AI, the secretary famously asked him to leave the room since the conversation between her and the machine had turned too personal, too intimate. You see, Eliza was programmed to emulate a non-directional psychotherapist, and Weizenbaum’s intention was to prove how communication between humans and machines was superficial. Instead, he ended up proving the opposite, sort of. The secretary ended up projecting her desire to be heard onto the machine. This is defined in psychology as when ‘inside’ content is mistaken to be coming from the ‘outside’ or the Other.

She, too, wanted to believe.

In the summer of 2022, I graduated from the Sandberg Institute where I did a temporary master’s program called F for Fact. The program (which was extended for two more years) focused on investigating different ways of knowledge through artistic research. The blurry lines between Facts and Fiction, the way knowledge is produced. What knowledge is and what it is not.

One of the things you need to do to graduate is write a thesis. At the time, I wasn’t looking forward to it. My bachelor’s thesis had left me with some PTSD, and I didn’t want to sound stupid or like I was trying too hard. So I thought it would be a great idea to ask GPT2 (just released on early sign-up access) to write my thesis for me. I had always been fascinated by technology, and I was then in my google earth era and working on a project about the materiality of digital technologies and the Internet, researching transatlantic internet cable networks and lithium mines. So it seemed like a great idea to use this new technology to write my thesis for me.

What started as a simple ‘I am too lazy and insecure, let a machine do it for me’ became an exploration of how these technologies would change the way we create knowledge and whether knowledge could be generated. Could we outsource knowledge creation to machines? Could I ‘cheat’ my way out of the thesis? Long story short, it turns out I couldn’t. Automation was not liberation. I still needed to write it, and probably it would have been easier just to write it myself. But the process became the topic of my thesis and the object of the research itself.

AKA, I ended up writing about co-writing with AI while co-writing with AI.

Looking back, one of the most interesting parts of co-writing was that even though I went into the process thinking, ‘I’m not gonna fall for it’, at times, I ended up forgetting I was talking to algorithms. Turns out I also wanted to believe in the promise of a machine that could help me overcome my anxieties around writing. And it kind of did, just not in the way I was expecting it to.

What happened is that I ended up needing to be extremely precise in what I wanted to write about, or else the algorithms would take me to topics I didn’t want or need to address. Nowadays, this is really clearly exemplified by how prompt engineers are becoming more and more important when working with AI. The capacity to get what you want from the algorithms is directly linked to the quality of the prompt. AKA what you ask is what you get, but not always what you want.

I couldn’t get what I wanted, a quick thesis. But I got what I needed, a bunch of AIs making me realise I was not as bad of a writer as I thought I was.

In the end, the thesis became a collection of texts co-written by me and a number of programs: GPT2, GPT3, Eliza and Replika. On top, a reflective text was written only by me, in which I looked back on the joys and frustrations when trying to co-write with AI, the problematic things in it (biases and all) and the need to engage with them with a critical eye.

I started as a sceptic, stumbled into my own projections and beliefs, and I ended up falling in love with the glitchy parts of my dear machines, which offered digital companionship and collaboration when I most needed it.

It is now a Monday evening, and while I am working on this text, I am thinking about a lecture I recently gave at an AI department of a Dutch university. There I discussed how I work with AI to co-write and collaborate on different projects such as my thesis. One of the scientists asked me whether I was afraid of AI. I answered that I was afraid of what humans could do with it. Another asked if I thought artists would be replaced by AIs and the future of human art was dead. I pointed out how with the invention of the camera, people predicted the end of painting, yet painters still paint. And in time, the camera itself became a tool for artistic production, not only documentation. It didn’t take that long before artists started experimenting with the new medium and adopting it as part of their set of tools. It will be the same with AI. Artists will always make art. And technological advancements will always go through a period of adjustment before they are normalised and demystified. When the phonograph was invented, people feared it was capturing the souls of the people it recorded. When Kindles appeared, people speculated it was the end of books. And in the face of Spotify, there’s a return to the LPs. So perhaps what will happen is that human-made art will be the equivalent of organic vegetables with a ‘made by humans’ certificate on it.

Long time ago, we dreamed and feared the end of work with the rise of machines. We dreamed of having more free time but also of losing our jobs. Instead, we are working more than ever, having less free time and inventing more jobs.

We keep hoping that making more machines will save us from the problems other machines have created. Recently I read an article in the Guardian pointing out that big tech people and politicians constantly mention how through AI, they will be able to make smarter choices and solve big problems for the benefit of humankind. But this is in itself quite stupid. Solving climate change doesn’t need smarter minds to create new solutions. We already know what to do; we just need people ready to put economic interests aside in order to reverse decades of damage.

As I sit on my sofa, again, trying not to fall into another TikTok hole, I realise I’m getting anxious again because I’m thinking about climate change, and that shit is scary. So I ask ChatGPT to make a joke to lighten the mood.

This is what came out: »Why did the AI start worrying about climate change? Because it realised it couldn’t CTRL+Z the planet’s rising temperature.«

It also explained why it should be funny:

»This joke is funny because it combines a humorous play on words with a relevant and serious topic. The punchline revolves around the computer keyboard shortcut ‘CTRL+Z,’ which is commonly used to undo or revert actions in many software programs. In this context, it is used metaphorically to suggest that the AI realised it couldn’t undo or reverse the planet’s rising temperature caused by climate change.

The joke cleverly blends technology and the environment, highlighting the AI’s recognition of a problem that cannot be easily undone or reversed. By juxtaposing the familiar computer shortcut with the pressing issue of climate change, the joke elicits a humorous response while also highlighting the seriousness and urgency of addressing environmental concerns

 I am suddenly very aware of the irony of using a super-polluting technology to talk about climate change. And even more, to make jokes about it. According to Bloomberg, the initial training of GPT3 consumed more energy than what the average household consumes in 3 years. And I’m not even mentioning all the material resources like minerals and water and cables and rumbas cleaning the servers and people making sure the Roombas are emptied and plugged in, and coders working for 11 dollars per hour, and us training them for free. The old, if you are not paying for it, you are the product.

It is hard not to fall for something designed to mirror us and keep us entertained. We all like looking at ourselves, and the algorithms know it. We all want to see more of what we like, especially when the world seems to be ending… once more. The one thing that truly comforts me is knowing how many times humanity has predicted the end of the world throughout history. And yet here we are.

I feel guilty from time to time because I am using technologies that I know are bad for me and the environment. But I also make a point to try to use them critically and put my tiny seed of resistance. Clicking ‘no’ on all cookies popups. Deleting my apps now and then. Using alternative platforms and programs. B     locking all ads. Going out for a walk instead of staring at my phone. Helping an old lady cross the street and carry the groceries. I remember Michelle Young’s words: »Power is relative. No one of us can bring about change by ourselves. But for each of us, our part is vital.«

I’ll try listening more to my mom and stare out of the window for ten minutes now and then.

To not webMD my symptoms. To not buy the next thing TikTok tells me to buy because it won’t solve all my problems.

I’ll also try not to feel so guilty and do more. To acknowledge that our relationship with technology is very intimate and intricate but also problematic. Like a codependent relationship. Maybe we should all go to therapy.

But also, like @ummsimonee said, »If you speak what you want into existence, at the very least, the Instagram algorithm will hear you.«

And my personal favourite: »Nobody knows me like the notes app does.«

This text is an adaptation of a lecture at Spui25 on Co-Creation with AI and organised by the Hmm. 

 

Your Voice is (Not) Your Passport

In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series (along with ed-in-chief JS!). It kicked off with Amina Abbas-Nazari’s post, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice. Last week, Golden Owens took a deep historical dive into the racialized sound of servitude in America and how this impacts Intelligent Virtual Assistants. Today, Michelle Pfeifer explores how some nations are attempting to draw sonic borders, despite the fact that voices are not passports.–JS

In the 1992 Hollywood film Sneakers, depicting a group of hackers led by Robert Redford performing a heist, one of the central security architectures the group needs to get around is a voice verification system. A computer screen asks for verification by voice and Robert Redford uses a “faked” tape recording that says “Hi, my name is Werner Brandes. My voice is my passport. Verify me.” The hack is successful and Redford can pass through the securely locked door to continue the heist. Looking back at the scene today it is a striking early representation of the phenomenon we now call a “deep fake” but also, to get directly at the topic of this post, the utter ubiquity of voice ID for security purposes in this 30-year-old imagined future.

In 2018, The Intercept reported that Amazon filed a patent to analyze and recognize user’s accents to determine their ethnic origin, raising suspicion that this data could be accessed and used by police and immigration enforcement. While Amazon seemed most interested in using voice data for targeting users for discriminatory advertising, the jump to increasing surveillance seemed frighteningly close, especially because people’s affective and emotional states are already being used for the development of voice profiling and voice prints that expand surveillance and discrimination. For example, voice prints of incarcerated people are collected and extracted to build databases of calls that include the voices of people on the other end of the line.


“Collect Calls From Prison” by Flickr User Cobalt123 (CC BY-NC-SA 2.0)

What strikes me most about these vocal identification and recognition technologies is how their appeal seems to lie, for advertisers, surveillers, and policers alike that voice is an attractive method to access someone’s identity. Supposedly there are less possibilities to evade or obfuscate identification when it is performed via the voice. It “is seen as a solution that makes it nearly impossible for people to hide their feelings or evade their identities.” The voice here works as an identification document, as a passport. While passports can be lost or forged, accent supposedly gives access to the identity of a person that is innate, unchanging, and tied to the body. But passports are not only identification documents. They are also media of mobility, globally unequally distributed, that allow or inhibit movement across borders. States want to know who crosses their borders, who enters and leaves their territory, increasingly so in the name of security.

What, then, when the voice becomes a passport? Voice recognition systems used in asylum administration in the Global North show what is at stake when the voice, and more specifically language and dialect, come to stand in for a person’s official national identity. Several states including Denmark, the Netherlands, the United Kingdom, Switzerland, Sweden, as well as Australia and Canada have been experimenting with establishing the voice, or more precisely language and dialect, to take on the passport’s role of identifying and excluding people.

“Passport Brochure” by Craig James (CC BY-NC 2.0)

In the 1990s—not too far from the time of Sneakers release—they started to use a crude form of linguistic analysis, later termed Language Analysis for the Determination of Origin (LADO), as part of the administration of claims to asylum. In cases where people could not provide a form of identity documentation or when those documents would be considered fraudulent or inauthentic, caseworkers would look for this national identity in the languages and dialects of people. LADO analyzes acoustic and phonetic features of recorded speech samples in relation to phonetics, morphology, syntax, and lexicon, as well as intonation and pronunciation.

The problems and assumptions of this linguistic analysis are multiple as pointed out and critiqued by linguists. 1) it falsely ties language to territorial and geopolitical boundaries and assumes that language is intimately tied to a place of origin according to a language ideology that maps linguistic boundaries onto geographical boundaries. Nation-state borders on the African continent and in the Middle East were drawn by colonial powers without considerations of linguistic communities. 2) LADO thinks of language and dialect as static, monoglossic and a stable index of identity. These assumptions produce the idea of a linguistic passport in which language is supposed to function as a form of official state identification that distributes possibilities and impossibilities of movement and mobility. As a result, the voice becomes a passport and it simultaneously functions as a border, by inscribing language into territoriality. As Lawrence Abu Hamdan has written and shown through his sound art work The Freedom of Speech itself, LADO functions to control territory, produce national space, and attempts to establish a correlation between voice and citizenship.

Language Analysis is the Second Step in Claiming Asylum in the UK (Home Office Science: Migration Border Analysis, 2012 p.37), see also K. Wilson’s LADO: An Investigative Study

I’ll add that the very idea of a passport has a history rooted in forms of colonial governance and population control and the modern nation-state and territorial borders. The body is intimately tied to the history of passports and biometrics. For example, German colonial administrators in South-West Africa, present day Namibia, and German overseas colony from 1884 to 1919 instituted a pass batch system to control the mobility of Indigenous people, create an exploitable labor force, and institute and reinforce white supremacy and colonial exploitation. Media and Black Studies scholar Simone Browne describes biometrics as “digital epidermalization,” to describe how surveillance becomes inscribed and encoded on the skin. Now, it’s coming for the voice too.

In 2016 the German government took LADO a step further and started to use what they call a voice biometric software that supposedly identifies the place of origin of people who are seeking asylum. Someone’s spoken dialect is supposedly recognized and verified on the basis of speech recordings with an average lengths of 25,7 seconds by a software employed by the German Ministry for Migration and Refugees (in German abbreviated as BAMF). The now used dialect recognition software used by German asylum administrators distinguishes between 4 large Arabic dialect groups: Levantine, Maghreb, Iraqi, Egyptian, and Gulf dialect. Just recently this was expanded with language models for Farsi, Dari and Pashto. There are plans to expand this software usage to other European countries, evidenced by BAMF traveling to other countries to demonstrate their software.

“voice vectors” Universal (CC0 1.0)

This “branding” of BAMF’s software stands in stark contradiction to its functionality. The software’s error rate is 20 percent. It is based on a speech sample as short as 26 seconds. People are asked to describe pictures while their speech is recorded, the software then indicates a percentage of probability of the spoken dialect and produces a score sheet that could indicate the following: 74% Egyptian, 13% Levantine, 8% Gulf Arabic, 5 % Other. The interpretation of results is left to the caseworkers without clear instructions on how to weigh those percentages against each other. The discretion left to caseworkers makes it more difficult to appeal asylum decisions. According to the Ministry, the results are supposed to give indications and clues about someone’s origin and are not a decision-making tool. However, as I have argued elsewhere, algorithmic or so-called “intelligent” bordering practices assume neutrality and objectivity and thereby conceal forms of discrimination embedded in technologies. In the case of dialect recognition the score sheet’s indicated probabilities produce a seeming objectivity that might sway case-workers in one direction or another. Moreover, the software encodes distinctions between who is deserving of protection and who is not; a feature of asylum and refugee protection regimes critiqued by many working in the field.

The functionality and operations of the software are also intentionally obscured. Research and sound artist Pedro Oliveira addresses the many black-boxed assumptions entering the dialect recognition technology. For instance, in his work Das hätte nicht passieren dürfen he engages with the labor involved in producing sound archives and speech corpora and challenges “ the idea that it might be feasible, for the purposes of biometric assessment, to divorce a sound’s materiality from its constitution as a cultural phenomenon.” Oliveira’s work counters the lack of transparency and accountability of the BAMF software. Information about its functionality is scarce. Freedom of information requests and parliamentary inquiries about the technical and algorithmic properties and training data of the software were denied as the information was classified because “the information can be used to prepare conscious acts of deception in the asylum proceeding and misuse language recognition for manipulation,” the German government argued.  While it is not necessarily deepfakes like the one Brandes produced to forego a security system that the German authorities are worried about, the specter of manipulation of the software looms large. 

The consequences of the software’s poor functionality can have drastic consequences for asylum decisions. Vice reported in 2018 the story of Hajar, whose name was changed to protect his identity. Hajar’s asylum application in Germany was denied on the basis of a dialect recognition software that supposedly indicated that he was a Turkish speaker and, thus, could not be from the Autonomous Region Kurdistan as he claimed. Hajar who speaks the Kurdish dialect Sorani had been instructed by BAMF to speak into a telephone receiver and describe an image in his first language. The software’s results indicated a 63% probability that Hajar speaks Turkish and the caseworker concluded that Hajar had lied in his asylum hearings about his origin and his reasons to seek asylum in Germany who continued to appeal the asylum decision. The software is not equipped to verify Sorani and should not have been used on Hajar in the first place.

Biometric Island, Gdansk University of Technology 2021, Image by Dawid Weber  (CC BY 3.0)

Why the voice? It seems that bureaucrats and caseworkers saw it as a way to identify people with ease and scale language analysis more easily. It is also important to consider the context in which this so-called voice biometry is used. Many people who seek asylum in Germany cannot provide identity documents like passports, birth certificates, or identification cards. This is the case because people cannot take them with them as they flee, they are lost or stolen on people’s journeys, or they are confiscated by traffickers. Many forms of documentation are also not accepted as legitimate by state authorities. Generally, language analysis is used in a hostile political context in which claims to asylum are increasingly treated with suspicion.

The voice as a part of the body was supposed to provide an answer to this administrative problem of states. In response to the long summer of migration in 2015 Germany hired McKinsey to overhaul their administrative processes, save money, accelerate asylum procedures, and make them more “efficient.” In July 2017, the head of the Department for Infrastructure and Information Technology of the German Federal Office for Migration and Refugees hailed the office’s new voice and dialect recognition software as “unrivaled world-wide” in its capacity to determine the region of origin of asylum seekers and to “detect inconsistencies” in narratives about their need for protection. More than identification documents, personal narratives, or other features of the body, the voice, the BAMF expert suggests is the medium that allows for the indisputable verification of migrants’ claims to asylum, ostensibly pinpointing their place of origin.

Voice and dialect recognition technology are established by policy makers and security industries as particularly successful tools to produce authentic evidence about the origin of asylum seekers. Asylum seekers have to sound like being from a region that warrants their claims to asylum: requiring the translation of voices into geographical locations. As a result, automated dialect recognition becomes more valuable than someone’s testimony. In other words, the voice, abstracted into a percentage, becomes the testimony. Here, the software, similarly to other biometric security systems, is framed as more objective, neutral, and efficient way of identifying the country of origin of people as compared to human decision-makers. As the German Migration agency argued in 2017: “The IT supported, automated voice biometric analysis provides an independent, objective and large-scale method for the verification of the indicated origin.”

“Soundwave and Spectrogram of “CIRCLE” by Lena Zipp, University of Zurich (CC BY-NC-ND 2.0)

The use of dialect recognition puts forth an understanding of the voice and language that pinpoints someone’s origin to a certain place, without a doubt and without considering how someone’s movement or history. In this sense, the software inscribes a vision of a sedentary, ahistorical, static, fixed, and abstracted human into its operations. As a result, geographical borders become reinforced and policed as fixed boundaries of territorial sovereignty. This vision of the voice ignores multiple mobilities and (post)colonial histories and reinscribes the borders of nation-states that reproduce racial violence globally. Dialect recognition reproduces precarity for people seeking asylum. As I have shown elsewhere, in the absence of other forms of identification and the presence of generalized suspicion of asylum claims, accent accumulates value while the content of testimony becomes devalued. Asylum applicants are placed in a double bind, simultaneously being incited to speak during asylum procedures and having their testimony scrutinized and placed under general suspicion.

Similar to conventional passports, the linguistic passport also represents a structurally unequal and discriminatory regime that needs to be abolished. The software was framed as providing a technical solution to a political problem that intensifies the violence of borders. We need to shift to pose other questions as well. What do we want to listen to? How could we listen differently? How could we build a world in which nation-states and passports are abolished and the voice is not a passport but can be appreciated in its multiplicity, heteroglossia, and malleability? How do we want to live together on a planet increasingly becoming uninhabitable?

Featured Image: Voice Print Sample–Image from US NIST

Michelle Pfeifer is postdoctoral fellow in Artificial Intelligence, Emerging Technologies, and Social Change at Technische Universität Dresden in the Chair of Digital Cultures and Societal Change. Their research is located at the intersections of (digital) media technology, migration and border studies, and gender and sexuality studies and explores the role of media technology in the production of legal and political knowledge amidst struggles over mobility and movement(s) in postcolonial Europe. Michelle is writing a book titled Data on the Move Voice, Algorithms, and Asylum in Digital Borderlands that analyses how state classifications of race, origin, and population are reformulated through the digital policing of constant global displacement.

tape-reel

REWIND! . . .If you liked this post, you may also dig:

“Hey Google, Talk Like Issa”: Black Voiced Digital Assistants and the Reshaping of Racial Labor–Golden Owens

Beyond the Every Day: Vocal Potential in AI Mediated Communication –Amina Abbas-Nazari 

Voice as Ecology: Voice Donation, Materiality, Identity–Steph Ceraso

The Sound of What Becomes Possible: Language Politics and Jesse Chun’s 술래 SULLAE (2020)Casey Mecija

The Sonic Roots of Surveillance Society: Intimacy, Mobility, and Radio–Kathleen Battles

Acousmatic Surveillance and Big Data–Robin James

Your Voice is (Not) Your Passport

In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series (along with ed-in-chief JS!). It kicked off with Amina Abbas-Nazari’s post, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice. Last week, Golden Owens took a deep historical dive into the racialized sound of servitude in America and how this impacts Intelligent Virtual Assistants. Today, Michelle Pfeifer explores how some nations are attempting to draw sonic borders, despite the fact that voices are not passports.–JS

In the 1992 Hollywood film Sneakers, depicting a group of hackers led by Robert Redford performing a heist, one of the central security architectures the group needs to get around is a voice verification system. A computer screen asks for verification by voice and Robert Redford uses a “faked” tape recording that says “Hi, my name is Werner Brandes. My voice is my passport. Verify me.” The hack is successful and Redford can pass through the securely locked door to continue the heist. Looking back at the scene today it is a striking early representation of the phenomenon we now call a “deep fake” but also, to get directly at the topic of this post, the utter ubiquity of voice ID for security purposes in this 30-year-old imagined future.

In 2018, The Intercept reported that Amazon filed a patent to analyze and recognize user’s accents to determine their ethnic origin, raising suspicion that this data could be accessed and used by police and immigration enforcement. While Amazon seemed most interested in using voice data for targeting users for discriminatory advertising, the jump to increasing surveillance seemed frighteningly close, especially because people’s affective and emotional states are already being used for the development of voice profiling and voice prints that expand surveillance and discrimination. For example, voice prints of incarcerated people are collected and extracted to build databases of calls that include the voices of people on the other end of the line.


“Collect Calls From Prison” by Flickr User Cobalt123 (CC BY-NC-SA 2.0)

What strikes me most about these vocal identification and recognition technologies is how their appeal seems to lie, for advertisers, surveillers, and policers alike that voice is an attractive method to access someone’s identity. Supposedly there are less possibilities to evade or obfuscate identification when it is performed via the voice. It “is seen as a solution that makes it nearly impossible for people to hide their feelings or evade their identities.” The voice here works as an identification document, as a passport. While passports can be lost or forged, accent supposedly gives access to the identity of a person that is innate, unchanging, and tied to the body. But passports are not only identification documents. They are also media of mobility, globally unequally distributed, that allow or inhibit movement across borders. States want to know who crosses their borders, who enters and leaves their territory, increasingly so in the name of security.

What, then, when the voice becomes a passport? Voice recognition systems used in asylum administration in the Global North show what is at stake when the voice, and more specifically language and dialect, come to stand in for a person’s official national identity. Several states including Denmark, the Netherlands, the United Kingdom, Switzerland, Sweden, as well as Australia and Canada have been experimenting with establishing the voice, or more precisely language and dialect, to take on the passport’s role of identifying and excluding people.

“Passport Brochure” by Craig James (CC BY-NC 2.0)

In the 1990s—not too far from the time of Sneakers release—they started to use a crude form of linguistic analysis, later termed Language Analysis for the Determination of Origin (LADO), as part of the administration of claims to asylum. In cases where people could not provide a form of identity documentation or when those documents would be considered fraudulent or inauthentic, caseworkers would look for this national identity in the languages and dialects of people. LADO analyzes acoustic and phonetic features of recorded speech samples in relation to phonetics, morphology, syntax, and lexicon, as well as intonation and pronunciation.

The problems and assumptions of this linguistic analysis are multiple as pointed out and critiqued by linguists. 1) it falsely ties language to territorial and geopolitical boundaries and assumes that language is intimately tied to a place of origin according to a language ideology that maps linguistic boundaries onto geographical boundaries. Nation-state borders on the African continent and in the Middle East were drawn by colonial powers without considerations of linguistic communities. 2) LADO thinks of language and dialect as static, monoglossic and a stable index of identity. These assumptions produce the idea of a linguistic passport in which language is supposed to function as a form of official state identification that distributes possibilities and impossibilities of movement and mobility. As a result, the voice becomes a passport and it simultaneously functions as a border, by inscribing language into territoriality. As Lawrence Abu Hamdan has written and shown through his sound art work The Freedom of Speech itself, LADO functions to control territory, produce national space, and attempts to establish a correlation between voice and citizenship.

Language Analysis is the Second Step in Claiming Asylum in the UK (Home Office Science: Migration Border Analysis, 2012 p.37), see also K. Wilson’s LADO: An Investigative Study

I’ll add that the very idea of a passport has a history rooted in forms of colonial governance and population control and the modern nation-state and territorial borders. The body is intimately tied to the history of passports and biometrics. For example, German colonial administrators in South-West Africa, present day Namibia, and German overseas colony from 1884 to 1919 instituted a pass batch system to control the mobility of Indigenous people, create an exploitable labor force, and institute and reinforce white supremacy and colonial exploitation. Media and Black Studies scholar Simone Browne describes biometrics as “digital epidermalization,” to describe how surveillance becomes inscribed and encoded on the skin. Now, it’s coming for the voice too.

In 2016 the German government took LADO a step further and started to use what they call a voice biometric software that supposedly identifies the place of origin of people who are seeking asylum. Someone’s spoken dialect is supposedly recognized and verified on the basis of speech recordings with an average lengths of 25,7 seconds by a software employed by the German Ministry for Migration and Refugees (in German abbreviated as BAMF). The now used dialect recognition software used by German asylum administrators distinguishes between 4 large Arabic dialect groups: Levantine, Maghreb, Iraqi, Egyptian, and Gulf dialect. Just recently this was expanded with language models for Farsi, Dari and Pashto. There are plans to expand this software usage to other European countries, evidenced by BAMF traveling to other countries to demonstrate their software.

“voice vectors” Universal (CC0 1.0)

This “branding” of BAMF’s software stands in stark contradiction to its functionality. The software’s error rate is 20 percent. It is based on a speech sample as short as 26 seconds. People are asked to describe pictures while their speech is recorded, the software then indicates a percentage of probability of the spoken dialect and produces a score sheet that could indicate the following: 74% Egyptian, 13% Levantine, 8% Gulf Arabic, 5 % Other. The interpretation of results is left to the caseworkers without clear instructions on how to weigh those percentages against each other. The discretion left to caseworkers makes it more difficult to appeal asylum decisions. According to the Ministry, the results are supposed to give indications and clues about someone’s origin and are not a decision-making tool. However, as I have argued elsewhere, algorithmic or so-called “intelligent” bordering practices assume neutrality and objectivity and thereby conceal forms of discrimination embedded in technologies. In the case of dialect recognition the score sheet’s indicated probabilities produce a seeming objectivity that might sway case-workers in one direction or another. Moreover, the software encodes distinctions between who is deserving of protection and who is not; a feature of asylum and refugee protection regimes critiqued by many working in the field.

The functionality and operations of the software are also intentionally obscured. Research and sound artist Pedro Oliveira addresses the many black-boxed assumptions entering the dialect recognition technology. For instance, in his work Das hätte nicht passieren dürfen he engages with the labor involved in producing sound archives and speech corpora and challenges “ the idea that it might be feasible, for the purposes of biometric assessment, to divorce a sound’s materiality from its constitution as a cultural phenomenon.” Oliveira’s work counters the lack of transparency and accountability of the BAMF software. Information about its functionality is scarce. Freedom of information requests and parliamentary inquiries about the technical and algorithmic properties and training data of the software were denied as the information was classified because “the information can be used to prepare conscious acts of deception in the asylum proceeding and misuse language recognition for manipulation,” the German government argued.  While it is not necessarily deepfakes like the one Brandes produced to forego a security system that the German authorities are worried about, the specter of manipulation of the software looms large. 

The consequences of the software’s poor functionality can have drastic consequences for asylum decisions. Vice reported in 2018 the story of Hajar, whose name was changed to protect his identity. Hajar’s asylum application in Germany was denied on the basis of a dialect recognition software that supposedly indicated that he was a Turkish speaker and, thus, could not be from the Autonomous Region Kurdistan as he claimed. Hajar who speaks the Kurdish dialect Sorani had been instructed by BAMF to speak into a telephone receiver and describe an image in his first language. The software’s results indicated a 63% probability that Hajar speaks Turkish and the caseworker concluded that Hajar had lied in his asylum hearings about his origin and his reasons to seek asylum in Germany who continued to appeal the asylum decision. The software is not equipped to verify Sorani and should not have been used on Hajar in the first place.

Biometric Island, Gdansk University of Technology 2021, Image by Dawid Weber  (CC BY 3.0)

Why the voice? It seems that bureaucrats and caseworkers saw it as a way to identify people with ease and scale language analysis more easily. It is also important to consider the context in which this so-called voice biometry is used. Many people who seek asylum in Germany cannot provide identity documents like passports, birth certificates, or identification cards. This is the case because people cannot take them with them as they flee, they are lost or stolen on people’s journeys, or they are confiscated by traffickers. Many forms of documentation are also not accepted as legitimate by state authorities. Generally, language analysis is used in a hostile political context in which claims to asylum are increasingly treated with suspicion.

The voice as a part of the body was supposed to provide an answer to this administrative problem of states. In response to the long summer of migration in 2015 Germany hired McKinsey to overhaul their administrative processes, save money, accelerate asylum procedures, and make them more “efficient.” In July 2017, the head of the Department for Infrastructure and Information Technology of the German Federal Office for Migration and Refugees hailed the office’s new voice and dialect recognition software as “unrivaled world-wide” in its capacity to determine the region of origin of asylum seekers and to “detect inconsistencies” in narratives about their need for protection. More than identification documents, personal narratives, or other features of the body, the voice, the BAMF expert suggests is the medium that allows for the indisputable verification of migrants’ claims to asylum, ostensibly pinpointing their place of origin.

Voice and dialect recognition technology are established by policy makers and security industries as particularly successful tools to produce authentic evidence about the origin of asylum seekers. Asylum seekers have to sound like being from a region that warrants their claims to asylum: requiring the translation of voices into geographical locations. As a result, automated dialect recognition becomes more valuable than someone’s testimony. In other words, the voice, abstracted into a percentage, becomes the testimony. Here, the software, similarly to other biometric security systems, is framed as more objective, neutral, and efficient way of identifying the country of origin of people as compared to human decision-makers. As the German Migration agency argued in 2017: “The IT supported, automated voice biometric analysis provides an independent, objective and large-scale method for the verification of the indicated origin.”

“Soundwave and Spectrogram of “CIRCLE” by Lena Zipp, University of Zurich (CC BY-NC-ND 2.0)

The use of dialect recognition puts forth an understanding of the voice and language that pinpoints someone’s origin to a certain place, without a doubt and without considering how someone’s movement or history. In this sense, the software inscribes a vision of a sedentary, ahistorical, static, fixed, and abstracted human into its operations. As a result, geographical borders become reinforced and policed as fixed boundaries of territorial sovereignty. This vision of the voice ignores multiple mobilities and (post)colonial histories and reinscribes the borders of nation-states that reproduce racial violence globally. Dialect recognition reproduces precarity for people seeking asylum. As I have shown elsewhere, in the absence of other forms of identification and the presence of generalized suspicion of asylum claims, accent accumulates value while the content of testimony becomes devalued. Asylum applicants are placed in a double bind, simultaneously being incited to speak during asylum procedures and having their testimony scrutinized and placed under general suspicion.

Similar to conventional passports, the linguistic passport also represents a structurally unequal and discriminatory regime that needs to be abolished. The software was framed as providing a technical solution to a political problem that intensifies the violence of borders. We need to shift to pose other questions as well. What do we want to listen to? How could we listen differently? How could we build a world in which nation-states and passports are abolished and the voice is not a passport but can be appreciated in its multiplicity, heteroglossia, and malleability? How do we want to live together on a planet increasingly becoming uninhabitable?

Featured Image: Voice Print Sample–Image from US NIST

Michelle Pfeifer is postdoctoral fellow in Artificial Intelligence, Emerging Technologies, and Social Change at Technische Universität Dresden in the Chair of Digital Cultures and Societal Change. Their research is located at the intersections of (digital) media technology, migration and border studies, and gender and sexuality studies and explores the role of media technology in the production of legal and political knowledge amidst struggles over mobility and movement(s) in postcolonial Europe. Michelle is writing a book titled Data on the Move Voice, Algorithms, and Asylum in Digital Borderlands that analyses how state classifications of race, origin, and population are reformulated through the digital policing of constant global displacement.

tape-reel

REWIND! . . .If you liked this post, you may also dig:

“Hey Google, Talk Like Issa”: Black Voiced Digital Assistants and the Reshaping of Racial Labor–Golden Owens

Beyond the Every Day: Vocal Potential in AI Mediated Communication –Amina Abbas-Nazari 

Voice as Ecology: Voice Donation, Materiality, Identity–Steph Ceraso

The Sound of What Becomes Possible: Language Politics and Jesse Chun’s 술래 SULLAE (2020)Casey Mecija

The Sonic Roots of Surveillance Society: Intimacy, Mobility, and Radio–Kathleen Battles

Acousmatic Surveillance and Big Data–Robin James

“Hey Google, Talk Like Issa”: Black Voiced Digital Assistants and the Reshaping of Racial Labor

.

In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series (along with ed-in-chief JS!). It kicked off with Amina Abbas-Nazari’s post, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice. Today Golden Owens explored what happens when companies sell Black voices along with their Intelligent Virtual Assistants. Tune in for a deep historical dive into the racialized sound of servitude in America. Even though corporations aren’t trying to hear this absolutely critical information–or Black users in general–they better listen up. –JS


In October 2019, Google released an ad for their Google Assistant (GA), an intelligent virtual assistant (IVA) that initially debuted in 2016. As revealed by onscreen text and the video’s caption, the ad’s announced that the GA would soon have a new celebrity voice. The ten-second promotion includes a soundbite from this unseen celebrity—who states: “You can still call me your Google Assistant. Now I just sound extra fly”— followed by audio of the speaker’s laughter, a white screen, the GA logo, and a written question: “Can you guess who it is?”

Consumers quickly speculated about the person behind the voice, with many posting their guesses on Reddit. The earliest comments named Tiffany Haddish, Lizzo, and Issa Rae as prospects, with other users affirming these guesses. These women were considered the most popular contenders: two articles written about the new GA voice cited the Reddit post, with one calling these women Redditors’ most popular guesses and the other naming only them as users’ desired choices. Those who guessed Rae were proven correct. One day after the ad, Google released a longer promo revealing her as the GA’s new voice, including footage of Rae recording responses for the assistant. The ad ends with Rae repeating the “extra fly” line from the initial promo, smiling into the camera.

Google’s addition of Rae as an IVA voice option is one of several recent examples of Black people’s voices employed in this manner. Importantly, this trend toward Black-voiced IVAs deviates from the pre-established standard of these digital aides. While there are many voice options available, the default voices for IVAs are white female voices with flat dialects. This shift toward Black American voices is notable not only because of conversations about inclusion—with some Black users saying they feel more represented by these new voices—but because this influx of Black voices marks a spiritual return to the historical employment of Black people as service-providing, labor-performing entities in the United States, thus subliminally reinforcing historical biases about Black people as uniquely suited for performing this type of work.

Marketed as labor-saving devices, IVAs are programmed to assist with cooking and grocery shopping, transmit messages and reminders, and provide entertainment, among other tasks. Since the late 2010s they have also been able to operate other technologies within users’ homes: Alexa, for example, can control Roomba robotic vacuums; IVA-compatible smart plugs or smart home devices enable IVAs to control lights, locks, thermostats, and other such apparatuses. Behaviorally, IVAs are designed and expected to be on-call at all times, but not to speak or act out of turn—with programmers often directed to ensure these aides are relatable, reliable, trustworthy, and unobtrusive.

Round Grey Speaker On Brown Board, gadget, google assistant, google home (public domain)

Far from operating in a vacuum, IVAs eerily evoke the presence of and parameters set for enslaved workers and domestic servants in the U.S.—many of whom have historically been Black American women. Like IVAs, Black women servants cooked, cleaned, entertained children, and otherwise served their (predominantly white) employers, themselves operating as labor-saving devices through their performance of these labors. Employers similarly expected these women to be ever-available, occupy specific areas of the home, and obey all requests and demands—and were unsettled if not infuriated when maids did not behave according to their expectations.

White women being the default voices of IVAs has somewhat obfuscated the degree to which these aides have re-embodied and replaced the Black servants who once predominantly executed this work, but incorporating Black voices into these roles removes this veil, symbolically re-implementing Black people as labor-performing entities by having them operate as the virtual assistants who now perform much of the labor Black workers historically performed. Enabling Black people to be used as IVAs thus re-aligns Black beings with the performance of service and labor.

While Black women were far from the only demographic conscripted into domestic labor, by the 1920s they comprised a “permanent pool of servants” throughout the country, due largely to the egress of white American and immigrant women from domestic service into fields that excluded Black women (183). Black women’s prominence in domestic service was heavily reflected in early U.S. media, which overwhelmingly portrayed domestic servants not just as Black women, but as Black Mammies—domestic servant archetypes originally created to promote the myth that Black women “were contented, even happy as slaves.” Characters like Gone with the Wind’s “Mammy” pulled both from then-current associations of Black women with domestic labor and from white nostalgia for the Antebellum era, and specifically for the archetypal Mammy—marking Black women as idealized labor-performing domestics operating in service of white employers. These on-screen servants were “always around when the boss needed them…[and] always ready to lend a helping hand when times were tough” (36). Historian Donald Bogle dubbed this era of Hollywood the “Age of the Negro Servant,” referenced in this reel from the New York Times.

—-

.

—–

Cinema and television merely built from years of audible racism on the radio—America’s most prominent form of in-home entertainment in the first half of the 20th century—where Black actors also played largely servant and maid roles that demanded they speak in “distorted dialect, exaggerated intonation, rhythmic speech cadences, and particular musical instruments” in order to appear at all (143). This white-contrived portrayal of Black people is known as “Blackvoice,” and essentially functions as “the minstrel show boiled down to pure aurality” (14). These performances allowed familiar ideals of and narratives about Blackness to be communicated and recirculated on a national scale, even without the presence of Black bodies. Labor-performing Black characters like Beulah, Molasses and January, Aunt Jemima, and Amos and Andy were prominent in the Golden Age of Radio, all initially voiced by white actors. In fact, Aunt Jemima’s print advertising was just as dependent on stereotypical representations of her voice as it was on visual “Mammy” imagery.

Close up of Aunt Jemima advertising appearing in Woman’s Day in 1948.

When Black actors broke through white exclusion on the airwaves, many took over roles once voiced by white men and/or were forced by white radio producers and scriptwriters to “‘talk as white people believed Negroes talked’” so that white audiences could discern them as Black (371). This continuous realignment undoubtedly informs contemporary ideas of labor, labor performance, and laboring bodies, further promoted by the sudden influx of Black voice assistants in 2019.

Specifically, these similarities demonstrate that contemporary IVAs are intrinsically haunted by Black women slaves and servants: built in accordance with and thus inevitably evoking these laborers in their positioning, programming, and task performance. Further facilitating this alignment is the fact that advertisements for Black-voiced IVAs purposefully link well-known Black bodies in conjunction with their Black voices. Excepting Apple’s Black-sounding voice options for Siri, all of  the Black IVA voice options since 2019 have belonged to prominent Black American celebrities. Prior to Issa Rae, GA users could employ John Legend as their digital aide (April 2019 until March 2020). Samuel L. Jackson became the first celebrity voice option for Amazon’s Alexa in December 2019, followed by Shaquille O’Neal in July 2021.

The ads for Black-voiced IVAs thus link these disembodied aides not just to Black bodies, but to specific Black bodies as a sales tactic—bodies which signify particular images and embodiments of Blackness. The Samuel L. Jackson Alexa ad utilizes close-ups of Jackson recording lines for the IVA and of Echo speakers with Jackson’s voice emitting from them in response to users. John Legend is physically absent from the ad announcing him as the GA; however, his celebrity wife directs the GA to sing for her instead, after which she states that it is “just like the real John”—thus linking Legend’s body to the GA even without his onscreen presence. Amazon has even explicitly explored the connection between the Black-voiced IVA and the Black body, releasing a 2021 commercial called “Alexa’s Body” that saw Alexa voiced and physically embodied by Michael B. Jordan—with the main character in the commercial insinuating that he is the ideal vessel for Alexa.

By aligning these bodies with, and having them act as, labor-performing devices in service of consumers, these advertisements both re-align Blackness with labor and illuminate how these devices were always already haunted by laboring Black bodies—and especially, given the demographics of the bodies who most performed the types of labors IVAs now execute, laboring Black women’s bodies. That the majority of the Black celebrities employed as Black IVA voices are men suggests some awareness of and attempt to distance from this history and implicit haunting—an effort which itself exposes and illuminates the degree to which this haunting exists. 

In some cases, the Black people lending their voices to these IVAs also speak in a way that sonically suggests Blackness: Issa Rae’s “Now I’m just extra fly,” for example, incorporates Black American slang through the use of the word “fly. As part of African American Vernacular English (AAVE), the term “fly” dates back to the 1970s and denotes coolness, attractiveness, and fashionableness. Because of its inclusion in Hip Hop, which has become the dominant music genre in the United States, the term, its meaning, and its racial origins are widely known amongst consumers. By using the word “fly,” Rae nods not only at these qualities but also at her own Blackness in a manner that is recognizable to a mainstream American audience.  Due in part to Hip Hop’s popularity, U.S.-based media outlets, corporations, and individuals of varying races and ethnicities regularly appropriate AAVE and Black slang terms, often without regard for the culture that created them or the vernacular they stem from. The ad preceding Issa Rae’s revelation as the GA specifically invited users to align the voice with a celebrity body, and users’ predominant claims that the voice was a Black woman’s suggest that something about the voice conjured Blackness and the Black female body.

“Alexa Voice” by Stock Catalog, (CC BY 2.0)

This racial marking was also likely facilitated by how people naturally listen and respond to voices. As Nina Sun Eidsheim notes in The Race of Sound, “voices heard are ultimately identified, recognized, and named by listeners at large. In hearing a voice, one also brings forth a series of assumptions about the nature of voice” (12). This series of assumptions, Eidsheim asserts in “The Voice as Action,” is inflected by the “multisensory context” surrounding a given voice, i.e., “a composite of visual, textural, discursive, and other kinds of information” (9). While we imagine our impressions of voices as uniquely meaningful, “we cannot but perceive [them] through filters generated by our own preconceptions” (10). As a result, listening is never a neutral or truly objective practice.

For many consumers, these filters are informed by what Jennifer Lynn Stoever terms the sonic color line, “a socially constructed boundary that racially codes sonic phenomena such as vocal timbre, accents, and musical tones” (11). Where the racial color line allows white people to separate themselves from Black people on the basis of visual and behavioral differences, the sonic color line allows people “to construct and discern racial identities based on voices, sounds, and particular soundscapes” and to assign nonwhite voices with “differential cultural, social, and political value” (11). In the U.S., the sonic color line operates in tandem with the American listening ear, which “normalizes the aural tastes and standards of white elite masculinity as the singular way to interpret sonic information” (13)  and therefore marks-as-Other not only the voices and bodies of Black people, but also those of non-males and the non-elite.

Voice bubble from 1940’s print ad for Aunt Jemima Pancake mix: the sonic color line in sight and sound.

Ironically, the very listening practices which make consumers register particular voices and vocal qualities as Black also make Black voices inaccessible to Alexa and other IVAs. Scholarship on Automated Speech Recognition (ASR) systems and Speech AI observes that many Black users find it necessary to code-switch when speaking to IVAs, as the devices fail to comprehend their linguistic specificities. A study by Christina N. Harrington et al. in which Black elders used the Google Home to seek health information discovered that “participants felt that Google Home struggles to understand their communication style (e.g., diction or accent) and language (e.g., dialect) specifically due to the device being based on Standard English” (15). To address these struggles, participants switched to Standard American English (SAE), eliminating informal contractions and changing their tone and verbiage so that the GA would understand them. As one of the study’s participants states,

You do have to change your words. Yes. You do have to change your diction and yes, you have to use… It cannot be an exotic name or a name that’s out of the Caucasian round. …You have to be very clear with the English language. No ebonic (15).

This incomprehension extends to Black Americans of all ages, and to other IVAs. A study by Allison Koenecke et al. on ASR systems produced by Amazon, Google, IBM, Microsoft and Apple discovered that these entities had a harder time accurately transcribing Black speech than white speech, producing “an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers.” (7684). A study by Zion Mengesha et al. on the impact of these errors on Black Americans—which included participants from different regions with a range of ages, genders, socioeconomic backgrounds and education-levels—discovered that many felt frustrated and othered by these mistakes, and felt further pressure to code-switch so that they would not be misunderstood. Koenecke et al. concluded that ASR systems could not understand the “phonological, phonetic, or prosodic characteristics of” AAVE (7687), and that this ignorance would make the use of these technologies more difficult for Black users—a sentiment that was echoed by participants in the study conducted by Mengesha et al., most of whom marked the technology as working better for white and/or SAE speakers (5). 

The speech recognition errors these technologies demonstrate—which also extend to speakers in other racial and ethnic groups—illuminate the reality that despite including Black voices as IVAs, these assistant technologies are not truly built for Black people, or for any person that does not speak Standard American English. And where AAVE is largely associated with Blackness, SAE is predominantly associated with whiteness: as a dialect widely perceived to be “lacking any distinctly regional, ethnic, or socioeconomic characteristics,” it is recognized as being “spoken by the majority group or the socially advantaged group” in the United States—both groups which are solely or primarily composed of white people. SAE is so identified with whiteness that Black people who only speak Standard English are often told that they sound and/or “talk” white, and Black people who deliberately invoke SAE in professional and/or interracial settings (i.e., code switching) are described as “talking white” or using their “white voice” when doing so. That IVAs and other ASR systems have such trouble understanding AAVE and other non-standard English dialects suggests that these technologies were not designed to understand any dialect other than SAE—and thus, given SAE’s strong identification with whiteness, were designed specifically to assist, understand, and speak to white users.

Writing on this phenomena as a woman with a non-standard accent, Sinduja Rangarajan highlights in “Hey Siri—Why Don’t You Understand More People Like Me?” that none of the IVAs currently on the market offer any American dialect that is not SAE. And while users can change their IVA’s accents, they are limited to Standard American, British, Irish, Australian, Indian, and South African—which Rangarajan rightly highlights as revealing who the IVAs think they are talking to, rather than who their user actually is. That most of these accents belong to Western, predominantly white countries (or to countries once colonized by white imperialists) strongly suggests that these devices are programmed to speak to—and perform labor for—white consumers specifically.

“Voice is Already Big”: Adobe Sayspring Founder Mark Christopher Webster Presents At Entrepreneurs Roundtable Accelerator Demo Day in April 2017 (CC BY-SA 4.0)

When considering the primary imagined and target users of IVAs, the sudden influx of Black-voiced IVAs becomes particularly insidious. Though they may indeed make some Black users feel more represented, cultivating this representation is merely a byproduct of their actual purpose. Because these technologies are not built for Black consumers, Black-voiced IVAs are meant to appeal not to Black users, but to white ones. Rae, Jackson, and the other Black celebrity voices may provide a much-needed variety in the types of voices applied to IVAs, but they primarily operate as “further examples of technology companies using Black voices to entertain white consumers while ignoring Black consumers.” Black-voiced assistants, after all, no better understand Black vernacular English than any of the other voice options for IVAs, a reality marking Black speech patterns as enjoyable but not legitimate.

By excluding Black consumers, the companies behind these IVAs insinuate that Blackness is only acceptable and worthy of consideration when operating in service of whiteness. Where Black people as consumers have been delegitimized and disregarded, Black voices as labor-saving assistants have been welcomed and deemed profitable—a reality which further emphasizes how historical constructions of Black people as labor-performing devices haunts these contemporary technologies. Tech companies reinforce historical positionings of white people as ideal consumers and Black people as consumable products—repeating historical demarcations of Blackness and whiteness in the present. 

In imagining the futures of IVAs, the companies behind them would need to reconsider how they interact—or fail to interact—with Black users. Both Samuel L. Jackson and Shaquille O’Neal, the last of the Black-celebrity-voiced IVAs still currently available to users, will be removed as Alexa voice options by September 2023, presenting an opportunity for these companies to divest. Whether or not the brands behind these IVAs take this initiative, consumers themselves can be critical of how AI technologies continue to reestablish hierarchical systems, of their own interactions with these devices, and of who these technologies are truly made for. In being critical, we can perhaps begin to envision alternative, reparative modes of AI technology—modes that serve and support more than one kind of user. 

Featured Image: Issa Rae gif from the 2017 Golden Globes

Golden Marie Owens is a PhD candidate in the Screen Cultures program at Northwestern University. Her research interests include representations of race and gender in American media and popular culture, artificial intelligence, and racialized sounds. Her doctoral dissertation, “Mechanical Maids: Digital Assistants, Domestic Spaces, and the Spectre(s) of Black Women’s
Labor,” examines how intelligent virtual assistants such as Apple’s Siri and Amazon’s Alexa evoke and are haunted by Black women slaves, servants, and houseworkers in the United States. In her time at Northwestern, she has had internal fellowships through the Office of Fellowships and the Alice Kaplan Institute for the Humanities. She currently holds an MMUF Dissertation Grant through the Institute for Citizens and Scholars and Ford Dissertation Fellowship through the National Academy for Sciences, Engineering, and Medicine.

tape-reel

REWIND! . . .If you liked this post, you may also dig:

Beyond the Every Day: Vocal Potential in AI Mediated Communication –Amina Abbas-Nazari 

Voice as Ecology: Voice Donation, Materiality, Identity–Steph Ceraso

Mr. and Mrs. Talking Machine: The Euphonia, the Phonograph, and the Gendering of Nineteenth Century Mechanical Speech – J. Martin Vest

Echo and the Chorus of Female MachinesAO Roberts

Black Excellence on the Airwaves: Nora Holt and the American Negro Artist ProgramChelsea Daniel and Samantha Ege

Spaces of Sounds: The Peoples of the African Diaspora and Protest in the United States–Vanessa Valdes

On Whiteness and Sound Studies–Gus Stadler

“Hey Google, Talk Like Issa”: Black Voiced Digital Assistants and the Reshaping of Racial Labor

.

In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series (along with ed-in-chief JS!). It kicked off with Amina Abbas-Nazari’s post, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice. Today Golden Owens explored what happens when companies sell Black voices along with their Intelligent Virtual Assistants. Tune in for a deep historical dive into the racialized sound of servitude in America. Even though corporations aren’t trying to hear this absolutely critical information–or Black users in general–they better listen up. –JS


In October 2019, Google released an ad for their Google Assistant (GA), an intelligent virtual assistant (IVA) that initially debuted in 2016. As revealed by onscreen text and the video’s caption, the ad’s announced that the GA would soon have a new celebrity voice. The ten-second promotion includes a soundbite from this unseen celebrity—who states: “You can still call me your Google Assistant. Now I just sound extra fly”— followed by audio of the speaker’s laughter, a white screen, the GA logo, and a written question: “Can you guess who it is?”

Consumers quickly speculated about the person behind the voice, with many posting their guesses on Reddit. The earliest comments named Tiffany Haddish, Lizzo, and Issa Rae as prospects, with other users affirming these guesses. These women were considered the most popular contenders: two articles written about the new GA voice cited the Reddit post, with one calling these women Redditors’ most popular guesses and the other naming only them as users’ desired choices. Those who guessed Rae were proven correct. One day after the ad, Google released a longer promo revealing her as the GA’s new voice, including footage of Rae recording responses for the assistant. The ad ends with Rae repeating the “extra fly” line from the initial promo, smiling into the camera.

Google’s addition of Rae as an IVA voice option is one of several recent examples of Black people’s voices employed in this manner. Importantly, this trend toward Black-voiced IVAs deviates from the pre-established standard of these digital aides. While there are many voice options available, the default voices for IVAs are white female voices with flat dialects. This shift toward Black American voices is notable not only because of conversations about inclusion—with some Black users saying they feel more represented by these new voices—but because this influx of Black voices marks a spiritual return to the historical employment of Black people as service-providing, labor-performing entities in the United States, thus subliminally reinforcing historical biases about Black people as uniquely suited for performing this type of work.

Marketed as labor-saving devices, IVAs are programmed to assist with cooking and grocery shopping, transmit messages and reminders, and provide entertainment, among other tasks. Since the late 2010s they have also been able to operate other technologies within users’ homes: Alexa, for example, can control Roomba robotic vacuums; IVA-compatible smart plugs or smart home devices enable IVAs to control lights, locks, thermostats, and other such apparatuses. Behaviorally, IVAs are designed and expected to be on-call at all times, but not to speak or act out of turn—with programmers often directed to ensure these aides are relatable, reliable, trustworthy, and unobtrusive.

Round Grey Speaker On Brown Board, gadget, google assistant, google home (public domain)

Far from operating in a vacuum, IVAs eerily evoke the presence of and parameters set for enslaved workers and domestic servants in the U.S.—many of whom have historically been Black American women. Like IVAs, Black women servants cooked, cleaned, entertained children, and otherwise served their (predominantly white) employers, themselves operating as labor-saving devices through their performance of these labors. Employers similarly expected these women to be ever-available, occupy specific areas of the home, and obey all requests and demands—and were unsettled if not infuriated when maids did not behave according to their expectations.

White women being the default voices of IVAs has somewhat obfuscated the degree to which these aides have re-embodied and replaced the Black servants who once predominantly executed this work, but incorporating Black voices into these roles removes this veil, symbolically re-implementing Black people as labor-performing entities by having them operate as the virtual assistants who now perform much of the labor Black workers historically performed. Enabling Black people to be used as IVAs thus re-aligns Black beings with the performance of service and labor.

While Black women were far from the only demographic conscripted into domestic labor, by the 1920s they comprised a “permanent pool of servants” throughout the country, due largely to the egress of white American and immigrant women from domestic service into fields that excluded Black women (183). Black women’s prominence in domestic service was heavily reflected in early U.S. media, which overwhelmingly portrayed domestic servants not just as Black women, but as Black Mammies—domestic servant archetypes originally created to promote the myth that Black women “were contented, even happy as slaves.” Characters like Gone with the Wind’s “Mammy” pulled both from then-current associations of Black women with domestic labor and from white nostalgia for the Antebellum era, and specifically for the archetypal Mammy—marking Black women as idealized labor-performing domestics operating in service of white employers. These on-screen servants were “always around when the boss needed them…[and] always ready to lend a helping hand when times were tough” (36). Historian Donald Bogle dubbed this era of Hollywood the “Age of the Negro Servant,” referenced in this reel from the New York Times.

—-

.

—–

Cinema and television merely built from years of audible racism on the radio—America’s most prominent form of in-home entertainment in the first half of the 20th century—where Black actors also played largely servant and maid roles that demanded they speak in “distorted dialect, exaggerated intonation, rhythmic speech cadences, and particular musical instruments” in order to appear at all (143). This white-contrived portrayal of Black people is known as “Blackvoice,” and essentially functions as “the minstrel show boiled down to pure aurality” (14). These performances allowed familiar ideals of and narratives about Blackness to be communicated and recirculated on a national scale, even without the presence of Black bodies. Labor-performing Black characters like Beulah, Molasses and January, Aunt Jemima, and Amos and Andy were prominent in the Golden Age of Radio, all initially voiced by white actors. In fact, Aunt Jemima’s print advertising was just as dependent on stereotypical representations of her voice as it was on visual “Mammy” imagery.

Close up of Aunt Jemima advertising appearing in Woman’s Day in 1948.

When Black actors broke through white exclusion on the airwaves, many took over roles once voiced by white men and/or were forced by white radio producers and scriptwriters to “‘talk as white people believed Negroes talked’” so that white audiences could discern them as Black (371). This continuous realignment undoubtedly informs contemporary ideas of labor, labor performance, and laboring bodies, further promoted by the sudden influx of Black voice assistants in 2019.

Specifically, these similarities demonstrate that contemporary IVAs are intrinsically haunted by Black women slaves and servants: built in accordance with and thus inevitably evoking these laborers in their positioning, programming, and task performance. Further facilitating this alignment is the fact that advertisements for Black-voiced IVAs purposefully link well-known Black bodies in conjunction with their Black voices. Excepting Apple’s Black-sounding voice options for Siri, all of  the Black IVA voice options since 2019 have belonged to prominent Black American celebrities. Prior to Issa Rae, GA users could employ John Legend as their digital aide (April 2019 until March 2020). Samuel L. Jackson became the first celebrity voice option for Amazon’s Alexa in December 2019, followed by Shaquille O’Neal in July 2021.

The ads for Black-voiced IVAs thus link these disembodied aides not just to Black bodies, but to specific Black bodies as a sales tactic—bodies which signify particular images and embodiments of Blackness. The Samuel L. Jackson Alexa ad utilizes close-ups of Jackson recording lines for the IVA and of Echo speakers with Jackson’s voice emitting from them in response to users. John Legend is physically absent from the ad announcing him as the GA; however, his celebrity wife directs the GA to sing for her instead, after which she states that it is “just like the real John”—thus linking Legend’s body to the GA even without his onscreen presence. Amazon has even explicitly explored the connection between the Black-voiced IVA and the Black body, releasing a 2021 commercial called “Alexa’s Body” that saw Alexa voiced and physically embodied by Michael B. Jordan—with the main character in the commercial insinuating that he is the ideal vessel for Alexa.

By aligning these bodies with, and having them act as, labor-performing devices in service of consumers, these advertisements both re-align Blackness with labor and illuminate how these devices were always already haunted by laboring Black bodies—and especially, given the demographics of the bodies who most performed the types of labors IVAs now execute, laboring Black women’s bodies. That the majority of the Black celebrities employed as Black IVA voices are men suggests some awareness of and attempt to distance from this history and implicit haunting—an effort which itself exposes and illuminates the degree to which this haunting exists. 

In some cases, the Black people lending their voices to these IVAs also speak in a way that sonically suggests Blackness: Issa Rae’s “Now I’m just extra fly,” for example, incorporates Black American slang through the use of the word “fly. As part of African American Vernacular English (AAVE), the term “fly” dates back to the 1970s and denotes coolness, attractiveness, and fashionableness. Because of its inclusion in Hip Hop, which has become the dominant music genre in the United States, the term, its meaning, and its racial origins are widely known amongst consumers. By using the word “fly,” Rae nods not only at these qualities but also at her own Blackness in a manner that is recognizable to a mainstream American audience.  Due in part to Hip Hop’s popularity, U.S.-based media outlets, corporations, and individuals of varying races and ethnicities regularly appropriate AAVE and Black slang terms, often without regard for the culture that created them or the vernacular they stem from. The ad preceding Issa Rae’s revelation as the GA specifically invited users to align the voice with a celebrity body, and users’ predominant claims that the voice was a Black woman’s suggest that something about the voice conjured Blackness and the Black female body.

“Alexa Voice” by Stock Catalog, (CC BY 2.0)

This racial marking was also likely facilitated by how people naturally listen and respond to voices. As Nina Sun Eidsheim notes in The Race of Sound, “voices heard are ultimately identified, recognized, and named by listeners at large. In hearing a voice, one also brings forth a series of assumptions about the nature of voice” (12). This series of assumptions, Eidsheim asserts in “The Voice as Action,” is inflected by the “multisensory context” surrounding a given voice, i.e., “a composite of visual, textural, discursive, and other kinds of information” (9). While we imagine our impressions of voices as uniquely meaningful, “we cannot but perceive [them] through filters generated by our own preconceptions” (10). As a result, listening is never a neutral or truly objective practice.

For many consumers, these filters are informed by what Jennifer Lynn Stoever terms the sonic color line, “a socially constructed boundary that racially codes sonic phenomena such as vocal timbre, accents, and musical tones” (11). Where the racial color line allows white people to separate themselves from Black people on the basis of visual and behavioral differences, the sonic color line allows people “to construct and discern racial identities based on voices, sounds, and particular soundscapes” and to assign nonwhite voices with “differential cultural, social, and political value” (11). In the U.S., the sonic color line operates in tandem with the American listening ear, which “normalizes the aural tastes and standards of white elite masculinity as the singular way to interpret sonic information” (13)  and therefore marks-as-Other not only the voices and bodies of Black people, but also those of non-males and the non-elite.

Voice bubble from 1940’s print ad for Aunt Jemima Pancake mix: the sonic color line in sight and sound.

Ironically, the very listening practices which make consumers register particular voices and vocal qualities as Black also make Black voices inaccessible to Alexa and other IVAs. Scholarship on Automated Speech Recognition (ASR) systems and Speech AI observes that many Black users find it necessary to code-switch when speaking to IVAs, as the devices fail to comprehend their linguistic specificities. A study by Christina N. Harrington et al. in which Black elders used the Google Home to seek health information discovered that “participants felt that Google Home struggles to understand their communication style (e.g., diction or accent) and language (e.g., dialect) specifically due to the device being based on Standard English” (15). To address these struggles, participants switched to Standard American English (SAE), eliminating informal contractions and changing their tone and verbiage so that the GA would understand them. As one of the study’s participants states,

You do have to change your words. Yes. You do have to change your diction and yes, you have to use… It cannot be an exotic name or a name that’s out of the Caucasian round. …You have to be very clear with the English language. No ebonic (15).

This incomprehension extends to Black Americans of all ages, and to other IVAs. A study by Allison Koenecke et al. on ASR systems produced by Amazon, Google, IBM, Microsoft and Apple discovered that these entities had a harder time accurately transcribing Black speech than white speech, producing “an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers.” (7684). A study by Zion Mengesha et al. on the impact of these errors on Black Americans—which included participants from different regions with a range of ages, genders, socioeconomic backgrounds and education-levels—discovered that many felt frustrated and othered by these mistakes, and felt further pressure to code-switch so that they would not be misunderstood. Koenecke et al. concluded that ASR systems could not understand the “phonological, phonetic, or prosodic characteristics of” AAVE (7687), and that this ignorance would make the use of these technologies more difficult for Black users—a sentiment that was echoed by participants in the study conducted by Mengesha et al., most of whom marked the technology as working better for white and/or SAE speakers (5). 

The speech recognition errors these technologies demonstrate—which also extend to speakers in other racial and ethnic groups—illuminate the reality that despite including Black voices as IVAs, these assistant technologies are not truly built for Black people, or for any person that does not speak Standard American English. And where AAVE is largely associated with Blackness, SAE is predominantly associated with whiteness: as a dialect widely perceived to be “lacking any distinctly regional, ethnic, or socioeconomic characteristics,” it is recognized as being “spoken by the majority group or the socially advantaged group” in the United States—both groups which are solely or primarily composed of white people. SAE is so identified with whiteness that Black people who only speak Standard English are often told that they sound and/or “talk” white, and Black people who deliberately invoke SAE in professional and/or interracial settings (i.e., code switching) are described as “talking white” or using their “white voice” when doing so. That IVAs and other ASR systems have such trouble understanding AAVE and other non-standard English dialects suggests that these technologies were not designed to understand any dialect other than SAE—and thus, given SAE’s strong identification with whiteness, were designed specifically to assist, understand, and speak to white users.

Writing on this phenomena as a woman with a non-standard accent, Sinduja Rangarajan highlights in “Hey Siri—Why Don’t You Understand More People Like Me?” that none of the IVAs currently on the market offer any American dialect that is not SAE. And while users can change their IVA’s accents, they are limited to Standard American, British, Irish, Australian, Indian, and South African—which Rangarajan rightly highlights as revealing who the IVAs think they are talking to, rather than who their user actually is. That most of these accents belong to Western, predominantly white countries (or to countries once colonized by white imperialists) strongly suggests that these devices are programmed to speak to—and perform labor for—white consumers specifically.

“Voice is Already Big”: Adobe Sayspring Founder Mark Christopher Webster Presents At Entrepreneurs Roundtable Accelerator Demo Day in April 2017 (CC BY-SA 4.0)

When considering the primary imagined and target users of IVAs, the sudden influx of Black-voiced IVAs becomes particularly insidious. Though they may indeed make some Black users feel more represented, cultivating this representation is merely a byproduct of their actual purpose. Because these technologies are not built for Black consumers, Black-voiced IVAs are meant to appeal not to Black users, but to white ones. Rae, Jackson, and the other Black celebrity voices may provide a much-needed variety in the types of voices applied to IVAs, but they primarily operate as “further examples of technology companies using Black voices to entertain white consumers while ignoring Black consumers.” Black-voiced assistants, after all, no better understand Black vernacular English than any of the other voice options for IVAs, a reality marking Black speech patterns as enjoyable but not legitimate.

By excluding Black consumers, the companies behind these IVAs insinuate that Blackness is only acceptable and worthy of consideration when operating in service of whiteness. Where Black people as consumers have been delegitimized and disregarded, Black voices as labor-saving assistants have been welcomed and deemed profitable—a reality which further emphasizes how historical constructions of Black people as labor-performing devices haunts these contemporary technologies. Tech companies reinforce historical positionings of white people as ideal consumers and Black people as consumable products—repeating historical demarcations of Blackness and whiteness in the present. 

In imagining the futures of IVAs, the companies behind them would need to reconsider how they interact—or fail to interact—with Black users. Both Samuel L. Jackson and Shaquille O’Neal, the last of the Black-celebrity-voiced IVAs still currently available to users, will be removed as Alexa voice options by September 2023, presenting an opportunity for these companies to divest. Whether or not the brands behind these IVAs take this initiative, consumers themselves can be critical of how AI technologies continue to reestablish hierarchical systems, of their own interactions with these devices, and of who these technologies are truly made for. In being critical, we can perhaps begin to envision alternative, reparative modes of AI technology—modes that serve and support more than one kind of user. 

Featured Image: Issa Rae gif from the 2017 Golden Globes

Golden Marie Owens is a PhD candidate in the Screen Cultures program at Northwestern University. Her research interests include representations of race and gender in American media and popular culture, artificial intelligence, and racialized sounds. Her doctoral dissertation, “Mechanical Maids: Digital Assistants, Domestic Spaces, and the Spectre(s) of Black Women’s
Labor,” examines how intelligent virtual assistants such as Apple’s Siri and Amazon’s Alexa evoke and are haunted by Black women slaves, servants, and houseworkers in the United States. In her time at Northwestern, she has had internal fellowships through the Office of Fellowships and the Alice Kaplan Institute for the Humanities. She currently holds an MMUF Dissertation Grant through the Institute for Citizens and Scholars and Ford Dissertation Fellowship through the National Academy for Sciences, Engineering, and Medicine.

tape-reel

REWIND! . . .If you liked this post, you may also dig:

Beyond the Every Day: Vocal Potential in AI Mediated Communication –Amina Abbas-Nazari 

Voice as Ecology: Voice Donation, Materiality, Identity–Steph Ceraso

Mr. and Mrs. Talking Machine: The Euphonia, the Phonograph, and the Gendering of Nineteenth Century Mechanical Speech – J. Martin Vest

Echo and the Chorus of Female MachinesAO Roberts

Black Excellence on the Airwaves: Nora Holt and the American Negro Artist ProgramChelsea Daniel and Samantha Ege

Spaces of Sounds: The Peoples of the African Diaspora and Protest in the United States–Vanessa Valdes

On Whiteness and Sound Studies–Gus Stadler

Young People and Information – A Manifesto

(Editor Alex Grech, Malta)

Young People and Information Manifesto

The State of Play with Online Information – The Issues We Want to Address

The manifesto is a primer for much-needed input and discussions among young people, individuals and institutions whom young people perceive as being able to address issues relating to online information – and implement improvements. Policymakers should read it, regulators and people working for technology firms, think tanks, technology companies and education institutions. The manifesto also calls for young people to take responsibility for the information they consume, create, and share online.

From the voices of the few can come change for many and for the generations to come.

MEDIA FREEDOMS

01 We are human. We are not data.
02 We have a socio-technical existence, and it is not for sale or exploitation.
03 We recognise that there is no such thing as free media. The price of an internet connection is not the only price we are paying to speak freely. The price of harvesting personal data for the benefit of third parties is rarely quantifiable.
04 We have the right to express ourselves freely but responsibly, and access information online without fear of censorship, surveillance, or harassment. We believe in the safeguarding of media freedoms, with a right to freedom of expression and to access information that is as free from bias as possible.
05 We believe journalism should be practised without fear or prejudice, irrespective of whether the journalist is employed by a mainstream media outlet, working as an independent investigative citizen journalist, or as a blogger. It is still possible for people on TikTok to do independent journalism.
06 We need to support citizen journalism and the role it plays in holding those in power accountable.

(if you want to read more, go to the pdf, downloadable here)

Beyond the Every Day: Vocal Potential in AI Mediated Communication 

In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series for Sounding Out! (along with ed-in-chief JS!). It starts today, with Amina Abbas-Nazari, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice– are we training it, or is it actually training us?


Hi, good morning. I’m calling in from Bangalore, India.” I’m talking on speakerphone to a man with an obvious Indian accent. He pauses. “Now I have enabled the accent translation,” he says. It’s the same person, but he sounds completely different: loud and slightly nasal, impossible to distinguish from the accents of my friends in Brooklyn.

The AI startup erasing call center worker accents: is it fighting bias – or perpetuating it? (Wilfred Chan, 24 August 2022)

This telephone interaction was recounted in The Guardian reporting on a Silicon Valley tech start-up called Sanas. The company provides AI enabled technology for real-time voice modification for call centre workers voices to sound more “Western”. The company describes this venture as a solution to improve communication between typically American callers and call centre workers, who might be based in countries such as Philippines and India. Meanwhile, research has found that major companies’ AI interactive speech systems exhibit considerable racial imbalance when trying to recognise Black voices compared to white speakers. As a result, in the hopes of being better heard and understood, Google smart speaker users with regional or ethnic American accents relay that they find themselves contorting their mouths to imitate Midwestern American accents.

These instances describe racial biases present in voice interactions with AI enabled and mediated communication systems, whereby sounding ‘Western’ entitles one to more efficient communication, better usability, or increased access to services. This is not a problem specific to AI though. Linguistics researcher John Baugh, writing in 2002, describes how  linguistic profiling is known to have resulted in housing being denied to people of colour in the US via telephone interactions. Jennifer Stoever‘s The Sonic Color Line (2016) presents a cultural and political history of the racialized body and how it both informed and was informed by emergent sound technologies. AI mediated communication repeats and reinforces biases that pre-exist the technology itself, but also helping it become even more widely pervasive.

“pain” by Flickr user Pol Neiman (CC BY-NC-ND 2.0)

Mozilla’s commendable Common Voice project aims to ‘teach machines how real people speak’ by building an open source, multi-language dataset of voices to improve usability for non-Western speaking or sounding voices. But singer and musicologist, Nina Sun Eidsheim describes how ’a specific voice’s sonic potentiality [in] its execution can exceed imagination’ (7), and voices as having ‘an infinity of unrealised manifestations’ (8) in The Race of Sound (2019). Eidsheim’s sentiments describe a vocal potential, through musicality, that exists beyond ideas of accents and dialects, and vocal markers of categorised identity. As a practicing vocal performer, I recognise and resonate with Eidsheim’s ideas I have a particular interest in extended and experimental vocality, especially gained through my time singing with Musarc Choir and working with artist Fani Parali. In these instances, I have experienced the pleasurable challenge of being asked to vocalise the mythical, animal, imagined, alien and otherworldly edges of the sonic sphere, to explore complex relations between bodies, ecologies, space and time, illuminated through vocal expression.

Joy by Flickr user François Karm, cropped by SO! (CC BY-NC 2.0)

Following from Eidsheim, and through my own vocal practice, I believe AI’s prerequisite of voices as “fixed, extractable, and measurable ‘sound object[s]’ located within the body” is over-simplistic and reductive. Voices, within systems of AI, are made to seem only as computable delineations of person, personality and identity, constrained to standardised stereotypes. By highlighting vocal potential, I offer a unique critique of the way voices are currently comprehended in AI recognition systems. When we appreciate the voice beyond the homogenous, we give it authority and autonomy, ultimately leading to a fuller understanding of the voice and its sounding capabilities.

My current PhD research, Speculative Voicing, applies thinking about the voice from a musical perspective to the sound and sounding of voices in artificially intelligent conversational systems. Herby the voice becomes an instrument of the body to explore its sonic materiality, vocal potential and extremities of expression, rather than being comprehended in conjunction to vocal markers of identity aligning to categories of race, gender, age, etc. In turn, this opens space for the voice to be understood as a shapeshifting, morphing and malleable entity, with immense sounding potential beyond what might be considered ordinary or everyday speech. Over the long term this provides discussion of how experimenting with vocal potential may illuminate more diverse perspectives about our sense of self and being in relation to vocal sounding.

Vocal and movement artist Elaine Mitchener exhibits the disillusion of the voice as ‘fixed’ perfectly in her performance of Christian Marclay’s No!, which I attended one hot summer’s evening at the London Contemporary Music Festival in 2022. Marclay’s graphic score uses cut outs from comic book strips to direct the performer to vocalise a myriad of ‘No”s.

In connection with Fraenkel Gallery’s 2021 exhibition, experimental vocalist Elaine Mitchener performs Christian Marclay’s graphic score, “No!” Image by author.

Mitchener’s rendering of the piece involved the cooperation and coordination of her entire body, carefully crafting lips, teeth, tongue, muscles and ligaments to construct each iteration of ‘No.’ Each transmutation of Mitchener’s ‘No’s’ came with a distinct meaning, context, and significance, contained within the vocalisation of this one simple syllable. Every utterance explored a new vocal potential, enabled by her body alone. In the context of AI mediated communication, we can see this way of working with the voice renders the idea of the voice as ‘fixed’ as redundant. Mitchener’s vocal potential demonstrates that voices can and do exist beyond AI’s prescribed comprehension of vocal sounding.

In order to further understand how AI transcribes understandings of voice onto notions of identity, and vocal potential, I produced the practice project Polyphonic Embodiment(s) as part of my PhD research, in collaboration with Nestor Pestana, with AI development by Sitraka Rakotoniaina. The AI we created for this project is based upon a speech-to-face recognition AI that aims to be able to tell what your face looks like from the sound of your voice. The prospective impact of this AI is deeply unsettling, as  its intended applications are wide-ranging – from entertainment to security, and as previously described AI recognition systems are inherently biased.

Still from project video for Polyphonic Embodiment(s). Image by author.

This multi-modal form of comprehending voice is also a hot topic of research being conducted by major research institutions including Oxford University and Massachusetts Institute of Technology. We wanted to explore this AI recognition programme in conjunction with an understanding of vocal potential and the voice as a sonic material shaped by the body. As the project title suggests, the work invites people to consider the multi-dimensional nature of voice and vocal identity from an embodied standpoint. Additionally, it calls for contemplation of the relationships between voice and identity, and individuals having multiple or evolving versions of identity. The collaboration with the custom-made AI software creates a feedback loop to reflect on how peoples’ vocal sounding is “seen” by AI, to contest the way voices are currently heard, comprehended and utilised by AI, and indeed the AI industry.

The video documentation for this project shows ‘facial’ images produced by the voice-to-face recognition AI, when activated by my voice, modified with simple DIY voice devices. Each new voice variation, created by each device, produces a different outputted face image. Some images perhaps resemble my face? (e.g. Device #8) some might be considered more masculine? (e.g. Device #10) and some are just disconcerting (e.g. Device #4). The speculative nature of Polyphonic Embodiment(s) is not to suggest that people should modify their voices in interaction with AI communication systems. Rather the simple devices work with bodily architecture and exaggerate its materiality, considering it as a flexible instrument to explore vocal potential. In turn this sheds light on the normative assumptions contained within AI’s readings of voice and its relationships to facial image and identity construction.

Through this artistic, practice-led research I hope to evolve and augment discussion around how the sounding of voices is comprehended by different disciplines of research. Taking a standpoint from music and design practice, I believe this can contest ways of working in the realms of AI mediated communication and shape the ways we understand notions of (vocal) identity: as complex, fluid, malleable, and ultimately not reducible to Western logics of sounding.

Featured Image: Still image from Polyphonic Embodiments, courtesy of author.

— 

Amina Abbas-Nazari is a practicing speculative designer, researcher, and vocal performer. Amina has researched the voice in conjunction with emerging technology, through practice, since 2008 and is now completing a PhD in the School of Communication at the Royal College of Art, focusing on the sound and sounding of voices in artificially intelligent conversational systems. She has presented her work at the London Design Festival, Design Museum, Barbican Centre, V&A, Milan Furniture Fair, Venice Architecture Biennial, Critical Media Lab, Switzerland, Litost Gallery, Prague and Harvard University, America. She has performed internationally with choirs and regularly collaborates with artists as an experimental vocalist

tape-reel

REWIND! . . .If you liked this post, you may also dig:

What is a Voice?–Alexis Deighton MacIntyre

Voice as Ecology: Voice Donation, Materiality, Identity-Steph Ceraso

Mr. and Mrs. Talking Machine: The Euphonia, the Phonograph, and the Gendering of Nineteenth Century Mechanical Speech – J. Martin Vest

One Scream is All it Takes: Voice Activated Personal Safety, Audio Surveillance, and Gender ViolenceMaría Edurne Zuazu

Echo and the Chorus of Female MachinesAO Roberts

On Sound and Pleasure: Meditations on the Human Voice– Yvon Bonefant

Beyond the Every Day: Vocal Potential in AI Mediated Communication 

In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series for Sounding Out! (along with ed-in-chief JS!). It starts today, with Amina Abbas-Nazari, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice– are we training it, or is it actually training us?


Hi, good morning. I’m calling in from Bangalore, India.” I’m talking on speakerphone to a man with an obvious Indian accent. He pauses. “Now I have enabled the accent translation,” he says. It’s the same person, but he sounds completely different: loud and slightly nasal, impossible to distinguish from the accents of my friends in Brooklyn.

The AI startup erasing call center worker accents: is it fighting bias – or perpetuating it? (Wilfred Chan, 24 August 2022)

This telephone interaction was recounted in The Guardian reporting on a Silicon Valley tech start-up called Sanas. The company provides AI enabled technology for real-time voice modification for call centre workers voices to sound more “Western”. The company describes this venture as a solution to improve communication between typically American callers and call centre workers, who might be based in countries such as Philippines and India. Meanwhile, research has found that major companies’ AI interactive speech systems exhibit considerable racial imbalance when trying to recognise Black voices compared to white speakers. As a result, in the hopes of being better heard and understood, Google smart speaker users with regional or ethnic American accents relay that they find themselves contorting their mouths to imitate Midwestern American accents.

These instances describe racial biases present in voice interactions with AI enabled and mediated communication systems, whereby sounding ‘Western’ entitles one to more efficient communication, better usability, or increased access to services. This is not a problem specific to AI though. Linguistics researcher John Baugh, writing in 2002, describes how  linguistic profiling is known to have resulted in housing being denied to people of colour in the US via telephone interactions. Jennifer Stoever‘s The Sonic Color Line (2016) presents a cultural and political history of the racialized body and how it both informed and was informed by emergent sound technologies. AI mediated communication repeats and reinforces biases that pre-exist the technology itself, but also helping it become even more widely pervasive.

“pain” by Flickr user Pol Neiman (CC BY-NC-ND 2.0)

Mozilla’s commendable Common Voice project aims to ‘teach machines how real people speak’ by building an open source, multi-language dataset of voices to improve usability for non-Western speaking or sounding voices. But singer and musicologist, Nina Sun Eidsheim describes how ’a specific voice’s sonic potentiality [in] its execution can exceed imagination’ (7), and voices as having ‘an infinity of unrealised manifestations’ (8) in The Race of Sound (2019). Eidsheim’s sentiments describe a vocal potential, through musicality, that exists beyond ideas of accents and dialects, and vocal markers of categorised identity. As a practicing vocal performer, I recognise and resonate with Eidsheim’s ideas I have a particular interest in extended and experimental vocality, especially gained through my time singing with Musarc Choir and working with artist Fani Parali. In these instances, I have experienced the pleasurable challenge of being asked to vocalise the mythical, animal, imagined, alien and otherworldly edges of the sonic sphere, to explore complex relations between bodies, ecologies, space and time, illuminated through vocal expression.

Joy by Flickr user François Karm, cropped by SO! (CC BY-NC 2.0)

Following from Eidsheim, and through my own vocal practice, I believe AI’s prerequisite of voices as “fixed, extractable, and measurable ‘sound object[s]’ located within the body” is over-simplistic and reductive. Voices, within systems of AI, are made to seem only as computable delineations of person, personality and identity, constrained to standardised stereotypes. By highlighting vocal potential, I offer a unique critique of the way voices are currently comprehended in AI recognition systems. When we appreciate the voice beyond the homogenous, we give it authority and autonomy, ultimately leading to a fuller understanding of the voice and its sounding capabilities.

My current PhD research, Speculative Voicing, applies thinking about the voice from a musical perspective to the sound and sounding of voices in artificially intelligent conversational systems. Herby the voice becomes an instrument of the body to explore its sonic materiality, vocal potential and extremities of expression, rather than being comprehended in conjunction to vocal markers of identity aligning to categories of race, gender, age, etc. In turn, this opens space for the voice to be understood as a shapeshifting, morphing and malleable entity, with immense sounding potential beyond what might be considered ordinary or everyday speech. Over the long term this provides discussion of how experimenting with vocal potential may illuminate more diverse perspectives about our sense of self and being in relation to vocal sounding.

Vocal and movement artist Elaine Mitchener exhibits the disillusion of the voice as ‘fixed’ perfectly in her performance of Christian Marclay’s No!, which I attended one hot summer’s evening at the London Contemporary Music Festival in 2022. Marclay’s graphic score uses cut outs from comic book strips to direct the performer to vocalise a myriad of ‘No”s.

In connection with Fraenkel Gallery’s 2021 exhibition, experimental vocalist Elaine Mitchener performs Christian Marclay’s graphic score, “No!” Image by author.

Mitchener’s rendering of the piece involved the cooperation and coordination of her entire body, carefully crafting lips, teeth, tongue, muscles and ligaments to construct each iteration of ‘No.’ Each transmutation of Mitchener’s ‘No’s’ came with a distinct meaning, context, and significance, contained within the vocalisation of this one simple syllable. Every utterance explored a new vocal potential, enabled by her body alone. In the context of AI mediated communication, we can see this way of working with the voice renders the idea of the voice as ‘fixed’ as redundant. Mitchener’s vocal potential demonstrates that voices can and do exist beyond AI’s prescribed comprehension of vocal sounding.

In order to further understand how AI transcribes understandings of voice onto notions of identity, and vocal potential, I produced the practice project Polyphonic Embodiment(s) as part of my PhD research, in collaboration with Nestor Pestana, with AI development by Sitraka Rakotoniaina. The AI we created for this project is based upon a speech-to-face recognition AI that aims to be able to tell what your face looks like from the sound of your voice. The prospective impact of this AI is deeply unsettling, as  its intended applications are wide-ranging – from entertainment to security, and as previously described AI recognition systems are inherently biased.

Still from project video for Polyphonic Embodiment(s). Image by author.

This multi-modal form of comprehending voice is also a hot topic of research being conducted by major research institutions including Oxford University and Massachusetts Institute of Technology. We wanted to explore this AI recognition programme in conjunction with an understanding of vocal potential and the voice as a sonic material shaped by the body. As the project title suggests, the work invites people to consider the multi-dimensional nature of voice and vocal identity from an embodied standpoint. Additionally, it calls for contemplation of the relationships between voice and identity, and individuals having multiple or evolving versions of identity. The collaboration with the custom-made AI software creates a feedback loop to reflect on how peoples’ vocal sounding is “seen” by AI, to contest the way voices are currently heard, comprehended and utilised by AI, and indeed the AI industry.

The video documentation for this project shows ‘facial’ images produced by the voice-to-face recognition AI, when activated by my voice, modified with simple DIY voice devices. Each new voice variation, created by each device, produces a different outputted face image. Some images perhaps resemble my face? (e.g. Device #8) some might be considered more masculine? (e.g. Device #10) and some are just disconcerting (e.g. Device #4). The speculative nature of Polyphonic Embodiment(s) is not to suggest that people should modify their voices in interaction with AI communication systems. Rather the simple devices work with bodily architecture and exaggerate its materiality, considering it as a flexible instrument to explore vocal potential. In turn this sheds light on the normative assumptions contained within AI’s readings of voice and its relationships to facial image and identity construction.

Through this artistic, practice-led research I hope to evolve and augment discussion around how the sounding of voices is comprehended by different disciplines of research. Taking a standpoint from music and design practice, I believe this can contest ways of working in the realms of AI mediated communication and shape the ways we understand notions of (vocal) identity: as complex, fluid, malleable, and ultimately not reducible to Western logics of sounding.

Featured Image: Still image from Polyphonic Embodiments, courtesy of author.

— 

Amina Abbas-Nazari is a practicing speculative designer, researcher, and vocal performer. Amina has researched the voice in conjunction with emerging technology, through practice, since 2008 and is now completing a PhD in the School of Communication at the Royal College of Art, focusing on the sound and sounding of voices in artificially intelligent conversational systems. She has presented her work at the London Design Festival, Design Museum, Barbican Centre, V&A, Milan Furniture Fair, Venice Architecture Biennial, Critical Media Lab, Switzerland, Litost Gallery, Prague and Harvard University, America. She has performed internationally with choirs and regularly collaborates with artists as an experimental vocalist

tape-reel

REWIND! . . .If you liked this post, you may also dig:

What is a Voice?–Alexis Deighton MacIntyre

Voice as Ecology: Voice Donation, Materiality, Identity-Steph Ceraso

Mr. and Mrs. Talking Machine: The Euphonia, the Phonograph, and the Gendering of Nineteenth Century Mechanical Speech – J. Martin Vest

One Scream is All it Takes: Voice Activated Personal Safety, Audio Surveillance, and Gender ViolenceMaría Edurne Zuazu

Echo and the Chorus of Female MachinesAO Roberts

On Sound and Pleasure: Meditations on the Human Voice– Yvon Bonefant

Bruce Sterling on the Art of Text-to-Image Generative AI

Authorized transcript of Bruce Sterling’s lecture during the TU Eindhoven conference AI for All, From the Dark Side to the Light, November 25, 2022, at Evoluon, Eindhoven, co-organized by Next Nature. Website of the event: https://www.tue.nl/en/our-university/calendar-and-events/25-11-2022-ai-for-all-from-the-dark-side-to-the-light. YouTube link of the talk: https://www.youtube.com/watch?v=UB461avEKnQ&t=3325s

It’s nice to be back in Eindhoven, a literal city of light in a technological world. I am here to discuss one of my favourite topics: artificial intelligence. The Difference Engine is a book that my colleague William Gibson and I wrote 33 years ago. The narrator of this book happens to be an artificial intelligence because we were cyberpunks at the time.

At the time we were talking to people in the press and they said: you science fiction writers like to write about computers, what if a computer started writing your novels? This was supposed to be some kind of existential threat to us. But we really liked computers. We had no fear of them, so we thought, oh, that might be amusing… why don’t we imitate a computer writing a novel? And this is the result. The book is still in print.

Here is the source of the problem: the infamous 1956 Dartmouth conference where ambitious computer scientists from the first decade of the computer science field gathered. They decided that since they were working on thinking machines, they should take this idea seriously and try to invent some machines and systems that could actually think. And they’re going to take computer science, they’re going to launch an imperialistic war on metaphysics, philosophy and psychology and establish whether software really is thought. And whether thought can be abstracted and whether there are rules for talking about intelligence. At the time they wrote some nice manifestos about it. I read them, even though I was only two years old when the event happened. Our novel is halfway between this old-school artificial intelligence and today’s AI. As a long-term fan of this rather tragic branch of computer science, 2022 has been the wildest year that artificial intelligence ever had. This is the first time there’s been a genuine popular craze about it. I’m going to spend the rest of my 45 minutes and 48 slides trying to tell you what the hell I think is going on.

This is artificial intelligence, the business side of it. If you lump in everything that could be plausibly called artificial intelligence, the old-school rules-based software code, and then the statement style of artificial intelligence, this is all of it and it’s pretty big. And it has never taken over anything completely. It’s just there are areas where it applies to various sectors airspace, finance, pipeline stuff, and data access. It’s very big. It’ s a mind map from Firstmark venture capital. Matt Turk photographed it. You’re going to be neck-deep in this. I’m not going to talk about all that. I’m going to get around to talking, to text, to image generators. But this is a generative AI, not even old-school artificial intelligence, not even necessarily machine learning and deep learning.

I am here to talk about text-to-image generators. This is actually generative AI, which is what went wild this year. It is a generative application landscape. This is a subset of AI, not even machine learning, but this is where all the heat and light are coming from. Right now. People are just going nuts about it.

And these are some of the technical platforms that support it. AI Machine learning, deep learning. You can see this. Just look these guys up. I could spend all day gossiping about them. Some are up, and some are down. Haul out your phones. Take pictures of this. Go take your pinky fingers. Look them up on the Internet. That’s a heck of a lot going on here.

And then, these are the visual guys. These are actual text-to-image generation outfits here. Platforms, companies, start-ups, most of them young, some of them younger than this February. But just like a small army of these guys, some of them younger than February 2022, coming out of the lab and schools, in the garages, dropping out of companies, scaring up venture capital. It’s a wild scene.

Here are the platforms that are supporting them at the moment. Practically none of these companies are making any money. What they’re busy doing is trying to muscle up, beef up the platforms and try to find some applications for these breakthroughs that they’re having. And the platforms that are in red are the open-source platforms. And these are the closest thing to AI for all that anybody has ever had. You can fire these up. You can look at them on the Web. You can download them from GitHub. Computer science breakthroughs are never going to be for all people. As you can see they started back in 2021 and picked up steam in a major fashion.

These are some of the little businesses. These are startups. Nine out of ten of these guys are going to die. This is not the future. This is not an overpowering way. These are all startups. Even the ones that survive are going probably going to get acquired. Look on the Internet, chase them down, follow them on social media, and read their white papers.

So what are they doing?

I’m going to talk about artwork because I have a problem here. I happen to be the art director of a technology art festival in Turin, Italy, which is where I flew from to be with you today. And we know that we’re going to be getting a lot of AI art, so we may as well do an event on AI art. We’ve got to figure out sort of what’s good and what we want to show the public of Turin. I’ve got to make aesthetic and cultural decisions about what matters. There are hundreds of thousands of users who’ve appeared in a matter of mere months, and they’ve generated literally millions of images. There’s a quarter of 1 million or 250 million images on these services. You just tell them what to do and offer them a prompt. They generate stuff very quickly.

This image happens to be Amsterdam-centric, I am messing with the Amsterdam imagery there.

You can do very elaborate kind of swirly arabesque stuff.

You could do fantastic unearthly landscapes that look like black and white photography.

You can mess about with architecture or do strange 3D geometric stuff.

You could do pretty girls. Those are always guaranteed to sell. There are megatons of pretty girls that have been generated, probably more pretty girls in the past year than in the entire history of Pretty Girl art today because you could just do it. Literally, press a button and have a hundred pretty girls.

Fantasy landscapes, odd-looking 3D gamer set stuff. You can just put in the word utopia and it will build you utopias. Not two will be the same.

The utopia prompt. One could do utopias, all day, all night. Do you like green ones? No, you like the blue ones. You know, it’s happy. You don’t have to say bad utopia. Good utopia. There is just an endless supply. Basically infinite. I mean, it’s not infinite because all these images are JPEGs. No, not really. Paintings are not really photographs. They’re conjugations of JPEGs. So what you’re seeing is like 256 by 256 of JPEGs. And, you know, statistically, there’s only so many ways you could vary the colour in a grid of 256 by 256. But these systems know how to do that.

This is a Refik Anadol, who’s from Istanbul and is working out of LA. He’s the world’s only truly famous artificial intelligence artist. He has been touring the world for the past 4 or 5 years doing these epically large motion graphics, mostly on buildings, using databases of people who are hiring him. Take out everything in your files, turn it into an artificial intelligence landscape and broadcast on your building, generating a lot of traffic. If you happen to be a museum director and Refik Anadol shows up with one of his projected shows, you’d better have the museum store ready to go because they’re going to be lined up around the block for Refik. He will give you all the artificial intelligence that you can eat, on time, and under budget. And the public loves it flat out.

Let’s explain how all this works technically. What does a text do when it generates, if you try to take a picture of it? This would be a selfie.

What you see here is a Google tensor chip. It happens to be version three, which is already obsolete. They’re threatening to roll out number four. It’s going to be like Moore’s Law but then heavier. If you’re Google DeepMind and you’re doing Alphazero and you’re going to beat every other chess player in the world, you’re just going to wipe the floor with all the old-school chess-playing computers. You need these babies. About 5000 of these. Slot them up, and train them on chess. Don’t tell them. Tell them nothing except the rules. I’ll just invent chess and they’ll beat every other chess machine ever invented. You need 5000 of these racked up. It’s not going to come cheap. It will take a lot of voltage. What I don’t have here are nice, homely literary metaphors like ‘cyberspace’.

It’s like you got all these wires and all these protocols and all these messages flying around at random from node to node you can understand the routing systems and like the naming system and so forth. Or you can just say cyberspace. It is a metaphor because there really is no cyberspace. All there is are wires, storage units, built on top of web browsers–colossal stacks of interacting.

The police and the military really like AI, that’s not going away. It was a successful coinage. So, what’s a generator? How does it generate? All this happens to be a stable diffusion, one of the better-known generators among many other similar generators. They’re not all built the same. There are different architectures. You don’t just have one machine. You’ve actually got several different ones. Artificial intelligence is about deep learning, neural networks of connected computers on chips, each one of them separate. They don’t trade information with one another. You’ve got on one end the one that interprets the text. It just looks. At typed text. It doesn’t read books. It just literally reads alphanumeric characters, ASCII, and it breaks them up into powder. It doesn’t even look at the words, but it looks like the phonemes and the statistical probabilities of them affecting other phonemes. And this has been typical of AI-style machine translation for a while, but now they’ve gotten really quite good at it. So it’s kind of looking at whatever command it’s given and breaking that up into a kind of probabilistic dust, just like points in a vector space. But something like flour, if you think of it as like a sifting machine, you’re putting in the white flour and it’s got rocks and some other unnecessary things. Then you sieve back and forth. Put the words in and break the words up into little pieces of probability.

It then passes them to an image generator. In the next stage, it tries to come up with a rough consensus of what it might be in a postage stamp style. This is a little beginning, a hint as to what this image might become. After having sifted that one around until it’s got a rough kind of consensus. It passes that to a second part of the server, which doesn’t concern itself with words. It just takes the earlier image and it tries to focus the image, tighten it, brighten it and make it broader. And then that passes its own version of the image to yet another one, which is bigger and kind of more focused on prettification that expands the image onto a bigger scale and fits it into a particular format, polishes it up, makes it look like a camera photo and makes it look like a painting or a blueprint. The three of them don’t intercommunicate, they’re three separate sieves. And then the last one there is the auto encoder-decoder, which functions as an editor-publisher, and it looks at what’s come through this.

Pretty refined. But most of is rubbish, nonsense. It’s like throwing things out the window like an impatient editor or getting rid of bad paintings, like an angry gallerist, statistically comparing images to a database it has of successful paintings: this one’s obviously chaos, that one might pass. And then when it’ll select a few out of a great many which have been generated. It’ll actually edit it down to just a few and sort of print them or at least turn them into actual JPEGs and present them to the viewer on the website. It is astonishingly complicated, amazing that such a whopper-jawed thing works at all. And where it came from is not text-to-image generators, but image-to-text recognition. What happened here?

Several years ago Facebook and Google tried using computers to identify what was in photographs and JPEGs. They were looking for your face or tried to identify consumer items, basic surveillance capitalism procedures. And then one of the engineers said, okay, we can look at a photograph in our machine, will name what’s in it. What happens if we just give it the name and ask it to produce the photograph, literally turning the box upside down? What they got was deep dreaming, a hallucinatory mess. It just didn’t make any sense. But then they’ve worked on it and refined it to some extent. But this is really a crude and whopper-jawed thing here. I mean, it’s literally as if I’d like turned a recycling machine on its ear and I could put in broken glass and get out Greek vases. And nobody expected this. I don’t know anybody in computer science that ever predicted the existence of a text-to-image generator. It’s just one of those bizarre lines of technical development where you do.

Something as simple as turning it upside down and an entire industry hops out of Pandora’s box there. It’s really, really a funny and wild thing. So what’s wrong with it? I’m going to go into this now. It’s like not what’s wrong with it. More to be fairer to this technology, What are its innate characteristics? I mean, what is the grain of the material there? What is it good at doing and what is it not bad at doing? And if you were an art director or a museum curator, how would you judge what was like a good output and like just the stuff that’s like every day and there’s 250 millions of them and somebody’s got to do this work, and I’m trying to help here. This is the basic problem with all forms of generative AI: they’re not normal. They’re not they don’t they don’t fulfil the aspirations of the founders of AI. They have no common sense.

Here is the ‘healthy boy eating broken glass for breakfast’ result. He looks like a really happy kid. Ask an artist to draw a child gleefully eating broken glass, that’s a horror image. But since this machine lacks any common sense understanding, it doesn’t know what glass is. It doesn’t know what a boy is, doesn’t know what breakfast is. It’s the very opposite of an Isaac Asimov robot. No idea about possible harm. If you look at this, where’s the ethics? But then, it’s just some sieves that are turning text into image. It’s a Rube Goldberg machine to turn a huge database of any possible character connected to every JPEG pixel on the Internet. It’s just it’s a balancing act between all the text on the Internet and all the images on the Internet, the common crawl. If you look for the AI intelligence in there, it’s like, where.

Are the rules? Where are the decisions? Where’s the common sense? There’s not a trace of them, not one trace. It’s just a series of photos, produced by sophisticated filters, connected by equations, they’re not even wired together. It’s fantastic what they can figure out. They have zero common sense. That’s not even in the textbook. They don’t care. They don’t compete with anything. They don’t have to. These AIs don’t have ears. They don’t have photographs. They don’t have paintings. They have a statistical relationship between text and clumps of JPEG pixels.

I heard early on from users who were trying to put their prompts into these machines that they weren’t very good at hands.

It’s like, why are they not good at hands?

You know, a hand is one of the most common things on the Internet, there are millions and millions of pictures of them.

It just doesn’t understand the geometry and doesn’t know what three dimensions look like. It knows what a picture of a hand looks like.

This is a prompt in the Dutch language. It doesn’t know what a hand looks like in three dimensions. It doesn’t have a hand. It has no skin.

Count to five. It can’t count to five. Why? It does not draw. It does not photograph. It only generates.

How about the oldest hands ever drawn? Can’t do it.

How about a foot? Can you compare a foot with a hand? No.

Right. It just sits there, generating, taking its clouds of pixels, its little probabilities, putting it a little bit of chaos, shaking it down like dropping sponges, you know, full of little coloured pixels, kind of paddling along. And it doesn’t stop in the middle of its generation.

What if I ask to imitate a human drawing a hand? This is one of the most impressive images that I have seen from an image generator. It is unearthly. If you notice carefully you will see that the paper the artist is working on is not square. I love the coffee cup. These are not mistakes. This is the actual grain of its compositional process. And there is a beauty to it. It is not a human beauty. It is a striking image that no human being could ever have dreamt of. It really has presence, it’s surreal.

For the machines that we built, this is their realism. This is what they actually ‘see’ when they are comparing the word ‘hand’ to the most probable JPEGs of hands. And if you think of hands and how fluid they are… We don’t even have a vocabulary for all the positions we can make with our hands. We’re used to them, but we don’t talk about them very effectively.  This vocabulary is not in the database because people never described them with enough fidelity, for them to be accurately rendered by a probabilistic engine.

Eventually, they’ll crack the hand problem.

And then when you input something, they’ll just call on the thing that makes the hands and it’ll kind of rush in from the side and powder up the hands quickly and then retreat back into okay, yeah.

When these systems are more refined, they won’t make these elementary errors, but they’re not errors.

This is graph paper and you would think graph paper would be the simpliest thing to do for computers. The computer sceeen itself is a like a graph, right? If you look carefully there are thousands of tiny probabalistic mistakes in these lines.

They are more obvious when you ask it to do a checkerboard.  If you ask to draw a black square, white square, black square, it gets confused. It start doing checkers and then gets lost. Even if you ask it to draw a black and white tile floor it gets lost where black and white is supposed to go, how many there and how it is represented in 3D space, even though there are thousands of photographs of such tile floors online.

Now I’m actually going to do some creative experimentation of my own. And being a novelist. I don’t just want to give it orders to have it make the world’s prettiest picture. Instead, I want to see what it can say about things that humans can’t draw. What will it produce if I ask it to draw something that is beyond human capacity to draw?

For instance, the unimaginable. But the unimaginable is an oxymoron, right? I mean, you can’t draw something you can’t imagine. This thing will draw the unimaginable in a hot second.

The Undreamed-of. Stable Diffusion doesn’t care. It is perfectly happy.

The Impossibility. These are not like expressive artworks, like a Van Gogh. We’re seeing things here that humans can’t make.

Intense fascination? It doesn’t have any. It doesn’t have emotions.

The obsessive compulsion.

The self-referential.

The shocking surprise. It cannot be shockingly surprised and instead is parodying us being surprised.

The lysergic hallucination. People have an amber proper about going insane; computers aren’t supposed to be able to do that. It has no trouble whatsoever with psychedelics. It can spin it out by the square kilometre.

The unthinkable. That is, images of humans being unable to think the unthinkable. It is never able to think. It will always come up with some answer.

The utterly forgotten.

In the industry, people are particularly interested in what’s called extension or outpainting. So you like to feed it with Hokusai and then you ask what’s on the corner of the painting and it will just add something onto it. Does this look like Hokusai? Yeah, I’ve got lots of this stuff. How about a cherry tree? And this excites graphic artists, It’s like I got a free cherry tree. They don’t recognise that this thing will effortlessly extend and stretch out forever into the direction of infinite cherry trees. You know, a leftover samurai, ninjas, you know, drums, ideograms, whatever. You know, Heian Japan. Weird tales of Genji. It’ll regurgitate that as long as the current is flowing through it, just indefinitely. We’ll never have screens big enough to show it all. It will never get tired of generating pastiches like this, on any scale, at any fidelity. Quickly, cheaply. And without ever making any common sense, without ever getting tired. It will grind these probabilistic connections and spew this stuff out. There is zero creative effort in this. It does take a lot of voltage.

This happens to be a Max Ernst from the 1930s entitled Europe after the Rain. Ernst did a number of these generative experiments. First, he went out with his canvases and rubbed pencils on them in order to get suggestive forms, and then he would paint over them. And then later he decided he’d just take the paint itself and toss it onto the canvas, stomp on it, and then open it up like a Rorschach block and paint over it. So he’s a world-class surrealist artist, so he got this smashed-up paint with not random, but suggestive kinds of imagery. He did a series of these surrealist paintings, which are some of his most successful ones. They are nearly 100 years old now, and they never look like anything else. Eventually, the novelty tired Ernst. He did a number of these gimmicks and he came to feel it was kind of beneath him. He got all the benefits out of this particular trick that he was likely to have and then moved into a different phase of his expressive career. It’s not like generative techniques have never entered the fine art world before.

This is Meret Oppenheim’s Breakfast in Fur, which will never be looked at the same. And this is something that troubles me. If you show this to anyone who has never seen henceforth next year, if you show this to anyone who is unaware of this famous artwork, almost a hundred years old. They will immediately conclude that it was generated. They will never look at it again and think, what a cool, surreal thing. It’s like she took a teacup and wrapped it up in gazelle fur. And look, she even wrapped up the spoon. And you know what? You can’t even drink out of that teacup. Think of putting tea in there, picking it up and feeling that fur in your mouth. Ooh, ooh. What a surrealist frisson. Boy, that’s super weird. Such an artist, this Meret Oppenheim. Such a form of human expression. We may have opened Pandora’s box and slammed the gate on our heritage.

There’s a quote by Simone Weil: “The beauty of the world is the mouth of a labyrinth,” which is a warning: if you’re interested in aesthetics, you have to curate stuff or happen to be an art director of a festival (like I am), you can’t just pick the pretty ones. The beauty of the world is the mouth of a labyrinth. Once you start taking aesthetics seriously, you enter metaphysics. The world comes up with these labyrinths and if you look at them they’re statistically likely portraits of labyrinths that are not, in fact, labyrinths. A labyrinth has a place where a human goes in and then the human is supposed to get bewildered. He takes a lot of false steps and he makes a lot of mistakes and he has to retreat often. But eventually, there’s a hole out the other side. He comes out and says: oh, such a cool experience. I was in the labyrinth. I thought I’d never get out. But then when I did get out, I was really happy to, like, defeat this puzzle. It doesn’t know what a puzzle is. It doesn’t know what legs are. It’s just looking at all the databases of the labyrinth that it has, which is very extensive. It draws on something that looks like a labyrinth but isn’t. And yet it’s beautiful, a beauty which is not of this world.

That’s what beauty is. The beauty of the world is the mouth of a labyrinth. This is beauty, which is not of this world and cannot be judged by the standards of beauty that we had earlier. But I know that this labyrinth is my doom. I don’t know how long I’m going to have to put up with this. I’ve been in the labyrinth of artificial intelligence since I first heard of it. I’m not too surprised that there’s suddenly a whole host of labyrinths. Thousands of them. I don’t mind. I know it’s trouble, but it’s kind of a good trouble. I don’t mind living there. I’ll build a house in the labyrinth. I’ll put a museum in it. You’re not going to stop me. I’m happy to accept the challenge. I hope you’ll have a look at it.

Transcription: Amberscript. Editing: Geert Lovink