Research Presentation by Alberto Ricca a.k.a Bienoise
Transcription
Alberto Ricca:
Thanks a lot, I'm Alberto Ricca, also called “Bienoise”. I'm an independent sound researcher, but mainly a laptop composer. I like to say laptop composer, without any kind of, you know, pride in that. You know, pride in that. That's because my main interest is in everyday technologies and how they can be used creatively, of course, and also in how they shape, silently, our perception of the world.
So the research I am presenting today is called “Whose voice is your voice?”, and it stemmed from a call for “Seismograf”, and so I built a paper around a question I had. The question was about audio compression by AI technologies that is happening today. I think I need to preface a couple of things to be perfectly clear about the research, because I understand it's probably quite technical somehow. So, let's say a couple of things.
Audio compression is completely widespread and well known by pretty much everybody. Everybody uses MP3; we have been using them since the 90s. And, I built my thought upon—the unfortunately passed away—Jonathan Stern, who was a brilliant researcher in history of music, or we can say history of sound, it’s probably better. He's the author of “The Audible Past” and of “MP3: The Meaning of a Form”. I build on both of those books, that research, and that collection of ideas because, I think, Jonathan asked himself some very deep questions about sounds.
The first question he asked was: are these technologies transparent? And, of course, the answer is not, and are those technologies innocent, and also here the answer is not. Talking about “The Audible Past”, talking about recordings of voices, of sounds, of instruments, he notes that the alienation of the musical product, often, of an artist, is not an unexpected byproduct of those technologies, but exactly the reason why so much money and time has been spent on developing those technologies.
To be clear, if I can record a blues singer, and then sell the records, I can make money upon the voice, and the ability, the talent of the person without giving them much. Also the transparency is a complete illusion because we really know well how noisy those audio interfaces are. We are continuously asked to ignore the outside or the noise of a vinyl record, the world outside of the record itself, any kind of distortion, any kind of glitch if we talk about digital audio.
And this makes those technologies completely not transparent. This is the basis upon which I built my questions. Just to level us all out on this, and to not just be me talking, but also the audio talking, I want to show you—show is not a good word, but it's impossible to go outside of this metaphor—what those sounds are.
I will not use the audio of a vinyl spreckling, I mean, you know that. Let's talk about MP3s. MP3s are the first technology to compress audio. So, it was a technology that allowed us to share, but most of all sell, easily, audio.
What happens in MP3? The whole spectrum, the totality of what we hear is “scizzored-out”, to reduce the amount of data that, pretty much, we have to send via the internet. What happens is quite brutal to the sound, even when we have the impression of a complete clarity, and complete fidelity, it's just an illusion.
It can be “good enough”, as Jonathan Stern, I think, very well said. Good enough to be sold, good enough to be, you know, listened to on Spotify. It's completely not good enough, if what you're searching for is a complete fake fidelity of the original sound. We will see how this becomes important today, but, let's just listen to a couple things. For example, this kind of sound.
This is a scene from Moon (2009), where the protagonist is taking a shower. Of course, when we watch the movie, we're asked to ignore all the glitches that the compression is adding to the audio of the water. It becomes, in a way, transparent, because we don't care, but they are there and we're just asked to ignore them.
I like to quote Mark Fisher on this, because—of course we like to quote Mark Fisher on this—but, he very well noted that MP3 audio compression was the last technological advancement in audio, but it was not a new instrument. It was a way, a faster way, to share sound.
So, we moved away from creating stuff, to selling stuff, or to sharing stuff. I'd like to say that it is just for sharing, but it's of course focused on selling sounds. I also believe that he completely ignored talking about Auto-Tune but whatever. In a way, MP3 is both a Lo-Fi technology—a Lo-Fi audio technology or an approach to audio that cuts something—and also a glitch technology. I just showed you this thing.
I would like to let you hear a snippet from a composition I made for an album of myself of very Lo-Fi MP3s. It's on Mille Plateaux, and it's called “Most Beautiful Design”. Of course, I am referring to MP3s and Floppy disks, two technologies built upon sharing, or we can say “hacked” into sharing.
So, of course, once you start using a technology to make art, to make music, every part of it becomes part of the output. In this way, the MP3s are both Lo-Fi and glitch because, usually Lo-Fi is more analog, and glitch is more digital, but both approaches share the same attraction towards anti-capitalism. And, in a way, anti-technology-ism; it's a paradox, because of course, we are using a technology, often a very new technology, in a different way.
But both Lo-Fi and glitch say, okay, we don't like how things sound normally. We don't like how the industry is using our sound, so we want to hack it. We want to bring out the worst of it. And it's usually a research that strides towards authenticity. And that's a thing that's very dear to me because, if I imagine people watching movies and hearing those glitches in every movie they watch because Netflix is compressing, because streaming services are compressing, because Spotify is compressing, then I'm sure that those tiny glitches are in everyone’s ears all day.
So they're really important and nobody cares about them. So that's why I love them, because it's, I feel, in a way, a very comforting authenticity. And that was my research until 2023. Then I started noticing something new.
The new stuff is that there are many different services, and I like the word services because they're not tools, they're services, and allow me to be a bit philosophical on this, but I feel that services are tools that inherit the sins of their creators, in this way, they're not transparent at all.
So, I started noticing that many services used AI to polish and to clean audio and to compress it. What's the result of this? The first thing I started noticing is that it's quite difficult to send sounds that are not a human voice speaking through WhatsApp, for example. And the result is something like this... These are or maybe you can just try to imagine that they are birds and a river, close to my house. Complete silence, pretty much, everything is cut.
So if I want to send you an audio: “Oh, listen to these beautiful birds.” listen to the soundscape in this place, I pretty much can not. Of course I can. I can hit the audio recorder, record it properly and send it to you, I can hit the audio recorder, record it properly and send it to you, but how many people have the sensibility to do that?
Of course it's faster to just use WhatsApp, and that's what I care about. Or even more interesting, this is a natural gas station outside Bologna. It looks like somebody is frantically repeating something. And that's what triggered my research. I started noticing those two things, the complete blankening of the, let's say, more than human sounds and an effort from the algorithm to find some kind of human sound in what happened, to resemble a human sound.
So I started wondering how I could find some deeper truth about this. And I read some papers, I found out that, for example, Google Meet, which is probably the most used web conference application, used Lyra, which is a very efficient algorithm to compress and clean audio.
I started working on Google Meet because it was interesting to be able to work in real time, and I used two different strategies. I had to use these strategies because, even if the paper was clear on how the algorithm works, when you start working with AI, you will notice that there is a bit of a black box problem with it, because you can access the way the algorithm works, but you don't really know how the algorithm has been trained. You don't have the corpus that the algorithm had studied on. So you need to put some inputs on, see what the algorithm shits out and start making comparisons to find out something. I did this, and I found out that, yeah, it's true, Lyra pretty much listens to what you say, transcribes it, and then re-synthesizes it.
And that's the reason why my research is called “Whose voice is your voice?”, because your voice is no longer your voice. It's a resynthesized simulacrum of your voice. And I wrote a small poem about it, because it's not just “Whose voice is your voice?”, but it's also “Whose ears are your ears?” When the algorithm listens, and the way it interprets sounds is not a 1-to-1 transmission of data as MP3 was, but it is an audio modeling made in a laboratory, on the “standard” human hearing the “standard” sounds.
And the final question of the poem is: “What is the world that I remember?” And I ask myself this because if we believe in Marie Schneider, the famous musicologist, that sounds create the world, then what is the world that we are creating with the sounds that these algorithms hear? It seems very abstract, but I feel it has some reality in it. And I will try to demonstrate it again with two different strategies. The first one of these probes, I use this idea by Gabriele de Seta that talks about probes to investigate black boxes. It's a very apt metaphor for me.
The first one is feedbacking them, sending them a signal and then taking the output and putting it back into the system. It's a common practice in audio, it probably started with Alvin Lucier, and this amazing composition “I'm sitting in a room”. In a way I call this “I'm streaming in a room”, and what happens when you make a system feedback with itself, of course, is that slowly, or not, the characteristics of the system come to the surface. You'll start losing the exact shape of what you put in, and you just get, you know, the sound of the room, the colors of the room.
What happens if I put my voice into it? It takes like 15 minutes, I'm not subjecting you to the whole 15 minutes, let's go a bit further. I find this particularly interesting because it's not becoming noisy. It's quite clear that it's still my voice, but I never said those words. I mean, it's complete gibberish, but I never said anything like that. This is interesting in a way, but it's even more interesting 15 minutes later when, cute little song. And the funny thing is that if I do the same experiment with a completely different material and the material is a recording of a lake, which I choose because it has quite an articulation, an envelope, we can say that, it's quite similar to a human voice, but it's completely different, of course, as it doesn't say anything.
Again, a very strong vocal quality. The only reason why this happened for me is that, of course, those algorithms—and I’m saying something which is very vague, now, I think—are built to just transmit a clear voice with a Western accent. I mean, I don't have time and we don't have the listening conditions to hear all the accent examples I had.
But I can assure you that, of course, a woman speaking in Chinese is very badly represented with Opus or Lyra, while a man speaking in proper English is very well represented. And another very interesting thing is that if you start screaming or if you start speaking softly, or if you show very deep emotions, those are aberrances for these algorithms, and you get cut in some way. Kids in the background are cut, for example, they're not human for the algorithm. I mean, the message is very clear. You just have to use corporate speech and do business, please. Also via WhatsApp. Don't send music. And this is the bad news. The good news is that, of course, again, Lo-Fi and glitch, we can use these to make music. For example, in this one—I will probably play this a couple times because it's difficult to hear it, if you're not listening for this exact thing—but it's fascinating how listening to a lake, the algorithm found some music. This is amazing because it's usually not trained on music, but it can probably transmit singing. And that's what it heard in this lake. So yeah, it becomes quite weird, but it's quite productive. If we start searching for what I tried to call “metahuman sounds” because, of course, it's still human, as we built this, but it's quite outside of our control. It's very difficult to imagine what you are getting from these experiments, but you get something.
And if you start working on it, endlessly and very patiently, you can try to get some music from it. I'd like to finish with just a snippet of new music, which is made with this approach. It's an extract from an unfinished training of an algorithm that is trained on anything else. So what happened? It is just probably two seconds of music in an ocean of noise. But I felt it was good music.
Thanks to Becoming Press, thanks to Mille Plateaux.
Claire Elise:
I just noticed for example, like you might expect it has a tendency to produce noise, but what I understood there is that there's a tendency towards producing rhythms. The algorithm seems to churn out rhythmic content rather than noisy content.
Alberto Ricca:
Often, yes. I feel that this is due to a certain window of processing... I didn't go deep on this, but, I noticed that, for example, Google Meet has a certain, you know, groove.
Roberto Alonso Trillo:
That was great. I will explain later why it reminds me of a project that we made a while ago, with Generative Adversarial Networks (GANs), back in the day when they were trending before transformers. And the question is “what does the algorithm do?”. Because I think it does tend to converge into a median the spectrum of human voices that it has been trained on over time.
And then maybe a rhythmic element has to do with onset detection or something. It's trying to look for either the beginning of consonants or something. So it would be cool to maybe look into that, to see how the process of “feedbacking” into it is absolutely revealing the underpinning bias that was always there.
Alberto Ricca:
Yeah, of course, what they do, I think it's twofold, as I understood it from the papers. It's probably a combination of sound cleaning [as in noise removal algorithms], which can or cannot be AI based, they probably are today because it's more efficient in a way.
And the other use of AI is reconstruction of sound. And that's why you usually, start hearing the “scizzoring”, or the complete elimination of anything above a certain... I mean, this is basic sonic cleaning and, the sound is clearer if you talk and use the technology as it is expected to be used, you notice that the audio that the other person gets is clearer. The background noise has been cut. Your voice is probably sharper. It's probably even a bit compressed.
And then you start feeling that something has been heard and reconstructed. I feel that there is a combination of pre-training on a lot of voices and then real-time training on your voice. The result changes if you don't speak and try to feedback the system, or if you speak a bit and then you feedback the system, it then has started learning your voice.
I can give you an example of this that I feel that it's very interesting. I also try to put the algorithm in crisis by using applauses, which are a bit like voice— I mean, they change a lot rapidly, they have a lot of on-sets. And I was talking underneath, but very softly. And the result is that you will hear my voice louder and you will also hear again some gibberish that I never said.