Our own ChatGPT implementation, Audience app | Nils - Week 9/10/11/12

Nils Meijer
Dec 4, 2023
4 min read

Updated: Jan 26, 2024

During week 9, I started working on implementing our custom ChatGPT. This included communication with the GPT-4 Large Language Model, as well as the recently released DALL-E 3 image generation model. OpenAI also offers a Text-to-speech (TTS) and Speech-to-text (STT) API. For research purposes and to explore possible directions we can take with our concept, we decided to implement these functionalities as well.

ChatGPT 3.5 Turbo / ChatGPT-4

Ideate, Prototype

Initially, I used ChatGPT 3.5 Turbo, which is the most-used LLM (Large Language Model), but also more technologically outdated compared to ChatGPT-4. I decided to switch to GPT-. It is more capable of providing accurate and nuanced responses. This is important, since the play requires the AI to stay in its role and understanding a prompt and intelligently provide a fitting answer contributes to the immersiveness of the performance. However, for testing the functionality itself and not the quality of the output, I will use ChatGPT 3.5 Turbo since ChatGPT 4 is much more expensive. As you can see in the image below, I used $0.67 of funds in 2 days with ChatGPT 4, compared to only $0.10 in nearly a whole month of development. That's 7x the cost, not even taking into account the usage time of ChatGPT 4.

DALL-E 3

Ideate, Prototype

On November 6th, the OpenAI DevDays took place. At this event, the newest image generation model DALL-E 3, which is the successor to DALL-E 2. The image quality of the new model is much, much better and offers more detailed and photo-realistic output. A downside is the cost per image generated, but for one night of theatre play, that cost is still only €1 at most, since image generation will only be done occasionally. The cost is also dependent on the resolution of the output image.

Text-to-speech (TTS)

Define, Prototype, Testing

The text-to-speech API allows us to give more personality to the AI. Rather than just seeing some text appear on the screen, the audience is able to hear "the voice of the AI". This feature is not something we definitively will use in the final prototype, but it is valuable to explore the technology, for the concept and for my own personal development.

During the testing phase, the clients asked if it was possible to train the model, by using data of their own voice recordings. However, OpenAI does not allow this functionality at the moment, regardless of if it is even possible to do this at a large scale.

Speech-to-Text (STT) - Whisper

Ideate, Prototype, Testing

Speech-to-text allows us to conveniently use so-called "shotgun microphones", which can be pointed towards someone sitting in the audience, and their voice input is then recognized by the STT algorithm, which is converted to text and put in the ChatGPT input field. This makes it much easier to have engaging audience interaction, without an employee of the theatre having to type all audience input themselves.

During development, I encountered an issue where Whisper would only work the first time the "enable microphone" button was clicked. After a lot of debugging, I found that it was not able to overwrite the previous audio input file. Instead, it would simply ignore any command to create a new file and use the old file for Whisper instead. More information on this these custom nodes is given in Custom Unreal Engine nodes.

ChatGPT output streaming

Testing, Prototype, Empathize, Define

During our time in Oslo, the theatre provided us with a professional actress. That enabled us to test the Metahumans. While executing those tests, we noticed the eyes of the Metahuman also following the movement of the actress's eyes. It was too noticable to disregard, so I researched ways to keep the eye movement to a minimal. To achieve that, the text should stay in the same place. However, at that moment, all of the ChatGPT was shown at once on the screen, rather than "word for word".

What we wanted was already implemented by the web-version of ChatGPT. On there, the words are shown with a short delay between each. That gives the illusion that the AI is currently typing out the text. With the OpenAI API and its documentation, I was able to implement this in the proof of concept. It still needed some modification on my end because that data also needed to be sent to Unreal Engine which caused some difficulties.

I decided to try to combine this with the TTS algorithm, to give the AI even more personality. Doing this in a realistic way is challenging. At the moment, there is a 0.32 second delay between each word output. No matter the length of the word, whether that is "hello" or "floccinaucinihilipilification", the delay is always 0.32 seconds. To do this realistically so that the word streaming is synchronized with the Text-to-speech algorithm, that delay would need to be dynamic. Currently, the level of quality of word streaming and TTS is sufficient and in my opinion it is not necessary to spend much more time on the topic.

Below, you can see a demonstration of the word streaming, as well as Text-to-speech.

Custom Unreal Engine nodes

Prototype

To fix the earlier mentioned bug, I made a custom blueprint node. It is something I had never done before, so it's a useful skill to learn to become more proficient in Unreal Engine. Implementing custom functionality in blueprints allows for much more extensive possibilities, since the functionality we want might not always be available as a plugin or as base functionality.

Prototype audience interactivity app

To test being able to have the audience interact with the theatre play, I created a relatively simple app, which enabled the user to click a button to "raise" or "lower" their hand. Eventually, this was supposed to transform into a wearable which would send that hand-state data to the Raspberry PI server, but we never ended up continuing with that concept because the project took a different direction.

The data that is received by the server (Raspberry PI)

In the image above, you can see the data that is received by the TCP server. Every time a person in the audience raises or lowers their hands, a status is being logged with the amount of hands raised.

In this image, a testing UI is visible with buttons for changing the hand state, input fields for the server IP address and port as well as a log text to check whether or not the device connected successfully to the server.

Unfortunately, this concept never made it to actual development of the feature.

From Studio Lychii

Our own ChatGPT implementation, Audience app | Nils - Week 9/10/11/12

Recent Posts

Comentarios