top of page

Stable Diffusion, MetaHumans (-Maya export), OpenAI - Week 5/6 - Nils

This week, I spent my time researching how to export a Metahuman character from Unreal Engine (or rather, Quixel Bridge) to Maya. There was a lot of troubleshooting to do, which included installing/deleting/reinstalling software, waiting for downloads and export but eventually I got it working and we have a workflow for importing Metahumans into Maya.


From a fellow studio member, we got the AI-model Stable Diffusion recommended, so I started researching it and making a prototype with it to see if it can be a viable tool for our concept. I will elaborate on Stable Diffusion later on in this post. Same as with the Quixel - Maya export connection, this setup took quite some troubleshooting and configuring before getting it to work in Unity.


I also researched how we should go about implementing text-to-speech & translation APIs, for which there are several options. So there are some considerations to be done in terms of financial feasability as well as technical capabilities.


Stable Diffusion

Prototype, Ideate

It took a few installs & uninstalls to get Stable Diffusion working on the system, but eventually we could generate AI art, using the powerful GPU in the computer.

The following prompt

python scripts/txt2img.py --prompt "the planet Jupiter with its moons Callisto and Europa, a spaceship flying close to the atmosphere in combat with a TIE fighter, hyperrealistic, unreal engine, cinematic lighting, ultra realistic illustration" --plms --n_iter 1 --n_samples 1

gave the image below as a result. It is not very realistic as well as unaccurate if you compare it to the prompt that was given, but the result can be improved by adjusting the parameters and working on the prompt quality.

However, a downside of this generation method is that it takes some time, around ~50 seconds for this image, to output the result. We are not sure if that is an acceptable wait time, or if it impacts the immersiveness too much.


A possible improvement on this could be using different AI models, as well as adding Python libraries that improve generation speed and efficiency. A popular alternative for Stable Diffusion will be discussed further down.


Another disadvantage of Stable Diffusion is that a powerful PC is required to generate images within an acceptable amount of time.


ChatGPT Empathize, Ideate

We can choose which LLM (Large Language Model) we can use. Most likely, GPT-3.5 Turbo will be more than sufficient for our use case but should we have a need for a more advanced interaction, the newer model GPT-4 is available. However, we should keep in mind that GPT-4 is 30x more expensive for input & output tokens. The count of output tokens can rise quite quickly as well, since ChatGPT tends to sometimes give extensive and elaborate answers.


To ensure we stay within financial limitations, we could add to the "custom instructions" of the API that the output should stay within a certain character or word count.

The "4K/16K/8K/32K context" means the maximum amount of tokens per input or output message. For our concept, it is not expected the 4K context will even be exceeded, since 4000 tokens is a lot of text already, whether that is for input or output.


DALL-E 3

Empathize, Define

We can use this sophisticated AI image model to create dynamic content to show during the interactive experience. This image model is made by the same developers as ChatGPT. This means that the integration with our application will be easier, since ChatGPT will by default be able to interact with the generated image content.


As expected, every time an image is generated a certain cost is added to the OpenAI bill.

This cost per image depends on the resolution that is chosen. Most likely, we will be choosing the highest resolution which means $0.02 per image will be paid. Even if 50 images (which is on the high end) are generated every time the interactive experience is performed, it would only cost us $1.

An example of content that DALL-E 3 can generate is shown below. As can be seen, this image is of much higher quality compared to the output of Stable Diffusion, with a fraction of the effort & time.



Whisper Empathize

An API that can help us take away a step in the interactive experience, which contributes to the whole immersiveness. Instead of having to physically type a prompt in the computer, we can use a so-called "shotgun microphone" to capture the voice(s) of the audience, which is then used as the prompt.


The API also allows for input in a lot of languages, which is automatically translated (or at least understood) by ChatGPT. This takes away another barrier, which is that the audience would be required to speak English during the interaction. Rather, they can communicate in the language they are most comfortable with and it wouldn't break the immersiveness.


One requirement for Whisper to work properly is that only one person can be speaking at the same time.


In the diagram below, a visualization on how the speech transcription as well as translation is handled.



Example calculation

We assume the interactive experience takes around 45 minutes.


Calculation

Cost

ChatGPT-3.5 Turbo

20 messages both ways is assumed

Input

(20 messages * 71 tokens) / 1k tokens = (1420 tokens / 1000) * $0.0010

$0.00142

Output

(20 messages * 71 tokens) / 1k tokens = (1420 tokens / 1000) * $0.0020

$0.00284

LLM cost

​$0.00497

DALL-E 3

​15 images generated assumed, at 1024x1024 resolution

Output

​20 images * $0.040

$0.80

Whisper (STT)

​15 minutes of speech to text assumed

Input

​15 minutes * $0.006

$0.09

​TTS

​15 minutes of text to speech assumed

Output

​15 minutes * $0.015

$0.225

Total sum:

​$1.12

With these example values, a single instance of the interactive play would take around $1.12 to fund the full AI functionality. In conclusion, the cost for this is neglectible and easily within the range of available funding.






Comments


bottom of page