Using AI in combination with AR is proving to be one of the most impactful ways developers are using to take their AR content to the next level, by providing a balance of immersion and real-world knowledge and awareness.
Using Mattercraft's latest features, I wanted to run a quick experiment to see how the two could work together to enable everyday portraits and art to "speak" to users once scanned in WebAR.
In this demo, you can point your phone at any image of a famous person and instantly engage in a conversation with them. The system recognizes the image, uses text-to-speech, and even lip-syncs responses in multiple languages—all in near real-time. The best part? There's no need to train any image target or fine-tune a model beforehand.
Everything happens dynamically.
How it works
I've broken down the key elements below so you can re-create the experience in Mattercraft. If you have any questions about the individual steps, feel free to reach out to me on LinkedIn.
Image Recognition
The user takes a picture of a person, which is then sent to a Cloud Run function I developed on Google Cloud Platform (GCP). This function interfaces with an API I created around the Zappar image CLI tool. The API processes the image and returns the image target (zpt), which is dynamically added to the scene.
Example of how you can capture a picture on code
Using the OpenAI Vision API
Simultaneously, the picture is sent to OpenAI's Vision API (model GPT-4o). The base prompt is enhanced with a system prompt that instructs the model to role-play as the person in the image.
For example:
"You are a helpful assistant. This is a roleplaying game and people can ask questions about famous people and learn some history. You can also search the internet for information. People who use this application want to know more about you. Start with greeting as the person in the provided image and say your first and last name. In general keep your answers short. Can you return in your first answer if this person is male or female. For example: Hi I'm Will Smith, how are you doing? [male]"
I was pleasantly surprised by how accurately the model could identify people.
Text-to-Speech & Lip Syncing
The response generated by the model is then sent to a Text-to-Speech API, along with the identified gender. The resulting audio, along with the image, is then processed by a model I found on Replicate.com, which lip-syncs the audio to the image. The resulting video is dynamically added to the scene in Mattercraft. (https://replicate.com/devxpy/cog-wav2lip)
Conversation Flow
Users can continue asking questions, with each query sent to OpenAI along with the conversation history. The initial image is used only once for lip-syncing, ensuring seamless interaction.
UI Development
The user interface is built in Mattercraft, where I designed and styled multiple elements using CSS.
Conclusion
It was exciting to see how quickly I could prototype in Mattercraft. With the backend logic already in place, I was able to iterate rapidly on the frontend (AR) logic and assess its performance. This allowed for easy adjustments to prompts and results as needed.
Example response:
– https://webxr.be/talk-with-image/api/video/66cb7b7b818df.mp4 (result from the lipsync model)
– https://webxr.be/talk-with-image/api/audio/66cb7b7b818df.mp3 (returned from text to speech API)
– https://webxr.be/talk-with-image/api/img/66cb7b7b818df.jpg (snapshot from camera)
Where to find Stijn
If you've got more questions for Stijn about the project or want to work with him you can find him on LinkedIn or check out more of his creations over on X at @stspanho.
WebXR developer