met4citizen / talkinghead Goto Github PK

View Code? Open in Web Editor NEW

261.0 5.0 84.0 41.07 MB

Talking Head (3D): A JavaScript class for real-time lip-sync using Ready Player Me full-body 3D avatars.

License: MIT License

JavaScript 100.00%

3d-avatar lip-sync text-to-speech talking-head ready-player-me talking-avatar

talkinghead's Introduction

Talking Head (3D)

Demo Videos

Video	Description
	I chat with Jenny and Harri. The close-up view allows you to evaluate the accuracy of lip-sync in both English and Finnish. Using GPT-3.5 and Microsoft text-to-speech.
	A short demo of how AI can control the avatar's movements. Using OpenAI's function calling and Google TTS with the TalkingHead's built-in viseme generation.
	Michael lip-syncs to two MP3 audio tracks using OpenAI's Whisper and TalkingHead's `speakAudio` method. He kicks things off with some casual talk, but then goes all out by trying to tackle an old Meat Loaf classic. 🤘 Keep rockin', Michael! 🎤😂
	Julia and I showcase some of the features of the TalkingHead class and the test app including the settings, some poses and animations.

All the demo videos are real-time screen captures from a Chrome browser running the TalkingHead test web app without any post-processing.

Use Case Examples

Video/App	Use Case
	Video conferencing. A video conferencing solution with real-time transcription, contextual AI responses, and voice lip-sync. The app and demo, featuring Olivia, by namnm 👍
	Recycling Advisor 3D. Snap a photo and get local recycling advice from a talking avatar. My entry for the Gemini API Developer Competition.
	Live Twitch adventure. Evertrail is an infinite, real-time generated world where all of your choices shape the outcome. Video clip and the app by JPhilipp 👏👏
	Quantum physics using a blackboard. David introduces us to the CHSH game and explores the mystery of quantum entanglement. For more information about the research project, see CliqueVM.
	Interactive Portfolio. Click the image to open the app, where you can interview the virtual persona of its developer, AkshatRastogi-1nC0re 👋

Introduction

Talking Head (3D) is a JavaScript class featuring a 3D avatar that can speak and lip-sync in real-time. The class supports Ready Player Me full-body 3D avatars (GLB), Mixamo animations (FBX), and subtitles. It also knows a set of emojis, which it can convert into facial expressions.

By default, the class uses Google Cloud TTS for text-to-speech and has a built-in lip-sync support for English, Finnish, and Lithuanian (beta). New lip-sync languages can be added by creating new lip-sync language modules. It is also possible to integrate the class with an external TTS service, such as Microsoft Azure Speech SDK or ElevenLabs WebSocket API.

The class uses ThreeJS / WebGL for 3D rendering.

Talking Head class

You can download the TalkingHead modules from releases (without dependencies). Alternatively, you can import all the needed modules from a CDN:

<script type="importmap">
{ "imports":
  {
    "three": "https://cdn.jsdelivr.net/npm/[email protected]/build/three.module.js/+esm",
    "three/addons/": "https://cdn.jsdelivr.net/npm/[email protected]/examples/jsm/",
    "talkinghead": "https://cdn.jsdelivr.net/gh/met4citizen/[email protected]/modules/talkinghead.mjs"
  }
}
</script>

If you want to use the built-in Google TTS and lip-sync using Single Sign-On (SSO) functionality, give the class your TTS proxy endpoint and a function from which to obtain the JSON Web Token needed to use that proxy. Refer to Appendix B for one way to implement JWT SSO.

import { TalkingHead } from "talkinghead";

// Create the talking head avatar
const nodeAvatar = document.getElementById('avatar');
const head = new TalkingHead( nodeAvatar, {
  ttsEndpoint: "/gtts/",
  jwtGet: jwtGet,
  lipsyncModules: ["en", "fi"]
});

FOR HOBBYISTS: If you're just looking to experiment on your personal laptop without dealing with proxies, JSON Web Tokens, or Single Sign-On, take a look at the minimal code example. Simply download the file, add your Google TTS API key, and you'll have a basic web app template with a talking head.

The following table lists all the available options and their default values:

Option	Description
`jwsGet`	Function to get the JSON Web Token (JWT). See Appendix B for more information.
`ttsEndpoint`	Text-to-speech backend/endpoint/proxy implementing the Google Text-to-Speech API.
`ttsApikey`	If you don't want to use a proxy or JWT, you can use Google TTS endpoint directly and provide your API key here. NOTE: I recommend that you don't use this in production and never put your API key in any client-side code.
`ttsLang`	Google text-to-speech language. Default is `"fi-FI"`.
`ttsVoice`	Google text-to-speech voice. Default is `"fi-FI-Standard-A"`.
`ttsRate`	Google text-to-speech rate in the range [0.25, 4.0]. Default is `0.95`.
`ttsPitch`	Google text-to-speech pitch in the range [-20.0, 20.0]. Default is `0`.
`ttsVolume`	Google text-to-speech volume gain (in dB) in the range [-96.0, 16.0]. Default is `0`.
`ttsTrimStart`	Trim the viseme sequence start relative to the beginning of the audio (shift in milliseconds). Default is `0`.
`ttsTrimEnd`	Trim the viseme sequence end relative to the end of the audio (shift in milliseconds). Default is `300`.
`lipsyncModules`	Lip-sync modules to load dynamically at start-up. Limiting the number of language modules improves the loading time and memory usage. Default is `["en", "fi", "lt"]`. [≥`v1.2`]
`lipsyncLang`	Lip-sync language. Default is `"fi"`.
`pcmSampleRate`	PCM (signed 16bit little endian) sample rate used in `speakAudio` in Hz. Default is `22050`.
`modelRoot`	The root name of the armature. Default is `Armature`.
`modelPixelRatio`	Sets the device's pixel ratio. Default is `1`.
`modelFPS`	Frames per second. Note that actual frame rate will be a bit lower than the set value. Default is `30`.
`modelMovementFactor`	A factor in the range [0,1] limiting the avatar's upper body movement when standing. Default is `1`. [≥`v1.2`]
`cameraView`	Initial view. Supported views are `"full"`, `"mid"`, `"upper"` and `"head"`. Default is `"full"`.
`cameraDistance`	Camera distance offset for initial view in meters. Default is `0`.
`cameraX`	Camera position offset in X direction in meters. Default is `0`.
`cameraY`	Camera position offset in Y direction in meters. Default is `0`.
`cameraRotateX`	Camera rotation offset in X direction in radians. Default is `0`.
`cameraRotateY`	Camera rotation offset in Y direction in radians. Default is `0`.
`cameraRotateEnable`	If true, the user is allowed to rotate the 3D model. Default is `true`.
`cameraPanEnable`	If true, the user is allowed to pan the 3D model. Default is `false`.
`cameraZoomEnable`	If true, the user is allowed to zoom the 3D model. Default is `false`.
`lightAmbientColor`	Ambient light color. The value can be a hexadecimal color or CSS-style string. Default is `0xffffff`.
`lightAmbientIntensity`	Ambient light intensity. Default is `2`.
`lightDirectColor`	Direction light color. The value can be a hexadecimal color or CSS-style string. Default is `0x8888aa`.
`lightDirectIntensity`	Direction light intensity. Default is `30`.
`lightDirectPhi`	Direction light phi angle. Default is `0.1`.
`lightDirectTheta`	Direction light theta angle. Default is `2`.
`lightSpotColor`	Spot light color. The value can be a hexadecimal color or CSS-style string. Default is `0x3388ff`.
`lightSpotIntensity`	Spot light intensity. Default is `0`.
`lightSpotPhi`	Spot light phi angle. Default is `0.1`.
`lightSpotTheta`	Spot light theta angle. Default is `4`.
`lightSpotDispersion`	Spot light dispersion. Default is `1`.
`avatarMood`	The mood of the avatar. Supported moods: `"neutral"`, `"happy"`, `"angry"`, `"sad"`, `"fear"`, `"disgust"`, `"love"`, `"sleep"`. Default is `"neutral"`.
`avatarMute`	Mute the avatar. This can be helpful option if you want to output subtitles without audio and lip-sync. Default is `false`.
`markedOptions`	Options for Marked markdown parser. Default is `{ mangle:false, headerIds:false, breaks: true }`.
`statsNode`	Parent DOM element for the three.js stats display. If `null`, don't use. Default is `null`.
`statsStyle`	CSS style for the stats element. If `null`, use the three.js default style. Default is `null`.

Once the instance has been created, you can load and display your avatar. Refer to Appendix A for how to make your avatar:

// Load and show the avatar
try {
  await head.showAvatar( {
    url: './avatars/brunette.glb',
    body: 'F',
    avatarMood: 'neutral',
    ttsLang: "en-GB",
    ttsVoice: "en-GB-Standard-A",
    lipsyncLang: 'en'
  });
} catch (error) {
  console.log(error);
}

An example of how to make the avatar speak the text on input text when the button speak is clicked:

// Speak 'text' when the button 'speak' is clicked
const nodeSpeak = document.getElementById('speak');
nodeSpeak.addEventListener('click', function () {
  try {
    const text = document.getElementById('text').value;
    if ( text ) {
      head.speakText( text );
    }
  } catch (error) {
    console.log(error);
  }
});

The following table lists some of the key methods. See the source code for the rest:

Method	Description
`showAvatar(avatar, [onprogress=null])`	Load and show the specified avatar. The `avatar` object must include the `url` for GLB file. Optional properties are `body` for either male `M` or female `F` body form, `lipsyncLang`, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood` and `avatarMute`.
`setView(view, [opt])`	Set view. Supported views are `"full"`, `"mid"`, `"upper"` and `"head"`. The `opt` object can be used to set `cameraDistance`, `cameraX`, `cameraY`, `cameraRotateX`, `cameraRotateY`.
`setLighting(opt)`	Change lighting settings. The `opt` object can be used to set `lightAmbientColor`, `lightAmbientIntensity`, `lightDirectColor`, `lightDirectIntensity`, `lightDirectPhi`, `lightDirectTheta`, `lightSpotColor`, `lightSpotIntensity`, `lightSpotPhi`, `lightSpotTheta`, `lightSpotDispersion`.
`speakText(text, [opt={}], [onsubtitles=null], [excludes=[]])`	Add the `text` string to the speech queue. The text can contain face emojis. Options `opt` can be used to set text-specific `lipsyncLang`, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood`, `avatarMute`. Optional callback function `onsubtitles` is called whenever a new subtitle is to be written with the parameter of the added string. The optional `excludes` is an array of [start,end] indices to be excluded from audio but to be included in the subtitles.
`speakAudio(audio, [opt={}], [onsubtitles=null])`	Add a new `audio` object to the speech queue. In audio object, property `audio` is either `AudioBuffer` or an array of PCM 16bit LE audio chunks. Property `words` is an array of words, `wtimes` is an array of corresponding starting times in milliseconds, and `wdurations` an array of durations in milliseconds. If the Oculus viseme IDs are know, they can be given in optional `visemes`, `vtimes` and `vdurations` arrays. The object also supports optional timed callbacks using `markers` and `mtimes`. The `opt` object can be used to set text-specific `lipsyncLang`.
`speakEmoji(e)`	Add an emoji `e` to the speech queue.
`speakBreak(t)`	Add a break of `t` milliseconds to the speech queue.
`speakMarker(onmarker)`	Add a marker to the speech queue. The callback function `onmarker` is called when the queue processes the event.
`lookAt(x,y,t)`	Make the avatar's head turn to look at the screen position (`x`,`y`) for `t` milliseconds.
`lookAtCamera(t)`	Make the avatar's head turn to look at the camera for `t` milliseconds.
`setMood(mood)`	Set avatar mood.
`playBackgroundAudio(url)`	Play background audio such as ambient sounds/music in a loop.
`stopBackgroundAudio()`	Stop playing the background audio.
`setMixerGain(speech, background)`	The amount of gain for speech and background audio (see Web Audio API / GainNode for more information). Default value is `1`.
`playAnimation(url, [onprogress=null], [dur=10], [ndx=0], [scale=0.01])`	Play Mixamo animation file for `dur` seconds, but full rounds and at least once. If the FBX file includes several animations, the parameter `ndx` specifies the index. Since Mixamo rigs have a scale 100 and RPM a scale 1, the `scale` factor can be used to scale the positions.
`stopAnimation()`	Stop the current animation started by `playAnimation`.
`playPose(url, [onprogress=null], [dur=5], [ndx=0], [scale=0.01])`	Play the initial pose of a Mixamo animation file for `dur` seconds. If the FBX file includes several animations, the parameter `ndx` specifies the index. Since Mixamo rigs have a scale 100 and RPM a scale 1, the `scale` factor can be used to scale the positions.
`stopPose()`	Stop the current pose started by `playPose`.
`playGesture(name, [dur=3], [mirror=false], [ms=1000])`	Play a named hand gesture and/or animated emoji for `dur` seconds with the `ms` transition time. The available hand gestures are `handup`, `index`, `ok`, `thumbup`, `thumbdown`, `side`, `shrug`. By default, hand gestures are done with the left hand. If you want the right handed version, set `mirror` to true. You can also use `playGesture` to play emojis. See Appendix D for more details. [≥`v1.2`]
`stopGesture([ms=1000])`	Stop the gesture with `ms` transition time. [≥`v1.2`]
`start`	Start/re-start the Talking Head animation loop.
`stop`	Stop the Talking Head animation loop.

The class has been tested on the latest Chrome, Firefox, Safari, and Edge desktop browsers, as well as on iPad.

The `index.html` Test App

NOTE: The index.html app was created for testing and developing the TalkingHead class. It includes various integrations with several paid services. If you only want to use the TalkingHead class in your own app, there is no need to install and configure the index.html app.

The web app index.html shows how to integrate and use the class with ElevenLabs WebSocket API, Microsoft Azure Speech SDK, OpenAI API and Gemini Pro API.

You can preview the app's UI here. Please note that since the API proxies for the text-to-speech and AI services are missing, the avatar does not speak or lip-sync, and you can't chat with it.

If you want to configure and use the app index.html, do the following:

Copy the whole project to your own server.
Create the needed API proxies as described in Appendix B and check/update your endpoint/proxy configuration in index.html:

// API endpoints/proxys
const jwtEndpoint = "/app/jwt/get"; // Get JSON Web Token for Single Sign-On
const openaiChatCompletionsProxy = "/openai/v1/chat/completions";
const openaiModerationsProxy = "/openai/v1/moderations";
const openaiAudioTranscriptionsProxy = "/openai/v1/audio/transcriptions";
const vertexaiChatCompletionsProxy = "/vertexai/";
const googleTTSProxy = "/gtts/";
const elevenTTSProxy = [
  "wss://" + window.location.host + "/elevenlabs/",
  "/v1/text-to-speech/",
  "/stream-input?model_id=eleven_multilingual_v2&output_format=pcm_22050"
];
const microsoftTTSProxy = [
  "wss://" + window.location.host + "/mstts/",
  "/cognitiveservices/websocket/v1"
];

The test app's UI supports both Finnish and English. If you want to add another language, you need to add an another entry to the i18n object.
Add you own background images, videos, audio files, avatars etc. in the directory structure and update your site configuration siteconfig.js accordingly. The keys are in English, but the entries can include translations to other languages.

Licenses, attributions and notes related to the index.html web app assets:

The app uses Marked Markdown parser and DOMPurify XSS sanitizer.
Fira Sans Condensed and Fira Sans Extra Condensed fonts are licensed under the SIL Open Font License, version 1.1, available with a FAQ at http://scripts.sil.org/OFL. Digitized data copyright (c) 2012-2015, The Mozilla Foundation and Telefonica S.A.
Example avatar "brunette.glb" was created at Ready Player Me. The avatar is free to all developers for non-commercial use under the CC BY-NC 4.0 DEED. If you want to integrate Ready Player Me avatars into a commercial app or game, you must sign up as a Ready Player Me developer.
Example animation walking.fbx and the pose dance.fbx are from Mixamo, a subsidiary of Adobe Inc. Mixamo service is free and its animations/poses (>2000) can be used royalty free for personal, commercial, and non-profit projects. Raw animation files can't be distributed outside the project team and can't be used to train ML models.
Background view examples are from Virtual Backgrounds
Impulse response (IR) files for reverb effects:
- ir-room: OpenAir, Public Domain Creative Commons license
- ir-basement: OpenAir, Public Domain Creative Commons license
- ir-forest (Abies Grandis Forest, Wheldrake Wood): OpenAir, Creative Commons Attribution 4.0 International License
- ir-church (St. Andrews Church): OpenAir, Share Alike Creative Commons 3.0
Ambient sounds/music attributions:
- murmur.mp3: https://github.com/siwalikm/coffitivity-offline

NOTE: None of the assets described above are used or distributed as part of the TalkingHead class releases. If you wish to use them in your own application, please refer to the exact terms of use provided by the copyright holders.

FAQ

Why not use the free Web Speech API?

The free Web Speech API can't provide word-to-audio timestamps, which are essential for accurate lip-sync. As far as I know, there is no way even to get Web Speech API speech synthesis as an audio file or determine its duration in advance. At some point I tried to use the Web Speech API events for syncronization, but the results were not good.

What paid text-to-speech service should I use?

It depends on your use case and budget. If the built-in lip-sync support is sufficient for your needs, I would recommend Google TTS, because it gives you up to 4 million characters for free each month. If your app needs to support multiple languages, I would consider Microsoft Speech SDK.

I would like to have lip-sync support for language X.

You have two options. First, you can implement a word-to-viseme class similar to those that currently exist for English and Finnish. See Appendix C for detailed instructions. Alternatively, you can check if Microsoft Azure TTS can provide visemes for your language and use Microsoft Speech SDK integration (speakAudio) instead of Google TTS and the built-in lip-sync (speakText).

Can I use a custom 3D model?

The class supports full-body Ready Player Me avatars. You can also make your own custom model, but it needs to have a RPM compatible rig/bone structure and all their blend shapes. Please refer to Appendix A and readyplayer.me documentation for more details.

Any future plans for the project?

This is just a small side-project for me, so I don't have any big plans for it. That said, there are several companies that are currently developing text-to-3D-avatar and text-to-3D-animation features. If and when they get released as APIs, I will probably take a look at them and see if they can be used/integrated in some way to the project.

Appendix A: Create Your Own 3D Avatar

FOR HOBBYISTS:

Create your own full-body avatar free at https://readyplayer.me
Copy the given URL and add the following URL parameters in order to include all the needed morph targets:
morphTargets=ARKit,Oculus+Visemes,mouthOpen,mouthSmile,eyesClosed,eyesLookUp,eyesLookDown&textureSizeLimit=1024&textureFormat=png

Your final URL should look something like this:
https://models.readyplayer.me/64bfa15f0e72c63d7c3934a6.glb?morphTargets=ARKit,Oculus+Visemes,mouthOpen,mouthSmile,eyesClosed,eyesLookUp,eyesLookDown&textureSizeLimit=1024&textureFormat=png
Use the URL to download the GLB file to your own web server.

FOR 3D MODELERS:

You can create and use your own 3D full-body model, but it has to be Ready Player Me compatible. Their rig has a Mixamo-compatible bone structure described here:

https://docs.readyplayer.me/ready-player-me/api-reference/avatars/full-body-avatars

For lip-sync and facial expressions, you also need to have ARKit and Oculus compatible blend shapes, and a few additional ones, all listed in the following two pages:

https://docs.readyplayer.me/ready-player-me/api-reference/avatars/morph-targets/apple-arkit https://docs.readyplayer.me/ready-player-me/api-reference/avatars/morph-targets/oculus-ovr-libsync

The TalkingHead class supports both separated mesh and texture atlasing.

Appendix B: Create API Proxies with JSON Web Token (JWT) Single Sign-On (SSO)

Make a CGI script that generates a new JSON Web Token with an expiration time (exp). See jwt.io for more information about JWT and libraries that best fit your needs and architecture.
Protect your CGI script with some authentication scheme. Below is an example Apache 2.4 directory config that uses Basic authentication (remember to always use HTTPS/SSL!). Put your CGI script get in the jwt directory.

# Restricted applications
<Directory "/var/www/app">
  AuthType Basic
  AuthName "Restricted apps"
  AuthUserFile /etc/httpd/.htpasswd
  Require valid-user
</Directory>

# JSON Web Token
<Directory "/var/www/app/jwt" >
  Options ExecCGI
  SetEnv REMOTE_USER %{REMOTE_USER}
  SetHandler cgi-script
</Directory>

Make an External Rewriting Program script that verifies JSON Web Tokens. The script should return OK if the given token is not expired and its signature is valid. Start the script in Apache 2.4 config. User's don't use the verifier script directly, so put it in some internal directory, not under document root.

# JSON Web Token verifier
RewriteEngine On
RewriteMap jwtverify "prg:/etc/httpd/jwtverify" apache:apache

Make a proxy configuration for each service you want to use. Add the required API keys and protect the proxies with the JWT token verifier. Below are some example configs for Apache 2.4 web server. Note that when opening a WebSocket connection (ElevenLabs, Azure) you can't add authentication headers in browser JavaScript. This problem is solved here by including the JWT token as a part of the request URL. The downside is that the token might end up in server log files. This is typically not a problem as long as you are controlling the proxy server, you are using HTTPS/SSL, and the token has an expiration time.

# OpenAI API
<Location /openai/>
  RewriteCond ${jwtverify:%{http:Authorization}} !=OK
  RewriteRule .+ - [F]
  ProxyPass https://api.openai.com/
  ProxyPassReverse  https://api.openai.com/
  ProxyPassReverseCookiePath "/"  "/openai/"
  ProxyPassReverseCookieDomain ".api.openai.com" ".<insert-your-proxy-domain-here>"
  RequestHeader set Authorization "Bearer <insert-your-openai-api-key-here>"
</Location>

# Google TTS API
<Location /gtts/>
  RewriteCond ${jwtverify:%{http:Authorization}} !=OK
  RewriteRule .+ - [F]
  ProxyPass https://eu-texttospeech.googleapis.com/v1beta1/text:synthesize?key=<insert-your-api-key-here> nocanon
  RequestHeader unset Authorization
</Location>

# Microsoft Azure TTS WebSocket API (Speech SDK)
<LocationMatch /mstts/(?<jwt>[^/]+)/>
  RewriteCond ${jwtverify:%{env:MATCH_JWT}} !=OK
  RewriteRule .+ - [F]
  RewriteCond %{HTTP:Connection} Upgrade [NC]
  RewriteCond %{HTTP:Upgrade} websocket [NC]
  RewriteRule /mstts/[^/]+/(.+) "wss://<insert-your-region-here>.tts.speech.microsoft.com/$1" [P]
  RequestHeader set "Ocp-Apim-Subscription-Key" <insert-your-subscription-key-here>
</LocationMatch>

# ElevenLabs Text-to-speech WebSocket API
<LocationMatch /elevenlabs/(?<jwt>[^/]+)/>
  RewriteCond ${jwtverify:%{env:MATCH_JWT}} !=OK
  RewriteRule .+ - [F]
  RewriteCond %{HTTP:Connection} Upgrade [NC]
  RewriteCond %{HTTP:Upgrade} websocket [NC]
  RewriteRule /elevenlabs/[^/]+/(.+) "wss://api.elevenlabs.io/$1" [P]
  RequestHeader set "xi-api-key" "<add-your-elevenlabs-api-key-here>"
</LocationMatch>

Appendix C: Create A New Lip-sync Module

The steps that are common to all new languages:

Create a new file named lipsync-xx.mjs where xx is your language code, and place the file in the ./modules/ directory. The language module should have a class named LipsyncXx where Xx is the language code. The naming in important, because the modules are loaded dynamically based on their names.
The class should have (at least) the following two methods: preProcessText and wordsToVisemes. These are the methods used in the TalkingHead class.
The purpose of the preProcessText method is to preprocess the given text by converting symbols to words, numbers to words, and filtering out characters that should be left unspoken (if any), etc. This is often needed to prevent ambiguities between TTS and lip-sync engines. This method takes a string as a parameter and returns the preprocessed string.
The purpose of the wordsToVisemes method is to convert the given text into visemes and timestamps. The method takes a string as a parameter and returns a lip-sync object. The lipsync object has three required properties: visemes, timesand durations.
- Property visemes is an array of Oculus OVR viseme codes. Each viseme is one of the strings: 'aa', 'E', 'I', 'O', 'U', 'PP', 'SS', 'TH', 'CH', 'FF', 'kk', 'nn', 'RR', 'DD', 'sil'. See the reference images here: https://developer.oculus.com/documentation/unity/audio-ovrlipsync-viseme-reference/
- Property times is an array of starting times, one entry for each viseme in visemes. Starting times are to be given in relative units. They will be scaled later on based on the word timestamps that we get from the TTS engine.
- Property durations is an array of relative durations, one entry for each viseme in visemes. Durations are to be given in relative units. They will be scaled later on based on the word timestamps that we get from the TTS engine.
(OPTIONAL) Add the new module "xx" to lipsyncModules parameter array in the talkinghead.mjs file.

The difficult part is to actually make the conversion from words to visemes. What is the best approach depends on the language. Here are some typical approaches to consider (not a comprehensive list):

Direct mapping from graphemes to phonemes to visemes. This works well for languages that have a consistent one-to-one mapping between individual letters and phonemes. This was used as the approach for the Finnish language (lipsync-fi.mjs) giving >99.9% lip-sync accuracy compared to the Finnish phoneme dictionary. Implementation size was ~4k. Unfortunately not all languages are phonetically orthographic languages.
Rule-based mapping. This was used as the approach for the English language (lipsync-en.mjs) giving around 80% lip-sync accuracy compared to the English phoneme dictionary. However, since the rules cover the most common words, the effective accuracy is higher. Implementation size ~12k.
Dictionary based approach. If neither of the previous approaches work for your language, make a search from some open source phoneme dictionary. Note that you still need some backup algorithm for those words that are not in the dictionary. The problem with phoneme dictionaries is their size. For example, the CMU phoneme dictionary for English is ~5M.
Neural-net approach based on transformer models. Typically this should be done on server-side as the model side can be >50M.

TalkingHead is supposed to be a real-time class, so latency is always something to consider. It is often better to be small and fast than to aim for 100% accuracy.

Appendix D: Adding Custom Poses, Moods, Gestures, and Emojis (ADVANCED)

In the TalkingHead class, the avatar's movements are based on four data structures: head.poseTemplates, head.animMoods, head.gestureTemplates, and head.animEmojis. By using these objects, you can give your avatar its own personal body language.

In head.poseTemplates the hip position is defined as an {x, y, z} coordinate in meters, and bone rotations as Euler XYZ rotations in radians. In each pose, the avatar should have its weight on the left leg, if any, as the class automatically mirrors it for the right side. Setting the boolean properties standing, sitting, bend, kneeling, and lying helps the class make the transitions between different poses in proper steps.

head.poseTemplates["custom-pose-1"] = {
  standing: true, sitting: false, bend: false, kneeling: false, lying: false,
  props: {
    'Hips.position':{x:0, y:0.989, z:0.001}, 'Hips.rotation':{x:0.047, y:0.007, z:-0.007}, 'Spine.rotation':{x:-0.143, y:-0.007, z:0.005}, 'Spine1.rotation':{x:-0.043, y:-0.014, z:0.012}, 'Spine2.rotation':{x:0.072, y:-0.013, z:0.013}, 'Neck.rotation':{x:0.048, y:-0.003, z:0.012}, 'Head.rotation':{x:0.05, y:-0.02, z:-0.017}, 'LeftShoulder.rotation':{x:1.62, y:-0.166, z:-1.605}, 'LeftArm.rotation':{x:1.275, y:0.544, z:-0.092}, 'LeftForeArm.rotation':{x:0, y:0, z:0.302}, 'LeftHand.rotation':{x:-0.225, y:-0.154, z:0.11}, 'LeftHandThumb1.rotation':{x:0.435, y:-0.044, z:0.457}, 'LeftHandThumb2.rotation':{x:-0.028, y:0.002, z:-0.246}, 'LeftHandThumb3.rotation':{x:-0.236, y:-0.025, z:0.113}, 'LeftHandIndex1.rotation':{x:0.218, y:0.008, z:-0.081}, 'LeftHandIndex2.rotation':{x:0.165, y:-0.001, z:-0.017}, 'LeftHandIndex3.rotation':{x:0.165, y:-0.001, z:-0.017}, 'LeftHandMiddle1.rotation':{x:0.235, y:-0.011, z:-0.065}, 'LeftHandMiddle2.rotation':{x:0.182, y:-0.002, z:-0.019}, 'LeftHandMiddle3.rotation':{x:0.182, y:-0.002, z:-0.019}, 'LeftHandRing1.rotation':{x:0.316, y:-0.017, z:0.008}, 'LeftHandRing2.rotation':{x:0.253, y:-0.003, z:-0.026}, 'LeftHandRing3.rotation':{x:0.255, y:-0.003, z:-0.026}, 'LeftHandPinky1.rotation':{x:0.336, y:-0.062, z:0.088}, 'LeftHandPinky2.rotation':{x:0.276, y:-0.004, z:-0.028}, 'LeftHandPinky3.rotation':{x:0.276, y:-0.004, z:-0.028}, 'RightShoulder.rotation':{x:1.615, y:0.064, z:1.53}, 'RightArm.rotation':{x:1.313, y:-0.424, z:0.131}, 'RightForeArm.rotation':{x:0, y:0, z:-0.317}, 'RightHand.rotation':{x:-0.158, y:-0.639, z:-0.196}, 'RightHandThumb1.rotation':{x:0.44, y:0.048, z:-0.549}, 'RightHandThumb2.rotation':{x:-0.056, y:-0.008, z:0.274}, 'RightHandThumb3.rotation':{x:-0.258, y:0.031, z:-0.095}, 'RightHandIndex1.rotation':{x:0.169, y:-0.011, z:0.105}, 'RightHandIndex2.rotation':{x:0.134, y:0.001, z:0.011}, 'RightHandIndex3.rotation':{x:0.134, y:0.001, z:0.011}, 'RightHandMiddle1.rotation':{x:0.288, y:0.014, z:0.092}, 'RightHandMiddle2.rotation':{x:0.248, y:0.003, z:0.02}, 'RightHandMiddle3.rotation':{x:0.249, y:0.003, z:0.02}, 'RightHandRing1.rotation':{x:0.369, y:0.019, z:0.006}, 'RightHandRing2.rotation':{x:0.321, y:0.004, z:0.026}, 'RightHandRing3.rotation':{x:0.323, y:0.004, z:0.026}, 'RightHandPinky1.rotation':{x:0.468, y:0.085, z:-0.03}, 'RightHandPinky2.rotation':{x:0.427, y:0.007, z:0.034}, 'RightHandPinky3.rotation':{x:0.142, y:0.001, z:0.012}, 'LeftUpLeg.rotation':{x:-0.077, y:-0.058, z:3.126}, 'LeftLeg.rotation':{x:-0.252, y:0.001, z:-0.018}, 'LeftFoot.rotation':{x:1.315, y:-0.064, z:0.315}, 'LeftToeBase.rotation':{x:0.577, y:-0.07, z:-0.009}, 'RightUpLeg.rotation':{x:-0.083, y:-0.032, z:3.124}, 'RightLeg.rotation':{x:-0.272, y:-0.003, z:0.021}, 'RightFoot.rotation':{x:1.342, y:0.076, z:-0.222}, 'RightToeBase.rotation':{x:0.44, y:0.069, z:0.016}
  }
};
head.playPose("custom-pose-1");

In head.animMoods the syntax is more complex, so I suggest that you take a look at the existing moods. In anims, each leaf object is an animation loop template. Whenever a loop starts, the class iterates through the nested hierarchy of objects by following keys that match the current state (idle, talking), body form (M, F), current view (full, upper, mid, head), and/or probabilities (alt + p). The next animation will be created internally by using the animFactory method. The property delay (ms) determines how long that pose is held, dt defines durations (ms) for each part in the sequence, and vs defines the shapekeys and their target values for each part.

head.animMoods["custom-mood-1"] = {
  baseline: { eyesLookDown: 0.1 },
  speech: { deltaRate: 0, deltaPitch: 0, deltaVolume: 0 },
  anims: [
    { name: 'breathing', delay: 1500, dt: [ 1200,500,1000 ], vs: { chestInhale: [0.5,0.5,0] } },
    { name: 'pose', alt: [
      { p: 0.2, delay: [5000,20000], vs: { pose: ['side'] } },
      { p: 0.2, delay: [5000,20000], vs: { pose: ['hip'] },
        'M': { delay: [5000,20000], vs: { pose: ['wide'] } }
      },
      { delay: [5000,20000], vs: { pose: ['custom-pose-1'] } }
    ]},
    { name: 'head',
      idle: { delay: [0,1000], dt: [ [200,5000] ], vs: { headRotateX: [[-0.04,0.10]], headRotateY: [[-0.3,0.3]], headRotateZ: [[-0.08,0.08]] } },
      talking: { dt: [ [0,1000,0] ], vs: { headRotateX: [[-0.05,0.15,1,2]], headRotateY: [[-0.1,0.1]], headRotateZ: [[-0.1,0.1]] } }
    },
    { name: 'eyes', delay: [200,5000], dt: [ [100,500],[100,5000,2] ], vs: { eyesRotateY: [[-0.6,0.6]], eyesRotateX: [[-0.2,0.6]] } },
    { name: 'blink', delay: [1000,8000,1,2], dt: [50,[100,300],100], vs: { eyeBlinkLeft: [1,1,0], eyeBlinkRight: [1,1,0] } },
    { name: 'mouth', delay: [1000,5000], dt: [ [100,500],[100,5000,2] ], vs : { mouthRollLower: [[0,0.3,2]], mouthRollUpper: [[0,0.3,2]], mouthStretchLeft: [[0,0.3]], mouthStretchRight: [[0,0.3]], mouthPucker: [[0,0.3]] } },
    { name: 'misc', delay: [100,5000], dt: [ [100,500],[100,5000,2] ], vs : { eyeSquintLeft: [[0,0.3,3]], eyeSquintRight: [[0,0.3,3]], browInnerUp: [[0,0.3]], browOuterUpLeft: [[0,0.3]], browOuterUpRight: [[0,0.3]] } }
  ]
};
head.setMood("custom-mood-1");

Typical value range is [0,1] or [-1,1]. At the end of each animation, the value will automatically return to its baseline value. If the value is an array, it defines a range for a uniform/Gaussian random value (approximated using CLT). See the class method gaussianRandom for more information.

In head.gestureTemplates each property is a subset of bone rotations that will be used to override the current pose.

head.gestureTemplates["salute"] = {
  'LeftShoulder.rotation':{x:1.706, y:-0.171, z:-1.756}, 'LeftArm.rotation':{x:0.883, y:-0.288, z:0.886}, 'LeftForeArm.rotation':{x:0, y:0, z:2.183}, 'LeftHand.rotation':{x:0.029, y:-0.298, z:0.346}, 'LeftHandThumb1.rotation':{x:1.43, y:-0.887, z:0.956}, 'LeftHandThumb2.rotation':{x:-0.406, y:0.243, z:0.094}, 'LeftHandThumb3.rotation':{x:-0.024, y:0.008, z:-0.012}, 'LeftHandIndex1.rotation':{x:0.247, y:-0.011, z:-0.084}, 'LeftHandIndex2.rotation':{x:0.006, y:0, z:0}, 'LeftHandIndex3.rotation':{x:-0.047, y:0, z:0.004}, 'LeftHandMiddle1.rotation':{x:0.114, y:-0.004, z:-0.055}, 'LeftHandMiddle2.rotation':{x:0.09, y:0, z:-0.007}, 'LeftHandMiddle3.rotation':{x:0.078, y:0, z:-0.006}, 'LeftHandRing1.rotation':{x:0.205, y:-0.009, z:0.023}, 'LeftHandRing2.rotation':{x:0.109, y:0, z:-0.009}, 'LeftHandRing3.rotation':{x:-0.015, y:0, z:0.001}, 'LeftHandPinky1.rotation':{x:0.267, y:-0.012, z:0.031}, 'LeftHandPinky2.rotation':{x:0.063, y:0, z:-0.005}, 'LeftHandPinky3.rotation':{x:0.178, y:-0.001, z:-0.014}
};
head.playGesture("salute",3);

In head.animEmojis each object is an animated emoji. Note that you can also use head.playGesture to play animated emojis. This makes it easy to combine a hand gesture and a facial expression by giving the gesture and the emoji the same name.

head.animEmojis["🫤"] = { dt: [300,2000], vs: {
    browInnerUp: [0.5], eyeWideLeft: [0.5], eyeWideRight: [0.5], mouthLeft: [0.5], mouthPressLeft: [0.8], mouthPressRight: [0.2], mouthRollLower: [0.5], mouthStretchLeft: [0.7],   mouthStretchRight: [0.7]
  }
};
head.playGesture("🫤",3);

talkinghead's People

Contributors

Stargazers

Watchers

Forkers

aicodedev cspsolutions-dev denyshubh adambear abderrazzakchabab mamd555 pabilash joseza-idb pasekji anoop-qasolve nivritgupta sourabh-3133 userluckytian oijoijcoiejoijce clabra tbu567 munzuruleee wangkb kevinwck williamli-15 accessvr drealmacoy vidina-solutions mattwaller wselfjes yrik quantjia shanky3678 harshswaika neargostudio unclebob2 dariob omnipotentai krzysztofwrobel chandailrc adityamitra5102 sidaga janithmg huydqdev pheobus78 shashank2403 eonasdan theslimesasha yitingliusia anselm jasoncocomo genievn sfidea geeksongs yitingliu97 busetde oxyo dimentox bugbugbaby rasult22 gabrieldeleles kp-forks gregggreg babybirdprd mavishoot javierulisesamarilla sreesree2004 shangzhikeji han-neil atharvagangrade10 ego jaccen sarahhare audreytimbani joelkoz runzway cloudeddragon mmaximov97 jmeegan2 aamaruf bitnom lupettohf christo-zero-john mhmoore12 d-mad hodachuy

talkinghead's Issues

Error: 429

thanks for creating the new demo file but I am getting error 429. using your code

<title>Talking Head - MP3 example</title> <style> body, html { width:100%; height:100%; margin: 0; padding: 0; background-color: dimgray; color: white; } #avatar { display: block; position: absolute; top: 0; left: 0; right: 40%; bottom: 0; } #controls { display: flex; flex-direction: column; gap: 10px; position: absolute; top: 50px; left: Calc( 60% + 50px); right: 50px; bottom: 50px; } #load { font-family: Arial; font-size: 20px; } #json { flex: 1; background-color: lightgray; font-family: Arial; font-size: 20px; } #play { font-family: Arial; font-size: 20px; } #loading { display: block; position: absolute; top: 50px; left: 50px; width: 200px; font-family: Arial; font-size: 20px; } </style> <script type="importmap"> { "imports": { "three": "https://cdn.jsdelivr.net/npm/[email protected]/build/three.module.js/+esm", "three/examples/": "https://cdn.jsdelivr.net/npm/[email protected]/examples/", "three/addons/": "https://cdn.jsdelivr.net/npm/[email protected]/examples/jsm/", "dompurify": "https://cdn.jsdelivr.net/npm/[email protected]/+esm", "marked": "https://cdn.jsdelivr.net/npm/[email protected]/+esm", "talkinghead": "https://cdn.jsdelivr.net/gh/met4citizen/[email protected]/modules/talkinghead.mjs" } } </script> <script type="module"> import { TalkingHead } from "talkinghead"; let head; // TalkingHead instance let audio; // Audio object // Make a transcription of an audio file using OpenAI's Whisper API async function loadAudio(file) { try { const nodeJSON = document.getElementById('json'); nodeJSON.value = "Please wait..."; const nodePlay = document.getElementById('play'); nodePlay.disabled = true; // OpenAI Whisper request const form = new FormData(); form.append("file", file); form.append("model", "whisper-1"); form.append("language", "en"); form.append("response_format", "verbose_json" ); form.append("timestamp_granularities[]", "word" ); form.append("timestamp_granularities[]", "segment" ); // NOTE: Never put your API key in a client-side code unless you know // that you are the only one to have access to that code! const response = await fetch( "https://api.openai.com/v1/audio/transcriptions" , { method: "POST", body: form, headers: { "Authorization": "Bearer " // <- Change this } }); if ( response.ok ) { const json = await response.json(); nodeJSON.value = JSON.stringify(json, null, 4); // Fetch audio if ( json.words && json.words.length ) { var reader = new FileReader(); reader.readAsArrayBuffer(file); reader.onload = async readerEvent => { let arraybuffer = readerEvent.target.result; let audiobuffer = await head.audioCtx.decodeAudioData(arraybuffer); // TalkingHead audio object audio = { audio: audiobuffer, words: [], wtimes: [], wdurations: [], markers: [], mtimes: [] }; // Add words to the audio object json.words.forEach( x => { audio.words.push( x.word ); audio.wtimes.push( 1000 * x.start - 150 ); audio.wdurations.push( 1000 * (x.end - x.start) ); }); // Callback function to make the avatar look at the camera const startSegment = async () => { head.lookAtCamera(500); head.speakWithHands(); }; // Add timed callback markers to the audio object json.segments.forEach( x => { if ( x.start > 2 && x.text.length > 10 ) { audio.markers.push( startSegment ); audio.mtimes.push( 1000 * x.start - 1000 ); } }); // Enable play button nodePlay.disabled = false; } } } else { nodeJSON.value = 'Error: ' + response.status + ' ' + response.statusText; console.log(response); } } catch (error) { console.log(error); } } document.addEventListener('DOMContentLoaded', async function(e) { // Instantiate the class // NOTE: Text-to-speech not initialized const nodeAvatar = document.getElementById('avatar'); head = new TalkingHead( nodeAvatar, { ttsEndpoint: "https://eu-texttospeech.googleapis.com/v1beta1/text:synthesize", cameraView: "head" }); // Load and show the avatar const nodeLoading = document.getElementById('loading'); try { await head.showAvatar( { url: 'https://models.readyplayer.me/64bfa15f0e72c63d7c3934a6.glb?morphTargets=ARKit,Oculus+Visemes,mouthOpen,mouthSmile,eyesClosed,eyesLookUp,eyesLookDown&textureSizeLimit=1024&textureFormat=png', body: 'F', avatarMood: 'neutral', lipsyncLang: 'en' }, (ev) => { if ( ev.lengthComputable ) { let val = Math.min(100,Math.round(ev.loaded/ev.total * 100 )); nodeLoading.textContent = "Loading " + val + "%"; } }); nodeLoading.style.display = 'none'; } catch (error) { console.log(error); nodeLoading.textContent = error.toString(); } // File changed const nodeLoad = document.getElementById('load'); nodeLoad.addEventListener('change', function(ev) { let file = ev.target.files[0]; loadAudio(file); }); // Play button clicked const nodePlay = document.getElementById('play'); nodePlay.addEventListener('click', function() { if ( audio ) { head.speakAudio( audio ); } }); }); </script>

Index.html

the major issues not understanding the code , like there should be separate files and how one can use it , alot of function depends on each other where not clear mentioning how everything works and Worst part is JWN Tokens i means why it is so bad

a general control of stillness

Thanks for the fantastic project! Lots of details are there, which requires lots of tedious work. The code is a lot hard to digest though. For an example, Id' like to have a global control of how active/still my model's pose look like when in upper/head view. The model would change and move too much up close in happy mood. I'd like her to move less and move in less amplitude. Any suggestions?

Poses and animations downloaded from Mixamo moves the origin of the model

When using custom files downloaded from Mixamo, applying the .fbx poses/animations seems to shift the center of the model and fixes the model's hip bone on the floor instead of the feet. This only affects poses and animations downloaded from Mixamo; the included dancing.fbx file doesn't have this issue.

This shift affects both the default model and any other custom model downloaded from readyplayer.me

Below is "Female Standing" applied to the default brunette.glb

Any idea what's happening here? All the Mixamo files apply just fine in Blender or Unity.

PS: Great repository and work, by the way, this really cool stuff.

Increase volume for openAI's tts and whisper integration

Right now when I integrate open AI's TTS and Whisper for the visemes and call the startSpeaking function. The volume seems to be low. is there any configuration required in TalkingHead module or do I need to tweak it with other methods?

Makeup lost?

Hi! My avatar has lost their lipstick when I include it. (Note the lipstick wasn't a new inclusion, it was added originally when setting up the avatar and getting its id.)

I kind of like the non-make-up version better so it's not a big deal, but wanted to let you know here. I used the mp3.html as a starting point for inclusion. Inclusion shown as square at the top in below screenshot.

cheers!

Shoutout: TalkingHead is used on this live Twitch adventure now

Just wanted to let the module author know, please feel free to delete this issue now! We're live at
https://twitch.tv/evertrail
and I also gave credits to met4citizen and this module from the Evertrail homepage.

Video: https://www.youtube.com/watch?v=OG1vwOit_Yk

Thanks for making this live host possible!

Bad mouth effect during speaking

Currently the mouth templates in the mood configurations interferes with the visemes and render bad looking lip shapes. Should it ignore the mouth template in the moods while the avatar is speaking?

Please do it fast as soon as possible with adding talking and li sync functionality

Proper way to unload old avatar before loading new?

What's the proper way to unload the old avatar, when one wants to dynamically change to a new one during live usage? I'm currently simply nulling things -- see below -- but e.g. I want the stacked speech queue to also immediately stop etc. Thanks!

let head;
let audio;
let avatarUrl;

export async function avatarLoad(id, language, gender, mood = 'neutral') {
  const url = `/avatars/${id}.glb`;
  if (url != avatarUrl) {
    console.log('Loading avatar', id, language, gender);
    avatarUrl = url;

    const nodeAvatar = document.getElementById('avatar');

    // "Unloading" old:
    head = null;
    nodeAvatar.innerHTML = '';
    audio = null;

    head = new TalkingHead( nodeAvatar, {
      ttsEndpoint: "none",
      cameraView: "head",
      cameraRotateEnable: false,
      cameraRotateY: 0.4,
      lightAmbientColor: '#fff1d8'
    });

    try {
      await head.showAvatar( {
        url: url,
        body: gender,
        avatarMood: mood,
        lipsyncLang: language
      }, (ev) => {});
    } catch (error) {
      console.log(error);
    }

    console.log('Loading avatar done.');
  }
}

Thinking pose/mood

Hello. Back again 😄

I'm trying to get a thinking animation from mixamo or a "mood". I'm hoping that I can make this animation/mood play while waiting for the GenAI to return after getting a user request.

Unfortunately, when I try to play the animation, the characters' position completely changes and goes "off-screen".

Do you have any ideas?

I'm happy to create a sample app, and I'd be willing to donate for your time.

P.S. I need to catch up to some changes you've made since last time and push my typescript rewrite somewhere for you to take a look at.

Audio not playing in Safari

I created a website using the minimal template you provided. The website works perfectly when used on Google chrome. But for some reason the avatar doesn't speak on Safari or on iPhone in any browser. I debugged the code the tts api key is retrieved from the proxy properly, the text is also going into head.SpeakText function. Console logs no errors. I checked if its apple that has some kind of restrictions, turns out they do have an autoplay limitation, but I specifically went to my safari settings and clicked on the option where all sites are now allowed to autoplay. But still I am facing that issue. Is there anything else that I can try ?

ReadyPlayerMe alternatives?

I love the TalkingHead project but was surprised at how little options ReadyPlayerMe had, considering they were well-funded and positioned themselves as big cross-app undertaking, as I understand.

For instance, all of their head and eye shapes look super similar and are often hard to tell apart. And they have super small variety of clothes. There doesn't seem to be an age slider. Or a fully-free color picker for all the things (what if I need blue skin). Or a slider for a very long nose, etc. etc. And that's not mentioning the many oddities and quirks on their site (like the "Rate us" dialog popping up every few minutes, or a Locked asset saying it's Premium but there's no obvious way to buy it. It all seems like a deserted project.) In sum, I just can't do half the avatars I had planned.

I do understand of course that ReadyPlayerMe maybe only wanted for the apps themselves to create the assets and then sell them, not to have a lot of variety on their site. They also seem to have been planning to ride on the NFT hype.

I wonder what other 3D avatar maker options there would be? Something more casual than diving in to Blender and understanding the rig etc.

Maybe met4citizen, you need to start a new side project to make a BETTER avatar platform. 😄 I'd pay for more options! Do you have a donations button somewhere, by the way?

Cheers!

Custom 3D model?

How can I add a custom 3D model? What rigging should it have?

Laptop struggles, any optimizations I could try?

The head runs fine on my main PC, but when using Twitch Studio (and having my server run in the background), my laptop struggles a bit and adds flicker in the stream when the head moves.

What are some things I might try to optimize performance?

Via CSS, I use an additional blur of 0.5 and slight background shadow on the avatar div, too. It probably doesn't help either (I'll try disable it now). Edit: That didn't really help.

Thanks!

bad looking gesture when IK superposed on the base gesture

When the base pose is "hip", the IK resolved to gestures with the influence from the base and they don't look good. See the left hand in the following snapshot:

Volume for head.speakAudio()?

I'm exclusively using the non-TTS speakAudio with a transcript file (works great!). Is there any way to set the volume? Cheers!

On a side note, what I'd really want is for the background music to smoothly get a lowpass filter and lower its gain while the avatar speaks, in order to better hear them. The approach I used so far for that is below. It sounds epic when the audio fades back in after the speech, and I'd love to wire this up to the avatar.

export class AudioPlayerWithLowpassFilter {

  constructor(musicPath, speechPath) {
    this.audioContext = new AudioContext();
    this.lowpassFilter = this.audioContext.createBiquadFilter();
    this.lowpassFilter.type = 'lowpass';
    
    const frequencyInHerz = 600;
    this.lowpassFilter.frequency.value = frequencyInHerz;

    this.gainNode = this.audioContext.createGain();
    
    this.musicVolumeMin = 0.55;
    this.musicVolumeMax = 0.7;

    this.musicPath = musicPath;
    this.speechPath = speechPath;
    this.musicSource = null;
    this.speechSource = null;
    this.isPlaying = false;

    if (!this.speechPath) {
      this.musicVolumeMin -= 0.4;
      this.musicVolumeMax = this.musicVolumeMin;
    }

    this.gainNode.gain.value = this.musicVolumeMin;
  }

  async loadAudio(url) {
    const response = await fetch(url);
    const arrayBuffer = await response.arrayBuffer();
    return this.audioContext.decodeAudioData(arrayBuffer);
  }

  async setupSources() {
    const musicBuffer = await this.loadAudio(this.musicPath);
    this.musicSource = this.audioContext.createBufferSource();
    this.musicSource.buffer = musicBuffer;
    this.musicSource.loop = true;

    if (this.speechPath) {
      const speechBuffer = await this.loadAudio(this.speechPath);
      this.speechSource = this.audioContext.createBufferSource();
      this.speechSource.buffer = speechBuffer;
    }

    this.musicSource.connect(this.lowpassFilter).connect(this.gainNode).connect(this.audioContext.destination);

    if (this.speechPath) {
      this.speechSource.connect(this.audioContext.destination);

      this.speechSource.onended = () => {
        this.removeLowpassFilter();
      };
    }
  }

  removeLowpassFilter() {
    if (this.musicSource && this.lowpassFilter) {
      const currentTime = this.audioContext.currentTime;

      const startFrequency = this.lowpassFilter.frequency.value;
      const endFrequency = 20000;
      const rampDuration = 4;
  
      this.lowpassFilter.frequency.setValueAtTime(startFrequency, currentTime);
      this.lowpassFilter.frequency.linearRampToValueAtTime(endFrequency, currentTime + rampDuration);

      this.gainNode.gain.setValueAtTime(this.gainNode.gain.value, currentTime);
      this.gainNode.gain.linearRampToValueAtTime(this.musicVolumeMax, currentTime + rampDuration);
    }
  }

  async play() {
    if (!this.isPlaying) {
      await this.setupSources();
      this.musicSource.start();
      if (this.speechPath) { this.speechSource.start(); }
      this.isPlaying = true;
    }
  }

  stop() {
    if (this.isPlaying) {
      this.musicSource.stop();
      if (this.speechPath) { this.speechSource.stop(); }
      this.isPlaying = false;
    }
  }

  stopAudioSourcePromise(audioSource) {
    return new Promise((resolve, reject) => {
      try {
        if (audioSource && audioSource.context.state === 'running') {
          audioSource.stop();
        }
        resolve();
      } catch (error) {
        reject(error);
      }
    });
  }
  
  async stopAsync() {
    if (this.speechPath) { 
      await Promise.all([
        this.stopAudioSourcePromise(this.musicSource),
        this.stopAudioSourcePromise(this.speechSource)
      ]);
    }
    else {
      await this.stopAudioSourcePromise(this.musicSource);
    }
    this.isPlaying = false;
  }

  async setFiles(musicPath, speechPath) {
    if (this.isPlaying) {
      this.stop();
    }
    this.musicPath = musicPath;
    this.speechPath = speechPath;
  }

}

Minimal Code example html fails with "AudioContext suspended"

Facing the below error while trying the Minimal Code sample html.

https://goo.gl/7K7WLu --> suggests we might have to create some click activity before initializing AudioContext.

Amazing work with TalkingHead !!!

Adding Lipsyncing for German would be fantastic!

There's currently English, Lithuanian and Finnish, but it would be fantastic to also have lipsync-de.mjs! I'm currently using this library for a host to a German ChatGPT-API-powered twitch chat.

Cheers!

Viseme not lip-syncing

Hello,

Thanks for this project. I haven't found anything close to doing what I wanted.

I'm having trouble getting the avatar's lips to move. I have a backend process that will call Azure's TTS service. I store the audio (base64), an array a visemes, and words. That API also converts from nanoseconds to seconds since I saw your Azure example dividing the offsets by 10000. Using the code from the index.html file, I convert the JSON I get back to the words/visemes that your code is expecting (mapping, etc.)

I then call speakAudio providing a speak object. The audio plays and the subtitles work, but as I said, the avatar's lips do not move.

Here's a minimal sample. I use WebStorm's built-in serve functionality to host it locally.

Please let me know if you have any thoughts on what I can do to get this working. If you have a ko-fi or something, I'd be happy to donate.

Godot 4 version?

We now have working C# Wrapper for Mediapipe in Godot and we are upgrading the mediapipe to v0.10.11 or perhaps v0.10.12

We have XRAnimator driving ReadyPlayerMe avatar in Godot.

I am working on the Deep AI part to support German Computational Linguitics and this will probably address this raised issue in future

it would be great is we could ask your feedback as we attempting to port some of your creativity to the Godot 4 ReadyPlayerMe facial animation part to combine TTS and STT .

Hide avatar?

Is there a way to hide the avatar, but keep everything else in the scene (like props I added)?

For a special feature, I want to temporarily show an actual webcam-filmed speaking robot head instead of an avatar, but they should still be overlaid by props I added.

Cheers!

Possibility for using custom audio

Is there any way to use custom audio (For example auio in my system) instead of using google tts or others

ikSolve and disappeared bones

Double clicks on the avatar body would cause the hand movements towards the points. But the hands and fingers would disappear when the camera is too close to the subject. see the attachments.

Use Legacy light

The useLegacy part showing deprecated in future

Visemes Value Problem

Hello,

First of all amazing project. Thank you for sharing.
I have a question regarding visemes, mouth movements. As i understand, RPM characters have blendshapes which give us mouth movements(visemes). But I couldn't find how to set their value. I converted project from three.js to babylon.js and doing this matches manually.

For example for 'aa' viseme, I set 'mouthOpen': 1.0, and 'jawOpen': 0.7 or for 'U' viseme, I set 'mouthPucker': 0.8, 'mouthPressLeft': 0.4

But these are completely random values and i am not sure how can I set correct values for the visemes. I couldn't find any reference in your code. From my opinion, for example if i say "Hello, my name is John" mouth should play corresponding visemes. But those visemes need to be matched with the correct values in the first place. I am not sure what I am missing. I hope i made my point.
Thank you.

ElevenLabs backend issue

Hello, I hope you're doing well. I'm facing an issue with some code. My primary programming language is Python, so I'm not very familiar with Node.js. From what I've gathered from GitHub and the existing code, it seems we need a backend server to process requests and return responses. I set up a simple Flask server for this purpose.

I also developed a backend function to handle requests and confirmed its functionality through separate testing - it worked. However, when I integrated it with the frontend, I received neither output nor error messages. Further investigation revealed that the function requires an 'alignments' dictionary detailing the start time, word, and total duration of the word in the audio. I've adjusted the backend to accommodate this.

After making these changes, I tried again and received audio output, but it was just white noise, not the expected word-related audio. I'm not using Nginx or Apache2 servers. Then I again tried some changed to the function which on checking I am now getting no error and no audio.

Thank you so much for your assistance in advance.

Here are some code samples:

Backend flask function:

def elevenSpeak_adapted(text):
    CHUNK_SIZE = 1024
    url = "https://api.elevenlabs.io/v1/text-to-speech/EXAVITQu4vr4xnSDxMaL"

    headers = {
        "Accept": "audio/mpeg",
        "Content-Type": "application/json",
        "xi-api-key": "49ed4d0f9499e7a5d26339731f8a16cd"
    }

    data = {
        "text": text,
        "model_id": "eleven_monolingual_v1",
        "voice_settings": {
            "stability": 0.5,
            "similarity_boost": 0.5
        }
    }

    chunks = b""
    chunks_list = []
    response = requests.post(url, json=data, headers=headers)

    for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
        chunks += chunk
        chunks_list.append(chunk)
    
    file_name = 'output_audio.mp3'
    with open(file_name, 'wb') as audio_file:
        audio_file.write(chunks)
    
    base64_audio = base64.b64encode(chunks).decode("utf-8")
    audio_bytes = base64.b64decode(base64_audio)
    audio_buffer = io.BytesIO(audio_bytes)

    audio = AudioSegment.from_file(audio_buffer)
    duration_ms = len(audio)

    print(f"Audio duration: {duration_ms} milliseconds")
    ALIGNMENTS = estimate_word_timings_enhanced(text, duration_ms)
    ALIGNMENTS = {
        "audio":[],
        "words":ALIGNMENTS[0],
        "wtimes":ALIGNMENTS[1],
        "wdurations":ALIGNMENTS[2]
    }

    print("Audio data returned")
    return base64.b64encode(chunks).decode("utf-8"), ALIGNMENTS


@socketio.on('speak_request')
def handle_speak_request(json_data):
    text = json_data.get('text')
    print("Got direct speak request!")

    if text:
        audio_data, ALIGNMENTS = elevenSpeak_adapted(text)
        print("Returning the audio data!")
        emit('speak_response', {'audio_data': audio_data, 'ALIGNMENTS':ALIGNMENTS})

Frontend Js function:

async function elevenSpeak(s, node=null) {
if (!elevenSocket) {
elevenInputMsgs = [
elevenBOS,
{
"text": s,
"try_trigger_generation": true,
}
];

let url = location.protocol + '//' + document.domain + ':' + '5000';
console.log("Sending request to " + url);
console.log(cfg('voice-lipsync-lang'))
console.log(cfg('voice-eleven-id'))

// Make the connection
elevenSocket = io.connect(url);

// Connection opened
elevenSocket.on("connect", function(){
  console.log("Socket is opened")
  elevenOutputMsg = null;
  while (elevenInputMsgs.length > 0) {
    console.log(`Sending to servers through socket ${JSON.stringify(elevenInputMsgs.shift())}`)
    elevenSocket.emit("speak_request", elevenInputMsgs.shift());
  }
});

// New message received
elevenSocket.on("speak_response", function (r) {
  console.log("Received message")
  console.log(r)

  // Speak audio
  if ((r.isFinal || r.normalizedAlignment) && elevenOutputMsg) {
    console.log("r.isFinal || r.normalizedAlignment) && elevenOutputMsg")
    head.speakAudio(elevenOutputMsg, { lipsyncLang: cfg('voice-lipsync-lang') }, node ? addText.bind(null, node) : null);
    elevenOutputMsg = null;
  }

  if (!r.isFinal) {
    // New part
    console.log(1)
    if (r.alignment) {
      elevenOutputMsg = { audio: [], words: [], wtimes: [], wdurations: [] };
      console.log(2)

      // Parse chars to words
      let word = '';
      let time = 0;
      let duration = 0;
      for (let i = 0; i < r.alignment.chars.length; i++) {
        if (word.length === 0) time = r.alignment.charStartTimesMs[i];
        if (word.length && r.alignment.chars[i] === ' ') {
          elevenOutputMsg.words.push(word);
          elevenOutputMsg.wtimes.push(time);
          elevenOutputMsg.wdurations.push(duration);
          word = '';
          duration = 0;
        } else {
          print(duration)
          duration += r
          duration += r.alignment.charDurationsMs[i];
          word += r.alignment.chars[i];
        }
      }
      // Add the last word if it's not empty
      if (word.length) {
        elevenOutputMsg.words.push(word);
        elevenOutputMsg.wtimes.push(time);
        elevenOutputMsg.wdurations.push(duration);
      }
    }

    // Add audio content to message
    if (r.audio && elevenOutputMsg) {
      console.log(r.audio)
      console.log(elevenOutputMsg)
      elevenOutputMsg.audio.push(head.b64ToArrayBuffer(r.audio));
    }
  }
});

elevenSocket.on("disconnect", (reason) => {
  if (reason === 'io server disconnect') {
      console.log("Socket connection has been closed by the server");

    } else {
      console.warn('Connection died', reason);
  }
  elevenSocket = null;

});

elevenSocket.on("connect_error", (error) => {
console.error("Connection error:", error);
});

} else {
// If the socket is already open, send the message directly
let msg = {
"text": s,
"try_trigger_generation": s.length > 0
};
elevenSocket.emit("speak_request", msg);
}
}


Here is the output from all console.log in console:

![image](https://github.com/met4citizen/TalkingHead/assets/162581671/f2e0a534-cf4d-4663-b8f4-9adb3aa8e23f)

Please know that I removed the jwt from frontend wherever it was required or used.

Hope that you can help with this issue.
Thank u so much
Arsal

More realism via Controlnet Stable Diffusion?

Again, fantastic project! I was wondering if it was possible to locally hook this up to something like controlnet stable diffusion to create an output layer that would turn the ReadyPlayerMe avatar to look like a realistic, photographic face.

I'm sure this is outside the scope of this great library but I just wanted to mention it. Imagine a real-looking avatar thanks to AI!

help

im sorry but this is not an issue but i want a private contact info i just want help

if you dont want to share any private contact info here there is mine (telegram)

im sorry but if you dont have telegram can you make one and if you delete it later i dont care but i reaaly need you bec you know js (ive contacted you before many times)

send me a message and when we talk we talk at secret chat THANKS !!!!

after you add me if i didnt respond for a while comment your name (not your username for privacy) and ill search for you bec i have too many chats

Speak Audio with PCM Buffer, no lip sync

Hi,

first thank you for this great work.

I have a problem with the function speakAudio, i pass a object with

let object = { audio: data.buffer, words: ['Ça', 'va', 'bien', 'grâce', 'à', 'toi'], wtimes: [93.75, 262.5, 481.25, 693.75, 862.5, 1118.75], wdurations: [87.5, 187.5, 212.5, 50, 100, 56.25] };

Audio is working well (Buffer PCM), but there is no animation and no lipsync. No error too.

I wonder if i forget something or it's a bug

Attach prop?

Is there a way to parent a 3D object to the avatar? I'm not even looking to have it look like it's held in the hand or anything, just something roughly before the person (ideally sticking to the avatar pivot). I got the glb objects ready.

And then also, a way to remove that prop again, or swap it out with another one. Perhaps all with the ability to set the relative position.

Background: I'm thinking of letting players add story-influencing props, like "Staff of Bad Luck", to their adventure story host.

Cheers!

divides tts sentences

Hello there. Ive made the same issue before in a different account you may not remember it but you told me to do this:
const dividersSentence = /[\p{Extended_Pictographic}]/ug;
it was working fine but for some reason one day it divided the sentences again so i installed the latest version and made the same thing that i did in the old one but still didnt work. BTW here is the python server logs for knowing what is happening:

Serving Flask app 'apimain'
Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
Running on http://127.0.0.1:6699
Press CTRL+C to quit
127.0.0.1 - - [21/Jun/2024 22:52:26] "GET /login HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:52:26] "OPTIONS /moderation_ai HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:52:26] "POST /moderation_ai HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:52:26] "GET /login HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:52:26] "OPTIONS /chatbot HTTP/1.1" 200 -
Received request data: {'model': 'gpt-3.5-turbo', 'messages': [{'role': 'user', 'content': 'nick: hi there this is just a test.'}], 'temperature': 1, 'presence_penalty': 0, 'frequency_penalty': 0, 'max_tokens': 1000, 'stream': True}
Last message: {'role': 'user', 'content': 'nick: hi there this is just a test.'}
Content of last message: nick: hi there this is just a test. <-- my message
typing message
sent
Finding response...
Captcha button not found, proceeding without interaction.
Matches found:
2b: Hey there, no problem, just testing things out. Always good to make sure everything is working properly. winks <-- the response
Only one match found: 2b: Hey there, no problem, just testing things out. Always good to make sure everything is working properly. winks <-- it returns it to talking head
127.0.0.1 - - [21/Jun/2024 22:52:39] "POST /chatbot HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:52:39] "GET /login HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:52:39] "GET /login HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:52:39] "OPTIONS /gtts HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:52:39] "OPTIONS /moderation_ai HTTP/1.1" 200 -
twob Hey there, no problem, just testing things out. <-- for some reason it divides it
127.0.0.1 - - [21/Jun/2024 22:52:39] "POST /moderation_ai HTTP/1.1" 200 -
#generates audio
127.0.0.1 - - [21/Jun/2024 22:53:22] "POST /gtts HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:53:25] "GET /login HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:53:25] "OPTIONS /gtts HTTP/1.1" 200 -
Always good to make sure everything is working properly. <-- divided sentence
#generates audio
Done.
127.0.0.1 - - [21/Jun/2024 22:54:16] "POST /gtts HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:54:20] "GET /login HTTP/1.1" 200 -
127.0.0.1 - - [21/Jun/2024 22:54:20] "OPTIONS /gtts HTTP/1.1" 200 -
winks <-- last divided sentence
#generates audio
127.0.0.1 - - [21/Jun/2024 22:55:01] "POST /gtts HTTP/1.1" 200 -

so as you can see it divides the text and then sends it back and i dont want that bec it takes a long time (20s) generating the audio bec it uses a hugging face coqui tts through api so it will reaaaallly take a long time just to say something but when its not divided it will normally read it well. is there a way to divide it? (i tried const dividersSentence = /[\p{Extended_Pictographic}]/ug; but it still didnt work) Thanks!

SpeakText

Hi, thanks for great project first. I've a silly question. I just want a lipsync (actually just need a speaking animation no need sync for now) tried head.speakText("Hello how are you today"); but no luck.

Also tried like that

                var vv = new LipsyncEn();
                var vis = vv.wordsToVisemes('Hello, how are you today?');
                console.log(vis);

                for (let j = 0; j < vis.visemes.length; j++) {
                    console.log(vis.times[j]);
                    head.setFixedValue("viseme_" + vis.visemes[j], vis.durations[j]);
                };

I am not JS dev, is there any solution? Thanks

How to generate other language modules

hi, I don't know how to generate files similar to lipsync-fi.mjs and lipsync-en.mjs, I want to generate other language modules, but I don't know how to do it? Can you provide more detailed methods or steps.

be deeply grateful!

Dynamic change tts parameters?

Hello, how can i change

ttsLang and ttsVoice parameters by click country flags? I tried
// onclick event head.showAvatar({ ttsLang: langOptions[lang]['ttsLang'], ttsVoice: langOptions[lang]['ttsVoice'] .... //reload head.ShowAvatar()

but no luck... Also tried

// onclick event head.ttsLang = langOptions[lang]['ttsLang']; head.ttsVoice = langOptions[lang]['ttsVoice'];

Thanks...

a-frame demo

Hi, im trying to make a porting in angular and a-frame (framework built on top of three.js) but im not sure if i can just use the THREE lerp with the frame times of TalkingHead for the lip sync animations

code: https://github.com/Roxcsas/aframe-talking-avatar-ia

this is the result: https://streamable.com/flpvvh

play emojis in text

I've got some emojis embedded in the text to be spoken. I have used Azure TTS to get all the audio offsets for them. And I have mixed the emojis in the word list of the speach data. How do I schedule emojis animations? It's not obvious. I don't think I can use the speakEmoji() method directly, can't I?

Thanks!

All eye movement directives not affecting the eyeballs

This may be related to specific ready player me model used, but I have found that none of the eye related blendshapes has impact on the eye ball position. Looking up/down/left/right etc has no effect on eyeballs. I suspect the eyes can only be moved by changing the quaternions of eye "bones".

Callback for speech end

Hello,

Amazing work!
Is there any way to know that the speech ended after calling the head.speakAudio() function?

Make Avatars more realistic

Hi, first of all i want to thank you for open sourcing this great project, i have already read other issues regarding the obstacles in the way to make the models more realistic ( using the same mesh and mixamo compatible bone structure as readyplayerme and having the arkit and oculus.

But i have tried to my best to create a better model from scratch but i was unsuccessful, either the structure is mixamo compatible but not compatible with the library ( i belive it has something to do with the geometry of the model, the second it get's imported and the idle animations kick in, i can see the idle animations are not normal at all ).

And the lips would not move at all even a bit when the lip sync gets triggered, now i wanted to ask, is there any kind of example model which is outside of websites like readyplayerme or avaturn that can be reproducible? ( we could use their mesh and the modules and just change the bone structure and texture )

Get current playing segment for subtitles

Hello I am using the head.speakAudio function for talking and head.speakMarker function to know that the speech ended.

Is there any prebuilt function or callback to know which segment is being played right now so that It can be shown as a subtitle?

I see in the code there are some callbacks called onsubtitles but I didn't quite understand how to use it

Thank you

hi, is there any way to make the lips speak like a human? right now I only see them open and close, not make human-like mouth shapes.

easy-speech

Fascinating project,

I was seeking for some avatars that I can use in LiaScript as teachers for educational content ...

Maybe this is something for you, we use easy-speech as a single interface to TTS of different browsers, which works very good for us.

https://github.com/leaonline/easy-speech

Mood ignored?

Hi! When trying the mp3.html example and changing the avatarMood from 'neutral' to e.g. 'happy' or 'angry', it will still remain neutral.

(This even happens if I amend the code to not cause a 'user gesture required' warning in Chrome.)

Help please?

Jwt for flask service.

Hi I am newbie in this library... I am trying to create the proxy for can usage the library in flask over nginx but I dont can work this....

the app is the follwing ....

from flask import Flask, render_template, request, jsonify
from flask_jwt_extended import JWTManager, create_access_token
import requests
import os  
from dotenv import load_dotenv  # Para cargar variables de entorno desde .env

app = Flask(__name__)
app.config['SECRET_KEY'] = 'secreto123456'
app.config['JWT_SECRET_KEY'] = 'secreto123456'
jwt = JWTManager(app)


@app.route('/token', methods=['POST'])
def token():
    access_token = create_access_token(identity="guest")
    return jsonify(access_token=access_token)


@app.route('/')
def main():
    return render_template('main.html')


@app.route('/gtts', methods=['POST'])
def gtts_proxy():
    token = request.headers.get("Authorization")
    headers = {'Authorization': token} if token else {}
    json_data = request.get_json()
    
    response = requests.post(
        'https://eu-texttospeech.googleapis.com/v1beta1/text:synthesize?key=the-api-key',
        headers=headers,
        json=json_data
    )
    
    if response.status_code != 200:
        return jsonify({"error": "Error en la solicitud a Google TTS", "details": response.json()}), response.status_code
    
    return jsonify(response.json()), response.status_code

if __name__ == '__main__':
    load_dotenv()
    app.run(debug=True, host='0.0.0.0', port=5150)

the html/javascript is following
main.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Main Page</title>
    <link rel="stylesheet" href="/static/css/styles.css">  <!-- Enlaza tu archivo CSS -->
</head>
<body>
  <h1>TalkingHead Avatar</h1>
  <div id="avatar"></div>
  <div id="controls">
    <input id="text" type="text" value="Hi there. How are you? I'm fine.">
    <input id="speak" type="button" value="Speak">
  </div>
  <div id="loading"></div>

  <script type="importmap">
  { "imports":
    {
      "three": "https://cdn.jsdelivr.net/npm/[email protected]/build/three.module.js/+esm",
      "three/addons/": "https://cdn.jsdelivr.net/npm/[email protected]/examples/jsm/",
      "talkinghead": "https://cdn.jsdelivr.net/gh/met4citizen/[email protected]/modules/talkinghead.mjs"
    }
  }
  </script>
    <script type="module">
        import { TalkingHead } from "talkinghead";

        async function jwtGet() {
           const response = await fetch('/token', { method: 'POST' });
           const data = await response.json();
           const token = data.access_token;
           //alert(token);  // show token
           return token;
        }

        // Inicializar el avatar cuando el DOM esté cargado
        document.addEventListener('DOMContentLoaded', async () => {
            
	    const nodeAvatar = document.getElementById('avatar');
            const head = new TalkingHead(nodeAvatar, {
                ttsEndpoint: "/gtts/",
                jwtGet: jwtGet,
                cameraZoomEnable: true,
                cameraPanEnable: true,
                cameraView: 'full',
                lipsyncModules: ["en", "fi"]
            });

	    // Load and show the avatar
      const nodeLoading = document.getElementById('loading');
      try {
        nodeLoading.textContent = "Loading...";
        await head.showAvatar( {
          url: 'https://models.readyplayer.me/64bfa15f0e72c63d7c3934a6.glb?morphTargets=ARKit,Oculus+Visemes,mouthOpen,mouthSmile,eyesClosed,eyesLookUp,eyesLookDown&textureSizeLimit=1024&textureFormat=png',
          body: 'F',
          avatarMood: 'neutral',
          ttsLang: "en-GB",
          ttsVoice: "en-GB-Standard-A",
          lipsyncLang: 'en'
        }, (ev) => {
          if ( ev.lengthComputable ) {
            let val = Math.min(100,Math.round(ev.loaded/ev.total * 100 ));
            nodeLoading.textContent = "Loading " + val + "%";
          }
        });
        nodeLoading.style.display = 'none';
      } catch (error) {
        console.log(error);
        nodeLoading.textContent = error.toString();
      }

      // Speak when clicked
      const nodeSpeak = document.getElementById('speak');
      nodeSpeak.addEventListener('click', function () {
        try {
          const text = document.getElementById('text').value;
          if ( text ) {
            head.speakText( text );
          }
        } catch (error) {
          console.log(error);
        }
      });


        });

       
    </script>
</body>
</html>

the nginx content is:

server {
    listen 80;
    server_name mydomain.net;

    location / {
        proxy_pass http://127.0.0.1:5150;  # Puerto donde Gunicorn está escuchando
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_redirect off;
    }

    # Proxy para la API de Google TTS
    location /gtts/ {
        proxy_pass http://127.0.0.1:5150/gtts/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

     # Ubicación para servir archivos estáticos
    location /static/ {
        alias /var/www/mydomain/static/;  #static files
    }
   
    location ~* \.mjs$ {
        add_header Content-Type application/javascript;
    }

    listen 443 ssl;
    ssl_certificate /etc/letsencrypt/live/mydomain.net/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/mydomain.net/privkey.pem;
   
}

I dont expert in flask but I think that the config file for nginx is ok?
the output page is working and load the avatar

but when i try to do speak this happen.. in the console of the google chrome

please help me to fix the errors to can allow me to work with this library

demo

I love this project!

I am porting it to Typescript & React Native. When I get working, happy to contribute this back if you are interested.

Any advice, where I could research how to build a lipsync library for a bunch of language (chinese, korean, german, dutch...)?

I would love to show you a demo of how I am using it.
https://www.linkedin.com/in/thejustinmann/

Can I use OpenAI TTS?

Fantastic project, thank you!

Is it possible to use OpenAI's Text-to-Speech too? They have a superb voice feel and intonation and can move fluently across different languages (without even needing to be set to a specific language).

Thanks!

Pose transitions

Hi, I was reading the code and how the Pose transition works was not immediately obvious to me. The poseAvatar object seems to pick up all the bone attributes in each animation frame, but how are the values applied to the all the joints?

Auto lipsync with mp3 file Feature Request

Hi there,

Is there a way to feed in the mp3 file directly and to have the lip sync to the mp3 file?

<title>Talking Head - MP3 example</title> <style> body, html { width:100%; height:100%; max-width: 800px; margin: auto; position: relative; background-color: black; color: white; } #avatar { display: block; width:100%; height:100%; } #controls { display: block; position: absolute; top: 10px; left: 10px; right: 10px; height: 50px; } #text { position: absolute; width: Calc( 100% - 110px ); height: 100%; top: 0; left: 0; bottom: 0; right: 110px; font-family: Arial; font-size: 20px; visibility: hidden; } /* Hide the text input as it's not needed for MP3 playback */ #speak { display: block; position: absolute; top: 0; bottom: 0; right: 0; height: 100%; width: 100px; font-family: Arial; font-size: 20px; } </style> <script type="importmap"> { "imports": { "three": "https://cdn.jsdelivr.net/npm/[email protected]/build/three.module.js/+esm", "three/examples/": "https://cdn.jsdelivr.net/npm/[email protected]/examples/", "three/addons/": "https://cdn.jsdelivr.net/npm/[email protected]/examples/jsm/", "dompurify": "https://cdn.jsdelivr.net/npm/[email protected]/+esm", "marked": "https://cdn.jsdelivr.net/npm/[email protected]/+esm", "talkinghead": "https://cdn.jsdelivr.net/gh/met4citizen/[email protected]/modules/talkinghead.mjs" } } </script> <script type="module"> import { TalkingHead } from "talkinghead";

let head;
let audio = new Audio('taviet.mp3'); // URL to your MP3 file

document.addEventListener('DOMContentLoaded', async function(e) {
  const nodeAvatar = document.getElementById('avatar');
  head = new TalkingHead(nodeAvatar, {
    ttsEndpoint: "https://dummy.endpoint/tts", // Placeholder endpoint
    ttsApikey: "dummy-api-key", // Placeholder API key
    cameraView: "upper"
  });

  try {
    await head.showAvatar({
      url: 'https://models.readyplayer.me/64bfa15f0e72c63d7c3934a6.glb?morphTargets=ARKit,Oculus+Visemes,mouthOpen,mouthSmile,eyesClosed,eyesLookUp,eyesLookDown&textureSizeLimit=1024&textureFormat=png',
      body: 'F',
      avatarMood: 'neutral',
    });
  } catch (error) {
    console.log(error);
  }

  const nodeSpeak = document.getElementById('speak');
  nodeSpeak.addEventListener('click', function () {
    audio.play();
  });

});

</script>

Azure Speech API can return VISEME information directly. Will you consider accessing Azure Speech?

So you don't need to convert English words to visemes and can support any language

met4citizen / talkinghead Goto Github PK

talkinghead's Introduction

Talking Head (3D)

Demo Videos

Use Case Examples

Introduction

Talking Head class

The index.html Test App

FAQ

See also

Appendix A: Create Your Own 3D Avatar

Appendix B: Create API Proxies with JSON Web Token (JWT) Single Sign-On (SSO)

Appendix C: Create A New Lip-sync Module

Appendix D: Adding Custom Poses, Moods, Gestures, and Emojis (ADVANCED)

talkinghead's People

Contributors

Stargazers

Watchers

Forkers

talkinghead's Issues

Backend flask function:

Frontend Js function:

Recommend Projects

Recommend Topics

Recommend Org

The `index.html` Test App