AuViMi stands for audio-visual mirror. The idea is to have CLIP generate its interpretation of what your webcam sees, combined with the words thare are spoken.
I've been lurking a bit on this project since you started posting on the deep-daze issues. Seems very cool to me and I'm curious how you intend to solve the real-time constraint I think that you have?
Is that right? You want to try to generate fast enough for a webcam to run as input to deep daze?
python app.py --run_local=1 --host=localhost --python_path=/home/nerdy/anaconda3/envs/auvimi/bin/python --mode=pic --size=512 --text="An ugly human with a face like the back end of a bulldog"