magenta / ddsp Goto Github PK

View Code? Open in Web Editor NEW

2.9K 2.9K 331.0 16.27 MB

DDSP: Differentiable Digital Signal Processing

Home Page: https://magenta.tensorflow.org/ddsp

License: Apache License 2.0

Shell 0.03% Python 77.25% Jupyter Notebook 22.61% Dockerfile 0.10%

ddsp's Issues

training in gpu

Thank you for your great job.
I want to train something in gpu , but got a assert error in 3_training.ipynb

# Setup the session.
import os
assert "COLAB_TPU_ADDR" in os.environ, "ERROR: Not connected to a TPU runtime; please set the runtime type to 'TPU'."
TPU_ADDRESS = "grpc://" + os.environ["COLAB_TPU_ADDR"]
sess = tf.Session(TPU_ADDRESS)

how to train ddsp in gpu ?

outputs and targets being fed in reverse order to loss functions

At the end of the function call of the class Autoencoder in ddsp/training/models.py, we have these lines, where it clearly feeds the loss_obj with the target and the synthetized audios in that order

if training:
  for loss_obj in self.loss_objs:
    self.add_loss(loss_obj(features['audio'], audio_gen))

However, in ddsp/losses.py, we see how the headers of all the loss functions implemented are the following

def call(self, audio, target_audio):

which means that they expect first the synthetized audio and then the target audio.

So basically, they are fed in reverse order. For most loss functions, this fact is not relevant, that's why this has been unnoticed until now. It is still a bug, though.

UnBound Local Error Uploading one's own Timbre Transfer Model

Hi everyone,

I have tried multiple times to upload my own audio model to no success... It always works with the existing models such as the violin and trumpet, but every time I try to upload my model the following error occurs.

Could anyone help me please? I would really appreciate it.

Best regards

Will this library work with pytorch?

Hi, DDSP seems to be quite helpful to my project, but I'm using PyTorch rather than TensorFlow, will it work with PyTorch?

is it possible to provide the audio sample used in ICLR paper

It's truly amazing work and thanks for the codes! I'd like to reproduce the auto-encoder demo model you described in the paper. Can I find the violin audio sample in some places?

Question about conditioning f0 on MIDI pitches

Since MIDI pitches are essentially correspoding to the base frequencies of instrumental notes, it is intuitive to condition f0 on MIDI pitches. And I even see that you mentioned the MIDI data on Nsynth dataset. However, I never found the usage of MIDI data. So,

Have you ever tried to condition f0 on MIDI pitches?
Are there any particular reason that prevents to do so? (e.g. maybe an instrumental note is not always in the same pitch in its duration, and the model relies on the exact pitch?)

timbre_transfer demo has invalid link

The Audio Examples link in the timbre_transfer demo is invalid.

Using ddsp_prepare_tfrecord with different 'example_secs' values than default

Hi, I tried using various 'example_secs' values for ddsp_prepare_tfrecord to find a way to resynthesise complete audio recordings having different lengths. Basically I am trying to train the autoencoder with a collection of train files and then feed test audio files to see how well the autoencoder can perform copy-synthesis with unseen data.

Using the default values:
--example_secs=4
--sliding_window_hop_secs=1
the process ends without problems but of course with chunks of data sized as 4 seconds.

I tried with various other settings such as:
--example_secs=0
(to keep entire file)

--example_secs=8 \
--sliding_window_hop_secs=2 \

etc...
In creation the tfrecords file, no error message is printed and the file is created as expected.

However trying to read/consume data(the tfrecords file produced), all settings other than the default returned with the following error::

InvalidArgumentError: Key: f0_confidence. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]

Any suggestions to keep original length of data while packing in tfrecords format?

typo in documentation of frequency_impulse_response() and frequency_filter()

In core.frequency_impulse_response() and core.frequency_filter(), the documentation states:

The frequencies of the last dimension are ordered as [0, f_nyqist / (n_frames -1), ..., f_nyquist], here f_nyquist is (sample_rate / 2).

Shouldn't this be f_nyqist / (n_frequencies -1) ?

It didn't make sense to me signal-processing-wise, so I think it's a typo?

f0 in gansynth_subset is wrong

The latest crepe(==0.0.10) has a bug on crepe.predict and ddsp uses it to calculate f0.
Related issue: marl/crepe#49
Where used:

ddsp/ddsp/spectral_ops.py

Lines 267 to 273 in 4472de0

 _, f0_hz, f0_confidence, _ = crepe.predict( 

 audio, 

 sr=sample_rate, 

 viterbi=viterbi, 

 step_size=crepe_step_size, 

 center=False, 

 verbose=0)

I noticed that f0 in gansynth_subset provided via tensorflow_dataset is calculated with this bug. Fortunately, the impact of the bug is not so large. I've checked some examples (not all) and about 1.2% values in each example are wrong.

License?

Hi. What license is assigned to this project?

train_autoencoder.ipynb error

when I running this code:

data_provider = ddsp.training.data.TFRecordProvider(TRAIN_TFRECORD_FILEPATTERN)
dataset = data_provider.get_dataset(shuffle=False)
ex = next(iter(dataset))

got a OutOfRangeError:

---------------------------------------------------------------------------
OutOfRangeError                           Traceback (most recent call last)
/tensorflow-2.1.0/python3.6/tensorflow_core/python/eager/context.py in execution_mode(mode)
   1896     ctx.executor = executor_new
-> 1897     yield
   1898   finally:

11 frames
OutOfRangeError: End of sequence [Op:IteratorGetNextSync]

During handling of the above exception, another exception occurred:

OutOfRangeError                           Traceback (most recent call last)
OutOfRangeError: End of sequence

During handling of the above exception, another exception occurred:

StopIteration                             Traceback (most recent call last)
/tensorflow-2.1.0/python3.6/tensorflow_core/python/data/ops/iterator_ops.py in next(self)
    674       return self._next_internal()
    675     except errors.OutOfRangeError:
--> 676       raise StopIteration
    677 
    678   @property

StopIteration:

I do n’t know if it ’s too little training data, I only uploaded one file for training.

Cannot access tutorial 1 in colab

The file https://colab.research.google.com/github/magenta/ddsp/blob/master/ddsp/colab/tutorials/1_synth_and_effects.ipynb isn't able to access.

very bad reconstruction if I change frame_rate and example_secs

Hi, I used to apply DDSP autoencoder to do audio to audio reconstruction and it works really well. But when I change the frame_rate from 250 to 50/100/200, and example_secs to 2/4, I found that the reconstruction is making no sense at all. It's very strange to me. I think DDSP should be robust to the frame_rate and example_secs? I could see the loss decreases, but the spectrogram and audio reconstruction are not good. Previously if I use 250 frame rate and 4 seconds, I could get correlation coefficient over 0.8, now the correlation between ground truth and reconstructed spectrograms are 0.
I attached the reconstructed result https://drive.google.com/file/d/1E3IMQHnQQpYT_uuRQGgdEjvbptXmCEBF/view?usp=sharing
and it would be great you could give me some hint why the frame rate and example_secs is so important? Or maybe there are some parameters I omit are supposed to be changed?

--gin_param="TFRecordProvider.frame_rate = 200" \
--gin_param='TFRecordProvider.example_secs = 2' \
--gin_param='DefaultPreprocessor.time_steps = 400' \
--gin_param='Additive.n_samples = 32000' \
--gin_param='FilteredNoise.n_samples = 32000' \

First cell of train_autoencoder.ipynb

Hi, and first of all thanks for your work.

When trying to execute the first cell of the train_autoencoder notebook, I get this output:

TensorFlow 2.x selected.
     |████████████████████████████████| 92kB 4.4MB/s 
     |████████████████████████████████| 3.1MB 14.7MB/s 
     |████████████████████████████████| 3.4MB 40.2MB/s 
     |████████████████████████████████| 368kB 64.2MB/s 
     |████████████████████████████████| 59.2MB 48kB/s 
     |████████████████████████████████| 61kB 9.8MB/s 
     |████████████████████████████████| 81kB 12.6MB/s 
     |████████████████████████████████| 235kB 51.3MB/s 
     |████████████████████████████████| 51kB 8.6MB/s 
     |████████████████████████████████| 1.2MB 58.3MB/s 
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

In the logs, I see 9 info messages, preceded by the following 2 warnings (I don't know if they are related):

warn("IPython.utils.traitlets has moved to a top-level traitlets package.")

/usr/local/lib/python2.7/dist-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.

Input dimension errors when using multiple GPUs (tutorial 3_training)

Hello. I've been working through the tutorials, running them locally on my machine. When running with a single GPU, I was able to get the 3_training.ipynb running just fine after adding in

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

When I have my second GPU enabled as well:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

I get an error on this line: trainer.build(next(iter(dataset))):

InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Input dimension 2 must have length of at least 256 but got: 64
	 [[node replica_1/autoencoder/processor_group/rfft (defined at /site-packages/ddsp/core.py:683) ]]
  (1) Invalid argument:  Input dimension 2 must have length of at least 256 but got: 64
	 [[node replica_1/autoencoder/processor_group/rfft (defined at /site-packages/ddsp/core.py:683) ]]
	 [[replica_1/autoencoder/processor_group/add_4/_10]]
0 successful operations.
0 derived errors ignored. [Op:__inference___call___11281]

Errors may have originated from an input operation.
Input Source operations connected to node replica_1/autoencoder/processor_group/rfft:
 replica_1/autoencoder/processor_group/frame/Reshape_4 (defined at /site-packages/ddsp/core.py:670)

Input Source operations connected to node replica_1/autoencoder/processor_group/rfft:
 replica_1/autoencoder/processor_group/frame/Reshape_4 (defined at /site-packages/ddsp/core.py:670)

Function call stack:
__call__ -> __call__

I'm running with two 2070 Supers, and I don't fully understand the whole allow_growth thing either, but I'm wondering if you may have any idea why I'm able to run on a single GPU but not both. Let me know if I can provide any more information. Thanks in advance for the help, and thanks for the awesome library!

Button Issue

Successfully cloned the repo to my saturn cloud jupyter notebook but I've run into a problem. The buttons and interactive displays dont show up in jupyter like they do in colab. IS there a workaround I can implement? Please help, love what you guys are doing so far. thanks

With regards to a perceptual loss

Thanks for the work you've done, this work is very exciting because it's intuitive to me! I do have a question though...

It is encouraged to use the multi-resolution spectrogram loss; however, the spectrogram loss does not incorporate a number of perceptual biases:

The MEL Scale
A-Frequency weighting
Decibel scale

How come?

Furthermore, in the paper, it mentions computing the reconstruction loss without the log scale. Given that human perception is non-linear, this choice doesn't make sense to me. Why would you compute the loss without the log scale?

Can't parse serialized Example when loading data

Hey, I am trying to run codes in train_autoencoder.ipynb and I got the following error after running next(iter(dataset))
InvalidArgumentError: Key: audio. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]

I change some of the ddsp_prepare_tfrecord configs since my audio is usually less than 1 sec:

!ddsp_prepare_tfrecord \
    --input_audio_filepatterns=$AUDIO_FILEPATTERN \
    --output_tfrecord_path=$TRAIN_TFRECORD \
    --num_shards=1 \
    --example_secs 1 \
    --sliding_window_hop_secs 0.25 \
    --alsologtostderr

I am not sure what went wrong. I could find the dataset contains: <ParallelMapDataset shapes: {audio: (64000,), f0_confidence: (1000,), f0_hz: (1000,), loudness_db: (1000,)}, types: {audio: tf.float32, f0_confidence: tf.float32, f0_hz: tf.float32, loudness_db: tf.float32}>

Occasional trouble downloading model zip file

Hey, getting an issue when I try to run the upload function in the timbre transfer demo
I run the cell, choose the file to upload then get this error

MessageError Traceback (most recent call last)
in ()
10 # Load audio sample here (.mp3 or .wav3 file)
11 # Just use the first file.
---> 12 filenames, audios = upload()
13 audio = audios[0]
14 audio = audio[np.newaxis, :]

3 frames
/usr/local/lib/python3.6/dist-packages/google/colab/_message.py in read_reply_from_input(message_id, timeout_sec)
104 reply.get('colab_msg_id') == message_id):
105 if 'error' in reply:
--> 106 raise MessageError(reply['error'])
107 return reply.get('data', None)
108

MessageError: RangeError: Maximum call stack size exceeded.

Pls help, thanks!

cannot upload your own model

When I try to upload my own model in timbre_transfer.ipynb，got error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-e3bb52fea3e8> in <module>()
      8   model_dir = os.path.join(GCS_CKPT_DIR, 'solo_%s_ckpt' % model.lower())
      9 else:
---> 10   raise ValueError
     11 
     12 # Assumes only one checkpoint in the folder, 'model.ckpt-[iter]`.

ValueError:

training an autoencoder without z

I noticed that in both ddsp/ddsp/training/gin/models/ae.gin and ddsp/ddsp/training/gin/models/ae_abs.gin settings, the model will use z as latent space. I tried to replace Autoencoder.decoder = @decoders.ZRnnFcDecoder() to Autoencoder.decoder = @decoders.RnnFcDecoder() to not use z and test the model's performance, is it the right way? I found that if I did not use z and use ae_abs.gin which jointly learns an encoder for f0, I will get loss nan after around 2000 steps. I doubt if this issue is from z latent missing...

Issue with training model

Hi all! I'm really excited about using this library.

When trying to train my own auto-encoder model, I get the following error shown in this screenshot. Anything I might be doing wrong here?

Thanks!

noisy outcome

Hi,

Thank you for this super cool code!

I use the colab implementation. The outcome of the style transfer seems to be a transfer of the amplitudes but the frequencies seem to be not right. My model also only trains for maybe 15-30 min. on a 3min source.

Details:

Everything is executable without errors. But I get a lot of warnings in the section "Preprocess raw audio into TFRecord dataset" and in the section "We will now begin training. "

"Preprocess raw audio into TFRecord dataset":
Warnings:

WARNING:tensorflow:From /tensorflow-2.1.0/python3.6/tensorflow_core/python/compat/v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
I0128 20:34:14.087458 139829082847104 fn_api_runner_transforms.py:490] ==================== <function annotate_downstream_side_inputs at 0x7f2bcb8d9510> ====================
I0128 20:34:14.088107 139829082847104 fn_api_runner_transforms.py:490] ==================== <function fix_side_input_pcoll_coders at 0x7f2bcb8d9620> ====================
I0128 20:34:14.088482 139829082847104 fn_api_runner_transforms.py:490] ==================== <function lift_combiners at 0x7f2bcb8d96a8> ====================
I0128 20:34:14.088626 139829082847104 fn_api_runner_transforms.py:490] ==================== <function expand_sdf at 0x7f2bcb8d9730> ====================
I0128 20:34:14.088833 139829082847104 fn_api_runner_transforms.py:490] ==================== <function expand_gbk at 0x7f2bcb8d97b8> ====================
I0128 20:34:14.089247 139829082847104 fn_api_runner_transforms.py:490] ==================== <function sink_flattens at 0x7f2bcb8d98c8> ====================
I0128 20:34:14.089420 139829082847104 fn_api_runner_transforms.py:490] ==================== <function greedily_fuse at 0x7f2bcb8d9950> ====================
I0128 20:34:14.090623 139829082847104 fn_api_runner_transforms.py:490] ==================== <function read_to_impulse at 0x7f2bcb8d99d8> ====================
I0128 20:34:14.090765 139829082847104 fn_api_runner_transforms.py:490] ==================== <function impulse_to_input at 0x7f2bcb8d9a60> ====================
I0128 20:34:14.090906 139829082847104 fn_api_runner_transforms.py:490] ==================== <function inject_timer_pcollections at 0x7f2bcb8d9bf8> ====================
I0128 20:34:14.091151 139829082847104 fn_api_runner_transforms.py:490] ==================== <function sort_stages at 0x7f2bcb8d9c80> ====================
I0128 20:34:14.091248 139829082847104 fn_api_runner_transforms.py:490] ==================== <function window_pcollection_coders at 0x7f2bcb8d9d08> ====================
I0128 20:34:14.092715 139829082847104 statecache.py:137] Creating state cache with size 100
I0128 20:34:14.093585 139829082847104 fn_api_runner.py:1538] Created Worker handler <apache_beam.runners.portability.fn_api_runner.EmbeddedWorkerHandler object at 0x7f2bcb36fef0> for environment urn: "beam:env:embedded_python:v1"

I0128 20:34:14.093782 139829082847104 fn_api_runner.py:693] Running (((((ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/DoOnce/Impulse_26)+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/DoOnce/FlatMap(<lambda at core.py:2530>)_27))+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/DoOnce/Map(decode)_29))+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/InitializeWrite_30))+(ref_PCollection_PCollection_18/Write))+(ref_PCollection_PCollection_19/Write)
I0128 20:34:14.110267 139829082847104 fn_api_runner.py:693] Running (((((((((ref_AppliedPTransform_Create/Impulse_3)+(ref_AppliedPTransform_Create/FlatMap(<lambda at core.py:2530>)_4))+(ref_AppliedPTransform_Create/Map(decode)_6))+(ref_AppliedPTransform_Map(_load_audio)_7))+(ref_AppliedPTransform_Map(_add_f0_estimate)_8))+(ref_AppliedPTransform_Map(_add_loudness)_9))+(ref_AppliedPTransform_FlatMap(_split_example)_10))+(ref_AppliedPTransform_Reshuffle/AddRandomKeys_12))+(ref_AppliedPTransform_Reshuffle/ReshufflePerKey/Map(reify_timestamps)_14))+(Reshuffle/ReshufflePerKey/GroupByKey/Write)
I0128 20:34:14.130905 139826085275392 prepare_tfrecord_lib.py:34] Loading 'data/audio/vocal-by-1.wav'.
WARNING:tensorflow:From /tensorflow-2.1.0/python3.6/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0128 20:34:14.580774 139826085275392 deprecation.py:506] From /tensorflow-2.1.0/python3.6/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-01-28 20:34:15.311901: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
/usr/local/lib/python3.6/dist-packages/librosa/core/time_frequency.py:1006: RuntimeWarning: divide by zero encountered in log10
  - 0.5 * np.log10(f_sq + const[3]))
I0128 20:34:52.714827 139829082847104 fn_api_runner.py:693] Running ((((((Reshuffle/ReshufflePerKey/GroupByKey/Read)+(ref_AppliedPTransform_Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)_19))+(ref_AppliedPTransform_Reshuffle/RemoveRandomKeys_20))+(ref_AppliedPTransform_Map(_float_dict_to_tfexample)_21))+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/ParDo(_RoundRobinKeyFn)_31))+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/WindowInto(WindowIntoFn)_32))+(WriteToTFRecord/Write/WriteImpl/GroupByKey/Write)
I0128 20:34:54.352579 139829082847104 fn_api_runner.py:693] Running ((WriteToTFRecord/Write/WriteImpl/GroupByKey/Read)+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/WriteBundles_37))+(ref_PCollection_PCollection_25/Write)
W0128 20:34:54.418104 139826085275392 tfrecordio.py:60] Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
I0128 20:34:54.581911 139829082847104 fn_api_runner.py:693] Running ((ref_PCollection_PCollection_18/Read)+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/PreFinalize_38))+(ref_PCollection_PCollection_26/Write)
I0128 20:34:54.591270 139829082847104 fn_api_runner.py:693] Running (ref_PCollection_PCollection_18/Read)+(ref_AppliedPTransform_WriteToTFRecord/Write/WriteImpl/FinalizeWrite_39)
I0128 20:34:54.597444 139826069305088 filebasedsink.py:294] Starting finalize_write threads with num_shards: 10 (skipped: 0), batches: 10, num_threads: 10
I0128 20:34:54.700012 139826069305088 filebasedsink.py:331] Renamed 10 shards in 0.10 seconds.

Maybe it s no problem. (?)

Then, when I train a model I get a lot of warnings again:

WARNING:tensorflow:From /tensorflow-2.1.0/python3.6/tensorflow_core/python/compat/v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/ddsp/training/train_util.py:189: The name tf.estimator.tpu.RunConfig is deprecated. Please use tf.compat.v1.estimator.tpu.RunConfig instead.

W0128 20:38:03.140560 139811238639488 module_wrapper.py:138] From /usr/local/lib/python3.6/dist-packages/ddsp/training/train_util.py:189: The name tf.estimator.tpu.RunConfig is deprecated. Please use tf.compat.v1.estimator.tpu.RunConfig instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/ddsp/training/train_util.py:191: The name tf.estimator.tpu.TPUConfig is deprecated. Please use tf.compat.v1.estimator.tpu.TPUConfig instead.

W0128 20:38:03.140788 139811238639488 module_wrapper.py:138] From /usr/local/lib/python3.6/dist-packages/ddsp/training/train_util.py:191: The name tf.estimator.tpu.TPUConfig is deprecated. Please use tf.compat.v1.estimator.tpu.TPUConfig instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/ddsp/training/train_util.py:199: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

W0128 20:38:03.141086 139811238639488 module_wrapper.py:138] From /usr/local/lib/python3.6/dist-packages/ddsp/training/train_util.py:199: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

INFO:tensorflow:Using config: {'_model_dir': '/content/models/ddsp-solo-instrument', '_tf_random_seed': None, '_save_summary_steps': 300, '_save_checkpoints_steps': 300, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 100, '_keep_checkpoint_every_n_hours': 1, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=300, num_shards=None, num_cores_per_replica=None, per_host_input_for_training=2, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
I0128 20:38:03.141758 139811238639488 estimator.py:216] Using config: {'_model_dir': '/content/models/ddsp-solo-instrument', '_tf_random_seed': None, '_save_summary_steps': 300, '_save_checkpoints_steps': 300, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 100, '_keep_checkpoint_every_n_hours': 1, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=300, num_shards=None, num_cores_per_replica=None, per_host_input_for_training=2, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu False
I0128 20:38:03.142008 139811238639488 tpu_context.py:221] _TPUContext: eval_on_tpu False
WARNING:tensorflow:From /tensorflow-2.1.0/python3.6/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0128 20:38:03.145722 139811238639488 deprecation.py:506] From /tensorflow-2.1.0/python3.6/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /tensorflow-2.1.0/python3.6/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0128 20:38:03.146073 139811238639488 deprecation.py:323] From /tensorflow-2.1.0/python3.6/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
I0128 20:38:03.266683 139811238639488 estimator.py:1151] Calling model_fn.
INFO:tensorflow:Running train on CPU
I0128 20:38:03.266903 139811238639488 tpu_estimator.py:3124] Running train on CPU
I0128 20:38:04.029543 139811238639488 processors.py:138] Connecting node (additive):
I0128 20:38:04.029708 139811238639488 processors.py:140] Input 0: amps
I0128 20:38:04.029782 139811238639488 processors.py:140] Input 1: harmonic_distribution
I0128 20:38:04.029845 139811238639488 processors.py:140] Input 2: f0_hz
I0128 20:38:04.095593 139811238639488 processors.py:138] Connecting node (filtered_noise):
I0128 20:38:04.095721 139811238639488 processors.py:140] Input 0: noise_magnitudes
I0128 20:38:04.194273 139811238639488 processors.py:138] Connecting node (add):
I0128 20:38:04.194403 139811238639488 processors.py:140] Input 0: filtered_noise/signal
I0128 20:38:04.194476 139811238639488 processors.py:140] Input 1: additive/signal
I0128 20:38:04.194946 139811238639488 processors.py:138] Connecting node (reverb):
I0128 20:38:04.195056 139811238639488 processors.py:140] Input 0: add/signal
I0128 20:38:04.302336 139811238639488 processors.py:157] ProcessorGroup output node (reverb)
I0128 20:38:04.933219 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc/dense/kernel:0 (shape=(1, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.933395 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933452 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc/layer_normalization/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933498 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc/layer_normalization/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933540 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc_1/dense/kernel:0 (shape=(512, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.933584 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc_1/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933624 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc_1/layer_normalization_1/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933666 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc_1/layer_normalization_1/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933704 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc_2/dense/kernel:0 (shape=(512, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.933745 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc_2/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933784 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc_2/layer_normalization_2/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933821 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack/fc_2/layer_normalization_2/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.933857 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc/dense/kernel:0 (shape=(1, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.933896 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934019 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc/layer_normalization_3/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934084 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc/layer_normalization_3/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934144 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc_1/dense/kernel:0 (shape=(512, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.934210 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc_1/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934275 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc_1/layer_normalization_4/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934336 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc_1/layer_normalization_4/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934394 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc_2/dense/kernel:0 (shape=(512, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.934453 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc_2/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934509 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc_2/layer_normalization_5/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934564 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_1/fc_2/layer_normalization_5/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934619 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/gru/kernel:0 (shape=(1024, 1536), dtype=<dtype: 'float32'>).
I0128 20:38:04.934680 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/gru/recurrent_kernel:0 (shape=(512, 1536), dtype=<dtype: 'float32'>).
I0128 20:38:04.934738 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/gru/bias:0 (shape=(1536,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934794 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc/dense/kernel:0 (shape=(1536, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.934854 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934910 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc/layer_normalization_6/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.934981 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc/layer_normalization_6/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.935039 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc_1/dense/kernel:0 (shape=(512, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.935099 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc_1/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.935154 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc_1/layer_normalization_7/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.935209 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc_1/layer_normalization_7/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.935270 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc_2/dense/kernel:0 (shape=(512, 512), dtype=<dtype: 'float32'>).
I0128 20:38:04.935333 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc_2/dense/bias:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.935391 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc_2/layer_normalization_8/gamma:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.935447 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/fc_stack_2/fc_2/layer_normalization_8/beta:0 (shape=(512,), dtype=<dtype: 'float32'>).
I0128 20:38:04.935502 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/dense/kernel:0 (shape=(512, 126), dtype=<dtype: 'float32'>).
I0128 20:38:04.935561 139811238639488 models.py:230] adding trainable variable rnn_fc_decoder/dense/bias:0 (shape=(126,), dtype=<dtype: 'float32'>).
I0128 20:38:04.935617 139811238639488 models.py:230] adding trainable variable ir:0 (shape=(48000,), dtype=<dtype: 'float32'>).
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/ddsp/training/train_util.py:126: The name tf.estimator.tpu.TPUEstimatorSpec is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimatorSpec instead.

W0128 20:38:06.495219 139811238639488 module_wrapper.py:138] From /usr/local/lib/python3.6/dist-packages/ddsp/training/train_util.py:126: The name tf.estimator.tpu.TPUEstimatorSpec is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimatorSpec instead.

INFO:tensorflow:Done calling model_fn.
I0128 20:38:06.514110 139811238639488 estimator.py:1153] Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
I0128 20:38:06.515020 139811238639488 basic_session_run_hooks.py:546] Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
I0128 20:38:07.337075 139811238639488 monitored_session.py:246] Graph was finalized.
2020-01-28 20:38:07.447110: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
INFO:tensorflow:Running local_init_op.
I0128 20:38:08.580479 139811238639488 session_manager.py:504] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0128 20:38:08.617971 139811238639488 session_manager.py:507] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /content/models/ddsp-solo-instrument/model.ckpt.
I0128 20:38:10.710745 139811238639488 basic_session_run_hooks.py:613] Saving checkpoints for 0 into /content/models/ddsp-solo-instrument/model.ckpt.
INFO:tensorflow:global_step/sec: 0.166465
I0128 20:38:26.712445 139811238639488 tpu_estimator.py:2307] global_step/sec: 0.166465
INFO:tensorflow:examples/sec: 2.66345
I0128 20:38:26.713377 139811238639488 tpu_estimator.py:2308] examples/sec: 2.66345
INFO:tensorflow:global_step/sec: 0.556173
I0128 20:38:28.510447 139811238639488 tpu_estimator.py:2307] global_step/sec: 0.556173
INFO:tensorflow:examples/sec: 8.89877
I0128 20:38:28.510792 139811238639488 tpu_estimator.py:2308] examples/sec: 8.89877
INFO:tensorflow:global_step/sec: 0.587273
I0128 20:38:30.213216 139811238639488 tpu_estimator.py:2307] global_step/sec: 0.587273

...

Training the model executes with:

INFO:tensorflow:global_step/sec: 0.557365
I0128 21:08:40.238079 139811238639488 tpu_estimator.py:2307] global_step/sec: 0.557365
INFO:tensorflow:examples/sec: 8.91784
I0128 21:08:40.238506 139811238639488 tpu_estimator.py:2308] examples/sec: 8.91784
INFO:tensorflow:global_step/sec: 0.548976
I0128 21:08:42.059640 139811238639488 tpu_estimator.py:2307] global_step/sec: 0.548976
INFO:tensorflow:examples/sec: 8.78361
I0128 21:08:42.060112 139811238639488 tpu_estimator.py:2308] examples/sec: 8.78361
INFO:tensorflow:Saving checkpoints for 1000 into /content/models/ddsp-solo-instrument/model.ckpt.
I0128 21:08:42.060704 139811238639488 basic_session_run_hooks.py:613] Saving checkpoints for 1000 into /content/models/ddsp-solo-instrument/model.ckpt.
INFO:tensorflow:Loss for final step: 6.1354375.
I0128 21:08:42.603451 139811238639488 estimator.py:375] Loss for final step: 6.1354375.
INFO:tensorflow:training_loop marked as finished
I0128 21:08:42.604181 139811238639488 error_handling.py:108] training_loop marked as finished

When I upload the model in the style transfer colab, the resynthesized sample sounds like rythmic noise / wind.

I wonder, where I am wrong and if I maybe have to adapt any of the following:
num_shards=10
num_train_steps=1000
gin_param=batch_size=16

Generated model data looks like:

I would really appreciate any leads!

Have a great day!

train_autoencoder demo: issue installing pip dependencies

Running into an issue running the block that installs dependencies:

ERROR: pydrive 1.3.1 has requirement oauth2client>=4.0.0, but you'll have oauth2client 3.0.0 which is incompatible.
ERROR: google-api-python-client 1.7.12 has requirement httplib2<1dev,>=0.17.0, but you'll have httplib2 0.12.0 which is incompatible.
ERROR: chainer 6.5.0 has requirement typing<=3.6.6, but you'll have typing 3.7.4.1 which is incompatible.
ERROR: chainer 6.5.0 has requirement typing-extensions<=3.6.6, but you'll have typing-extensions 3.7.4.2 which is incompatible.

The timbre transfer colab worked fine for me.

Thank you for your work on these colab demos. I find them super useful.

2_processor_group.ipynb error

# Processor group DAG
dag = [
  (additive, ['amps', 'harmonic_distribution', 'f0_hz']),
  (noise, ['magnitudes']),
  (add, ['additive/signal', 'noise/signal']),
  (reverb, ['ir', 'add/signal'])
]

processor_group = ddsp.processors.ProcessorGroup(dag=dag)
audio_out = processor_group.get_signal(inputs)

# Listen
play(audio_out)
specplot(audio_out)

error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-9c0cecf1091d> in <module>
      8 
      9 processor_group = ddsp.processors.ProcessorGroup(dag=dag)
---> 10 audio_out = processor_group.get_signal(inputs)
     11 
     12 # Listen

~/anaconda3/envs/hw/lib/python3.6/site-packages/ddsp/processors.py in get_signal(self, *args, **kwargs)
    113   def get_signal(self, *args: tf.Tensor, **kwargs: tf.Tensor) -> tf.Tensor:
    114     """Convert input tensors arguments into a signal tensor."""
--> 115     outputs = self.get_outputs(*args, **kwargs)
    116     signal = outputs[self.name]['signal']
    117     return signal

~/anaconda3/envs/hw/lib/python3.6/site-packages/ddsp/processors.py in get_outputs(self, dag_inputs)
    144 
    145       # Build the processor (does nothing if not the first time).
--> 146       processor.build(*[tensor.shape for tensor in inputs])
    147       # Run processor.
    148       controls = processor.get_controls(*inputs)

TypeError: build() takes 2 positional arguments but 3 were given

calling astype on python list in timbre_transfer.ipynb

audios[0:1].astype(np.float32) in the basic timbre_transfer.ipynb does not work, as audios is a list.
audios[0].astype(np.float32) works.

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally

Hi, I'm trying to train a model locally (adapting the code from train_autoencoder.ipynb), and I'm getting the error in the title just before the model is supposed to start training. I will copy the complete log below. My configuration is as follows:

Tensorflow 2.1
CUDA 10.1
cudnn 7.6.5 for CUDA 10.1

2020-02-21 13:39:39.259132: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:41.110202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
I0221 13:39:43.156791  2672 train_util.py:56] Defaulting to MirroredStrategy
2020-02-21 13:39:43.164404: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-02-21 13:39:43.237886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.241122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.246274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.250949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.253287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.257189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.261498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.269133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.271574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.272927: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-02-21 13:39:43.275556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1558] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-21 13:39:43.278705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-02-21 13:39:43.280447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:43.282142: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:43.283834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-02-21 13:39:43.285671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-02-21 13:39:43.287438: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-02-21 13:39:43.289994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-21 13:39:43.291835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1700] Adding visible gpu devices: 0
2020-02-21 13:39:43.970857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1099] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-21 13:39:43.973353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105]      0
2020-02-21 13:39:43.974871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1118] 0:   N
2020-02-21 13:39:43.976781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1244] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6306 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:43.974044  2672 mirrored_strategy.py:501] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0221 13:39:44.343264  2672 train_util.py:201] Building the model...
WARNING:tensorflow:From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0221 13:39:48.817270  3952 deprecation.py:506] From c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1809: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2020-02-21 13:39:52.821030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-02-21 13:39:53.103556: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-02-21 13:39:53.327462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
I0221 13:39:54.833573  2672 train_util.py:172] Restoring from checkpoint...
I0221 13:39:54.833573  2672 train_util.py:184] No checkpoint, skipping.
I0221 13:39:54.833573  2672 train_util.py:256] Creating metrics for ListWrapper(['spectral_loss', 'total_loss'])
2020-02-21 13:40:02.551385: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-02-21 13:40:02.554137: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Fatal Python error: Aborted

Thread 0x00000a70 (most recent call first):
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\execute.py", line 60 in quick_execute
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 598 in call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1741 in _call_flat
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\function.py", line 1660 in _filtered_call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 646 in _call
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\tensorflow\python\eager\def_function.py", line 576 in __call__
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\train_util.py", line 273 in train
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\gin\config.py", line 1055 in gin_wrapper
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 151 in main
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 250 in _run_main
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\absl\app.py", line 299 in run
  File "c:\users\andrey\anaconda3\envs\test\lib\site-packages\ddsp\training\ddsp_run.py", line 172 in console_entry_point
  File "C:\Users\andrey\Anaconda3\envs\test\Scripts\ddsp_run.exe\__main__.py", line 7 in <module>
  File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 85 in _run_code
  File "c:\users\andrey\anaconda3\envs\test\lib\runpy.py", line 193 in _run_module_as_main

I can't point my finger on where's the problem because:

Tensorflow trains on GPU correctly with a toy example training, so it is configured correctly to work with CUDA
Tensorflow trains DDSP correctly if run on CPU

This is with a Windows system. On Ubuntu the situation was the same, but I was getting the following error:
Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Any help will be appreciated.

compute loudness error

Hey, I am trying to compute loudness from an only 4 seconds audio. And it turns out that the loudness calculation just failed.

Is it because the audio is too short? Could I just use spectral_ops.compute_loudness on multiple 4 seconds long audio samples separately?

Few questions

This is great! I'm not too experienced with ML development but follow a lot of audio ML research, and I've been thinking that this approach should be the way to do things for a good while. Looking forward to playing around with ddsp for an upcoming project.

Got a few questions...

In some places, harmonic distribution seems nearly synonymous with the amplitude distribution a(n), as a model of variations between partials' spectral magnitudes, but then it's also referenced to model spectral centroid? Can you elaborate on the difference between harmonic distribution and a(n)? I use "overtone distribution" in my code to refer to discrete frequency distributions of partials relative to a fundamental (inharmonic timbre stuff)... probably contributing to my confusion. 😛
I'll be synthesizing novel inharmonic timbres with retuned pitches, using (mostly) harmonic timbres for inputs. Remapping/interpolating f(0) seems easy with the current model. I'm wondering if it's viable to remap overtone partials to an arbitrary frequency distribution with the current model... ie, instead of multiplying the fundamental by integers, simply multiply it by some predefined set of rational numbers/floats. As I'll be synthesizing novel timbres, I won't necessarily have training sets to provide as inputs to train an unconstrained oscillator bank via a loss function... so I'm thinking the process could just be training the current model, still limited to f(0), for the given input and then remapping partial frequencies onto inharmonic frequency sets at the additive synth while still using other features that are generated by the encoder and/or interpolated. Make sense/any immediate issues with that idea?
You use 101 partials in the synth... which for harmonic timbres would extend past the 8kHz nyquist limit for any pitch >~80Hz. Is that just to cover the entire frequency range for any reasonable pitch? Also curious about why you limited it to the 16kHz sample rate... real-time constraint or faster training or something?

Thanks and stay safe out there! Sorry for the wall of text.

Floating point problem in compute_f0

In ddsp.spectral_ops.compute_f0, when len(audio) is 1025253 and sample_rate is 16000, n_samples, which is equivalent to 1025253 / 16000 * 16000 becomes 1025252.9999999999, which then causes the assertion further down to fail (assert n_padding % 1 == 0).

Exponential decay reset after resuming training

Once the training is stopped and launched again, it continues training from the last checkpoint, however, the optimizer schedule is always reinicialized.
This happens in the __init__ method of the class Trainer in ddsp/training/train_util.py

    lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=learning_rate,
        decay_steps=lr_decay_steps,
        decay_rate=lr_decay_rate)

    with self.strategy.scope():
      optimizer = tf.keras.optimizers.Adam(lr_schedule)
      self.optimizer = optimizer

A new instance of Trainer is created each time ddsp_run is executed

can we just reconstruct the wavefrom from fundamental frequency and loudness?

Hey, I just got a really good reconstruction result which is too good to be true. I have a sense that the idea behind the model is really good but it is still so amazing to me. I just use your demo autoencoder to reconstruct audios from the human voice and the result is really good. But I could not understand how it can be achieved by only using f0 and loudness information? For example, the vowel 'a' and 'e' is definitely different, how does this be reflected through f0 and loudness? I thought there might be some difference between musical instruments and human voice. I just couldn't understand that these features are enough.

By the way, if I want to add z as latent space besides f0 and loudness, how can I tell the model to use it? I thought you mentioned in the paper that z may correspond to timbre information but I couldn't find it in timbre_transfer.ipynb, can you achieve timbre transfer without z?

train_autoencoder demo preprocess issue with tf2

Traceback (most recent call last):
  File "/usr/local/bin/ddsp_prepare_tfrecord", line 8, in <module>
    sys.exit(console_entry_point())
  File "/usr/local/lib/python3.6/dist-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 91, in console_entry_point
    tf.disable_v2_behavior()
AttributeError: module 'tensorflow_core.compat.v2' has no attribute 'disable_v2_behavior'

Idea: sample_rate agnostic demo/tutorial

I really enjoy tinkering with ddsp. It would be a bit more approachable if we could experiment more easily with 44.1kHz or the other standard audio formats. Could you perhaps make it more straightforward in the demos, or alternatively document what should we set differently to accommodate other sample rates for a whole training-synthesizing pipeline, or at least some "best advices"?

The ddsp_prepare_tfrecord function, for example, is not very forgiving with custom sample rates (it asserts because of crepe's 16kHz resampling producing some decimal paddings?).

I suppose, 16kHz is a default because we are stuck in the world of speech synthesis (of the '90-ies?), but what might be acceptable for telephony is just not acceptable for many audio use cases.

I hope you don't mind me opening an issue just because of an idea/rant, feel free to close it anytime, and keep up doing these amazing contributions to the world of audio/music!

model restore

hey, I have a question about model save and restore. As you said here:

Saving weights in checkpoint format because saved_model requires handling variable batch size, which some synths and effects can't.

Do you mean the model also saves some synths and effects' variables?
I am struggling with this because I'd like to do some transfer learning with a new encoder but would like to use the pre-trained model's decoder' weights and I found that you use tf.train.Checkpoint.restore to restore the whole model. And you can use trainer.restore(model_dir) to restore the model during training. But it seems to make it hard to restore part of the model's weights using this coding style.
Is there a way to restore only part of the pretrained model (like decoder) in restore part and pass it to the new model's decoder? Another solution I can think up is to restore the whole model and replace all the parts except decoder, which seems really weird and might now work.

demo notebook dead link

the link to DDSP Timbre Tranfer Colab at the beginning of the notebook is dead.

Training Demo Error

I'm having trouble with the Colab Training Demo. I keep getting this error

I0319 03:02:50.860800 140595398194944 prepare_tfrecord_lib.py:30] Loading 'data/audio/19063.wav'.
/usr/local/lib/python3.6/dist-packages/librosa/core/time_frequency.py:1006: RuntimeWarning: divide by zero encountered in log10
  - 0.5 * np.log10(f_sq + const[3]))

I'm running everything in Google Colab with the GPU runtime selected. I've tried multiple MP3s and wav files but I keep getting the error. I'm not sure how to determine what the issue is and any help is appreciated.

timbre-transfer example -> If source sample is 'long', play(gen_audio) can not be downloaded via gui

I am often having problems with not being able to download the result from the small soundfile gui (the vertical three dots) in the colab when files are long >45~ seconds. (using upload, and .wav - both with pretrained and custom checkpoint)

if using a small 13mb file (for an example) there is no problem the download button throws/opens a os window for downloading or it will just download to the sys/os download folder immediately, as download.wav . depending on browser used and settings.

FYI, i have tried with chrome, firefox and brave. running debian buster.
and macOS mojave with firefox, chrome and safari.
sometimes i have the problem of not being able to download sometimes not.
tried to clear my cache and logout of google account + restarting runtime and refreshing page. no help. download button in soundfile view just results in "download failed due to network connection error" or even sometimes just nothing. no error, no os window pop up.

and I have double checked that it is not my network connection which is the trouble maker or that is is something i am doing wrong on my system. So when i experience this bug i immediately go to another colab running (for an example) some librosa stuff or even another ddsp demo like one from the tutorials and there i can download results from processes. no problem.

I would love to write the resynthesis output from model(af, ...) to the files directory. but I am not able to find any write out or store funcs in the ddsp lib.

PS. ddsp is making so nice sounding material compared to other ML resynthesis methods i have come across. Good job. And many thanks for sharing.

missing import in train_autoencoder.ipynb?

In train_autoencoder.ipynb this exception needs tf.errors to be imported:

try:
  ex = next(iter(dataset))
except OutOfRangeError:

How to deal with variable-length sequences?

Hey,

I'm struggling with getting a simple (e.g. additive) synth to run with variable length data. Say, you are trying to train an autoencoder but the data is not always exactly the same length.

Problem 1: The synth already needs n_samples at initialization. Why is this? It makes much more sense to me to have this argument when calling the synth.
Problem 2: In principle, I don't mind creating a new synth object every time I want to use it (i.e. decode), since the initialization seems fairly trivial (shouldn't produce much overhead... right?). So I could just do that and always pass in the n_samples I need at that time step. However, I'm not sure what the correct way of doing this would be. I tried passing a tensor shape as n_samples, but this leads to a crash, see below.

Short code example:

def test(inp):
    # inp is a dummy -- it just represents "something with variable length"
    osc = ddsp.synths.Additive(n_samples=int(tf.shape(inp)[0]))
    # dummy values for amplitude/harmonics/f0
    audio = osc([[[3.], [2.], [5.]]], [[[3.], [2.], [5.]]], [[[441.], [442.], [443.]]])
    return audio

test(tf.random.normal([102]))

This leads to the following crash

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-79-5d43e4898acc> in <module>
----> 1 test(tf.random.normal([102]))

<ipython-input-78-211339ba06f2> in test(inp)
      1 def test(inp):
      2     osc = ddsp.synths.Additive(n_samples=tf.shape(inp)[0])
----> 3     audio = osc([[[3.], [2.], [5.]]], [[[3.], [2.], [5.]]], [[[441.], [442.], [443.]]])
      4     return audio

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py in __call__(self, inputs, *args, **kwargs)
    820           with base_layer_utils.autocast_context_manager(
    821               self._compute_dtype):
--> 822             outputs = self.call(cast_inputs, *args, **kwargs)
    823           self._handle_activity_regularization(inputs, outputs)
    824           self._set_mask_metadata(inputs, outputs, input_masks)

/usr/local/lib/python3.6/dist-packages/ddsp/processors.py in call(self, *args, **kwargs)
     59     """Convert input tensors arguments into a signal tensor."""
     60     controls = self.get_controls(*args, **kwargs)
---> 61     signal = self.get_signal(**controls)
     62     return signal
     63 

/usr/local/lib/python3.6/dist-packages/ddsp/synths.py in get_signal(self, amplitudes, harmonic_distribution, f0_hz)
     97         harmonic_distribution=harmonic_distribution,
     98         n_samples=self.n_samples,
---> 99         sample_rate=self.sample_rate)
    100     return signal
    101 

/usr/local/lib/python3.6/dist-packages/ddsp/core.py in harmonic_synthesis(frequencies, amplitudes, harmonic_shifts, harmonic_distribution, n_samples, sample_rate, amp_resample_method)
    405   frequency_envelopes = resample(harmonic_frequencies, n_samples)  # cycles/sec
    406   amplitude_envelopes = resample(harmonic_amplitudes, n_samples,
--> 407                                  method=amp_resample_method)
    408 
    409   # Synthesize from harmonics [batch_size, n_samples].

/usr/local/lib/python3.6/dist-packages/ddsp/core.py in resample(inputs, n_timesteps, method, add_endpoint)
    124 
    125   elif method == 'window':
--> 126     outputs = upsample_with_windows(inputs, n_timesteps, add_endpoint)
    127 
    128   else:

/usr/local/lib/python3.6/dist-packages/ddsp/core.py in upsample_with_windows(inputs, n_timesteps, add_endpoint)
    170                          n_frames, n_timesteps))
    171 
--> 172   if n_timesteps % n_intervals != 0.0:
    173     minus_one = '' if add_endpoint else ' - 1'
    174     raise ValueError(

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py in tensor_not_equals(self, other)
   1363   if ops.Tensor._USE_EQUALITY and ops.executing_eagerly_outside_functions():
   1364     if fwd_compat.forward_compatible(2019, 9, 25):
-> 1365       return gen_math_ops.not_equal(self, other, incompatible_shape_error=False)
   1366     else:
   1367       return gen_math_ops.not_equal(self, other)

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py in not_equal(x, y, incompatible_shape_error, name)
   6435         _ctx._context_handle, tld.device_name, "NotEqual", name,
   6436         tld.op_callbacks, x, y, "incompatible_shape_error",
-> 6437         incompatible_shape_error)
   6438       return _result
   6439     except _core._FallbackException:

TypeError: Cannot convert 0.0 to EagerTensor of dtype int32

which I assume is due to n_timesteps being a tensor in upsample_with_windows. In this case it can be fixed by converting the shape to an int() explicitly, but this won't work when using @tf.function because the shape is not known at the time the code is actually run. Any workarounds? My colleague proposed simply initializing one synth for every possible length and choosing the correct one on the fly, but this seems wasteful.

I hope this is the right place to ask this. All the tutorials seem to be using fixed-length data (e.g. always 4 seconds) but I don't think variable lengths are a particularly exotic scenario.

Using a custom trainer

I would like to use a custom training loop with ddsp_run. Can I somehow swap Trainer for a custom trainer (overriding step_fn) in the gin config file?

I assume I cannot just do

train.trainer = @my_module.MyCustomTrainer()

because then the model and strategy arguments would not get passed into the trainer's constructor here.

I suppose I could define a function like

@gin.configurable
def get_trainer(*args, trainer_class=Trainer, **kwargs):
   return trainer_class(*args, **kwargs)

and then call it from ddsp_run instead of instantiating Trainer directly, but that seems a bit clumsy.

Is there a better way to do it?

Trouble training model without Crepe f0 estimation

Description

I am having trouble training models that don't rely on an f0 estimate from the Crepe pitch estimator. In my tests, whenever fundamental frequency estimation is part of the differential graph I cannot get any convergence of the additive synthesizer at all.

To reproduce it, I create a batch consisting of one sample generated with the additive synth as in the synths and effects tutorial notebook. I then try overfitting an autoencoder on that one sample, with code adapted from the training on one sample notebook.

The decoder uses an additive synthesizer too so, in theory, it should easily reconstruct the sample. Here is a Colab notebook that demonstrates the behavior. In order to make the model converge replace f0_encoder=f0_encoder with f0_encoder=None.

Results

Original Audio

Reconstruction with an f0 encoder (3000 training steps)

After the first few training steps, the loss does not improve anymore (around 18.-19.).

Reconstruction with f0 from Crepe (100 training steps)

The model converges immediately with the loss going down to 3. in a short time.

Things I have tried

Pass the precalculated f0 estimate from Crepe to a dense layer with one output. Even then the model does not converge although a simple scaling of the input should be enough for reconstruction. I tried a combination of activation functions and rescaling.
In order to help avoid local minima, that could occur if the model optimizes for a fundamental frequency that is a multiple of the real fundamental frequency, I tried applying a coarse spectrogram loss using less FFT buckets. I didn't see any improvement.
Initialized a fake f0 estimator so it starts almost at the right frequency before starting to train with no success.

This is happening just trying to fit one sample. I tried fitting multiple samples too without success.

To Reproduce

Colab notebook

import time
import ddsp
from ddsp.training import (data, decoders, encoders, models, preprocessing, 
                           train_util)
import gin
import numpy as np
import tensorflow.compat.v2 as tf
import itertools

sample_rate = 16000

### Generate an audio sample using the additive synth

n_frames = 1000
hop_size = 64
n_samples = n_frames * hop_size

# Amplitude [batch, n_frames, 1].
# Make amplitude linearly decay over time.
amps = np.linspace(1.0, -3.0, n_frames,dtype=np.float32)
amps = amps[np.newaxis, :, np.newaxis]

# Harmonic Distribution [batch, n_frames, n_harmonics].
# Make harmonics decrease linearly with frequency.
n_harmonics = 20
harmonic_distribution = np.ones([n_frames, 1],dtype=np.float32) * np.linspace(1.0, -1.0, n_harmonics,dtype=np.float32)[np.newaxis, :]
harmonic_distribution = harmonic_distribution[np.newaxis, :, :]

# Fundamental frequency in Hz [batch, n_frames, 1].
f0_hz = 440.0 * np.ones([1, n_frames, 1],dtype=np.float32)

# Create synthesizer object.
additive_synth = ddsp.synths.Additive(n_samples=n_samples,
                                      scale_fn=ddsp.core.exp_sigmoid,
                                      sample_rate=sample_rate)

# Generate some audio.
audio = additive_synth(amps, harmonic_distribution, f0_hz)

# Create a batch of data (1 example) to train on

batch = {"audio": audio, "f0_hz": f0_hz, "amplitudes": amps, "loudness_db": np.ones_like(amps)}

dataset_iter = itertools.repeat(batch)
batch = next(dataset_iter)
audio = batch['audio']
n_samples = audio.shape[1]


### Create an autoencoder

# Create Neural Networks.
preprocessor = preprocessing.DefaultPreprocessor(time_steps=n_samples)

# f0 encoder
f0_encoder = encoders.ResnetF0Encoder(size="small")


encoder = encoders.MfccTimeDistributedRnnEncoder(rnn_channels = 256, 
                                                 rnn_type = 'gru', 
                                                 z_dims = 16, 
                                                 z_time_steps=125, 
                                                 f0_encoder=f0_encoder)
# set f0_encoder=None to use Crepe

decoder = decoders.RnnFcDecoder(rnn_channels = 256,
                                rnn_type = 'gru',
                                ch = 256,
                                layers_per_stack = 1,
                                output_splits = (('amps', 1),
                                                 ('harmonic_distribution', 45)))

# Create Processors.
additive = ddsp.synths.Additive(n_samples=n_samples, 
                                sample_rate=sample_rate,
                                name='additive')

# Create ProcessorGroup.
dag = [(additive, ['amps', 'harmonic_distribution', 'f0_hz'])]

processor_group = ddsp.processors.ProcessorGroup(dag=dag,
                                                 name='processor_group')


# Loss_functions
spectral_loss = ddsp.losses.SpectralLoss(loss_type='L1',
                                         mag_weight=1.0,
                                         logmag_weight=1.0)

strategy = train_util.get_strategy()

with strategy.scope():
  # Put it together in a model.
  model = models.Autoencoder(preprocessor=preprocessor,
                             encoder=encoder,
                             decoder=decoder,
                             processor_group=processor_group,
                             losses=[spectral_loss])
  trainer = train_util.Trainer(model, strategy, learning_rate=1e-3)


### Try overfitting to the synthetic sample

# Build model, easiest to just run forward pass.

trainer.build(batch)

for i in range(3000):
  losses = trainer.train_step(dataset_iter)
  res_str = 'step: {}\t'.format(i)
  for k, v in losses.items():
    res_str += '{}: {:.2f}\t'.format(k, v)
  print(res_str)

evaluation not releasing memory

I noticed that the evaluate_or_sample function doesn't release memory between checkpoints.

For example, if I run ddsp_run in eval mode while another ddsp_run is training, the size of the evaluation process keeps growing as it loads and evaluates the checkpoints that are being generated by the training process. While killing and rerunning the eval process solves the issue, it is not an ideal solution.

All alternative spectral losses throw an error

SpectralLoss:
I am trying to use alternative loss weights, but all of them throw an error except logmag_weight and mag_weight:

INFO:tensorflow:Error reported to Coordinator: in user code:

    /usr/local/lib/python3.6/dist-packages/ddsp/training/models.py:64 __call__  *
        results = super().__call__(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/ddsp/training/models.py:123 call  *
        loss = loss_obj(features['audio'], audio_gen)
    /usr/local/lib/python3.6/dist-packages/ddsp/losses.py:107 call  *
        target = diff(target_mag, axis=1)
    /usr/local/lib/python3.6/dist-packages/ddsp/spectral_ops.py:158 diff  *
        size = shape.as_list()

    AttributeError: 'list' object has no attribute 'as_list'

or in case of trying to use loudness:

INFO:tensorflow:Error reported to Coordinator: in user code:

    /usr/local/lib/python3.6/dist-packages/ddsp/training/models.py:64 __call__  *
        results = super().__call__(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/ddsp/training/models.py:123 call  *
        loss = loss_obj(features['audio'], audio_gen)
    /usr/local/lib/python3.6/dist-packages/ddsp/losses.py:138 call  *
        target = spectral_ops.compute_loudness(target_audio, n_fft=2048)
    /usr/local/lib/python3.6/dist-packages/ddsp/spectral_ops.py:209 compute_loudness  *
        s = stft_fn(audio, frame_size=n_fft, overlap=overlap, pad_end=True)
    /usr/local/lib/python3.6/dist-packages/ddsp/spectral_ops.py:61 stft_np  *
        audio = np.pad(audio, padding, 'constant')
    <__array_function__ internals>:6 pad  **
        
    /usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py:741 pad
        array = np.asarray(array)
    /usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py:85 asarray
        return array(a, dtype, copy=False, order=order)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py:749 __array__
        " array.".format(self.name))

    NotImplementedError: Cannot convert a symbolic Tensor (fn:0) to a numpy array.

Tutorial 3_training.ipynb crashes on training

Hello!

Running through the training tutorial linked from the README.md (3_training.ipynb) without any alterations in Google Colab crashes at the "Build Model" cell. Specifically, these lines:

dataset = trainer.distribute_dataset(dataset)
trainer.build(next(iter(dataset)))

Seems as though the decoder is expecting the latent variable z from the conditioning dict but it's not present. Here's the stack trace:

>>> dataset = trainer.distribute_dataset(dataset)
>>> trainer.build(next(iter(dataset)))

    /usr/local/lib/python3.6/dist-packages/ddsp/training/models.py:64 __call__  *
        results = super().__call__(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/ddsp/training/models.py:120 call  *
        audio_gen = self.decode(conditioning, training=training)
    /usr/local/lib/python3.6/dist-packages/ddsp/training/models.py:114 decode  *
        processor_inputs = self.decoder(conditioning, training=training)
    /usr/local/lib/python3.6/dist-packages/ddsp/training/decoders.py:42 call  *
        x = self.decode(conditioning)
    /usr/local/lib/python3.6/dist-packages/ddsp/training/decoders.py:85 decode  *
        inputs = [conditioning[k] for k in self.input_keys]

    KeyError: 'z'

Trouble recreating solo violin dataset

Hello!

Thank you for all your great work on this library.

I want to reproduce the solo violin experiment from the library.

I've downloaded the mp3s (the wavs are paywalled unfortunately) of the pieces performed by John Gardner from the link provided in the paper.

However, I'm getting the following error when I run ddsp_prepare_tfrecord.

 File "/home/myuser/.local/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 261, in compute_f0
    assert n_padding % 1 == 0
RuntimeErrar: AssertionError [while running 'Map(_add_f0_estimate)']

This seems to be an issue with some of the clips, like V. Sarabande.mp3.

Thank you!

why use EmbeddingLoss?

Hi, I have a question about embedding loss used in autoencoder, more specifically, PretrainedCREPEEmbeddingLoss used in ae_abs.gin. From the codes it seems to regularize the original and reconstruction audio's latent f0, right? I am not quite sure why you do this, is it kind of like the cycle loss? since you calculate the latent again from the reconstructed audio?
I am not sure what this loss is used for, is it for training the CREPE model? I just found the logic is a little weird here.
Thank you!

With regards to a perceptual loss

Posting this again because the issue was closed before my question was resolved: #12

@jesseengel Thanks for the response!

You didn't quite answer my second question. Let me ask it in a different way...

With regards to the autoencoder configuration...

ddsp/ddsp/training/gin/models/ae.gin

Line 40 in cd98116

SpectralLoss.mag_weight = 1.0

Is there a benefit to training on an amplitude spectrogram? It looks like it's also used in the loss in addition to the log scaled spectrogram.

The amplitude spectrogram is not meaningful from a human perspective, right?

Conversion to original voice while changing lyrics

What would it take to use DDSP to change the words a singer is singing in a song while still keeping the melody? So, a combination of TTS and DDSP, I would think. For example, one could feed in new lyrics (text) to an existing song like Creep (Radiohead) and have Thom say "I love feet" instead of "I'm a creep".

I think this project, and similar projects, seem the closest to actually doing this, but I haven't seen any specific mention of it. Any tips or additional info would be appreciated.

	_, f0_hz, f0_confidence, _ = crepe.predict(
	audio,
	sr=sample_rate,
	viterbi=viterbi,
	step_size=crepe_step_size,
	center=False,
	verbose=0)