Git Product home page Git Product logo

Comments (5)

kwuking avatar kwuking commented on May 28, 2024

Hi, I have conducted tests using both CPU and GPU and have encountered the loss explosion issue you mentioned. May I ask what versions of Python, torch, and CUDA you are using, as well as the specific GPU? This could be related to your runtime environment. Below is the log from my recent test using an A100:
Use GPU: cuda:0
#############################start training : long_term_forecast_weather_96_96_TimeMixer_custom_ftM_sl96_ll0_pl96_dm16_nh8_el3_dl1_df32_fc3_ebtimeF_dtTrue_Exp_0#############################
train 36696
val 5175
test 10444
iters: 100, epoch: 1 | loss: 0.3753206
speed: 0.1125s/iter; left time: 632.6095s
iters: 200, epoch: 1 | loss: 0.3650213
speed: 0.0963s/iter; left time: 531.6102s
Epoch: 1 cost time: 29.084431648254395
Epoch: 1, Steps: 286 | Train Loss: 0.4559769 Vali Loss: 0.4045496 Test Loss: 0.1651619
Validation loss decreased (inf --> 0.404550). Saving model ...
Updating learning rate to 0.01
iters: 100, epoch: 2 | loss: 0.3403140
speed: 1.5760s/iter; left time: 8407.9784s
iters: 200, epoch: 2 | loss: 0.3347130
speed: 0.0959s/iter; left time: 501.9730s
Epoch: 2 cost time: 28.343204021453857
Epoch: 2, Steps: 286 | Train Loss: 0.4250036 Vali Loss: 0.4022082 Test Loss: 0.1644919
Validation loss decreased (0.404550 --> 0.402208). Saving model ...
Updating learning rate to 0.005
iters: 100, epoch: 3 | loss: 0.3028832
speed: 1.5435s/iter; left time: 7792.9897s
iters: 200, epoch: 3 | loss: 0.3556193
speed: 0.0947s/iter; left time: 468.4591s
Epoch: 3 cost time: 27.956741094589233
Epoch: 3, Steps: 286 | Train Loss: 0.4147728 Vali Loss: 0.3992365 Test Loss: 0.1646327
Validation loss decreased (0.402208 --> 0.399237). Saving model ...
Updating learning rate to 0.0025
iters: 100, epoch: 4 | loss: 0.3041284
speed: 1.5765s/iter; left time: 7509.0167s
iters: 200, epoch: 4 | loss: 0.4556009
speed: 0.0955s/iter; left time: 445.4644s
Epoch: 4 cost time: 28.11586308479309
Epoch: 4, Steps: 286 | Train Loss: 0.4078718 Vali Loss: 0.3988088 Test Loss: 0.1617702
Validation loss decreased (0.399237 --> 0.398809). Saving model ...
Updating learning rate to 0.00125
iters: 100, epoch: 5 | loss: 0.2949091
speed: 1.5833s/iter; left time: 7088.2278s
iters: 200, epoch: 5 | loss: 0.5099045
speed: 0.0949s/iter; left time: 415.2986s
Epoch: 5 cost time: 28.256697416305542
Epoch: 5, Steps: 286 | Train Loss: 0.4033931 Vali Loss: 0.3990721 Test Loss: 0.1609277
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.000625
iters: 100, epoch: 6 | loss: 0.3733408
speed: 1.5490s/iter; left time: 6491.8874s
iters: 200, epoch: 6 | loss: 0.3319999
speed: 0.0925s/iter; left time: 378.4997s
Epoch: 6 cost time: 27.6805522441864
source /opt/conda/bin/activate
conda activate base
Epoch: 6, Steps: 286 | Train Loss: 0.4002498 Vali Loss: 0.4013985 Test Loss: 0.1616626
EarlyStopping counter: 2 out of 10
Updating learning rate to 0.0003125
iters: 100, epoch: 7 | loss: 0.6016151
speed: 1.5663s/iter; left time: 6116.3414s
iters: 200, epoch: 7 | loss: 0.4443198
speed: 0.0929s/iter; left time: 353.2984s
Epoch: 7 cost time: 27.468426942825317
Epoch: 7, Steps: 286 | Train Loss: 0.3996443 Vali Loss: 0.3977090 Test Loss: 0.1613210
Validation loss decreased (0.398809 --> 0.397709). Saving model ...
Updating learning rate to 0.00015625
iters: 100, epoch: 8 | loss: 0.3081976
speed: 1.5743s/iter; left time: 5697.4836s
iters: 200, epoch: 8 | loss: 0.3726347
speed: 0.0954s/iter; left time: 335.6610s
Epoch: 8 cost time: 27.83885407447815
Epoch: 8, Steps: 286 | Train Loss: 0.3982672 Vali Loss: 0.3980305 Test Loss: 0.1620593
EarlyStopping counter: 1 out of 10
Updating learning rate to 7.8125e-05
iters: 100, epoch: 9 | loss: 0.5365641
speed: 1.5665s/iter; left time: 5221.0898s
iters: 200, epoch: 9 | loss: 0.3724343
speed: 0.0960s/iter; left time: 310.3874s
Epoch: 9 cost time: 28.19004225730896
Epoch: 9, Steps: 286 | Train Loss: 0.3979731 Vali Loss: 0.3968561 Test Loss: 0.1611742
Validation loss decreased (0.397709 --> 0.396856). Saving model ...
Updating learning rate to 3.90625e-05
iters: 100, epoch: 10 | loss: 0.4671008
speed: 1.5777s/iter; left time: 4807.3249s
iters: 200, epoch: 10 | loss: 0.3070258
speed: 0.0971s/iter; left time: 286.0706s
Epoch: 10 cost time: 28.16199278831482
Epoch: 10, Steps: 286 | Train Loss: 0.3965665 Vali Loss: 0.3990801 Test Loss: 0.1615544
EarlyStopping counter: 1 out of 10
Updating learning rate to 1.953125e-05
iters: 100, epoch: 11 | loss: 0.6498543
speed: 1.5710s/iter; left time: 4337.5387s
iters: 200, epoch: 11 | loss: 0.3345088
speed: 0.0938s/iter; left time: 249.5560s
Epoch: 11 cost time: 27.815724849700928
Epoch: 11, Steps: 286 | Train Loss: 0.3976659 Vali Loss: 0.4000468 Test Loss: 0.1616991
EarlyStopping counter: 2 out of 10
Updating learning rate to 9.765625e-06
iters: 100, epoch: 12 | loss: 0.4381566
speed: 1.5728s/iter; left time: 3892.6526s
iters: 200, epoch: 12 | loss: 0.5621059
speed: 0.0973s/iter; left time: 231.0085s
Epoch: 12 cost time: 28.15647864341736
Epoch: 12, Steps: 286 | Train Loss: 0.3960074 Vali Loss: 0.3980559 Test Loss: 0.1616852
EarlyStopping counter: 3 out of 10
Updating learning rate to 4.8828125e-06
iters: 100, epoch: 13 | loss: 0.2753164
speed: 1.5826s/iter; left time: 3464.3073s
iters: 200, epoch: 13 | loss: 0.6397113
speed: 0.0956s/iter; left time: 199.7189s
Epoch: 13 cost time: 28.209867477416992
Epoch: 13, Steps: 286 | Train Loss: 0.3974377 Vali Loss: 0.3994200 Test Loss: 0.1616747
EarlyStopping counter: 4 out of 10
Updating learning rate to 2.44140625e-06
iters: 100, epoch: 14 | loss: 0.5507306
speed: 1.5722s/iter; left time: 2991.8295s
iters: 200, epoch: 14 | loss: 0.2794026
speed: 0.0942s/iter; left time: 169.7589s
Epoch: 14 cost time: 27.93237328529358
Epoch: 14, Steps: 286 | Train Loss: 0.3971617 Vali Loss: 0.3962648 Test Loss: 0.1616808
Validation loss decreased (0.396856 --> 0.396265). Saving model ...
Updating learning rate to 1.220703125e-06
iters: 100, epoch: 15 | loss: 0.2362812
speed: 1.5702s/iter; left time: 2539.0405s
iters: 200, epoch: 15 | loss: 0.5210038
speed: 0.0952s/iter; left time: 144.4564s
Epoch: 15 cost time: 27.919300079345703
Epoch: 15, Steps: 286 | Train Loss: 0.3972344 Vali Loss: 0.3991646 Test Loss: 0.1616832
EarlyStopping counter: 1 out of 10
Updating learning rate to 6.103515625e-07
iters: 100, epoch: 16 | loss: 0.2441869
speed: 1.5644s/iter; left time: 2082.2781s
iters: 200, epoch: 16 | loss: 0.3173642
speed: 0.0947s/iter; left time: 116.5348s
Epoch: 16 cost time: 28.07013177871704
source /opt/conda/bin/activate
conda activate base
Epoch: 16, Steps: 286 | Train Loss: 0.3973130 Vali Loss: 0.3976985 Test Loss: 0.1616816
EarlyStopping counter: 2 out of 10
Updating learning rate to 3.0517578125e-07
iters: 100, epoch: 17 | loss: 0.8181573
speed: 1.5765s/iter; left time: 1647.4591s
iters: 200, epoch: 17 | loss: 0.3754889
speed: 0.0966s/iter; left time: 91.3173s
Epoch: 17 cost time: 27.89303421974182
Epoch: 17, Steps: 286 | Train Loss: 0.3973716 Vali Loss: 0.3948824 Test Loss: 0.1616797
Validation loss decreased (0.396265 --> 0.394882). Saving model ...
Updating learning rate to 1.52587890625e-07
iters: 100, epoch: 18 | loss: 0.2578862
speed: 1.5766s/iter; left time: 1196.6736s
iters: 200, epoch: 18 | loss: 0.3357456
speed: 0.0962s/iter; left time: 63.4233s
Epoch: 18 cost time: 27.98421311378479
Epoch: 18, Steps: 286 | Train Loss: 0.3964639 Vali Loss: 0.3991940 Test Loss: 0.1616804
EarlyStopping counter: 1 out of 10
Updating learning rate to 7.62939453125e-08
iters: 100, epoch: 19 | loss: 0.3234898
speed: 1.5751s/iter; left time: 745.0206s
iters: 200, epoch: 19 | loss: 0.2907747
speed: 0.0969s/iter; left time: 36.1467s
Epoch: 19 cost time: 28.47565507888794
Epoch: 19, Steps: 286 | Train Loss: 0.3972313 Vali Loss: 0.3985853 Test Loss: 0.1616802
EarlyStopping counter: 2 out of 10
Updating learning rate to 3.814697265625e-08
iters: 100, epoch: 20 | loss: 0.6489332
speed: 1.5610s/iter; left time: 291.9142s
iters: 200, epoch: 20 | loss: 0.3143170
speed: 0.0941s/iter; left time: 8.1840s
Epoch: 20 cost time: 27.963088274002075
Epoch: 20, Steps: 286 | Train Loss: 0.3974204 Vali Loss: 0.3994423 Test Loss: 0.1616799
EarlyStopping counter: 3 out of 10
Updating learning rate to 1.9073486328125e-08
#############################testing : long_term_forecast_weather_96_96_TimeMixer_custom_ftM_sl96_ll0_pl96_dm16_nh8_el3_dl1_df32_fc3_ebtimeF_dtTrue_Exp_0#############################
test 10444
test shape: (10444, 1, 96, 21) (10444, 1, 96, 21)
test shape: (10444, 96, 21) (10444, 96, 21)
mse:0.16167980432510376, mae:0.20934458076953888

from timemixer.

lzmax888 avatar lzmax888 commented on May 28, 2024

Could you please print all your hyper parameters? Thanks

from timemixer.

kwuking avatar kwuking commented on May 28, 2024

Could you please print all your hyper parameters? Thanks

The script we are currently using for testing is /scripts/long_term_forecast/Weather_script/TimeMixer_unify.sh.

from timemixer.

lzmax888 avatar lzmax888 commented on May 28, 2024

So, are you using the same hyper parameters like me? How is your test result in visual?

Namespace(task_name='long_term_forecast', is_training=1, model_id='weather_96_96', model='TimeMixer', data='custom', root_path='./dataset/weather/', data_path='weather.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=96, label_len=0, pred_len=96, seasonal_patterns='Monthly', inverse=False, top_k=5, num_kernels=6, enc_in=21, dec_in=21, c_out=21, d_model=16, n_heads=4, e_layers=3, d_layers=1, d_ff=32, moving_avg=25, factor=3, distil=True, dropout=0.1, embed='timeF', activation='gelu', output_attention=False, channel_independence=1, decomp_method='moving_avg', use_norm=1, down_sampling_layers=3, down_sampling_window=2, down_sampling_method='avg', num_workers=10, itr=1, train_epochs=20, batch_size=128, patience=10, learning_rate=0.01, des='Exp', loss='MSE', lradj='TST', pct_start=0.2, use_amp=False, comment='none', use_gpu=True, gpu=0, use_multi_gpu=False, devices='0,1', p_hidden_dims=[128, 128], p_hidden_layers=2)

I found a way to avoid loss exploration by set the lradj to 'type1' or 'type3'.

Thanks.

from timemixer.

kwuking avatar kwuking commented on May 28, 2024

So, are you using the same hyper parameters like me? How is your test result in visual?

Namespace(task_name='long_term_forecast', is_training=1, model_id='weather_96_96', model='TimeMixer', data='custom', root_path='./dataset/weather/', data_path='weather.csv', features='M', target='OT', freq='h', checkpoints='./checkpoints/', seq_len=96, label_len=0, pred_len=96, seasonal_patterns='Monthly', inverse=False, top_k=5, num_kernels=6, enc_in=21, dec_in=21, c_out=21, d_model=16, n_heads=4, e_layers=3, d_layers=1, d_ff=32, moving_avg=25, factor=3, distil=True, dropout=0.1, embed='timeF', activation='gelu', output_attention=False, channel_independence=1, decomp_method='moving_avg', use_norm=1, down_sampling_layers=3, down_sampling_window=2, down_sampling_method='avg', num_workers=10, itr=1, train_epochs=20, batch_size=128, patience=10, learning_rate=0.01, des='Exp', loss='MSE', lradj='TST', pct_start=0.2, use_amp=False, comment='none', use_gpu=True, gpu=0, use_multi_gpu=False, devices='0,1', p_hidden_dims=[128, 128], p_hidden_layers=2)

I found a way to avoid loss exploration by set the lradj to 'type1' or 'type3'.

Thanks.

Indeed, using different schedulers can yield varying results, and factors such as the current execution environment, device model, and software versions can all have an impact. If adopting a different scheduler resolves your issue, then that is excellent.

from timemixer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.