soheil-ab / orca Goto Github PK

View Code? Open in Web Editor NEW

99.0 99.0 38.0 52.01 MB

Orca: Towards Mastering Congestion Control In the Internet

License: MIT License

Shell 5.75% Python 40.85% Perl 13.92% C 13.30% C++ 26.18%

orca's People

Contributors

Stargazers

Watchers

orca's Issues

Reward reached 60 then drops

Hi,

I am trying to reproduce the training curve of Orca (the score curve). I simplified the training to one-actor case and found the reward(score) could reach 60 at the early stages, but then drops. Usually, the reward ends in fluctuating between ~38 and ~ 3.

Did you encounter this before? Do you have any insights why this happens?

Thank you

the question about the integrate into Pantheon

Hello ,i want to integrate orca into Pantheon,but i meet some questions.
i just comment the mahimahi and the client part in orca-server-mahimahi.cc(152-155,243) as the sender,and use client as before.
but when i use them in Pantheon,i get confused result.
pantheon_report.pdf
it seems it can't run correctly,but i can run outside the Pantheon successfully.
i noticed that you said in issues before that you had plan to integrate it to Pantheon.
i wonder whether there are some easily ways to integrate it and where is my method wrong.
thanks for your help!

Some questions about orca-server-mahimahi.cc

Hello, thank you very much for your contribution to the community. Here I would like to ask you a question. What is the purpose of setting target_ratio=1.1*orca_info for each time in the slow startup stage?
It seems that you did not mention this in your paper.
Thank you very much for your answer.

Here is the corresponding code snippet,

if(!slow_start_passed)
{
//got_no_zero=1;
tcp_info_pre=orca_info;
t0=timestamp();

                    target_ratio=1.1*orca_info.cwnd;
                    ret1 = setsockopt(sock_for_cnt[i], IPPROTO_TCP,TCP_CWND, &target_ratio,sizeof(target_ratio));
                    if(ret1<0)
                    {
                       DBGPRINT(0,0,"setsockopt: for index:%d flow_index:%d ... %s (ret1:%d)\n",i,          flow_index,strerror(errno),ret1);
                       return((void *)0);
                    }

How are models loaded for evaluation?

Hello @Soheil-ab ,

Thank you for your work!
I'm currently trying to evaluate Orca under different network conditions, but I am unsure where the code loads pre-trained models from.
I see the models being saved to ./train_dir/learner0/model*.ckpt - and I see a load_model and save_model in agent.py - but these functions aren't used anywhere (rephrase: where would be the right place to use load_model)?
Also - replay_memory has been initiated, but not used - is this on purpose? and How do I continue learning on the previous model using this option?

TL;DR: Could you please explain how to go about loading different models to evaluate Orca's performance?

Thank you!

How can I start a 6 hour training process

Hi @Soheil-ab ,
I want to evaluate Orca's performance by tuning the parameters of the reward function so I need to re-train the DRL model. I followed the intruction and ran the "./orca.sh 4 44444" and "./orca.sh 1 44444" commands to achieve that goal. However, I didn't found there are new checkpoints derived in /models and there are no new content in the rl-module/log/sum-*.

Could you please inform me that is there anything wrong with my procedure ？What's more, how can I ensure that the training process is running smoothly?

Reproducing the overhead experiment result in the paper

In Figure. 10 in the Orca paper. I find the overhead of Orca is significantly lower than Aurora.
However, running Orca means running the Cubic and RL model at the same time, while Aurora only runs the RL model to adjust the CWND window. How can I achieve the low overhead of Orca? With smaller model architecture or longer MTP? Didn't get a clue from the paper.

How can I control the length of MTP?

Hi, @Soheil-ab .
Thanks for sharing the work. Lately I've been reading the paper and the source code. However I have not found which parameter controls the Monitoring Time Period. Could you show me which part of the code controls the MTP accurately?
Thanks!

the question about the patching Orca's Kernel

Hello,i try to use your algorithm,but i meet some difficulties when i try to patching Orca's Kernels follow the option 1.
i use ubuntu 14.04 in tencnet cloud and my linux kernels is 3.13.0-128-generic.
my /etc/default/grub is

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
#GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX_DEFAULT="crashkernel=1800M-4G:128M,4G-:168M panic=5"
GRUB_CMDLINE_LINUX="console=ttyS0,9600n8 console=tty0"
GRUB_SERIAL_COMMAND="serial --speed=9600 --unit=0 --word=8 --parity=no --stop=1"

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console
GRUB_TERMINAL="console serial"

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"
GRUB_RECORDFAIL_TIM
EOUT=5
GRUB_GFXPAYLOAD_LINUX=text

i can successfully perform the first two steps but after i reboot
i get

it really troubled me for a long time, I don’t know whether you could give me some suggestions.
thanks for your help!

Three Questions in model testing

Hey , @Soheil-ab

I found some of the following phenomena when using the model,

In conducting the bandwidth burst（24mbps to 12mbps） test, I found that when rtt is large (rtt>100ms),tput will oscillate after the bandwidth burst decrease.
In some experiments, such as bw=5mbps, 10mbps, 12mbps, tput will appear ‘tail-up’phenomenon.
When training the model on the basis of the original model (Train trace：Bandwidth ：24mbps->12mbps),I found that the actor loss and critical loss continued to rise, and the results in the test trace(Bandwidth ：24mbps->12mbps) after training were not as good as the original model.

Here are my concrete results

I would like to ask you if you encountered any of the problems I mentioned above during your previous use or testing? If so, how did you solve these problems?

Thank you very much for your answer.

Questions about shared memory

Hello, @Soheil-ab , sorry to bother you. I'm having some problems with shared memory. I refer to your program and try to use shared memory to communicate between C and Python. But my program crashes after a while, removing the shared memory does not. I did a lot of experiments and modifications, but still did not solve the problem or find the cause. I don't know if you have encountered this problem. Looking forward to your help, thank you very much!

Some questions about running and version switching

@Soheil-ab Sorry to bother you, I'm a beginner. I am having some difficulties in running and would like to ask for your help. I followed the step-by-step installation as described in the code. However, when running the sample test, the problem of "no process found" appears as shown in the figure below.

I have tried many ways to solve it, and my colleagues also encounter the same problem. Although the questions may be fool, I hope to get your help.
One more question, I read in your paper that you have also implemented a purely DRL-based version. I wonder how to switch to this version. I see that the reward in the GYM_Env_Wrapper function in envwrapper.py is fixed to 10. So I'm not sure if I switch from here? Looking forward to your help, thanks!

Congestion Window Update Rule

Hi,

from the server's code, it looks like the congestion window is updated in two instances:

In slow start, increasing the window by a factor 1.1
Congestion avoidance, using:
target_ratio=atoi(alpha)*orca_info.cwnd/100;

In the second case, how does the update rule relate to the window increase presented in the paper, i.e. 2^(alpha) * cwnd?

Thanks for the clarification.

虚拟机更新内核太慢怎么办？

用第一个方法，在更换内核的时候，初始化内存盘一直卡着，好几天没有结果怎么办？

Any help to integrate into Pantheon?

In the paper ,it seems Orca used Pantheon for test,so is there any help for integration with Pantheon ,like wrapper.py for Pantheon?
Thanks a lot !

How can I only get the cwnd output of DRL?

Hi, I wanna test Orca in my transport protocol.
My system has transport function and integrated cubic. I only need the output of Orca DRL module.
So could you tell me how can I run DRL solely and the input&output formation/position of it?

Thanks a lot :)

how to get the network data

Hello ,i have the question about the network data.
In define.h

You define "#define TCP_ORCA_INFO 46 ",
I check the socket.h,46 means SO_BUSY_POLL
"#define SO_BUSY_POLL 46"
I can not understand how use "getsockopt( sk, SOL_TCP, TCP_ORCA_INFO, (void *)info, (socklen_t *)&tcp_info_length );"
get the data of info.
Maybe the question is very silly , i am sorry for that:)

The time to train the model

From the file model.ckpt-1283529.index, I notice that you train the model with more than 1000000 epochs. Could you give me some information how long it takes to train this model?

How should I run Orca without mahimahi?

I want to run Orca server and Orca client on two separate machines which are connected by the third one (router) and emulate different network environments on the router by linux tc and netem. How should I run Orca without mahimahi?

Competitiveness issues for ORCA

Hey , @Soheil-ab

I found some of the following phenomena when using the model,
1、Orca is far less competitive than Cubic.
2、There is also a gap between Orca's competitiveness and Cubic's.

Here are my concrete results

I would like to ask you if you encountered any of the problems I mentioned above during your previous use or testing? If so, how did you solve these problems?

Thank you very much for your answer.

Correct way to train a new Orca? (with access to a cluster)

Hi @Soheil-ab,

I had a few questions regarding training Orca.

The paper mentions you use 256 actors interacting with different environments - can you please give some insight on:

Does each actor get one trace over which it will collect 50k samples - after which the actor dies, and the learner waits for all actors to die? Does this mean your training dataset had a total of 256 generated traces?
How did you distributed these actors across your pool of servers - could you please share your workflow or some resource to go over for this? Did you have a single coordinator distributing all the actor tasks, or did you have to manually start a learner on one machine, and then signal the other machines to start their actor counterpart?
If you could share your trace generator, or methodology of trace generation, that would be very helpful too.
How can we be confident that our newly trained Orca has been trained correctly?
I see "/cpu" being used in the given code, but the paper mentions your training cluster includes a GPU - did you use a GPU to train, or is it not very relevant since the networks used for Orca are not very big?

Thank you for taking the time to read and answer these questions!

Can client separately run on non-Orca-built kernel?

Hi, I'm trying to run Orca in remote mode without Mahimahi.
I have a dumb question here: it seems that the client (compiled from client.c) is a simple receiving/acknowledging tool. So and it run on a linux kernel without Orca patched?
Thanks!
@Soheil-ab

About the reached maximum throughput of Orca?

@Soheil-ab

Hi, Soheil.
I set the following network topology:

client ---------- router------------- server

and set the bottleneck bandwidth with 128Mbps, RTT with 150ms, and queue length with 1BDP at the router using Linux tc and netem.
I found the maximum throughput of Orca is only ~60Mbps but pure cubic can reach 100+Mbps?
Do I need to configure Orca specifically for this environment? (I have asked you how to run Orca without mahimahi.)

A question about reproducing throughput experiments in the Clean-Slate Model and No-Model cases

Sorry to interrupt，I want to test the throughput of No-model and the throughput of Clean-Slate Model in the same network environment as the sample code, how should I modify it based on the sample code? Looking forward to your answer, thanks!

cwnd control of Orca

I would like to ask you a question: in order to reproduce the case of Figure 9: in your paper, and I found that the maximum value of cwnd in this process is up to 2147483648, could it be that target_ratio=1.1*orca_info is causing the cwnd to increase too aggressively?
Here are my Orac cubic 24mbps-12mbps
Thank you very much for your answer.

Orca's performance in the 5G wireless scenario and new training proposal

Dear Soheil,
Have you ever tested the Orca in the 5G wireless scenario where bandwidth changes quickly and unpredictablely and How Orca perform in the aforementioned scenario? The second question is whether we can train Orca in the new unseen environment to enhance the performance in new scenario while perform inference to guide the classic cc ? In another word，I want to train the Orca in real time between two MTPs(Monitoring Time Period) and infer at the MTP, How do you think of this proposal?

Many thanks!
Best regards,
Eric

soheil-ab / orca Goto Github PK

orca's People

Contributors

Stargazers

Watchers

Forkers

orca's Issues

Recommend Projects

Recommend Topics

Recommend Org