Comments (27)
I am seeing the same problem, but with distcc 3.1.
Original comment by [email protected]
on 18 Feb 2009 at 11:45
from distcc.
The whole process of reading multiple state files that get created and deleted
by the
monitoring programs is hoaky. It is possible that in the process of taking a
snapshot of the entries of the .distcc/state directory, multiple compiles on
the same
host/slot pair get triggered. Note that both the text based and the GUI based
monitor suffer from the same problem, since they use the same underlying
function
dcc_mon_poll.
A quick fix for this is to ignore multiple compiles on the same host/slot pair,
for
which I have attached a patch below. This is not an ideal solution, but should
do
the job.
Original comment by [email protected]
on 26 Aug 2009 at 3:58
Attachments:
from distcc.
I'm pretty sure that I have found that actual cause of the bug. Will upload a
patch as soon as I solve a few issues in it.
Original comment by [email protected]
on 20 Jun 2010 at 2:45
from distcc.
OK. So, the bug appears to be in the client, in the handling of the my_state
variable. Since 3.0 introduced the possibility of a file being partly
processed locally and remotely, but the state information was all being stored
in my_state, resulting in mixed up data. So the lock files are created
correctly, but the state files are not.
This fix simply creates a second state variable, differentiating between local
and remote, and stores state data in the appropriate one. When
dcc_write_state() is called, it uses whichever state variables was last updated.
It works for me, so I would love to hear if it works for you! Cheers.
PS. This is the first time that I have looked at the distcc codebase, so the
usual disclaimers of naivety apply: I may have inadvertently created a
monstrous new bug!
Original comment by [email protected]
on 21 Jun 2010 at 3:28
Attachments:
from distcc.
Observed the bug last night for the first time since I have been running the
patched client. Didn't see where it happened, so am not able to reproduce it.
Original comment by [email protected]
on 27 Jun 2010 at 6:40
from distcc.
I have been bugged by this particular bug for a long while. Finally I had the
time and energy to investigate, and found this to be the problem:
dcc_build_somewhere calls dcc_pick_host_from_list_and_lock_it, which calls
dcc_lock_one. This particular function sets my_state.slot to whatever slot is
available from whatever host is picked. However, when returning to
dcc_build_somewhere the code gets into dcc_lock_local_cpp, which again calls
dcc_lock_one for localhost. This, in turn, overwrites the previously stored
slot with the local slot chosen for preprocessing. Finally, in
dcc_lock_local_cpp, the state is committed to disk for the monitor to pick it
up, but with the remote slot and local slot values.
I came to this conclusion after reading jeremy's post, but my patch does less:
let my_state.slots be set just once, by the first call to
dcc_pick_host_from_list_and_lock_it. It appears to work.
Judging by how communication between the distcc host/slot allocation and the
monitor state reading is done, I think getting duplicate host+slot entries is
possible if the monitor is reading the state files while one (or more) of the
processes are modifying the files, but this should occur rarely.
Hope this helps :)
Original comment by [email protected]
on 6 Jul 2010 at 1:11
Attachments:
from distcc.
Hi cheepeero, I just tested your patch on rev 728 and it actually made the
symptoms worse: lots of duplicate localhost entries appeared within seconds.
Have you tested it recently? Cheers.
Original comment by [email protected]
on 10 Sep 2010 at 8:17
from distcc.
I will have to check what is different with rev 728; as the name suggests my
patch was made against distcc-3.1. I'll also investigate if the gentoo distro
also aplies some other patches as well.
I use distcc for building my gentoo systems; with the patch all behave well
(one single-core, 1 dual core and one with 8 logical cores). Here is a
screenshot of how it looks with my patch on the dual-core while building glib.
Another thing that might cause a difference is that I have limited on each
system the localhost slots in /etc/distcc/hosts.
Original comment by [email protected]
on 10 Sep 2010 at 9:39
Attachments:
from distcc.
On the previous run I had -j8 left in the gentoo configuration. Here is another
screenshot on mysql with -j10
Original comment by [email protected]
on 10 Sep 2010 at 9:44
Attachments:
from distcc.
Again mysql with -j14. I have checked out your trunk, but it's quite late for
me so I'll continue tomorrow :)
Original comment by [email protected]
on 10 Sep 2010 at 9:50
Attachments:
from distcc.
Hey. I have tested my patch towards r730, and found no problem with it. The
behavior of distccmon-gnome was correct. I have no explanation on why your test
went wrong, except the obvious ones: patch was not actually applied, or your
sources were tainted with some other patch or other modifications.
To clarify: my patch refers to the pure distcc trunk / tag 3.1; I did not apply
any of the other patches in this thread but the one I submitted, namely
distcc-3.1-fix-slots.patch
I cannot do more except instruct you to try again. Perhaps open the patch file
with a text editor and read the long comment I placed there. Read the code and
validate my fix.
I am on the watcher list of this issue and I'll monitor it. I'm sure we'll be
able to make it work :)
Original comment by [email protected]
on 12 Sep 2010 at 3:03
from distcc.
cheepeero, could you attach your config.h? Thanks.
Original comment by [email protected]
on 16 Sep 2010 at 10:12
from distcc.
Here is the config.h.
However, in the meantime I have used distcc without the software limitation in
/etc/distcc/hosts and once the localhost/0 slot got multiplied three times.
localhost pink
I have used localhost/2 pink/8 with my patch for quite a while now (see date of
my post) without bumping into this problem, so I assume this has something to
do with slot allocation, not with transmitting the correct slot to the monitor
(which is what my patch fixes).
Original comment by [email protected]
on 17 Sep 2010 at 8:07
Attachments:
from distcc.
Interesting. I'll have a look when I'm feeling better. Our config.h files are
the same apart from avahi. Cheers.
Original comment by [email protected]
on 18 Sep 2010 at 6:09
from distcc.
Hi cheepeero. I just tested your patch again with and without the /LIMIT
option and duplicate localhost entries appeared each time. I tested by
compiling the Linux kernel with make -j6 and a hosts file of "localhost prison".
The monitor reads slot information from the state files saved by the client,
there is no separate channel for transmission. So whatever appears in the
monitor is whatever the clients are 'thinking'. :)
Original comment by [email protected]
on 24 Sep 2010 at 5:46
from distcc.
I understand that the slot info is inferred by the client, not reported by the
server. I was referring to distcc client - distcc monitor communication through
state files.
Even so, this does not change the fact that the same member,
dcc_task_state.slot, is used both for accounting the remote compiling slot and
the local preprocessor slot, and the latter is the last write in my_state.slot
(or my_state->slot within your patch).
Let me elaborate:
At line 550 in compile.c, dcc_build_somewhere sets my_state.slot at -1
(unallocated, is suppose). A few lines below it calls
dcc_pick_host_from_list_and_lock_it, which at line 104 calls dcc_lock_one. When
finding a free host/cpu, this function writes the found i_cpu (slot) variable
with dcc_note_state_slot.
When returning to dcc_build_somewhere in compile.c, the flow goes to
dcc_lock_local_cpp at line 567, which in where.c also calls dcc_lock_one, this
time with a fake "localhost" list. This time dcc_lock_one finds another i_cpu
relative to the fake "localhost" list, which overwrites the previous remote
slot selection. So the remote slot initially chosen for compiling the code is
lost in my_state at this phase, replaced by the local preprocessor slot.
I have found that these are the only two places where my_state.slot is
modified. The slot variable in my_state also seems to only be only used for the
monitoring clients, not host selection or such.
I have examined your patch, and I think it should solve this particular slot
accounting bug. I think mine would do so too, except yours would report
preprocessing on the localhost slot it is performed on, while mine would report
it on the remote slot where compilation is scheduled.
However, both our patches seem to have the same problem:
dcc_lock_local_cpp is looking at a different dcc_hostdef list (see line 194 in
where.c), namely dcc_lock_local_cpp, which has 8 slots by default, and is only
overwritten by option "--localslots_cpp".
Also, dcc_lock_local called at line 763 in compile.c examines a different
localhost list (dcc_hostdef_local set by "--localslots") than the normal
"localhost" host added by /etc/distcc/hosts or otherwise. This one has 4 slots
hardcoded as default.
See dcc_parse_hosts at line 468 in hosts.c or just grep the sources for the
variables I have pointed to.
Under these circumstances I feel it is normal for the monitor to display up to
4 or 8 slots for localhost. But with any of our patches, the lock system should
prevent having duplicate host+slot entries. Did you observe duplicates or just
an increased number of localhost slots?
I apologize for the long post, but I felt that the intention of my patch was
not understood - basically it's a different fix (more trivial and less clean)
than yours.
Original comment by [email protected]
on 25 Sep 2010 at 11:42
from distcc.
Right, that explains why I occasionally see four slots used on localhost when I
expect the limit to be two.
However, it's not the problem. When I use your patch, I still see duplicate
entries: the same localhost-slot combination listed multiple times.
Doesn't your patch force the client to create the second lock on localhost with
the same slot as the first lock found on the remote host? Which is a reversal
of the original problem that the second lock clobbered the first slot value.
Now instead of duplicate remote hosts there are duplicate localhosts.
Original comment by [email protected]
on 26 Sep 2010 at 4:24
from distcc.
No, my changes only stop the local preprocessing slot to overwrite the initial
remote host/slot selection. As far as I understood the distcc code, the lock
files are different from state files.
Here's a scenario without patch: somehost/6 is selected in the first call of
dcc_lock_one, and gets recorded in my_state. Next dcc_lock_local_cpp locks slot
localhost/2 for preprocessing, but only records the slot, so from now on the
state file contains somehost/2 and remains so until it gets deleted. With my
patch the second write to my_state.slot is ignored, so both the preprocessing
and the compilation phase appear to happen on somehost/6.
As far as I understand your patch, the preprocessing phase would move the whole
my_state from remote_state to local_state, and record preprocessing on
localhost/2, but I am unsure that after this phase is finished the client
records the switch from local_state to remote_state (or back) while the file is
getting compiled remotely and the localhost/2 slot is unlocked for other tasks.
This might lead to duplicating localhost/2 if it gets reused in the meantime.
Unfortunately I had to move from my previous location so I won't be able to use
distcc for a few weeks - no more pink/8 to help my poor notebook :) Perhaps
I'll find another setup to test for duplicate localhost+host entries.
After this discussion I feel your implementation is better than my idea,
because it describes more accurately what actually happens (preprocessing on
localhost etc.) I will also switch to your patch and see how it works when my
setup becomes available again.
Original comment by [email protected]
on 26 Sep 2010 at 10:21
from distcc.
Sorry, I said "lock" but I meant the state file. Makes me wonder if it's
really necessary to have separate lock and state files?
Hmmm, see, I thought your patch would cause it to lock and "note" somehost/6,
then lock localhost/2 but note localhost/6. Most of the time it doesn't,
because preprocessing and compiling happen on the remote host. But if the
remote fails for some reason, then falling back to localhost does cause it note
localhost with the remote slot, causing duplication.
Anyway, if we agree that my patch design is the way to go, then we can work on
making it as good as can be and get this bug resolved. You can always run a
local distcc server and configure your client to use it. Maybe something like
"localhost/1 shadow/1" where "shadow" is the hostname of your localhost.
I'll look into the possible source of duplication using my patch soon. Cheers.
Original comment by [email protected]
on 27 Sep 2010 at 1:55
from distcc.
Here's a revised patch that tidies it up, but most importantly changes the
target of dcc_note_state() in dcc_wait_for_cpp(). Cheers.
Original comment by [email protected]
on 29 Sep 2010 at 2:26
Attachments:
from distcc.
I have tested your last patch during my last gentoo upgrade (48 packages) and
found no duplicate host+slot entries. The localhost host jumped up to 4 slots
even if my notebook has 2 logical CPUs, and I think this is due to the preset
preprocessor slots. I am unsure if some rare race conditions between the
monitor and distcc may still provoke duplicates, but it seems unlikely.
I believe the patch should be applied on the distcc trunk, since the situation
presented by the distcc monitor (text and gtk-based) with it is a lot better
than without it. It also corrects the mess created by the bug overwriting the
remote slot with the local preprocessor slot. What I like most is that I can
now see how busy my local host is with preprocessing.
I also hope distcc developers read and take into account my thoughts :)
Original comment by [email protected]
on 8 Oct 2010 at 6:05
from distcc.
Thanks heaps, everyone! Thanks particularly to Jeremy and Cheepeero for your
patches, review, and testing, and also to Pankaj for the earlier work-around
patch, and to jsjuni56 for first reporting the issue.
I have applied Jeremy's last patch ("distcc-r731_state.patch") to the distcc
svn repository; it is revision 732.
Original comment by [email protected]
on 8 Oct 2010 at 6:31
from distcc.
Original comment by [email protected]
on 8 Oct 2010 at 6:31
- Changed state: Fixed
from distcc.
Cool, I'm glad we could get it resolved.
Cheepero, are you interested in working with me on issue #24? I've started to
play around with it and got some promising results but I could use some help.
Cheers.
Original comment by [email protected]
on 15 Oct 2010 at 8:51
from distcc.
Sure, if I can. I have starred #24 :)
Original comment by [email protected]
on 15 Oct 2010 at 10:52
from distcc.
This bug seems to have come back as of 3.1:
distcc 3.1 x86_64-pc-linux-gnu
(protocols 1, 2 and 3) (default port 3632)
built Feb 17 2012 13:04:11
Built from source via Gentoo sys-devel/distcc-3.1-r5
Original comment by [email protected]
on 6 May 2012 at 6:36
Attachments:
- [Screenshot-distcc Monitor - [email protected]](https://storage.googleapis.com/google-code-attachments/distcc/issue-36/comment-26/Screenshot-distcc Monitor - [email protected])
from distcc.
3.1 is too old to have this patch. It is included in 3.2.
Original comment by [email protected]
on 7 May 2012 at 12:15
from distcc.
Related Issues (20)
- Compilation fails under Cygwin HOT 2
- --coverage flag not treated correctly HOT 1
- missing feature: support --coverage to compile remotely HOT 1
- ls HOT 1
- Apparently not getting much parallelism HOT 9
- [deleted issue]
- Include server not covering... HOT 6
- GPL3 encumbrance HOT 9
- disable distcc HOT 3
- unknown translation unit - mips64 compiler HOT 4
- Building with GCC 4.7.3 on x86_32 fails due to warnings in lzo/minilzo.c HOT 3
- Documentation for --random is incomplete
- lsdistcc_1.html Shown as Raw File in Browser HOT 1
- zeroconf hosts have a hardcoded slots value of 4? HOT 2
- OS X Yosemite Compile Fixes
- absence of DISTCC_SSH setting causes exec("ssh") failure HOT 1
- lsdistcc segfaults when called with more than 502 hosts
- [PATCH] SOCKSv5 proxy support HOT 2
- distccd not honoring DISTCC_TCP_CORK=0
- gnome/distccmon-gnome.desktop is not utf-8 encoded file.It is ISO-8859.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from distcc.