Git Product home page Git Product logo

Comments (27)

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
I am seeing the same problem, but with distcc 3.1.

Original comment by [email protected] on 18 Feb 2009 at 11:45

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
The whole process of reading multiple state files that get created and deleted 
by the
monitoring programs is hoaky.  It is possible that in the process of taking a
snapshot of the entries of the .distcc/state directory, multiple compiles on 
the same
host/slot pair get triggered.  Note that both the text based and the GUI based
monitor suffer from the same problem, since they use the same underlying 
function
dcc_mon_poll.

A quick fix for this is to ignore multiple compiles on the same host/slot pair, 
for
which I have attached a patch below.  This is not an ideal solution, but should 
do
the job.

Original comment by [email protected] on 26 Aug 2009 at 3:58

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
I'm pretty sure that I have found that actual cause of the bug.  Will upload a 
patch as soon as I solve a few issues in it.

Original comment by [email protected] on 20 Jun 2010 at 2:45

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
OK.  So, the bug appears to be in the client, in the handling of the my_state 
variable.  Since 3.0 introduced the possibility of a file being partly 
processed locally and remotely, but the state information was all being stored 
in my_state, resulting in mixed up data.  So the lock files are created 
correctly, but the state files are not.

This fix simply creates a second state variable, differentiating between local 
and remote, and stores state data in the appropriate one.  When 
dcc_write_state() is called, it uses whichever state variables was last updated.

It works for me, so I would love to hear if it works for you!  Cheers.

PS. This is the first time that I have looked at the distcc codebase, so the 
usual disclaimers of naivety apply: I may have inadvertently created a 
monstrous new bug!

Original comment by [email protected] on 21 Jun 2010 at 3:28

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Observed the bug last night for the first time since I have been running the 
patched client.  Didn't see where it happened, so am not able to reproduce it.

Original comment by [email protected] on 27 Jun 2010 at 6:40

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
I have been bugged by this particular bug for a long while. Finally I had the 
time and energy to investigate, and found this to be the problem:

dcc_build_somewhere calls dcc_pick_host_from_list_and_lock_it, which calls 
dcc_lock_one. This particular function sets my_state.slot to whatever slot is 
available from whatever host is picked. However, when returning to 
dcc_build_somewhere the code gets into dcc_lock_local_cpp, which again calls 
dcc_lock_one for localhost. This, in turn, overwrites the previously stored 
slot with the local slot chosen for preprocessing. Finally, in 
dcc_lock_local_cpp, the state is committed to disk for the monitor to pick it 
up, but with the remote slot and local slot values.

I came to this conclusion after reading jeremy's post, but my patch does less: 
let my_state.slots be set just once, by the first call to 
dcc_pick_host_from_list_and_lock_it. It appears to work.

Judging by how communication between the distcc host/slot allocation and the 
monitor state reading is done, I think getting duplicate host+slot entries is 
possible if the monitor is reading the state files while one (or more) of the 
processes are modifying the files, but this should occur rarely.

Hope this helps :)

Original comment by [email protected] on 6 Jul 2010 at 1:11

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Hi cheepeero, I just tested your patch on rev 728 and it actually made the 
symptoms worse: lots of duplicate localhost entries appeared within seconds.  
Have you tested it recently?  Cheers.

Original comment by [email protected] on 10 Sep 2010 at 8:17

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
I will have to check what is different with rev 728; as the name suggests my 
patch was made against distcc-3.1. I'll also investigate if the gentoo distro 
also aplies some other patches as well.

I use distcc for building my gentoo systems; with the patch all behave well 
(one single-core, 1 dual core and one with 8 logical cores). Here is a 
screenshot of how it looks with my patch on the dual-core while building glib.

Another thing that might cause a difference is that I have limited on each 
system the localhost slots in /etc/distcc/hosts.

Original comment by [email protected] on 10 Sep 2010 at 9:39

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
On the previous run I had -j8 left in the gentoo configuration. Here is another 
screenshot on mysql with -j10

Original comment by [email protected] on 10 Sep 2010 at 9:44

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Again mysql with -j14. I have checked out your trunk, but it's quite late for 
me so I'll continue tomorrow :)

Original comment by [email protected] on 10 Sep 2010 at 9:50

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Hey. I have tested my patch towards r730, and found no problem with it. The 
behavior of distccmon-gnome was correct. I have no explanation on why your test 
went wrong, except the obvious ones: patch was not actually applied, or your 
sources were tainted with some other patch or other modifications.

To clarify: my patch refers to the pure distcc trunk / tag 3.1; I did not apply 
any of the other patches in this thread but the one I submitted, namely 
distcc-3.1-fix-slots.patch

I cannot do more except instruct you to try again. Perhaps open the patch file 
with a text editor and read the long comment I placed there. Read the code and 
validate my fix.

I am on the watcher list of this issue and I'll monitor it. I'm sure we'll be 
able to make it work :)

Original comment by [email protected] on 12 Sep 2010 at 3:03

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
cheepeero, could you attach your config.h?  Thanks.

Original comment by [email protected] on 16 Sep 2010 at 10:12

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Here is the config.h.

However, in the meantime I have used distcc without the software limitation in 
/etc/distcc/hosts and once the localhost/0 slot got multiplied three times.

localhost pink

I have used localhost/2 pink/8 with my patch for quite a while now (see date of 
my post) without bumping into this problem, so I assume this has something to 
do with slot allocation, not with transmitting the correct slot to the monitor 
(which is what my patch fixes).

Original comment by [email protected] on 17 Sep 2010 at 8:07

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Interesting.  I'll have a look when I'm feeling better.  Our config.h files are 
the same apart from avahi.  Cheers.

Original comment by [email protected] on 18 Sep 2010 at 6:09

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Hi cheepeero.  I just tested your patch again with and without the /LIMIT 
option and duplicate localhost entries appeared each time.  I tested by 
compiling the Linux kernel with make -j6 and a hosts file of "localhost prison".

The monitor reads slot information from the state files saved by the client, 
there is no separate channel for transmission.  So whatever appears in the 
monitor is whatever the clients are 'thinking'.  :)

Original comment by [email protected] on 24 Sep 2010 at 5:46

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
I understand that the slot info is inferred by the client, not reported by the 
server. I was referring to distcc client - distcc monitor communication through 
state files.

Even so, this does not change the fact that the same member, 
dcc_task_state.slot, is used both for accounting the remote compiling slot and 
the local preprocessor slot, and the latter is the last write in my_state.slot 
(or my_state->slot within your patch).

Let me elaborate:

At line 550 in compile.c, dcc_build_somewhere sets my_state.slot at -1 
(unallocated, is suppose). A few lines below it calls 
dcc_pick_host_from_list_and_lock_it, which at line 104 calls dcc_lock_one. When 
finding a free host/cpu, this function writes the found i_cpu (slot) variable 
with dcc_note_state_slot.

When returning to dcc_build_somewhere in compile.c, the flow goes to 
dcc_lock_local_cpp at line 567, which in where.c also calls dcc_lock_one, this 
time with a fake "localhost" list. This time dcc_lock_one finds another i_cpu 
relative to the fake "localhost" list, which overwrites the previous remote 
slot selection. So the remote slot initially chosen for compiling the code is 
lost in my_state at this phase, replaced by the local preprocessor slot.

I have found that these are the only two places where my_state.slot is 
modified. The slot variable in my_state also seems to only be only used for the 
monitoring clients, not host selection or such.

I have examined your patch, and I think it should solve this particular slot 
accounting bug. I think mine would do so too, except yours would report 
preprocessing on the localhost slot it is performed on, while mine would report 
it on the remote slot where compilation is scheduled.

However, both our patches seem to have the same problem:

dcc_lock_local_cpp is looking at a different dcc_hostdef list (see line 194 in 
where.c), namely dcc_lock_local_cpp, which has 8 slots by default, and is only 
overwritten by option "--localslots_cpp". 

Also, dcc_lock_local called at line 763 in compile.c examines a different 
localhost list (dcc_hostdef_local set by "--localslots") than the normal 
"localhost" host added by /etc/distcc/hosts or otherwise. This one has 4 slots 
hardcoded as default. 

See dcc_parse_hosts at line 468 in hosts.c or just grep the sources for the 
variables I have pointed to.

Under these circumstances I feel it is normal for the monitor to display up to 
4 or 8 slots for localhost. But with any of our patches, the lock system should 
prevent having duplicate host+slot entries. Did you observe duplicates or just 
an increased number of localhost slots?

I apologize for the long post, but I felt that the intention of my patch was 
not understood - basically it's a different fix (more trivial and less clean) 
than yours.

Original comment by [email protected] on 25 Sep 2010 at 11:42

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Right, that explains why I occasionally see four slots used on localhost when I 
expect the limit to be two.

However, it's not the problem.  When I use your patch, I still see duplicate 
entries: the same localhost-slot combination listed multiple times.

Doesn't your patch force the client to create the second lock on localhost with 
the same slot as the first lock found on the remote host?  Which is a reversal 
of the original problem that the second lock clobbered the first slot value.  
Now instead of duplicate remote hosts there are duplicate localhosts.

Original comment by [email protected] on 26 Sep 2010 at 4:24

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
No, my changes only stop the local preprocessing slot to overwrite the initial 
remote host/slot selection. As far as I understood the distcc code, the lock 
files are different from state files.

Here's a scenario without patch: somehost/6 is selected in the first call of 
dcc_lock_one, and gets recorded in my_state. Next dcc_lock_local_cpp locks slot 
localhost/2 for preprocessing, but only records the slot, so from now on the 
state file contains somehost/2 and remains so until it gets deleted. With my 
patch the second write to my_state.slot is ignored, so both the preprocessing 
and the compilation phase appear to happen on somehost/6.

As far as I understand your patch, the preprocessing phase would move the whole 
my_state from remote_state to local_state, and record preprocessing on 
localhost/2, but I am unsure that after this phase is finished the client 
records the switch from local_state to remote_state (or back) while the file is 
getting compiled remotely and the localhost/2 slot is unlocked for other tasks. 
This might lead to duplicating localhost/2 if it gets reused in the meantime.

Unfortunately I had to move from my previous location so I won't be able to use 
distcc for a few weeks - no more pink/8 to help my poor notebook :) Perhaps 
I'll find another setup to test for duplicate localhost+host entries. 

After this discussion I feel your implementation is better than my idea, 
because it describes more accurately what actually happens (preprocessing on 
localhost etc.) I will also switch to your patch and see how it works when my 
setup becomes available again.

Original comment by [email protected] on 26 Sep 2010 at 10:21

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Sorry, I said "lock" but I meant the state file.  Makes me wonder if it's 
really necessary to have separate lock and state files?

Hmmm, see, I thought your patch would cause it to lock and "note" somehost/6, 
then lock localhost/2 but note localhost/6.  Most of the time it doesn't, 
because preprocessing and compiling happen on the remote host.  But if the 
remote fails for some reason, then falling back to localhost does cause it note 
localhost with the remote slot, causing duplication.

Anyway, if we agree that my patch design is the way to go, then we can work on 
making it as good as can be and get this bug resolved.  You can always run a 
local distcc server and configure your client to use it.  Maybe something like 
"localhost/1 shadow/1" where "shadow" is the hostname of your localhost.

I'll look into the possible source of duplication using my patch soon.  Cheers.

Original comment by [email protected] on 27 Sep 2010 at 1:55

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Here's a revised patch that tidies it up, but most importantly changes the 
target of dcc_note_state() in dcc_wait_for_cpp().  Cheers.

Original comment by [email protected] on 29 Sep 2010 at 2:26

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
I have tested your last patch during my last gentoo upgrade (48 packages) and 
found no duplicate host+slot entries. The localhost host jumped up to 4 slots 
even if my notebook has 2 logical CPUs, and I think this is due to the preset 
preprocessor slots. I am unsure if some rare race conditions between the 
monitor and distcc may still provoke duplicates, but it seems unlikely.

I believe the patch should be applied on the distcc trunk, since the situation 
presented by the distcc monitor (text and gtk-based) with it is a lot better 
than without it. It also corrects the mess created by the bug overwriting the 
remote slot with the local preprocessor slot. What I like most is that I can 
now see how busy my local host is with preprocessing.

I also hope distcc developers read and take into account my thoughts :)

Original comment by [email protected] on 8 Oct 2010 at 6:05

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Thanks heaps, everyone!  Thanks particularly to Jeremy and Cheepeero for your 
patches, review, and testing, and also to Pankaj for the earlier work-around 
patch, and to jsjuni56 for first reporting the issue.

I have applied Jeremy's last patch ("distcc-r731_state.patch") to the distcc 
svn repository; it is revision 732.

Original comment by [email protected] on 8 Oct 2010 at 6:31

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024

Original comment by [email protected] on 8 Oct 2010 at 6:31

  • Changed state: Fixed

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Cool, I'm glad we could get it resolved.

Cheepero, are you interested in working with me on issue #24?  I've started to 
play around with it and got some promising results but I could use some help.  
Cheers.

Original comment by [email protected] on 15 Oct 2010 at 8:51

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
Sure, if I can. I have starred #24 :)

Original comment by [email protected] on 15 Oct 2010 at 10:52

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
This bug seems to have come back as of 3.1:

distcc 3.1 x86_64-pc-linux-gnu
  (protocols 1, 2 and 3) (default port 3632)
  built Feb 17 2012 13:04:11

Built from source via Gentoo sys-devel/distcc-3.1-r5

Original comment by [email protected] on 6 May 2012 at 6:36

Attachments:

from distcc.

GoogleCodeExporter avatar GoogleCodeExporter commented on July 21, 2024
3.1 is too old to have this patch.  It is included in 3.2.

Original comment by [email protected] on 7 May 2012 at 12:15

from distcc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.