seattletestbed / nodemanager Goto Github PK
View Code? Open in Web Editor NEWRemote control server for SeattleTestbed nodes
License: MIT License
Remote control server for SeattleTestbed nodes
License: MIT License
Several nodes exhibit this behavior and after this, the NM has not been observed running.
1247549568.05:PID-26778:[INFO]:Loading config
1247549568.11:PID-26778:Traceback (most recent call last):
File "nmmain.py", line 410, in <module>
File "nmmain.py", line 349, in main
KeyError: 'publickey'
Sample node:
onelab1.info.ucl.ac.be (See http://blackbox.cs.washington.edu:4444/detailed/onelab1.info.ucl.ac.be/1247816135 for more information about the node. Ignore the mismatch on file hashes - it was compared to an older version file dict).
Need a way to wipe a vessel. It should do "stop" and then clear the vessel's log and file system.
Need an easy way for an external user to determine the names of files in their vessel
runonce in the NM uses the registry, and a specific module _winreg which is not available to windows mobile. By implementing this in the Windows API both desktop and mobile platforms will have access to it.
NodeManager probably makes calls that don't exist/work on CE which need to be addressed and updated.
This was found via the log analysis script. Several socket related tracebacks were found on node ip: 192.26.179.68, hash 1630fb12833ebe9e2ff167cad9449294, version 0.1s-beta-r4015. The two related categories and sample entries are listed below.
'''Category and number of entries:'''
[waitforconn, cannont assign requested address : 2682
'''Sample Entry:'''
1282649630.65:PID-2275:ERROR: when calling waitforconn for the connection_handler: (99, 'Cannot assign requested address')
[[BR]]
'''Category and number of entries:'''
Traceback, real_socket error 99, cannot assign requested address : 2681
'''Sample Entry:'''
1282649630.66:PID-2275:Traceback (most recent call last):
File "nmmain.py", line 269, in start_accepter
File "/home/uw_seattle/seattle/seattle_repy/emulcomm.py", line 1582, in waitforconn
File "/home/uw_seattle/seattle/seattle_repy/emulcomm.py", line 1621, in get_real_socket
error: (99, 'Cannot assign requested address')
As we allow initiating tcp/udp communication through the loopback interface, we open up some important security risks. Everything on a system that talks tcp/udp which was previously externally inaccessible due to firewalls is potentially subject to malicious traffic when running our software. There are quite a few very serious exploits that can be imagined.
The issue is more generally that the user running our software now has a source of arbitrary traffic that originates behind their firewalls, including any network border firewalls. The localhost example is just an easy one to use to convey the idea.
A good solution would be to restrict sending data to only ports 63100-63180 when the destination is the loopback or an RFC 1918 address (private network address).
There's currently no way to init() the servicelogger so that the servicelogger will use a non-default file size for the underlying circular logger. It may be important to allow larger log files when more information is logged, such as when #551 is addressed.
The nmtestresetvessel_fresh.py
nodemanager test occasionally fails on freebsd. It may be only 1 in every 5 or 10 times.
nmtestresetvessel_fresh.py
out:err:Seattle Traceback (most recent call last):
"nmtestresetvessel_fresh.py", line 10411, in <module>
Exception (with type 'exceptions.Exception'): After reset, vessel status was not Fresh! (Stopped)
Create some unit tests based on unittest for the code that handles threading errors.
It seems that the test ut_nm_joinsplitvessels.py seems to fail on the second run on most machines. That is after I run preparetest.py -t on a folder, then I run the command:
python utf.py -m nm
then everything passes. The second time I run the same command the test ut_nm_joinsplitvessels.py with the exception:
Standard error : (Produced, Expected):
('---\nUncaught exception! Following is a full traceback, and a user traceback.\nThe user traceback excludes non-user modules. The most recent call is displayed last.\n\nFull debugging traceback:\n "repy.py", line 203, in main\n "/Users/monzum/test_dir/virtual_namespace.py", line 116, in evaluate\n "/Users/monzum/test_dir/safe.py", line 332, in safe_run\n "ut_nm_joinsplitvessels.py", line 11516, in <module>\n "ut_nm_joinsplitvessels.py", line 11331, in nmclient_signedsay\n\nUser traceback:\n "ut_nm_joinsplitvessels.py", line 11516, in <module>\n "ut_nm_joinsplitvessels.py", line 11331, in nmclient_signedsay\n\nException (with class \'.NMClientException\'): Node Manager error \'Internal Error\'\n---\n', None)
If I try to run the test just by itself, it seems to pass correctly. It is possible that this test is failing because of some other test that was previously run.
This seems related to NAT. Here is the node manager log:
1250808755.61:PID-5583:[config
1250808763.97:PID-5583:[INFO](INFO]:Loading): Trying NAT wait
1250808793.21:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808793.21:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808793.21:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808793.66:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808793.66:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808793.66:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808800.23:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808800.23:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808800.23:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808818.14:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808818.14:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808818.14:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808818.43:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808818.43:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808818.43:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808818.72:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808818.72:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808818.72:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808819.01:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808819.01:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808819.01:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808819.31:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808819.31:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808819.31:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808819.6:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808819.6:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808819.6:PID-5583:[ERROR](ERROR]:): cannot find a port for recvmess
1250808820.6:PID-5583:[Trying NAT wait
1250808823.86:PID-5583:[ERROR](INFO]:): when calling waitforconn for the connection_handler: list index out of range
1250808823.86:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808823.87:PID-5583:[Trying NAT wait
1250808844.52:PID-5583:[ERROR](INFO]:): when calling waitforconn for the connection_handler: list index out of range
1250808844.52:PID-5583:Traceback (most recent call last):
File "nmmain.py", line 245, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range
1250808844.52:PID-5583:[Trying NAT wait
1250808875.09:PID-5583:myname = NAT$4dadc4a57f24b8e38245a888258ecc87df5a60afv2:9625
1250808875.12:PID-5583:[INFO](INFO]:):Started
The node manager seems to be leaking socket based file descriptors. I've seen counts of ~ 1300 on a few nodes that I checked.
The following node manager tests fail on windows:
nmtestreadvessellog.py
nmtestresetvessel.py
nmtestresetvessel_multireset.py
nmteststartstopvessel.py
See the output log from running run_tests.py -n here: http://www.pastie.org/536772
Requests with invalid sequence id's are being executed. A request with sequence id 'n' should only be executed if n=1 or if request 'n-1' has been executed. Request with negative sequence id's should also not be executed.
The client should verify the sequence id is greater than 0 before sendig the request. For sequence id's greater than 0, validation should be done on the server side.
The following tests demonstrate this behavior:
nmtestchangeadvertise_invalidsequenceid.mix
nmtestchangeadvertise_invalidsequenceidnegative.mix
This ticket is replacing an outdated ticekt #227
In order to bring nat traversal into the node manager I will:
Move NAT traversal library and its dependencies into seattlelib
update nmmain.py to use the natlayer
update nmclient.repy to use the natlayer
deploy nat forwarders on a distributed set of nodes
add integration tests to ensure that the forwarders remain functional
If an Internet connection cannot be detected right away on a Windows system, nmmain.py prints that it cannot detect an Internet connection to the nodemanager log and does not try to reconnect.
If using wireless, it often takes a few moments to connect to the internet. Hence, nmmain.py never starts.
After I deployed the new nmmain.py patch for the shim and deployed it on the betabox nodes, I found that there was a critical bug in it, which was causing the advertthread not to start causing the nodemanager to shutdown.
This needs to be fixed as soon as possible before we deploy it. Please test the new nmmain.py thoroughly before it can be deployed.
For the moment I have rolled back the betabox nodes to r4045.
The error seen in the nodemanager log was:
1284067720.2:PID-18790:Traceback (most recent call last):
File "nmmain.py", line 470, in <module>
File "nmmain.py", line 405, in main
File "nmmain.py", line 285, in start_advert_thread
AttributeError: 'module' object has no attribute 'advertthread'
The nodemanager is broken by r3259. Running the nodemanager directly on testbed-opensuse I get:
1260379954.06:PID-32220:[INFO]:Loading config
1260379954.06:PID-32220:Traceback (most recent call last):
File "nmmain.py", line 502, in <module>
File "nmmain.py", line 401, in main
TypeError: cannot concatenate 'str' and 'exceptions.ImportError' objects
This is probably the bad concatenation of the exception object from r3259 that I just emailed Zack about. It took a while to get to looking this far because the tests had been failing during the night because of leftover seattle processes running on the testbed machines.
There is a huge volume of advertise requests coming from each node. This is due to r3250. Here is a log of a typical node advertising.
This seems to be due to errors introduced in r3250.
Can servicelogger.mix be changed to a python script by renaming it and making the following change?
29c29,30
< include servicelookup.repy
---
> import repyhelper
> repyhelper.translate_and_import('servicelookup.repy')
I'm not sure if it's that simple or if there's something else that needs to be done, but I would appreciate it as it would get a mix file out of my development process for seattlegeni (i don't use servicelogger in seattlegeni, but there's a chain of dependencies believe from repyportability that needs it).
Crash in nmstatusmonitor perhaps makes NM stop responding.
1244352534.74:PID-13443:[ File "/home/uw_seattle/seattle_repy/nmadvertise.py", line 791, in run
File "/home/uw_seattle/seattle_repy/misc.py", line 32, in do_sleep
File "/home/uw_seattle/seattle_repy/nonportable.py", line 295, in getruntime
File "/home/uw_seattle/seattle_repy/linux_api.py", line 190, in getSystemUptime
1244352550.81:PID-13443:[ERROR](ERROR]:): File "/home/uw_seattle/seattle_repy/nmstatusmonitor.py", line 108, in run
File "/home/uw_seattle/seattle_repy/statusstorage.py", line 92, in read_status
Additional info:
Node is version .1h
[Software Updater memory usage is unusually high. (21420)
NodeManager Node Manager is not responding to requests on port 1224.
The following error shows up in the logs of the old SU (multiple times, different random folder and timestamp though obviously):
1244695695.24:PID-21901:[socket error](Errno) (-2, 'Name or service not known') http://seattle.cs.washington.edu/couvb/updatesite/0.1/metainfo
1244695695.24:PID-21901:[2](Errno) No such file or directory: '/tmp/tmpDuq8ta/metainfo'
The machine is '''planetlab2.williams.edu'''
It would be nice to have the list of files returned by this api call to be sorted.
Figure out where those places are and swap them.
I don't see this as being a difficult ticket as I have a pretty good understanding of NM/repy.
Change openDHTadvertise to use the repy xmlrpc library so we may use it in repy.
The nmtestlistfilesinvessel_add_and_remove.py
nodemanager test consistently fails on freebsd.
nmtestlistfilesinvessel_add_and_remove.py
out:err:Seattle Traceback (most recent call last):
"nmtestlistfilesinvessel_add_and_remove.py", line 10409, in <module>
Exception (with type 'exceptions.Exception'): Original and new file lists do not match:hello,
Current runs can be found at: http://blackbox.cs.washington.edu/~continuousbuild/
I see an error when running the node manager. It is understandable that this would happen when the forwarders are down, but shouldn't the node manager eventually move on instead of repeating this?
1255021313.96:PID-12619:[Trying NAT wait
1255021373.32:PID-12619:[ERROR](INFO]:): when calling waitforconn for the connection_handler: Failed to connect to a forwarder.
1255021373.32:PID-12619:Traceback (most recent call last):
File "nmmain.py", line 244, in start_accepter
File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 459, in nat_waitforconn
EnvironmentError: Failed to connect to a forwarder.
The node manager dies with InternalError if you send it data that is encrypted with a key it doesn't recognize. It should give a more intelligible error message.
During the development of my log analysis program I found a few beta nodes that were logging exception tracebacks. This specific category of log entry has been seen about 400 times on 3 nodes since July 23rd, which was when the beta nodes were reinstalled. The nodes are using version 0.1r-beta-r3519.
An example of the log entry is below:
1281071514.45:PID-21119:Traceback (most recent call last):
File "/home/uw_seattle/seattle/seattle_repy/nmrequesthandler.py", line 75, in handle_request
File "nodemanager.repyhelpercache/session_repy.py", line 49, in session_recvmessage
ValueError: Bad message size
This error has been seen on the following beta nodes: 131.193.34.21, 210.123.39.168 and 128.112.139.28
Through seash and through nm_remote_api, I tried to show the logs. Some vessels give errors others don't.
I also experienced ticket #293 which may be related.
jordanr@browsegood !> show log
failure: Node Manager error 'Internal Error'
Log from '169.229.50.14:1224:v30':
67.6697890759 Forwarder Started on 169.229.50.14
91.6218111515 Polling for dead connections.
118.586320162 Polling for dead connections.
failure: Node Manager error 'Internal Error'
Log from '169.229.50.7:1224:v42':
8.61007118225 Forwarder Started on 169.229.50.7
93.5640552044 Polling for dead connections.
124.011840105 Polling for dead connections.
152.649521112 Polling for dead connections.
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
Log from '169.229.50.6:1224:v63':
26.6492931843 Forwarder Started on 169.229.50.6
failure: Node Manager error 'Internal Error'
Failures on 9 targets: 128.208.1.135:1224:v10, 128.208.1.158:1224:v12, 128.208.1.167:1224:v4, 128.208.1.121:1224:v12, 128.208.1.217:1224:v10, 128.208.1.199:1224:v6, 128.208.1.183:1224:v8, 128.208.1.225:1224:v8, 128.208.1.157:1224:v10
Added group 'loggood' with 3 targets and 'logfail' with 9 targets
After looking at various nodes that have stopped advertising I ran across an Error being thrown that shouldn't be seen.
1273364434.65:PID-5896:[ERROR]: File "/home/uw_seattle/seattle/seattle_repy/nmadvertise.py", line 6171, in run
<type 'exceptions.TypeError'> log_last_exception() takes no arguments (1 given)
On Linux / Mac runonce doesn't work if the files in /tmp are owned by a different user. This is presumably because other users don't have write access
Not sure if this is something that we should fix (as I've noticed it only on the machines that have multiple, 2, SU/NM instances running).
What'll occur is the following will be logged to nodemanager logfile in /v2:
1249231081.89:PID-32504:Traceback (most recent call last):
File "/home/uw_seattle/seattle_repy/nmrequesthandler.py", line 75, in handle_request
File "/home/uw_seattle/seattle_repy/session_repy.py", line 42, in session_recvmessage
File "/home/uw_seattle/seattle_repy/emulcomm.py", line 1583, in recv
Exception: Socket closed
We should probably prevent this type of behavior. One probably shouldn't be able to start multiple instances of the Software Updater (SU) and/or the Node Manager (NM).
The nmtestreadvessellog.py
nodemanager test is flaky on freebsd. It fails about 1 in every 5 times.
Sample failure:
nmtestreadvessellog.py
out:err:Seattle Traceback (most recent call last):
"nmtestreadvessellog.py", line 10407, in <module>
Exception (with type 'exceptions.Exception'): The log '' does not match the expected string '0.23028441581'
Attempting to download a file with an empty name name (''
) causes the following to be seen by the client:
Node Manager error 'Internal Error'
And the following ends up in the nodemanager log:
1260814665.76:PID-4213:Traceback (most recent call last):
File "/home/uw_seattle/seattle_repy/nmrequesthandler.py", line 92, in handle_request
File "/home/uw_seattle/seattle_repy/nmrequesthandler.py", line 227, in process_API_call
File "/home/uw_seattle/seattle_repy/nmAPI.py", line 2858, in retrievefilefromvessel
IOError: [21](Errno) Is a directory
The nodemanager already checks for empty files in nmAPI.mix's addfiletovessel
, it was probably just an oversight not adding the same check to retrievefilefromvessel
and deletefileinvessel
.
CC'ing justinc in case it is felt that this empty filename check should move to emulfile.assert_is_allowed_filename to avoid this type of error in general.
I'm assuming this error has to do with obtaining vessels on a node.
1247548898.88:PID-7512:Traceback (most recent call last):
File "/homes/iws/justinc/128.208.1.169/seattle_repy/nmrequesthandler.py", line 92, in handle_request
File "/homes/iws/justinc/128.208.1.169/seattle_repy/nmrequesthandler.py", line 227, in process_API_call
File "/homes/iws/justinc/128.208.1.169/seattle_repy/nmAPI.py", line 3150, in splitvessel
File "/homes/iws/justinc/128.208.1.169/seattle_repy/nmAPI.py", line 3093, in setup_vessel
OSError: [17](Errno) File exists: 'v8'
Signeddata is currently storing full messages as old metadata. This can lead to excessive memory consumption in the cases where files are being transmitted to the nodemanager. The message data should not be saved as oldmetadata, only the signature should be necessary.
The beta node with IP 200.0.206.203 isn't advertising but softwareupdater and nmmain still appear to be running.
There appears to be a series of entries that repeat in the node manager log, I've pasted it below:
1280812854.21:PID-3564:[node manager is alive...
1280813106.05:PID-3564:AdvertiseError occured, continuing: 'announce error (type: DHT): filedescriptor out of range in select()'
1280813106.36:PID-3564:AdvertiseError occured, continuing: [ 'announce error (type: central): filedescriptor out of range in select()']
1280813411.89:PID-3564:AdvertiseError occured, continuing: [ 'announce error (type: DOR): Socket timed out connecting to host/port.']
The first of these started on Sat Jul 31 2010 21:01:48 according to the first timestamp (1280635308.13).
I have talked with Monzur about this on August 2nd and he has said he can take a look at this tomorrow.
While doing some testing, I found that the function nmclient_handle() in nmclient.repy tries to open a file called 'advertised_name', which does not exist. This is causing problem and causing repy files to fail when accessing nodemanagers on local machine.
The lines in nmclient_createhandle thats causing grief are:
def nmclient_createhandle(nmIP, nmport, sequenceid = None, timestamp=True, identity = True, expirationtime = 60*60, publickey = None, privatekey = None, vesselid = None, timeout=15):
# If nmIP is the same as the current IP, we know that we're testing
# nmclient. First, we don't run the node manager and the nmclient on the same
# machine under normal operations. Second, all the components communicate
# using shim's naming system, rather than IP. During testing, we need to
# translate the IP into the advertised name of the node manager, which is
# stored in a file called 'advertised_name'. Added by Danny Yuxing Huang to
# facilitate the evaluation and deployment of shims.
if nmIP == getmyip():
fileobj = open('advertised_name', 'r')
(nmIP, nmport) = fileobj.read().strip().split(':')
nmport = int(nmport)
fileobj.close()
It looks like this was the modified nmclient.repy from the Shim implementation. In the comment it says that the file stores the ip of the machine, but the file doesn't seem to be created.
Traceback (most recent call last):
File "./seattle_repy/testprocess.py", line 389, in <module>
nmclient_rawsay(getmyip(), 1224, "GetVessels")
File "./seattle_repy/testprocess.py", line 160, in nmclient_rawsay
(response, status) = fullresponse.rsplit('\n',1)
ValueError: need more than 1 value to unpack
This error is generated by the testprocess.py script when it tries to acquire vessels (communicate with the NM).
Please see http://blackbox.cs.washington.edu:4444/detailed/planet02.hhi.fraunhofer.de/1249577253 for more information.
Perhaps in catching this ValueError, it would be useful to print the value of fullresponse along with the general error for easier debugging.
The nodemanager doesn't appear to advertise its host:port under its own node key, but rather only under the owner and user keys of the individual vessels. Advertising under the node's key is useful in some situations, such as being able to efficiently follow a moving node.
Starting nmmain.py fails on testbed-xp2 (after running nminit.py, of course), with the following backtrace:
1247440591.53:PID-604:[ File "nmmain.py", line 408, in
File "nmmain.py", line 280, in main
File "c:\Documents and Settings\cemeyer\foo2\runonce.py", line 32, in getprocesslock
File "c:\Documents and Settings\cemeyer\foo2\runonce.py", line 209, in getprocesslockmutex
File "c:\Documents and Settings\cemeyer\foo2\runonce.py", line 169, in openkey
File "c:\Documents and Settings\cemeyer\foo2\runonce.py", line 167, in openkey
<type 'exceptions.WindowsError'> Error 5 Access is denied
It may have something to do with a permissions error on the mutex (or may not). Look in getprocesslockmutex() in runonce.py.
The node manager logs don't include the exception that happened. They only include the lines of code on which the exception happened. This should be improved to show the actual exception.
A reset on a vessel doesn't move it to a fresh state. This should be fixed.
nat_check_bi_directional() starts a listener with a call back function that echos any data received, this can be confusing if someone unknowingly connects to this while trying to connect to a nodemanager.
Fix this to make it clear what is going on if you accidentally get this connection
I tried to acquire expired vessels through nm_remote_api.py. It looks like the Python socket in emulcomm should be timing out. But, it just hangs.
Here's the trace:
richard@satya:~/trunk/foo$ python nat_forwarder_monitor.py start
Acquiring vessels...
Traceback (most recent call last):
File "nat_forwarder_monitor.py", line 69, in <module>
success, info = nm_remote_api.initialize(hosts, 'jordanr')
File "/home/richard/trunk/foo/nm_remote_api.py", line 2996, in initialize
new_vessels = add_node_by_hostname(host)
File "/home/richard/trunk/foo/nm_remote_api.py", line 2580, in add_node_by_hostname
publickey = key[ File "/home/richard/trunk/foo/nm_remote_api.py", line 1868, in nmclient_createhandle
response = nmclient_rawsay(newhandle, 'GetVessels')
File "/home/richard/trunk/foo/nm_remote_api.py", line 1921, in nmclient_rawsay
fullresponse = nmclient_rawcommunicate(nmhandle, *args)
File "/home/richard/trunk/foo/nm_remote_api.py", line 1741, in nmclient_rawcommunicate
return session_recvmessage(thisconnobject)
File "/home/richard/trunk/foo/nm_remote_api.py", line 1619, in session_recvmessage
currentbyte = socketobj.recv(1)
File "/home/richard/trunk/foo/emulcomm.py", line 1042, in recv
datarecvd = comminfo[mycommid]('public'])
)['socket'].recv(bytes)
KeyboardInterrupt
There seems to be a huge downswing in the number of advertising nodes as of May 22nd. The node manager on most of these systems has over 1K sockets open in the TIME_WAIT state.
On 170.140.119.69 (a PL node), netstat -n reports there are 957 sockets open to SeattleGENI, 33 sockets open to 128.208.3.203 (testbed-ubuntu), and another 15 or so that seem to be associated with different client IPs (possibly from seash?).
The log of the system looks like:
1274775319.71:PID-4231:AdvertiseError occured, continuing: [error (type: DOR): Unexpected tag 'details' while parsing response."]("announce)
1274775731.32:PID-4231:[node manager is alive...
1274776133.23:PID-4231:AdvertiseError occured, continuing: ["announce error (type: DOR): Unexpected tag 'details' while parsing response."](INFO]:)
1274776331.61:PID-4231:[node manager is alive...
1274776846.31:PID-4231:AdvertiseError occured, continuing: ['announce error (type: central): filedescriptor out of range in select()'](INFO]:)
1274776920.74:PID-4231:AdvertiseError occured, continuing: [error (type: DHT): filedescriptor out of range in select()']('announce)
1274776927.36:PID-4231:AdvertiseError occured, continuing: [error (type: central): filedescriptor out of range in select()']('announce)
I tried restarting the node manager on a beta node (128.112.139.28) and am monitoring it to see if it leaks sockets.
I believe the 'details' line in the traceback above is not relevant because I ran a version of the node manager that did frequent advertisement and it didn't leak sockets. I also ran nmclient_get_vessel_dict 1000 times each from 10 threads and didn't detect a socket leak.
After running for some time, the node manager dies on my laptop. It returns code 30 which implies a timer raised an exception that wasn't caught.
Here is my node manager log. I'm unsure if the tracebacks contained in this are really pointing to problems or not.
1249162032.98:PID-17687:[config
1249162128.02:PID-17687:myname = NAT$d40aa1d38e2ce73eff7351fee1ce4ee12cf1dd72v2:1224
1249162128.05:PID-17687:[INFO](INFO]:Loading):Started
1249162313.5:PID-17687:Traceback (most recent call last):
File "/Users/justincappos/test/nmrequesthandler.py", line 92, in handle_request
File "/Users/justincappos/test/nmrequesthandler.py", line 227, in process_API_call
File "/Users/justincappos/test/nmAPI.py", line 2788, in addfiletovessel
IOError: [21](Errno) Is a directory: 'v1//'
1249162357.69:PID-17687:Traceback (most recent call last):
File "/Users/justincappos/test/nmrequesthandler.py", line 75, in handle_request
File "/Users/justincappos/test/session_repy.py", line 42, in session_recvmessage
File "/Users/justincappos/test/emulcomm.py", line 1595, in recv
Exception: Socket closed
1249162398.68:PID-17687:Traceback (most recent call last):
File "/Users/justincappos/test/nmrequesthandler.py", line 75, in handle_request
File "/Users/justincappos/test/session_repy.py", line 42, in session_recvmessage
File "/Users/justincappos/test/emulcomm.py", line 1595, in recv
Exception: Socket closed
1249162732.89:PID-17687:[node manager is alive...
1249162980.77:PID-17687:Traceback (most recent call last):
File "/Users/justincappos/test/nmrequesthandler.py", line 92, in handle_request
File "/Users/justincappos/test/nmrequesthandler.py", line 227, in process_API_call
File "/Users/justincappos/test/nmAPI.py", line 2788, in addfiletovessel
IOError: [Errno 21](INFO]:) Is a directory: 'v13//'
1249163025.18:PID-17687:Traceback (most recent call last):
File "/Users/justincappos/test/nmrequesthandler.py", line 75, in handle_request
File "/Users/justincappos/test/session_repy.py", line 42, in session_recvmessage
File "/Users/justincappos/test/emulcomm.py", line 1595, in recv
Exception: Socket closed
1249163066.05:PID-17687:Traceback (most recent call last):
File "/Users/justincappos/test/nmrequesthandler.py", line 75, in handle_request
File "/Users/justincappos/test/session_repy.py", line 42, in session_recvmessage
File "/Users/justincappos/test/emulcomm.py", line 1595, in recv
Exception: Socket closed
1249163267.4:PID-17687:Traceback (most recent call last):
File "/Users/justincappos/test/nmadvertise.py", line 5815, in run
File "/Users/justincappos/test/nmadvertise.py", line 5688, in advertise_announce
AdvertiseError: openDHT announce error: timed out
1249163267.63:PID-17687:[1249163267.63 restarting advert...
1249163273.76:PID-17687:Traceback (most recent call last):
File "/Users/justincappos/test/nmadvertise.py", line 5815, in run
File "/Users/justincappos/test/nmadvertise.py", line 5688, in advertise_announce
AdvertiseError: openDHT announce error: timed out
1249163273.98:PID-17687:[WARN](WARN]:At):At 1249163273.98 restarting advert...
1249163339.53:PID-17687:[INFO]: node manager is alive...
openDHTadvertise.py was switched to use python's stdlib xmlrpclib because of a bug in our xmlrpc library that has since been fixed. We should switch it back.
Also, port it to seattlelib.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.