seattletestbed / nodemanager Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 10.0 335 KB

Remote control server for SeattleTestbed nodes

License: MIT License

Python 100.00%

nodemanager's People

Contributors

Watchers

Forkers

monzum choksi81 yyzhuang priyam3nidhi sensibilitytestbed tsumikifreederation aaaaalbert awwad lukpueh

nodemanager's Issues

NM not running after this error is seen

Several nodes exhibit this behavior and after this, the NM has not been observed running.

1247549568.05:PID-26778:[INFO]:Loading config
1247549568.11:PID-26778:Traceback (most recent call last):
  File "nmmain.py", line 410, in <module>
  File "nmmain.py", line 349, in main
KeyError: 'publickey'

Sample node:
onelab1.info.ucl.ac.be (See http://blackbox.cs.washington.edu:4444/detailed/onelab1.info.ucl.ac.be/1247816135 for more information about the node. Ignore the mismatch on file hashes - it was compared to an older version file dict).

Add "WipeVessel" call to node manager

Need a way to wipe a vessel. It should do "stop" and then clear the vessel's log and file system.

"ListFiles" call for node manager

Need an easy way for an external user to determine the names of files in their vessel

Add Registry interface to Windows API

runonce in the NM uses the registry, and a specific module _winreg which is not available to windows mobile. By implementing this in the Windows API both desktop and mobile platforms will have access to it.

Make nmmain.py test for the newly added functionality to update the seattle crontab entry

preparetest.py needs to be updated to copy /trunk/dist/update_crontab_entry.py to the test folder since my functionality imports this module.
Write a node manager test for this new functionality

Update NodeManager to run on WinCE

NodeManager probably makes calls that don't exist/work on CE which need to be addressed and updated.

Nodemanger socket tracebacks

This was found via the log analysis script. Several socket related tracebacks were found on node ip: 192.26.179.68, hash 1630fb12833ebe9e2ff167cad9449294, version 0.1s-beta-r4015. The two related categories and sample entries are listed below.

'''Category and number of entries:'''
[waitforconn, cannont assign requested address : 2682
'''Sample Entry:'''
1282649630.65:PID-2275:ERROR: when calling waitforconn for the connection_handler: (99, 'Cannot assign requested address')

[[BR]]

'''Category and number of entries:'''
Traceback, real_socket error 99, cannot assign requested address : 2681
'''Sample Entry:'''
1282649630.66:PID-2275:Traceback (most recent call last):
File "nmmain.py", line 269, in start_accepter
File "/home/uw_seattle/seattle/seattle_repy/emulcomm.py", line 1582, in waitforconn
File "/home/uw_seattle/seattle/seattle_repy/emulcomm.py", line 1621, in get_real_socket
error: (99, 'Cannot assign requested address')

security risk posed by traffic originating behind firewalls

As we allow initiating tcp/udp communication through the loopback interface, we open up some important security risks. Everything on a system that talks tcp/udp which was previously externally inaccessible due to firewalls is potentially subject to malicious traffic when running our software. There are quite a few very serious exploits that can be imagined.

The issue is more generally that the user running our software now has a source of arbitrary traffic that originates behind their firewalls, including any network border firewalls. The localhost example is just an easy one to use to convey the idea.

A good solution would be to restrict sending data to only ports 63100-63180 when the destination is the loopback or an RFC 1918 address (private network address).

unable to specify log file size when using servicelogger

There's currently no way to init() the servicelogger so that the servicelogger will use a non-default file size for the underlying circular logger. It may be important to allow larger log files when more information is logged, such as when #551 is addressed.

nodemanager test is flaky on freebsd: nmtestresetvessel_fresh.py

The nmtestresetvessel_fresh.py nodemanager test occasionally fails on freebsd. It may be only 1 in every 5 or 10 times.

nmtestresetvessel_fresh.py
out:err:Seattle Traceback (most recent call last):
  "nmtestresetvessel_fresh.py", line 10411, in <module>
Exception (with type 'exceptions.Exception'): After reset, vessel status was not Fresh! (Stopped)

Create tests for the threading error handling

Create some unit tests based on unittest for the code that handles threading errors.

ut_nm_joinsplitvessels.py seems to fail on second run.

It seems that the test ut_nm_joinsplitvessels.py seems to fail on the second run on most machines. That is after I run preparetest.py -t on a folder, then I run the command:

python utf.py -m nm

then everything passes. The second time I run the same command the test ut_nm_joinsplitvessels.py with the exception:

Standard error : (Produced, Expected):
('---\nUncaught exception! Following is a full traceback, and a user traceback.\nThe user traceback excludes non-user modules. The most recent call is displayed last.\n\nFull debugging traceback:\n  "repy.py", line 203, in main\n  "/Users/monzum/test_dir/virtual_namespace.py", line 116, in evaluate\n  "/Users/monzum/test_dir/safe.py", line 332, in safe_run\n  "ut_nm_joinsplitvessels.py", line 11516, in <module>\n  "ut_nm_joinsplitvessels.py", line 11331, in nmclient_signedsay\n\nUser traceback:\n  "ut_nm_joinsplitvessels.py", line 11516, in <module>\n  "ut_nm_joinsplitvessels.py", line 11331, in nmclient_signedsay\n\nException (with class \'.NMClientException\'): Node Manager error \'Internal Error\'\n---\n', None)

If I try to run the test just by itself, it seems to pass correctly. It is possible that this test is failing because of some other test that was previously run.

Node manager tests fail on my linux box.

This seems related to NAT. Here is the node manager log:

1250808755.61:PID-5583:[config
1250808763.97:PID-5583:[INFO](INFO]:Loading): Trying NAT wait
1250808793.21:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808793.21:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808793.21:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808793.66:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808793.66:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808793.66:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808800.23:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808800.23:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808800.23:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808818.14:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808818.14:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808818.14:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808818.43:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808818.43:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808818.43:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808818.72:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808818.72:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808818.72:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808819.01:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808819.01:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808819.01:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808819.31:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808819.31:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808819.31:PID-5583:[INFO](ERROR]:): Trying NAT wait
1250808819.6:PID-5583:[when calling waitforconn for the connection_handler: list index out of range
1250808819.6:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808819.6:PID-5583:[ERROR](ERROR]:): cannot find a port for recvmess
1250808820.6:PID-5583:[Trying NAT wait
1250808823.86:PID-5583:[ERROR](INFO]:): when calling waitforconn for the connection_handler: list index out of range
1250808823.86:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808823.87:PID-5583:[Trying NAT wait
1250808844.52:PID-5583:[ERROR](INFO]:): when calling waitforconn for the connection_handler: list index out of range
1250808844.52:PID-5583:Traceback (most recent call last):
  File "nmmain.py", line 245, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 429, in nat_waitforconn
  File "nodemanager.repyhelpercache/NAT_advertisement_repy.py", line 102, in nat_forwarder_list_lookup
IndexError: list index out of range

1250808844.52:PID-5583:[Trying NAT wait
1250808875.09:PID-5583:myname = NAT$4dadc4a57f24b8e38245a888258ecc87df5a60afv2:9625
1250808875.12:PID-5583:[INFO](INFO]:):Started

Node manager has a huge number of open file descriptors

The node manager seems to be leaking socket based file descriptors. I've seen counts of ~ 1300 on a few nodes that I checked.

4 / 40 node manager tests fail on windows

The following node manager tests fail on windows:

nmtestreadvessellog.py
nmtestresetvessel.py
nmtestresetvessel_multireset.py
nmteststartstopvessel.py

See the output log from running run_tests.py -n here: http://www.pastie.org/536772

requests with invalid sequence ids are executed by the nodemanager

Requests with invalid sequence id's are being executed. A request with sequence id 'n' should only be executed if n=1 or if request 'n-1' has been executed. Request with negative sequence id's should also not be executed.

The client should verify the sequence id is greater than 0 before sendig the request. For sequence id's greater than 0, validation should be done on the server side.

The following tests demonstrate this behavior:
nmtestchangeadvertise_invalidsequenceid.mix
nmtestchangeadvertise_invalidsequenceidnegative.mix

Integrate NATLayer with nodemanager

This ticket is replacing an outdated ticekt #227

In order to bring nat traversal into the node manager I will:

Move NAT traversal library and its dependencies into seattlelib

update nmmain.py to use the natlayer
update nmclient.repy to use the natlayer
deploy nat forwarders on a distributed set of nodes
add integration tests to ensure that the forwarders remain functional

nmmain.py fails to start if no internet connection exists.

If an Internet connection cannot be detected right away on a Windows system, nmmain.py prints that it cannot detect an Internet connection to the nodemanager log and does not try to reconnect.

If using wireless, it often takes a few moments to connect to the internet. Hence, nmmain.py never starts.

The nmmain.py patch has a critical bug in it.

After I deployed the new nmmain.py patch for the shim and deployed it on the betabox nodes, I found that there was a critical bug in it, which was causing the advertthread not to start causing the nodemanager to shutdown.

This needs to be fixed as soon as possible before we deploy it. Please test the new nmmain.py thoroughly before it can be deployed.

For the moment I have rolled back the betabox nodes to r4045.

The error seen in the nodemanager log was:

1284067720.2:PID-18790:Traceback (most recent call last):
  File "nmmain.py", line 470, in <module>
  File "nmmain.py", line 405, in main
  File "nmmain.py", line 285, in start_advert_thread
AttributeError: 'module' object has no attribute 'advertthread'

nodemanager broken by r3259

The nodemanager is broken by r3259. Running the nodemanager directly on testbed-opensuse I get:

1260379954.06:PID-32220:[INFO]:Loading config
1260379954.06:PID-32220:Traceback (most recent call last):
  File "nmmain.py", line 502, in <module>
  File "nmmain.py", line 401, in main
TypeError: cannot concatenate 'str' and 'exceptions.ImportError' objects

This is probably the bad concatenation of the exception object from r3259 that I just emailed Zack about. It took a while to get to looking this far because the tests had been failing during the night because of leftover seattle processes running on the testbed machines.

Nodes flooding advertise server...

There is a huge volume of advertise requests coming from each node. This is due to r3250. Here is a log of a typical node advertising.

http://pastie.org/822691

This seems to be due to errors introduced in r3250.

Convert servicelogger.mix to servicelogger.py

Can servicelogger.mix be changed to a python script by renaming it and making the following change?

29c29,30
< include servicelookup.repy

---
> import repyhelper
> repyhelper.translate_and_import('servicelookup.repy')

I'm not sure if it's that simple or if there's something else that needs to be done, but I would appreciate it as it would get a mix file out of my development process for seattlegeni (i don't use servicelogger in seattlegeni, but there's a chain of dependencies believe from repyportability that needs it).

NM not responding after crash

Crash in nmstatusmonitor perhaps makes NM stop responding.

1244352534.74:PID-13443:[ File "/home/uw_seattle/seattle_repy/nmadvertise.py", line 791, in run
  File "/home/uw_seattle/seattle_repy/misc.py", line 32, in do_sleep
  File "/home/uw_seattle/seattle_repy/nonportable.py", line 295, in getruntime
  File "/home/uw_seattle/seattle_repy/linux_api.py", line 190, in getSystemUptime

1244352550.81:PID-13443:[ERROR](ERROR]:):  File "/home/uw_seattle/seattle_repy/nmstatusmonitor.py", line 108, in run
  File "/home/uw_seattle/seattle_repy/statusstorage.py", line 92, in read_status

Additional info:
Node is version .1h
[Software Updater memory usage is unusually high. (21420)
NodeManager Node Manager is not responding to requests on port 1224.

The following error shows up in the logs of the old SU (multiple times, different random folder and timestamp though obviously):

1244695695.24:PID-21901:[socket error](Errno) (-2, 'Name or service not known') http://seattle.cs.washington.edu/couvb/updatesite/0.1/metainfo
1244695695.24:PID-21901:[2](Errno) No such file or directory: '/tmp/tmpDuq8ta/metainfo'

The machine is '''planetlab2.williams.edu'''

ListFilesInVessel does not return a sorted file list.

It would be nice to have the list of files returned by this api call to be sorted.

Repy, Nodemanager use == in several places where is should be used instead

Figure out where those places are and swap them.

I don't see this as being a difficult ticket as I have a pretty good understanding of NM/repy.

Change trunk/nodemanager/openDHTadvertise.py to use xmlrpc_client

Change openDHTadvertise to use the repy xmlrpc library so we may use it in repy.

nodemanager test fails on freebsd: nmtestlistfilesinvessel_add_and_remove.py

The nmtestlistfilesinvessel_add_and_remove.py nodemanager test consistently fails on freebsd.

nmtestlistfilesinvessel_add_and_remove.py
out:err:Seattle Traceback (most recent call last):
  "nmtestlistfilesinvessel_add_and_remove.py", line 10409, in <module>
Exception (with type 'exceptions.Exception'): Original and new file lists do not match:hello,

Current runs can be found at: http://blackbox.cs.washington.edu/~continuousbuild/

Problem running node manager (no forwarders available).

I see an error when running the node manager. It is understandable that this would happen when the forwarders are down, but shouldn't the node manager eventually move on instead of repeating this?

1255021313.96:PID-12619:[Trying NAT wait
1255021373.32:PID-12619:[ERROR](INFO]:): when calling waitforconn for the connection_handler: Failed to connect to a forwarder.
1255021373.32:PID-12619:Traceback (most recent call last):
  File "nmmain.py", line 244, in start_accepter
  File "nodemanager.repyhelpercache/NATLayer_rpc_repy.py", line 459, in nat_waitforconn
EnvironmentError: Failed to connect to a forwarder.

The node manager doesn't report crypto errors cleanly

The node manager dies with InternalError if you send it data that is encrypted with a key it doesn't recognize. It should give a more intelligible error message.

Bad message size Tracebacks

During the development of my log analysis program I found a few beta nodes that were logging exception tracebacks. This specific category of log entry has been seen about 400 times on 3 nodes since July 23rd, which was when the beta nodes were reinstalled. The nodes are using version 0.1r-beta-r3519.

An example of the log entry is below:

1281071514.45:PID-21119:Traceback (most recent call last):
File "/home/uw_seattle/seattle/seattle_repy/nmrequesthandler.py", line 75, in handle_request
File "nodemanager.repyhelpercache/session_repy.py", line 49, in session_recvmessage
ValueError: Bad message size

This error has been seen on the following beta nodes: 131.193.34.21, 210.123.39.168 and 128.112.139.28

show log gives node manager Internal Errors

Through seash and through nm_remote_api, I tried to show the logs. Some vessels give errors others don't.

I also experienced ticket #293 which may be related.

jordanr@browsegood !> show log
failure: Node Manager error 'Internal Error'
Log from '169.229.50.14:1224:v30':
67.6697890759 Forwarder Started on 169.229.50.14
91.6218111515 Polling for dead connections.
118.586320162 Polling for dead connections.

failure: Node Manager error 'Internal Error'
Log from '169.229.50.7:1224:v42':
8.61007118225 Forwarder Started on 169.229.50.7
93.5640552044 Polling for dead connections.
124.011840105 Polling for dead connections.
152.649521112 Polling for dead connections.

failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
failure: Node Manager error 'Internal Error'
Log from '169.229.50.6:1224:v63':
26.6492931843 Forwarder Started on 169.229.50.6

failure: Node Manager error 'Internal Error'
Failures on 9 targets: 128.208.1.135:1224:v10, 128.208.1.158:1224:v12, 128.208.1.167:1224:v4, 128.208.1.121:1224:v12, 128.208.1.217:1224:v10, 128.208.1.199:1224:v6, 128.208.1.183:1224:v8, 128.208.1.225:1224:v8, 128.208.1.157:1224:v10
Added group 'loggood' with 3 targets and 'logfail' with 9 targets

nmadvertise has a small bug in the advertthread class

After looking at various nodes that have stopped advertising I ran across an Error being thrown that shouldn't be seen.

1273364434.65:PID-5896:[ERROR]: File "/home/uw_seattle/seattle/seattle_repy/nmadvertise.py", line 6171, in run
<type 'exceptions.TypeError'> log_last_exception() takes no arguments (1 given)

runonce fails when files are owned by another user

On Linux / Mac runonce doesn't work if the files in /tmp are owned by a different user. This is presumably because other users don't have write access

Multiple instances of SU/NM running cause socket to close prematurely?

Not sure if this is something that we should fix (as I've noticed it only on the machines that have multiple, 2, SU/NM instances running).

What'll occur is the following will be logged to nodemanager logfile in /v2:

1249231081.89:PID-32504:Traceback (most recent call last):
  File "/home/uw_seattle/seattle_repy/nmrequesthandler.py", line 75, in handle_request
  File "/home/uw_seattle/seattle_repy/session_repy.py", line 42, in session_recvmessage
  File "/home/uw_seattle/seattle_repy/emulcomm.py", line 1583, in recv
Exception: Socket closed

We should probably prevent this type of behavior. One probably shouldn't be able to start multiple instances of the Software Updater (SU) and/or the Node Manager (NM).

nodemanager test is flaky on freebsd: nmtestreadvessellog.py

The nmtestreadvessellog.py nodemanager test is flaky on freebsd. It fails about 1 in every 5 times.

Sample failure:

nmtestreadvessellog.py
out:err:Seattle Traceback (most recent call last):
  "nmtestreadvessellog.py", line 10407, in <module>
Exception (with type 'exceptions.Exception'): The log '' does not match the expected string '0.23028441581'

downloading empty filenames cause nodemanager internal error

Attempting to download a file with an empty name name ('') causes the following to be seen by the client:

Node Manager error 'Internal Error'

And the following ends up in the nodemanager log:

1260814665.76:PID-4213:Traceback (most recent call last):
  File "/home/uw_seattle/seattle_repy/nmrequesthandler.py", line 92, in handle_request
  File "/home/uw_seattle/seattle_repy/nmrequesthandler.py", line 227, in process_API_call
  File "/home/uw_seattle/seattle_repy/nmAPI.py", line 2858, in retrievefilefromvessel
IOError: [21](Errno) Is a directory

The nodemanager already checks for empty files in nmAPI.mix's addfiletovessel, it was probably just an oversight not adding the same check to retrievefilefromvessel and deletefileinvessel.

CC'ing justinc in case it is felt that this empty filename check should move to emulfile.assert_is_allowed_filename to avoid this type of error in general.

Bug in nmAPI on 0.1k nodes

I'm assuming this error has to do with obtaining vessels on a node.

1247548898.88:PID-7512:Traceback (most recent call last):
  File "/homes/iws/justinc/128.208.1.169/seattle_repy/nmrequesthandler.py", line 92, in handle_request
  File "/homes/iws/justinc/128.208.1.169/seattle_repy/nmrequesthandler.py", line 227, in process_API_call
  File "/homes/iws/justinc/128.208.1.169/seattle_repy/nmAPI.py", line 3150, in splitvessel
  File "/homes/iws/justinc/128.208.1.169/seattle_repy/nmAPI.py", line 3093, in setup_vessel
OSError: [17](Errno) File exists: 'v8'

signed data should not be storing full packets as old metadata

Signeddata is currently storing full messages as old metadata. This can lead to excessive memory consumption in the cases where files are being transmitted to the nodemanager. The message data should not be saved as oldmetadata, only the signature should be necessary.

Beta node not advertising

The beta node with IP 200.0.206.203 isn't advertising but softwareupdater and nmmain still appear to be running.

There appears to be a series of entries that repeat in the node manager log, I've pasted it below:
1280812854.21:PID-3564:[node manager is alive...
1280813106.05:PID-3564:AdvertiseError occured, continuing: 'announce error (type: DHT): filedescriptor out of range in select()'
1280813106.36:PID-3564:AdvertiseError occured, continuing: [ 'announce error (type: central): filedescriptor out of range in select()']
1280813411.89:PID-3564:AdvertiseError occured, continuing: [ 'announce error (type: DOR): Socket timed out connecting to host/port.']

The first of these started on Sat Jul 31 2010 21:01:48 according to the first timestamp (1280635308.13).

I have talked with Monzur about this on August 2nd and he has said he can take a look at this tomorrow.

nmclient_createhandle has a bug with trying to access a non-existent file.

While doing some testing, I found that the function nmclient_handle() in nmclient.repy tries to open a file called 'advertised_name', which does not exist. This is causing problem and causing repy files to fail when accessing nodemanagers on local machine.

The lines in nmclient_createhandle thats causing grief are:

def nmclient_createhandle(nmIP, nmport, sequenceid = None, timestamp=True, identity = True, expirationtime = 60*60, publickey = None, privatekey = None, vesselid = None, timeout=15):

  # If nmIP is the same as the current IP, we know that we're testing
  # nmclient. First, we don't run the node manager and the nmclient on the same
  # machine under normal operations. Second, all the components communicate
  # using shim's naming system, rather than IP. During testing, we need to
  # translate the IP into the advertised name of the node manager, which is
  # stored in a file called 'advertised_name'. Added by Danny Yuxing Huang to
  # facilitate the evaluation and deployment of shims.
  if nmIP == getmyip():
    fileobj = open('advertised_name', 'r')
    (nmIP, nmport) = fileobj.read().strip().split(':')
    nmport = int(nmport)
    fileobj.close()

It looks like this was the modified nmclient.repy from the Shim implementation. In the comment it says that the file stores the ip of the machine, but the file doesn't seem to be created.

Unexpected response (from nmclient) while acquiring vessels

Traceback (most recent call last):
  File "./seattle_repy/testprocess.py", line 389, in <module>
    nmclient_rawsay(getmyip(), 1224, "GetVessels")
  File "./seattle_repy/testprocess.py", line 160, in nmclient_rawsay
    (response, status) = fullresponse.rsplit('\n',1)
ValueError: need more than 1 value to unpack

This error is generated by the testprocess.py script when it tries to acquire vessels (communicate with the NM).

Please see http://blackbox.cs.washington.edu:4444/detailed/planet02.hhi.fraunhofer.de/1249577253 for more information.

Perhaps in catching this ValueError, it would be useful to print the value of fullresponse along with the general error for easier debugging.

nodemanager doesn't advertise under its own node key

The nodemanager doesn't appear to advertise its host:port under its own node key, but rather only under the owner and user keys of the individual vessels. Advertising under the node's key is useful in some situations, such as being able to efficiently follow a moving node.

Node manager doesn't start on Windows (testbed-xp2)

Starting nmmain.py fails on testbed-xp2 (after running nminit.py, of course), with the following backtrace:

1247440591.53:PID-604:[ File "nmmain.py", line 408, in
File "nmmain.py", line 280, in main
File "c:\Documents and Settings\cemeyer\foo2\runonce.py", line 32, in getprocesslock
File "c:\Documents and Settings\cemeyer\foo2\runonce.py", line 209, in getprocesslockmutex
File "c:\Documents and Settings\cemeyer\foo2\runonce.py", line 169, in openkey
File "c:\Documents and Settings\cemeyer\foo2\runonce.py", line 167, in openkey
<type 'exceptions.WindowsError'> Error 5 Access is denied

It may have something to do with a permissions error on the mutex (or may not). Look in getprocesslockmutex() in runonce.py.

Node manager doesn't log what execption occurred.

The node manager logs don't include the exception that happened. They only include the lines of code on which the exception happened. This should be improved to show the actual exception.

Reset doesn't result in fresh vessels

A reset on a vessel doesn't move it to a fresh state. This should be fixed.

nat_check_bi_directional() can result in confusing indications

nat_check_bi_directional() starts a listener with a call back function that echos any data received, this can be confusing if someone unknowingly connects to this while trying to connect to a nodemanager.

Fix this to make it clear what is going on if you accidentally get this connection

emulcomm hangs when acquiring expired vessels through nm_remote_api

I tried to acquire expired vessels through nm_remote_api.py. It looks like the Python socket in emulcomm should be timing out. But, it just hangs.

Here's the trace:

richard@satya:~/trunk/foo$ python nat_forwarder_monitor.py start
Acquiring vessels...
Traceback (most recent call last):
  File "nat_forwarder_monitor.py", line 69, in <module>
    success, info = nm_remote_api.initialize(hosts, 'jordanr')
  File "/home/richard/trunk/foo/nm_remote_api.py", line 2996, in initialize
    new_vessels = add_node_by_hostname(host)
  File "/home/richard/trunk/foo/nm_remote_api.py", line 2580, in add_node_by_hostname
    publickey = key[ File "/home/richard/trunk/foo/nm_remote_api.py", line 1868, in nmclient_createhandle
    response = nmclient_rawsay(newhandle, 'GetVessels')
  File "/home/richard/trunk/foo/nm_remote_api.py", line 1921, in nmclient_rawsay
    fullresponse = nmclient_rawcommunicate(nmhandle, *args)
  File "/home/richard/trunk/foo/nm_remote_api.py", line 1741, in nmclient_rawcommunicate
    return session_recvmessage(thisconnobject)
  File "/home/richard/trunk/foo/nm_remote_api.py", line 1619, in session_recvmessage
    currentbyte = socketobj.recv(1)
  File "/home/richard/trunk/foo/emulcomm.py", line 1042, in recv
    datarecvd = comminfo[mycommid]('public'])
)['socket'].recv(bytes)
KeyboardInterrupt

Nodes failing to advertise due to socket leaks...

There seems to be a huge downswing in the number of advertising nodes as of May 22nd. The node manager on most of these systems has over 1K sockets open in the TIME_WAIT state.

On 170.140.119.69 (a PL node), netstat -n reports there are 957 sockets open to SeattleGENI, 33 sockets open to 128.208.3.203 (testbed-ubuntu), and another 15 or so that seem to be associated with different client IPs (possibly from seash?).

The log of the system looks like:

1274775319.71:PID-4231:AdvertiseError occured, continuing: [error (type: DOR): Unexpected tag 'details' while parsing response."]("announce)
1274775731.32:PID-4231:[node manager is alive...
1274776133.23:PID-4231:AdvertiseError occured, continuing: ["announce error (type: DOR): Unexpected tag 'details' while parsing response."](INFO]:)
1274776331.61:PID-4231:[node manager is alive...
1274776846.31:PID-4231:AdvertiseError occured, continuing: ['announce error (type: central): filedescriptor out of range in select()'](INFO]:)
1274776920.74:PID-4231:AdvertiseError occured, continuing: [error (type: DHT): filedescriptor out of range in select()']('announce)
1274776927.36:PID-4231:AdvertiseError occured, continuing: [error (type: central): filedescriptor out of range in select()']('announce)

I tried restarting the node manager on a beta node (128.112.139.28) and am monitoring it to see if it leaks sockets.

I believe the 'details' line in the traceback above is not relevant because I ran a version of the node manager that did frequent advertisement and it didn't leak sockets. I also ran nmclient_get_vessel_dict 1000 times each from 10 threads and didn't detect a socket leak.

Node manager dies...

After running for some time, the node manager dies on my laptop. It returns code 30 which implies a timer raised an exception that wasn't caught.

Here is my node manager log. I'm unsure if the tracebacks contained in this are really pointing to problems or not.

1249162032.98:PID-17687:[config
1249162128.02:PID-17687:myname = NAT$d40aa1d38e2ce73eff7351fee1ce4ee12cf1dd72v2:1224
1249162128.05:PID-17687:[INFO](INFO]:Loading):Started
1249162313.5:PID-17687:Traceback (most recent call last):
  File "/Users/justincappos/test/nmrequesthandler.py", line 92, in handle_request
  File "/Users/justincappos/test/nmrequesthandler.py", line 227, in process_API_call
  File "/Users/justincappos/test/nmAPI.py", line 2788, in addfiletovessel
IOError: [21](Errno) Is a directory: 'v1//'

1249162357.69:PID-17687:Traceback (most recent call last):
  File "/Users/justincappos/test/nmrequesthandler.py", line 75, in handle_request
  File "/Users/justincappos/test/session_repy.py", line 42, in session_recvmessage
  File "/Users/justincappos/test/emulcomm.py", line 1595, in recv
Exception: Socket closed

1249162398.68:PID-17687:Traceback (most recent call last):
  File "/Users/justincappos/test/nmrequesthandler.py", line 75, in handle_request
  File "/Users/justincappos/test/session_repy.py", line 42, in session_recvmessage
  File "/Users/justincappos/test/emulcomm.py", line 1595, in recv
Exception: Socket closed

1249162732.89:PID-17687:[node manager is alive...
1249162980.77:PID-17687:Traceback (most recent call last):
  File "/Users/justincappos/test/nmrequesthandler.py", line 92, in handle_request
  File "/Users/justincappos/test/nmrequesthandler.py", line 227, in process_API_call
  File "/Users/justincappos/test/nmAPI.py", line 2788, in addfiletovessel
IOError: [Errno 21](INFO]:) Is a directory: 'v13//'

1249163025.18:PID-17687:Traceback (most recent call last):
  File "/Users/justincappos/test/nmrequesthandler.py", line 75, in handle_request
  File "/Users/justincappos/test/session_repy.py", line 42, in session_recvmessage
  File "/Users/justincappos/test/emulcomm.py", line 1595, in recv
Exception: Socket closed

1249163066.05:PID-17687:Traceback (most recent call last):
  File "/Users/justincappos/test/nmrequesthandler.py", line 75, in handle_request
  File "/Users/justincappos/test/session_repy.py", line 42, in session_recvmessage
  File "/Users/justincappos/test/emulcomm.py", line 1595, in recv
Exception: Socket closed

1249163267.4:PID-17687:Traceback (most recent call last):
  File "/Users/justincappos/test/nmadvertise.py", line 5815, in run
  File "/Users/justincappos/test/nmadvertise.py", line 5688, in advertise_announce
AdvertiseError: openDHT announce error: timed out

1249163267.63:PID-17687:[1249163267.63 restarting advert...
1249163273.76:PID-17687:Traceback (most recent call last):
  File "/Users/justincappos/test/nmadvertise.py", line 5815, in run
  File "/Users/justincappos/test/nmadvertise.py", line 5688, in advertise_announce
AdvertiseError: openDHT announce error: timed out

1249163273.98:PID-17687:[WARN](WARN]:At):At 1249163273.98 restarting advert...
1249163339.53:PID-17687:[INFO]: node manager is alive...

Switch nodemanager/openDHTadvertise.py back to repy's xmlrpc library

openDHTadvertise.py was switched to use python's stdlib xmlrpclib because of a bug in our xmlrpc library that has since been fixed. We should switch it back.

Also, port it to seattlelib.

seattletestbed / nodemanager Goto Github PK

nodemanager's People

Contributors

Watchers

Forkers

nodemanager's Issues

Recommend Projects

Recommend Topics

Recommend Org