seattletestbed / advertiseserver Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 31 KB

Advertisement servers

License: MIT License

Python 94.59% Shell 5.41%

advertiseserver's People

Contributors

Watchers

Forkers

choksi81

advertiseserver's Issues

Advertise bug caused by , and \ encoding

Reminder to swap out command backslash encoding for serialization.

An artifact service is created when advertise_announce is called with a value that contains "\c" (among other bugs, maybe)

Dwindling node counts since 0.1.1d push

Albert reports:

I've noticed something alarming in my advertise monitoring setup though: Since October 17, we've been losing nodes at a rate of 15 per day (!). Can you please check the usual suspects (read: !PlanetLab instances running Seattle) for errors/hiccups?

I've done some poking around, and this issue is similar to #1261, where a failure in the udp recvmess handler is causing the nodemanager to exit.

I first disabled central/centralv2 on my nodemanager, so that only UDP is being used. I thenprinted out the UDP port that was being used to receive the UDP advertise responses. I started up a process that sent endless messages to that port, and not long after the nodemanager process died.

Clean up repo

This repository contains a few items which I'm not sure are used anymore, such as advertiseserver.mix (single-threaded RepyV1 server), deploy_advertiseserver.sh (should be superseded at least partially by the new build script), and advertise_test_routine.repy (an attempt at an integration test?).

See what is actually deployed, remove whatever isn't.

Many nodes have hard time advertising.

Many nodes seem to have a hard time having successful advertisement in at least one of the advertise server. We seem to have more nodes that are up and running then there are nodes that are advertising.

Below is the tail of the nodemanager log in one of our nodes:
193.174.67.186

1292020043.08:PID-29330:AdvertiseError occured, continuing: None of the advertise services could be contacted
1292020048.09:PID-29330:AdvertiseError occured, continuing: None of the advertise services could be contacted
1292020053.09:PID-29330:AdvertiseError occured, continuing: None of the advertise services could be contacted
1292020058.09:PID-29330:AdvertiseError occured, continuing: None of the advertise services could be contacted
1292020220.06:PID-29330:AdvertiseError occured, continuing: [error (type: DHT): (111, 'Connection refused')"]("announce)
1292020277.42:PID-29330:[node manager is alive...
1292020820.84:PID-29330:AdvertiseError occured, continuing: ['announce error (type: central): timed out'](INFO]:)
1292020938.88:PID-29330:[node manager is alive...
1292021159.32:PID-29330:AdvertiseError occured, continuing: ["announce error (type: DHT): (111, 'Connection refused')"](INFO]:)
1292021604.29:PID-29330:[node manager is alive...
1292021855.59:PID-29330:AdvertiseError occured, continuing: ["announce error (type: DHT): (111, 'Connection refused')"](INFO]:)
1292021874.04:PID-29330:AdvertiseError occured, continuing: [error (type: DOR): Unexpected tag 'details' while parsing response."]("announce)
1292022269.52:PID-29330:[node manager is alive...
1292022286.09:PID-28038:[ERROR](INFO]:):Another node manager process (pid: 29330) is running
1292022502.65:PID-29330:AdvertiseError occured, continuing: [error (type: central): timed out']('announce)
1292022507.76:PID-29330:AdvertiseError occured, continuing: None of the advertise services could be contacted
1292022507.76:PID-29330:AdvertiseError occured, continuing: None of the advertise services could be contacted

As can be seen, many times none of the advertise servers could be contacted.

Add Integration test for tcp_time module and timeservers

The latest version of seattle has a new method of getting ntp time that depends on a service we run on geni nodes. This service is kept running on a set of nodes by a deployment manager, but we need to add an integration test to ensure the nodes are working correctly.

The attached repy code tests the time service by

verify a minimum number of time servers are up
verify time from several timeservers are within a reasonable limit
verify time for the ntp method is with a reasonable limit of the tcp method

This test should be worked into the integration test suite.

For questions about the time service see Zack or Eric

Node manager should readvertise on key change.

The node manager should readvertise in opendht, etc. soon after a user changes the keys on vessels on the node. Waiting ~ 5 mins is not a good solution and the current work arounds (doing it on the website, etc.) are not good long term fixes.

DORadvertise seems to have a bug in it

When running the dorputget_new.py test that resides in trunk/integrationtests/opendhtputget_repy/dorputget_new.py it often fails. However some of the failures seem to be because an out of index error is raised. Here is the traceback:

Feb 12 00:38:46 2010Exception: DORadvertise_announce() failed
Error: IndexError
Description: list index out of range
Traceback:
File "/home/integrationtester/cron_tests/dorputget/dorputget_new.py", line 124, in main

File "/home/integrationtester/cron_tests/dorputget/DORadvertise_repy.py", line 103, in DORadvertise_announce

File "/home/integrationtester/cron_tests/dorputget/DORadvertise_repy.py", line 154, in _DORadvertise_command

File "/home/integrationtester/cron_tests/dorputget/httpretrieve_repy.py", line 275, in httpretrieve_get_string

File "/home/integrationtester/cron_tests/dorputget/httpretrieve_repy.py", line 171, in httpretrieve_open

I have attached the files in the traceback in a tarball

Bug in current advertiseserver

There is a but in the current advertiseserver that occurs rarely. While I was running test to figure out #805, I found that very very rarely there is an uncaught exception which causes the advertiseserver to crash. Here is the error:

Uncaught exception! Following is a full traceback, and a user traceback.
The user traceback excludes non-user modules. The most recent call is displayed last.

Full debugging traceback:
"/home/monzum/advertiseserver_deployed_current/emulcomm.py", line 684, in run
"/home/monzum/advertiseserver_deployed_current/namespace.py", line 1489, in wrapped_function
"advertiseserver_current_3368.py", line 340, in _timeout_waitforconn_callback
"advertiseserver_current_3368.py", line 5461, in handlerequest
"advertiseserver_current_3368.py", line 5426, in expire_hashtable_items

User traceback:
"advertiseserver_current_3368.py", line 340, in _timeout_waitforconn_callback
"advertiseserver_current_3368.py", line 5461, in handlerequest
"advertiseserver_current_3368.py", line 5426, in expire_hashtable_items

Exception (with type 'exceptions.ValueError'): list.remove(x): x not in list

I managed to produce this error twice.

openDHTadvertise.repy leaks resources

running the below program will consume more than 10 events and 5 socekts.

include openDHTadvertise.repy

if callfunc == 'initialize':
openDHTadvertise_announce('seattleopendhttest', 'gotit', 30)
sleep(2)
results = openDHTadvertise_lookup('seattleopendhttest')

Unexpected tag 'details' raises exception during advertise lookup

We use the advertise service to lookup a Zenodotus DNS entry containing IP addresses. However, it raises an exception indicating that the 'Details' field is unexpected. We should mark that field as expected so that an exception isn't raised.

Centralized advertise v2 is slow

Advertising to centralized advertise v2 times out randomly.

To replicate, you can run the following from a directory after preparetest:

>>> import repyhelper
>>> repyhelper.translate_and_import('centralizedadvertise_v2.repy')
>>> v2centralizedadvertise_announce('hello', 'hi', 1234)
>>> v2centralizedadvertise_lookup('hello')

It make take you a few tries, but you will get this error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "centralizedadvertise_v2_repy.py", line 154, in v2centralizedadvertise_lookup
    sockobj = timeout_openconn(v2servername,v2serverport, timeout=10)
  File "sockettimeout_repy.py", line 263, in timeout_openconn
    realsocketlikeobject = openconn(desthost, destport, localip, localport, timeout)
  File "/home/leonwlaw/verify/emulcomm.py", line 1528, in openconn
    raise connect_exception
socket.timeout: timed out

Integrate NTP time checking for seash and the node manager

Modify time.repy so that it intelligently chooses to fail-over to getting time from the time_server service if the normal
NTP request fails.

Memory leak in advertise server.

The advertise server is dying after running out of memory. We need to find and remove any memory leaks.

advertiseserver crashing due to events limit exceeded

The advertise server crashed again. Here is what was in the logs, most importantly what looks like the cause of the crash:

jsamuel@satya:/home/geni/advertiseserver_deployed$ tail log.stderr
Seattle Traceback (most recent call last):
Exception (with type 'exceptions.KeyboardInterrupt'):
Exception (with type 'exceptions.KeyboardInterrupt'):
Seattle Traceback (most recent call last):
  "advertiseserver.py", line 4414, in periodic_print
Exception (with type 'exceptions.Exception'): Resource 'events' limit exceeded!!

Problem with advertise service - Error when using announce

The following code, when run via Repy, will sometimes crash.

if callfunc == 'initialize':
  advertise_announce("goldfish.zenodotus.cs.washington.edu", "1.2.3.4", 120)
  print "Advertising goldfish . . ."
  print advertise_lookup("goldfish.zenodotus.cs.washington.edu")

The error produced looks like this. Note that this will not always happen, just occasionally. I couldn't find a pattern with this, though I assume it has something to do with the DOR server?


---
Uncaught exception! Following is a full traceback, and a user traceback.
The user traceback excludes non-user modules. The most recent call is displayed last.

Full debugging traceback:
  "repy.py", line 202, in main
  "/home/sebass63/seattle/seattle_repy/virtual_namespace.py", line 116, in evaluate
  "/home/sebass63/seattle/seattle_repy/safe.py", line 311, in safe_run
  "advertise_interface_built.repy", line 4902, in <module>
  "advertise_interface_built.repy", line 4787, in advertise_announce

User traceback:
  "advertise_interface_built.repy", line 4902, in <module>
  "advertise_interface_built.repy", line 4787, in advertise_announce

Exception (with class '.AdvertiseError'): [error (type: DOR): Socket timed out connecting to host/port.']('announce)

---

Seems random, but it definitely happens pretty regularly in normal use.

Deprecate old advertise server format...

We should really deprecate and remove the old legacy advertise server format. To do this, we first need to understand what software is still using the old format and update these programs.

This is related to ticket #408.

advertiseserver.py needs include RepyV2 files through dylink

Inadvertiseserver.py(https://github.com/SeattleTestbed/advertiseserver/blob/master/advertiseserver.py), serialize and session are RepyV2 files. We need to include them through dylink.

There seems to be a small bug dealing with openDHTadvertise.repy

There seems to be a small bug somewhere when doing a openDHTadvertise_lookup(), where a len() is called on a none type value. Here is the traceback on the error: http://www.pastie.org/679173

Centralized advertise server sometimes hangs...

We sometimes see the centralized advertise server hang. It doesn't seem to log anything

Advertise service sometimes returns empty list

The advertise service often returns an empty list when it is past the graceperiod but before the timeout.

Here is an output for successful lookup:

monzum@TestUbuntu:~/exdisk/work/affix_library$ python test_advertise.py time_server
{'exception': ['returned': [(('central', 'time_server', 100, [True](],)), ['129.63.159.101:63106', '128.111.52.64:63106', '141.161.20.33:63106', '204.8.155.227:63106', '128.42.142.42:63106', '200.0.206.169:63106', '129.10.120.193:63106', '195.113.161.83:63106']('130.216.1.22:63106',)), (('central_v2', 'time_server', 100, [['204.8.155.227:63106', '142.103.2.2:63106', '156.56.250.226:63106', '160.36.57.173:63106', '13.7.64.20:63106', '128.111.52.64:63106', '129.63.159.101:63106', '141.161.20.33:63106', '195.113.161.83:63106', '129.10.120.193:63106', '128.42.142.42:63106', '192.33.90.68:63106', '192.42.83.253:63106', '165.91.55.8:63106', '130.195.4.68:63106', '141.219.252.132:63106', '128.232.103.201:63106', '130.216.1.22:63106'](True]),))], 'aborted': ['129.63.159.101:63106', '128.111.52.64:63106', '141.161.20.33:63106', '204.8.155.227:63106', '128.42.142.42:63106', '200.0.206.169:63106', '129.10.120.193:63106', '195.113.161.83:63106', '142.103.2.2:63106', '156.56.250.226:63106', '160.36.57.173:63106', '13.7.64.20:63106', '192.33.90.68:63106', '192.42.83.253:63106', '165.91.55.8:63106', '130.195.4.68:63106', '141.219.252.132:63106', '128.232.103.201:63106'](]}
['130.216.1.22:63106',)
Time to lookup: 0.165866136551

Here is a lookup of a case where it failed and timed out

monzum@TestUbuntu:~/exdisk/work/affix_library$ python test_advertise.py time_server
{'exception': ['returned': [(('central_v2', 'time_server', 100, [False](],)), [(('central', 'time_server', 100, [False](]),)), ['aborted': [](])],)}
[]
Time to lookup: 19.150124073

advertise_lookup leaving threads open due to sockets not timing out

Note that the summary is what I highly suspect. I'm not 100% certain my suspicions are correct.

'''Background'''

The transition scripts are accumulating many extra threads whose origins are not known. I think the issue may be a blocked communication in a parallelized call made in advertise_lookup for opendht. Not necessarily incorrectly, advertise_lookup gives up on the parallelized call because it takes too long. The problem is that this likely results in more and more threads in the background that are blocked on communication (and more sockets open, and more memory used).

This is all a strong hunch, aided by the fact that before the recent changes to make advertise_lookup use parallelize, if communication blocked indefinitely than so would the advertise_lookup call. I was seeing the node state transition scripts hanging there for hours until restarted after first launching the new seattlegeni to blackbox and before these changes were made to advertise_lookup.

So, I think this is fundamentally an issue with opendht communication sometimes hanging indefinitely.

'''Details'''

After about 16 hours of running, the transition scripts have the following number of active threads (according to threading.active_count()) open at times when they are not doing anything.

transition_donation_to_canonical.py: 63
transition_canonical_to_onepercentmanyevents.py: 206
transition_onepercentmanyevents_to_onepercentmanyevents.py: 82

We expect to see one thread, the main thread. However, listing them all with threading.enumerate() shows that all except a single main thread are _Timer threads -- so, ultimately resulting from a call to our settimer(). These numbers appear to be growing steadily. I won't be sure if or where they level out unless I can keep the transition scripts running without a restart (e.g. for adding debugging info) for multiple days.

Attached is a list of the TCP streams open by transition_canonical_to_onepercentmanyevents.py at the moment according to lsof, totaling almost 200. I haven't checked, but my guess is that these are opendht nodes given that they look like mostly all planetlab nodes.

As one more thing I've checked, there is nothing being left in emulcomm.comminfo.

'''Reason it's bad'''

Multiple bad things will happen if this is not fixed. I suspect this is the cause of memory still leaking in the transition scripts (I've verified there are no nmclient handle or parallelize handle leaks). The OS may/will decide at some point to not allow the process to open more threads. The number of sockets open could become an issue if the others don't become issues first.

[Newcomer| Evaluate advertise monitoring logs

I have a RepyV1 program running in a Seattle vessel that's monitoring which nodes announce keys in the advertise service, and contacts these nodes querying the number of available vessels from them. I have a few questions about the dataset which can be answered by a bit of data mining:
How many different IP addresses did we see overall?
How many of them were private ones?
How often did we see which node?
Which nodes disappeared over time?
What names are associated with these nodes? (Hint: use "dig -x" on the list of IP addresses)
What categories do these names fall into? (Hint: Compare with known PlanetLab nodes, extract top-level domains, look for well-known ISPs/mobile operators, etc.)
Be sure to add questions you come up with yourself, too!
You might want to create graphical representations of your results. I suggest to use something scriptable such as gnuplot or Gnu R rather than Excel.
The lines of the logfile are formatted like this: (I happened to not check for "PlanetLab status" due to performance reasons.)
. Node statistics for two different advertised keys
nodestats'', advertise key name, timestamp, advertise type-count pairs, overall'' count of unique nodes across all advertise types
. Overall statistics on advertised keys, including vessel counts:
SUMMARY'', advertise key name, timestamp, advertise type-count pairs, overall'' unique nodes count, contacted'' number of nodes contacted for the purpose of counting vessels (including nodes that didn't currently advertise),vessels'' total number of vessels
. Detailed vessel availability data
vessels'', advertise key name, timestamp, and then tuples of (IP, nodeman port, round-trip latency, vessel count, PlanetLab status) for each node ever found advertising. Within tuples, fields are colon separated, tuples themselves are separated by commas. (Some nodes might not advertise anymore but be still contactable, or the other way around. Let's see.) . Details on the nodes that advertise: nodedetails'', advertise key name, timestamp, and then pairs of
IP address ``:'' nodemanager port, successive pairs being comma-separated.

[Newcomer] Integration test for advertise servers

We should have an integration test that monitors the status on each advertise server. In addition, it should let us know if it is at a high load.

As the advertise server generates load logs and error logs, we might parse these log files, in addition to the output from unix utilities like top to see what is going on.

Ideally, we should have the same integration test be runnable for each advertise server, with a configurable path if each server is set up in different directories on the different machines.

DOR and opendht fail when using keys that are large dicts...

When doing opendht or DOR advertisements, if the key is a large dictionary (like a Seattle user's public key), it fails to look up the associated values (returning the empty string).

>>> adict = {0: '0asdf-1', 1: '1asdf0', 2: '2asdf1', 3: '3asdf2', 4: '4asdf3', 5: '5asdf4', 6: '6asdf5', 7: '7asdf6', 8: '8asdf7', 9: '9asdf8', 10: '10asdf9', 11: '11asdf10', 12: '12asdf11', 13: '13asdf12', 14: '14asdf13', 15: '15asdf14', 16: '16asdf15', 17: '17asdf16', 18: '18asdf17', 19: '19asdf18', 20: '20asdf19', 21: '21asdf20', 22: '22asdf21', 23: '23asdf22', 24: '24asdf23', 25: '25asdf24', 26: '26asdf25', 27: '27asdf26', 28: '28asdf27', 29: '29asdf28', 30: '30asdf29', 31: '31asdf30', 32: '32asdf31', 33: '33asdf32', 34: '34asdf33', 35: '35asdf34', 36: '36asdf35', 37: '37asdf36', 38: '38asdf37', 39: '39asdf38', 40: '40asdf39', 41: '41asdf40', 42: '42asdf41', 43: '43asdf42', 44: '44asdf43', 45: '45asdf44', 46: '46asdf45', 47: '47asdf46', 48: '48asdf47', 49: '49asdf48', 50: '50asdf49', 51: '51asdf50', 52: '52asdf51', 53: '53asdf52', 54: '54asdf53', 55: '55asdf54', 56: '56asdf55', 57: '57asdf56', 58: '58asdf57', 59: '59asdf58', 60: '60asdf59', 61: '61asdf60', 62: '62asdf61', 63: '63asdf62', 64: '64asdf63', 65: '65asdf64', 66: '66asdf65', 67: '67asdf66', 68: '68asdf67', 69: '69asdf68', 70: '70asdf69', 71: '71asdf70', 72: '72asdf71', 73: '73asdf72', 74: '74asdf73', 75: '75asdf74', 76: '76asdf75', 77: '77asdf76', 78: '78asdf77', 79: '79asdf78', 80: '80asdf79', 81: '81asdf80', 82: '82asdf81', 83: '83asdf82', 84: '84asdf83', 85: '85asdf84', 86: '86asdf85', 87: '87asdf86', 88: '88asdf87', 89: '89asdf88', 90: '90asdf89', 91: '91asdf90', 92: '92asdf91', 93: '93asdf92', 94: '94asdf93', 95: '95asdf94', 96: '96asdf95', 97: '97asdf96', 98: '98asdf97', 99: '99asdf98', '1': 2}
>>> advertise_announce(adict, '123', 1000)
>>> advertise_lookup(onepercent_manyevent_pubkey, maxvals = 10*1024*1024, lookuptype = [advertise_lookup(onepercent_manyevent_pubkey, maxvals = 10*1024*1024, lookuptype = ['opendht']('DOR'])
[]
>>>))
[advertise_lookup(onepercent_manyevent_pubkey, maxvals = 10*1024*1024, lookuptype = ['central'](]
>>>))
[it works just fine for small dicts...

adict = {'1':2}
advertise_announce(adict, '123', 1000)
advertise_lookup(adict)
['123']('123']

Oddly,)

advertise_lookup(adict, 10,[advertise_lookup(adict, 10,'DOR'
['123']
))
[advertise_lookup(adict, 10,'opendht')
['123']

Advertising two keys (in two threads) results in high failure rate

When using advertise_announce (advertise.repy) to advertise more than one key within a single program most nodes will not succeed in advertising either key.

I've attached two files. adtest.repy is meant to be run on a group of nodes on the testbed. lookuptest.repy is used to lookup which nodes have successfully advertised their keys

I recommend using at least 4 or 5 nodes to recreate.

httpretrieve.repy possibly leaks sockets.

The function httpretrieve_open() may leak sockets. The function does a timeout_waitforconn call, but does not capture everything done after it in a try/finally clause. This means that if the function is interrupted or terminated then the socket will not get closed and leak.

Particularly this affects the advertise.repy, which is used very often. If the graceperiod in advertise_lookup is small, then as soon as the graceperiod time expires, and one of the advertisement types is successful, the other advertisements are terminated, which could cause httpretrieve_open() to be terminated without being finished.

Port advertise servers to RepyV2

udpadvertiseserver and advertiseserver.mix both use RepyV1. Port them to use the new API, and adapt the build scripts (currently broken due to conflicting Repy versions) accordingly.

Modify centralizedadvertise.repy to escape any comma that appears within the value string(s)

Right now, there is a comma that separates the multiple values that can be attributed to one key in centralizedadvertise.repy. When centralizedadvertise_lookup returns a list of the values, it splits the list on the commas so there is a separate element in the returned list for each value. This comma needs to be escaped, however, whenever it appears in the actual string value so that the splitting on the comma does not split apart actual values that may contain a comma.

Centralized advertise v2 to contact multiple v2 servers

The centralizedadvertisev2 client is currently hardcoded to contact only centralizedadvertise_v2.poly.edu. With the upcoming multithreaded centralizedadvertisev2 server in mind, we should plan to add the ability for the client to contact multiple servers, as we will have both the single threaded and multithreaded servers running simultaneously.

This should be done in a way such that it would be trivial to add more advertise servers.

Problem with advertise server

Not sure what the problem is, but I'm seeing the logs being spammed by the following msg.


1249460169.43:PID-4074:[WARN]:At 1249460169.43 restarting advert...
1249460744.19:PID-4074:Traceback (most recent call last):
  File "/home/uw_seattle/seattle_repy/nmadvertise.py", line 5942, in run
  File "/home/uw_seattle/seattle_repy/nmadvertise.py", line 5823, in advertise_announce
AdvertiseError: openDHT announce error: Socket closed