visionion aims to provide a webbased visualization tool for Tor metrics data.
The Tor project is primarily a system to provide a user with anonymity while on the internet. It adds to this some means for censorship prevention as adversaries try to block access to Tor alltogether. The Tor infrastructure is comprised of several types of network nodes, and a lot thereof.
Visualizing all the parts of this network in a meaningful way is propably not possible but of course insights can be drawn from combining different aspects and sources in one view. Visionion aims to integrate and visualize all available data in a generic and easily extensible fashion. These generic views can then be combined and tailored to elucidate structural patterns and hidden aspects in the data.
The project website provides the currently available data in JSON as well as an overview of metrics descriptor formats and the raw data. It's currently saved to a PostgreSQL database (SQL schema).
A user might just have some simple questions: how many relays were there running in the past 3 months in .de? How much bandwidth was provided by relays running Tor version 0.2.3.x.?
Such questions might just ask for a number and therefor need no visualization at all.
When a series of numbers is asked for or when information needs get more complex because they involve different factors and sources of data a visualization can be very handy to ease comprehension of the answer.
Even a simple graph of values over a period of time can be grasped much more easily than a list of numbers.
Visually combining and contextualizing different aspects of data in one place can help understand causes, effects and correlations.
When there is a problem to which not only the right answer but also the right question is not known a visualization can help understand what's going on in the first place.
Certainly this is an ambitious task for which the visualization has to provide interactive controls and visual malleablity.
How to achieve all this? Basically, take all network graphs and merge them into one single graph with plenty of options. Users should be able to navigate into any factor (bridge vs. relay, country, flags, Tor software version, operating system, EC2 cloud bridge or not) and learn the total relay number or advertised bandwidth or bandwidth history for their selection.
1
the most prominent usecase is the timeline with a graph representing volumina of bandwidth or number of hosts or number of clients etc. on the vertical axis
1a
it should also be possible to layer timeline graphs for the same time period but with different subject on each other to compare eg consumed bandwidth and number of clients
2
now imagine a plane orthogonal to the graph, representing some other data at that point in time eg adding to the graph of linux driven relays a cake diagram of all operating systems driving relays
3
now imagine a third plane on the floor showing geogrgraphic distribution of linux driven relays and how much bandwidth each of them handles, the imaginary center of linux driven traffic at the crosspoint of the first two planes
4
now add markers for certain events: the day when traffic from linux driven relays peaked, the day it hit an alltime low, the days it plummeted, the days it spiked etc.
5
show the biggest nodes for a given metric and their share of the total
1
represents the usecase that's presently handled by the Tor metrics project graph visualizations.
1a
is available as a prototype at interactive graphs.
2
attempts to combine different visualization techniques like timeline and cake diagram.
Different visualizations get rendered on different layers.
Control shifts from the visualization framework to the web application.
3
introduces the geographical dimension which is not very strongly represented in the raw data but nonetheless an interesting perspective.
4
points the user in directions that might be worth to explore.
It will need some analytics in the background.
5
checks (de-) centralizations in the infrastructure.
In a nutshell:
- Tor metrics data get's imported into a MongoDB database.
- Aggregation and indexing transforms the imported data into a big fact table suitable to drive the visualization.
- Visualization Framework is D3.js, additionally Crossfilter.
- Client side application framework is not yet decided. Either Angular.js, Knockout.js or Can.js.
- Targeted web browsers are Chrome and Firefox. Others might work as well.
Most of the visualization facets get rendered seperatly, on seperate planes (technically DIVs).
The application prepares the joins and our eyes carry them out.
Visualization framework
D3.js is a leading data visualization framework for the web.
It keeps a strong link between the data and it's visual representation, expresses it in a nice declarative and CSS-like style, provides an impressive set of features and renders to SVG.
Database
Since the data schema is quite flat and in a certain flux a NoSQL database seems appropriate.
MongoDB was chosen because of it's JavaScript support which promises nice integration with client side logic.
Since the complexity of the underlying data is rather limited MongoDBs query capabilities, although less expressive than SQL, should be sufficient.
With a visualization tool the most interesting joins are anyway those that are carried out in the eyes of the user.
The ability to store JavaScript-code in the MongoDB might help in the development of an analyzer toolkit.
Support for geo-data could be beneficial either (no other NoSQL database has that so easily available AFAIK).
Web application framework
Angular.js is the likely candidate because of it's declarative style and it's attractive approach to routing and HTML extensions (discussion).
It integrates nicely with D3.js. and with MongoDB (also here).
The Tor network is comprised of a lot of different nodes. All these nodes operate - despite their different functions - from the same software, just with different configuration flags set. A single node can be in most categories at the same time and in every category over time.
Nodes are all the actors that form the network.
Nodes encompass clients, bridges and relays.
Clients are the end users, connecting to the Tor network to anonymously use the internet.
Servers are everything except clients.
Servers encompass relays and bridges.
Bridges are the nodes that clients connect to to circumvent attempts to block access to Tor.
Relays are the nodes that form the actual Tor network which provides anonymity.
Relays encompass guard nodes, middle nodes, exit nodes and directory nodes.
Guard nodes function as entry points to an anonymized route through the Tor network.
They are reached by the client either directly or, if a censor blocks them, through a bridge.
Middle nodes function as intermediary steps on that route.
Exit nodes function as exit points, leaving the Tor network and continuing to the destination on the internet.
Directory nodes provide some auxiliary services to the Tor network.
node everything in the tor network
client the users
server everything serving the user
bridge special entry points for clients that need to circumvent blocking
relay the actual anonymization network
guard entry points into the network (accessed by client directky or by bridge)
middle intermediary nodes on anonymizing route
exit now anonymized, continue route to actual destination on the internet
directory some auxiliary services
It's quite common that a relay is guard node, middle node, exit node, and directory mirror at the same time and that same node can be used as client at any time.
Also, the node may have been configured as bridge before or after being configured as a relay.
But there are two exceptions to the general rule:
- a node can't be a client and a server at the same time.
- a node can't be a bridge and a relay at the same time.
A more detailed description of the different nodes and measures
clients
Tor doesn't log any data at individual clients themselves, but it logs abstract data about clients at bridges and directory mirrors.
Bridges are obvious, but directory mirrors maybe not so much.
The idea is to count network status requests per day and per country, aggregate that data for all directory mirrors, and derive the number of clients from that number.
The "time to download files over Tor" and "timeouts and failures of downloading files over Tor" parts are learned from clients run by the Tor project itself.
See https://metrics.torproject.org/formats.html for details: "Second, we describe the numerous aggregate statistics that relays publish about their usage (PDF), including byte histories, directory request statistics, connecting client statistics, bridge user statistics, cell-queue statistics, exit-port statistics, and bidirectional connection use."
servers
These are the documents you have per relay/bridge:
- Network status entry: There's a network status entry for every relay or bridge with some summary information.
It's a confirmation by either the directory authorities (for relays) or the bridge authority (for bridges) that the given relay/bridge information is valid.
But this summary doesn't contain, e.g., OS information or number of bytes spent on answering directory requests. - Server descriptor: Every relay or bridge publishes a descriptor containing its contact information and capabilities to the directory authorities or bridge authority every 12--18 hours. This server descriptor is then referenced by digest from one or typically multiple network status entries.
- Extra-info descriptor: Statistical information about a relay or bridge is not contained in the server descriptor, but in an extra-info descriptor.
These are referenced from server descriptors by digest, with a 1:1 relationship.
See https://metrics.torproject.org/formats.html for details about "the numerous aggregate statistics that relays publish about their usage (PDF), including byte histories, directory request statistics, connecting client statistics, bridge user statistics, cell-queue statistics, exit-port statistics, and bidirectional connection use."
bridges
Bridges are simply nodes with a I-want-to-be-a-bridge bit set in their configuration.
However, whether a node is a bridge or a relay determines to some extend what data we have about that node.
For example, we don't have country information about bridges, but we have that for relays.
relays
- guard node
- middle node
- exit node
- directory mirror
Directory mirrors are just relays with an open directory ports. So, the set of directory mirrors is a subset of the set of relays, and there'd be flags and all that for directory mirrors, too. - combinations of guard, middle, exit and directory Knowing if an exit may also be used in the guard position can be interesting. In general comparisons between any two types of relays should be possible,
flags
-
BadExit The BadExit flag is already taken into account in the importer: a relay that has the Exit flag and the BadExit flag isn't put into the Exit category. The BadExit flag doesn't have any impact on the other types.
-
Authority Being an authority is mostly relevant for directories, if at all. It's not a very important flag.
-
Fast
TODO
-
Stable
TODO
other dimensions
-
bandwidth
Bandwidth is measured for relays and bridges in two values: bandwidth advertized and bandwidth consumed. -
probabilities
You can assign a consensus weight fraction to each relay, for any given date and hour. Then you can say that all clients used that relay for about x% of their paths, or that a particular client used that relay for a particular path with a probability of x%.
There are currently four such weights/probabilities defined for relays (this does not apply to bridges).Quoting from Onionoo's protocol specification:
"consensus_weight_fraction": Fraction of this relay's consensus weight compared to the sum of all consensus weights in the network. This fraction is a very rough approximation of the probability of this relay to be selected by clients.
"guard_probability": Probability of this relay to be selected for the guard position. This probability is calculated based on consensus weights, relay flags, and bandwidth weights in the consensus. Path selection depends on more factors, so that this probability can only be an approximation.
"middle_probability": Probability of this relay to be selected for the middle position. This probability is calculated based on consensus weights, relay flags, and bandwidth weights in the consensus. Path selection depends on more factors, so that this probability can only be an approximation.
"exit_probability": Probability of this relay to be selected for the exit position. This probability is calculated based on consensus weights, relay flags, and bandwidth weights in the consensus. Path selection depends on more factors, so that this probability can only be an approximation.Probabilities for selecting a node in the guard/middle/exit position are calculated based on the node's consensus weight, whether it has the Guard and/or Exit flag, and the bandwidth weights in the consensus.
-
autonomous systems
For visualization, autonomous systems are very similar to countries. Think of an autonomous system as a group of IP address blocks belonging to the same organization. You want to avoid that all relays in a path, or at least entry and exit, are located in the same autonomous system and thereby controlled by the same organization. And you want to avoid that a single AS/organization sees a too high percentage of Tor traffic. For example, AS39138 rrbone UG (haftungsbeschraenkt) currently sees almost 20% of Tor's exit traffic. That's about as interesting as the fact that over 30% of Tor's traffic exits from U.S. relays. -
pluggable transports
steht hier für Onion Routing, sprich das normale Tor-Protocol. Eine Bridge, die Pluggable Transports anbietet kann auch normale Tor-Verbindungen zulassen, die unter zusammengefasst würden. Das ist auch gleichzeitig der Default-Wert wenn eine Bridge keine Statistiken zu Pluggable Transports übermittelt, daher der große Anteil. Außerdem gibt es noch für den Fall, dass weder ein bekannter Pluggable Transport noch benutzt wurden. Das wird aber wahrscheinlich nur sehr selten passieren.
even more
- for some rather detailed explenations see the Tor directory protocol, version 3
postponed
-
performance measures
The "time to download files over Tor" and "timeouts and failures of downloading files over Tor" are learned from clients we run ourselves, coming from Torperf output files. The gathering of this data is currently worked on and work on it's visualization is postponed. -
measuring bandwidths for types of relays
Bandwidth figures (advertized and consumed) include all types of services offered by a node. Currently they can not be refined to the level of indidividual services like bandwidth consumed by guard nodes, middle nodes etc.
In theory, we have data about consumed directory bandwidth for newer relays or bridges, but not for traffic as bridge, guard, middle, or exit node. We only have that data for the directory role, and only for a subset of relays, and deriving this data is difficult. There are privacy implications of gathering too detailed data, so we can't get more detailed data.
We can also not simply derive these values from the data we already have since relays can offer more than one service. E.g. relays with the Guard flag are not exclusively used in the guard position, but could also be used in the middle position and possibly also as directory server. And if relays also have the Exit flag, they'll be used less in the previously mentioned positions, but therefore also in the exit position.
We could derive advertised or consumed guard bandwidth for types of relays from relay bandwidth similar to how we derive guard probability from consensus weight using the Guard/Exit/BadExit flag and Wgd/Wgg bandwidth weights.
I'm uncertain whether this would produce good metrics or not. We'd mix path selection probabilities with actual usage data, and I'm not sure whether we can do that. This is a fine question for an analysis task and a later extension of Visionion, but currently we don't feel confident enough now to implement this in the current database importer. Results might be misleading.An additional note: We can count relays that are suitable for guard position, and we can sum up advertised and observed bandwidth of those relays.
We cannot sum up advertised and observed bandwidth of relays that have actually been used as guards. In fact, we cannot even count those relays, because a relay may have been used 20% in guard position, 30% in middle position, 40% in exit position, and 10% as directory. We don't know those fractions.
The initial database import schema has only 3 collections for all node types: 'relay', 'bridge' and'client'. Documents of type "guard", "middle", "exit" and "directory" will be added to the collection named "importRelays", documents of type "bridge" will be added to the collection named "importBridges", documents of type "client" will be added to the collection named "importClients". These 3 collections contain all raw data as it is imported into the database.
importRelays
in field description type subtype aggregation valuespace
+----------+-------+---------------------------+--------+------+-----------+----------
bgmedr _id document ID string [*] fingerprint+span+date eg 'fingerprint-1-YYYYMMDDHH'
bgmedr addd timedate the doc was added string ISO 8601 extended format YYYY-MM-DDTHH:mm:ss.sssZ
bgmedr node node id string - Tor fingerprint
bgmedr span period of validity integer - length of the interval this dataset describes, in hours:
one of: 1(default), 6, 24, 168
bgmedr date datetime string - start of the time span that this document describes
format "YYYY-MM-DD HH" as defined in ISO-8601
bgmedr nick nickname string mode nickname of relay
gmedr role roles/functions of relay array string mode [*] some of: Guard, Middle, Exit, Dir
gmedr flag flags array string mode [*] some of: Authority, BadExit, BadDirectory, Fast,
Named, Stable, Running, Unnamed, Valid,
V2Dir, V3Dir
b r bwa bandwidth advertized integer mean B/s
b r bwc bandwidth consumed integer mean B/s
bgmedr tsv Tor software version string mode one of: 010, 011, 012, 020, 021, 022, 023, 024
bgmedr osv operating system string mode one of: linux, darwin, freebsd, windows, other
r pbr consensus_weight_fraction number mean probability of a client picking a relay for their path
g pbg guard_probability number mean probability of a client picking a relay for their guard position
m pbm middle_probability number mean probability of a client picking a relay for their middle position
e pbe exit_probability number mean probability of a client picking a relay for their exit position
e pex permitted exit ports array integer mode some of: 80, 443, 6667
gmedr as autonomous system integer mode
gmedr cc country code string mode two-letter (ISO 3166-1 alpha-2), upper case
importBridges
in field description type subtype aggregation valuespace
+----------+-------+---------------------------+--------+------+-----------+----------
bgmed _id document ID string [*] fingerprint+span+date eg 'fingerprint-1-YYYYMMDDHH'
bgmed addd timedate the doc was added string ISO 8601 extended format YYYY-MM-DDTHH:mm:ss.sssZ
bgmed node node id string - Tor fingerprint
bgmed span period of validity integer - length of the interval this dataset describes, in hours:
one of: 1(default), 6, 24, 168
bgmed date datetime string - start of the time span that this document describes
format "YYYY-MM-DD HH" as defined in ISO-8601
bgmed nick nickname string mode nickname of bridge
bgmed bwa bandwidth advertized integer mean B/s
bgmed bwc bandwidth consumed integer mean B/s
bgmed tsv Tor software version string mode one of: 010, 011, 012, 020, 021, 022, 023, 024
bgmed osv operating system string mode one of: linux, darwin, freebsd, windows, other
b brp bridge pool string mode one of: email, https, other
b bre bridge is in EC2 cloud boolean mode
b brt bridge pluggable transport array string mode [*] some of: obfs2, obfs3
importClients
field description type subtype aggregation valuespace
+-------+---------------------------+-------+------+------------+---------
_id document ID string 'client'+span+date eg 'client-24-YYYYMMDDHH'
addd timedate the doc was added string ISO 8601 extended format YYYY-MM-DDTHH:mm:ss.sssZ
span duration integer Length of the time span that this dataset describes, in hours:
one of: 24 (default), 168
date datetime string Start of the time span that this document describes
format "YYYY-MM-DD HH" as defined in ISO-8601
cb clients at bridges integer mean
cbcc clients@bridges per country array object mean {cc:integer} // an array of {countrycode : int } objects
cr clients at relays integer mean
crcc clients@relays per country array object mean {cc:integer}
cpt bridge pluggbl.transp.used object {obfs2/obfs3/OR/Unknown:integer}
cip ip-version used object mode {v4/v6:integer}
LEGEND --------------------------------------------------------------------
in indicates, for which type of node the field is relevant,
'bgmed' standing for Bridge Guard Middle Exit Directory
field name of the field in the database
description short description of the field's semantics
type as defined in 3.5 of http://datatracker.ietf.org/doc/draft-zyp-json-schema/?include_text=1
subtype if type is array, type of array content
valuespace expected values
for lists of possible values "some of" where multiple values are possible
or "one of" where possible values are mutually exclusive
[*] if the relay provides the functionality in question for at least half of the timespan in question
Client data is - unlikey all relay and bridge data - never collected at the client nodes themselves (otherwise anonymity could be compromised). Instead client data is derived from relay data through special means and is already aggregated into timespans when it is imported into the MongoDB.
JSON schema
The above has been transformed into a JSON schema.
If the outline above and the schema get out of sync, the schema is authorative.
For information about JSON Schema see Wikipedia and the Draft Specification.
The purpose of the schema is twofold: combined with a validator it can provide some control over what data get's inserted into the database. Since MongoDB doesn't perform any consistency checks this can be useful to detect if somethings goes wrong. More importantly the validator can spot data that's not handled by the schema and trigger the addition of an appropriate (probably rather generic) query interface to the visualization GUI.
Import checks
We are making assumptions about the imported data that wouldn't hurt to be checked.
The following query checks if Bridges and all other types of relays are really disjunct sets:
TODO
TODO this section is of questionable quality
The imported data represents the following dimensions:
node types 3 node types: relays, bridges, clients 4 relay types: guard, middle, exit, directory (not mutually exclusive) flags 2 flags for relays only: stable, fast - both boolean (the others are too unimportant to aggregate) 1 flag for exits: permitted exit ports (3 values, not mutually exclusive) 3 flags for bridges only: pool (one of 3), ec2 (boolean), transport (2 values, n.m.e.) bandwidths 2 bandwidths for all servers: advertized and consumed probabilities 4 probabilties: for relay in general and for guard, middle, exit software 8 software versions for tor 5 software versions for os areas about 200 countries and many more autonomous systems clients 2 types: @relays, @bridges 2 types per country: @relays, @bridges 2 flags: transport used (4 values) and ip-version used (2 values)
The fields in the 3 import collections overlap only in one case: date. That's the only clamp between all datasets. The table below tries to capture the multiple dimensions and qualities of the imported data.
"Mode" refers to the essential quality of the thing being counted.
This may be the numbers of hardware instances, software characteristics, measures of quality of service, number of users.
"Measure" documents if the numbers denote absolute values, percentages, averages etc.
Percentages don't easily compare to absolute values and also not all absolute values in one category add up to a meaningful sum because value spaces overlap.
Therefor it's important to carefully select, construct and arrange meaningful and actually comparable configurations.
"Unit" is not much different from Measure, mainly reflecting if the field is single value or multi valued.
"Upper limit" denotes the upper limit of the value space. For percentages it's 100.
For each relay type it's the total number of relays - the important implication being that each relay can simultaneously belong to multiple types: the types alltogether don't add up to the number of relays, the're up to 4 times more. Bridges are distinct from relays.
TYPE FIELD MODE MEASURE UNIT UPPER LIMIT
-----------------------------------------------------------------------
SERVER
server hard sum count server
osv soft sum...s count/item server
tsv soft sum...s count/item server
upt quality avg percentage 100
bwa ip sum count -
bwc ip sum count bwa
RELAY
relay hard sum count Server minus Bridge
g hard sum count < relay
m hard sum count < relay
e hard sum count < relay
d hard sum count < relay
pbr quality avg percentage 100 (but should be much less)
pbg quality avg percentage 100 (but should be much less)
pbm quality avg percentage 100 (but should be much less)
pbe quality avg percentage 100 (but should be much less)
flag soft sum...s count/item < relay
as net sum...s count/item < relay (but really much less)
pex soft sum,sum,sum count/item < relay
BRIDGE Server minus Relay
bridge hard sum count relay
brp net sum,sum,sum count/item bridge
bre hard sum count < bridge
brt soft sum,sum count/item bridge
CLIENT
cb user sum client
cr user sum client
cpt soft sum,sum,sum,sum count/item
cip soft sum,sum count/item
COUTRY
clients
cbcc user sum client/country
crcc user sum client/country
relays
osv soft sum...s count/item/country relay
tsv soft count/item/country relay
upt quality avg percentage/country 100
bwa ip sum count/country
bwc ip sum count/country bwa
g hard sum count/country < relay
m hard sum count/country < relay
e hard sum count/country < relay
d hard sum count/country < relay
pbr quality avg percentage/country 100 (but should be much less)
pbg quality avg percentage/country 100 (but should be much less)
pbm quality avg percentage/country 100 (but should be much less)
pbe quality avg percentage/country 100 (but should be much less)
flag soft sum...s count/item/country < gmed
as net scat + sum...s count/item/country < gmed (but really much less)
Overview data on clients and relays:
We have some very general data on all relays and bridges: total count, software version, operating system version, total bandwidth provided and consumed.
Correspondingly we have quite general data on clients: how many clients in total were connected to the tor network via bridges or directly via guard nodes.
These two fit well together.
We also know which IP-version and which obfuscation techniques clients use.
But that's about it with clients and relays.
Clients:
Client data is on purpose quite sparse and we can't do much more than compare numbers of clients with the more detailed data about the relays and bridges.
We will eg not be able to follow clients through the network.
Countries:
The most detailed view we can get on clients is their distribution by country. This is interesting since we also know from each relay the country in which it is located. And we know a lot about relays. So maybe we can construct some useful views on specific characteristics of relays and total numbers of clients by country.
Relays:
Additionally to the data on relays and bridges we have quite specific data on different types of relays (but not bridges), namely guards, middle nodes, exits and directory servers.
This data is detailed but not easy to handle.
Numbers for the different types of relays don't add up to the total number of relays since each relay can (and most often does) serve more than one purpose and implements two, three or all four types of relays besides bridges.
For each relay we know with which probability it is part of a clients route through the network, but we would need to agggregate averages and mean deviations to add more meaning to these numbers.
We also know for most relays through which AS they are connected but this is a very large number of different AS which we first need to aggregate to find the most used ones and how high the concentration is.
We then have some flags and exit port information which again are not particularily easy to visualize (and interpret).
Bridges:
Last not least we have some data about bridges, but not as much as about relays. This is again on purpose since bridges serve to circumvent attempts to block the access to the tor network alltogether. Gathering too much information about them would make the censors' job easier.
Since bridges are distinct from relays their numbers add up to the total number of servers.
Apart from that we don't know much more than a few technicalities that don't have much impact on the rest of the network: from which bridge pool they were assigned, which transport they use and if they are hosted in the EC2 cloud.
flags
Most of the flags collected in the "relays" import collection actually serve so little purpose that we will not use them in the visualization, to avoid visual clutter and distraction and improve performance on the backend.
They will be imported into the database but will not be aggregated.
Only the flags "Fast", "Stable" and "Authority" will be aggregated for the following types of relays:
Fast Stable Authority
Guard x x
Middle x x
Exit x x
Directory x
MongoDB
In proven OLAP fashion we'll aggregate all data into one big facts collection ('collections' are the MongoDB equivalent to SQL tables).
MongoDB does fit this purpose well because it allows sparsely populated collections. As a document store it also supports nested collections which comes in very handy when the data sets we retrieve from the network are not as uniform and regular as we'd like them to be. As MongoDB is a schemaless database we do not have to worry about future structural changes. When e.g. more performance data becomes available we can seamlessly add it without having to touch any of the existing documents.
MongoDB has some constraints of it's own that need to be taken into account when designiing the facts collection:
- no joins
(but we can work around that by visually layering querie results on top of each other) - only 64 indices per collection (equals table in SQL-speak)
(slightly easing this problem: composite indices) - only one field in an index can be an array
(no workaround: we have to avoid arrays if they aren't really necessary)
Preparing the import tables
A few indices over the 3 import tables "relays", "bridges" and "clients" will speed up the aggregation:
. an index over "date" for "bridges" and "clients
. an index over "date + flag" for "relays"
. an index over "cc + date" for "relays"
. an index over "date + as + role + node" for "relays" (?)
Aggregation
Aggregation of the imported data is necessary for several reasons:
. the imported server data is ordered by individual server by date but most of the time we will not want to look at individual servers but at all servers or at a subset of servers sharing certain attributes during a given timespan.
. the imported data reflects only a certain view on the underlying network, highly influenced by how the data is collected. A visualization needs to provide other and more diverse perspectives and the imported data has to be aggregated in different shapes and combinations to support the visualization accordingly. A well prepared database is a prerequisite for a responsive and interactive visualization.
Step 1 - aggregation of imported data
In a first step imported data will be added to the facts collection.
Step 2 - consolidation and simplification
Then the facts table will be aggregated into longer timespans and other simplifications (e.g. regions) to improve retrieval performance.
Step 3 - indexing
The aggregated collections will be indexed to gain further speed advantages.
Additionally indices over the 3 import collections are needed to facilitate generic and unforseen queries and lookups on specific nodes.
step 1 : import data aggregation
tl;dr: a schematic example of a row of the resulting facts collection can be found here
A rather minimal fact table would include:
(4 relays x 2 flags + 3 nodes) x 2 bandwidths = 22 bandwidths
But we need the intermediate steps too because we also want to know these numbers for groups of nodes like all stable relays or all servers. That already leads to more than 30 bandwidth values - a rough first estimate and a very reasonable and encouraging result. But this sketch neglects a lot of information that we want to make visible, and the devil lies in the detail (-ed data sets).
An exhaustive fact table should encompass everything we know from a certain timespan, about all node types and in any dimension. We'll see how far we can get on the way.
0 _id, date, span
1 clients
total int
2 atBridges int
3 atRelays int
4 cip4 int
5 cip6 int
6 cptObfs2 int
7 cptObfs3 int
8 cptOr int
9 cptOther int
For clients this is all we know, save the clients per country which we'll tackle later.
Clients @bridges and @relays are mutually exclusive, the other fields aren't. We'll just list them one after another.
For transports we currently have 4 possible values: obfs2, obfs3, OR, other.
More transports may be developed in the future.
It seems sensible to add a result object that has fields for every possible combination of transports offered by a bridge.
The value is always the number of clients complying to the field type.
legend c osv tsv bwa bwc prb pex
10 servers
total object x x x x x
11 bridges
total object x x x x x
12 brpEmail object x x x x x
13 brpHttps object x x x x x
14 brpOther object x x x x x
15 breTrue object x x x x x
16 brtObfs2 object x x x x x
17 brtObfs3 object x x x x x
18 brtObfs23 object x x x x x
19 relays
roleAll
total object x x x x x x
20 flagNone object x x x x x x
21 flagFast object x x x x x x
22 flagStable object x x x x x x
23 flagFastStable object x x x x x x
24 roleGuard
total object x x x x x x
25 flagNone object x x x x x x
26 flagFast object x x x x x x
27 flagStable object x x x x x x
28 flagFastStable object x x x x x x
29 roleMiddle
total object x x x x x x
30 flagNone object x x x x x x
31 flagFast object x x x x x x
32 flagStable object x x x x x x
33 flagFastStable object x x x x x x
34 roleExit
total object x x x x x x x
35 flagNone object x x x x x x x
36 flagFast object x x x x x x x
37 flagStable object x x x x x x x
38 flagFastStable object x x x x x x x
39 roleDir
total object x x x x x
40 authorityTrue object x x x x x
That's 34 columns about servers, including the most common flags. Still looks manageable. And we cover a lot of ground here since the value is not only a number like with clients but it's an object with several field:value pairs: count, bandwidths and software versions for all server nodes, probabilities and exitports where applicable. The result object en detail: First every object contains a field counting the number of nodes that comply to the field type. Second for each of these node types 2 bandwidth values can be calculated: advertized and consumed bandwidth. Third all objects contain sub-objects for Tor software version and operating system. Finally some node types have addidtional fields in their results object:
- the relays also carry a probabilty field
- exits also carrys the permitted exit ports.
There are 3 possible values and every combination thereof which makes 7 fields.
We'll add these as a sub-object to the result object.
This elegant way of using the columns for more than one result type is possible because bandwidths, node counts, probabilties and the exit ports are independent from each other. There's no way how we could construct a different perspective where bandwidths and node counts don't correlate in the same way.
But now we've reached the end of low hanging fruit.
Mutually non exclusive relay types
Up to now it looks like we have everything covered. Or is there a combination of type, flag and probabilty that we couldn't find in this table in one easy step?
The astute reader will have noticed that there indeed is indeed a problem: guards, middles, exits and directories aren't mutually exclusive. To capture any combination thereof we would need not only 4 but 15 rows, so add 9 to 40 = 49.
Plus we wouldn't want to loose track of the flags and add another - hold your breath - 66 rows.
combinations of relay types and flags
type g d gd
m gmd
gm md
gme ged
ge gmed
e med
me ed
flags 4 2 6
total 28 2 36
66
Alltogether 115 columns. Maybe we can get rid of this scary situation by stuffing the combinations of types and flags into a seperate collection?
OS or Tor software versions
Adding OS or Tor software versions as further dimensions would mean blowing up the dimensionality to 37 x 5 = 195 or 37 x 8 = 296 respectively and I can't see any scenario in which this effort would be justified. And that still leaves out the 40 combinations of OS and Tor software versions.
Probably Tor software version and OS versions are only of limited significance. I tend to add them to the result objects of the main 31 server columns sketched out above and be done with it.
13 more field:value pairs added to each result object, 5 for OS and 8 for TS: would that seem useful?
Maybe even cut that down and only add them to bridges and the 4 relays types, without honoring the flags?
At least theoretically interesting iare the 40 possible combinations of operating system and tor software compared with any of the other dimensions, e.g. the 8 basic node types (without flags) = 320 permutations. Not nice, but doable in a seperate collection. Would that be useful?
Areas
But so far this was all peanuts compared to country and AS information.
These are enormous value spaces that - if they are not reduced - need to be at the root of a tree like structure, not at the leaves.
Therefor we have to change perspective: we can't start from the perspective of servers and clients anymore, we have to start from the properties country and AS.
Again there are differences: while there exist about 37.000 autonomous systems, there are less than 200 countries - which is still a lot, but also a lot less than AS. We already have very interesting data about clients per country, which makes it mandatory to come up with a decent schema that can handle all countries. The solution is an array on country:value objects, each populated by a rather complex result object, like so:
41 countries array of objects
country cc country
cbcc int how many clients in this country connecting through bridges
crcc int how many clients in this country connecting through relays
relay int how many relays in this country
guard int how many guards in this country
middle int how many middles in this country
exit int how many exits in this country
directory int how many directories in this country
bwa int total bwa of all relays in this country
bwc int total bwc of all relays in this country
pbr float total probability of all relays in this country
pbg float total probability of all guards in this country
pbm float total probability of all middles in this country
pbe float total probability of all exits in this country
fast int how many fast relays in this country
stable int how many stable relays in this country
osv object
linux int
freebsd int
darwin int
windows int
other int
tsv object
v010 int
v011 int
v012 int
v020 int
v021 int
v022 int
v023 int
v024 int
pex object
p4 int 4 as in 443
p6 int 6 as in 6667
p8 int 8 as in 80
p46 int
p48 int
p68 int
p468 int
autosys array of objects
as int
This approach has one problem: with MongoDB the inner arrays can't be indexed if we already have an index on the outer array 'country' - and we definitely need that country index. For osv, tsv and pex this can be solved by plainly listing them: that's 16 rows. But for autonomous systems the problem is not so easily solvable since the matrix of 200 countries and all autonomous systems in our case is close to unmangeable. A possible workaround could be to limit the list to just the 10 or 100 AS with the most bandwidth, or probability, and one more value for the rest.
Additionally countries could be grouped into continents, political regions (like "middle east", "EU"), by bandwidth consumption etc.
Because of their sheer number also autonomous systems have to be analyzed on their own. To understand which of them are of significant importance to the network as a whole or to specfic countries, for specific functionalities, at specific times etc we need to aggregate them over at least the most common fields.
42 autosys array of objects one result object per AS
as string number of as (format is string because it's a name)
name string name of as
home string home country of as, jurisdiction
relay int how many relays in this AS
bwa int total bwa of all relays in this AS
bwc int total bwc of all relays in this AS
fast int how many fast relays in this AS
stable int how many stable relays in this AS
guard int how many guards in this AS
middle int how many middles in this AS
exit int how many exits in this AS
dir int how many directories in this AS
pbr int total pbr of all relays in this AS
pbg int total pbg of all guards in this AS
pbm int total pbm of all middles in this AS
pbe int total pbe of all exits in this AS
countries array of objects
cc string two-letter (ISO 3166-1 alpha-2) country code
relay int how many relays in that country in this AS
bwa int how much bwa in that country in this AS
bwc int how much bwc in that country in this AS
pbr float total probability of all relays in that country and this AS
pbg float total probability of all guards in that country and this AS
pbm float total probability of all middles in that country and this AS
pbe float total probability of all exits in that country and this AS
This is still sketchy. More input and ideas on handling AS would be welcome.
aggregate unique items maybe it would be useful to have a special collection called uniqueItems that contains arrays of all values that ever turned up for a given field, e.g. countries autonomous systems nicknames wouldn't it?
uptimes
A node may not be online in every part of an aggregated timespan.
We don't count servers that haven't been available for at least 30% of a timespan.
That way we are counting the bandwidth a little conservativ, while we are too optimistic regarding the number of available servers.
scattering / spreading / evenness of distribution
So far we only examined aggregated groups of node types. To understand distribution over the individual nodes we have to collect some 10 or 100 or whatever biggest nodes in each category.
These numbers can be added to the server result objects explained above.
They can be added alongside applicable fields in the country objects, namely: relay (guard, middle, exit, dir), bandwidths, probabilities, flags.
Likewise for AS.
And we should establish some measure to indicate how even the distribution is (without having to look at individual nodes).
But this is just a reminder and a list of notes. We agreed to postpone this domain.
TODO
step 2 : consolidation and simplification
aggregations over time and space. Time quite obviously translates to the ability to watch the data from the finest level available - hourly - to an overview that shows the whole timespan available - currently 5 years - in a single view. The equivalent to zooming in and out. Space translates to the ability to group countries to meaningful regions, either continents or geopolitical regions like "arab spring".
timedate intervals / periods
The default timespan is 1 hour for relays and 24 hours for clients. At a scale of 1 pixel per default timespan we can't see the whole data on a regular display.
So far we collected about 5 years of data so far, which leads the following numbers of pixels
5 5 years since 2008
5 x 12 60 months
x 4 240 weeks
x 30 1800 days
// so, if 1px isn't to fine for our weary eyes, on a regular display
// we can show about half of the available data on a per day basis
x 4 2200 6 hours, quarter day
x24 43200 hourly
We will want to zoom in and out of the data visualization and henceforth need to define aggregated timespans. Sensible spans coudl be 6h 6 hours 1d 24 hours, 1 day 1w 168 hours, 7 days, 1 week 1m 1 month, about 4 weeks, about 30.5 days If we skip months as too coarse anyway (but actually because they are so unwieldy irregular) we could get by with 4 possible integer values: 1, 6, 24, 168
We probably need to pre-aggregate these timespans in MongoDB (which provides map/reduce functionality and an "aggregation framework". Maybe the Cube project (based on D3.js) can be used.
continents and political regions
TODO
step 3 : indexing
- import collections
relay: node+timespan to look up specific nodes
Issues
Background Information
Wikipedia has quick introductions to the meaning of mean, median and mode (the links point to the german edition).
Some material about MongoDB and OLAP
MongoDB - Materialized View/OLAP Style Aggregation and Performance (stackoverflow)
Another useful thread on stackoverflow, see especially the second answer
MongoDB OLAP with pre-aggregated cubes
DataBrewery Cubes
MongoDB OLAP
How well would the schema developed so far from the inherent characteristics of the data and the limitations of the database fit the usecases we already gathered?
Visualizing the total pbr of all relays with a certain characteristic.
For example, what's the total pbr of all relays in Germany?
. covered by the aggregations outlined above! (by the countries subcollection)
Right now, if you ask for relays running version 0.2.4, it looks up those numbers in the static JSON file that metrics exports. You cannot ask for the number of relays running version 0.2.4 on Linux, and you cannot ask for bandwidth provided by relays running version 0.2.4.
. I described the aggregation required to fulfill this usecase above and dismissed it as too expensive.
But it may be tackled again.
One could ask how many bytes per day are transported by relays running Linus.
Or, what's the probability of having a Windows relay as your entry guard.
. more software version tasks.
My question is still: is this really important on such a detailed level?
Note that you wouldn't have to aggregate by single relay or bridge, but you could aggregate all relays or bridges with the same combination of dimensions. For example, you only care about facts like "on May 23, 2013, there were 25 relays running with type Guard and Middle, with the Fast and Stable flag, with version 0.2.3.x, on OS X, in AS 1234, not permitting any ports, in Germany".
When I wrote that example, I wanted to express the maximum level of detail that you'd have to keep to answer any question that anybody could ever want to ask. I wanted to say that you don't need to remember which particular relay fingerprints are behind that number. But you're right that nobody would actually want to know the answer to such a detailed question.
Starting from the other end, I suggest you start with questions touching only a single dimension: "how many relays were there in Germany?" or "how many relays were there on OS X?" And when people want to know more, like: "how many relays on OS X were there in Germany?", it would be good if the system can be extended to answer such questions. But actually extending it could be step two.
. The schema above could be extended, with not too much effort, to cover such queries. In this case I would probably add 5 os columns relay to the country sub-collection.
There's a large emphasis on node numbers, but really, bwa, bwc, and pbr are more important measures than the number of nodes. Here's my idea: how about you keep osv_r, tsv_r, fast_r, stable_r, and as_r and store arrays of [#nodes, bwa, bwc, pbr] for each of them? For osv_r, tsv_r, and as_r that would mean storing an array of arrays, and for fast_r and stable_r it would be just that array.
. I hope I took this into account with the new aggregation concept.
But still, visualizing the average pbr (consensus weight fraction) or all relays doesn't make much sense to me. The pbr values of all relays add up to 100%, so that the average is always 1 / #relays. What makes more sense is visualizing the total pbr of all relays with a certain characteristic. For example, what's the total pbr of all relays in Germany? That makes much more sense to me.
. covered
-
chrome colors green and purple
-
coloring of data should be readable for color blind people
-
selecting countries by region, by other criterias (eg number of relays), on a map etc
-
visualize countries on a fisheye map, with suitable projection
-
selecting time period by widget, zoom in/out, move left/right in time
-
ability to change scale on vertical axis
-
ensure that any field not accessible through predefined vis options is accessible through gerneric interface
-
combine criteria eg stable and fast relays runnix linux with OS version xy in country z
-
combine/add/stack graphs to show complete datasets (eg cake diagrams)
-
SVG export
-
[future] consumed bandwidth between relays
this list is unsorted
Some useful links:
-
notify the client of new fields so he can add them to the generic interface
-
RESTfulness: having the URL represent the complete state of a visualization e.g. including zoom factor, active facets, selected clipping etc
framework
still not sure which framework to use. something lightweight should suffice. angular.js maybe to involved. knockout.js like angular.js takes a declarative approach. can.js doesn't have that declarative touch but apart from that looks very promising.
datetime
Handling of date and time can get difficult with JavaScript because not every environment handles every possible datetime format equally well. Besides the ubiquitious UTC-epoch format which is rather inaccessable to humans we settle on "YYYY-MM-DD HH" as defined in ISO-8601 which is supported across all browsers and serves our needs just well. If D3.js doesn't provide all we need we may use the Moment.JS library which "was designed to work both in the browser and in Node.JS". For further discussion of the topic see Stackoverflow.
An importer tool takes metrics descriptors as input and produces JSON or BSON to be imported into MongoDB. Such a tool should use Stem, which is a Python library that parses all relevant metrics descriptors. I think it even has an export function that may or may not support JSON. See Tor ticket #6171 for more details: https://trac.torproject.org/projects/tor/ticket/6171. import.py is a simple data importer that uses Stem to read consensuses and server descriptors and that prints out dicts that could be imported into MongoDB.
-
sketches of a visualization
-
more documentation of pre-import aggregation (extract from karsten's mails)
-
aggregation of visualization primitives and timespans
-
figure out how to control MongoDB via external scripts
http://docs.mongodb.org/manual/tutorial/write-scripts-for-the-mongo-shell/ e.g. prompt:> mongo localhost:27017/tor ~/visionion/aggregation.js particularily aggregation, indexing and status/control-queries -
Then a prototype visualization of some graph will be the first occassion to connect the database, the web application framework and the visualization library.
-
When that's accomplished more experiments need to be conducted to see if it's really possible to have more than one D3 instances on one webpage and how they can interact.
-
Then the real work on the visualizations can begin.
-
tbc
On OSX:
brew install mongodb
# Start mongo db and create the database
mkdir MONGOdata
mongod --dbpath MONGOdata
# Import the data
mongoimport --db tor --collection importRelays --stopOnError --upsert --file RAWdata/relays.json
mongoimport --db tor --collection importBridges --stopOnError --upsert --file RAWdata/bridges.json
mongoimport --db tor --collection importClients --stopOnError --upsert --file RAWdata/clients.json
# start mongo shell
mongo
# ensure index over date of import collections
db.importClients.ensureIndex({date:1})
db.importBridges.ensureIndex({date:1})
db.importRelays.ensureIndex({date:1})
# run a javascript file through a new mongo shell
mongo localhost:27017/tor visionion/aggregateFacts.js
# housekeeping tasks in mongo shell
show dbs
use dbName
db.dropDatabase()
show collections
db.collectionName.remove()
db.collectionName.ensureIndex({fieldName:1}) // sorting: 1 ascending, -1 descending
db.collectionName.dropIndex("indexName")
db.collectionName.getIndexSpecs()
db.collectionName.findOne()