Git Product home page Git Product logo

Comments (7)

tomlurge avatar tomlurge commented on July 29, 2024

The notion "number of users" starts to get more ambigious the longer I think of it: does it mean "(average) number of users simultaniously connected to the network at any given time during the given period". Or the total of users during one day (computed by extrapolating some averages and rules of thumb)? Or the number of "average users" per day (which e.g. might be users that are connected for 10 hours straight and generate 10 consensus requests during that time). I assume it's the last one but maybe you can clarify.

I suspect that there's a flaw in your reasoning and that you're mixing together 2 concepts: "Number of users per day" and "Average of users at any given time during that day". That's not the same and can't be converted. 10 users connected for 24 hours each make 10 users per hour and per day - and 240 hours of usage. 10 different users connected 1 hour each per day still make 10 users - but only 24 hours of usage. 10 users each hour, and each different from each other make 240 users per day - and 240 hours of usage. From the way you explained how the number of users is computed I can't tell the difference. But it seems that the numbers in the end don't tell much about usage but only about aspects that could lead to a correct estimate of usage. The point is not so much if it's averages or absolute values,

Anyway you're right that we can't compute the avergae users per hour by dividing the day through 24. We would at least have to know the average time a user is connected to the network. If that was 1 hour than dividing by 24 would be correct. Likewise if it was 2 hours we would have to divide by 12 etc.
In the absence of such knowledge we can only copy the averages from the whole day to every hour. (But still I'm not sure what these numbers really mean).

But regarding the weekly average we're on the save side: we sum up the numbers of 7 days and divide the result by 7: that's the average per week. Right?

from visionion.

kloesing avatar kloesing commented on July 29, 2024

You're making some very good points here.

We don't have any data that allows us to distinguish users. What we really estimate is usage. We say "users" just because it's easier to understand for the masses. Obviously this makes it harder to understand for the experts.

When we say there was 1 user, that could have been a single user being connected for 24 hours. Or it could have been 10 users being connected for 2:24 hours (24 hours divided by 10). Or it could have been 10 users being connected from 00:00 to 02:24 and 0 users in the remaining 21:36 hours. We don't know, and we have no way to find out.

So, I'd say this metric can best be described as "average number of users at any given time during the day". Which means that this is the very same number for any given hour on that day, because we don't have any data saying whether there were fewer or more users at that specific hour. Just assume usage was stable over the day.

Calculating the weekly number is not as trivial as you say. Your calculation is only correct, because you implicitly think of the number as being an average. But yes, computing the mean like you say is correct.

Let me know if/when I should patch the Java importer to fix the divide-by-24 bug.

from visionion.

tomlurge avatar tomlurge commented on July 29, 2024

tl;dr : apply the patch!

It's becoming clear to me that usage and users are pretty vague terms...

Usage is not only hard to measure because of Tor specific technicalities: e.g. consumed bandwidth seems to be a rather "hard" measure but was does it really tells us about the use of Tor? A soap opera video download or checking one's dissident web mail may be extremely different in their bandwidth_consumption/actual_usefulness ratio. The notion of usage inherently makes some assumptions about how important an act is - but even the soap opera episode may be very important under certain circumstances. Still, consumed bandwidth gives an idea, and - most of all - the number is "hard".

Since this user calculation is basically a well informed guess one might go a different route than with bandwidth and calculate not what we have but what we want to have, what we expect, what makes the most sense in the context of all available data - avoiding mismatch of offer and expectations. So that we don't have to explain and the user (of the visualization) doesn't have to think.
When I see a field named "Users" containing an integer and framed by a timespan I expect that number to be the total of users that had connected during that timespan. That of course has some inherent assumptions about how long a user is regularily connected (probably not longer than that timespan), if 2 usages of the same user during that timespan - one in the morning, one in the evening - constitute 1 or 2 users etc. That's not easy to figure out.

What would I want to know? "How many dudes are actually using that stuff more or less regularily and not just for fun, total number". And also: "At any given time how many people were online through Tor simultaneously?" These are two different concepts, also corresponding quite nicely to the concepts of "usage" and "users". Since the usage aspect is already covered by the hard number of "consumed bandwidth" we could make "users" mean total number of "active users" like facebook and google+ measure (or fake) their populations.
But OTOH you seem to put more value on the usage aspect of the users calculation, and maybe this is also more in line with the rest of the data: e.g. we don't sum up absolute values of consumed bandwidth day after day to an ever increasing pile. That number can shrink as well when too many relays go offline etc. But the population can also shrink when people that used to use Tor a lot start to use some other service (you never know!). Oh, this is muddy territory!

Which leads me back to looking at the basis of this calculation: from what data is it derived. And it seems to me that the link from consensus requests to average number of users is relatively strong. So: yes, seems like a plausible way to go. Please apply the patch :-)

Regarding the weekly average we seem to have the same opinion: it's an average and calculating the mean is the way to go. Right?

from visionion.

kloesing avatar kloesing commented on July 29, 2024

Sent pull request. I'll post a link to new data once people.torproject.org is back online. Or I can send you the link via private mail. Whichever you prefer.

I'm not sure what to do with your comment above. Again, you raise important points. But can you be more specific what actions we should take from here? We're clearly limited by the data that is available, and it's really hard to establish new user/usage metrics. Just to give you an idea, coming up with the existing user/usage metric took me at least 6 months, distributed over the past years. It's not at all perfect, but alternatives are surprisingly hard to build. It's unrealistic that we can establish a new metric in the near future (though I wouldn't stop you from coming up with something, though ideally after visionion is deployed). I think the best we can do at this point is phrase the meaning of the existing user/usage metric as clearly as possible, so that both experts and the crowd is happy. I'm happy to take advice there. But can you be more specific what phrasings need changing?

from visionion.

tomlurge avatar tomlurge commented on July 29, 2024

No need for the link right now, no email necessary.

Regarding the comments above: I do think that there are different "things" that one could reasonably expect or wish to see associated to the concepts "users" and "usage". They all are not easy to describe/define, let alone to come by in the Tor network. The concept you picked sure males sense and seems to be derived from the data as straightforward as it is possible with Tor. Therefor I think that the present solution is a good one. Clear naming of the value in the interface , maybe adding a tooltip and visually grouping it with values that share the same statistical characteristics is another matter that I'll try to adress in the GUI.
Above I tried to more clearly analyze which aspects the concepts "user" and "usage" have but I failed in finding one aspect that ruled them all. So I came to the conclusion that the interpretation you picked - while in principal not better or worse than others - is a good one, since it's backed by the data available (at least as much as one can hope for considering the nature of Tor).
But this is all preliminary. These concepts are not easy to nail down and I might be inclined to try to do so again when some prototype visualizations are available and I feel that things should blend together more smoothly. But only then... for now everything's fine :)

from visionion.

tomlurge avatar tomlurge commented on July 29, 2024

tl;dr: we have to start with a guesstimated prototypical average user and then try to find out how often that average user used Tor.

Okay, I'd like to re-enter this discussion. I have a proposal that I hope you'll find convincing and relatively straightforward.

The most important points in the discussion above are IMHO the following:

  • the "client" metric is supposed to represent numbers of distinct users. That's what one would intuitively expect and that's what this metric clearly differentiates from "bwc" (bandwidth consumed), which represents usage.
  • "users" and "usage" are both ambiguous concepts but while deriving "usage" from consumed bandwidth seems relatively straightforward, the number of "users" is derived by a rather gross extrapolation from consensus requests.

Let's start with "usage". Let's say consuming or producing information constitutes "usage". The transport of information is measured in bytes. Although this is quite straightforwardly definedl it still is a very vague metric: the bytes measured in "bwc" can stem from an email or a picture or a movie. Therefor the density of the information which these bytes represent is very different. Intuitively we might think that a movie carrys as much information as 100 emails, or 1 book, or 10 essays, or 1000 pictures, or 50 comic novels. The sheer number of bytes doesn't reflect that. Also the information might be important or trivial: does the usage genereated by 10 dissidents distributing scarce information from inside an oppressive regime count the same as 10 young adults socializing and chitchatting along happily? The images and connotations that we connect to the hard numbers of bandwidth consumed start to fall apart and we are left with not much more than vague assumptions, a few hints and some historic notion of growing or shrinking usage (which might as well stem from an increase in HD video consumption over Tor rather than increased intensity of Tor usage for everyday tasks). Still we use this metric and stick to the images and connotations as we won't get anything better (but maybe we should make the inherent weaknesses of this conceptualization clearer).

On the contrary the number of "users" is derived from "the number of users per day by dividing the total number of consensus requests by 10, assuming that every continuously connected client makes 10 consensus requests per day" (quoted from above) which seems shaky at best.
But again the concept of a "user" itself is heavily interpretable. It might be a person using Tor at a given point in time e.g. checking on his email. But what about the same person using Tor 10 minutes later reading a website - is that the same user? What about a person checking his email 5 times a day - is that 1 user per day or 5 users? Also what about the duration of usage: does any started hour count as one hour of usage, or only hours fully spend online (every 60 minutes, sprinkeled over the day, adding up to 1 hour)?
We quickly enter slippery territory, quagmires of interpretations of interpretations, if we do not base our calculation on a solid understanding of what the notion of a "user" represents. What we do need first is a precise conceptionalization of a "user" - and only then we should try to align the available data from Tor with that concept.

I'd say - and I'm pulling that out of my head - that a regular user is online 5 times a day for a total of 3 hours. This may seem excessive but it isn't for a) techies like us b) young folks checking facebook every 5 minutes c) journalists that care about being snooped after by the NSA d) dissidents trying to create publicity and networking with other people. It might be too much for people that are marginalized and have only sporadic access to the internet. It is a wild guess and some empirical data would be very welcome to ground it but at least it is a starting point, helping us to clarify what we are speaking about.

There is only one metric that can help us extract users - the number of consensus requests - and only one correlation - a continuously connected user makes a consensus request about 10 times a day. There is a problem here: a continuously connected user doesn't seem to be the most common scenario to me. Wouldn't we rather expect the typical user to connect to Tor for a specific task and then disconnect again? Embracing the concept of a user outlined above and given that a consensus request stays valid for about 2 hours let's assume that 5 requests per day represent 1 user and 3 hours of usage. That would double the number of users and that of course is politically "interesting" but I'll leave that sort of reasoning to you ;-) We could very well just ajdust the model of a user above to arrive at the established numbers.

There is still the other problem - how to translate these daily numbers to hours, weeks and months - that was the source of this issue and I haven't even started to adress it yet but I hope to be able to solve in the same spirit.

Hourly users IMHO can not, as you suggest, be the same number than daily users. Intuitively that doesn't make any sense - even if the hourly numbers are based on daily matrics. The prototypical user outlined above could show the direction to take.
One possible way to calculate users per hour is as follows: if the user is on average online for 3 hours a day than every 5 consensus requests translate to 3/24 = 1/8 user per hour. But the 5 consensus requests rather suggest that he is online 5 times a day, therefor in the course of 5 distinct hours. 3 or 5, which is it? We could settle for 4. We are on shaky grounds anyway, why not make informed guesses? Both concepts have there merits, both should have an influence on the outcome. That would lead to 1/6 user per hour.
The data contains lots of entries with only one user per day per country. Those would be rounded to 0 users per hour - which IMO makes sense: at any given hour the chance that in that country a user was online is rather small. Hourly numbers don't need to add up to daily numbers, they just need to give a correctish impression in the visualization. For a whole day, again, he is visible. All other entries would be rounded too: 1-3 round to zero, 4-8 round to 1, etc.

Monthly user numbers are tricky too. One possible way to go would be to just add them up from daily numbers. The basic numbers are absolute, not averages or means. They may be guesstimated from the total number of consensus requests, but that doesn't make them relative. Therefor they should be added up.
But we want to know how many distinct users have been online during a month. A lot of them might have used Tor every day, while others might have been online only every second day or once a week. Again this is pure speculation, not backed by any empirical data, but let's assume that the average user is online on 4 days a week.
The real number of users per month would then be computed as the average number of users per day during a month multiplied by 7/4. E.g. if there had been on average of 10 users per day (or 300 users added up from 30 days), multiplying 10 by 7/4 results in about 18 distinct users per month.

from visionion.

tomlurge avatar tomlurge commented on July 29, 2024

One more thought on monthly user numbers: some users will not be using Tor for all their Internet activity, but only for certain tasks. They might use Tor for a certain period of time, like "a few days", until they accomplished the task and then not use it for some weeks etc. Therefor I'd multiply the monthly user number calculated above by a constant factor of 2. Again I'm totally making the concrete number - 2 in this case - up. But I think adding this multiplication operation in principle makes a lot of sense.

from visionion.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.