PyZTrending is a library for determining "trending" data in a dataset using z scores - think #trending on Twitter but for any use case.
The PyZTrending architecture involves two types of entities - Document
and Token
. As a familiar use case,
tweets and hashtags in Twitter will be used to demonstrate the system. In the case of Twitter, imagine a Python class
called Tweet
:
from typing import List
from datetime import datetime
class Tweet:
def __init__(self, tweet_text: str, timestamp: datetime, author: TwitterUser):
self.author = author
self.tweet_text = tweet_text
self.timestamp = timestamp
self.hashtags: List[str] = self.extract_hashtags()
In this use case, a Tweet
is the Document and a single hashtag would be the Token (a single document can contain
multiple tokens). The idea is that a document is a single instance of a piece of data that contains trending data,
just like the Tweet
class contains hashtags
. Crucial to the working of the system is that documents, and therefore
tokens, have a real timestamp associated with them. This is critical as PyZTrending uses real times for determining
trending data and does not work well with other systems of time.
PyZTrending works in two parts. The first part is the historical data ingestion, which consumes multiple documents that are used as a "training" dataset (not to be confused with training in Machine Learning; this does not use Machine Learning). This part of the system will determine which tokens have a lot of exposure and which tokens are seldom seen.
Imagining Twitter again, the "#Trump" hashtag gets mentioned often, but perhaps "#Trump" was only trending during the first few months after Donald Trump announced his bid for presidency; after a while, "#Trump" became so commonplace that it could no longer be considered trending in the moment. Getting trending data by z scores helps us focus on tokens (hashtags) that are currently being disproportionately mentioned.
After historical data is provided, PyZTrending can look at new data and determine what's trending.
PyZTrending analyzes historical data on a rolling window basis, which accepts two parameters: a window size, and a step size, as demonstrated in the following diagram:
The diagram aboe is an example of twenty seven minutes' worth of Tweet data. In total, there are 24 windows. Each window
is 4 minutes long, and just as importantly the window step size is 1 minute. PyZTrending analyzes each window of time,
from the earliest document to the last, and calculates a score that is a function of each token (e.g. hashtag)
that is mentioned in that window. The first tweet in the diagram for example, which contains the hashtags "love" and
"summer2020" is a part of the window that starts at some T=0 and ends at T=240 seconds. If the scores for tokens were
calculated simply by the number of times the token appeared in a window, then the score for love
in T(0,240) would be
1
and the score for summer2020
in T(0,240) would be 1
.
There is also a second tweet that falls into the first window. If that tweet also happened to have the tweet "love" but
not "summer2020" then T(0,240) for "love" would be 2
while T(0,240) for "summer2020" would remain 1
. Also notice
that both the first and second tweets fall into more than just 1`window. More specifically, the first tweet falls in two
windows: T(0,240) and T(60,300). Meanwhile, the second tweet falls into three windows: T(0,240), T(60,300), and
T(120,360).
The reason for having a step size parameter, and not just starting a new window where the last one left off, is that it allows PyZTrending that have a more granular view of the data that shows how gradually over time tokens can shift in the amount of times they appear. This is especially useful for datasets that aren't as densely concentrated as Twitter's, for example. If it is not desirable to have multiple windows overlap, it is possible to set the step size equal to the window size, which will make all windows be adjacent.
Once historical data ingestion is complete, PyZTrending has a statistical distribution available that can help it calculate z scores for new incoming data, which it compares to previous time windows.
When determining a trending hashtag based off of a collection of tweets, one option is calculate simply the number of times a hashtag appeared in a certain time window. In this case, we use a weighting function that simply adds to a counter, which contains the total number of times a hashtag was mentioned in a time window. In order to do that, we can use the following weighting function:
def hashtag_weight_scale(tweet: Tweet, hashtag: str):
return 1
However, there could also be a scenario where not all hashtags should contribute the same towards a trending algorithm. One such example could be if the user who wrote the tweet was taken into account - and by extension, their number of followers:
from typing import List
from datetime import datetime
from pyztrending import Trending
class Tweet:
def __init__(self, tweet_text: str, timestamp: datetime):
self.tweet_text = tweet_text
self.timestamp = timestamp
self.hashtags: List[str] = self.extract_hashtags()
def hashtag_weight_scale(tweet: Tweet, hashtag: str):
lots_of_followers: int = 1000000
if tweet.author.num_followers > lots_of_followers:
return 10
return 1
trending = Trending(240, 60)
trending.add_class_support(Tweet, ..., hashtag_weight_scale)
In this scenario, the weighting function for determining how much is contributed towards a trending scale is dependent on the number of followers that the author of the tweet has.
Weight scales take two arguments:
- the document that is being looked at
- one of tokens that is found in that document
First, PyZTrending expects there to be a Class that represents some sort of document. The Tweet
class from before will
be used as demonstration. PyZTrending is BYOC (Bring Your Own Class). In order to accomplish this, PyZ needs an
interpreter in order to understand the class.
from typing import List
from datetime import datetime
from pyztrending import Trending
class Tweet:
def __init__(self, tweet_text: str, timestamp: datetime):
self.tweet_text = tweet_text
self.timestamp = timestamp
self.hashtags: List[str] = self.extract_hashtags()
def interpret_tweet(tweet: Tweet):
return tweet.hashtags, tweet.timestamp
trending = Trending(240, 60)
trending.add_class_support(Tweet, interpret_tweet, ...)
The add_interpreter
function tells PyZTrending
how to get the two pieces of data in needs from our data structure
- the tokens and a
datetime
timestamp. EveryTweet
instance and every document that is provided to be supported via an interpreter that will return a tupleTuple[List, datetime]
. Keep in mind that even thoughtweet.hashtags
is a list, each hashtag is treated separately.
We could also use PyZTrending to interpret dictionaries/JSON data structures, given a way to identify and differentiate between them. The below example demonstrates this, along with demonstrating how one might deal with various types of JSON structures:
from typing import Dict
from datetime import datetime
from pyztrending import Trending
def interpret_json_tweet(tweet: Dict):
if 'type' in tweet:
if tweet['type'] == 'tweet_type1':
return tweet['hashtags'], datetime.fromtimestamp(tweet['timestamp'])
elif tweet['type'] == 'tweet_type2':
return tweet['hashtag_array'], datetime.fromtimestamp(tweet['timestamp'])
raise TypeError(f"Unknown data structure for object {tweet}")
trending = Trending(240, 60)
trending.add_class_support(Dict, interpret_json_tweet, ...)