To design an efficient fraud detection feature for PayMo users; the following are the two steps of solution:
- Building model: Using the batch_payment data to construct efficient graph data structure
- Streaming Values: For each incoming payments searching/parsing through the model to find out the degree.
-
Representing the transactions graph: To represent the transaction the following data structures have been evaluated:
- Bloom Filters: False positive is bad
- Trees(AVL,RB): Maintaining N trees is costly with no real advantage.
- Hashmaps: O(1) access for membership queries
- Array: duh!
From the above mentioned data structures; Hashmaps provide us with a good performance while performing membership queries and to represent a graph, a hash map array is used for each node in graph.
For example consider the following graph:
The tree is now represented as
graph g =
{
'userA':{
'userC':'userC'
}
'userB':{
'userC':'userC'
}
'userC':{
'userA':'userA',
'userB':'userB'
}
}
where the graph g is a hash map array with indexes for each node and for each node as key in parent hashmap has a value of inner hashmap array containing nodes that share a edge.
batch_payment -----> Data Model
|
|
Stream at time $t_0$ -----> Fraud Detection Algorithm ----> unverified,trusted
Fairly, a simple pipeline that shall build the Data Model; which is a graph represented as hashmap. Once the data model is created, for each streaming value, the sender and receiver are identified and degree between them is determined to generate the warnings.
Effort has been made to write as many custom implementations as possible with little use of external libraries:
-
Libraries:
pip install networkx
- Python 2.7
-
Environment:
- Ubuntu 16.04
- git
- Generate data model:
For each batch_payment:
if not sender/receiver in model:
initialize node
add transaction edge between sender and receiver
update the transaction
- Fraud detection algorithm:
For each stream_payment:
findDegree(sender,receiver)
update corressponding output
Several optimizations have been made:
- A global nodes dictionary is maintained to detect unknown transactions with O(1)
- For the first degree neighborhood, a different scheme is used. The receiver is checked directly from the hashmap entry instead of checking entore graph. Guarantees O(edges in sender node) or bounded by O(max(Edges))
- O(1) insertion into the hashmap
- Building the grows linearly with the build transactions
- A Breadth first search is deployed to find the degree between the sender and receiver. Which performs: O(|V|+|E|) ~ O(max(V,E)) in entire graph
Following are the various test cases written:
- test-1-paymo-trans: Basic test provided
- test-2-unknown-transactions: Test case for unknown transactions to be identified by optimization rule
- test-3-feature2: test for degree 2 check
- test-4-feature3: extended check for degree 4
- Build step: This typically takes about 17.308 s. Measured as execution time and averaged over five runs.
- Streaming: For one second of streaming data (915 transactions) are processed within 0.0649 s. Measured as execution time over five runs.
Remarks: Additionally, the build model can be cached for the user network locally.
The algorithm employed is guaranteed to bound by the linear time. The data structures are scalable and no bottlenecks have been observed during initial testing.