Git Product home page Git Product logo

eth-scam-ml's Introduction

#Ethereum-Fraud-Detection (forked and modded)

Detecting Fraudulent Blockchain Accounts on Ethereum with Supervised Machine Learning using sklearn's Library


Introduction

Since 2021, more than 46,000 people lost over $1 billion to cryptocurrency scams, nearly 60 times more compared to 2018.1 The Federal Trade Commission (FTC) found that the top cryptocurrencies used to pay scammers were Bitcoin (70%), Tether (10%) and Ethereum (9%).1 Especially, with the most recent incident with FTX, a crypto exchange which misused more than $1 billion of client’s funds, it becomes ever more important to stay vigilant when navigating through the cryptocurrency world.2 To enforce deterrence against fraudulent scams, we used supervised machine learning techniques such as Logistic Regression, Naive Bayes, SVM, XGboost, LightGBM, MLP, Tabnet and Stacking to detect and predict fraudulent Ethereum accounts. This would add business value by enhancing fraudulent account detection features on crypto exchanges and crypto wallets, enabling people to navigate confidently through the cryptocurrency world and safeguard their personal assets. We set an objective to achieve more than 90% F1 score for machine learning models in predicting fraudulent accounts on the Ethereum blockchain.


Data

There are 2 data sources : Kaggle and Etherscan

Kaggle

The Kaggle dataset is downloaded from https://www.kaggle.com/datasets/vagifa/ethereum-frauddetection-dataset and can be found in ./Data/address_data_k.csv

Etherscan

Data are mined from etherscan from https://etherscan.io/accounts/label/phish-hack (Currently data has been taken off Etherscan, but we have saved our data) and can be found in ./Data/address_data_e.csv

Combined without Time Series

Data from Kaggle and Etherscan are combined and can be found in ./Data/address_data_combined.csv

Time-Series

One key aspect of the dataset that we realised was missing was the time series element. Although each observation in our data was a user account, this data was generated by aggregating individual transactions. By doing so, valuable information could have been “flattened out”. The flow of Ethereum transactions are intrinsically time series data that could be used in our model, such as seasonality of transactions. These information was extracted using the 'tsfresh' library and can be found in ./Data/Transaction_data and the new features extracted can be found in ./Data/new_ts_features_only.csv.

Combined with Time Series

Data from Kaggle and Etherscan including time series can be found in ./Data/address_data_combined_ts.csv

Data Description

We started with a Kaggle dataset of 9841 observations. Each observation is a unique Ethereum account, with each variable being an aggregate statistic over all transactions performed by that unique account, such as total Ether value received or average time between transactions. The data also distinguishes between account-to-account transactions and account-to-smart contract transactions. However, the dataset was highly imbalanced, with only 2179 out of 9841 (22.14%) being marked as fraud. To address the imbalance, we leveraged an API provided by Etherscan, a “Block Explorer and Analytics Platform for Ethereum”. This allowed us to retrieve transactions made by any given account address on the Ethereum blockchain. As a result, the number of fraudulent accounts in our dataset climbed to 4339 observations, making the combined dataset less imbalanced (45.97% fraud).



eth-scam-ml's People

Contributors

eltontay avatar notalvin avatar pz808 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.