tanas0 / gitcoin_metamorphosis Goto Github PK

This project forked from honest-protocol/gitcoin_metamorphosis

Python 100.00%

gitcoin_metamorphosis's Introduction

Gitcoin Metamorphosis Hackathon

Calling all data scientists and data analysts! The Honest Protocol is sponsoring $15K in prizes in Gitcoin’s Metamorphosis Hackathon.

Analysis of suspicious domains

This is a project to build a dataset around ScamSniffer's list of domains (or any website in general) in order to shed more light into scam operations.

This dataset could eventually help to:

Categorize the scams: NFTs, ICO, Exchange, Donation campaign, airdrop, seed phrase theft...
Find links between scam operations: using the same IP addresses
Automatically extract further information about the scam: wallet addresses, email addresses, social media accounts...
Assist in reporting the scam: geo-locate and identify the jurisdiction it falls under, identify third-parties to report to (hosting providers, domain names registrar...)

Data collected

The data that seems relevant to me, in a rough order of importance:

Most common words in the website
Text contents of the website
IP addresses of the domain
[] JavaSript tags and external scripts
[] Subdomains and paths of the website

How to run locally

This explains how to setup the repository and run the data extraction yourself. The outputs will all be saved in the output directory. You will find in it the domains.csv file that includes the following information:

Is the website up?
IP addresses associated with the website
Does the website have a robots.txt file? (Could be used later to identify paths, or identify suspicious behaviour for blocking SEO indexing)
Processed text contents of the website: text with punctuation and English stopwords removed

The HTML itself of the websites will also be dumped into the output/htmls directory. It can be used for manual inspection, archiving the suspicious website, or avoiding an HTTP request in the code.

# Download list of suspicious domains
wget https://raw.githubusercontent.com/scamsniffer/scam-database/main/blacklist/domains.json

## Install Python dependencies
pip install -r requirements.txt
## Run the script
python main.py

Further improvements

Save into a JSON instead of a CSV (more suitable for array of IPs, most common words...)
Lemmatise text contents
Scrape websites that require JavaScript to be enabled
Download JS files referenced in website and scan them for further information (API calls, wallet addresses...)
Check if the reputation of the IP addresses
Retrieve history of the domain from the WayBack Archive
List technologies the website is built with using Wappalyzer

Recommend Projects

tanas0 / gitcoin_metamorphosis Goto Github PK

gitcoin_metamorphosis's Introduction

Gitcoin Metamorphosis Hackathon

Analysis of suspicious domains

Data collected

How to run locally

Further improvements

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent