lindsey98 / phishpedia Goto Github PK

Official Implementation of "Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages" USENIX'21

Python 90.81% Shell 9.19%

cybersecurity computer-vision phishing-detection

phishpedia's Introduction

Phishpedia A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages

Paper • Website • Video • Dataset • Citation

This is the official implementation of "Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages" USENIX'21 link to paper, link to our website, link to our dataset.
Existing reference-based phishing detectors:
- ❌ Lack of interpretability, only give binary decision (legit or phish)
- ❌ Not robust against distribution shift, because the classifier is biased towards the phishing training set
- ❌ Lack of a large-scale phishing benchmark dataset
The contributions of our paper:
- ✅ We propose a phishing identification system Phishpedia, which has high identification accuracy and low runtime overhead, outperforming the relevant state-of-the-art identification approaches.
- ✅ We are the first to propose to use consistency-based method for phishing detection, in place of the traditional classification-based method. We investigate the consistency between the webpage domain and its brand intention. The detected brand intention provides a visual explanation for phishing decision.
- ✅ Phishpedia is NOT trained on any phishing dataset, addressing the potential test-time distribution shift problem.
- ✅ We release a 30k phishing benchmark dataset, each website is annotated with its URL, HTML, screenshot, and target brand: https://drive.google.com/file/d/12ypEMPRQ43zGRqHGut0Esq2z5en0DH4g/view?usp=drive_link.
- ✅ We set up a phishing monitoring system, investigating emerging domains fed from CertStream, and we have discovered 1,704 real phishing, out of which 1133 are zero-days not reported by industrial antivirus engine (Virustotal).

Framework

Input: A URL and its screenshot Output: Phish/Benign, Phishing target

Step 1: Enter Deep Object Detection Model, get predicted logos and inputs (inputs are not used for later prediction, just for explanation)
Step 2: Enter Deep Siamese Model
- If Siamese report no target, Return Benign, None
- Else Siamese report a target, Return Phish, Phishing target

Project structure

- logo_recog.py: Deep Object Detection Model
- logo_matching.py: Deep Siamese Model 
- configs.yaml: Configuration file
- phishpedia.py: Main script

Instructions

Requirements:

Anaconda installed, please refer to the official installation guide: https://docs.anaconda.com/free/anaconda/install/index.html

Create a local clone of Phishpedia

git clone https://github.com/lindsey98/Phishpedia.git

Setup

chmod +x ./setup.sh
export ENV_NAME="phishpedia" && ./setup.sh

conda activate phishpedia

Run in bash

python phishpedia.py --folder <folder you want to test e.g. ./datasets/test_sites>

The testing folder should be in the structure of:

test_site_1
|__ info.txt (Write the URL)
|__ shot.png (Save the screenshot)
test_site_2
|__ info.txt (Write the URL)
|__ shot.png (Save the screenshot)
......

Miscellaneous

In our paper, we also implement several phishing detection and identification baselines, see here
The logo targetlist described in our paper includes 181 brands, we have further expanded the targetlist to include 277 brands in this code repository
For the phish discovery experiment, we obtain feed from Certstream phish_catcher, we lower the score threshold to be 40 to process more suspicious websites, readers can refer to their repo for details
We use Scrapy for website crawling

Citation

If you find our work useful in your research, please consider citing our paper by:

@inproceedings{lin2021phishpedia,
  title={Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages},
  author={Lin, Yun and Liu, Ruofan and Divakaran, Dinil Mon and Ng, Jun Yang and Chan, Qing Zhou and Lu, Yiwen and Si, Yuxuan and Zhang, Fan and Dong, Jin Song},
  booktitle={30th $\{$USENIX$\}$ Security Symposium ($\{$USENIX$\}$ Security 21)},
  year={2021}
}

Contacts

If you have any issues running our code, you can raise an issue or send an email to [email protected], [email protected], and [email protected]

phishpedia's People

Contributors

Stargazers

Watchers

phishpedia's Issues

The domain consistency checking should also check tld, not only domain

TODO

Where is train_targets.txt of siamese

It's a very greate project, I'd try it to recognized some phishing sites. Now I want to add brand myself.
But when I test training siamese, it report

FileNotFoundError: [Errno 2] No such file or directory: './src/siamese_pedia/siamese_retrain/train_targets.txt'

I downloaded expand_targetlist, where is train_targets.txt and target_dict.json?

# phishpedia/src/siamese_pedia/siamese_retrain/bit_pytorch/train.py
        train_set = GetLoader(data_root='./src/siamese_pedia/expand_targetlist',
                                            data_list='./src/siamese_pedia/siamese_retrain/train_targets.txt',
                                            label_dict='./src/siamese_pedia/siamese_retrain/target_dict.json',
                                            transform=train_tx)
        
        valid_set = GetLoader(data_root='./src/siamese_pedia/expand_targetlist',
                              data_list='./src/siamese_pedia/siamese_retrain/test_targets.txt',
                              label_dict='./src/siamese_pedia/siamese_retrain/target_dict.json',
                              transform=val_tx)

Thanks.

Missing keys in loading state_dict for ResNetV2 and can't load "BiT-M-R50x1.npz"

Hi I'm trying to do the Siamese retrain.

When I run evaluate.py file, I got the following error.

Traceback (most recent call last):
File "evaluate.py", line 155, in
model.load_state_dict(new_state_dict)
File "/xxx/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNetV2:
Missing key(s) in state_dict: "root.conv.weight",....

And also the error of:
Unexpected key(s) in state_dict: "ean", "td", "e.fpn_lateral2.weight", "e.fpn_lateral2.bias"...

The other thing is that when I load "BiT-M-R50x1.npz", I got the this error:

np.load("BiT-M-R50x1.npz")
Traceback (most recent call last):
File "", line 1, in
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/npyio.py", line 398, in load
"Failed to interpret file %s as a pickle" % repr(file))
IOError: Failed to interpret file 'BiT-M-R50x1.npz' as a pickle

Thanks!

Otuput Files

Hi,

I noticed that if I run the following command, the program will generate two files ('predict_intention.png' and 'predict.png') for each url. Could you please explain what these two files represent, respectively?
python run.py --folder <folder you want to test e.g. phishpedia/datasets/test_sites> --results <where you want to save the results e.g. test.txt> --no_repeat

Thanks.

Kindly build in such a way that it supports API endpoint

Hi, Can you guys build this project in such a way that it supports API endpoints to post image and URL. And response as Json? If that is possible, it will be helpful for me to use this service in repl.it

data.zip password

Hi Lin, I would like to ask about the file password of data.zip, a public data set on the official website.
data link : https://drive.google.com/drive/folders/1QmchBhw8Xhr2HxtczTGnbcaJtbdZ-ITP

Optimization suggestion

Sorry for not creating a pull request, but I believe I do have an optimization suggestion.
in file src/siamese_pedia/inference.py at line 98, the variable logo_feat_list is an array of float64 holding only bitvalues.
by placing:

    logo_feat_list = logo_feat_list.astype(np.uint8)

before line 98, the size of the array is reduced, without effect on performance.
logo_feat_list size = 50,200,712 -> 6,275,208 bytes

This will further improve the speed of your model.

Question regarding deep learning network architecture

Hi, I have noticed that your project employs a two-stage deep learning network. In the first stage, RCNN is used to detect logos and input boxes on web pages, and in the second stage, a Siamese network is used to recognize the brand names of the logos detected in the previous stage?
I have read your paper and roughly understand the rationale of the design of each network stage. However, I still don't quite understand why it needs to be designed as two stages but not only one stage. Couldn't a single stage network handle this task? For example, if we train a target detection network (like YOLO) using only the logos in the target list to directly detect the brand names (use brand name as class) and then train another network to detect input boxes simultaneously, what would be the drawbacks of this approach?

(As a beginner in this field, my question might be somewhat naive😣)

About ".pkl" file in the Logo2K+ dataset

in siamese_retrain\bit_pytorch\train.py：

        valid_set = GetLoader(
            data_root="./datasets/logo2k/Logo-2K+",
            data_list="./datasets/logo2k/val.txt",
            label_dict="./datasets/logo2k/logo2k_labeldict.pkl",
            transform=val_tx,
        )

but I can't find any .pkl files in the Logo2K+ dataset , do I need to generate logo2k_labeldict.pkl myself ?

Model Training Issues

Hi Lin, I have seen your paper on Phishpedia and I feel that it will help me in my research. I am working on your code and I would like to ask if it is possible to train the two models mentioned in your paper by myself: 1) Deep Object Detection Model; 2) Siamese Model

If it is possible to train them by myself, could you please give me a hint how to start?

test data

Hi Lin, there is only one test site data in the test_sites folder. Where can I get more test data?

Expand targetlist

It's possible to expand target brand list to cover new brands. Please follow the following steps:

Step 1: Create a new folder under src/siamese_pedia/expand_targetlist with the new brand name
Step 2: Append {"brand name": ["domain1 for this brand", "domain2 for this brand" ... ]} to src/siamese_pedia/domain_map.pkl (Dictionary storing the brand-domain mappings)

Model evaluation issues

Hi Lin, I would like to evaluate the Phishpedia model on the dataset you provided. Hence I would like to ask whether pipeline_eval.py is used to evaluate the whole model (Phishpedia), including accuracy and recall.

Request pip list versions

Hello, I'm trying to run the test code.

But I got an error in torch library.
I set my virtual environment with python 3.8.17 / cuda 11.0 / cudnn 8.0.4.
The other programs were installed using "requirements.txt".

However, I got an error below :
Traceback (most recent call last): File "test.py", line 8, in <module> ELE_MODEL, SIAMESE_THRE, SIAMESE_MODEL, LOGO_FEATS, LOGO_FILES, DOMAIN_MAP_PATH = load_config(cfg_path) File "/home/iis/Phishpedia/phishpedia/phishpedia_config.py", line 23, in load_config ELE_MODEL = config_rcnn(ELE_CFG_PATH, ELE_WEIGHTS_PATH, conf_threshold=ELE_CONFIG_THRE) File "/home/iis/Phishpedia/phishpedia/src/detectron2_pedia/inference.py", line 56, in config_rcnn predictor = DefaultPredictor(cfg) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 288, in __init__ checkpointer.load(cfg.MODEL.WEIGHTS) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/detectron2/checkpoint/detection_checkpoint.py", line 52, in load ret = super().load(path, *args, **kwargs) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 155, in load checkpoint = self._load_file(path) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/detectron2/checkpoint/detection_checkpoint.py", line 88, in _load_file loaded = super()._load_file(filename) # load native pth checkpoint File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 252, in _load_file return torch.load(f, map_location=torch.device("cpu")) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, 'v'.

I think this error came from version confliction.

My installed pip packages are the same with below:
`Package Version

absl-py 1.4.0
advertorch 0.2.3
antlr4-python3-runtime 4.9.3
anyio 4.0.0
appdirs 1.4.4
black 21.4b2
cachetools 5.3.1
certifi 2023.7.22
cffi 1.15.1
charset-normalizer 3.2.0
click 8.1.7
cloudpickle 2.2.1
contourpy 1.1.0
cryptography 38.0.4
cycler 0.11.0
detectron2 0.6+cu111
exceptiongroup 1.1.3
filelock 3.12.3
fonttools 4.42.1
future 0.18.3
fvcore 0.1.5.post20221221
google-auth 2.23.0
google-auth-oauthlib 1.0.0
grpcio 1.58.0
gspread 5.11.1
h11 0.14.0
httpcore 0.17.3
httplib2 0.22.0
httpx 0.24.1
hydra-core 1.3.2
idna 3.4
importlib-metadata 6.8.0
importlib-resources 6.0.1
iopath 0.1.9
joblib 1.3.2
kiwisolver 1.4.5
lxml 4.9.3
Markdown 3.4.4
MarkupSafe 2.1.3
matplotlib 3.7.3
mypy-extensions 1.0.0
numpy 1.24.4
oauth2client 4.1.3
oauthlib 3.2.2
omegaconf 2.3.0
opencv-python 4.8.0.76
packaging 23.1
pandas 2.0.3
pathlib 1.0.1
pathspec 0.11.2
phishpedia 0.0.0
Pillow 9.5.0
pip 23.2.1
portalocker 2.7.0
protobuf 4.24.3
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycocotools 2.0.7
pycparser 2.21
pydot 1.4.2
pyparsing 3.1.1
python-dateutil 2.8.2
python-telegram-bot 20.5
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.8.8
requests 2.31.0
requests-file 1.5.1
requests-oauthlib 1.3.1
rsa 4.9
scikit-learn 1.3.0
scipy 1.10.1
setuptools 68.2.2
six 1.16.0
sniffio 1.3.0
tabulate 0.9.0
tensorboard 2.14.0
tensorboard-data-server 0.7.1
termcolor 2.3.0
threadpoolctl 3.2.0
tldextract 3.5.0
toml 0.10.2
torch 1.9.0+cu111
torchsummary 1.5.1
torchvision 0.10.0+cu111
tqdm 4.66.1
typing_extensions 4.7.1
tzdata 2023.3
urllib3 1.26.16
Werkzeug 2.3.7
wheel 0.41.2
yacs 0.1.8
zipp 3.16.2`

Could you share your package versions? or Do you know what is the problem is it?

Which model should I choose for siamese tranining?

Now I've selected BiT-M-R50x1, is this correct?

KNOWN_MODELS = OrderedDict([
        ('BiT-M-R50x1', lambda *a, **kw: ResNetV2([3, 4, 6, 3], 1, *a, **kw)),
        ('BiT-M-R50x3', lambda *a, **kw: ResNetV2([3, 4, 6, 3], 3, *a, **kw)),
        ('BiT-M-R101x1', lambda *a, **kw: ResNetV2([3, 4, 23, 3], 1, *a, **kw)),
        ('BiT-M-R101x3', lambda *a, **kw: ResNetV2([3, 4, 23, 3], 3, *a, **kw)),
        ('BiT-M-R152x2', lambda *a, **kw: ResNetV2([3, 8, 36, 3], 2, *a, **kw)),
        ('BiT-M-R152x4', lambda *a, **kw: ResNetV2([3, 8, 36, 3], 4, *a, **kw)),
        ('BiT-S-R50x1', lambda *a, **kw: ResNetV2([3, 4, 6, 3], 1, *a, **kw)),
        ('BiT-S-R50x3', lambda *a, **kw: ResNetV2([3, 4, 6, 3], 3, *a, **kw)),
        ('BiT-S-R101x1', lambda *a, **kw: ResNetV2([3, 4, 23, 3], 1, *a, **kw)),
        ('BiT-S-R101x3', lambda *a, **kw: ResNetV2([3, 4, 23, 3], 3, *a, **kw)),
        ('BiT-S-R152x2', lambda *a, **kw: ResNetV2([3, 8, 36, 3], 2, *a, **kw)),
        ('BiT-S-R152x4', lambda *a, **kw: ResNetV2([3, 8, 36, 3], 4, *a, **kw)),
])

Thanks

Run phishpedia.py error

Hello,I confronted an issue that I ran phishpedia.py but I was told "No module named 'detectron2.config'".Actually I can use python to import detectron2.Moreover,I use "python -m pip install -e detectron2" to download detectron2 but also faced with other problems.

I'm puzzled with it,and can you help me?

About logo classification task

Hi Lin, you mentioned the logo classification task in your paper，I would like to ask if this classification task is to learn diverse feature vectors? One more question, how is this classification task implemented? Based on your paper, I think the classification model first capture the image features using ResNet and then connect the fully connected layer for classification. Is my understanding correct?

CertStream URLs File NOT Accessible

Hi,

https://sites.google.com/view/phishpedia-site/complementary-experiment mentioned that 131562 Certstream URLs are provided in a google doc file but it looks like the corresponding part is not accessible. May I know how to access and download it?

Thanks!

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.

If you meet the error as title suggests, please add the following in your main code when you import packages:

import os
 os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

About the domain

Hi,

Hope you doing well.

I have run your code of phishpedia, but now I find that you have updated the code.

When I check the "phishpedia_classifier_logo" function and run the code on some benign websites, I find that the "matched_domain" always not include ".com", ".cn", etc., while the "tldextract.extract(url).domain + '.' + tldextract.extract(url).suffix" include these information. So I am wondering if there is a bug or you also need to update the domain.pkl. I am not sure, just put forward my question.

Besides, I want to make sure what I am understanding is right. For the "phishpedia_classifier_logo" function, if the predict target brand is not None and the domain in the maintained domain list, then it should be benign and the output pred_target is None rather than the real brand?

Thanks

Best

Fujiao

vt_scan error

If you are running phishpedia_main.py, you will notice that we check the Virustotal (vt_scan function) result once the website is reported as phishing. In vt_scan function, I leave the api_key as empty for security reason. You need to fill in your own token, i.e. get an account in Virustotal and obtain a free API key.

idx has odd size

I am trying to run the phishpedia_main.py on the alibaba example. Though, my runs keep failing due to lack of memory. When inspecting the program, in siamese_pedia/inference.py, at line 105, idx is set. If I understand correctly, this should hold the indexes of the top 10 brands. Though, in my run, this variable has a shape of 10x2048x2048. Am I doing something wrong here? What shape is expected here?

About evaluating the Phishpedia model

Hi, I would like to evaluate the Phishpedia model as a baseline in our dataset, do I need to train it on my dataset first? Or I can directly use the model you uploaded to evaluate the accuracy, etc. Thanks

Error: _pickle.UnpicklingError: invalid load key, 'v'.

When you meet this error when loading the model
Try download models from here: https://drive.google.com/drive/folders/1rCEqhu1CS8tphwDKoxsCRh5t1PXfSceH?usp=sharing
And put them into the corresponding directories

training data is disabled

It seems like that the URL of faster-rcnn training data is out of work

AttributeError: module 'PIL.Image' has no attribute 'LINEAR'

Step 1: Open .../anaconda3/envs/.../lib/python3.8/site-packages/detectron2/data/transforms/transform.py in editor, go to Line 46:
Step 2: Change

 def __init__(self, src_rect, output_size, interp=Image.LINEAR, fill=0):

 def __init__(self, src_rect, output_size, interp=Image.BILINEAR, fill=0):

Training problem on the siamese model

Hi, Hope you doing well.

I have a question when retrain all the models. From my understanding, there should be 2 models, 1) rcnn model and 2) the siamese model.

use the benign30k to train the layout classifier, get the rcnn model.
for training siamese model, there are 2 steps: a) train on logo2k, and b) fine-tune on target list 277 brands.
2)-a) first use the logo2k to train, and get the bit.pth.tar, in this step, the parameters are [bit_pretrained_dir='.', dataset='logo_2k', logdir='log', model='BiT-M-R50x1', name='log_log', save=True, weights_path=None].
2)-b) After the former step, how can I use the obtained model to get the siamese model on the target list 277 brands? Can I just set the parameters as [bit_pretrained_dir='.', dataset='targetlist', logdir='log', model='BiT-M-R50x1', name='log_log', save=True, weights_path='log/log_log/bit.pth.tar']. when I set as this, I always get an error, says" size mismatch for module.head.conv.weight: copying a param with shape torch.Size([2340, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([277, 2048, 1, 1]).". So could you please tell me how to do the 2)-b)?

I am confused about the file "train_siamese/bit_pytorch_train/train.py" when loading weights from line 205-228. If we have trained the 2)-a) on logo 2k and get the weights bit.pth.tar, then delete the "checkpoint['model']['module.head.conv.weight'], checkpoint['model']['module.head.conv.bias']", however, the next step of "try" then reload the initial one again... DO I miss something? Thanks.

If there is any issues in the process of training of my understanding, feel free to tell me. Thanks. Have a good day