Git Product home page Git Product logo

phishpedia's Introduction

Phishpedia A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages

Dialogues Dialogues

PaperWebsiteVideoDatasetCitation

  • This is the official implementation of "Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages" USENIX'21 link to paper, link to our website, link to our dataset.

  • Existing reference-based phishing detectors:

    • ❌ Lack of interpretability, only give binary decision (legit or phish)
    • Not robust against distribution shift, because the classifier is biased towards the phishing training set
    • ❌ Lack of a large-scale phishing benchmark dataset
  • The contributions of our paper:

    • ✅ We propose a phishing identification system Phishpedia, which has high identification accuracy and low runtime overhead, outperforming the relevant state-of-the-art identification approaches.
    • ✅ We are the first to propose to use consistency-based method for phishing detection, in place of the traditional classification-based method. We investigate the consistency between the webpage domain and its brand intention. The detected brand intention provides a visual explanation for phishing decision.
    • ✅ Phishpedia is NOT trained on any phishing dataset, addressing the potential test-time distribution shift problem.
    • ✅ We release a 30k phishing benchmark dataset, each website is annotated with its URL, HTML, screenshot, and target brand: https://drive.google.com/file/d/12ypEMPRQ43zGRqHGut0Esq2z5en0DH4g/view?usp=drive_link.
    • ✅ We set up a phishing monitoring system, investigating emerging domains fed from CertStream, and we have discovered 1,704 real phishing, out of which 1133 are zero-days not reported by industrial antivirus engine (Virustotal).

Framework

Input: A URL and its screenshot Output: Phish/Benign, Phishing target

  • Step 1: Enter Deep Object Detection Model, get predicted logos and inputs (inputs are not used for later prediction, just for explanation)

  • Step 2: Enter Deep Siamese Model

    • If Siamese report no target, Return Benign, None
    • Else Siamese report a target, Return Phish, Phishing target

Project structure

- logo_recog.py: Deep Object Detection Model
- logo_matching.py: Deep Siamese Model 
- configs.yaml: Configuration file
- phishpedia.py: Main script

Instructions

Requirements:

  1. Create a local clone of Phishpedia
git clone https://github.com/lindsey98/Phishpedia.git
  1. Setup
chmod +x ./setup.sh
export ENV_NAME="phishpedia" && ./setup.sh
conda activate phishpedia
  1. Run in bash
python phishpedia.py --folder <folder you want to test e.g. ./datasets/test_sites>

The testing folder should be in the structure of:

test_site_1
|__ info.txt (Write the URL)
|__ shot.png (Save the screenshot)
test_site_2
|__ info.txt (Write the URL)
|__ shot.png (Save the screenshot)
......

Miscellaneous

  • In our paper, we also implement several phishing detection and identification baselines, see here
  • The logo targetlist described in our paper includes 181 brands, we have further expanded the targetlist to include 277 brands in this code repository
  • For the phish discovery experiment, we obtain feed from Certstream phish_catcher, we lower the score threshold to be 40 to process more suspicious websites, readers can refer to their repo for details
  • We use Scrapy for website crawling

Citation

If you find our work useful in your research, please consider citing our paper by:

@inproceedings{lin2021phishpedia,
  title={Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages},
  author={Lin, Yun and Liu, Ruofan and Divakaran, Dinil Mon and Ng, Jun Yang and Chan, Qing Zhou and Lu, Yiwen and Si, Yuxuan and Zhang, Fan and Dong, Jin Song},
  booktitle={30th $\{$USENIX$\}$ Security Symposium ($\{$USENIX$\}$ Security 21)},
  year={2021}
}

Contacts

If you have any issues running our code, you can raise an issue or send an email to [email protected], [email protected], and [email protected]

phishpedia's People

Contributors

lindsey98 avatar llmhyy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

phishpedia's Issues

Where is train_targets.txt of siamese

It's a very greate project, I'd try it to recognized some phishing sites. Now I want to add brand myself.
But when I test training siamese, it report

FileNotFoundError: [Errno 2] No such file or directory: './src/siamese_pedia/siamese_retrain/train_targets.txt'

I downloaded expand_targetlist, where is train_targets.txt and target_dict.json?

# phishpedia/src/siamese_pedia/siamese_retrain/bit_pytorch/train.py
        train_set = GetLoader(data_root='./src/siamese_pedia/expand_targetlist',
                                            data_list='./src/siamese_pedia/siamese_retrain/train_targets.txt',
                                            label_dict='./src/siamese_pedia/siamese_retrain/target_dict.json',
                                            transform=train_tx)
        
        valid_set = GetLoader(data_root='./src/siamese_pedia/expand_targetlist',
                              data_list='./src/siamese_pedia/siamese_retrain/test_targets.txt',
                              label_dict='./src/siamese_pedia/siamese_retrain/target_dict.json',
                              transform=val_tx)

Thanks.

Missing keys in loading state_dict for ResNetV2 and can't load "BiT-M-R50x1.npz"

Hi I'm trying to do the Siamese retrain.

  1. When I run evaluate.py file, I got the following error.

Traceback (most recent call last):
File "evaluate.py", line 155, in
model.load_state_dict(new_state_dict)
File "/xxx/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNetV2:
Missing key(s) in state_dict: "root.conv.weight",....

And also the error of:
Unexpected key(s) in state_dict: "ean", "td", "e.fpn_lateral2.weight", "e.fpn_lateral2.bias"...

  1. The other thing is that when I load "BiT-M-R50x1.npz", I got the this error:

np.load("BiT-M-R50x1.npz")
Traceback (most recent call last):
File "", line 1, in
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/npyio.py", line 398, in load
"Failed to interpret file %s as a pickle" % repr(file))
IOError: Failed to interpret file 'BiT-M-R50x1.npz' as a pickle

Thanks!

Otuput Files

Hi,

I noticed that if I run the following command, the program will generate two files ('predict_intention.png' and 'predict.png') for each url. Could you please explain what these two files represent, respectively?
python run.py --folder <folder you want to test e.g. phishpedia/datasets/test_sites> --results <where you want to save the results e.g. test.txt> --no_repeat

Thanks.

Optimization suggestion

Sorry for not creating a pull request, but I believe I do have an optimization suggestion.
in file src/siamese_pedia/inference.py at line 98, the variable logo_feat_list is an array of float64 holding only bitvalues.
by placing:

    logo_feat_list = logo_feat_list.astype(np.uint8)

before line 98, the size of the array is reduced, without effect on performance.
logo_feat_list size = 50,200,712 -> 6,275,208 bytes

This will further improve the speed of your model.

Question regarding deep learning network architecture

Hi, I have noticed that your project employs a two-stage deep learning network. In the first stage, RCNN is used to detect logos and input boxes on web pages, and in the second stage, a Siamese network is used to recognize the brand names of the logos detected in the previous stage?
I have read your paper and roughly understand the rationale of the design of each network stage. However, I still don't quite understand why it needs to be designed as two stages but not only one stage. Couldn't a single stage network handle this task? For example, if we train a target detection network (like YOLO) using only the logos in the target list to directly detect the brand names (use brand name as class) and then train another network to detect input boxes simultaneously, what would be the drawbacks of this approach?

(As a beginner in this field, my question might be somewhat naive😣)

About ".pkl" file in the Logo2K+ dataset

in siamese_retrain\bit_pytorch\train.py

        valid_set = GetLoader(
            data_root="./datasets/logo2k/Logo-2K+",
            data_list="./datasets/logo2k/val.txt",
            label_dict="./datasets/logo2k/logo2k_labeldict.pkl",
            transform=val_tx,
        )

but I can't find any .pkl files in the Logo2K+ dataset , do I need to generate logo2k_labeldict.pkl myself ?

Model Training Issues

Hi Lin, I have seen your paper on Phishpedia and I feel that it will help me in my research. I am working on your code and I would like to ask if it is possible to train the two models mentioned in your paper by myself: 1) Deep Object Detection Model; 2) Siamese Model

If it is possible to train them by myself, could you please give me a hint how to start?

test data

Hi Lin, there is only one test site data in the test_sites folder. Where can I get more test data?
image

Expand targetlist

It's possible to expand target brand list to cover new brands. Please follow the following steps:

Step 1: Create a new folder under src/siamese_pedia/expand_targetlist with the new brand name
Step 2: Append {"brand name": ["domain1 for this brand", "domain2 for this brand" ... ]} to src/siamese_pedia/domain_map.pkl (Dictionary storing the brand-domain mappings)

Model evaluation issues

Hi Lin, I would like to evaluate the Phishpedia model on the dataset you provided. Hence I would like to ask whether pipeline_eval.py is used to evaluate the whole model (Phishpedia), including accuracy and recall.

Request pip list versions

Hello, I'm trying to run the test code.

But I got an error in torch library.
I set my virtual environment with python 3.8.17 / cuda 11.0 / cudnn 8.0.4.
The other programs were installed using "requirements.txt".

However, I got an error below :
Traceback (most recent call last): File "test.py", line 8, in <module> ELE_MODEL, SIAMESE_THRE, SIAMESE_MODEL, LOGO_FEATS, LOGO_FILES, DOMAIN_MAP_PATH = load_config(cfg_path) File "/home/iis/Phishpedia/phishpedia/phishpedia_config.py", line 23, in load_config ELE_MODEL = config_rcnn(ELE_CFG_PATH, ELE_WEIGHTS_PATH, conf_threshold=ELE_CONFIG_THRE) File "/home/iis/Phishpedia/phishpedia/src/detectron2_pedia/inference.py", line 56, in config_rcnn predictor = DefaultPredictor(cfg) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 288, in __init__ checkpointer.load(cfg.MODEL.WEIGHTS) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/detectron2/checkpoint/detection_checkpoint.py", line 52, in load ret = super().load(path, *args, **kwargs) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 155, in load checkpoint = self._load_file(path) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/detectron2/checkpoint/detection_checkpoint.py", line 88, in _load_file loaded = super()._load_file(filename) # load native pth checkpoint File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 252, in _load_file return torch.load(f, map_location=torch.device("cpu")) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/iis/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, 'v'.

I think this error came from version confliction.

My installed pip packages are the same with below:
`Package Version


absl-py 1.4.0
advertorch 0.2.3
antlr4-python3-runtime 4.9.3
anyio 4.0.0
appdirs 1.4.4
black 21.4b2
cachetools 5.3.1
certifi 2023.7.22
cffi 1.15.1
charset-normalizer 3.2.0
click 8.1.7
cloudpickle 2.2.1
contourpy 1.1.0
cryptography 38.0.4
cycler 0.11.0
detectron2 0.6+cu111
exceptiongroup 1.1.3
filelock 3.12.3
fonttools 4.42.1
future 0.18.3
fvcore 0.1.5.post20221221
google-auth 2.23.0
google-auth-oauthlib 1.0.0
grpcio 1.58.0
gspread 5.11.1
h11 0.14.0
httpcore 0.17.3
httplib2 0.22.0
httpx 0.24.1
hydra-core 1.3.2
idna 3.4
importlib-metadata 6.8.0
importlib-resources 6.0.1
iopath 0.1.9
joblib 1.3.2
kiwisolver 1.4.5
lxml 4.9.3
Markdown 3.4.4
MarkupSafe 2.1.3
matplotlib 3.7.3
mypy-extensions 1.0.0
numpy 1.24.4
oauth2client 4.1.3
oauthlib 3.2.2
omegaconf 2.3.0
opencv-python 4.8.0.76
packaging 23.1
pandas 2.0.3
pathlib 1.0.1
pathspec 0.11.2
phishpedia 0.0.0
Pillow 9.5.0
pip 23.2.1
portalocker 2.7.0
protobuf 4.24.3
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycocotools 2.0.7
pycparser 2.21
pydot 1.4.2
pyparsing 3.1.1
python-dateutil 2.8.2
python-telegram-bot 20.5
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.8.8
requests 2.31.0
requests-file 1.5.1
requests-oauthlib 1.3.1
rsa 4.9
scikit-learn 1.3.0
scipy 1.10.1
setuptools 68.2.2
six 1.16.0
sniffio 1.3.0
tabulate 0.9.0
tensorboard 2.14.0
tensorboard-data-server 0.7.1
termcolor 2.3.0
threadpoolctl 3.2.0
tldextract 3.5.0
toml 0.10.2
torch 1.9.0+cu111
torchsummary 1.5.1
torchvision 0.10.0+cu111
tqdm 4.66.1
typing_extensions 4.7.1
tzdata 2023.3
urllib3 1.26.16
Werkzeug 2.3.7
wheel 0.41.2
yacs 0.1.8
zipp 3.16.2`

Could you share your package versions? or Do you know what is the problem is it?

Which model should I choose for siamese tranining?

Now I've selected BiT-M-R50x1, is this correct?

KNOWN_MODELS = OrderedDict([
        ('BiT-M-R50x1', lambda *a, **kw: ResNetV2([3, 4, 6, 3], 1, *a, **kw)),
        ('BiT-M-R50x3', lambda *a, **kw: ResNetV2([3, 4, 6, 3], 3, *a, **kw)),
        ('BiT-M-R101x1', lambda *a, **kw: ResNetV2([3, 4, 23, 3], 1, *a, **kw)),
        ('BiT-M-R101x3', lambda *a, **kw: ResNetV2([3, 4, 23, 3], 3, *a, **kw)),
        ('BiT-M-R152x2', lambda *a, **kw: ResNetV2([3, 8, 36, 3], 2, *a, **kw)),
        ('BiT-M-R152x4', lambda *a, **kw: ResNetV2([3, 8, 36, 3], 4, *a, **kw)),
        ('BiT-S-R50x1', lambda *a, **kw: ResNetV2([3, 4, 6, 3], 1, *a, **kw)),
        ('BiT-S-R50x3', lambda *a, **kw: ResNetV2([3, 4, 6, 3], 3, *a, **kw)),
        ('BiT-S-R101x1', lambda *a, **kw: ResNetV2([3, 4, 23, 3], 1, *a, **kw)),
        ('BiT-S-R101x3', lambda *a, **kw: ResNetV2([3, 4, 23, 3], 3, *a, **kw)),
        ('BiT-S-R152x2', lambda *a, **kw: ResNetV2([3, 8, 36, 3], 2, *a, **kw)),
        ('BiT-S-R152x4', lambda *a, **kw: ResNetV2([3, 8, 36, 3], 4, *a, **kw)),
])

Thanks

Run phishpedia.py error

image
Hello,I confronted an issue that I ran phishpedia.py but I was told "No module named 'detectron2.config'".Actually I can use python to import detectron2.Moreover,I use "python -m pip install -e detectron2" to download detectron2 but also faced with other problems.
image
image
image
I'm puzzled with it,and can you help me?

About logo classification task

Hi Lin, you mentioned the logo classification task in your paper,I would like to ask if this classification task is to learn diverse feature vectors? One more question, how is this classification task implemented? Based on your paper, I think the classification model first capture the image features using ResNet and then connect the fully connected layer for classification. Is my understanding correct?
image

About the domain

Hi,

Hope you doing well.

I have run your code of phishpedia, but now I find that you have updated the code.

When I check the "phishpedia_classifier_logo" function and run the code on some benign websites, I find that the "matched_domain" always not include ".com", ".cn", etc., while the "tldextract.extract(url).domain + '.' + tldextract.extract(url).suffix" include these information. So I am wondering if there is a bug or you also need to update the domain.pkl. I am not sure, just put forward my question.

Besides, I want to make sure what I am understanding is right. For the "phishpedia_classifier_logo" function, if the predict target brand is not None and the domain in the maintained domain list, then it should be benign and the output pred_target is None rather than the real brand?

Thanks

Best

Fujiao

vt_scan error

If you are running phishpedia_main.py, you will notice that we check the Virustotal (vt_scan function) result once the website is reported as phishing. In vt_scan function, I leave the api_key as empty for security reason. You need to fill in your own token, i.e. get an account in Virustotal and obtain a free API key.
Screenshot 2021-08-13 at 9 44 46 AM

idx has odd size

I am trying to run the phishpedia_main.py on the alibaba example. Though, my runs keep failing due to lack of memory. When inspecting the program, in siamese_pedia/inference.py, at line 105, idx is set. If I understand correctly, this should hold the indexes of the top 10 brands. Though, in my run, this variable has a shape of 10x2048x2048. Am I doing something wrong here? What shape is expected here?

About evaluating the Phishpedia model

Hi, I would like to evaluate the Phishpedia model as a baseline in our dataset, do I need to train it on my dataset first? Or I can directly use the model you uploaded to evaluate the accuracy, etc. Thanks

AttributeError: module 'PIL.Image' has no attribute 'LINEAR'

Step 1: Open .../anaconda3/envs/.../lib/python3.8/site-packages/detectron2/data/transforms/transform.py in editor, go to Line 46:
Step 2: Change

 def __init__(self, src_rect, output_size, interp=Image.LINEAR, fill=0):

to

 def __init__(self, src_rect, output_size, interp=Image.BILINEAR, fill=0):

Training problem on the siamese model

Hi, Hope you doing well.

I have a question when retrain all the models. From my understanding, there should be 2 models, 1) rcnn model and 2) the siamese model.

  1. use the benign30k to train the layout classifier, get the rcnn model.
  2. for training siamese model, there are 2 steps: a) train on logo2k, and b) fine-tune on target list 277 brands.
    2)-a) first use the logo2k to train, and get the bit.pth.tar, in this step, the parameters are [bit_pretrained_dir='.', dataset='logo_2k', logdir='log', model='BiT-M-R50x1', name='log_log', save=True, weights_path=None].
    2)-b) After the former step, how can I use the obtained model to get the siamese model on the target list 277 brands? Can I just set the parameters as [bit_pretrained_dir='.', dataset='targetlist', logdir='log', model='BiT-M-R50x1', name='log_log', save=True, weights_path='log/log_log/bit.pth.tar']. when I set as this, I always get an error, says" size mismatch for module.head.conv.weight: copying a param with shape torch.Size([2340, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([277, 2048, 1, 1]).". So could you please tell me how to do the 2)-b)?

I am confused about the file "train_siamese/bit_pytorch_train/train.py" when loading weights from line 205-228. If we have trained the 2)-a) on logo 2k and get the weights bit.pth.tar, then delete the "checkpoint['model']['module.head.conv.weight'], checkpoint['model']['module.head.conv.bias']", however, the next step of "try" then reload the initial one again... DO I miss something? Thanks.

If there is any issues in the process of training of my understanding, feel free to tell me. Thanks. Have a good day

Fujiao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.