Git Product home page Git Product logo

www24_threatadvphish's Introduction

Overview

This repo contains the resources of the paper "“Are Adversarial Phishing Webpages a Threat in Reality?” Understanding the Users’ Perception of Adversarial Webpages" accepted to the Security track of TheWebConf'24. We also created a webpage to provide more information: ThreatAdvPhish

If you use any of our resources, you are kindly invited to cite our paper:

@inproceedings{yuan2024are,
  title={{“Are Adversarial Phishing Webpages a Threat in Reality?” Understanding the Users’ Perception of Adversarial Webpages}},
  author={Yuan, Ying and Hao, Qingying and Apruzzese, Giovanni and Conti, Mauro and Wang, Gang},
  booktitle={ACM International World Wide Web Conference (TheWebConf)},
  year={2024}
}

Organization

This repository includes three main folders:

  • ML-PWD: This folder includes the implementation of our custom ML-based phishing website detector and scripts for generating APW-Lab webpages.
  • USER STUDY: This folder contains all user study related, one .pdf file states how our survey looks and one subfolder list all webpages used in our study.
  • Analysis: This folder contains the components needed to replicate all analysis. This includes the codebook, a script for calculating cohen's kappa value, linear regression model and mixed-effect logistic regression model.

Contents

We explain the documents in the order of the list above, i.e., ML-PWD, USER STUDY and Analysis.

ML-PWD

This folder includes three .ipynb files, one subfolder, 5 .json files, two .py and one requirements.txt file:

  • datasets, which is a folders containing the dataset proposed in paper Building standard offline anti-phishing dataset for benchmarking (we just show the source of three phishing webpages, the full dataset can be downloaded from the link proposed in their paper).
  • feature_extraction.ipynb, which is a notebook file extracting features from HTML source of webpages.
  • RF-PWD.ipynb, which is a notebook containing machine learning model selection, the custom ML-PWD described in our paper, as well as the prediction for our APW-Lab samples.
  • apw_lab_generation.ipynb, which is a notebook stating how we generate APW-Lab webpages.
  • extractor.py, which is from SpacePhish, to extract features from HTML source of webpages.
  • util.py, which is a script providing functions for the building of ML-PWD and the generation of APW-Lab samples.
  • full_feature.json, which is the feature set extracted from the dataset proposed in paper Building standard offline anti-phishing dataset for benchmarking, to build the custom ML-PWD.
  • addfootimg_use.json, addbackimg_use.json, addtypos_use.json and repass_use.json, which are the features extracted from APW-Lab webpages.
  • requirements.txt, which is a txt file specifying which Python libraries were needed to build the custom ML-PWD and generate APW-Lab webpages.

USER STUDY

This folder includes one .pdf file and one subfolder:

  • survey.pdf, which is a pdf file displaying what we show to the users, includes consent form, introduction, main questions, attention questions,demographic questions and final page.

  • experimental_webpages, which is a subfolder includes all webpages used in our survey: Unperturbed Phishing, Legitimate, APW-Lab and APW-Wild. They are in four subfolders:

    • APW_Lab, which is a folder containing all APW-Lab webpages we generated for 15 brands.
    • APW_Wild, which is a folder containing adversarial phishing webpages in the wild, taken from Real Attackers Don't Compute Gradients.
    • Legitimate, which is a folder including the legitimate webpages of 15 brands decribed in our paper.
    • Unperturbed Phishing, which is a folder containing corresponding ubperturbed phishing webpages of 15 brands.

Analysis

This is a folder contains 4 files:

  • codebook.pdf, which is a pdf file we built based on user's responses for open-form questions.
  • codebook_kappa.ipynb, which is a script to calculate two coder's cohen's kappa value.
  • linear_regression_model.r, which is a R script for analyzing the impact of demographic factors on user's detection rate.
  • mixed_effect_logistic_regression_model.r, which is a R script for analyzing the relationship between webpage's type and user's accuracy.

Instructions

Let's explain how to use our artifact.

  1. Download dataset, and install requirements. We recommend creating a visual enviroment for this purpose (Miniconda works well). The datasets subforlder in ML-PWD should be replaced.
  2. Extarct features and build custom ML-PWD. Running feature_extraction.ipynb file extracting features and build RF-based phishing website detector by excuting RF-PWD.ipynb.
  3. Generate APW-Lab. Generating APW-Lab webpages by running apw_lab_generation.ipynb, then extract features by feature_extraction.ipynb , and input them to RF-PWD to test whether they can evade the detection.
  4. Publish the survey. Publish the survey to collect users judgement for legitimate, unperturbed phishing, APW-Lab and APW-Wild wepages.
  5. Coding response. Two coders code the reponses based on codebook.pdf, and calculate the cohen's kappa value by codebook_kappa.ipynb to check the realible of the codebook.
  6. Regression analysis. Using linear_regression_model.r file to analyze the imapct of demographich factors on user's detection rate. And, using mixed_effect_logistic_regression_model.r to analyze whether user's detection rate are affected by webpage's type, users familiarity and the frequency of website visits.

www24_threatadvphish's People

Contributors

joanyy avatar hihey54 avatar

Stargazers

 avatar  avatar lindsey98 avatar Jason Trost avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.