Git Product home page Git Product logo

advertiser-quality-from-sites's Introduction

Intern Project: Identifying Advertiser Quality from Their Websites

Intuitively, for an ad campaign, the quality of the advertiser’s business itself can have a big impact on the campaign performance, especially conversion performance. For a restaurant, the quality can be their food, service and location that provided to the customer; for an e-commerce website, the quality can be the ease of use of their website, shipping and customer service. The quality of an advertiser can be estimated from several data sources, such as their website quality, their rating on popular rating platforms, such as Google Maps or Yelp, and/or how natively they rank on Google Search, etc. Considering estimating the quality of the advertiser does not require ads data, we would like to measure the quality of advertiser from their websites in the intern project.

Our methods to preproess the data and our neural network models are built upon Python 3, Tensorflow on CPU and GPUs. We developed two neural network models: 1) baseline model 2) DOM-based models. Both approaches use BERT-Base model with 12 layers. Further ahead, we show how to run our codes to retrieve url links, HTMLs, clearning visible texts, visualizing DOM structures, and also we show how to run our NN models to train the data and to predict the category/rating of urls links.

Instructions

1. Setup

The required packages are listed in requirements.txt. You also need to install setup.py. For testing, please go ahead and click on Actions, shown above, to run a workflow.

2. Data Preprocessing

All methods for preprocessing step are located at folder utils. Here, we show how to run the code.

Get HTMLS of URL Links: get_HTMLs_from_urls.py
This code gets input path of url links which is basically a text file where every line contains a url link. It also gets a dataframe as the input that has information of businesses which are ordered the same order as the order of url links in the text file. It extracts the HTML contents of urls and store a data frame with the extractd HTMLs in the output directory address. Note that the url links are extract from Yelp page. Please check generate_input_data.ipynb to see how we save the url links in a text file.

usage: get_HTMLs_from_urls.py [-h] [--input_path INPUT_PATH]
                              [--urllinks_path URLLINKS_PATH]
                              [--output_directory OUTPUT_DIRECTORY]

Get HTMLs from URL Links

optional arguments:
  -h, --help            show this help message and exit
  --input_path INPUT_PATH
                        The path of the data
  --urllinks_path URLLINKS_PATH
                        The path of url links
  --output_directory OUTPUT_DIRECTORY
                        The directory of final result

Get Visible Texts of HTMLS: HTML_to_visibletexts.py
This Python code retrieves visible texts from HTMLs and then clean the text by removing non-ascii characters and replacing \t and multiple spaces with a single space. We get the input path of the data frame and the column name for HTML contents. Further, it ignores the websites that do not use English using the Python package langdetect. It saves the final dataframe in the output directory.

usage: HTML_to_visibletexts.py [-h] [--input_path INPUT_PATH]
                               [--output_directory OUTPUT_DIRECTORY]
                               [--HTML_colname HTML_COLNAME]

Get Visible Texts from HTMLs

optional arguments:
  -h, --help            show this help message and exit
  --input_path INPUT_PATH
                        The path of the data
  --output_directory OUTPUT_DIRECTORY
                        The directory of final result
  --HTML_colname HTML_COLNAME
                        The HTML column name

Visualize the DOM structure of HTML: gen_DOM_tree.py
This code visualizes the DOM structure of the given HTML. For a better visualization, we limit the depth of the tree (DOM) to 7 and the maximum branches to 5. Here, you see how to run the code by passing a url link. The output pdf file is saved in the same directory as the python file.

usage: gen_DOM_tree.py [-h] [--url URL]

Visualize DOM HTMLs

optional arguments:
  -h, --help  show this help message and exit
  --url URL   URL link of website

Here is an example:

python gen_DOM_tree.py --url https://www.pizzahut.com

Helper functions: utils.py
Here, we maintain several helper functions for cleaning texts, unifying categories, spliting the training and test sets, oversampling the training data, and plotting the distribution of the data. We listed the functions of utils below:

def suppress_stdout #for supressing stdout.
def suppress_sterr #for suppressing stderr.
def unify_yelp_data_classes #unifies categories by maping a list of word to a label name.
def remove_not_loaded_websites #remove websites that are not loaded.
def oversampling # oversamples the training data.
def plot_classes_distribution #plots the distirbution of the entries for each class.
def clean_text #clean the input text by removing non-ascii characters and replace multiple spaces and \t with a single space.
def get_train_test #split the data to training and test sets.

Tree Library and its helper functions: tree_lib.py
This file contains several functions for converting an HTML to tree strings, a data structure for maining trees, data balancing based on the labels of trees, get the stats of trees including the number of nodes, maximum depths, and their maximum branches. In generate_input_data.ipynb, we generate trees and save them in a dataframe.

3. Models

Baseline model: baseline_model.py
This code is our baseline model, which prepares the data for the model by a preprocessing step for converting texts into BERT embeddings. Then, the model starts to train and finally evaluates the test set. The other option to this model is to pass a url links and get the prediction of its category or rating. Here are the parameters for running the code:

usage: baseline_model.py [-h] [--tasktype {C,R}]
                         [--input_directory INPUT_DIRECTORY]
                         [--adam_lr ADAM_LR] [--n_epochs N_EPOCHS]
                         [--val_split_ratio VAL_SPLIT_RATIO]
                         [--bert_folder_path BERT_FOLDER_PATH]
                         [--bert_embedding_size BERT_EMBEDDING_SIZE]
                         [--keep_prob KEEP_PROB]
                         [--max_content_length MAX_CONTENT_LENGTH]
                         [--n_hidden_layers N_HIDDEN_LAYERS] [--url URL]
                         [--best_weight_path BEST_WEIGHT_PATH]
                         [--chrome_path CHROME_PATH]

Baseline -- Identifying Advertiser Quality from Their Websites

optional arguments:
  -h, --help            show this help message and exit
  --tasktype {C,R}      (C) Classification or (R)Regression
  --input_directory INPUT_DIRECTORY
                        Directory for train and test data
  --adam_lr ADAM_LR     Adam learning rate
  --n_epochs N_EPOCHS   Numbrt of epochs
  --val_split_ratio VAL_SPLIT_RATIO
                        Validation size ration
  --bert_folder_path BERT_FOLDER_PATH
                        Folder path of model BERT
  --bert_embedding_size BERT_EMBEDDING_SIZE
                        BERT output embedding size
  --keep_prob KEEP_PROB
                        Kept rate of dropout layers
  --max_content_length MAX_CONTENT_LENGTH
                        Maximum content length of from each leaf of DOM
  --n_hidden_layers N_HIDDEN_LAYERS
                        Number of hidden layers
  --url URL             URL link of business website
  --best_weight_path BEST_WEIGHT_PATH
                        URL link of business website
  --chrome_path CHROME_PATH
                        The path to chrome engine for Python package selenium

DOM based models: DOMbased_model.py
This file contains the implementation of DOM-based models including the Fast DOM-based model (FDBM) and their modified versions. However, when we run the code the FDBM with mean (which the children embeddings are averaged out) is run. Similar to the baseline model, we can pass a url link, otherwise (if we don't pass a url link) we can train our model. Here are the paramteres to run the code.

usage: DOMbased_model.py [-h] [--tasktype {C,R}]
                         [--input_directory INPUT_DIRECTORY]
                         [--adam_lr ADAM_LR] [--n_epochs N_EPOCHS] [--l2 L2]
                         [--val_split_ratio VAL_SPLIT_RATIO]
                         [--max_depth MAX_DEPTH]
                         [--bert_folder_path BERT_FOLDER_PATH]
                         [--bert_embedding_size BERT_EMBEDDING_SIZE]
                         [--embedding_size EMBEDDING_SIZE]
                         [--keep_prob KEEP_PROB]
                         [--max_content_length MAX_CONTENT_LENGTH] [--url URL]
                         [--best_weight_path BEST_WEIGHT_PATH]
                         [--chrome_path CHROME_PATH]

Fast DOM Based Model -- Identifying Advertiser Quality from Their Websites

optional arguments:
  -h, --help            show this help message and exit
  --tasktype {C,R}      (C) Classification or (R)Regression
  --input_directory INPUT_DIRECTORY
                        Directory for train and test data
  --adam_lr ADAM_LR     Adam learning rate
  --n_epochs N_EPOCHS   Numbrt of epochs
  --l2 L2               L2 regularization factor
  --val_split_ratio VAL_SPLIT_RATIO
                        Validation size ration
  --max_depth MAX_DEPTH
                        Maximum depth for DOM based model
  --bert_folder_path BERT_FOLDER_PATH
                        Folder path of model BERT
  --bert_embedding_size BERT_EMBEDDING_SIZE
                        BERT output embedding size
  --embedding_size EMBEDDING_SIZE
                        DOM-based model output Embedding size
  --keep_prob KEEP_PROB
                        Kept rate of dropout layers
  --max_content_length MAX_CONTENT_LENGTH
                        Maximum content length of from each leaf of DOM
  --url URL             URL link of business website
  --best_weight_path BEST_WEIGHT_PATH
                        URL link of business website
  --chrome_path CHROME_PATH
                        The path to chrome engine for Python package selenium

Demo

Here is a demo of FDBM for predicting the category and rating of businesses given their url links:

License

Apache 2.0; see LICENSE for details.

Disclaimer

This is not an officially supported Google product.

advertiser-quality-from-sites's People

Contributors

beginner1010 avatar zhangyy1209 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

wliangxy

advertiser-quality-from-sites's Issues

Security Policy violation Binary Artifacts

This issue was automatically created by Allstar.

Security Policy Violation
Project is out of compliance with Binary Artifacts policy: binaries present in source code

Rule Description
Binary Artifacts are an increased security risk in your repository. Binary artifacts cannot be reviewed, allowing the introduction of possibly obsolete or maliciously subverted executables. For more information see the Security Scorecards Documentation for Binary Artifacts.

Remediation Steps
To remediate, remove the generated executable artifacts from the repository.

Artifacts Found

  • tests/pycache/init.cpython-37.pyc
  • tests/pycache/test_tree_lib.cpython-37-pytest-5.4.3.pyc
  • tests/pycache/test_utils.cpython-37-pytest-5.4.3.pyc

Additional Information
This policy is drawn from Security Scorecards, which is a tool that scores a project's adherence to security best practices. You may wish to run a Scorecards scan directly on this repository for more details.


Allstar has been installed on all Google managed GitHub orgs. Policies are gradually being rolled out and enforced by the GOSST and OSPO teams. Learn more at http://go/allstar

This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.