marco-c / autowebcompat Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 41.0 121.11 MB

Automatically detect web compatibility issues

License: Mozilla Public License 2.0

Python 90.21% Jupyter Notebook 7.72% JavaScript 1.75% Dockerfile 0.32%

deep-learning machine-learning neural-networks

autowebcompat's People

Contributors

Stargazers

Watchers

autowebcompat's Issues

Add unit tests for the functions in utils.py

Make collect.py support Mac and Windows too

Using TensorFlow backend.
10280
Traceback (most recent call last):
File "/Users/amit/Documents/GitHub/autowebcompat/collect.py", line 203, in
firefox_driver = webdriver.Firefox(firefox_profile=firefox_profile, firefox_binary='tools/nightly/firefox-bin')
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 152, in init
keep_alive=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 98, in init
self.start_session(desired_capabilities, browser_profile)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 188, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: **Unable to find a matching set of capabilities**

Collect screenshots at different times during the day or on different days

Many websites (e.g. with a carousel, or with news) change pretty often their content, but the overall structure remains the same.
If we collect many screenshots for the same website over the day or over multiple days, we might be able to teach the network better to ignore differences in content and consider differences in structure.

Add support for input type 'color'

Like different actions we can add color input type as well.

Support 'checkbox' input type in the crawler

autowebcompat/collect.py

Line 104 in e3f77c7

elif elem.tag_name == 'input':

Consider using markup data too to train the network

This might be a future enhancement.

Support 'search' input type in the crawler

autowebcompat/collect.py

Line 104 in e3f77c7

elif elem.tag_name == 'input':

Expand the write_labels/read_labels tests to make sure read_labels can read labels written by write_labels

The test_write_labels test in the test_utils.py file just simply asserts the existence of the file, a new test should be written or the same updated. This test should write the labels and then read them back. The read labels should be the same as the written ones.

Setup .travis.yml to run flake8

We should add flake8 to test-requirements.txt and write a .travis.yml to run flake8.
The flake8 configuration file should ignore line limit errors (E501).

Add Travis CI badge to README

Write unit test for CouplesIterator class in utils.py

Add tests for network.py

Try using Simnet architecture

I found this architecture 'Simnet' by Amazon Development Services, it uses a variation of Siamese network, they use 2 extra shallower CNN models trained on downsampled images alongside the ImageNet,
The final results are better.
https://arxiv.org/pdf/1709.08761.pdf

This would be a great experiment as per issue #1

Support 'submit' input type in the crawler

Around line 159
elif input_type == 'radio': elem.click()

Make sure there's only one element on the page with the given attributes

In the first browser, when the crawler finds an element, it should check if there's no other element on the page with the same attributes.

In the second browser, when we are repeating the steps, we should assert that there's a single element on the page with the given attributes.

Modify label.py and train.py to write/read labels to/from a CSV file instead of a JSON file

This way it is more manageable to see diffs, as each line is a different data point.

Support collecting coverage when running tests

Update data_inconsistencies.py script to account for changes in screenshot naming structure

#41 broke the data_inconsistencies.py script. We should modify it to make it work again.

Improve user-experience of get_dependencies.py

The get_dependencies.py script prints no message, and takes a long time to finish its execution (as it involves downloading and extracting a zip file of size 1.4 GB). So while the script runs, it results in a poor experience of the user (as he doesn't know what's going on, and it gives off the impression that it's stuck). It can be improved by

Printing some info while the execution occurs.
Showing a progress bar while it downloads the data.zip file.

The progress bar can be added using libraries like clint, but it also means adding an extra dependency. Although, as we polish this project, we will be needing proper CLI interface, so it won't hurt if we start adding them now.

Changes to complete code structure and format for better usage of Networks in both training and testing.

Following Pycode style guidelines.
https://pypi.python.org/pypi/pycodestyle
Classes for the Siamese module in order to the same module for different type of networks and on different datasets. Can make ease of parallelisation for quick results.
Function naming should be standardised accordingly.

Modify label.py to handle multiple people performing the labeling

Modify the label.py script to handle multiple people performing the labeling (I would suggest simply adding a parameter PERSON, and writing a labels_PERSON.csv according to this parameter). Then, write another script that generates labels.csv from all the labels_.csv files (basically storing the lines from labels_.csv which agree on the label).

Investigate cases where the crawler wasn't able to take screenshots

There are a few cases where the crawler was not able to take screenshots. We should figure out why and try to fix any issue that we notice.

The files under data/ are in the format WEBCOMPAT-ID_ELEMENT-ID_BROWSER.png. WEBCOMPAT-ID are the IDs from webcompat.com.
ELEMENT-ID are the element IDs where the crawler clicked before taking the screenshot.
BROWSER is the name of the browser.

We should investigate these cases:

XXXX_firefox.png is present but XXXX_chrome.png is not present.
XXXX_ELEMENT_firefox.png is present but XXXX_ELEMENT_chrome.png is present.

ModuleNotFoundError for Tkinter

sumit@HAL9000:~/Documents/autowebcompat$ python3 label.py 
Traceback (most recent call last):
  File "label.py", line 4, in <module>
    from Tkinter import Tk, Label
ModuleNotFoundError: No module named 'Tkinter'

This is happening because Tkinter is being imported as Tkinter. For Python 3 it should be imported as tkinter (see this SO answer).

Document that Git LFS is required and how to clone the repo with submodules

On doing a fresh recursive clone of autowebcompat, the submodule for data fetches the .png files by filename but inside they have git signatures instead of content.
So it downloaded just 5MB data instead of 959MB.

How to reproduce bug:

git clone --recurse-submodules [email protected]:marco-c/autowebcompat.git autoweb
cat autoweb/data/7_firefox.png

It is text instead of png binary gibbrish!

Try using a network pretrained on ImageNet (e.g. VGG16)

Docs: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html.

We should try both using the output of the network as features, and finetuning the top layers and finetuning all layers.

Support select tags in the crawler

In collect.py add ability to support select tags

In data_inconsistencies.py, print how many screenshots we have collected and how many we could have collected

@marco-c
Once #78 is fixed, I think a script should be written which ensures that inconsistent data is excluded and/or what part of the data collected is actually being used to train so that we have an idea of how successful we are at data collection and maybe update the crawler script later on.
Although i do know that labels.py will only take a set of images if both are present, i haven't seen any enforcement measures.

Write unit test for the 'balance' function in utils.py

A unit test has to be written for the balance function in utils.py.

Fix all flake8 issues

Make the screenshots in the different browsers equal in size

Right now, Chrome is showing a scrollbar that we should get rid of, as it is simply adding noise.
Once we remove the scrollbar, the screenshots are probably not the same size in Firefox and Chrome, so we should adjust it to be the same.

Instead of randomly choosing an element to interact with, choose all possible elements

Currently, we are choosing elements to interact with in a random fashion.
Instead, we should explore all possible paths (up to a maximum level that we need to decide).

Rearrange the project directory structure

Rearrange it into a structure like this, as suggested by @marco-c here. This will facilitate testing, among other things.

Add UTF-8 Encoding to get_dependencies.py

On running python get_dependencies.py we get an error as:

File "get_dependencies.py", line 31
SyntaxError: Non-ASCII character '\xe2' in file get_dependencies.py on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

To solve this, the utf-8 encoding should be mentioned on get_dependencies.py

Write unit test for get_ImageDataGenerator function in utils.py

Downloading Images for training

I just forked and cloned the repository and ran the following commands as described in the README

pip3 install -r requirements.txt
pip3 install -r test-requirements.txt

But when I run python3 pretrain.pyI get the following error

/usr/local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from 'float' to 'np.floating' is deprecated. In future, it will be treated as 'np.float64 == np.dtype(float).type'. from ._conv import register_converters as _register_converters

Using TensorFlow backend. Traceback (most recent call last): File "pretrain.py", line 14, in <module> image = utils.load_image(all_images[0]) IndexError: list index out of range

This might be due to missing training images and I am unable to figure out how to download the training images

Investigate if we can repeat the same steps in different browsers without element IDs

Right now, we only interact with elements that have an ID and store the sequence of operations as a list of element IDs.
A lot of times on websites the elements don't have an ID though. Can we find a way to repeat the same steps that we perform in a browser on another browser without having the IDs of the elements?

Investigate running multiple crawlers in parallel

We should try to modify collect.py to run multiple instances of the browsers. Since the crawler is spending most of its time waiting (for the page to be fully loaded), running multiple instances can increase consistently the number of screenshots we can take per second.

data.zip in get_dependencies.py not available anymore.

https://www.dropbox.com/s/7f5uok2alxz9j1r/data.zip?dl=1 is not available anymore.

Investigate if we can find more elements to interact with

Right now, the crawler is interacting with buttons, some inputs and as:

autowebcompat/collect.py

Line 74 in 205aae8

buttons = body.find_elements_by_tag_name('button')

Improve documentation in the README

The documentation is pretty scarce, it needs to be improved.

Add tests for some functionality of the crawler

Maybe moving some crawler code in a separate module under the autowebcompat directory.

Keep running the crawler and collect more screenshots

While doing it, we should also take note of on which sites the crawler fails to take screenshots.

Make `download` function in get_dependencies.py automatically detect the file name from the URL

Use the PyTest tmpdir feature instead of manually handling temporary directories through tempfile

There's a nice pytest feature that you can use instead of manually using tempfile: https://docs.pytest.org/en/latest/tmpdir.html.

Try using other network architectures

Add argparse interface to pretrain.py and train.py to select network architecture and optimizer

Two steps here:

Add a function to network.py that returns the list of supported networks and a function that returns the list of supported optimizers;
Add argparse interface to pretrain.py and train.py to select network and optimizer from those lists.

Support 'number' input type in the crawler

autowebcompat/collect.py

Line 104 in e3f77c7

elif elem.tag_name == 'input':

Label dataset

The labeling can be performed using the label.py script.

This script will show you a couple of images, and then you can press 'y' to label them as being compatible, 'd' to label them as being compatible with content differences (e.g. on news site, two screenshots could be compatible even though they are showing two different news, simply because the news shown depends on the time the screenshot was taken and not on the different browser), 'n' to label them as not being compatible, 'RETURN' to skip them (in case you are not sure yet), 'ESCAPE' to terminate the current labeling session and store the current results.

More details about the three-labeling system are present in the documentation at https://github.com/marco-c/autowebcompat#labeling.

Make the crawler check if screenshots for the operations were taken

Right now, for each webcompat issue, we are just checking if we have taken screenshots for the main page. We should instead check if we have taken screenshots for all the possible operations too.

Basically, this should take care of the TODO here:

autowebcompat/collect.py

Line 220 in 5bb234e

# Assume that if we generated the main file, we also generated the one with

Given the associated text file with the sequence of operations, we should check if a screenshot exist for every operation. When a screenshot doesn't exist, we should attempt to create it.

Make get_dependencies.py use the TarFile class instead of tarfile

The TarFile class can be used as a context manager, so it is nicer.

marco-c / autowebcompat Goto Github PK

autowebcompat's People

Contributors

Stargazers

Watchers

Forkers

autowebcompat's Issues

Recommend Projects

Recommend Topics

Recommend Org