marco-c / autowebcompat Goto Github PK
View Code? Open in Web Editor NEWAutomatically detect web compatibility issues
License: Mozilla Public License 2.0
Automatically detect web compatibility issues
License: Mozilla Public License 2.0
Using TensorFlow backend.
10280
Traceback (most recent call last):
File "/Users/amit/Documents/GitHub/autowebcompat/collect.py", line 203, in
firefox_driver = webdriver.Firefox(firefox_profile=firefox_profile, firefox_binary='tools/nightly/firefox-bin')
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 152, in init
keep_alive=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 98, in init
self.start_session(desired_capabilities, browser_profile)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 188, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: **Unable to find a matching set of capabilities**
Many websites (e.g. with a carousel, or with news) change pretty often their content, but the overall structure remains the same.
If we collect many screenshots for the same website over the day or over multiple days, we might be able to teach the network better to ignore differences in content and consider differences in structure.
See also https://groups.google.com/forum/#!topic/mozilla.compatibility/oU9eVcHSPng.
Like different actions we can add color
input type as well.
Line 104 in e3f77c7
This might be a future enhancement.
Line 104 in e3f77c7
The test_write_labels test in the test_utils.py file just simply asserts the existence of the file, a new test should be written or the same updated. This test should write the labels and then read them back. The read labels should be the same as the written ones.
See also PR #63.
We should add flake8 to test-requirements.txt and write a .travis.yml to run flake8.
The flake8 configuration file should ignore line limit errors (E501).
I found this architecture 'Simnet' by Amazon Development Services, it uses a variation of Siamese network, they use 2 extra shallower CNN models trained on downsampled images alongside the ImageNet,
The final results are better.
https://arxiv.org/pdf/1709.08761.pdf
This would be a great experiment as per issue #1
Around line 159
elif input_type == 'radio': elem.click()
In the first browser, when the crawler finds an element, it should check if there's no other element on the page with the same attributes.
In the second browser, when we are repeating the steps, we should assert that there's a single element on the page with the given attributes.
This way it is more manageable to see diffs, as each line is a different data point.
#41 broke the data_inconsistencies.py script. We should modify it to make it work again.
The get_dependencies.py
script prints no message, and takes a long time to finish its execution (as it involves downloading and extracting a zip file of size 1.4 GB). So while the script runs, it results in a poor experience of the user (as he doesn't know what's going on, and it gives off the impression that it's stuck). It can be improved by
data.zip
file.The progress bar can be added using libraries like clint
, but it also means adding an extra dependency. Although, as we polish this project, we will be needing proper CLI interface, so it won't hurt if we start adding them now.
Following Pycode style guidelines.
https://pypi.python.org/pypi/pycodestyle
Classes for the Siamese module in order to the same module for different type of networks and on different datasets. Can make ease of parallelisation for quick results.
Function naming should be standardised accordingly.
Modify the label.py script to handle multiple people performing the labeling (I would suggest simply adding a parameter PERSON, and writing a labels_PERSON.csv according to this parameter). Then, write another script that generates labels.csv from all the labels_.csv files (basically storing the lines from labels_.csv which agree on the label).
There are a few cases where the crawler was not able to take screenshots. We should figure out why and try to fix any issue that we notice.
The files under data/
are in the format WEBCOMPAT-ID_ELEMENT-ID_BROWSER.png
. WEBCOMPAT-ID are the IDs from webcompat.com.
ELEMENT-ID are the element IDs where the crawler clicked before taking the screenshot.
BROWSER is the name of the browser.
We should investigate these cases:
sumit@HAL9000:~/Documents/autowebcompat$ python3 label.py
Traceback (most recent call last):
File "label.py", line 4, in <module>
from Tkinter import Tk, Label
ModuleNotFoundError: No module named 'Tkinter'
This is happening because Tkinter is being imported as Tkinter
. For Python 3 it should be imported as tkinter
(see this SO answer).
On doing a fresh recursive clone of autowebcompat, the submodule for data fetches the .png
files by filename but inside they have git signatures instead of content.
So it downloaded just 5MB data instead of 959MB.
How to reproduce bug:
git clone --recurse-submodules [email protected]:marco-c/autowebcompat.git autoweb
cat autoweb/data/7_firefox.png
It is text instead of png binary gibbrish!
Docs: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html.
We should try both using the output of the network as features, and finetuning the top layers and finetuning all layers.
In collect.py add ability to support select tags
@marco-c
Once #78 is fixed, I think a script should be written which ensures that inconsistent data is excluded and/or what part of the data collected is actually being used to train so that we have an idea of how successful we are at data collection and maybe update the crawler script later on.
Although i do know that labels.py will only take a set of images if both are present, i haven't seen any enforcement measures.
A unit test has to be written for the balance
function in utils.py.
See also PR #63.
Right now, Chrome is showing a scrollbar that we should get rid of, as it is simply adding noise.
Once we remove the scrollbar, the screenshots are probably not the same size in Firefox and Chrome, so we should adjust it to be the same.
Currently, we are choosing elements to interact with in a random fashion.
Instead, we should explore all possible paths (up to a maximum level that we need to decide).
On running python get_dependencies.py
we get an error as:
File "get_dependencies.py", line 31
SyntaxError: Non-ASCII character '\xe2' in file get_dependencies.py on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
To solve this, the utf-8 encoding should be mentioned on get_dependencies.py
I just forked and cloned the repository and ran the following commands as described in the README
But when I run python3 pretrain.py
I get the following error
/usr/local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from 'float' to 'np.floating' is deprecated. In future, it will be treated as 'np.float64 == np.dtype(float).type'. from ._conv import register_converters as _register_converters
Using TensorFlow backend. Traceback (most recent call last): File "pretrain.py", line 14, in <module> image = utils.load_image(all_images[0]) IndexError: list index out of range
This might be due to missing training images and I am unable to figure out how to download the training images
Right now, we only interact with elements that have an ID and store the sequence of operations as a list of element IDs.
A lot of times on websites the elements don't have an ID though. Can we find a way to repeat the same steps that we perform in a browser on another browser without having the IDs of the elements?
We should try to modify collect.py to run multiple instances of the browsers. Since the crawler is spending most of its time waiting (for the page to be fully loaded), running multiple instances can increase consistently the number of screenshots we can take per second.
https://www.dropbox.com/s/7f5uok2alxz9j1r/data.zip?dl=1 is not available anymore.
Right now, the crawler is interacting with button
s, some input
s and a
s:
Line 74 in 205aae8
The documentation is pretty scarce, it needs to be improved.
Maybe moving some crawler code in a separate module under the autowebcompat directory.
While doing it, we should also take note of on which sites the crawler fails to take screenshots.
There's a nice pytest feature that you can use instead of manually using tempfile: https://docs.pytest.org/en/latest/tmpdir.html.
See also PR #63.
Two steps here:
Line 104 in e3f77c7
The labeling can be performed using the label.py script.
This script will show you a couple of images, and then you can press 'y' to label them as being compatible, 'd' to label them as being compatible with content differences (e.g. on news site, two screenshots could be compatible even though they are showing two different news, simply because the news shown depends on the time the screenshot was taken and not on the different browser), 'n' to label them as not being compatible, 'RETURN' to skip them (in case you are not sure yet), 'ESCAPE' to terminate the current labeling session and store the current results.
More details about the three-labeling system are present in the documentation at https://github.com/marco-c/autowebcompat#labeling.
Right now, for each webcompat issue, we are just checking if we have taken screenshots for the main page. We should instead check if we have taken screenshots for all the possible operations too.
Basically, this should take care of the TODO here:
Line 220 in 5bb234e
Given the associated text file with the sequence of operations, we should check if a screenshot exist for every operation. When a screenshot doesn't exist, we should attempt to create it.
The TarFile class can be used as a context manager, so it is nicer.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.