vida-nyu / domain_discovery_tool_deprecated Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 8.0 77.36 MB

Seed acquisition tool to bootstrap focused crawlers

Python 29.80% Shell 0.58% Java 6.15% HTML 7.85% CSS 4.18% JavaScript 51.08% Makefile 0.37%

domain_discovery_tool_deprecated's People

Contributors

Stargazers

Watchers

Forkers

ahmadia ksmaheshkumar nyimbi anukat2015 ashbt skadambi20 toabey weeshlow

domain_discovery_tool_deprecated's Issues

Conda enviroment not working due to elasticsearch dependency

Latest elasticsearch conda package (1.7) doesn't include executable file to start elasticsearch, or the name changed. ElasticSearch doesn't start when running supervisor:

$ supervisord
2015-10-15 14:24:27,667 INFO RPC interface 'supervisor' initialized
2015-10-15 14:24:27,668 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2015-10-15 14:24:27,668 INFO supervisord started with pid 16981
2015-10-15 14:24:28,669 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:28,671 INFO spawned: 'ddt' with pid 16985
2015-10-15 14:24:29,703 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:29,704 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2015-10-15 14:24:31,706 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:34,710 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:34,710 INFO gave up: elasticsearch entered FATAL state, too many start retries too quickly

Specifying a elasticsearch version <= 1.6 in the environment.yml file fixes this problem, but another problem happens when starting DDT services through supervisor:

$ supervisord
2015-10-15 14:28:14,162 INFO RPC interface 'supervisor' initialized
2015-10-15 14:28:14,162 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2015-10-15 14:28:14,162 INFO supervisord started with pid 17341
2015-10-15 14:28:15,165 INFO spawned: 'elasticsearch' with pid 17344
2015-10-15 14:28:15,168 INFO spawned: 'ddt' with pid 17345
2015-10-15 14:28:15,600 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:16,725 INFO success: elasticsearch entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2015-10-15 14:28:16,726 INFO spawned: 'ddt' with pid 17450
2015-10-15 14:28:17,196 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:19,239 INFO spawned: 'ddt' with pid 17627
2015-10-15 14:28:19,623 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:23,187 INFO spawned: 'ddt' with pid 17736
2015-10-15 14:28:23,590 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:24,591 INFO gave up: ddt entered FATAL state, too many start retries too quickly

and in the file ./logs/ddt-stderr---supervisor-YpGRaX.log:

Traceback (most recent call last):
  File "/home/aeciosantos/workspace/domain_discovery_tool/vis/server.py", line 5, in <module>
    from crawler_model_adapter import *
  File "/home/aeciosantos/workspace/domain_discovery_tool/vis/crawler_model_adapter.py", line 2, in <module>
    from models.crawlermodel import *
  File "/home/aeciosantos/workspace/domain_discovery_tool/models/crawlermodel.py", line 22, in <module>
    from elasticsearch import Elasticsearch
ImportError: No module named elasticsearch

@ahmadia @brittainhard Do you guys know if this a problem of elasticsearch packages or in DDT's conda env?

Thanks @felipemoraes for reporting the problem.

OutOfMemoryError on statistics page

Page statistics show the following error when there is a big number of pages indexed.

500 Internal Server Error

The server encountered an unexpected condition which prevented it from fulfilling the request.

Traceback (most recent call last):
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/home/aeciosantos/workspace/ddt/vis/server.py", line 357, in statistics
    pages_dates = self._crawler.getPagesDates(session)
  File "/home/aeciosantos/workspace/ddt/vis/crawler_model_adapter.py", line 194, in getPagesDates
    return self._crawlerModel.getPagesDates(session)
  File "/home/aeciosantos/workspace/ddt/models/crawlermodel.py", line 1042, in getPagesDates
    return get_pages_datetimes(es_info["activeCrawlerIndex"])
  File "/home/aeciosantos/workspace/ddt/elastic/get_documents.py", line 193, in get_pages_datetimes
    items = es.search(index_name, size=100000)["hits"]["hits"]
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/client/__init__.py", line 506, in search
    params=params, body=body)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/connection/http_urllib3.py", line 89, in perform_request
    self._raise_error(response.status, raw_data)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
TransportError: TransportError(500, u'OutOfMemoryError[Java heap space]')

Bash scripts and Python in elastic/

So which of these are being used? There are multiple instances of what look like python scripts and bash scripts doing the same things, i.e. https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/elastic/delete.py and https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/elastic/delete_index.sh

Moreover, looking at the blame, it looks like the python script is much newer than the bash script.

And thoughts on this @yamsgithub @aecio ?

Error when tagging relevant, irrelevant

So the following happens if you mark an item relevant or irrelevant many times. The graph will show the URL as being irrelevant in this case:

"[["http://www.internationalgramscisociety.org/"],0.4030953753789398,2.5563493917919113,
["","Irrelevant","Relevant","Irrelevant","Relevant","Irrelevant","Relevant","Irrelevant"]]"

If I were to mark this page Relevant it wouldn't stick, and it would be reverted back to Irrelevant.

It seems to prefer listing pages as Irrelevant rather than Relevant. This behavior persists even if "Relevant" is the first item in the tags list. This url is marked as irrelevant in the graph.

"[["http://www.icair.org/"],-10.98856167516711,-1.8956571626087264,["Relevant","Irrelevant"]]"

This bug requires some special attention, and probably needs to be fixed very quickly @yamsgithub . Let me know if you need anymore info.

Allow uploading list of URLS or single URL (instead of a web query)

Add another collapsible panel (collapsed by default) on the left below the Web Search panel.
It should have a text box where you can input a URL.
And an upload icon that allows you to upload a file with a list of URLS. The uploaded URLs should appear in the text box
When the list of URLs are submitted then they are processed just like the results of the web search are processed. That is the pages corresponding to the URLs are downloaded and all information required to visualize in DDT are extracted and stored in elasticsearch.

Harden install of DDT

Right now, DDT is installed via the instructions here: https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/README.md

It would be great if we could boil this down to "vagrant up" via a Vagrantfile/Salt install.

Tag the page as seed

Bugs arising from the pin terms work

When new terms are added they disappear after the update is done.
Sometimes the term added as relevant (blue) shows up as irrelevant (red) on the terms list

Query using URLs

seeds_generator/conf/queries.txt creates Merge Conflicts

Does this file need to be tracked by the repo? My PR at VIDA-NYU/domain_discovery_tool#28 just got a merge conflict because of a merged change in this file.

Better summary of the retrieved pages:

improved visualization, e.g., termite
Real-time, incremental LDA (PLSA - from MIT Lincoln Labs)

Remove folders lda_pipeline and seed_crawler_site

Folders lda_pipeline and seed_crawler_site contain legacy code that have not been modified in a long time. Check that they are not really used and remove what is not used anymore.

Web Search Broken

The web search for domains is currently breaking for me. This is the error I see in my terminal.

Get the top 100 results
None
java.io.FileNotFoundException: microcap (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at java.io.FileReader.<init>(FileReader.java:72)
    at BingSearch.read_queries(BingSearch.java:43)
    at BingSearch.main(BingSearch.java:165)

Set the cursor at the beginning of the text box in the add new domain window

Allow the user to start adding the domain name without having to click on the textbox.

Allow accessing the page from the term context snippet

This involves 2 task:

Enable shift+click on the term which will disable the mouseover on the terms and makes the snippet window persistent for the selected term
Allow click on each term context snippet which opens the page that contains the snippet to be opened in a browser

DDT won't plot pages (master branch)

Steps to reproduce:

Make a clean repository pull
Run 'make'
Run 'source activate ddt'
run 'supervisord'
Open tool and create a new domain
Issue a query
Click update button
Pages will be downloaded and shown in the 'page summary', but not plotted in the 2d visualization

Is this caused by the recent change to use bokeh? Or bokeh code is still in a separate branch?

Add Google API search

Add modal window explaining what's the 2D projection

For first time users, it's not clear what's the 2D visualization is doing. We could add a help button on the corner, which opens a new modal window with a explanation of what's the purpose of the visualization and what each dot and color in the plot means.

Add a scroll bar to the pages details gallery

Currently you have to scroll down the pages which moves the visualization out of users view.

Add DDT configuration tab to add various configurations

While adding a new domain allow the specification of the index mapping. This allows accessing other indexes with different mapping in DDT
Allow to select between Google and Bing and input access keys

Create a conda based development environment

Remove Unused Files

Add a menu bar and move some functionality to it

Add a menu bar on the top and move following functionality to it to reduce the space of the interface:
1. Domains list
2. Add new domain
3. ~~Clustering methods~~
4. Model building
The menu bar should be fixed while scrolling.
It should display the name of the currently active domain selected.
Display the name "Domain Discovery Tool" and logo on the left corner.

Make all components on the interface collapsable

Counts in page summary panel are not correct after click in "update" button

The panels with statistics about pages show wrong numbers after using "update" button.
Steps to reproduce:

Create a new domain named "MachineLearning" and activate it.
Click in the update button. The number of "crawled pages" is correct: 0
Add the following 4 URLs using the "Upload URLs" panel. Type following text in the check box, then click "Submit".

https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Statistical_classification
https://en.wikipedia.org/wiki/Cluster_analysis
https://en.wikipedia.org/wiki/Regression_analysis

Wait for the statistics update automatically. The number of "crawled pages" is updated correctly to 4.
Click in the update button. The number of "crawler pages" is wrong (8) and the number of "new pages" is also wrong (shows 4, while should be 0).
Filter the pages. Type "tree" in the filter panel. The number of "crawler pages" is wrong (6) and the number of "new pages" is also wrong (4).

Better tooltips overall

Ability to remove domains and data associated with it

We already have a button for adding new domains (crawlers). Another useful feature would be to have a button for removing the domains and all the data associated with it.

DDT - Build Model Hangs

Having created a new domain, tagged pages and terms as relevant and irrelevant, I click on the "Build Model" button to generate training data package for use with ACHE. The cursor goes busy for a long period (circa 30-45 minutes) and no content is created even when I check in the folders via command line.

This applies to the local development instance v2.8.3 of DDT

Change the status messages to appear at the top of the page

Something like Google status messages in gmail.
It should be persistent to scrolls

Status messages are using term "crawler" instead of "domain" when a new domain is added

Error: Could not find or load main class GoogleSearch

On master, can't make new queries because of the following errors:

Error: Could not find or load main class GoogleSearch

any feedback on this would be appreciated.

Searching for same queries reset tags

When downloading the same page again, its tags get reset.

DDT in latest master branch cannot find static files

This regards commit f297f49.

After cloning the repo and following the installation/running instructions from the Wiki, I obtained an installation that could not find any of the static files. Here are some error logs from /bin/ddt-dev after I refreshed the page with the application (localhost:8084):

127.0.0.1 - - [07/Dec/2015:12:23:41] "GET / HTTP/1.1" 200 3103 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootflat-2.0.4/css/bootstrap.min.css HTTP/1.1" 404 669 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/
537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootflat-2.0.4/css/bootflat.min.css HTTP/1.1" 404 666 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/5
37.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootstrap-datetimepicker-4.15.35/css/bootstrap-datetimepicker.min.css HTTP/1.1" 404 680 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like G
ecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/d3.slider.css HTTP/1.1" 404 657 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/jquery-ui.css HTTP/1.1" 404 658 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/jquery.urlive.css HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery-1.10.0.min.js HTTP/1.1" 404 667 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.v3.5.5.min.js HTTP/1.1" 404 664 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery-ui.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/dropdowns-enhancement.min.css HTTP/1.1" 404 667 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/53$
.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/crawler-white.css HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.layout.cloud.js HTTP/1.1" 404 666 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery.urlive.js HTTP/1.1" 404 664 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.lasso.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/bootstrap.min.js HTTP/1.1" 404 663 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/moment.js HTTP/1.1" 404 658 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.slider.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/dropdowns-enhancement.js HTTP/1.1" 404 668 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.
36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/queue.min.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootstrap-datetimepicker-4.15.35/js/bootstrap-datetimepicker.min.js HTTP/1.1" 404 681 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gec
ko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/pageslandscape.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/sigslot_core.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/tagsgallery.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/dataaccess.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/crawlervis.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/pagesgallery.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/crawlersigslots.js HTTP/1.1" 404 662 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/seedcrawlerstatslist.js HTTP/1.1" 404 665 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/snippetsviewer.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/wordlist.js HTTP/1.1" 404 657 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/utils.js HTTP/1.1" 404 656 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /img/nyu_stacked_black.png HTTP/1.1" 404 663 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"

I followed the installation instructions that use the Makefile, so just make. I initialized the conda environment with

conda env create
source activate ddt

at the root of the repo.

Then I ran the app with elasticsearch on one terminal, and ./bin/ddt-dev also at the root of the repo.

I'll go now and check if the manual installation instructions work.

Filter by fields of the schema like date (updated, added)

Don't assume that DDT lives on the top-level domain

We'll be serving DDT from an explorer.io page, where it will live as explorer.io/ddt. DDT contains a few absolute URL references that need to be changed to relative URLs. Alternatively, we could define an HTTP_BASE URL for you that defines your base.

Remove hardcode references to elasticsearch endpoint

In order to use configurable elasticsearch endpoint, we need to remove any hardcoded references to "localhost:9200" in the repository.

Pagination for filter results

DDT - Standalone; ddt (exit status 1; not expected)

Hi folks, am getting some issues with the stand alone deployment. The package builds (make) and then can run ddt, but can't open the application on localhost:8084. The terminal shows this pattern of starting, then exiting ddt roughly every three minutes. A sample is pasted below:

2016-03-16 12:10:40,956 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-03-16 12:13:00,284 INFO exited: ddt (exit status 1; not expected)
2016-03-16 12:13:01,357 INFO spawned: 'ddt' with pid 28709
2016-03-16 12:13:02,359 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-03-16 12:16:51,898 INFO exited: ddt (exit status 1; not expected)
2016-03-16 12:16:52,901 INFO spawned: 'ddt' with pid 28849
2016-03-16 12:16:53,903 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

Log files are attached; not sure if this is a local config issue with the system I'm using (Ubuntu 14.04), but thought I'd raise to as I checked the other issues and couldn't find anything similar. Any advice or suggestions would be much appreciated! Thanks.

ddt-stderr---supervisor-SV4uTK.log.txt

Add custom label to groups of pages

Unnecessary crawlermodel.py?

So i'm looking through the code and I noticed that there is a crawlermodel.py file here at https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/vis/crawlermodel.py and another different crawlermodel.py here: https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/models/crawlermodel.py

It seems to me that the one in vis/ is the false one and the one in model/ is the real one. Can we get rid of one of these to limit confusion?

Configure multiple elasticsearch into DDT

Empty message box is displayed when the page is loaded

Every time the page is loaded, a yellow box is display in the screen. The box only disappears after the first message is shown. Ideally, the box should not be displayed if there's no message to be shown.

Related to issue #11.

@rshandy @yamsgithub

Bugs and other issues in bokeh clustering plot

The URL tooltips should appear after the mouse is stationary. Tooltips appearing as the mouse moves over the circles is not optimal
A non-deterministic bug where a new instance of the CrawlerModel is created thereby making the _domains object and hence a NoneType error attached below. The bug manifests while tagging pages
The update session event should be replaced by a call to the sessionInfo method in crawlervis.js
Bokeh plot doesn't show pages close to the borders #62

Here is the error:

[26/Jan/2016:10:51:38] HTTP 
Request Headers:
  Content-Length: 1364
  REFERER: http://localhost:8084/seedcrawler
  HOST: localhost:8084
  ORIGIN: http://localhost:8084
  CONNECTION: keep-alive
  Remote-Addr: 127.0.0.1
  ACCEPT: */*
  USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
  X-REQUESTED-WITH: XMLHttpRequest
  ACCEPT-LANGUAGE: en-US,en;q=0.8
  Content-Type: application/x-www-form-urlencoded; charset=UTF-8
  ACCEPT-ENCODING: gzip, deflate
[26/Jan/2016:10:51:38] HTTP 
Traceback (most recent call last):
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 206, in setPagesTag
    self._crawler.setPagesTag(pages, tag, applyTagFlag, session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 111, in setPagesTag
    self._crawlerModel.setPagesTag(pages, tag, applyTagFlag, session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 732, in setPagesTag
    es_info = self.esInfo(session['domainId'])
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
    "activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'
[26/Jan/2016:10:51:38] HTTP 
Request Headers:
  REFERER: http://localhost:8084/seedcrawler
  HOST: localhost:8084
  CONNECTION: keep-alive
  Remote-Addr: 127.0.0.1
  ACCEPT: */*
  USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
  X-REQUESTED-WITH: XMLHttpRequest
  ACCEPT-LANGUAGE: en-US,en;q=0.8
  ACCEPT-ENCODING: gzip, deflate, sdch
[26/Jan/2016:10:51:38] HTTP 
Traceback (most recent call last):
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 273, in getBokehPlot
    data = self._crawler.getPages(session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 95, in getPages
    return self._crawlerModel.getPages(session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 627, in getPages
    es_info = self.esInfo(session['domainId'])
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
    "activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'
127.0.0.1 - - [26/Jan/2016:10:51:38] "POST /setPagesTag HTTP/1.1" 500 2103 "http://localhost:8084/seedcrawler" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [26/Jan/2016:10:51:38] "GET /getBokehPlot?session=%7B%22activeProjectionAlg%22%3A%22Group+by+Similarity%22%2C%22domainId%22%3A%22AVJBRZHIIf8LCshQL9xi%22%2C%22pagesCap%22%3A%22100%22%2C%22fromDate%22%3Anull%2C%22toDate%22%3Anull%2C%22filter%22%3Anull%2C%22pageRetrievalCriteria%22%3A%22Most+Recent%22%7D HTTP/1.1" 500 2053 "http://localhost:8084/seedcrawler" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
[26/Jan/2016:10:51:38] HTTP 
Request Headers:
  REFERER: http://localhost:8084/seedcrawler
  HOST: localhost:8084
  CONNECTION: keep-alive
  Remote-Addr: 127.0.0.1
  ACCEPT: */*
  USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
  X-REQUESTED-WITH: XMLHttpRequest
  ACCEPT-LANGUAGE: en-US,en;q=0.8
  ACCEPT-ENCODING: gzip, deflate, sdch
[26/Jan/2016:10:51:38] HTTP 
Traceback (most recent call last):
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 273, in getBokehPlot
    data = self._crawler.getPages(session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 95, in getPages
    return self._crawlerModel.getPages(session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 627, in getPages
    es_info = self.esInfo(session['domainId'])
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
    "activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'

Queries Plot Breaks with Large Queries

With very large queries, the queries plot has difficulty rendering lines. There are a number of factors, that may be causing this, including conflicts with the new forwardlinks / backlinks feature.

No License

Please include a license for the project. In the DARPA catalog this project is listed as BSD licensed but there is no indication of that in the repo.

Add some statistics of the selected corpus

On the menu add a tab to view the statistics of the data for a selected domain using bokeh. These could be:

Display a summary of queries thus far
The domains that were crawled
Some statistics like the number of pages/per query, pages/per keyword.
Page statistics and queries by time

Clustering Graph Issues

There is a region along the margin where lasso selection does not work
Sometimes the pages close to but outside the lasso selection are tagged
Some pages within the lasso selection are not selected or tagged

Bokeh plot doesn't show pages close to the borders

The visualization panel sometimes doesn't show all pages that are close to the borders.
Steps to reproduce:

Create a new domain named "ML" and activate it.
Add the following 4 URLs using the "Upload URLs" panel. Type following text in the check box, then click "Submit".

https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Statistical_classification
https://en.wikipedia.org/wiki/Cluster_analysis
https://en.wikipedia.org/wiki/Regression_analysis

Click in the update button. 4 pages are shown.
Filter the pages: type "tree" in the filter panel and search. The number of matches is 2, but depending on the resolution of the screen, they will not be plotted. Try resizing your browser window to notice that. This can be seen in the attached image.
In the image it can also be seen (see the dashed line of the lasso selection) that there is a big gap between the borders of the plot (square dense line) and the area that the pages are actually plotted.

Bind for 0.0.0.0:8084 failed: port is already allocated

Whilst trying to restart the docker deployment of DDT I get the "port is already allocated" message.

I check iptables and had the following entries - and tried to remove the 8084 nat entry manually and recreate the images but got the same error.

I resolved this by removing the docker images and then removing the domain_discovery_tool directory, creating a new clone and rebuilding.

~/domain_discovery_tool$ sudo iptables -t nat -L -n --line-numbers
Chain PREROUTING (policy ACCEPT)
num target prot opt source destination
1 DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
num target prot opt source destination

Chain OUTPUT (policy ACCEPT)
num target prot opt source destination
1 DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
num target prot opt source destination
1 MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
2 MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:8084
3 MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:9200

Chain DOCKER (2 references)
num target prot opt source destination
1 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8084 to:172.17.0.1:8084
2 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:9200 to:172.17.0.1:9200

sudo iptables -D DOCKER 1 -t nat

Rebuild DDT conda package for both Linux and Mac

@ahmadia