Git Product home page Git Product logo

domain_discovery_tool_deprecated's People

Contributors

aecio avatar ahmadia avatar brittainhard avatar canavandl avatar gocesarp avatar jdfekete avatar kienpt avatar rshandy avatar soniacq avatar yamsgithub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

domain_discovery_tool_deprecated's Issues

Conda enviroment not working due to elasticsearch dependency

Latest elasticsearch conda package (1.7) doesn't include executable file to start elasticsearch, or the name changed. ElasticSearch doesn't start when running supervisor:

$ supervisord
2015-10-15 14:24:27,667 INFO RPC interface 'supervisor' initialized
2015-10-15 14:24:27,668 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2015-10-15 14:24:27,668 INFO supervisord started with pid 16981
2015-10-15 14:24:28,669 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:28,671 INFO spawned: 'ddt' with pid 16985
2015-10-15 14:24:29,703 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:29,704 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2015-10-15 14:24:31,706 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:34,710 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:34,710 INFO gave up: elasticsearch entered FATAL state, too many start retries too quickly

Specifying a elasticsearch version <= 1.6 in the environment.yml file fixes this problem, but another problem happens when starting DDT services through supervisor:

$ supervisord
2015-10-15 14:28:14,162 INFO RPC interface 'supervisor' initialized
2015-10-15 14:28:14,162 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2015-10-15 14:28:14,162 INFO supervisord started with pid 17341
2015-10-15 14:28:15,165 INFO spawned: 'elasticsearch' with pid 17344
2015-10-15 14:28:15,168 INFO spawned: 'ddt' with pid 17345
2015-10-15 14:28:15,600 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:16,725 INFO success: elasticsearch entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2015-10-15 14:28:16,726 INFO spawned: 'ddt' with pid 17450
2015-10-15 14:28:17,196 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:19,239 INFO spawned: 'ddt' with pid 17627
2015-10-15 14:28:19,623 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:23,187 INFO spawned: 'ddt' with pid 17736
2015-10-15 14:28:23,590 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:24,591 INFO gave up: ddt entered FATAL state, too many start retries too quickly

and in the file ./logs/ddt-stderr---supervisor-YpGRaX.log:

Traceback (most recent call last):
  File "/home/aeciosantos/workspace/domain_discovery_tool/vis/server.py", line 5, in <module>
    from crawler_model_adapter import *
  File "/home/aeciosantos/workspace/domain_discovery_tool/vis/crawler_model_adapter.py", line 2, in <module>
    from models.crawlermodel import *
  File "/home/aeciosantos/workspace/domain_discovery_tool/models/crawlermodel.py", line 22, in <module>
    from elasticsearch import Elasticsearch
ImportError: No module named elasticsearch

@ahmadia @brittainhard Do you guys know if this a problem of elasticsearch packages or in DDT's conda env?

Thanks @felipemoraes for reporting the problem.

OutOfMemoryError on statistics page

Page statistics show the following error when there is a big number of pages indexed.

500 Internal Server Error

The server encountered an unexpected condition which prevented it from fulfilling the request.

Traceback (most recent call last):
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/home/aeciosantos/workspace/ddt/vis/server.py", line 357, in statistics
    pages_dates = self._crawler.getPagesDates(session)
  File "/home/aeciosantos/workspace/ddt/vis/crawler_model_adapter.py", line 194, in getPagesDates
    return self._crawlerModel.getPagesDates(session)
  File "/home/aeciosantos/workspace/ddt/models/crawlermodel.py", line 1042, in getPagesDates
    return get_pages_datetimes(es_info["activeCrawlerIndex"])
  File "/home/aeciosantos/workspace/ddt/elastic/get_documents.py", line 193, in get_pages_datetimes
    items = es.search(index_name, size=100000)["hits"]["hits"]
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/client/__init__.py", line 506, in search
    params=params, body=body)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/connection/http_urllib3.py", line 89, in perform_request
    self._raise_error(response.status, raw_data)
  File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
TransportError: TransportError(500, u'OutOfMemoryError[Java heap space]')

Error when tagging relevant, irrelevant

So the following happens if you mark an item relevant or irrelevant many times. The graph will show the URL as being irrelevant in this case:

"[["http://www.internationalgramscisociety.org/"],0.4030953753789398,2.5563493917919113,
["","Irrelevant","Relevant","Irrelevant","Relevant","Irrelevant","Relevant","Irrelevant"]]"

If I were to mark this page Relevant it wouldn't stick, and it would be reverted back to Irrelevant.

It seems to prefer listing pages as Irrelevant rather than Relevant. This behavior persists even if "Relevant" is the first item in the tags list. This url is marked as irrelevant in the graph.

"[["http://www.icair.org/"],-10.98856167516711,-1.8956571626087264,["Relevant","Irrelevant"]]"

This bug requires some special attention, and probably needs to be fixed very quickly @yamsgithub . Let me know if you need anymore info.

Allow uploading list of URLS or single URL (instead of a web query)

Add another collapsible panel (collapsed by default) on the left below the Web Search panel.
It should have a text box where you can input a URL.
And an upload icon that allows you to upload a file with a list of URLS. The uploaded URLs should appear in the text box
When the list of URLs are submitted then they are processed just like the results of the web search are processed. That is the pages corresponding to the URLs are downloaded and all information required to visualize in DDT are extracted and stored in elasticsearch.

Bugs arising from the pin terms work

  1. When new terms are added they disappear after the update is done.
  2. Sometimes the term added as relevant (blue) shows up as irrelevant (red) on the terms list

Web Search Broken

The web search for domains is currently breaking for me. This is the error I see in my terminal.

Get the top 100 results
None
java.io.FileNotFoundException: microcap (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at java.io.FileReader.<init>(FileReader.java:72)
    at BingSearch.read_queries(BingSearch.java:43)
    at BingSearch.main(BingSearch.java:165)

Allow accessing the page from the term context snippet

This involves 2 task:

  1. Enable shift+click on the term which will disable the mouseover on the terms and makes the snippet window persistent for the selected term
  2. Allow click on each term context snippet which opens the page that contains the snippet to be opened in a browser

DDT won't plot pages (master branch)

Steps to reproduce:

  1. Make a clean repository pull
  2. Run 'make'
  3. Run 'source activate ddt'
  4. run 'supervisord'
  5. Open tool and create a new domain
  6. Issue a query
  7. Click update button
  8. Pages will be downloaded and shown in the 'page summary', but not plotted in the 2d visualization

Is this caused by the recent change to use bokeh? Or bokeh code is still in a separate branch?

Add modal window explaining what's the 2D projection

For first time users, it's not clear what's the 2D visualization is doing. We could add a help button on the corner, which opens a new modal window with a explanation of what's the purpose of the visualization and what each dot and color in the plot means.

Add a menu bar and move some functionality to it

  • Add a menu bar on the top and move following functionality to it to reduce the space of the interface:
    1. Domains list
    2. Add new domain
    3. Clustering methods
    4. Model building
  • The menu bar should be fixed while scrolling.
  • It should display the name of the currently active domain selected.
  • Display the name "Domain Discovery Tool" and logo on the left corner.

Counts in page summary panel are not correct after click in "update" button

The panels with statistics about pages show wrong numbers after using "update" button.
Steps to reproduce:

  • Create a new domain named "MachineLearning" and activate it.
  • Click in the update button. The number of "crawled pages" is correct: 0
  • Add the following 4 URLs using the "Upload URLs" panel. Type following text in the check box, then click "Submit".
https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Statistical_classification
https://en.wikipedia.org/wiki/Cluster_analysis
https://en.wikipedia.org/wiki/Regression_analysis
  • Wait for the statistics update automatically. The number of "crawled pages" is updated correctly to 4.
  • Click in the update button. The number of "crawler pages" is wrong (8) and the number of "new pages" is also wrong (shows 4, while should be 0).
  • Filter the pages. Type "tree" in the filter panel. The number of "crawler pages" is wrong (6) and the number of "new pages" is also wrong (4).

DDT - Build Model Hangs

Having created a new domain, tagged pages and terms as relevant and irrelevant, I click on the "Build Model" button to generate training data package for use with ACHE. The cursor goes busy for a long period (circa 30-45 minutes) and no content is created even when I check in the folders via command line.

This applies to the local development instance v2.8.3 of DDT

DDT in latest master branch cannot find static files

This regards commit f297f49.

After cloning the repo and following the installation/running instructions from the Wiki, I obtained an installation that could not find any of the static files. Here are some error logs from /bin/ddt-dev after I refreshed the page with the application (localhost:8084):

127.0.0.1 - - [07/Dec/2015:12:23:41] "GET / HTTP/1.1" 200 3103 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootflat-2.0.4/css/bootstrap.min.css HTTP/1.1" 404 669 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/
537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootflat-2.0.4/css/bootflat.min.css HTTP/1.1" 404 666 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/5
37.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootstrap-datetimepicker-4.15.35/css/bootstrap-datetimepicker.min.css HTTP/1.1" 404 680 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like G
ecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/d3.slider.css HTTP/1.1" 404 657 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/jquery-ui.css HTTP/1.1" 404 658 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/jquery.urlive.css HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery-1.10.0.min.js HTTP/1.1" 404 667 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.v3.5.5.min.js HTTP/1.1" 404 664 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery-ui.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/dropdowns-enhancement.min.css HTTP/1.1" 404 667 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/53$
.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/crawler-white.css HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.layout.cloud.js HTTP/1.1" 404 666 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery.urlive.js HTTP/1.1" 404 664 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.lasso.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/bootstrap.min.js HTTP/1.1" 404 663 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/moment.js HTTP/1.1" 404 658 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.slider.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/dropdowns-enhancement.js HTTP/1.1" 404 668 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.
36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/queue.min.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootstrap-datetimepicker-4.15.35/js/bootstrap-datetimepicker.min.js HTTP/1.1" 404 681 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gec
ko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/pageslandscape.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/sigslot_core.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/tagsgallery.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/dataaccess.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/crawlervis.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/pagesgallery.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/crawlersigslots.js HTTP/1.1" 404 662 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/seedcrawlerstatslist.js HTTP/1.1" 404 665 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/snippetsviewer.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/wordlist.js HTTP/1.1" 404 657 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/utils.js HTTP/1.1" 404 656 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /img/nyu_stacked_black.png HTTP/1.1" 404 663 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"

I followed the installation instructions that use the Makefile, so just make. I initialized the conda environment with

conda env create
source activate ddt

at the root of the repo.

Then I ran the app with elasticsearch on one terminal, and ./bin/ddt-dev also at the root of the repo.

I'll go now and check if the manual installation instructions work.

Don't assume that DDT lives on the top-level domain

We'll be serving DDT from an explorer.io page, where it will live as explorer.io/ddt. DDT contains a few absolute URL references that need to be changed to relative URLs. Alternatively, we could define an HTTP_BASE URL for you that defines your base.

DDT - Standalone; ddt (exit status 1; not expected)

Hi folks, am getting some issues with the stand alone deployment. The package builds (make) and then can run ddt, but can't open the application on localhost:8084. The terminal shows this pattern of starting, then exiting ddt roughly every three minutes. A sample is pasted below:


2016-03-16 12:10:40,956 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-03-16 12:13:00,284 INFO exited: ddt (exit status 1; not expected)
2016-03-16 12:13:01,357 INFO spawned: 'ddt' with pid 28709
2016-03-16 12:13:02,359 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-03-16 12:16:51,898 INFO exited: ddt (exit status 1; not expected)
2016-03-16 12:16:52,901 INFO spawned: 'ddt' with pid 28849
2016-03-16 12:16:53,903 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)


Log files are attached; not sure if this is a local config issue with the system I'm using (Ubuntu 14.04), but thought I'd raise to as I checked the other issues and couldn't find anything similar. Any advice or suggestions would be much appreciated! Thanks.

ddt-stderr---supervisor-SV4uTK.log.txt

Bugs and other issues in bokeh clustering plot

  1. The URL tooltips should appear after the mouse is stationary. Tooltips appearing as the mouse moves over the circles is not optimal
  2. A non-deterministic bug where a new instance of the CrawlerModel is created thereby making the _domains object and hence a NoneType error attached below. The bug manifests while tagging pages
  3. The update session event should be replaced by a call to the sessionInfo method in crawlervis.js
  4. Bokeh plot doesn't show pages close to the borders #62

Here is the error:

[26/Jan/2016:10:51:38] HTTP 
Request Headers:
  Content-Length: 1364
  REFERER: http://localhost:8084/seedcrawler
  HOST: localhost:8084
  ORIGIN: http://localhost:8084
  CONNECTION: keep-alive
  Remote-Addr: 127.0.0.1
  ACCEPT: */*
  USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
  X-REQUESTED-WITH: XMLHttpRequest
  ACCEPT-LANGUAGE: en-US,en;q=0.8
  Content-Type: application/x-www-form-urlencoded; charset=UTF-8
  ACCEPT-ENCODING: gzip, deflate
[26/Jan/2016:10:51:38] HTTP 
Traceback (most recent call last):
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 206, in setPagesTag
    self._crawler.setPagesTag(pages, tag, applyTagFlag, session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 111, in setPagesTag
    self._crawlerModel.setPagesTag(pages, tag, applyTagFlag, session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 732, in setPagesTag
    es_info = self.esInfo(session['domainId'])
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
    "activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'
[26/Jan/2016:10:51:38] HTTP 
Request Headers:
  REFERER: http://localhost:8084/seedcrawler
  HOST: localhost:8084
  CONNECTION: keep-alive
  Remote-Addr: 127.0.0.1
  ACCEPT: */*
  USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
  X-REQUESTED-WITH: XMLHttpRequest
  ACCEPT-LANGUAGE: en-US,en;q=0.8
  ACCEPT-ENCODING: gzip, deflate, sdch
[26/Jan/2016:10:51:38] HTTP 
Traceback (most recent call last):
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 273, in getBokehPlot
    data = self._crawler.getPages(session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 95, in getPages
    return self._crawlerModel.getPages(session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 627, in getPages
    es_info = self.esInfo(session['domainId'])
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
    "activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'
127.0.0.1 - - [26/Jan/2016:10:51:38] "POST /setPagesTag HTTP/1.1" 500 2103 "http://localhost:8084/seedcrawler" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [26/Jan/2016:10:51:38] "GET /getBokehPlot?session=%7B%22activeProjectionAlg%22%3A%22Group+by+Similarity%22%2C%22domainId%22%3A%22AVJBRZHIIf8LCshQL9xi%22%2C%22pagesCap%22%3A%22100%22%2C%22fromDate%22%3Anull%2C%22toDate%22%3Anull%2C%22filter%22%3Anull%2C%22pageRetrievalCriteria%22%3A%22Most+Recent%22%7D HTTP/1.1" 500 2053 "http://localhost:8084/seedcrawler" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
[26/Jan/2016:10:51:38] HTTP 
Request Headers:
  REFERER: http://localhost:8084/seedcrawler
  HOST: localhost:8084
  CONNECTION: keep-alive
  Remote-Addr: 127.0.0.1
  ACCEPT: */*
  USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
  X-REQUESTED-WITH: XMLHttpRequest
  ACCEPT-LANGUAGE: en-US,en;q=0.8
  ACCEPT-ENCODING: gzip, deflate, sdch
[26/Jan/2016:10:51:38] HTTP 
Traceback (most recent call last):
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
    response.body = self.handler()
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
    self.body = self.oldhandler(*args, **kwargs)
  File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 273, in getBokehPlot
    data = self._crawler.getPages(session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 95, in getPages
    return self._crawlerModel.getPages(session)
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 627, in getPages
    es_info = self.esInfo(session['domainId'])
  File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
    "activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'

Queries Plot Breaks with Large Queries

With very large queries, the queries plot has difficulty rendering lines. There are a number of factors, that may be causing this, including conflicts with the new forwardlinks / backlinks feature.

No License

Please include a license for the project. In the DARPA catalog this project is listed as BSD licensed but there is no indication of that in the repo.

Add some statistics of the selected corpus

On the menu add a tab to view the statistics of the data for a selected domain using bokeh. These could be:

  1. Display a summary of queries thus far
  2. The domains that were crawled
  3. Some statistics like the number of pages/per query, pages/per keyword.
  4. Page statistics and queries by time

Clustering Graph Issues

  • There is a region along the margin where lasso selection does not work
  • Sometimes the pages close to but outside the lasso selection are tagged
  • Some pages within the lasso selection are not selected or tagged

Bokeh plot doesn't show pages close to the borders

The visualization panel sometimes doesn't show all pages that are close to the borders.
Steps to reproduce:

  • Create a new domain named "ML" and activate it.
  • Add the following 4 URLs using the "Upload URLs" panel. Type following text in the check box, then click "Submit".
https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Statistical_classification
https://en.wikipedia.org/wiki/Cluster_analysis
https://en.wikipedia.org/wiki/Regression_analysis
  • Click in the update button. 4 pages are shown.
  • Filter the pages: type "tree" in the filter panel and search. The number of matches is 2, but depending on the resolution of the screen, they will not be plotted. Try resizing your browser window to notice that. This can be seen in the attached image.
  • In the image it can also be seen (see the dashed line of the lasso selection) that there is a big gap between the borders of the plot (square dense line) and the area that the pages are actually plotted.

ddt-bug

Bind for 0.0.0.0:8084 failed: port is already allocated

Whilst trying to restart the docker deployment of DDT I get the "port is already allocated" message.

I check iptables and had the following entries - and tried to remove the 8084 nat entry manually and recreate the images but got the same error.

I resolved this by removing the docker images and then removing the domain_discovery_tool directory, creating a new clone and rebuilding.

~/domain_discovery_tool$ sudo iptables -t nat -L -n --line-numbers
Chain PREROUTING (policy ACCEPT)
num target prot opt source destination
1 DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT)
num target prot opt source destination

Chain OUTPUT (policy ACCEPT)
num target prot opt source destination
1 DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT)
num target prot opt source destination
1 MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
2 MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:8084
3 MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:9200

Chain DOCKER (2 references)
num target prot opt source destination
1 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8084 to:172.17.0.1:8084
2 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:9200 to:172.17.0.1:9200

sudo iptables -D DOCKER 1 -t nat

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.