vida-nyu / domain_discovery_tool_deprecated Goto Github PK
View Code? Open in Web Editor NEWSeed acquisition tool to bootstrap focused crawlers
Seed acquisition tool to bootstrap focused crawlers
Latest elasticsearch conda package (1.7) doesn't include executable file to start elasticsearch, or the name changed. ElasticSearch doesn't start when running supervisor:
$ supervisord
2015-10-15 14:24:27,667 INFO RPC interface 'supervisor' initialized
2015-10-15 14:24:27,668 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2015-10-15 14:24:27,668 INFO supervisord started with pid 16981
2015-10-15 14:24:28,669 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:28,671 INFO spawned: 'ddt' with pid 16985
2015-10-15 14:24:29,703 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:29,704 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2015-10-15 14:24:31,706 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:34,710 INFO spawnerr: can't find command 'elasticsearch'
2015-10-15 14:24:34,710 INFO gave up: elasticsearch entered FATAL state, too many start retries too quickly
Specifying a elasticsearch version <= 1.6 in the environment.yml
file fixes this problem, but another problem happens when starting DDT services through supervisor:
$ supervisord
2015-10-15 14:28:14,162 INFO RPC interface 'supervisor' initialized
2015-10-15 14:28:14,162 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2015-10-15 14:28:14,162 INFO supervisord started with pid 17341
2015-10-15 14:28:15,165 INFO spawned: 'elasticsearch' with pid 17344
2015-10-15 14:28:15,168 INFO spawned: 'ddt' with pid 17345
2015-10-15 14:28:15,600 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:16,725 INFO success: elasticsearch entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2015-10-15 14:28:16,726 INFO spawned: 'ddt' with pid 17450
2015-10-15 14:28:17,196 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:19,239 INFO spawned: 'ddt' with pid 17627
2015-10-15 14:28:19,623 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:23,187 INFO spawned: 'ddt' with pid 17736
2015-10-15 14:28:23,590 INFO exited: ddt (exit status 1; not expected)
2015-10-15 14:28:24,591 INFO gave up: ddt entered FATAL state, too many start retries too quickly
and in the file ./logs/ddt-stderr---supervisor-YpGRaX.log
:
Traceback (most recent call last):
File "/home/aeciosantos/workspace/domain_discovery_tool/vis/server.py", line 5, in <module>
from crawler_model_adapter import *
File "/home/aeciosantos/workspace/domain_discovery_tool/vis/crawler_model_adapter.py", line 2, in <module>
from models.crawlermodel import *
File "/home/aeciosantos/workspace/domain_discovery_tool/models/crawlermodel.py", line 22, in <module>
from elasticsearch import Elasticsearch
ImportError: No module named elasticsearch
@ahmadia @brittainhard Do you guys know if this a problem of elasticsearch packages or in DDT's conda env?
Thanks @felipemoraes for reporting the problem.
Page statistics show the following error when there is a big number of pages indexed.
500 Internal Server Error
The server encountered an unexpected condition which prevented it from fulfilling the request.
Traceback (most recent call last):
File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
response.body = self.handler()
File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
return self.callable(*self.args, **self.kwargs)
File "/home/aeciosantos/workspace/ddt/vis/server.py", line 357, in statistics
pages_dates = self._crawler.getPagesDates(session)
File "/home/aeciosantos/workspace/ddt/vis/crawler_model_adapter.py", line 194, in getPagesDates
return self._crawlerModel.getPagesDates(session)
File "/home/aeciosantos/workspace/ddt/models/crawlermodel.py", line 1042, in getPagesDates
return get_pages_datetimes(es_info["activeCrawlerIndex"])
File "/home/aeciosantos/workspace/ddt/elastic/get_documents.py", line 193, in get_pages_datetimes
items = es.search(index_name, size=100000)["hits"]["hits"]
File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/client/__init__.py", line 506, in search
params=params, body=body)
File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/connection/http_urllib3.py", line 89, in perform_request
self._raise_error(response.status, raw_data)
File "/home/aeciosantos/.anaconda2/envs/ddt/lib/python2.7/site-packages/elasticsearch-1.6.0-py2.7.egg/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
TransportError: TransportError(500, u'OutOfMemoryError[Java heap space]')
So which of these are being used? There are multiple instances of what look like python scripts and bash scripts doing the same things, i.e. https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/elastic/delete.py and https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/elastic/delete_index.sh
Moreover, looking at the blame, it looks like the python script is much newer than the bash script.
And thoughts on this @yamsgithub @aecio ?
So the following happens if you mark an item relevant or irrelevant many times. The graph will show the URL as being irrelevant in this case:
"[["http://www.internationalgramscisociety.org/"],0.4030953753789398,2.5563493917919113,
["","Irrelevant","Relevant","Irrelevant","Relevant","Irrelevant","Relevant","Irrelevant"]]"
If I were to mark this page Relevant it wouldn't stick, and it would be reverted back to Irrelevant.
It seems to prefer listing pages as Irrelevant rather than Relevant. This behavior persists even if "Relevant" is the first item in the tags list. This url is marked as irrelevant in the graph.
"[["http://www.icair.org/"],-10.98856167516711,-1.8956571626087264,["Relevant","Irrelevant"]]"
This bug requires some special attention, and probably needs to be fixed very quickly @yamsgithub . Let me know if you need anymore info.
Add another collapsible panel (collapsed by default) on the left below the Web Search panel.
It should have a text box where you can input a URL.
And an upload icon that allows you to upload a file with a list of URLS. The uploaded URLs should appear in the text box
When the list of URLs are submitted then they are processed just like the results of the web search are processed. That is the pages corresponding to the URLs are downloaded and all information required to visualize in DDT are extracted and stored in elasticsearch.
Right now, DDT is installed via the instructions here: https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/README.md
It would be great if we could boil this down to "vagrant up" via a Vagrantfile/Salt install.
Does this file need to be tracked by the repo? My PR at VIDA-NYU/domain_discovery_tool#28 just got a merge conflict because of a merged change in this file.
improved visualization, e.g., termite
Real-time, incremental LDA (PLSA - from MIT Lincoln Labs)
Folders lda_pipeline
and seed_crawler_site
contain legacy code that have not been modified in a long time. Check that they are not really used and remove what is not used anymore.
The web search for domains is currently breaking for me. This is the error I see in my terminal.
Get the top 100 results
None
java.io.FileNotFoundException: microcap (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileReader.<init>(FileReader.java:72)
at BingSearch.read_queries(BingSearch.java:43)
at BingSearch.main(BingSearch.java:165)
Allow the user to start adding the domain name without having to click on the textbox.
This involves 2 task:
Steps to reproduce:
Is this caused by the recent change to use bokeh? Or bokeh code is still in a separate branch?
For first time users, it's not clear what's the 2D visualization is doing. We could add a help button on the corner, which opens a new modal window with a explanation of what's the purpose of the visualization and what each dot and color in the plot means.
Currently you have to scroll down the pages which moves the visualization out of users view.
The panels with statistics about pages show wrong numbers after using "update" button.
Steps to reproduce:
https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Statistical_classification
https://en.wikipedia.org/wiki/Cluster_analysis
https://en.wikipedia.org/wiki/Regression_analysis
We already have a button for adding new domains (crawlers). Another useful feature would be to have a button for removing the domains and all the data associated with it.
Having created a new domain, tagged pages and terms as relevant and irrelevant, I click on the "Build Model" button to generate training data package for use with ACHE. The cursor goes busy for a long period (circa 30-45 minutes) and no content is created even when I check in the folders via command line.
This applies to the local development instance v2.8.3 of DDT
Something like Google status messages in gmail.
It should be persistent to scrolls
On master, can't make new queries because of the following errors:
Error: Could not find or load main class GoogleSearch
any feedback on this would be appreciated.
When downloading the same page again, its tags get reset.
This regards commit f297f49.
After cloning the repo and following the installation/running instructions from the Wiki, I obtained an installation that could not find any of the static files. Here are some error logs from /bin/ddt-dev
after I refreshed the page with the application (localhost:8084
):
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET / HTTP/1.1" 200 3103 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootflat-2.0.4/css/bootstrap.min.css HTTP/1.1" 404 669 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/
537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootflat-2.0.4/css/bootflat.min.css HTTP/1.1" 404 666 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/5
37.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootstrap-datetimepicker-4.15.35/css/bootstrap-datetimepicker.min.css HTTP/1.1" 404 680 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like G
ecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/d3.slider.css HTTP/1.1" 404 657 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/jquery-ui.css HTTP/1.1" 404 658 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/jquery.urlive.css HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery-1.10.0.min.js HTTP/1.1" 404 667 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.v3.5.5.min.js HTTP/1.1" 404 664 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery-ui.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/dropdowns-enhancement.min.css HTTP/1.1" 404 667 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/53$
.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /css/crawler-white.css HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.layout.cloud.js HTTP/1.1" 404 666 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/jquery.urlive.js HTTP/1.1" 404 664 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.lasso.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/bootstrap.min.js HTTP/1.1" 404 663 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/moment.js HTTP/1.1" 404 658 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/d3.slider.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/dropdowns-enhancement.js HTTP/1.1" 404 668 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.
36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/libs/queue.min.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /bootstrap-datetimepicker-4.15.35/js/bootstrap-datetimepicker.min.js HTTP/1.1" 404 681 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gec
ko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/pageslandscape.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/sigslot_core.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/tagsgallery.js HTTP/1.1" 404 660 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/dataaccess.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/crawlervis.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/pagesgallery.js HTTP/1.1" 404 659 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/crawlersigslots.js HTTP/1.1" 404 662 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/seedcrawlerstatslist.js HTTP/1.1" 404 665 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/snippetsviewer.js HTTP/1.1" 404 661 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/wordlist.js HTTP/1.1" 404 657 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /js/utils.js HTTP/1.1" 404 656 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [07/Dec/2015:12:23:41] "GET /img/nyu_stacked_black.png HTTP/1.1" 404 663 "http://localhost:8084/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
I followed the installation instructions that use the Makefile
, so just make
. I initialized the conda environment with
conda env create
source activate ddt
at the root of the repo.
Then I ran the app with elasticsearch
on one terminal, and ./bin/ddt-dev
also at the root of the repo.
I'll go now and check if the manual installation instructions work.
We'll be serving DDT from an explorer.io page, where it will live as explorer.io/ddt. DDT contains a few absolute URL references that need to be changed to relative URLs. Alternatively, we could define an HTTP_BASE URL for you that defines your base.
In order to use configurable elasticsearch endpoint, we need to remove any hardcoded references to "localhost:9200" in the repository.
Hi folks, am getting some issues with the stand alone deployment. The package builds (make) and then can run ddt, but can't open the application on localhost:8084. The terminal shows this pattern of starting, then exiting ddt roughly every three minutes. A sample is pasted below:
2016-03-16 12:10:40,956 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-03-16 12:13:00,284 INFO exited: ddt (exit status 1; not expected)
2016-03-16 12:13:01,357 INFO spawned: 'ddt' with pid 28709
2016-03-16 12:13:02,359 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-03-16 12:16:51,898 INFO exited: ddt (exit status 1; not expected)
2016-03-16 12:16:52,901 INFO spawned: 'ddt' with pid 28849
2016-03-16 12:16:53,903 INFO success: ddt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Log files are attached; not sure if this is a local config issue with the system I'm using (Ubuntu 14.04), but thought I'd raise to as I checked the other issues and couldn't find anything similar. Any advice or suggestions would be much appreciated! Thanks.
So i'm looking through the code and I noticed that there is a crawlermodel.py file here at https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/vis/crawlermodel.py and another different crawlermodel.py here: https://github.com/ViDA-NYU/domain_discovery_tool/blob/master/models/crawlermodel.py
It seems to me that the one in vis/
is the false one and the one in model/
is the real one. Can we get rid of one of these to limit confusion?
Every time the page is loaded, a yellow box is display in the screen. The box only disappears after the first message is shown. Ideally, the box should not be displayed if there's no message to be shown.
Related to issue #11.
Here is the error:
[26/Jan/2016:10:51:38] HTTP
Request Headers:
Content-Length: 1364
REFERER: http://localhost:8084/seedcrawler
HOST: localhost:8084
ORIGIN: http://localhost:8084
CONNECTION: keep-alive
Remote-Addr: 127.0.0.1
ACCEPT: */*
USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
X-REQUESTED-WITH: XMLHttpRequest
ACCEPT-LANGUAGE: en-US,en;q=0.8
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
ACCEPT-ENCODING: gzip, deflate
[26/Jan/2016:10:51:38] HTTP
Traceback (most recent call last):
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
response.body = self.handler()
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
return self.callable(*self.args, **self.kwargs)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 206, in setPagesTag
self._crawler.setPagesTag(pages, tag, applyTagFlag, session)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 111, in setPagesTag
self._crawlerModel.setPagesTag(pages, tag, applyTagFlag, session)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 732, in setPagesTag
es_info = self.esInfo(session['domainId'])
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
"activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'
[26/Jan/2016:10:51:38] HTTP
Request Headers:
REFERER: http://localhost:8084/seedcrawler
HOST: localhost:8084
CONNECTION: keep-alive
Remote-Addr: 127.0.0.1
ACCEPT: */*
USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
X-REQUESTED-WITH: XMLHttpRequest
ACCEPT-LANGUAGE: en-US,en;q=0.8
ACCEPT-ENCODING: gzip, deflate, sdch
[26/Jan/2016:10:51:38] HTTP
Traceback (most recent call last):
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
response.body = self.handler()
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
return self.callable(*self.args, **self.kwargs)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 273, in getBokehPlot
data = self._crawler.getPages(session)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 95, in getPages
return self._crawlerModel.getPages(session)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 627, in getPages
es_info = self.esInfo(session['domainId'])
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
"activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'
127.0.0.1 - - [26/Jan/2016:10:51:38] "POST /setPagesTag HTTP/1.1" 500 2103 "http://localhost:8084/seedcrawler" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
127.0.0.1 - - [26/Jan/2016:10:51:38] "GET /getBokehPlot?session=%7B%22activeProjectionAlg%22%3A%22Group+by+Similarity%22%2C%22domainId%22%3A%22AVJBRZHIIf8LCshQL9xi%22%2C%22pagesCap%22%3A%22100%22%2C%22fromDate%22%3Anull%2C%22toDate%22%3Anull%2C%22filter%22%3Anull%2C%22pageRetrievalCriteria%22%3A%22Most+Recent%22%7D HTTP/1.1" 500 2053 "http://localhost:8084/seedcrawler" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
[26/Jan/2016:10:51:38] HTTP
Request Headers:
REFERER: http://localhost:8084/seedcrawler
HOST: localhost:8084
CONNECTION: keep-alive
Remote-Addr: 127.0.0.1
ACCEPT: */*
USER-AGENT: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
X-REQUESTED-WITH: XMLHttpRequest
ACCEPT-LANGUAGE: en-US,en;q=0.8
ACCEPT-ENCODING: gzip, deflate, sdch
[26/Jan/2016:10:51:38] HTTP
Traceback (most recent call last):
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cprequest.py", line 670, in respond
response.body = self.handler()
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/lib/encoding.py", line 217, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/media/data/yamuna/miniconda2/envs/ddt/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 61, in __call__
return self.callable(*self.args, **self.kwargs)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/server.py", line 273, in getBokehPlot
data = self._crawler.getPages(session)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/vis/crawler_model_adapter.py", line 95, in getPages
return self._crawlerModel.getPages(session)
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 627, in getPages
es_info = self.esInfo(session['domainId'])
File "/media/data/yamuna/Memex/bugfix/domain_discovery_tool/models/crawlermodel.py", line 122, in esInfo
"activeCrawlerIndex": self._domains[domainId]['index'],
TypeError: 'NoneType' object has no attribute '__getitem__'
With very large queries, the queries plot has difficulty rendering lines. There are a number of factors, that may be causing this, including conflicts with the new forwardlinks / backlinks feature.
Please include a license for the project. In the DARPA catalog this project is listed as BSD licensed but there is no indication of that in the repo.
On the menu add a tab to view the statistics of the data for a selected domain using bokeh. These could be:
The visualization panel sometimes doesn't show all pages that are close to the borders.
Steps to reproduce:
https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Statistical_classification
https://en.wikipedia.org/wiki/Cluster_analysis
https://en.wikipedia.org/wiki/Regression_analysis
Whilst trying to restart the docker deployment of DDT I get the "port is already allocated" message.
I check iptables and had the following entries - and tried to remove the 8084 nat entry manually and recreate the images but got the same error.
I resolved this by removing the docker images and then removing the domain_discovery_tool directory, creating a new clone and rebuilding.
~/domain_discovery_tool$ sudo iptables -t nat -L -n --line-numbers
Chain PREROUTING (policy ACCEPT)
num target prot opt source destination
1 DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
Chain INPUT (policy ACCEPT)
num target prot opt source destination
Chain OUTPUT (policy ACCEPT)
num target prot opt source destination
1 DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL
Chain POSTROUTING (policy ACCEPT)
num target prot opt source destination
1 MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
2 MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:8084
3 MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:9200
Chain DOCKER (2 references)
num target prot opt source destination
1 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8084 to:172.17.0.1:8084
2 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:9200 to:172.17.0.1:9200
sudo iptables -D DOCKER 1 -t nat
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.