Comments (9)
ooh! good idea!!!!
I'm actually working on integrating boilerpipe into Terra Incognita right
now - https://code.google.com/p/boilerpipe/ - this could do content
extraction on the URLs that GDELT is giving us.
/////////////////////////////
Catherine D'Ignazio
Research Assistant, MIT Media Lab Center for Civic Media
[email protected] || [email protected] || @kanarinka || +1
617 501 2441 || www.kanarinka.com ||
http://civic.mit.edu/blog/kanarinka/
On Tue, Nov 26, 2013 at 9:55 AM, rahulbot [email protected] wrote:
Another validation idea - perhaps we can test against the GDELT data? For
instance, the daily downloadshttp://gdelt.utdallas.edu/data/dailyupdates/?O=Dinclude rows like this:277188496 20031128 200311 2003 2003.8986 IRN TEHRAN IRN 0 020 020 02 1 3.0 20 1 20 1.99778270509978 0 4 Tehran, Tehran, Iran IR IR26 35.75 51.5148 10074674 4 Tehran, Tehran, Iran IR IR26 35.75 51.5148 10074674 20131125 http://www.ansamed.info/ansamed/en/news/nations/france/2013/11/25/Nuclear-Fabius-first-sanctions-against-Iran-lifted-Dec-_9676412.html
This includes a primary location and the url. We could test our aboutness
strategy against this, no?—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11
.
from cliff-annotator.
Is there any documentation in GDELT where/how they get their aboutness?
Just want to know how reliable their aboutness would be to start with??
/////////////////////////////
Catherine D'Ignazio
Research Assistant, MIT Media Lab Center for Civic Media
[email protected] || [email protected] || @kanarinka || +1
617 501 2441 || www.kanarinka.com ||
http://civic.mit.edu/blog/kanarinka/
On Tue, Nov 26, 2013 at 9:59 AM, Catherine D'Ignazio [email protected]:
ooh! good idea!!!!
I'm actually working on integrating boilerpipe into Terra Incognita right
now - https://code.google.com/p/boilerpipe/ - this could do content
extraction on the URLs that GDELT is giving us./////////////////////////////
Catherine D'Ignazio
Research Assistant, MIT Media Lab Center for Civic Media
[email protected] || [email protected] || @kanarinka || +1
617 501 2441 || www.kanarinka.com ||
http://civic.mit.edu/blog/kanarinka/On Tue, Nov 26, 2013 at 9:55 AM, rahulbot [email protected]:
Another validation idea - perhaps we can test against the GDELT data? For
instance, the daily downloadshttp://gdelt.utdallas.edu/data/dailyupdates/?O=Dinclude rows like this:277188496 20031128 200311 2003 2003.8986 IRN TEHRAN IRN 0 020 020 02 1 3.0 20 1 20 1.99778270509978 0 4 Tehran, Tehran, Iran IR IR26 35.75 51.5148 10074674 4 Tehran, Tehran, Iran IR IR26 35.75 51.5148 10074674 20131125 http://www.ansamed.info/ansamed/en/news/nations/france/2013/11/25/Nuclear-Fabius-first-sanctions-against-Iran-lifted-Dec-_9676412.html
This includes a primary location and the url. We could test our aboutness
strategy against this, no?—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11
.
from cliff-annotator.
RE: boiler plate
Make sure you isolate that well in your code. In theory you'd be able to replace it with the Media Cloud extractor later because @natematias and I are trying to schedule a time with @dlarochelle to pull the extractor out of MC. They've said it performs better than others, especially with multiple languages.
from cliff-annotator.
RE: GDELT
Their intro paper says:
Each article is subjected to fulltext geocoding from Leetaru [2012] to identify
and disambiguate all geographic references contained in each article.
That reference points to his fulltext geocoding paper, which talks about disambiguation but doesn't say anything about selecting a "primary" location. We are already doing disambiguation similar to what that paper describes.
More importantly, the intro paper also says:
The Tabari system is applied to each article in full-story mode to extract all events
contained anywhere in the article and the Tabari geocoding post-processing system
is enabled to georeferenced each event back to the specific city or geographic
landmark it is associated with.
This makes it clear that their system is using TABARI to pull "event" mentions out of the article, and then find the location the event is in. So each article can have many events, each tied to one of the already-disambiguated mentions. I would guess this uses the closest location mentioned in text, perhaps with some other tricks.
So again, like NYT, this isn't data of where the article is "about". We could maybe say that if the article only has one GDELT event, then that events location is a reasonable proxy for aboutness. Dunno.
from cliff-annotator.
will do and yes I heard about that work with MC and think it sounds like a
great idea
www.kanarinka.com || [email protected] || 617-501-2441
On Tue, Nov 26, 2013 at 10:11 AM, rahulbot [email protected] wrote:
RE: boiler plate
Make sure you isolate that well in your code. In theory you'd be able to
replace it with the Media Cloud extractor later because @natematiashttps://github.com/natematiasand I are trying to schedule a time with
@dlarochelle https://github.com/dlarochelle to pull the extractor out
of MC. They've said it performs better than others, especially with
multiple languages.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-29299691
.
from cliff-annotator.
The GDELT download includes the URL of the source article. To write this test I need to be able to extract article content from those URLs. I'm hoping if we set up an instance of your extractor as a web service then I can use the same thing....
from cliff-annotator.
Sweet - yes, let's get that up and running. I've been really happy with its
extraction so far:
http://code.google.com/p/boilerpipe/?utm_source=rss&utm_medium=rss&utm_campaign=boilerpipe-boilerplate-removal-and-fulltext-extraction-from-html-pages
www.kanarinka.com || [email protected] || 617-501-2441
On Sat, Apr 26, 2014 at 7:45 PM, rahulbot [email protected] wrote:
The GDELT download includes the URL of the source article. To write this
test I need to be able to extract article content from those URLs. I'm
hoping if we set up an instance of your extractor as a web service then I
can use the same thing....—
Reply to this email directly or view it on GitHubhttps://github.com//issues/11#issuecomment-41484343
.
from cliff-annotator.
First test against one day put us around 80%. To be clear, that means that 80% of the time the list of countries we identify as mentioned in the article include both the actors of the GDELT event pulled from that article.
from cliff-annotator.
Turns out this isn't such a useful idea for benchmarking, since of course GDELT has its own error rate. With a sample of 200 events, we have a match rate around 74%. That's pretty good, which probably means we make a lot of the same disambiguation errors that they make ;-)
from cliff-annotator.
Related Issues (20)
- return better status while index is building HOT 1
- charIndex of mentions is offset by preceding demonyms when replaceAllDemonyms=true HOT 2
- `results.mentions.source.string` contains the normalised country name, intead of the raw string matched. HOT 2
- integrate co-ref resolution into person entity counts HOT 1
- DELETE IT
- upgrade default Stanford NER to v3.9.2 HOT 9
- News Articles about Washington, always resolve to Washington state instead of Washington, DC. HOT 1
- support language at the query level HOT 1
- move configuration to env-vars HOT 1
- rename packages to match brand HOT 1
- Are there any API docs which go over what the meaning of each key in the json response means? HOT 5
- Can't get CLIFF to run HOT 2
- Unable to access through tomcat (404 error) HOT 1
- Unable to geoparse German text HOT 1
- Gazetteer index Error HOT 3
- Live demo currently down HOT 2
- upgrade to latest NER HOT 1
- upgrade to latest CLIFF
- Add support for French HOT 1
- What is the maximum sequence length? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cliff-annotator.