mediacloud / cliff-annotator Goto Github PK

A lightweight server to allow HTTP requests to the Stanford Named Entity Recognized and a heavily modified CLAVIN geoparser.

Home Page: https://cliff.mediacloud.org

License: Apache License 2.0

Java 99.51% Python 0.49%

cliff-annotator's Introduction

This is the source code for the Media Cloud core system. Media Cloud, a joint project of the Berkman Center for Internet & Society at Harvard University and the Center for Civic Media at MIT, is an open source, open data platform that allows researchers to answer complex quantitative and qualitative questions about the content of online media.

For more information on Media Cloud, go to mediacloud.org.

Note: Most users prefer to use Media Cloud's API and public tools to query our data instead of running their own Media Cloud instance.

The code in this repository will be of interest to those users who wish to run their own Media Cloud instance and users of the public tools who want to understand how Media Cloud is implemented.

The Media Cloud code here does three things:

Runs a web app that allows you to manage a set of media sources and their feeds.
Periodically crawls the feeds setup within the web app and downloads any new stories found within the downloaded feeds.
Extracts the substantive text from the downloaded story content (minus the ads, navigation, comments, etc.) and associates a set of tags with each story based on that extracted text.

For very brief installation instructions, see INSTALL.markdown.

Please send us a note at [email protected] if you are using any of this code or if you have any questions. We are very interested in knowing who's using the code and for what.

Build Status

History of the Project

Print newspapers are declaring bankruptcy nationwide. High-profile blogs are proliferating. Media companies are exploring new production techniques and business models in a landscape that is increasingly dominated by the Internet. In the midst of this upheaval, it is difficult to know what is actually happening to the shape of our news. Beyond one-off anecdotes or painstaking manual content analysis, there are few ways to examine the emerging news ecosystem.

The idea for Media Cloud emerged through a series discussions between faculty and friends of the Berkman Center. The conversations would follow a predictable pattern: one person would ask a provocative question about what was happening in the media landscape, someone else would suggest interesting follow-on inquiries, and everyone would realize that a good answer would require heavy number crunching. Nobody had the time to develop a huge infrastructure and download all the news just to answer a single question. However, there were eventually enough of these questions that we decided to build a tool for everyone to use.

Some of the early driving questions included:

Do bloggers introduce storylines into mainstream media or the other way around?
What parts of the world are being covered or ignored by different media sources?
Where do stories begin?
How are competing terms for the same event used in different publications?
Can we characterize the overall mix of coverage for a given source?
How do patterns differ between local and national news coverage?
Can we track news cycles for specific issues?
Do online comments shape the news?

Media Cloud offers a way to quantitatively examine all of these challenging questions by collecting and analyzing the news stream of tens of thousands of online sources.

Using Media Cloud, academic researchers, journalism critics, policy advocates, media scholars, and others can examine which media sources cover which stories, what language different media outlets use in conjunction with different stories, and how stories spread from one media outlet to another.

Collaborators

Past and present collaborators include Morningside Analytics, Betaworks, and Bit.ly.

License

Media Cloud is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Media Cloud is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with Media Cloud . If not, see <http://www.gnu.org/licenses/>.

cliff-annotator's People

Contributors

Stargazers

Watchers

cliff-annotator's Issues

not catching all Demonyms

LocationScoredAboutnessStrategy - should we use it?

I wrote the LocationScoring strategy - originally I wrote it scoring the headline at 3 points, first couple sentences at 2 points and other mentions at 1 point. But this produced way worse results than our FrequencyMentionStrategy. So then I messed with just adding 2 points if the mention was in the first x% of the text.

You can see the results below - Around 10% and at 25% we get some very slight improvements. This is probably just one single article or maybe two which is tipping it. Is it enough to want to choose this strategy? I'll do a little more testing to see if I can get any further improvements too.

FREQUENCY MENTION ---- Original strategy
11:04:52.999 [main] INFO - Aboutness gets NYT 1.0 correct
11:04:52.999 [main] INFO - Aboutness gets Huff Po 0.8333333333333334 correct
11:04:52.999 [main] INFO - Aboutness gets BBC 0.84 correct
AVERAGE: 0.89111111111111

LOCATIONSCORED MENTION --- Experimenting with weighting stuff mentioned earlier in article

Weighting top 5%
11:17:21.942 [main] INFO - Aboutness gets NYT 1.0 correct
11:17:21.942 [main] INFO - Aboutness gets Huff Po 0.875 correct
11:17:21.942 [main] INFO - Aboutness gets BBC 0.80 correct
AVERAGE: 0.89166666666667

Weighting top 10%
11:11:58.905 [main] INFO - Aboutness gets NYT 1.0 correct
11:11:58.905 [main] INFO - Aboutness gets Huff Po 0.8333333333333334 correct
11:11:58.905 [main] INFO - Aboutness gets BBC 0.88 correct
AVERAGE: 0.90444444444444

Weighting top 15%
11:32:58.135 [main] INFO - Aboutness gets NYT 0.9130434782608695 correct
11:32:58.135 [main] INFO - Aboutness gets Huff Po 0.875 correct
11:32:58.135 [main] INFO - Aboutness gets BBC 0.88 correct
AVERAGE: 0.88934782608696

Weighting top 20%
11:13:45.187 [main] INFO - Aboutness gets NYT 0.9565217391304348 correct
11:13:45.187 [main] INFO - Aboutness gets Huff Po 0.8333333333333334 correct
11:13:45.187 [main] INFO - Aboutness gets BBC 0.88 correct
AVERAGE: 0.88995169082125

Weighting places in the top 25% of the article
11:09:30.370 [main] INFO - Aboutness gets NYT 0.9565217391304348 correct
11:09:30.370 [main] INFO - Aboutness gets Huff Po 0.875 correct
11:09:30.370 [main] INFO - Aboutness gets BBC 0.88 correct
AVERAGE: 0.90384057971014

Weighting - top 30%
11:10:43.704 [main] INFO - Aboutness gets NYT 0.9565217391304348 correct
11:10:43.704 [main] INFO - Aboutness gets Huff Po 0.875 correct
11:10:43.704 [main] INFO - Aboutness gets BBC 0.84 correct
AVERAGE: 0.89050724637681

Weighting - top 40%
11:23:26.474 [main] INFO - Aboutness gets NYT 0.9565217391304348 correct
11:23:26.474 [main] INFO - Aboutness gets Huff Po 0.875 correct
11:23:26.474 [main] INFO - Aboutness gets BBC 0.84 correct
AVERAGE: 0.89050724637681

Weighting - top 50% - SAME STATS AS FREQUENCYMENTION
11:20:59.779 [main] INFO - Aboutness gets NYT 1.0 correct
11:20:59.779 [main] INFO - Aboutness gets Huff Po 0.8333333333333334 correct
11:20:59.779 [main] INFO - Aboutness gets BBC 0.84 correct
AVERAGE: 0.89111111111111

Oklahoma resolves to a city in Pennsylvania, instead of the state

Example input:

Oklahoma say Common Core tests are too costly

Example output:

{
  "status": "ok",
  "version": "1.2.0",
  "results": {
    "organizations": [],
    "places": {
      "mentions": [
        {
          "confidence": 1,
          "name": "Oklahoma",
          "countryCode": "US",
          "featureCode": "PPL",
          "lon": -79.57393,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 0,
            "string": "Oklahoma"
          },
          "stateCode": "PA",
          "featureClass": "P",
          "lat": 40.58145,
          "stateGeoNameId": "6254927",
          "id": 5204294,
          "population": 809
        }
      ],
      "focus": {
        "states": [
          {
            "name": "Pennsylvania",
            "countryCode": "US",
            "featureCode": "ADM1",
            "lon": -76.90567,
            "countryGeoNameId": "6252001",
            "score": 1,
            "stateCode": "PA",
            "featureClass": "A",
            "lat": 40.27245,
            "stateGeoNameId": "6254927",
            "id": 6254927,
            "population": 12440621
          }
        ],
        "cities": [
          {
            "name": "Oklahoma",
            "countryCode": "US",
            "featureCode": "PPL",
            "lon": -79.57393,
            "countryGeoNameId": "6252001",
            "score": 1,
            "stateCode": "PA",
            "featureClass": "P",
            "lat": 40.58145,
            "stateGeoNameId": "6254927",
            "id": 5204294,
            "population": 809
          }
        ],
        "countries": [
          {
            "name": "United States",
            "countryCode": "US",
            "featureCode": "PCLI",
            "lon": -98.5,
            "countryGeoNameId": "6252001",
            "score": 1,
            "stateCode": "00",
            "featureClass": "A",
            "lat": 39.76,
            "stateGeoNameId": "",
            "id": 6252001,
            "population": 310232863
          }
        ]
      }
    },
    "people": []
  },
  "milliseconds": 6
}

CLAVIN-Server for cities

I want to do Aboutness for cities in Terra Incognita but it looks like our current resolution strategy is favoring districts over cities - i.e. Paris resolves to the administrative division, not the city. Same with Shanghai. Would be great to tweak this so that we favor cities.

I'm going to look into this in the next couple days, I just wanted to log it here for the moment.

indexFormatTooNewException when deploying

Sadek emailed to say:
"
After deploying the CLIFF in tomcat, if I try to parse a
text using the “parse/text?q=“ parameter, I receive this error: “Unable
to create parser org.apache.lucene.index.IndexFormatTooNewException:
Format version is not supported (resource:
MMapIndexInput(path=“/home/ripul/CLAVIN/IndexDirectory/segments.gen”)):
-3 (needs to be between -2 and -2)”.
"

test against GDELT?

Another validation idea - perhaps we can test against the GDELT data? For instance, the daily downloads include rows like this:

277188496   20031128    200311  2003    2003.8986                                           IRN TEHRAN  IRN                             0   020 020 02  1   3.0 20  1   20  1.99778270509978    0                           4   Tehran, Tehran, Iran    IR  IR26    35.75   51.5148 10074674    4   Tehran, Tehran, Iran    IR  IR26    35.75   51.5148 10074674    20131125    http://www.ansamed.info/ansamed/en/news/nations/france/2013/11/25/Nuclear-Fabius-first-sanctions-against-Iran-lifted-Dec-_9676412.html

This includes a primary location and the url. We could test our aboutness strategy against this, no?

architect for scale on a server

The architecture doesn't scale optimally right now, particularly when we think about a multicore system. We should think about changing that up. Some ideas:

should we switch to an asynchronous model, or is the geoparsing fast enough
is the "keep an open socket" strategy right, or should each request be closed after (more like HTTP)?
create a pool of NERs, each on it's own thread (threadpool approach), and have a config var that controls how big it is

switch to Stanford 4class model?

Our demonym performance is spotty - catching and converting some and missing others. This is due to the Stanford model we are using, not our code. To address this, we could try integrating their demonym annotator.

I tried just using a different model, which includes demonyms (their 4class model) and this was the performance:

Test	3class	4class
GDELT	79%	80%
NYT mentions/aboutness	84% / 89%	89% / 86%
Reuters mentions/aboutness	95% / 91%	97% / 92%

As you can see, the 4class performed better on all tests except the NYT aboutness one.

"European" geocodes to "European Mine" in Arizona

This sentences doesn't work well:
"
The prices leaked to the site are what the devices will cost for European carriers, and it’s entirely likely that there will be a difference in US and EU prices – but it’s unlikely to be a huge difference.
"

{
  "status": "ok",
  "version": "1.2.0",
  "results": {
    "organizations": [
      {
        "count": 1,
        "name": "EU"
      }
    ],
    "places": {
      "mentions": [
        {
          "confidence": 1,
          "name": "United States",
          "countryCode": "US",
          "featureCode": "PCLI",
          "lon": -98.5,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 144,
            "string": "US"
          },
          "stateCode": "00",
          "featureClass": "A",
          "lat": 39.76,
          "stateGeoNameId": "",
          "id": 6252001,
          "population": 310232863
        },
        {
          "confidence": 1,
          "name": "European Mine",
          "countryCode": "US",
          "featureCode": "MN",
          "lon": -110.76814,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 65,
            "string": "European"
          },
          "stateCode": "AZ",
          "featureClass": "S",
          "lat": 31.46426,
          "stateGeoNameId": "5551752",
          "id": 5294393,
          "population": 0
        }
      ],
      "focus": {
        "states": [
          {
            "name": "Arizona",
            "countryCode": "US",
            "featureCode": "ADM1",
            "lon": -111.50098,
            "countryGeoNameId": "6252001",
            "score": 1,
            "stateCode": "AZ",
            "featureClass": "A",
            "lat": 34.5003,
            "stateGeoNameId": "5551752",
            "id": 5551752,
            "population": 5863809
          }
        ],
        "cities": [],
        "countries": [
          {
            "name": "United States",
            "countryCode": "US",
            "featureCode": "PCLI",
            "lon": -98.5,
            "countryGeoNameId": "6252001",
            "score": 2,
            "stateCode": "00",
            "featureClass": "A",
            "lat": 39.76,
            "stateGeoNameId": "",
            "id": 6252001,
            "population": 310232863
          }
        ]
      }
    },
    "people": []
  },
  "milliseconds": 15
}

use ISO3166 codes for test cases

Since we are testing countries, we should use ISO3166 codes, not specific geo ids.

Alternate country spelling is triggering errors

For instance:

"This is about the place Albany." is identified as the country Albania
"It happened the first day of her sophomore year at Columbia." comes out as a mention of Colombia the country

This doesn't seem right, but maybe is related to a recent change in the heuristic.

Support for parsing body text from POST request?

It seems like it might be cleaner on the URL to POST to /parse/text and pass the input string in the body of the request - is there support for that? Are there potential limits to the length of text sent in the querystring?

Configuration to enable/disable Stanford NER

It would be great if it has a configiration to enable / disable Stanford NER. In my case i wanted to use the Geo tagging part , but not Stanford NER

Case insensitivity?

Hi there!

We're planning to utilize CLIFF as part of a broader project on the history of hip hop in the Twin Cities. The idea is to feed lyrics into the parser and see what sort of geographical rhyming is happening. Not quite the use case you envisioned, I imagine, but that's the beauty of FOSS, right?

Anyways, based on the lyrics we've collected/seen, many sources do not capitalize place names. From my testing it seems that CLIFF's text parser is case sensitive, and I'm wondering if there's a fairly painless way to make it case insensitive?

If you could at least point me to the direction in the code, I can take a crack at it.

Thanks!

CLAVIN-Server - tweak ties for aboutness

If France & US get the same # of place mentions then:

Should they should both be returned?
Should the first one mentioned be returned?

Right now the behavior is that the last one mentioned is returned which I think we should change. @rahulbot - what do you think?

SSL connection error after Vagrant install on Windows

I'm using the vagrant install on Windows 10; installation appears to have ran successfully. However, when I try to pull up the CLIFF site for a test, I get (using Chrome), an error about "This site can’t provide a secure connection", with smaller print "ERR_SSL_PROTOCOL_ERROR".

It appears that it is trying to redirect to an SSL port, I think. Is there some sort of self-signed cert setup that needs to be done, or a way to disable the SSL redirect altogether in tomcat? Is there some sort of additional installation steps I needed to take?

Different results between CLIFF homepage sample and my local installation

I'm getting different results in parsing a news story snippet between my local install of CLIFF and the sample tester on the CLIFF homepage. Specifically, given my input text, my local install is not returning the correct location whereas the CLIFF homepage does return the correct location.

My input text is:

TROY, N.C. (AP) -- Some people want the Confederate flag to be removed from a volunteer fire department in central North Carolina.

WFMY-TV in Greensboro reports (http://on.wfmy.com/2qEYNRx) the flag has been displayed at the Uwharrie Volunteer Fire Department in Montgomery County for more than 20 ...

This should at least identify to Troy, North Carolina.

The results from my installation are:

{
  "milliseconds": 1318,
  "version": "2.3.0",
  "results": {
    "places": {
      "mentions": [
        {
          "featureCode": "ADM1",
          "featureClass": "A",
          "confidence": 1,
          "lon": -80.00032,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 115,
            "string": "North Carolina"
          },
          "population": 8611367,
          "stateGeoNameId": "4482348",
          "countryCode": "US",
          "name": "North Carolina",
          "stateCode": "NC",
          "id": 4482348,
          "lat": 35.50069
        },
        {
          "featureCode": "PPL",
          "featureClass": "P",
          "confidence": 1,
          "lon": -83.14993,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 0,
            "string": "TROY"
          },
          "population": 83280,
          "stateGeoNameId": "5001836",
          "countryCode": "US",
          "name": "Troy",
          "stateCode": "MI",
          "id": 5012639,
          "lat": 42.60559
        },
        {
          "featureCode": "PPL",
          "featureClass": "P",
          "confidence": 1,
          "lon": -79.79198,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 143,
            "string": "Greensboro"
          },
          "population": 285342,
          "stateGeoNameId": "4482348",
          "countryCode": "US",
          "name": "Greensboro",
          "stateCode": "NC",
          "id": 4469146,
          "lat": 36.07264
        },
        {
          "featureCode": "ADM2",
          "featureClass": "A",
          "confidence": 1,
          "lon": -77.20424,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 264,
            "string": "Montgomery County"
          },
          "population": 971777,
          "stateGeoNameId": "4361885",
          "countryCode": "US",
          "name": "Montgomery County",
          "stateCode": "MD",
          "id": 4362716,
          "lat": 39.13638
        },
        {
          "featureCode": "HTL",
          "featureClass": "S",
          "confidence": 1,
          "lon": -77.5848,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 6,
            "string": "N.C."
          },
          "population": 0,
          "stateGeoNameId": "4482348",
          "countryCode": "US",
          "name": "Hampton Inn Kinston, N.C.",
          "stateCode": "NC",
          "id": 9949124,
          "lat": 35.2448
        }
      ],
      "focus": {
        "cities": [
          {
            "score": 1,
            "featureCode": "PPL",
            "stateGeoNameId": "5001836",
            "featureClass": "P",
            "countryCode": "US",
            "name": "Troy",
            "lon": -83.14993,
            "countryGeoNameId": "6252001",
            "stateCode": "MI",
            "id": 5012639,
            "lat": 42.60559,
            "population": 83280
          },
          {
            "score": 1,
            "featureCode": "PPL",
            "stateGeoNameId": "4482348",
            "featureClass": "P",
            "countryCode": "US",
            "name": "Greensboro",
            "lon": -79.79198,
            "countryGeoNameId": "6252001",
            "stateCode": "NC",
            "id": 4469146,
            "lat": 36.07264,
            "population": 285342
          }
        ],
        "countries": [
          {
            "score": 5,
            "featureCode": "PCLI",
            "stateGeoNameId": "",
            "featureClass": "A",
            "countryCode": "US",
            "name": "United States",
            "lon": -98.5,
            "countryGeoNameId": "6252001",
            "stateCode": "00",
            "id": 6252001,
            "lat": 39.76,
            "population": 310232863
          }
        ],
        "states": [
          {
            "score": 3,
            "featureCode": "ADM1",
            "stateGeoNameId": "4482348",
            "featureClass": "A",
            "countryCode": "US",
            "name": "North Carolina",
            "lon": -80.00032,
            "countryGeoNameId": "6252001",
            "stateCode": "NC",
            "id": 4482348,
            "lat": 35.50069,
            "population": 8611367
          }
        ]
      }
    },
    "organizations": [
      {
        "name": "AP",
        "count": 1
      },
      {
        "name": "WFMY-TV",
        "count": 1
      }
    ],
    "people": []
  },
  "status": "ok"
}

which identifies the location as Troy, Michigan. However, if I submit the same text on the CLIFF homepage I get this:

{
  "status": "ok",
  "version": "2.3.0",
  "results": {
    "organizations": [
      {
        "count": 1,
        "name": "AP"
      },
      {
        "count": 1,
        "name": "WFMY-TV"
      }
    ],
    "places": {
      "mentions": [
        {
          "confidence": 1,
          "name": "North Carolina",
          "countryCode": "US",
          "featureCode": "ADM1",
          "lon": -80.000320000000002,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 115,
            "string": "North Carolina"
          },
          "stateCode": "NC",
          "featureClass": "A",
          "lat": 35.500689999999999,
          "stateGeoNameId": "4482348",
          "id": 4482348,
          "population": 8611367
        },
        {
          "confidence": 1,
          "name": "Troy",
          "countryCode": "US",
          "featureCode": "PPLA2",
          "lon": -79.894490000000005,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 0,
            "string": "TROY"
          },
          "stateCode": "NC",
          "featureClass": "P",
          "lat": 35.358469999999997,
          "stateGeoNameId": "4482348",
          "id": 4495714,
          "population": 3189
        },
        {
          "confidence": 1,
          "name": "Greensboro",
          "countryCode": "US",
          "featureCode": "PPL",
          "lon": -79.791979999999995,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 143,
            "string": "Greensboro"
          },
          "stateCode": "NC",
          "featureClass": "P",
          "lat": 36.07264,
          "stateGeoNameId": "4482348",
          "id": 4469146,
          "population": 269666
        },
        {
          "confidence": 1,
          "name": "Hampton Inn Kinston, N.C.",
          "countryCode": "US",
          "featureCode": "HTL",
          "lon": -77.584800000000001,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 6,
            "string": "N.C."
          },
          "stateCode": "NC",
          "featureClass": "S",
          "lat": 35.244799999999998,
          "stateGeoNameId": "4482348",
          "id": 9949124,
          "population": 0
        }
      ],
      "focus": {
        "states": [
          {
            "name": "North Carolina",
            "countryCode": "US",
            "featureCode": "ADM1",
            "lon": -80.000320000000002,
            "countryGeoNameId": "6252001",
            "score": 4,
            "stateCode": "NC",
            "featureClass": "A",
            "lat": 35.500689999999999,
            "stateGeoNameId": "4482348",
            "id": 4482348,
            "population": 8611367
          }
        ],
        "cities": [
          {
            "name": "Greensboro",
            "countryCode": "US",
            "featureCode": "PPL",
            "lon": -79.791979999999995,
            "countryGeoNameId": "6252001",
            "score": 1,
            "stateCode": "NC",
            "featureClass": "P",
            "lat": 36.07264,
            "stateGeoNameId": "4482348",
            "id": 4469146,
            "population": 269666
          },
          {
            "name": "Troy",
            "countryCode": "US",
            "featureCode": "PPLA2",
            "lon": -79.894490000000005,
            "countryGeoNameId": "6252001",
            "score": 1,
            "stateCode": "NC",
            "featureClass": "P",
            "lat": 35.358469999999997,
            "stateGeoNameId": "4482348",
            "id": 4495714,
            "population": 3189
          }
        ],
        "countries": [
          {
            "name": "United States",
            "countryCode": "US",
            "featureCode": "PCLI",
            "lon": -98.5,
            "countryGeoNameId": "6252001",
            "score": 4,
            "stateCode": "00",
            "featureClass": "A",
            "lat": 39.759999999999998,
            "stateGeoNameId": "",
            "id": 6252001,
            "population": 310232863
          }
        ]
      }
    },
    "people": []
  },
  "milliseconds": 40
}

The CLIFF homepage correctly identifies Troy, NC.

So my question is, why is it different? Both list version 2.3.0; mine was built within the last couple weeks. Is it a difference of when the gazetteer index was built? I did find on my local install that in CliffLocationResolver.java, if I modify the MAX_HIT_DEPTH value from 10 to 20, I do get Troy, NC in the resolved results, right at spot 11. So on mine the location is correctly resolved but apparently not ranked/ordered high enough to get past the first hit depth test.

So why might this be? Thanks for any insight!

String "Czechoslovaki" gives java.lang.NullPointerException

any text containing the string 'Czechoslovakia' give a "java.lang.NullPointerException"

http://localhost:8080/CLIFF-2.1.1/parse/text?q="Czechoslovakia" returns
"details":"java.lang.NullPointerException","status":"error"

create a test case that checks bake-off results

Once we have a new "about this country" field (#1), we need a test case that runs it against the bake-off results to see if my algorithm does better than the out-of-the-box CLAVIN one.

Sao Paolo matches State and not City

Noting here for review later. Same problem with Rio de Janeiro. Maybe check on more Brazil city & state names and adjust city vs state logic in CLIFF.

cliff.mediameter.org example not responding

seems like the process endpoint, http://cliff.mediameter.org/process is hanging or is down

Create a LocationScoredAboutnessStrategy

The idea would be to score each country mention. A regular mention would earn 1 point. A mention in the headline would earn 3 points. The first mention in the article would earn 2 points. Those number are fairly arbitrary.

Then the country the article is about would be the one with the most points.

See #5 (comment) for a motivating example in the hand-coded corpus, plus this idea just sort of felt right to us in earlier email discussions.

colocation logic should take states (ie. ADM1) into account

Guy reports that simple examples like this are failing:

“We have holdings in Wilmington,(DE|Delaware)” incorrectly assigns Wilmington to NC.
“I am from Arlington, (VA|Virginia)”, puts Arlington in Texas despite no mention of Texas (and it correctly identifies Virginia as a State mentioned).

He suggests taking states into account
"Perhaps CLIFF could be reconfigured so that if it identifies a city as being inside the US, greater weight could be placed on seeing if it is in one of the states mentioned in the document"

Large Areas resolving to wrong countries

Western Europe resolves to Germany (the country) and Eastern Europe resolves to Belarus.

-- see below query --

http://civicdev.media.mit.edu:8080/CLIFF/parse/text?q=Next%20Map%20%3E%20This%20map%20shows%20books%20borrowed%20from%20public%20libraries%20-%20which%20lend%20books%20to%20members%20for%20free%20or%20for%20a%20nominal%20charge.%20Libraries%20share%20books,%20making%20it%20unnecessary%20for%20us%20to%20buy%20books%20that%20we%20will%20read%20only%20once%20or%20twice.%20The%20most%20books%20borrowed%20were%20in%20the%20Russian%20Federation.%20There%20were%20high%20rates%20of%20borrowing%20in%20Western%20Europe,%20Japan%20and%20Eastern%20Europe.%20In%20these%20regions%20most%20territories%20reported%20some%20book%20borrowing.%20In%20other%20regions%20reported%20book%20borrowing%20was%20lower,%20and%20many%20territories%20reported%20very%20little%20borrowing.%20Where%20many%20people%20cannot%20afford%20books,%20it%20appears%20they%20often%20cannot%20borrow%20them%20either.%20%22In%20vain%20have%20you%20acquired%20knowledge,%20if%20you%20have%20not%20imparted%20it%20to%20others.%22%20Deuteronomy%20Rabbah,%20undated%20Territory%20size%20shows%20the%20proportion%20of%20all%20library%20books%20borrowed%20that%20were%20borrowed%20there.

Test against the NYT corpus

We should have an automated test against a selected from the NYT Corpus. Any country we pick at the "about" country should certainly show up on the list of "locations" for any article in the corpus.

Manually adjust bad places?

"Reddit" for example is matching to Reddit Creek, CA
Not sure if this is still the case but all mentions of "Washington" were being located to WA state
[EDIT -- I checked and Washington is still being located to WA state]

Should we keep a running list of these for manually extracting later?

integrate naive "aboutness" selector

We decided for now to build a frequency-based selector that picks the single country a the story is "about". We'll test the output of that against the bake-off results to see if my NewsHeursitics candidate selection strategy is better than the out-of-the-box CLAVIN one.

test against Reuters

Now that we have the Reuters corpus (#8), we need to test against it to see how we do!

ParseManager.getParserInstance should resolve IndexDirectory symlinks

We store the CLAVIN IndexDirectory in /srv/cliff and symlink it to /etc/cliff2/IndexDirectory but this fails, I assume because of the following code:

File gazetteerDir = new File(PATH_TO_GEONAMES_INDEX);
if( !gazetteerDir.exists() || !gazetteerDir.isDirectory() ){
 ...
}

I solved it by using a bind mount but this is awkward as I have to create an entry in /etc/fstab.

This could be solved by resolving the symlink before checking if it's a directory. I'm not a Java Developer™ but from the doco it seems this might work (though it does seem horribly clumsy!):

File gazetteerDir = new File((new File(PATH_TO_GEONAMES_INDEX)).getCanonicalPath);

I'm sure there is a better way of doing it, but you get my drift.

long articles fail to geocode

v0.7 fails to handle long text because we're doing a GET request - it should handle both that (for testing) and POST, which is what we should actually use

OMG São Paulo is killing me

On my local instance, Sao Paulo gets disambiguated fine:
http://localhost:8080/parseText?text=This%20is%20some%20text%20about%20New%20York%20City,%20and%20maybe%20about%20S%C3%A3o%20Paulo%20as%20well

But on civicdev CLIFF thinks Sao Paulo is a person:
http://civicdev.media.mit.edu:8080/parseText?text=This%20is%20some%20text%20about%20New%20York%20City%20and%20maybe%20about%20S%C3%A3o%20Paulo%20as%20well

Why? CLIFF sees Sao Paulo as S?o Paulo whereas my localhost see it as SÃ£o Paulo so maybe there are character encoding issues?

Issues Building and Running CLIFF in Tomcat

I am on OS X 10.10.5.
David-Laxers-MacBook-Pro:mongo davidlaxer$ mvn -version
Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00)
Maven home: /Users/davidlaxer/Downloads/apache-maven-3.2.1
Java version: 1.8.0_05, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"
David-Laxers-MacBook-Pro:mongo davidlaxer$

$ mvn package runs successfully in the CLIFF parent directory.
However, there are issues in the child director
stanford_entity_extractor:
David-Laxers-MacBook-Pro:stanford-entity-extractor davidlaxer$ mvn package
[INFO] Scanning for projects...
[INFO]
[INFO] Using the builder org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder with a thread count of 1
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building stanford-entity-extractor 2.3.0
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for org.mediameter:common:jar:2.3.0 is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.190 s
[INFO] Finished at: 2015-08-25T18:48:12-08:00
[INFO] Final Memory: 6M/162M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project stanford-entity-extractor: Could not resolve dependencies for project org.mediameter:stanford-entity-extractor:jar:2.3.0: Failure to find org.mediameter:common:jar:2.3.0 in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
David-Laxers-MacBook-Pro:stanford-entity-extractor davidlaxer$

I get the stanford Could not resolve dependencies for project org.mediameter:stanford-entity-extractor:jar:2.3.0: error in the CLIFF parent when I try to run the maven tomcat deployment.

David-Laxers-MacBook-Pro:CLIFF davidlaxer$ mvn tomcat7:deploy -DskipTests
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] CLIFF
[INFO] common
[INFO] stanford-entity-extractor
[INFO] cliff
[INFO]
[INFO] Using the builder org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder with a thread count of 1
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building CLIFF 2.3.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] >>> tomcat7-maven-plugin:2.1:deploy (default-cli) @ CLIFF >>>
[INFO]
[INFO] <<< tomcat7-maven-plugin:2.1:deploy (default-cli) @ CLIFF <<<
[INFO]
[INFO] --- tomcat7-maven-plugin:2.1:deploy (default-cli) @ CLIFF ---
[INFO] Skipping non-war project
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building common 2.3.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] >>> tomcat7-maven-plugin:2.1:deploy (default-cli) @ common >>>
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ common ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /Users/davidlaxer/CLIFF/common/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ common ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ common ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /Users/davidlaxer/CLIFF/common/src/test/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ common ---
[INFO] No sources to compile
[INFO]
[INFO] --- maven-surefire-plugin:2.16:test (default-test) @ common ---
[INFO] Tests are skipped.
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ common ---
[INFO]
[INFO] <<< tomcat7-maven-plugin:2.1:deploy (default-cli) @ common <<<
[INFO]
[INFO] --- tomcat7-maven-plugin:2.1:deploy (default-cli) @ common ---
[INFO] Skipping non-war project
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building stanford-entity-extractor 2.3.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] >>> tomcat7-maven-plugin:2.1:deploy (default-cli) @ stanford-entity-extractor >>>
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] CLIFF ............................................. SUCCESS [ 1.193 s]
[INFO] common ............................................ SUCCESS [ 1.322 s]
[INFO] stanford-entity-extractor ......................... FAILURE [ 0.047 s]
[INFO] cliff ............................................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.105 s
[INFO] Finished at: 2015-08-25T18:49:46-08:00
[INFO] Final Memory: 12M/197M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project stanford-entity-extractor: Could not resolve dependencies for project org.mediameter:stanford-entity-extractor:jar:2.3.0: Failure to find org.mediameter:common:jar:2.3.0 in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :stanford-entity-extractor
David-Laxers-MacBook-Pro:CLIFF davidlaxer$

A .war was created:

David-Laxers-MacBook-Pro:CLIFF davidlaxer$ find . -name '*.war' -ls
126302424 254304 -rw-r--r-- 1 davidlaxer staff 130200897 Aug 25 17:40 ./webapp/target/cliff-2.3.0.war

I copied the .war into Tomcat 8's webapps directory:

ls -l /usr/local/apache-tomcat-8.0.24/webapps/
total 254304
drwxr-xr-x@ 19 davidlaxer staff 646 Jul 1 13:23 ROOT
drwxr-xr-x 4 davidlaxer staff 136 Aug 25 18:53 cliff-2.3.0
-rw-r--r-- 1 root staff 130200897 Aug 25 18:53 cliff-2.3.0.war
drwxr-xr-x@ 55 davidlaxer staff 1870 Jul 1 13:23 docs
drwxr-xr-x@ 7 davidlaxer staff 238 Jul 1 13:23 examples
drwxr-xr-x@ 7 davidlaxer staff 238 Jul 1 13:23 host-manager
drwxr-xr-x@ 8 davidlaxer staff 272 Jul 1 13:23 manager

I can't run the CLIFF 2.3.0 server from my browser:

"Bangor, Maine" resolving to Bangor in Ireland

This sentences mentions both Bangor and Maine, yet still it resolves Bangor to the city in Northern Ireland:

"
Near Bangor, Maine, 75 vehicles got tangled up in a series of chain-reaction pileups
on a snowy stretch of Interstate 95, injuring at least 17 people.
"

The algorithm should realize that Maine is there in the US and pick the Bangor in Maine because of that. Grrrr....

{
  "status": "ok",
  "version": "1.2.0",
  "results": {
    "organizations": [],
    "places": {
      "mentions": [
        {
          "confidence": 1,
          "name": "Bangor",
          "countryCode": "GB",
          "featureCode": "PPLA2",
          "lon": -5.66895,
          "countryGeoNameId": "2635167",
          "source": {
            "charIndex": 5,
            "string": "Bangor"
          },
          "stateCode": "NIR",
          "featureClass": "P",
          "lat": 54.65338,
          "stateGeoNameId": "2641364",
          "id": 2656396,
          "population": 60385
        },
        {
          "confidence": 1,
          "name": "Maine",
          "countryCode": "US",
          "featureCode": "ADM1",
          "lon": -69.24977,
          "countryGeoNameId": "6252001",
          "source": {
            "charIndex": 13,
            "string": "Maine"
          },
          "stateCode": "ME",
          "featureClass": "A",
          "lat": 45.50032,
          "stateGeoNameId": "4971068",
          "id": 4971068,
          "population": 1325518
        }
      ],
      "focus": {
        "states": [
          {
            "name": "Northern Ireland",
            "countryCode": "GB",
            "featureCode": "ADM1",
            "lon": -6.5,
            "countryGeoNameId": "2635167",
            "score": 1,
            "stateCode": "NIR",
            "featureClass": "A",
            "lat": 54.5,
            "stateGeoNameId": "2641364",
            "id": 2641364,
            "population": 1700000
          },
          {
            "name": "Maine",
            "countryCode": "US",
            "featureCode": "ADM1",
            "lon": -69.24977,
            "countryGeoNameId": "6252001",
            "score": 1,
            "stateCode": "ME",
            "featureClass": "A",
            "lat": 45.50032,
            "stateGeoNameId": "4971068",
            "id": 4971068,
            "population": 1325518
          }
        ],
        "cities": [
          {
            "name": "Bangor",
            "countryCode": "GB",
            "featureCode": "PPLA2",
            "lon": -5.66895,
            "countryGeoNameId": "2635167",
            "score": 1,
            "stateCode": "NIR",
            "featureClass": "P",
            "lat": 54.65338,
            "stateGeoNameId": "2641364",
            "id": 2656396,
            "population": 60385
          }
        ],
        "countries": [
          {
            "name": "United States",
            "countryCode": "US",
            "featureCode": "PCLI",
            "lon": -98.5,
            "countryGeoNameId": "6252001",
            "score": 1,
            "stateCode": "00",
            "featureClass": "A",
            "lat": 39.76,
            "stateGeoNameId": "",
            "id": 6252001,
            "population": 310232863
          },
          {
            "name": "United Kingdom of Great Britain and Northern Ireland",
            "countryCode": "GB",
            "featureCode": "PCLI",
            "lon": -2.69531,
            "countryGeoNameId": "2635167",
            "score": 1,
            "stateCode": "00",
            "featureClass": "A",
            "lat": 54.75844,
            "stateGeoNameId": "",
            "id": 2635167,
            "population": 62348447
          }
        ]
      }
    },
    "people": []
  },
  "milliseconds": 8
}

Assess hand coded test failures

NYT is at 100% - no test failures.

Put Georgetown on stopword list

consistently wrongly geolocates Georgetown -- typically it's the basketball team that it tries to geolocate - better bet is to just put it on the stopword list and ignore it, IMO. Rahul - let me know if you agree and I'll take care of this

Docker container

I wrapped up CLIFF into a Docker container that should help ease large-scale deployments. https://github.com/johnb30/cliff-docker

com.bericotech.clavin.ClavinException: Error opening gazetteer index.

Any ideas on this error?

(I added the dependency to my pom.xml)

          <dependency>
               <groupId>com.bericotech</groupId>
               <artifactId>clavin</artifactId>
               <version>2.0.0</version>
            </dependency>

For CLAVIN:

$ mvn install

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

For CLIFF:

$ mvn install

...
com.bericotech.clavin.ClavinException: Error opening gazetteer index.
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:218)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:242)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:802)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:53)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:67)
at com.bericotech.clavin.gazetteer.query.LuceneGazetteer.(LuceneGazetteer.java:139)
at org.mediameter.cliff.ParseManager.getParserInstance(ParseManager.java:350)
at org.mediameter.cliff.ParseManager.extractAndResolve(ParseManager.java:301)
at org.mediameter.cliff.ParseManager.extractAndResolve(ParseManager.java:297)
at org.mediameter.cliff.test.places.BlacklistTest.testReddit(BlacklistTest.java:16)

"M.S." is locating to a place in Malaysia

This is a bad false positive: Valerie Berkowitz, M.S., R.D., locates to a park in Malaysia (perhaps because of an alternate name match?

{
  "status": "ok",
  "version": "2.1.1",
  "results": {
    "organizations": [],
    "places": {
      "mentions": [
        {
          "confidence": 1,
          "name": "M.S. Garden",
          "countryCode": "MY",
          "featureCode": "HTL",
          "lon": 103.32848,
          "countryGeoNameId": "1733045",
          "source": {
            "charIndex": 20,
            "string": "M.S."
          },
          "stateCode": "06",
          "featureClass": "S",
          "lat": 3.8115,
          "stateGeoNameId": "1733042",
          "id": 9950975,
          "population": 0
        }
      ],
      "focus": {
        "states": [
          {
            "name": "Pahang",
            "countryCode": "MY",
            "featureCode": "ADM1",
            "lon": 102.75,
            "countryGeoNameId": "1733045",
            "score": 1,
            "stateCode": "06",
            "featureClass": "A",
            "lat": 3.5,
            "stateGeoNameId": "1733042",
            "id": 1733042,
            "population": 0
          }
        ],
        "cities": [],
        "countries": [
          {
            "name": "Malaysia",
            "countryCode": "MY",
            "featureCode": "PCLI",
            "lon": 112.5,
            "countryGeoNameId": "1733045",
            "score": 1,
            "stateCode": "00",
            "featureClass": "A",
            "lat": 2.5,
            "stateGeoNameId": "",
            "id": 1733045,
            "population": 28274729
          }
        ]
      }
    },
    "people": [
      {
        "count": 1,
        "name": "Valerie Berkowitz"
      },
      {
        "count": 1,
        "name": "R.D."
      }
    ]
  },
  "milliseconds": 8
}

Bias Argument

Apologies if this is redundant but I've been digging around and can't seem to find anything on including a bias with parse requests. It would help me out greatly to be able to put in country/state codes to bias any results.

I am running a lot of content through CLIFF and often get results that don't make sense in the context of the content. For example CLIFF returns Moscow, Russia when the content is discussing Moscow, Idaho. Thinking something like a list of codes (cca3, adm1, adm2) would skew the results towards geonames of those codes.

Is this already possible and I'm just not finding it?

flesh out servlet metadata

We need to do things like:

add the total query time to JSON servlet results
log the IP of the requests
and so on before releasing v1.0 of the servlet interface

mvn tomcat7:deploy - Cannot invoke Tomcat manager: Broken pipe

I built CLIFF successfully, but I'm unable to deploy to my Tomcat server.
Any ideas?

I posted the issue here:

http://stackoverflow.com/questions/32230962/mvn-tomcat7deploy-cannot-invoke-tomcat-manager-broken-pipe

include demonyms?

The latest CLAVIN-NERD release includes a file that they use to remove demonyms found by the Standford NER. What's the logic there? Don't we want to include them? We could scrape the wikipedia demonym map and use that to turn demonyms into country names before they go into the disambigutation strategy.

I think because of our focus on "Aboutness" we want to include this information.

get reuters corpus

We should get and integrate a test on the Reuters corpus

400 Bad Request error

Seems to be something having to do with character encoding -- it's from this URL:
http://docs.mongodb.org/manual/reference/method/db.collection.find/

Here's a short python script to repeat it without having to run it through the extractor:

coding=utf-8

import requests

text = u'db.collection.find() — MongoDB Manual 2.6.0 criteria document Optional. Specifies selection criteria using query operators . To return all documents in a collection, omit this parameter or pass an empty document ({}). projection document Optional. Specifies the fields to return using projection operators . To return all fields in the matching document, omit this parameter. Returns: A cursor to the documents that match the query criteria. When the find() method “returns documents,” the method is actually returning a cursor to the documents. If the projection argument is specified, the matching documents contain only the projection fields and the _id field. You can optionally exclude the _id field. Executing find() directly in the mongo shell automatically iterates the cursor to display up to the first 20 documents. Type it to continue iteration. To access the returned documents with a driver, use the appropriate cursor handling mechanism for the driver language . The projection parameter takes a document of the following form: { field1: , field2: ... } The value can be any of the following: 1 or true to include the field. The find() method always includes the _id field even if the field is not explicitly stated to return in the projection parameter. 0 or false to exclude the field. A projection cannot contain both include and exclude specifications, except for the exclusion of the _id field. In projections that explicitly include fields, the _id field is the only field that you can explicitly exclude. db.collection.find() is a wrapper for the more formal query structure that uses the $query operator. Examples ¶ Find All Documents in a Collection ¶ The find() method with no parameters returns all documents from a collection and returns all fields for the documents. For example, the following operation returns all documents in the bios collection : db.bios.find() Find Documents that Match Query Criteria ¶ To find documents that match a set of selection criteria, call find() with the parameter. The following operation returns all the documents from the collection products where qty is greater than 25: db.products.find( { qty: { $gt: 25 } } ) The following operation returns documents in the bios collection where _id equals 5: db.bios.find( { _id: 5 } ) Query Using Operators ¶ The following operation returns documents in the bios collection where _id equals either 5 or ObjectId("507c35dd8fada716c89d0013"): db.bios.find( { _id: { $in: [ 5, ObjectId("507c35dd8fada716c89d0013") ] } } ) Query for Ranges ¶ Combine comparison operators to specify ranges. The following operation returns documents with field between value1 and value2: db.collection.find( { field: { $gt: value1, $lt: value2 } } ); Query a Field that Contains an Array ¶ If a field contains an array and your query has multiple conditional operators, the field as a whole will match if either a single array element meets the conditions or a combination of array elements meet the conditions. Given a collection students that contains the following documents: { "_id" : 1, "score" : [ -1, 3 ] } { "_id" : 2, "score" : [ 1, 5 ] } { "_id" : 3, "score" : [ 5, 5 ] } The following query: db.students.find( { score: { $gt: 0, $lt: 2 } } ) Matches the following documents: { "_id" : 1, "score" : [ -1, 3 ] } { "_id" : 2, "score" : [ 1, 5 ] } In the document with _id equal to 1, the score: [ -1, 3 ] meets the conditions because the element -1 meets the $lt: 2 condition and the element 3 meets the $gt: 0 condition. In the document with _id equal to 2, the score: [ 1, 5 ] meets the conditions because the element 1 meets both the $lt: 2 condition and the $gt: 0 condition. Query Arrays ¶ Query for an Array Element ¶ The following operation returns documents in the bios collection where the array field contribs contains the element "UNIX": db.bios.find( { contribs: "UNIX" } ) Query an Array of Documents ¶ The following operation returns documents in the bios collection where awards array contains a subdocument element that contains the award field equal to "Turing Award" and the year field greater than 1980: db.bios.find( { awards: { $elemMatch: { award: "Turing Award", year: { $gt: 1980 } } } } ) Query Subdocuments ¶ Query Exact Matches on Subdocuments ¶ The following operation returns documents in the bios collection where the subdocument name is exactly { first: "Yukihiro", last: "Matsumoto" }, including the order: db.bios.find( { name: { first: "Yukihiro", last: "Matsumoto" } } ) The name field must match the sub-document exactly. The query does not match documents with the following name fields: { first: "Yukihiro", aka: "Matz", last: "Matsumoto" } { last: "Matsumoto", first: "Yukihiro" } Query Fields of a Subdocument ¶ The following operation returns documents in the bios collection where the subdocument name contains a field first with the value "Yukihiro" and a field last with the value "Matsumoto". The query uses dot notation to access fields in a subdocument: db.bios.find( { "name.first": "Yukihiro", "name.last": "Matsumoto" } ) The query matches the document where the name field contains a subdocument with the field first with the value "Yukihiro" and a field last with the value "Matsumoto". For instance, the query would match documents with name fields that held either of the following values: { first: "Yukihiro", aka: "Matz", last: "Matsumoto" } { last: "Matsumoto", first: "Yukihiro" } Projections ¶ The projection parameter specifies which fields to return. The parameter contains either include or exclude specifications, not both, unless the exclude is for the _id field. Specify the Fields to Return ¶ The following operation returns all the documents from the products collection where qty is greater than 25 and returns only the _id, item and qty fields: db.products.find( { qty: { $gt: 25 } }, { item: 1, qty: 1 } ) The operation returns the following: { "_id" : 11, "item" : "pencil", "qty" : 50 } { "_id" : ObjectId("50634d86be4617f17bb159cd"), "item" : "bottle", "qty" : 30 } { "_id" : ObjectId("50634dbcbe4617f17bb159d0"), "item" : "paper", "qty" : 100 } The following operation finds all documents in the bios collection and returns only the name field, contribs field and _id field: db.bios.find( { }, { name: 1, contribs: 1 } ) Explicitly Excluded Fields ¶ The following operation queries the bios collection and returns all fields except the the first field in the name subdocument and the birth field: db.bios.find( { contribs: 'OOP' }, { 'name.first': 0, birth: 0 } ) Explicitly Exclude the _id Field ¶ The following operation excludes the _id and qty fields from the result set: db.products.find( { qty: { $gt: 25 } }, { _id: 0, qty: 0 } ) The documents in the result set contain all fields except the _id and qty fields: { "item" : "pencil", "type" : "no.2" } { "item" : "bottle", "type" : "blue" } { "item" : "paper" } The following operation finds documents in the bios collection and returns only the name field and the contribs field: db.bios.find( { }, { name: 1, contribs: 1, _id: 0 } ) On Arrays and Subdocuments ¶ The following operation queries the bios collection and returns the last field in the name subdocument and the first two elements in the contribs array: db.bios.find( { }, { _id: 0, 'name.last': 1, contribs: { $slice: 2 } } ) Iterate the Returned Cursor ¶ The find() method returns a cursor to the results. In the mongo shell, if the returned cursor is not assigned to a variable using the var keyword, the cursor is automatically iterated up to 20 times to access up to the first 20 documents that match the query. You can use the DBQuery.shellBatchSize to change the number of iterations. See Flags and Cursor Behaviors . To iterate manually, assign the returned cursor to a variable using the var keyword. With Variable Name ¶ The following example uses the variable myCursor to iterate over the cursor and print the matching documents: var myCursor = db.bios.find( ); myCursor The following example uses the cursor method next() to access the documents: var myCursor = db.bios.find( ); var myDocument = myCursor.hasNext() ? myCursor.next() : null; if (myDocument) { var myName = myDocument.name; print (tojson(myName)); } To print, you can also use the printjson() method instead of print(tojson()): if (myDocument) { var myName = myDocument.name; printjson(myName); } With forEach() Method ¶ The following example uses the cursor method forEach() to iterate the cursor and access the documents: var myCursor = db.bios.find( ); myCursor.forEach(printjson); Modify the Cursor Behavior ¶ The mongo shell and the drivers provide several cursor methods that call on the cursor returned by the find() method to modify its behavior. Order Documents in the Result Set ¶ The sort() method orders the documents in the result set. The following operation returns documents in the bios collection sorted in ascending order by the name field: db.bios.find().sort( { name: 1 } ) sort() corresponds to the ORDER BY statement in SQL. Limit the Number of Documents to Return ¶ The limit() method limits the number of documents in the result set. The following operation returns at most 5 documents in the bios collection : db.bios.find().limit( 5 ) limit() corresponds to the LIMIT statement in SQL. Set the Starting Point of the Result Set ¶ The skip() method controls the starting point of the results set. The following operation skips the first 5 documents in the bios collection and returns all remaining documents: db.bios.find().skip( 5 ) The following example chains cursor methods: db.bios.find().sort( { name: 1 } ).limit( 5 ) db.bios.find().limit( 5 ).sort( { name: 1 } ) Regardless of the order you chain the limit() and the sort() , the request to the server has the structure that treats the query and the sort() modifier as a single object. Therefore, the limit() operation method is always applied after the sort() regardless of the specified order of the operations in the chain. See the meta query operators . Copyright © 2011-2014 MongoDB, Inc . Licensed under Creative Commons . MongoDB, Mongo, and the leaf logo are registered trademarks of MongoDB, Inc. ON THIS PAGE'

params = {'q':text}

r = requests.post('http://localhost:8080/CLIFF/parse/text', params=params)
print r
print r.json()

Some obvious places are not recognized

I tried "Toronto" and it returns no results.

curl http://localhost:8080/CLIFF-2.1.1/parse/text?q=Toronto

{
    "milliseconds": 2,
    "results": {
        "organizations": [],
        "people": [],
        "places": {
            "focus": {},
            "mentions": []
        }
    },
    "status": "ok",
    "version": "2.1.1"
}

I also tried this on the demo here http://cliff.mediameter.org/ and got the same result:

NullPointerException

When I run CLAVIN-Server on extracted text from this URL: http://www.nytimes.com/2014/02/07/us/stolen-stradivarius-violin-may-have-been-recovered.html?action=click&contentCollection=Europe&module=MostEmailed&version=Full&region=Marginalia&src=me&pgtype=article

And NullPointerException when you give it text without a place mention

Disable Geocode

The bulk of the work is for geocoding. Vagrant/Docker need heavy resources only because of it.

So is it possible to modify the existing model, it is pretty cool, to only support basic location identification? How to go about that particularly with MITIE.

Africa resolves to a city in Tunisia

Split off from #12

{
          "confidence": 1,
          "name": "Mahdia",
          "countryCode": "TN",
          "featureCode": "PPLA",
          "lon": 11.06222,
          "countryGeoNameId": "2464461",
          "source": {
            "charIndex": 14,
            "string": "Africa"
          },
          "stateCode": "15",
          "featureClass": "P",
          "lat": 35.50472,
          "stateGeoNameId": "2473574",
          "id": 2473572,
          "population": 45977
        }

replaceAllDemonyms not working correctly

Alex says:
"Afghanistan is not found by either version when replaceAllDemonyms is set to true and they both find it without it"
Abiut the 2.1.0 release

primaryStates & primaryCountries data format,

Primary states & countries should return name, geonames ID, state code (if state), and country code