elyase / geotext Goto Github PK
View Code? Open in Web Editor NEWGeotext extracts country and city mentions from text
License: MIT License
Geotext extracts country and city mentions from text
License: MIT License
"USA" is not being detected. I have to replace "USA" to "United States" in order the country to be detected.
Since issue #4 has been resolved with modifications to the master branch, those changes need to be tagged and pushed to PyPI.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 165: character maps to
Here in countryInfo.txt you can see several country names with three words, i.e. United Arab Emirates, Antigua and Barbuda, Bosnia and Herzegovina, Central African Republic, ...
But due to the [ \-]?
of the following regex, only countries names with a maximum of one space are detected.
Line 107 in add0334
Oops:
In [1]: from geotext import GeoText
In [2]: places1 = GeoText('I love London and Brussels.')
In [3]: places1.cities
Out[3]: ['London', 'Brussels']
In [4]: places2 = GeoText('I Love London and Brussels.')
In [5]: places2.cities
Out[5]: ['Brussels']
I have found 2 cities which are not identified in geotext.
The cities "Ventalló" and "Sant Cugat del Vallès" exist in http://www.geonames.org but geotext is not able to find.
GeoText("Ventallo").cities
[]
GeoText("Sant Cugat del Vallés").cities
[]
GeoText("Alcalá de Henares").cities
['Alcalá de Henares']
Hi,
I have texts extracted from certain contexts of files like:
'. Education : 05/2012 DePaul University Graduate School Master of Science in E-Commerce Technology Morgan in E-Commerce Technology Morgan State University 05/88 BS in Computer Science Technical Training or Certifications '
'Information Technology Southern New Hampshire University Expected 2015 Associate of Arts , Graphic Design Penn'
These contain the name of some university along with some city names like 'Morgan city' in the first sentence and 'New Hampshire' in the second sentence. I am using the code mentioned below to extract the city names from the text using the 'geotext' python library:
from geotext import geotext
places = GeoText(sent1) -- or sent2
print(places.cities)
I had used pip install geotext
for the installation on Python 3 Anaconda 3.0 in Windows 7.
The output I am getting is ['University'] and ['University', 'University']. These are clearly not city names.
I would like to mention that the post installation I have had some 'expecting bytes not strings' errors and 'cannot find name GeoText' errors which I corrected manually.
I changed the import statement in init.py to contain geotext instead of GeoText and
I changed the string to string.encode() for the byte array errors.
Hi, I am running single cities through the country_mentions func and both of them are coming up only with "OrderedDict([('US', 1)])"
cities = ['Melbourne', 'Bristol']
for city in cities:
country_dict = GeoText(city.title()).country_mentions
print(country_dict)
I understand that these are places in the US, but obviously Melbourne is pretty significant in Australia, as is Bristol in the UK. Should the Dict come back with numerous country mentions?
Thanks!
"The official ISO country code for the United Kingdom is 'GB'. The code 'UK' is reserved."
Both UK and GB are returned in my country mentions for some reason. I'm not even sure what the UK ones are from. (I'm using this on a huge file so there's no way to tell what places it's deeming as UK)
Is it possible to query for states?
Hi @elyase this is great work, thanks - very fast. I am encountering a few reliability issues however. Specifically, I am finding that the library is very sensitive to capitalization and punctuation (ignores lowercase, ignores countries if followed by other properly capitalized words) and that it also has trouble disambiguating between multiple places with the same name. For example:
GeoText("France Is A Country").country_mentions
>>OrderedDict()
GeoText("paris France").country_mentions
>>OrderedDict([('FR', 1)])
GeoText("Paris France").country_mentions
>>OrderedDict()
GeoText("Paris, France").country_mentions
>> OrderedDict([('FR', 1), ('US', 1)])
(Presumably because there are also American cities named Paris?)
Just wanted to flag this for future updates...thanks!
In some cases we need to extract cities from specific country. What do you think about implement it @elyase?
Windows 10, Version 1803
I am doing my ChatBot assignment that can tell user the weather.
So, I using geotext to extract cities from the user input
But I found that when the sentence is too long, it cannot return the city I want.
Codes:
from geotext import GeoText
def main():
while True:
request = input("Enter sentence containing a location: ")
places = GeoText(request.title())
print("Cities in the sentence: " + places.cities)
if __name__ == '__main__':
main()
Output:
Enter sentence containing a location: what is the weather today in kuala lumpur?
Cities in the sentence: ['Kuala Lumpur']
Enter sentence containing a location: please tell me the weather today in New York
Cities in the sentence: ['York']
Enter sentence containing a location: Washington
Cities in the sentence: ['Washington']
Enter sentence containing a location: can you please tell me the weather today in kuala lumpur?
Cities in the sentence: []
Enter sentence containing a location: can you please tell me the weather today in london?
Cities in the sentence: []
Enter sentence containing a location: please tell me the weather today in Washington
Cities in the sentence: []
Enter sentence containing a location: what is the weather in Washington?
Cities in the sentence: []
When i try to recognize some cities with more then two words the city is not recognized.
Examples: Rio de Janeiro, Mar del Plata, Rio das Ostras.
Hi,
I have encountered a strange issue while testing the library, When I enter following String it gives me back "Parsippany" as the city:
54 Manchester Rd, Parsippany , NJ 07054 07054
but for the following, I dont get any:
54 Manchester Rd Parsippany , NJ 07054 07054
The only thing different is the comma "," between Rd and Parsippany.
Any ideas?
Tests in test folder fail. Please check this out.
Hello Elyase, very glad you have created and maintained this very useful python library. I'm currently using it to help parse quite a lot of info from the USPTO. Anyway I noticed quite a few errors where the library didn't capture the city and/or country from the string. Here are some examples of strings from the source data I ran the library against where the city and/or country was not picked out. Hopefully these cases can help you improve the library.
INDIANAPOLIS INDIANA.
BARDSLEY, ENGLAND
ST. LOUIS, MO.
WHITING, INDIANA, AND CHICAGO, ILLINOIS.
PHILADELPHIA PA.
LEROY, N.Y.
LYNDONVILLE, VT.
AMENIA, N. Y.
COPPERHILL, TENN.
DETROIT AND JOSEPH CAMPAU AT THE RIVER,MICH.
IVORYTON, CONN.
ST. LOUIS, MO. CORPORATION OF MISSOURI.
OGDENSBURG, N.Y.
NEAR SHEFFIELD, ENGLAND
INDIANAPOLIS IND.
BASLE,
ST. LOUIS, MO. REPUBLISHED BY MONSANTO COMPANY,/ST. LOUIS, MO.
LABORATORY PARK DECATUR, ILL.
1006 OAZA KADOMA, KADOMA-CHO KITAKAWACHI-GUN, OSAKA,
3501 W. 48TH PLACE CHICAGO 32, ILL.
700 BROADWAY NEW YORK, N.Y.
811 WYANDOTTE KANSAS CITY, MO.
835 S. 8TH ST. ST. LOUIS 2, MO.
47/51 EXMOUTH MARKET, ROSEBERRY AVE. LONDON E.C.1, ENGLAND
1407 CUMMINGS DRIVE RICHMOND 20, VA.
Hi,
First of all thanks for your work, it works well and it's really useful.
What do you think of an option to make a case insensitive search for city/country names. I'm trying to do it for my project, I can send you a PR if you want and I succeed.
from geotext import GeoText
places = GeoText("London is a great city")
places.cities
GeoText('New York, Texas, and also China').country_mentions
My computer system is Windows 10... The code fragment is mentioned above. Then it throws an error:
"D:\Program Files\Python3\python.exe" D:/OneDrive/Programs/Jieba/ExtractLocation.py
Traceback (most recent call last):
File "D:/OneDrive/Programs/Jieba/ExtractLocation.py", line 20, in
from geotext import GeoText
File "C:\Users\Du Fei\AppData\Roaming\Python\Python36\site-packages\geotext_init_.py", line 7, in
from .geotext import GeoText
File "C:\Users\Du Fei\AppData\Roaming\Python\Python36\site-packages\geotext\geotext.py", line 87, in
class GeoText(object):
File "C:\Users\Du Fei\AppData\Roaming\Python\Python36\site-packages\geotext\geotext.py", line 103, in GeoText
index = build_index()
File "C:\Users\Du Fei\AppData\Roaming\Python\Python36\site-packages\geotext\geotext.py", line 74, in build_index
get_data_path('countryInfo.txt'), usecols=[4, 0], skip=1)
File "C:\Users\Du Fei\AppData\Roaming\Python\Python36\site-packages\geotext\geotext.py", line 48, in read_table
next(f)
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 2: illegal multibyte sequence
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.