On a standard Ubuntu system (Ubuntu 20.04 or less), at first, we need to check whether Python is already installed or not:
Ubuntu version 20.04:
python3 --version
Prior Ubuntu versions:
python --version
If python3 is not installed, please run the following commands:
sudo apt-get update
sudo apt-get install python3
Once, python3 is installed, we need to install the "geoip2" module by following command:
ย
pip install geoip2 [1]
If pip (Pip Install Packages) is not installed, please execute the following command:
sudo apt install python3-pip
We are now ready to run the code by the following command (Ubuntu 20.04):
python3 parseGeoLiteCityDB.py access.log
For prior Ubuntu version:
python parseGeoLiteCityDB.py access.log
- access.log
- GeoLite2-City.mmdb
Both of the files need to be present in the same directory where parseGeoLiteCityDB.py is located.
While executing the run command (python3 parseGeoLiteCityDB.py access.log), we can use different log files to get different outputs.
If you want to try a different database, please change the variable name in line no. 15 in the code.
I can code in such a way so that we can have both (access.log file and database file) as an input, however, since the homework question says "Include a command-line program to run your code against an arbitrary file", I limited the input argument to only access.log file.
If needed, please download the "GeoLite2-City.mmdb" from here: https://drive.google.com/drive/folders/1Squ0xtr2QCDPoGq6TyIkS-_0HjA2yMib?usp=sharing [2]
ubuntu@ip-172-31-45-47:~/maxmind$ python3 parseGeoLiteCityDB.py access.log
Most Viewed Country:
Country :: #Most View :: "The most viewed page" (#viewCount)
============================================================
United States :: 13905 :: "/region/1" (61)
Netherlands :: 3216 :: "/search/by-lat-long/9.250043,-83.859123/filter/category_id=1;category_id=2;category_id=3;category_id=4;category_id=5;category_id=6;category_id=7;category_id=8;category_id=9?limit=10;unit=km;distance=10" (11)
China :: 1466 :: "/entry/" (9)
Germany :: 1244 :: "/entry/20252" (26)
France :: 702 :: "/entry/2299" (4)
Russia :: 658 :: "/region/659" (3)
United Kingdom :: 304 :: "/region/52" (7)
Canada :: 221 :: "/entry/6843" (3)
Mexico :: 120 :: "/region/1" (2)
Israel :: 66 :: "/site/recent.atom?entries_only=1" (11)
Most Viewed US States:
States :: #Most View :: "The most viewed page" (#viewCount)
============================================================
Washington :: 2400 :: "/region/1" (17)
Virginia :: 2278 :: "/entry/4628" (8)
California :: 410 :: "/location/most_recent_vendors.rss?location_id=5" (22)
New York :: 174 :: "/region/1" (5)
Delaware :: 171 :: "/region/1503" (2)
Michigan :: 153 :: "/region/447" (2)
Texas :: 152 :: "/region/218" (10)
Minnesota :: 131 :: "/region/13" (21)
Illinois :: 116 :: "/region/1766" (4)
New Jersey :: 96 :: "/entry/near/40.7458%2C-74.0321/filter/category_id=1;veg_level=2;allow_closed=0?limit=10;order_by=distance;address=Your+location" (4)
Summary:
Total valid IP processed: 22454
Unknown Country list: (total 3)
['193.202.255.201', '66.249.93.72', '66.249.81.72']
Unknown states found: 790
Total execution time: 13.11 seconds.
I also experimented with an altered access.log file, at first reducing the total number of lines by half, and then only having the first 500 lines. So my code handles the situation correctly: "where there are less than 10 states or countries with visitors, only show those which have at least one visitor". Corresponding outputs:
Using 25037 lines of access.log file:
ubuntu@ip-172-31-45-47:~/maxmind$ python3 parseGeoLiteCityDB.py access_half.log
Most Viewed Country:
Country :: #Most View :: "The most viewed page" (#viewCount)
============================================================
United States :: 5844 :: "/region/1" (36)
Netherlands :: 1272 :: "/search/by-lat-long/53.214297,-1.738481/filter/category_id=1;category_id=2;category_id=3;category_id=4;category_id=5;category_id=6;category_id=7;category_id=8;category_id=9?limit=10;unit=km;distance=10" (4)
China :: 815 :: "/site/help" (8)
Germany :: 531 :: "/entry/20252" (16)
United Kingdom :: 245 :: "/region/659" (6)
Russia :: 189 :: "/site/help" (2)
France :: 186 :: "/entry/2299" (3)
Mexico :: 104 :: "/region/1" (2)
Canada :: 34 :: "/entry/3613" (2)
Israel :: 25 :: "/site/recent.atom?entries_only=1" (4)
Most Viewed US States:
States :: #Most View :: "The most viewed page" (#viewCount)
============================================================
Washington :: 1070 :: "/region/1" (12)
Virginia :: 439 :: "/entry/20253/reviews" (5)
California :: 191 :: "/location/most_recent_vendors.rss?location_id=5" (10)
New York :: 84 :: "/region/1" (3)
Illinois :: 82 :: "/region/1766" (4)
Texas :: 72 :: "/region/218" (6)
Ohio :: 70 :: "/entry/5023" (3)
Pennsylvania :: 70 :: "/region/599" (3)
Minnesota :: 60 :: "/region/13" (18)
Florida :: 51 :: "/api-explorer/" (4)
Summary:
Total valid IP processed: 9506
Unknown Country list: (total 2)
['193.202.255.201', '66.249.93.72']
Unknown states found: 487
Total execution time: 4.86 seconds.
Using only first 500 lines of access.log file:
ubuntu@ip-172-31-45-47:~/maxmind$ python3 parseGeoLiteCityDB.py access_500linesOnly.log
Most Viewed Country:
Country :: #Most View :: "The most viewed page" (#viewCount)
============================================================
United States :: 162 :: "/entry/5023" (3)
Netherlands :: 42 :: "/entry/near/0%2C0/filter?unit=mile;distance=25;sort_order=ASC;page=;order_by=distance;address=34034;limit=" (1)
China :: 21 :: "/entry/15205" (1)
Switzerland :: 6 :: "/entry/15603" (2)
Germany :: 4 :: "/region/60" (1)
France :: 3 :: "/entry/656" (1)
Canada :: 2 :: "/entry/2708" (1)
Israel :: 1 :: "/site/recent.atom?entries_only=1" (1)
Most Viewed US States:
States :: #Most View :: "The most viewed page" (#viewCount)
============================================================
Washington :: 41 :: "/entry/1817" (1)
Ohio :: 5 :: "/entry/5023" (3)
California :: 3 :: "/location/view.html?location_id=174&new_query=1" (1)
Texas :: 3 :: "/region/2" (2)
Arizona :: 2 :: "/site/help" (1)
Virginia :: 1 :: "/entry/18992" (1)
Summary:
Total valid IP processed: 242
Unknown Country list: (total 1)
['193.202.255.201']
Unknown states found: 33
Total execution time: 0.14 seconds.
User can mistakently give different inputs while running the program, I handled those situation in my code. Followings are the different case scenario:
- No input file given:
ubuntu@ip-172-31-45-47:~/maxmind$ python3 parseGeoLiteCityDB.py
Please provide ONLY the 'access.log' file as the first argument. - More than 1 input file given:
ubuntu@ip-172-31-45-47:~/maxmind$ python3 parseGeoLiteCityDB.py access_500linesOnly.log GeoLite2-City.mmdb
Please provide ONLY the 'access.log' file as the first argument. - Typo while executing the command (spelling mistake of the access.log file):
ubuntu@ip-172-31-45-47:~/maxmind$ python3 parseGeoLiteCityDB.py access_500linesOnly.log GeoLite2-City.mmdbasdfasdf Please provide ONLY the 'access.log' file as the first argument.
I imported the following modules:
import re
import sys
import time
import os.path
import webbrowser
import geoip2.database
Generally, all the above modules come with the installing of Python3 and geoip2
References:
[1] https://www.makeuseof.com/install-python-ubuntu/
[2] https://dev.maxmind.com/geoip/geolite2-free-geolocation-data?lang=en