Git Product home page Git Product logo

ufo_sightings's Introduction

NLP Unsupervised Learning Case Study

badge badge

Cindy Wong | Feli Gentle | Maureen Petterson

Table of Contents

Introduction

UFO sightings occur with relative frequency all across the United States. The sighted UFOs have various shapes and the sightings last for varying amounts of time. Using the UFO sighting database, we evaluated several characteristics of the sightings and used Natural Language Processing (NLP) to analyze the descriptions and see what commonalities all the descriptions had.

The data was pulled from the The National UFO Reporting Center Online Database.

Data Preparation and Exploratory Data Analysis

Data Preparation

The raw data was 2.5GB and required a decent amount of preparation prior to analysis. We downloaded a zipped json file that included the raw HTML for each individual sighting.

Cleaning and preparation methods included:

  • Extracting the unique observation ID, date, time, location, shape and text description of the sightings
    • First we used Beautiful Soup's html parser to extract data contained within specific HTML tags
    • Limited data to about 15,000 in order for it to not run forever
    • Regular expressions were utilized to extract the exact terms we needed to run analyis on the different features
  • Separating the text description from the follow-up notes
  • Putting the information into a pandas datafram for easier analysis
Raw JSON data Data
Raw Extracted Sample Report Data

The cleaned up pandas dataframe is shown below

Exploratory Data Analysis

The sightings described the UFOs as various different shapes, including circles, chevrons, lights, or fireballs. The duration of the sightings lasted from a few seconds to many minutes.

Shapes and Duration
shapes

The time of day for the observations were also interesting. Sightings tended to be higher in the early morning or evening hours, which makes sense as UFO lights will not be as visible during daylight hours. It's also possible many people mistake planets, satellites, or planes as UFOs.

timeofday

State

We got a count of the states and sightings. It seems California is number one for UFO sightings.

state_count

Natural Language Processing

The data was analyzed using a combination of nltk packages and sklearn CountVectorizer/TFIDFVectorizer to analyze the most common words within the observations. The output of the TFIDF transformation was deconstructed using two methods:

  1. Non-Negative Matrix Factorization (NMF)
  2. Singular Value Decomposition (SVD) combined with Kmeans

Both of these methods allowed extraction of latent topics.

The corpus (documents) was prepared using standard methods:

  • Tokenization
  • Stop words removal (standard English)
  • Lemmatization using nltk WordNetLemmatizer
  • TFIDF Vectorization to get the relative word strengh

The results from the NMF and SVD+kmeans are shown below.

NMF topics:
nmf

SVD + kmeans topics:
svd

Summary and Key Findings

Data

ufo_sightings's People

Contributors

mkpetterson avatar oro13 avatar

Watchers

 avatar  avatar

Forkers

cwong690

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.