Git Product home page Git Product logo

insight_extractor's Introduction

Insight Extractor

https://pepy.tech/badge/insight-extractor https://travis-ci.com/NoahFinberg/insight_extractor.svg?branch=main

The Insight Extractor was the ML model that Considdr used to identify abstractive sentences in full text documents on the web. Considdr closed in the summer of 2020 and now we're making our model freely available to all. We'd love to hear the interesting ways people apply this model. All we ask is that you cite this repo.

Abstractive sentences are of particular value when it comes to understanding the key insights in adjacent documents. For more on this summarization approach see "Summarization by Adjacent Document."

Install

pip install insight_extractor

Notes:

  • We use Tensorflow 2.X and recommend using Python 3.6 or higher.
  • Python 2 is not supported

Using insight_extractor

v0.1.1 of insight_extractor exposes one primary function --extract_insights -- which takes a list of candidate insight sentences and returns a list of prediction scores signifying the probability that our model thinks a given sentence is an insight.

Input:

# given a list of input sentences
sentences = [
    'According to the most recent statistics, more than a million people a year are arrested for simple drug possession in the United States -- and more than half a million of those arrests are for marijuana possession.',
    'One study found that for cancer patients considering experimental chemotherapy, trust in their physician was one of the most important reasons they enrolled in a clinical trial -- on par with the belief that the treatment would be effective.',
    'Senate leaders were working to agree on a dual track to try the departing president at the same time it considered the agenda of the incoming one, an exercise never tried before.',
]

Insight Extraction

# import
from insight_extractor.pipeline import extract_insights

# get insight predictions
predictions = extract_insights(sentences)

# print predictions
print(predictions)

Output:

[0.7167318, 0.6289567, 0.01138071]

Notes on Interpretation

Of the three sample input sentences, we would define the first two as an "insight", but not the last sentence. As you can see our model predicts that the first and second sample sentences are insights with a probability of ~72% and ~63% respectively.

Generally most sentences in a given article are not insight sentences. However, some sentences are more "abstractive" than others. In practice, we found that most sentences predicted with >10% probability of being an insight often have at least some abstractive value. You may want to fiddle with the threshold given your use-case and tolerance for False Positives.

Notes

v0.1.X is really the bare minimum functionality of the Considdr insight model.

  1. In the actual production implementation we took as inputs entire articles (html pages) and returned insight sentences from that article.
  2. We leveraged the fact that multiple documents often abstract the same works and built a second much more complex model to cluster similar insights together.
  3. We also trained various versions of our model on academic documents when we built out proof of concepts for academic search engines that were interested in our technology. Citation structure enables a very clear extension of our summarization by adjacent document approach.

Over time, we plan to update this package to better reflect the robustness of the Considdr product. Collaborators and contributors are welcome.

Acknowledgements

This insight extraction model benefitted from the hard work of many of our team members at Considdr. In particular, hand labeling thousands and thousands of sentences and cross-validating those labels across members of our team was an especially grueling effort. Thank you to Hailey Wahl, Kevin Lane, Derek Yau, and Eddie Korando for all your help here.

A special thank you to Gaurav Sood who encouraged us to share our model with the broader community and who helped walk us through best practices for packaging ML models.

We also heavily utilized the following resources in building our CNN model.

Authors

Noah Finberg and Marcus Christiansen

License

The package is released under the MIT License.

insight_extractor's People

Contributors

noahfinberg avatar soodoku avatar suriyan avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.