Git Product home page Git Product logo

seekdegas's Introduction

seekdegas

This is a simple Python 2.7 library for scraping the SEC's EDGAR resource. Companies like seekinf charge for full access to the data, despite it being free on their FTP server.

Right now, to use it, simply add it as a submodule (or install it in your python packages directory), and use the built-in functions seekdegas.query and seekdegas.download.

The function query accepts arguments start,end, which are integers indicating the start and end years (range since 1993), and the other arguments are all optional filters: cik,sic,company are all string arguments to filter for those specific fields (CIK number, SIC number, exact company name as listed on official documents case-insensitive). forms is a list of strings of form names (e.g. ['10-K','10-Q'] is a possible selection). keyphrases is a list of strings to search for in the documents, but its use is not recommended, since it makes the search extremely slow.

query yields an iterator of hashtables, each of which has keys cik,company,form,date,url all of which are fairly self-explanatory. date is the date listed on the filing, which might sometimes conflict with the quarter or even the year in which it was listed. url is the url of the text/HTML document on the FTP server with the filing.

The function download is the same, but it downloads the files instead of yielding the hashtables. It also accepts strings regex,regopt, which enable you to just download portions of the filings by extracting using regular expressions - these are fed directly into re.findall; see the re package documentation for more information. filepath is the name of the folder you want to save these into, no backslashes - it's a relative path, so it'll simply save them into that subdirectory of wherever you're running the script. fileprefix will add that string to the beginning of each of the filenames as they're saved. The base name format is (CIK number)-(date of filing).txt.

You can create your own functions that act on every filing searched for with the @edgar decorator. It feeds the wrapped function a single variable fdata, which is an array containing the CIK number, company name, form name, date of filing, and URL of filing on the server (without the domain), in that order, indexed from 0. Arguments taken by the wrapped function are then identical to those of the function query. The decorator iterates through and yields the results of the wrapped function for every filing matched under those parameters; functions returning nothing but simply performing an action (for example, download) are fine as well. If you want further arguments, those work fine as well as long as they're named arguments with defaults - the implementation is using **kwargs See the source code for details.

Keyphrase search is incredibly slow right now due to the necessity of opening and searching each individual text file from the FTP server, but the commercial resource mentioned above shows that it doesn't have to be slow - figure out how to change this.

Feel free to contribute; there is quite a bit of functionality that is still missing. Some of it is because I don't have enough experience with the data to know what's going on with it.

Immediate TODO list:

  • Implement XBRL search functionality - not actually sure what this is yet.
  • Make availability of the output in other formats part of the library - downloading a CSV, etc. Downloading the archival files made available by the SEC may also be useful.
  • Allow filtering by finer timeframes than entire years. Quarters at the least; are individual days necessary/useful?
  • Create a web interface for easy use.

SIC lookup data gleaned from the results of Matt Kiefer's scraper.

seekdegas's People

Contributors

achtor avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.