Git Product home page Git Product logo

scraxbrl's Introduction

ScraXBRL

SEC Edgar Scraper and XBRL Parser/Renderer

To use:

  1. Install requirements from requirements.txt
  2. Change settings.py to fit your needs. Raw scraped data will be stored in data/raw_data and
    extracted data will be stored in data/extracted_data by default.
  3. Run python main.py. This will begin the scrape and extract process.
  4. To view data in terminal (data must have already been scraped from the SEC EDGAR website): alt tag Data is stored in a pickle file. To use DataViewer, create instance of DataView class and enter
    the necessary parameters (ticker_symbol, filing_date, filing_type). The data will then be stored in
    [instance name].data, and it is an OrderedDict.

scraxbrl's People

Contributors

iuvoz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scraxbrl's Issues

IndexError: list index out of range

Affects:
DataView('AC','2015-09-30','10-Q')
DataView('AIW','2011-06-30','10-Q')
DataView('AI','2012-09-30','10-Q')
DataView('APLE','2014-03-31','10-Q')
DataView('ARE','2012-03-31','10-Q')
DataView('ARL','2016-06-30','10-Q')
DataView('AT','2011-06-30','10-Q')
DataView('A','2016-07-31','10-Q')
[...]

Traceback (most recent call last):
DataView('AC','2015-09-30','10-Q')
File "DataViewer.py", line 13, in init
self.load_data()
File "DataViewer.py", line 20, in load_data
fpath_file = os.listdir(fpath_no_p)[0]
IndexError: list index out of range

Most likely the pickle file was not created because the xml file(s) might not have been parsed properly (or they may have errors)?

xbrl taxonomy

First of all thanks for sharing the library. It works perfectly.
However, I can't really figure out how it is actually working!

In the settings script, a few paths to the xbrl taxonomy is defined (line 41-49). However, it is not clear to me that these paths are actually being used for anything. What are these paths for?

Cannot get requirements

When running pip install -r requirements.txt, the following shows up in the log -

URLs to search for versions for Twisted-Web==13.2.0 (from -r requirements.txt (line 4)):

this is not working problem in extract XML data method

it is scraping data perfectly file but when it calls XMLExtract to extract file, its not working and function self.build_ins().. for some file it goes beyond that function but again error at self.extract_all_pre(). after that it fails. controller wont go to next next function which is self.extract_all_calc()

KeyError

>>> from DataViewer import *
>>> a=DataView('A','2013-04-30','10-Q')
>>> a.traverse_tree('StatementOfFinancialPositionClassified')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "DataViewer.py", line 72, in traverse_tree
    base = self.data[cat]['roles'][name]
KeyError: 'StatementOfFinancialPositionClassified'
>>> a.data.keys()
['ins', 'cal', 'lab', 'pre', 'error', 'no_lineage']
>>> a.data['pre'].keys()
['xbrl_titles', 'roles']
>>> a.data['pre']['roles'].keys()
['AcquisitionOfDako', 'AcquisitionOfDakoDetailedTextualsDetails', 'AcquisitionOfDakoIntangiblesDetails', 'AcquisitionOfDakoProformaConsolidatedOperatingResultsDetails', 'AcquisitionOfDakoPurchasePriceAllocationDetails', 'AcquisitionOfDakoTables', 'CondensedConsolidatedBalanceSheetUnaudited', 'CondensedConsolidatedBalanceSheetUnauditedParenthetical', 'CondensedConsolidatedStatementOfCashFlowsUnaudited', 'CondensedConsolidatedStatementOfComprehensiveIncomeUnaudited', 'CondensedConsolidatedStatementOfComprehensiveIncomeUnauditedParenthetical', 'CondensedConsolidatedStatementOfOperationsUnaudited', 'Derivatives', 'DerivativesDetails', 'DerivativesDisclosuresAndDerivativeInstrumentAggregatedNotionalAmountsByCurrencyAndDesignationsDetails', 'DerivativesEffectOfDerivativeInstrumentsOnConsolidatedStatementOfOperationsDetails', 'DerivativesFairValueOfDerivativeInstrumentsAndConsolidatedBalanceSheetLocationDetails', 'DerivativesTables', 'DocumentAndEntityInformation', 'FairValueMeasurements', 'FairValueMeasurementsFairValueMeasuresAndImpairmentOfLongLivedAssetsDetails', 'FairValueMeasurementsFairValueOfAssetsAndLiabilitiesMeasuredOnRecurringBasisDetails', 'FairValueMeasurementsTables', 'GoodwillAndOtherIntangibleAssets', 'GoodwillAndOtherIntangibleAssetsDisclosuresAndComponentsOfPurchasedOtherIntangiblesDetails', 'GoodwillAndOtherIntangibleAssetsFiniteLivedAssetsFutureAmortizationExpenseDetails', 'GoodwillAndOtherIntangibleAssetsGoodwillAndOtherIntangibleAssetsTextualsDetails', 'GoodwillAndOtherIntangibleAssetsGoodwillRollForwardDetails', 'GoodwillAndOtherIntangibleAssetsTables', 'IncomeTaxes', 'IncomeTaxesDetails', 'Inventory', 'InventoryDetails', 'InventoryTables', 'LongTermDebt', 'LongTermDebtDetails', 'LongTermDebtLongTermDebtOtherDebtDetails', 'LongTermDebtTables', 'NetIncomePerShare', 'NetIncomePerShareDetails', 'NetIncomePerShareTables', 'NewAccountingPronouncements', 'OverviewBasisOfPresentationAndSummaryOfSignificantAccountingPolicies', 'OverviewBasisOfPresentationAndSummaryOfSignificantAccountingPoliciesDetails', 'OverviewBasisOfPresentationAndSummaryOfSignificantAccountingPoliciesPolicies', 'Restructuring', 'RestructuringDetails', 'RestructuringIncomeStatementLocationDetails', 'RestructuringTables', 'RetirementPlansAndPostRetirementPensionPlans', 'RetirementPlansAndPostRetirementPensionPlansDetails', 'RetirementPlansAndPostRetirementPensionPlansDetailsTextual', 'RetirementPlansAndPostRetirementPensionPlansTables', 'SegmentInformation', 'SegmentInformationProfitabilityAndSegmentAssetsDetails', 'SegmentInformationReconciliationOfReportableResultsDetails', 'SegmentInformationTables', 'ShareBasedCompensation', 'ShareBasedCompensationAllocatedShareBasedCompensationExpenseDetails', 'ShareBasedCompensationFairValueAssumptionsDetails', 'ShareBasedCompensationTables', 'ShortTermDebt', 'ShortTermDebtCreditFacilityDetails', 'ShortTermDebtSeniorNotesDetails', 'StockholdersEquity', 'StockholdersEquityStockRepurchaseProgramDetails', 'StockholdersEquityStockholdersEquityDividendsDetails', 'WarrantiesAndContingencies', 'WarrantiesAndContingenciesDetails', 'WarrantiesAndContingenciesTables']
>>> a.data['pre']['roles']['AcquisitionOfDako']
OrderedDict([('title_name', None), ('tree', OrderedDict([('BusinessCombinationsAbstract', OrderedDict([('pfx', 'us-gaap'), ('sub', OrderedDict([('BusinessCombinationDisclosureTextBlock', OrderedDict([('pfx', 'us-gaap'), ('sub', OrderedDict()), ('order', 1), ('val', OrderedDict()), ('label', 'Acquisition of Dako')]))])), ('label', u'Business Combinations [Abstract]')]))])), ('from_to', [('BusinessCombinationsAbstract', 'BusinessCombinationDisclosureTextBlock', '1', 'terseLabel')]), ('root', [('us-gaap', 'BusinessCombinationsAbstract', '1', 'terseLabel')]), ('unique', [('us-gaap', 'BusinessCombinationDisclosureTextBlock', '1', 'terseLabel'), ('us-gaap', 'BusinessCombinationsAbstract', '1', 'terseLabel')])])
>>> a.traverse_tree['AcquisitionOfDako']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'instancemethod' object has no attribute '__getitem__'

Formatting issue

Affects namely AAN's 10-Q 2014-06-30; as such:

aan.traverse_tree('ProgressiveAcquisitionIntangibleAssetsAcquiredDetails')
   IndefiniteLivedIntangibleAssetsAcquiredAsPartOfBusinessCombinationTable
      IndefiniteLivedIntangibleAssetsByMajorClassAxis
         IndefiniteLivedIntangibleAssetsMajorClassNameDomain
      AcquiredIndefiniteLivedIntangibleAssetsLineItems
         IndefinitelivedIntangibleAssetsAcquired
                (u'2014-04-13', u'2014-04-14')
                53000000.0
   FiniteLivedIntangibleAssetsAcquiredAsPartOfBusinessCombinationTable
         BusinessAcquisitionAcquireeDomain
            ProgressiveFinanceHoldingsLLCMember
                2014-04-14              (u'2015-01-01', u'2015-01-31')          (u'2014-07-01', u'2014-07-31')          (u'2014-04-15', u'2014-06-30')              (u'2014-04-13', u'2014-04-14')          (u'2014-01-01', u'2014-06-30')          (u'2013-01-01', u'2013-06-30')
                138198000.0             3600000.0               22300000.0              323000.0                333000000.0             2300000.0  1325928000.0
      FiniteLivedIntangibleAssetsByMajorClassAxis
         FiniteLivedIntangibleAssetsMajorClassNameDomain
      AcquiredFiniteLivedIntangibleAssetsLineItems
         FinitelivedIntangibleAssetsAcquired1
                (u'2014-04-13', u'2014-04-14')
                14000000.0
         IntangibleAssetsAcquired
                (u'2014-04-13', u'2014-04-14')
                333000000.0
         AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife
                (u'2014-04-13', u'2014-04-14')
                P9Y

Notice how BusinessAcquisitionAcquireeDomain is idented with 6 spaces while 3 should be used.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.