Useful codes for working with structured data
'dataprep_plots_code_pp' Tasks performed by the code:
- Outlier treatment (cap at 1 percentile and 99 percentile)
- Missing value treatment (substitute with 0 or substitute with median/mean)
- Create visualizations (13 different types) all customized in terms of fonts etc so that it can be directly inserted into word doc reports.
- Factor analysis template to check whether the responses to survey questions used to measure different constructs are coming out consistently or not using chronback alpha
'LDAR_pradeep_28June.ipynb' code has code in R. Once data is ready in terms of 'documents', it does the following operations:
- Makes text lowercase, removes stopwords, numbers and whitespaces and creates DTM
- Creates vocabulary from DTMs
- Create ngrams
- Apply LDA
- identify top terms for each topic
- Score the documents on topics
'LDApy_pradeep_28June.ipynb' code has code for applying LDA for text classification using python. Still WIP:
- remove stop words, tokenize, lemmatize, create dtm
- create vocab from DTM
- apply LDA
'SIC_assignment.R' assigns Standard Industrial Codes to Angel.co companies based on textual description of the company:
- Tokenize words, create vocab and create DTM from each text document
- calculate tfidf for each dtm
- calculate cosine similarity of each document with SIC code with document with SIC code
- assign SIC code to document without SIC code based on the closest company calculated using cosine similarity
Other data prep codes for the project.
- 'crowdfunding.py'
- 'CF_dataprep_code_5June.R'
- 'Crowdfunding_2_5_master.egp'
Code from 'Women on company boards and the impact of diversity on innovation outcomes and risk taking ability of firms'
- 'womboards_4Jun.py' dataprep
- 'womboards_28May.do' regressions
- 'CSC.py' dataprep
- 'cscreport_regs.do' analysis and regressions
- 'networks.py'
- 'cluster.py'
- 'Elasticsearch_to_csv.ipynb'