pslmodels / tax-calculator Goto Github PK

USA Federal Individual Income and Payroll Tax Microsimulation Model

Home Page: https://taxcalc.pslmodels.org

License: Other

Shell 0.71% Python 97.55% Batchfile 0.08% Tcl 0.51% Awk 0.69% Makefile 0.47%

psl-cataloged usa-federal-income payroll-taxes usa taxes law budget

tax-calculator's Introduction


Org
Package
Testing

Tax-Calculator

Tax-Calculator is an open-source microsimulation model for static analysis of USA federal income and payroll taxes.

We are seeking contributors and maintainers. If you are interested in joining the project as a contributor or maintainer, open a new issue and ping @MattHJensen or @jdebacker -- or just jump right in.

Complete documentation is available here.

tax-calculator's People

Contributors

Stargazers

Watchers

Forkers

sameersarkar matthjensen iliakur talumbau theofanislekkas zgrnk amy-xu zrisher xiyuw123 rkuchan jacobdr sherwinlott mmessick jdebacker cecile-murray gofroggyrun mcdeaton13 peterdsteinberg salimfurth kdd0211 bjl3 chun1211 kathleenszabo ahunter1 spauldings ahyanbusinesssolution jamiebryanhall chrisrytting parkerrogers codykallen akshaya-trivedi bfgard jwgarrison dalamar1995 mramanathan yuying27 venkatarajasekhar andriyb25 ninthavenue8 andersonfrailey darkmatterx7 ecbrown5 lnsongxf hdoupe econ02 sarinasingh wnkessler seoket1 krishan300 jeepers58 palmerlite 18418n9f2nn1n derrickchoe dansherpa bencasselman hayleefay sandboxorg mudathirlawal sakho3600 draban3 masterworm maxghenis mayoralito jtkostman gregorycrane bavery22 amitashutosh ajinkyasurve strattonstudios willrichens jtbeebe kpinkelman rkasher codejagadisha laurenav centeronbudget jwdiamond willgrimme jlt9 pbpramanik anuraaggoyal abraham-leventhal lucassz clarepan kevinxperese leovasone shekharkhetan kumar-ab tejprakash328 keshavchoudhary87 bbgupta nupurm13 arunjanacdac com1tpru smilingashutosh dc4tpru dc1-tpru rohitdeojha kusumjp sonali110691

tax-calculator's Issues

Fixing C00650

I traced the root of some discrepancies to the variable C006550.

There may be a difference in the way it gets set, but I'm afraid I don't know enough SAS to be able to tell. @feenberg or @SameerSarkar, please advise.

Relevant Python code: here
Relevant SAS code: here

That whole snipped of SAS code, as mentioned in #31 seems to raise some questions.

An API should exist to permit separate calculations for different models of the tax code

The standard use case for those interested in this package would be (I believe):

come up with an idea for a modification to the existing tax code
type some Python code that encompasses that idea
run a calculation for existing tax code over a set of data
run a calculation for the modified tax code over the same set of data
compare the results in some way

I'd like to make this process as easy as possible. As a basic first step, I think the taxcalc package should support a calculator factory function. This function will give back an object that represents a tax calculation. When this object is supplied with input data (e.g. a PUF file) and rules (i.e tax law parameters), it can perform a set of calculations (currently specified in calculate.py). With an API like that, I think the steps above would look roughly like:

import taxcalc

#Existing tax law
calc1 = taxcalc.calculator() 
# Do various things with calc1 to prep to run calculations
calc1.load("puf2.csv")

calc2 = taxcalc.calculator() 
rt7 = np.array([0.425])
# change the top tax rate
calc2._rt7 = rt7
# load the data
calc2.load("puf2.csv")

#do all work
calc1.calc()
calc2.calc()
df_diff = calc1._eitc - calc2._eitc

Clearly, the above makes other assumptions that would need to be nailed down. Would be interested in knowing if the calculator factory function makes sense.

Implement DropQ procedure

The Tax Calculator should be distributed as a typical Python package

It would be great to distribute the Tax Calculator as a standard Python package that others could use inside a typical Python workflow. As an example, consider something like this

import numpy as np
import pandas as pd
import taxcalc as tc

# Use tc package here to create new tax credit/exemption/etc
# e.g.:
cred = tc.credits.credit(...)

# Read data and do a tax calculator computation with the new credit
sample = pd.read_csv("puf.csv)
calc = tc.calc(...data=sample)
calc.register_credit(cred)
calc.do_all()

#get desired info from resulting calculation, returned as a DataFrame
total_income_tax =  calc.e06500

So, if that kind of workflow makes sense, the questions are:

What should the package be named (here I'm showing taxcalc)
Should the functions in translation.py be broken down into separate modules (I vote 'yes')
What is a good interface for others to use that interacts well with the existing Python data ecosystem? (this answer will likely evolve over time)

For now, if other are OK with a package named taxcalc as shown above, I can make the initial code changes.

Missing var labels for American Opportunities Credit

@feenberg, we are missing the American Opportunities Credit from our variable labels document. Do you have labels for these? If not, I can guess at many of them.

e87482
e87487
e87492
e87497
e87483
e87488
e87493
e87498
e87521
e87668
e87654
e87656
e87658
e87660
e87662
e87664
e87666
e10960
e87668
e87681

Functions that write output using `np.savetxt` could return DataFrames

Many of the functions in the calculator end with a few lines of code like this:

    outputs = (_sep, _txp)
    output = np.column_stack(outputs)
    np.savetxt('FilingStatus.csv', output, delimiter=',',
               header=('_sep, _txp'), fmt = '%1.3f')

This means that calling the function results in generating output to disk, which might not be what the caller wants. We can preserve the column names (in this case _sep and _txp) by creating and returning a DataFrame that consists of the given numpy arrays. The DataFrame could then be saved to disk with a similar utility function. The main idea is that it seems there is value in separating the work of the function from the saving of output.

More mysteries

There is a difference in variable C19700. I took a random taxpayer for whom it differs and looked at the variables it depends on. Below is a table that includes both C19700 and the other relevant variables:

Variable Name	SAS value	Python value
C19700	951222.5	957633.5
E19800	386900	386900
E20100	830000	830000
E20200	0	0
_posagi	1902445	1902445
_lim30	570733.5	570733.5
_lim50	386900	386900

As you can see, it isn't entirely clear where the difference is coming from.
Here is the relevant python code.
Here is the relevant SAS code.

need to update requirements.txt

In the readme, we may also suggest that the best way to meet the requirements is to instal Anaconda.

replace locals() update with global dictionary

I knew from somewhere that we really should not be modifying the locals() dictionary as it's part of Python's scaffolding, but didn't vocally object too much because a) there were more important things on our plate and b) I wasn't 100% sure I was right. I now have conclusive proof to back my opinion up and would like to make sure our code adheres to the standard.
Conceptually, it's an easy fix: we simply use the "y" (or whatever we decide to name it) dictionary with our tax parameter names as keys.

In practice this might take some time since we have to go through the entire script changing all the appropriate variable names to dictionary keys.

In case you're worried about performance issues associated with the additional step of looking up the dictionary name, then the key, I have two things to say:

Variable lookup is implemented very fast in Python, as it is a fundamental feature. Even with all the hundreds of lookups we'll be introducing, the overhead will be negligeable compared to other bottlenecks.
To further speed things up we can pass the name of the dictionary as an argument to our calculating functions. Since Python variable names are references to an object, we won't be shunting too much memory around. At the same time, having a reference to the dictionary in the local namespace of the function will speed up its lookup even further.

Unravelling C05750

This variable affects a couple of others down the road, so fixing it should improve our performance overall.
During my investigation into the cause of Python and SAS's different results for this variable, I found at least some of the discrepancy came from C05100 (see here for the relevant Python line and here for the SAS code).
That variable in turn wholy depended on C24580. This is where things get tricky. The python code for setting this variable (cf this) simply checks for taxable income being greater than zero and _hasgain being equal to 1. What I thought was the equivalent SAS code, however, looks very different. In fact, it seems to me that this code never gets executed because the conditional above it is never satisfied.
@SameerSarkar, @feenberg, this sounds like a mystery only you two can shed light on.

test suite

What's the best way for me to use your test code in my branch? Should I copy the relevant code from your branch, should we have a separate test.py, or something else?

Output

Can be JSON, or whatever Dan wants.

Generating output

When testing the code for accuracy on my PC, I added a feature that exports a number of .csv files that contains data for each variable that is manipulated or created in any way within each of our functions.

I primarily used this to compare my results to the SAS code, but I am considering moving them to a new function that creates a series of output csv files for future testing (each csv file is currently contained within the function it represents, and exporting the data is a relatively slow process).

Let me clarify: I think keeping a function that creates an output does no harm as long as we don't automatically call it whenever we run the program. It might be useful for researchers to have a csv file with total output.

Any thoughts on this?

Reference IRS forms in parameters.py

I added some comments to parameters.py to provide the parameters with references to IRS forms and worksheets. I confirmed the accuracy of parameters for 2014 and labeled where to find each parameter in the same format. This is meant to make it easier for others to find/calculate the parameters from IRS forms and to make it easier for us to update the calculator in the future.

Please let me know if you have any suggestions on formatting the IRS reference comments.

Link to the file on my fork: https://github.com/Amy-Xu/Tax-Calculator/blob/master/taxcalc/parameters.py (illustrations of comments can be found at the beginning of the file)

@feenberg I changed one parameter at line 205, the threshold for additional medicare tax for widows, from 250,000 to 200,000 (_thresx[MARS==5]). Here’s the IRS page I refer to: http://www.irs.gov/uac/Newsroom/What-You-Should-Know-about-the-Additional-Medicare-Tax. Could you please check? I hope I read the right source. Thanks a lot!

Thanks,
Amy

Attribute access on a Calculator should "pass through" to Parameters and Puf objects

Now, we support the ability to create a calculator as follows:

#Create a Public Use File object
puf = PUF("puf2.csv")

#Create a Calculator
calc = Calculator(parameters=params, puf=puf)

#The Current Year is given from the Parameters and PUF objects
print(calc.current_year)

#Parameters that vary by year are scalars based on the current year
print(params.almdep)

#All the calculation happens through this interface
calc.calc_all()

#The year is incremented
calc.increment_year()

To support various kinds of analysis, it would be nice to be able to do calc.c00100 instead of calc.puf.c00100. and calc.ssmax instead of calc.parameters.ssmax.

This would be accomplished by writing a custom __getattr__ for Calculator

calculate.py should be broken up into different files

It's difficult to navigate around a 2000-line file. If we break out the functions into smaller chunks, it will help others get a "lay of land" and reduce noise when someone is trying to understand a piece of the functionality. the overall "run everything" function (currently in test.py and repeated in test_calculate.py) would have access to each of the functions in each file.

Compitem calculation

I've been tracking down the source of a discrepancy in C04100 between what Python and SAS produce. I noticed that this variable depends on the value of _compitem, which differs for one of the taxpayers for whom C04100 values don't match. In the python translation, this variable gets set to 0 (see here), however in the SAS code there's a conditional for it (see this line that sets it either to 0 or to 1. @feenberg, just to be on the safe side, is that conditional still current law? @SameerSarkar, could you confirm that I'm not missing something and that variable is in fact being set to zero?

Thanks!

C62100

This variable is in top 20 in terms of cumulative error. In the Python code it is set with a long expression with a bunch of e-values. I looked at one taxpayer for whom C62100 was off and found that all the e-values matched up (as a matter of fact, they were all zero), as were apparently the c-values (see table below). Here is the relevant SAS code, for reference. Tagging @SameerSarkar and @feenberg.

Variable Name	SAS value	Python Value
c62100	4364782	4352722
c60260	0	0
c60000	421058	421058

What is the _hopelm variable? Should it be removed?

I noticed this commented out variable here:

#_hopelm = np.array([1200])

What is/was its purpose? Should we remove it?

Fixing C82880

My largest error is now concentrated in this variable and those derived from it.
This variable first gets set here in Python and here in SAS. Below is a table of relevant variables from a sample tax payer.

Variable Name	SAS value	Python Value
c82880	0	88250
e00200	88250	88250
e82882	0	0
e30100	0	0
_setax	0	0
_nctcr	1	1
_sey	0	0

Clearly, in both SAS and Python, the statements lead to c82880 being assigned the value of e00200. However, we're not done with c82880 and it gets reassigned here in Python and here in SAS. The weird thing is, it appears that in SAS this statement is true gets executed, whereas in Python we never hit it. I'm providing a table of relevant variables for the same taxpayer below, which suggest that the reassignment should not in fact take place. @feenberg, do you think there may have been modifications to this segment recently?

Variable Name	SAS value	Python Value
_exact	0	0
e82880	0	0

Decide where to use bools

We may want to change some variable types to bool. This could make sense for existing variables like _hasgain.

If we do this, we can change

if e01000 > 0 or c23650 > 0 or e23250 > 0 or e01100 > 0 or e00650 > 0:
    _hasgain = 1
else: _hasgain = 0

_hasgain = (e01000 > 0 or c23650 > 0 or e23250 > 0 or e01100 > 0 or e00650 > 0)

and

if _hasgain == 1:

if _hasgain:

Parameterize hard-coded variables

For the hard-coded variables in functions.py, I came up with some naming conventions to parameterize them, which should be universal and flexible enough to apply to other parameters in parameters.py at some point.

The variable names generally have two fixed components, one for tax/tax credit category (upper case) and the other for the function of this number in calculation (lower case).

The category names so far include the following:

CTC: Child Tax Credit
ACTC: Additional Child Tax Credit
AMT: Alternative Minimum Tax
II: Individual Income (Including personal exemptions and tax brackets)
CG: Capital Gain
EITC: Earned Income Tax Credit
ETC: Education Tax Credit
FEI: Foreign Earned Income
ID: Itemized Deduction
STD: Standard Deduction
FICA: Federal Income Contributions Act
SS: Social Security
MED: Medicare
AMED: Additional Medicare

The possible functions of HC numbers:

c: ceiling
f: floor
t: tax
p: phaseout
rt: rate
em: exemption
ec: exclusion
thd: threshold

For example, the FICA tax rate would be _FICA_trt, t for tax and rt for rate. Similarly, prt represents a phaseout rate, crt represents a ceiling rate etc.

In addition to these two essential components, there might be sub-components to make the var names more informational. For example, _ID_Madical_frt refers to the medical deduction floor rate under Itemized Deduction. All these subcomponents should be very intuitive that people don’t need a reference to understand their meanings.

All above is what I came up with at this point. Please let me know if you have any suggestions on the format/abbreviation.

I added some hard coded values at the end of parameters.py and revised the corresponding section in functions.py. Here are the links to the modified files: functions.py (https://github.com/Amy-Xu/Tax-Calculator/blob/master/taxcalc/functions.py) and parameters.py (https://github.com/Amy-Xu/Tax-Calculator/blob/master/taxcalc/parameters.py).

Here's a list of the parameters I created:

refactor multiple logical statements with numpy reduce

http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduce.html

SOIYR

@feenberg, I noticed that SOI year is set to 2008. Is this correct or should it also be set to 2013?

Turn PUF into pickle loader, drop global var definitions

I think it makes sense to read in the variables we define in PUF from some sort of file. We currently have JSON and pickle at our disposal, so it's not like we're short on options.
This will make the code more readable (less scrolling through tedious variable definitions) and we can always choose a file format that's human-readable so that interested parties can inspect what variables we are loading.

Additionally, just like in issue #10 it would be nice to get rid of global variable definitions and replace them with a dictionary. It would make sense to merge this dictionary with the one we get from the PUF file.

Is it meaningful to have puf=False?

In our current test.py, the run function has a keyword argument puf with a default value of True. All of our tests use this value. In ItemDed, we have this snippet of code:

    c20750 = 0.02 * _posagi
    if puf == True:
        c20400 = e20400
        c19200 = e19200
    else:
        c20400 = e20550 + e20600 + e20950
        c19200 = e19500 + e19570 + e19400 + e19550
    c20800 = np.maximum(0, c20400 - c20750)

Do we ever take the else branch here? If not, let's get rid of this check and remove the puf Boolean variable.

All params in calculate.py should be indexed by year

A good assumption is that for each parameter, some user will want it to vary by year. Therefore, we should index each param by year in calculate.py, i.e., p.param[FLPDYR - p.DEFAULT_YR].

Parameters.py should accept year-by-year inflation rates

The user of the tax calculator should be able to vary, by year, the inflation rate that indexes parameters. The latest projection from CBO for inflation (cpi-u), for instance, is 1.1% in 2015, 2.2% in 2016, 2.3% in 2017, and 2.4% thereafter. Webapp users do not need this functionality at this point.
CBO economic projections

is the Taxer function doing what is desired?

It seems like the Taxer function is doing quite a bit of logical work to set variable _a5. Then, it sets the value of _a6 based on some logical conditions and the values _a3 and _a5. Immediately after this though, there is the line:

_a6 = inc_in

The computed results (inc_out) only involve _a6 and other values from outside the function, so none of the work for variables _a1 - _a5 is used. I commented out that code and got the same answer. Am I missing something?

https://github.com/OpenSourcePolicyCenter/Tax-Calculator/blob/master/translation.py#L1738-L1740

casulty floor not set

@feenberg the SAS code for setting the floor on the casulty deduction does not seem to do anything; in particular, it doesn't seem to set the 10% of agi floor. Am I missing something?

Starting at line 317 in the SAS file you sent over the weekend:

/* Casulty */

if e20500 gt 0 then do;

c37703 = e20500+.1*_posagi;

c20500 = c37703-.1*_posagi;

end;

else do;

c37703 = 0;

c20500 = 0;

end;

Fixing c82925

@MattHJensen I'm referencing the "old" taxcalc code here and in #44 because in these particular cases it's the same as the newer version Dan sent me. As soon as I run into a situation where it differs, I'll be sure to upload it to GitHub. I kind of want to wait to get the latest version of the SAS script so as to avoid too much shunting in and out of different SAS versions.
There's a discrepancy in c82925. The Python code and SAS code appear to be identical.
Below, as always, is a table of relevant variable values for a sample taxpayer.

Variable Name	SAS value	Python Value
c82925	0	64
_nctcr	3	3
_precrd	64	64

Get rid of repeated code

Glancing at the code I see a lot of situations where the same code is being rewritten several times in consecutive lines of code. Code repetition is dangerous for the following reasons:

It doesn't capture commonalities between procedures.
On a more practical note, imagine a situation where you decide to modify this code? Would you rather modify it once and have your change percolate everywhere appropriate or would you prefer to manually find every instance where that code is used and hope that you neither missed nor made a typo in any of them?

I propose we simply find all instances of repeating code, set it to variables whenever possible.
P.S.
In the case of some operations (like constructing boolean arrays), this might even give us a tiny bit of a speedup. I'm positively obsessed with speed now :)

Link output tables to calculator

@talumbau and @theofanislekkas, contingent on @feenberg's review, here are the first set of output tables we should provide. We will want to add several more options later and likely some charts, so we might want to have a page that links to each of these tables (or something similar) rather than putting them all on one page.

The first six/sixty-six output tables (variables listed below).

Plan x vars, avg by AGI decile
Plan x vars, avg by AGI group
Plan y vars, avg by AGI decile
Plan y vars, avg by AGI group
By-tax-record difference between plan y and plan x, avg by AGI decile
By-tax-record difference between plan y and plan x, avg by by AGI group

The reason this could/should be sixty-six tables is that it'd be very nice to be able to see each of these tables by year, as well as see the ten-year average of the by-year tables.

@feenberg, do you think the 10-year average should be the simple average of each cell across the 10 one-year tables, or should we do a weighted average (more important for AGI groups than deciles)?

Variables for first 6 tables:

c00100 AGI
c04100 Standard Deduction
c04470 Itemized deductions
c04800 taxable income
c05200 Regular tax
c09600 Alternative Minimum Tax
*[NEED TO CREATE VAR] Non-Refundable Credits
c09200 SOI Tax (Tax before refundable credits)
*[NEED TO CREATE VAR] Refundable credits
*[NEED TO CREATE VAR] Tax after all credits

Ignore for now; @feenberg, you should have an email from me on the 15th regarding these.

AGI Groups

Below 10
10 -20
20-30
30-40
40-50
50-75
75-100
100-200
200 +
Total

The seventh and eighth output table deal primarily with 'tax after all credits'

By AGI decile,

percent of tax units in each decile with a tax cut
percent of tax units in each decile with a tax increase
share of total tax change born by each decile (percent)
average tax change ($)
@feenberg, do you think we should include an average federal tax rate percentage point change and an average federal tax rate under the proposal as in the TPC example below? (average total-tax/average AGI?)

Same by AGI group

We won't be able to create these tables until we get the tax after all credits variable sorted out

Again, it would be nice to have these for each of the ten years and an average.

Proper Readme

We should have a simple readme file in the main directory that explains who we are, how to install and use the program, etc.

The Tax Calculator should support both Python 2 and Python 3

It seems like it would be easy to meet this requirement, but I thought it would be best to be explicit. As we proceed in testing, the tax calculator should be tested in both python 2 and python 3 environments. Does this sound right to others?

Sphynx Documentation

What is the _cmp variable for?

The variable _cmp is only involved in comparison expressions (i.e. where(_cmp == 1...). It appears that it is never assigned to except when it is created in puf.py, where it is set to all zeros. Does anyone have any insight here? I'd like to remove it if it is not actually useful for any comparisons.

coding style

At some point we need to go through our code and make it conform to a set of style standards. PEP 8 seems like a good starting point. Is it possible to make pylint run automatically as part of our commit suite (along with Travis-CI?)

We should rename our tax records DataFrame.

Our core tax data is currently stored in a DataFrame named "puf." This is misleading since the DataFrame will always contain columns that are not on the IRS public use file (intermediate variables, for instance), and often the public use file won't be involved whatsoever.

I suggest "records" as a better name.

A Calculator should take user-defined functions that are called at calculation time

This would be a means to provide a friendly developer/user of the package with a way to add something to the model (credit, deduction, etc.). The idea is that a user may write a function and then register it with the calculator. The calculator will call it at the proper time. It could look like this:

def give_everyone_ten_grand(p):
    p.e00100 += 10000

calc = Calculator(tax_dta, default_year=91)

#Register callback here
calc.register_income_callback(give_everyone_ten_grand)

results = calc.run_all()

There are two things happening here. One is that run_all is now a method on a calculator that will call all of the functions in the designated sequence. Second is that this designated sequence would have a few spots to run any registered UDFs (user-defined functions). So, it would look like:

self.CapGains()
self.SSBenefits()
self.all_income_udfs()
self.AGI()

all_income_udfs would execute any registered income functions. We could have such a mechanism for income, deductions, credits, etc.

This capability would not be available through the web application, since it would involve execution of arbitrary code. Anyone using the package on their local machine would be able to do this though.

Testing the Code

Yesterday I finished editing the code so that it accurately works on the PUF. Unfortunately, it will be a lot more difficult to test the case where the input data is not PUF, as I don't have a ready-made input file to use to compare our code's results to that of the SAS code.

@feenberg do you still have the input file that simulated the full IRS code that you used when testing your SAS code?

Is _ymax indexed correctly in the current master?

A number of parameters in the calculator are of the form:

#                    singl   joint   sep     hh      widow   sep
_phase2 = np.array([[250000, 300000, 150000, 275000, 300000, 150000],
                    [254200, 305050, 152525, 279650, 305050, 152525],
                    [258250, 309900, 154950, 284050, 309900, 154950],])

The outer dimension is for indexing by years. So _phase2[0,1] would be the index for the 'zeroth' budget year we are considering, and then the entry for "joint".

Indexing usually looks like this:

 _phase2 = p._phase2[FLPDYR-p.DEFAULT_YR, MARS-1]

Where we subtract the default starting year from the filing year for each record.
This pattern holds throughout the calculator except for the _ymax variable. It looks like this:

#                    0kids 1kid   2kids  3+kids
_ymax = np.array([  [7970, 17530, 17530, 17530],
                    [8110, 17830, 17830, 17830],
                    [8240, 18110, 18110, 18110]])

But the indexing is different. In current master, the indexing looks like this (starting at line 1066):

   _val_ymax = np.where(np.logical_and(MARS == 2, _modagi > 0), p._ymax[
                         _ieic, FLPDYR - p.DEFAULT_YR] + p._joint[FLPDYR - p.DEFAULT_YR, _ieic], 0)
    _val_ymax = np.where(np.logical_and(_modagi > 0, np.logical_or(MARS == 1, np.logical_or(
        MARS == 4, np.logical_or(MARS == 5, MARS == 7)))), p._ymax[_ieic, FLPDYR - p.DEFAULT_YR], _val_ymax)

So, _ymax has the "year" index second. I think this is an error. Correct? @feenberg @MattHJensen

The current model doesn't account for 10 budget years

In order to use the current code to model 10 budget years, we need data in the model for each year of the 10 years. In the current package, the code is structured to handle multiple years, it's just that we only index one year of data. If we were to attempt to index another year of data, we would get run time errors. For example, there is a lot of code like this:

 _prexmp = XTOT * _amex[FLPDYR - DEFAULT_YR]

_amex is defined like this:

_amex = np.array([3900])

So we are OK as long as FLPDYR is the same as DEFAULT_YR. Once we advance FLPDYR, then the model would no longer work.

What is the best way to handle this situation for now (e.g. repeat the model parameters for 10 years), and is there a longer term solution that is different from the short term solution?

Vector conditionals using "np.where" are hard to understand

There must be a better way of handling logic like this:

_a5 = np.where(np.logical_and(low == 1, _a4 < 25), 13, 0)
_a5 = np.where(np.logical_and(low == 1, np.logical_and(_a4 >= 25, _a4 < 50)), 38, _a5)
_a5 = np.where(np.logical_and(low == 1, np.logical_and(_a4 >= 50, _a4 < 75)), 63, _a5)
_a5 = np.where(np.logical_and(low == 1, _a4 >= 75), 88, _a5)

_a5 = np.where(np.logical_and(med == 1, _a4 < 50), 25, _a5)
_a5 = np.where(np.logical_and(med == 1, _a4 >= 50), 75, _a5)

_a5 = np.where(inc_in == 0, 0, _a5)

One person suggested we could use “masking” (see below), but I’m hoping for something better.

x = np.array([1,2,3])
marital = np.array(['S','M','M'])
x[marital == ‘S’] *= 3
print x
[3  2  3]

It should not be necessary to locate and read the `puf2.csv` file in order to import `taxcalc`

Module import should not depend on finding an input data file. The reading of the data should be refactored.

SOIT accuracy regarding the c10300 variable

In the SOIT() function it looks like there may be a bug, but I am not sure what the correct fix should be.

Should the line
c10300 = c10300 - c10300 - c10950 - e11451 - e11452
be
c10300 = c10300 - c10950 - e11451 - e11452 ?

and also the line
c10300 = c09200 - e09710 - e09720 - e10000 - e11601 - e11602
be
c10300 = c10300 - c09200 - e09710 - e09720 - e10000 - e11601 - e11602?

if the second line referenced above is correct, than the above 7 lines of code are not needed, as they do not impact the accuracy of that variable.

@feenberg and @MattHJensen, is this an error? And if so, which would be the correct fix?

Additionally, if the first suggestion is correct I think it would be easier to read and less error prone if the bulk of that function was consolidated into a single assignment of the c10300 variable instead of re-assignment, such as
c10300 = c09200 - e10000 - e59680 - ... - ... - e11601 - e11602 all at once.

_nctcr Calculation

I've been able to track some of the few variables that still have discrepancies to a difference in _nctcr.
As I look at the relevant SAS code I'm wondering what the statement sum(of xtxcr1-xtxcr10); means, in particular the of keyword. Google didn't seem to have an answer. @feenberg @MattHJensen @SameerSarkar

Variable names

I'm hoping to generate a conversation about whether we should rename our variables to be descriptive in English.

We could easily provide a dictionary for variable mapping, set by default to the PUF, and in most cases we could even use the PUF variable descriptions; for instance, earnedIncForEITC instead of E59560.

Note that many economist studying an isolated tax issue don't know tax law well enough to guess what each variable is by comparing our Python code to their internal understanding of tax law; certainly most policy analysts won't. Descriptive variable names would help both of these groups understand and contribute to our code.

Moreover, if our code is readable to the uninitiated, it could be the best place for the uninitiated to learn how tax law works, a valuable contribution in and of itself.

I'd be interested in hearing from @SameerSarkar and @copper-head if they think descriptive variable times would have sped up their understanding of the code, and I'd like to know from @feenberg what we might be sacrificing if we were to make the switch.

Understanding _cmp

I'm looking at this code and wondering if _cmp is really just a proxy for checking whether we're using the PUF file or the full disclosed one, since its values are in complimentary distribution with those of puf.

Calculator, Parameters, and PUF should all support dict-like access

Many users might be most comfortable reading/writing variables like this:

puf['e00100'] += 1000

This is rarely used in the "guts" of the package, but would be expected by user familiar with pandas.