vallettea / koala Goto Github PK

View Code? Open in Web Editor NEW

142.0 142.0 60.0 2.73 MB

Transpose your Excel calculations into python for better performances and scaling.

License: GNU General Public License v3.0

Python 100.00%

koala's People

Stargazers

Watchers

koala's Issues

Add util to check if a link exists between two cells

Find a faster way to read graphs

An idea is to use this kind of strategy:
https://axialcorps.com/2013/09/27/dont-slurp-how-to-read-files-in-python/

The problem seems to be that we handle gzip files. An alternative solution could be https://docs.python.org/3/library/zlib.html#zlib.decompressobj, but we need to know the 'window size' (wbits), which i don't think we do in advance...

Adapt graph serialization to Range object

See #12.

Write a tool to detect files with volatile dependent inputs

When you have an input that affect the value of a volatile, your whole calculation will be broken if you use Spreadsheet.clean_volatiles().

Detecting in advance which files are concerned is then a nice feature.

Investigate on an Eval error

Exception: Problem evalling: cells and values in a Range must have the same size

Repair basic_evaluation

Is the clean_volatiles() cache a source of bad evaluations ?

There is a cache dictionary in the Spreadsheet.clean_volatiles() function, whose purpose is to reduce the amount of expression calculated, when the formula is the same as one previously found.
The problem is that sometimes, the same formula, called from a different cell, will evaluate differently.

This might lead to bad evaluations, and might explain #44.
But performance might be impacted.

VDB function with partials

excellib.vdb() doesn't output exactly the same result when using partial start_period or end_period (meaning, floats)

Explore the idea of pure functional evaluation

This is to avoid to eval() at each node of the graph, which takes quite a long time.

Offset doesn't work with Ranges

When use the OFFSET height and width so that the output Cell is actually a Range, it is most probable that this output doesn't exist in the cellmap, leading to errors.

Should we clean white spaces in formulas ?

White spaces in formulas are a problem:

if you clean them up, text variables that include white spaces are perverted
if you don't, clean_volatiles() function might end up not replacing parts of formula since revert_rpn() (which outputs the part of the formula to replace) returns a formula without white spaces.

The current set up is not replacing white spaces.

False circular references

Formulas like:
=(totalDecom-SUM(INDEX(FA_RecCostsDecom;1;1):INDEX(FA_RecCostsDecom;1;CA_Periods-1)))*Deprec_UOPRates when calculated on a cell referenced as FA_RecCostsDecom trigger infinite loop.

This is because currently our koala algorithm reevaluates a range each time it sees it in a formula.
A good way to handle this would be to store Ranges (in a koala sense) in a Spreadsheet.range_dict object so that when koala encounters a Range it already knows, it can directly use the values without reevaluating the Range (then avoiding the infinite loop).

2 problems though:

this means the way to initialize Ranges must be adjusted so that a Range is created in the dict on the first element inserted (otherwise the previous formula wouldn't work either)
this might be a lot of effort for a few cases, since this might not happen that many times

Check if Volatile Ranges are scanned during detect_alive()

excellib.match() needs an ExcelError as output

This line should output an ExcelError.

Add a ".XLSX" test to verify inputs/outputs

Use a single Tokenizer

Currently, 2 different tokenizers are used in Koala:

the main tokenizer is the one from Pycel, is used when constructing the graph (in koala/ast/tokenizer.pyx)
a secondary tokenizer from Openpyxel used when reading the cells of type range to be able to translate the formulas (in koala/openpyxl/tokenizer.py).

We need to merge the 2 into one to avoid complexity.

Merge CellRange and Range classes

It might be interesting to merge CellRange and Range into one unique class, for clarity purposes.

https://github.com/anthill/koala/blob/master/koala/ast/excelutils.py#L15
https://github.com/anthill/koala/blob/master/koala/ast/Range.py#L118

Authorize ":" tokenizer when you have inputs that influence 'INDEX' or 'OFFSET' formula

When you have inputs that can modify cells with formulas containing INDEX or OFFSET, you don't want to pre parse your formula to clean the volatiles.
So you need to able to calculate entirely your workbook (even if it takes a great amout of time).

Currently, this is not possible and leads to evaluation errors due to bad parsing of ":" characters.
A generic mode addressing this case needs to be available.

Repair Excel function tests

Be more precise with ExcelError return values

Most of the Excel Error Codes i've put so far are "#N/A" or "#VALUE!", we might need more precision.
We might need to find a way to be more explicit about the errors

Rename Volatiles

Volatile functions in Excel are functions that always trigger evaluation (see: http://www.decisionmodels.com/calcsecretsi.htm)

What we have called "volatiles" in our code is actually functions that output a reference to a cell, which is not the same.

For the sake of clarity, we need to rename what we call volatiles in our code.

Fix behavior without clean_volatiles()

Formulas like A1:OFFSET(A1,0,1) lead to errors when you don't clean_volatiles().

This issue overruns #25.

RangeCore.apply_all on Range with different sizes

Our current strategy is not to fill Ranges with empty cells.
But this might lead to apply_all operations on Ranges with different sizes, raising an Exception.

We might need to consider filling the missing cells values with zeros on such occasions.

Add function to create name ranges within generated graph

Explore more calculation routes to ensure koala works

See https://github.com/anthill/engie/issues/42

Improve perfs

See #16.

Set up an automatic Cython compiler on commit

Correct Evaluation

No need to remove all index

we don't need to remove all index (only the one that give address) and not the one giving back a value. For the moment, we remove all.

Cellmap inconsistency

When you prune your graph, the cellmap of the reduced graph has a smaller nb of cells than the original cellmap. But Rangeshave been created with the original cellmap, so they might have a valid reference to a cell that doesn't exist in the reduced cellmap.
This problem get solved by dumping/loading the graph, since Ranges are recreated from the reduced cellmap.

But still, such inconsistency should be addressed

Set up a detailed Benchmark

Related to #17.

We need to understand exactly where we gain perfs and where we simplify the graphs.
A detailed benchmark is then needed.

The main 3 options we've added are:

volatile cleaning
pruning (inputs selection)
outputs selection

For each of these options, we want to know:

what is the size reduction of the graph (node, edges) ?
what is the time reduction of gen_graph ?
what is the time reduction of set_value ?
what is the time reduction of evaluate ?

print 'First evaluation', sp.evaluate('Cashflow!G187') # => outputs -2966.25862693
sp.set_value('InputData!G14', 0) # this is to avoid direct evaluation
sp.set_value('InputData!G14', 2025)
print 'Second evaluation', sp.evaluate('Cashflow!G187') # => outputs -3719.5504961

With InputData!G14 as 2025 in the .XLS,

print 'First evaluation', sp.evaluate('Cashflow!G187') # => outputs -2582.30664008
sp.set_value('InputData!G14', 0) # this is to avoid direct evaluation
sp.set_value('InputData!G14', 2025)
print 'Second evaluation', sp.evaluate('Cashflow!G187') # => outputs -2582.30663952

vallettea / koala Goto Github PK

koala's People

Stargazers

Watchers

Forkers

koala's Issues

Recommend Projects

Recommend Topics

Recommend Org