capeprivacy / cape-dataframes Goto Github PK

View Code? Open in Web Editor NEW

174.0 14.0 20.0 446 KB

Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.

Home Page: https://docs.capeprivacy.com

License: Apache License 2.0

Makefile 2.34% Python 97.42% Dockerfile 0.24%

data-science privacy policy collaboration spark pandas hacktoberfest machine-learning python

cape-dataframes's People

Contributors

Stargazers

Watchers

cape-dataframes's Issues

Bring back seeds support to Spark DatePerturbation

Is your feature request related to a problem? Please describe.

The Spark DatePerturbation currently doesn't support seeding due to issues with sharing a seed across multiple executors.

Describe the solution you'd like

Ability to specify a seed for DatePerturbation.

Link any API references in the README to the documentation site

docs.capeprivacy.com is not live yet, but once it is, we should go through the readme and point any relevant API mentions to the docs

Multiple Columns / Rows Supported in Policy Selection

Is your feature request related to a problem? Please describe.
When you want to apply the same transformation across multiple columns, you have to copy and paste a bunch of code. Ideally, there would be a way to say you would like to apply the same transformation across many columns.

Describe the solution you'd like
An update to the YAML policy implementation file to make it easier to apply transformations to more than one row or column.

Describe alternatives you've considered
The current system allows you to copy and paste - however, this could end up making the YAML file quite long when we are dealing with MANY columns and it feels like repeated work.

Additional context
There is some more context in the Cape Community Slack!

Calling DatePerturbation alters the original pd.Series

Describe the bug
Calling the class DatePerturbation

To Reproduce

>>> def load_dataset(sess=None):
    dataset = pd.DataFrame({
        "name": ["alice", "bob"],
        "age": [34, 55],
        "birthdate": [pd.Timestamp(1985, 2, 23), pd.Timestamp(1963, 5, 10)],
        "salary": [59234.32, 49324.53],
        "ssn": ["343554334", "656564664"],
    })

>>> df = load_dataset()
>>> perturb_date = DatePerturbation(frequency=("YEAR", "MONTH", "DAY"), min=(-10, -5, -5), max=(10, 5, 5))
>>> perturb_numric(df["age"])

Expected behavior
When calling the perturb_numric without assignment changes the original pd.Series

Screenshots

Pretty Print for Cape Python objects (like Client and Policy)

Is your feature request related to a problem? Please describe.
It would be nice if the Cape Python library had the ability to print out useful information for the core primitive objects, such as the Policy objects and Client objects.

Describe the solution you'd like
Policy might display the YAML file or some representation of it. The Client could display the session information.

Describe alternatives you've considered
Currently, we just show an object <cape.Client object>, so we could keep it that way.

Additional context
I found this by building our initial walkthrough, so this hasn't been validated by user research. I still think it is a neat feature. :)

Exploring Snowflake Support via PyArrow & Pandas

Is your feature request related to a problem? Please describe.
We would like to eventually support workflows that use SnowflakeDB, and one idea that has come up is to use their integration with PyArrow and Pandas to learn more about Arrow but also to support Snowflake data - https://docs.snowflake.com/en/user-guide/python-connector-pandas.html

Describe the solution you'd like
An initial test of whether this workflow is feasible would be useful to see what benchmarks we can create for pulling data into Pandas and then applying Cape policy to it. It might also be worthwhile diving into the library internals to see how the Query -> Arrow -> Dataframe workflow works!

Describe alternatives you've considered
We have explored the idea of a ODBC or JDBC layer as another way of solving this issue.

Additional context
We could pick an interesting example use case and check it out! Would also be excited to hear about architecture choices they made here and see if we can explore how we might apply policy in/to Arrow??

Required pandas version 1.0.3. conflicts with other packages

Describe the bug
On some machines pip install cape-privacy doesn't work and it says conflicting versions of pandas. Whereas, on others with pandas==1.0.3 DatePerturbation gives an error
ImportError: cannot import name 'values_from_object' from 'pandas._libs.lib' (/usr/local/lib/python3.7/dist-packages/pandas/_libs/lib.cpython-37m-x86_64-linux-gnu.so)

After uninstalling pandas 1.0.3 and installing the newest version of pandas, the method work

Screenshots

Desktop (please complete the following information):

OS: [Linux]
OS Version [21.04]
Python 3.9.5
Installed pip packages

Please loosen up pandas and numpy versions in setup.py and requirements.txt @justin1121
I am using this package for my project.

Integrate Cape Python to work with Dask

Is your feature request related to a problem? Please describe.
We've had several users request working with Dask directly instead of Spark and Pandas. Because of it's use in the Python data science community and ease of use for out-of-core computations and parallelization of workflows, it fits well with the data science needs we are trying to address.

Describe the solution you'd like
We should see how many changes we would need to get the cape_pandas integrations working for Dask Dataframes. Matt Rocklin had a look on the webinar and pointed out only a few lines (for example, where we explicitly call pd.Series when returning an array as a series), which would need to be updated for it to just work.

Describe alternatives you've considered
We could wait on Dask integration to prioritize other integrations; however, if it truly is as simple as changing a few returns, I would prefer we do it sooner! :)

Additional context
To hear Matt's comments, check out around 48minutes on this YouTube: https://www.youtube.com/watch?v=cIvv8EGMDY0&feature=youtu.be - I'm sure he is happy to help if we need extra guidance! 🙌

Non-pinned Versions for Supporting Libraries

Is your feature request related to a problem? Please describe.
Since many folks will use this library in conjunction with other Python libraries, it is best practice to not pin exact versions but to instead try to allow for a certain base version we have tested with and up. This allows folks to deploy Cape in places where they might need or use a higher or lower library version than the one we normally use.

Describe the solution you'd like
It would be nice to figure out some lower level requirements and use the >= in the requirements file so folks can have variance in their versions. This would also solve our deployment issue on CoLab! We should then also add a call for folks to report any bugs they find with newly released versions of libraries.

Describe alternatives you've considered
We can continue to pin versions and this will be okay for folks that don't have a lot of other Python libraries installed and running.

Additional context
This came in as a feature request from a recent user interview and from my own use of Cape Python on GCP!

Pandas compatibility

Hi,

I'd like to use cape-privacy in a project, however, said project runs pandas 1.2 which not compatible with cape-python's dependency constraint of pandas~=1.0.3.

As pandas 1.0 is now more than 1 year old, would it be possible to release a new version of cape-privacy without the version constraint on pandas (I'm assuming it works without changing anything to cape-python) ?

Perturbation That Preserves Ordering

Is your feature request related to a problem? Please describe.
It would be nice if I could preserve ordering and still allow for minimal perturbation - for example, for time series related data.

Describe the solution you'd like
Is there a way to allow perturbation to be stateful or to take into consideration the constraints of time-series based data or other data where increasing or decreasing values are significant?

Describe alternatives you've considered
You could add another column with a simple ordered index and use that as a work around.

Additional context
This was brought up during our YouTubeLive! Video here: https://youtube.com/watch?v=cIvv8EGMDY0&feature=youtu.be

`DatePerturbation` raises when the index doesn't contain the key 0

Describe the bug

When a pandas Series is passed that has a non-default integer index (really, just doesn't contain the key 0) a

To Reproduce

In [2]: import pandas as pd
   ...: from cape_privacy.pandas import transformations as tfms
   ...:
   ...: perturb_application_date = tfms.DatePerturbation(frequency="DAY", min=-3, max=3)
   ...: s = pd.Series(pd.date_range('2000', periods=12), index=list(range(1, 13)))
   ...: s
Out[2]:
1    2000-01-01
2    2000-01-02
3    2000-01-03
4    2000-01-04
5    2000-01-05
6    2000-01-06
7    2000-01-07
8    2000-01-08
9    2000-01-09
10   2000-01-10
11   2000-01-11
12   2000-01-12
dtype: datetime64[ns]

In [3]: perturb_application_date(s)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/Envs/dask-dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2888             try:
-> 2889                 return self._engine.get_loc(casted_key)
   2890             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-3-6365109791c4> in <module>
----> 1 perturb_application_date(s)

~/Envs/dask-dev/lib/python3.8/site-packages/cape_privacy/pandas/transformations/perturbation.py in __call__(self, x)
    111
    112         # Use equality instead of isinstance because of inheritance
--> 113         if type(x[0]) == datetime.date:
    114             x = pd.to_datetime(x)
    115             is_date_no_time = True

~/Envs/dask-dev/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key)
    880
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883
    884         if (

~/Envs/dask-dev/lib/python3.8/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
    989
    990         # Similar to Index.get_value, but we do not fall back to positional
--> 991         loc = self.index.get_loc(label)
    992         return self.index._get_values_for_loc(self, loc, label)
    993

~/Envs/dask-dev/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2889                 return self._engine.get_loc(casted_key)
   2890             except KeyError as err:
-> 2891                 raise KeyError(key) from err
   2892
   2893         if tolerance is not None:

KeyError: 0

Expected behavior

The same behavior as if the index did have a key 0.

Desktop (please complete the following information):

OS: macOS
OS Version: 10.14.5
Python Version: 3.8
Installed pip packages
cape-privacy 0.2.0
pandas 1.1.0

Additional context

I think cape wants something like x.iloc[0], but even that will fail if the Series is empty and has no rows. Perhaps something like if pd.api.types.is_object_dtype(x.dtype) then try a pd.to_datetime(x)? I haven't looked closely at the code.

NumericPerturbation - Masking different values for same source data

Describe the bug
Is it possible to get same masked value for a an integer for its all occurrences
In below sample, i am defining a column as age which has all values same as 14, but after masking its not giving same value.

To Reproduce
import pandas as pd
from cape_privacy.pandas import dtypes
from cape_privacy.pandas.transformations import NumericPerturbation
df = pd.DataFrame({"age": [14,14,14,14]})
perturb_age = NumericPerturbation(dtype=dtypes.Integer, min=-10, max=10, seed=111)
df["age"] = perturb_age(df["age"])
print(df)

Expected behavior
I am trying mask an integer with same value for all its occurrences

Screenshots

Spark UserWarning for Deprecated UDF Feature

Describe the bug
When using Cape Python with PySpark in Spark 3.0.0 there is a warning message for typehints and a note that we are using a deprecated UDF feature/

/usr/local/spark/python/pyspark/sql/pandas/functions.py:386: UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to specify type hints for pandas UDF instead of specifying pandas UDF type which will be deprecated in the future releases. See SPARK-28264 for more details. "in the future releases. See SPARK-28264 for more details.", UserWarning)

To Reproduce
Follow along with the IoT example notebook on a Spark 3.0.0 installation.

Expected behavior
No Warning message - we should try to match Spark's latest API if that is the recommended version to use with Cape Python.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Ubuntu
OS Version: 18.04.4
Python Version: 3.6.9
Installed pip packages: cape!

Additional context
I am running Spark latest 3.0.0 package for linux.

Tokenizer for email addresses

Is your feature request related to a problem? Please describe.
Working with data collected from online form fills, need to mask email addresses as they are collected, but once i do this using the tokenizer built for names, i can no longer parse nor analyze the domain.

Describe the solution you'd like
The tokenizer should applied to both parts of an email address, while maintaining the '@ 'sign. Alternatively, allowing the user to determine what part of the email address requires masking (either the username or the domain address).
This way analysis can still be performed on either parts without compromising the identify of the user.

Documentation site is password-protected

Describe the bug

Documentation site is password-protected

Expected behavior

Documentation of an open-source project is public.

Screenshots

Docstrings missing in some files

Is your feature request related to a problem? Please describe.
For the autogenerated API docs (https://docs.capeprivacy.com/pythonv1/readme/), we end up with some blank pages due to missing docstrings, and some of the pages are quite sparse on examples.

Describe the solution you'd like
Add the docstrings

Describe alternatives you've considered
Don't include the pages in the docs - let me know if that would make more sense

Additional context
Current blank pages:
https://docs.capeprivacy.com/pythonv1/cape_privacy.pandas/dtypes/
https://docs.capeprivacy.com/pythonv1/cape_privacy.spark/dtypes/

Current pages that could use more info/examples to bring them to the same level as similar pages:
https://docs.capeprivacy.com/pythonv1/cape_privacy.pandas/transformations/column_redact/
https://docs.capeprivacy.com/pythonv1/cape_privacy.pandas/transformations/row_redact/

Run docker builds in ci on every commit

Is your feature request related to a problem? Please describe.

Docker builds should be run on every commit so they don't accidentally get broken.

Describe the solution you'd like

Run docker builds in ci on every commit.

See what Transformations are Linked to What Data Types

Is your feature request related to a problem? Please describe.
It would be useful within EDA to see what types of transformations I can apply to what Series/Columns so that I can know how to approach the problem.

Describe the solution you'd like
There are several ways this could be implemented - it might be nice to explore a few ways and investigate how others have solved this problem. One way off the top of my head is that the Pandas or interactive Spark session dataframe dtypes could be used as an input and then display a list of possible transforms that might apply. Another idea is to pass in a Series or Spark column and have a show_available_transforms method (or something with a better name).

Describe alternatives you've considered
One can look through the documentation to see this and the docstrings.

Additional context
I can share a notebook of what I am thinking!

capeprivacy / cape-dataframes Goto Github PK

cape-dataframes's People

Contributors

Stargazers

Watchers

Forkers

cape-dataframes's Issues

Recommend Projects

Recommend Topics

Recommend Org