Comments (8)
Hello capellino,
I don't have a definitive answer to your trouble, but maybe I have some clue.
Myself, I am not using the append() function because:
- well, most notably, as reported in my other tickets, I didn't succeeded to have it working,
- and also appending a dataframe to another can obey different logics depending the topic to be studied.
More specifically, when you append a dataframe to another, it seems logical not to record values that are both in dataframe 1 and dataframe 2. But this means you are able to define an "equality" operator between this data. And which information needs to be checked to confirm there is equality is up to you only.
To give you an example, in my case, I have for instance:
- for index: timestamp of an occurence
- 1st column: value of the occurence
- 2nd column: time of recording in pystore
In my case, equality is based only on index and 1st column
Well, after all this blabla, the question is: on which equality is based pystore append() function?
You can have a look in the source of collection.py:
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")
If you have a look to 'drop_duplicates' documentation (well, I only checked out that of pandas, not dask, but I understand dask simulates pandas), you will see it identifies duplicates purely on values in columns, not index.
So you can have different index, but same value, it won't bother, it will purely suppress the 'duplicateed rows'.
Back to your ewample, maybe this can be the why.
You generate a string dataframe based on 3 letters, abc as far as I understand. I am not that surprised it keeps only 3 rows.
With float, I guess no value is ever the same.
To finish. in my own case, I only use the write() function, and operate the appending in few lines:
To give you an idea of what is my 'appending' strategy, here it is (for financial data)
` # Only removes duplicates based on key columns for given data type
# Key columns for OHLCV data are 'Timestamp', 'Open', 'High', 'Low', 'Close' & 'Volume'
# Do not keep last row of DataFrame currently in base (empty row)
# Do not keep last row of new DataFrame either (empty row)
combined = pd.concat([current[:-1].reset_index(), df[:-1].reset_index()])
combined.sort_values('Timestamp', inplace = True)
combined.drop_duplicates(subset = AC.DataTypes[cde.data_type], inplace = True)
combined.set_index(keys='Timestamp', inplace = True)
# Re-add as last row the latest timestamp
# We don't know if added data is newest data or earliest data, so let's check
if current.index[-1] > df.index[-1]:
last_row = current.iloc[-1]
else:
last_row = df.iloc[-1]
combined = combined.append(last_row, sort=False)
my_collection.write(item_ID, combined, overwrite=True) `
Hope this can give you some clues.
Bests
from pystore.
Thank you very much for your suggestions. I solved not using at all the append() function as you said!
from pystore.
just a naive question: why are you using remove_duplicates ? or maybe better question, should we use remove_duplicates when the index is part of the data (e.g. a datetimeindex for a timeseries) ?
from pystore.
Hi,
there is a problem with the append function:
The line:
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")**
in the file collection.py should be subtituted by:
idx_name = current.data.index.name
combined = dd.concat([current.data, new]).reset_index().drop_duplicates(subset=idx_name, keep="first").set_index(idx_name)**
For further explanation, please refer to:
from pystore.
just a naive question: why are you using remove_duplicates ? or maybe better question, should we use remove_duplicates when the index is part of the data (e.g. a datetimeindex for a timeseries) ?
In general, pandas indexes are not unique and you can have repeated values. Therefore you need to remove duplicated indexes if unique ids are needed.
from pystore.
In the SO post you link, they suggest the simpler and more efficient alternative to remove duplicate indices;
df = df[~df.index.duplicated(keep='first')]
from pystore.
In the SO post you link, they suggest the simpler and more efficient alternative to remove duplicate indices;
df = df[~df.index.duplicated(keep='first')]
You are right. The method I mentioned is easier to understand for me, but less efficient and compact.
from pystore.
from pystore.
Related Issues (20)
- Symbols containing regex are broken with dask/fastparquet/fsspec HOT 1
- Dependabot couldn't authenticate with https://pypi.python.org/simple/
- Does append() work on OSX? HOT 3
- how to read all columns but the one use for partition
- Nested Dataframes causes exception
- collection.list_items() with metadata paremeter is showing "*** json.decoder.JSONDecodeError: Expecting value: line 1 column 198 (char 197)" HOT 1
- Append function not working
- Cause of most silent append errors HOT 3
- Multiindex and/or building minute bars HOT 1
- Is append loading the entire data into memory just to append new data ? HOT 1
- .to_pandas() error [can't read parquet file even though there is data in it when i look with parquet viewer] HOT 1
- Pystore Tutorial loading data problem
- issue reading back an item with metadata.json but no "_metadata"
- _updated in metadata use hour instead of minute
- Append lose data : by default remove duplicted indices. HOT 1
- Importing Pystore now gives ''EntryPoints' object has no attribute 'get''. HOT 1
- problem
- Strange path behaviour when using IPython terminal in Spyder
- Append ignores time series index when data is identical? HOT 1
- drop fastparquet and use pyarrow. this is required on latest versions of dask HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pystore.