Comments (9)
Thanks for reaching out. Yes, wiki contains duplicates. @alexandervanrenen can you provide some information about the dataset?
I don't understand what you mean by record length. We extend each 64-bit unsigned integer key with a 64-bit value (payload). The payloads don't affect the mapping of keys to positions.
However, it is true that with larger payloads the last-mile search (e.g., binary search) would become more expensive due to extra cache misses.
from sosd.
Sure! The wiki dataset is based on wikipedia. We gathered the timestamps of all edits to pages. I think they have a second or millisecond granularity. Which explains the collisions. You can find the data and descriptions here. If I remember correctly, we only used a subset of the english wikipedia, as that already contained more than enough time stamps.
from sosd.
Thanks for reaching out. Yes, wiki contains duplicates. @alexandervanrenen can you provide some information about the dataset?
I don't understand what you mean by record length. We extend each 64-bit unsigned integer key with a 64-bit value (payload). The payloads don't affect the mapping of keys to positions.
However, it is true that with larger payloads the last-mile search (e.g., binary search) would become more expensive due to extra cache misses.
Thank you for your help!
I mean that each key corresponds to a index in the array, so if the record length is 1, then as the keys sorted, the positions are 0,1,2,3... but if the record length is not 1, assume it's 20, then positions are 0,20,40,60. Although the same primary keys, and the mappings between keys and positons are changed, the performance of learned index is quite different.
Best wishes!
from sosd.
Sure! The wiki dataset is based on wikipedia. We gathered the timestamps of all edits to pages. I think they have a second or millisecond granularity. Which explains the collisions. You can find the data and descriptions here. If I remember correctly, we only used a subset of the english wikipedia, as that already contained more than enough time stamps.
Thank you for your help, I find the description.
from sosd.
I mean that each key corresponds to a index in the array, so if the record length is 1, then as the keys sorted, the positions are 0,1,2,3... but if the record length is not 1, assume it's 20, then positions are 0,20,40,60. Although the same primary keys, and the mappings between keys and positons are changed, the performance of learned index is quite different.
Best wishes!
Hi @chaohcc. No, the record length doesn't affect the positions since each record is fixed size. You can just multiply the model's prediction with the record size to get the byte offset of the record. The position of the 5th record will always be 4 (0-indexed). Its byte offset will be 4 * record_size. So the model only needs to know that the 5th record is stored at position 4.
With variable-sized records that's a different story. But even then you probably wouldn't train on the raw byte offsets but instead use another structure for the indirection, like store an array of pointers and index into that with the model.
I hope that helps.
from sosd.
I mean that each key corresponds to a index in the array, so if the record length is 1, then as the keys sorted, the positions are 0,1,2,3... but if the record length is not 1, assume it's 20, then positions are 0,20,40,60. Although the same primary keys, and the mappings between keys and positons are changed, the performance of learned index is quite different.
Best wishes!Hi @chaohcc. No, the record length doesn't affect the positions since each record is fixed size. You can just multiply the model's prediction with the record size to get the byte offset of the record. The position of the 5th record will always be 4 (0-indexed). Its byte offset will be 4 * record_size. So the model only needs to know that the 5th record is stored at position 4.
With variable-sized records that's a different story. But even then you probably wouldn't train on the raw byte offsets but instead use another structure for the indirection, like store an array of pointers and index into that with the model.
I hope that helps.
Thank you. It's very helpful, i also try the varialbe-size records and build a mapping table with just the keys in array instead of the raw byte offsets.
Best wishes!
from sosd.
Sure! The wiki dataset is based on wikipedia. We gathered the timestamps of all edits to pages. I think they have a second or millisecond granularity. Which explains the collisions. You can find the data and descriptions here. If I remember correctly, we only used a subset of the english wikipedia, as that already contained more than enough time stamps.
Hi! I have another question about the dataset, I want more information about 'books_200M', is it timestamp or some other ID? Thank you for your help!
Best wishes!
from sosd.
@RyanMarcus If I remember correctly the book dataset has the popularity of each book. So the number of times it was accessed or bought on amazon. No timestamps.
from sosd.
@RyanMarcus If I remember correctly the book dataset has the popularity of each book. So the number of times it was accessed or bought on amazon. No timestamps.
Thank you so much!
from sosd.
Related Issues (15)
- [More Information on source code] HOT 1
- Wrong data file for the fb dataset HOT 7
- Support for key duplicates in ARTPrimary? HOT 3
- Possibly wrong cache-miss measurement HOT 2
- Compile errors (prepare.sh) HOT 6
- Output files (containing only RMI and RS) HOT 4
- error: No such file or directory HOT 5
- execute_perf.sh: "Error opening counter cycles" HOT 4
- Compilation error when running scripts/prepare.sh HOT 3
- [RadixSpline] memory leak and a suggested fix HOT 5
- [RadixSpline] SIGSEGV on EqualityLookup when key = 0 HOT 10
- The commit corresponding to the original SOSD paper not tagged or citable HOT 6
- The benchmark doesn't compile (prepare.sh) HOT 9
- Benchmark's memory requirement HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sosd.