snailed / bachelor-project Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 12.97 MB

TeX 100.00%

bachelor-project's People

Contributors

Stargazers

Watchers

bachelor-project's Issues

Indsæt ("Formal definition will follow") efter problembeskrivelsen

bachelor-project/introduction/main.tex

Line 2 in 5aa32e8

 The \textit{Approximate Similarity Search Problem} regards efficiently finding a set $A$ from a corpus $\mathcal{F}$ that is approximately similar to a query set $Q$ in regards to the \textit{Jaccard Similarity} metric $J(A,Q) = \frac{|A\cap Q|}{|A\cup Q|}$\cite{dahlgaard2017fast}\cite{fast-similarity-search}. Practical applications includes searching through large corpi of high-dimensional text documents like plagiarism-detection or website duplication checking among others\cite{vassilvitskii2018}. The main bottleneck in this problem is the \textit{curse of dimensionality}. Any trivial algorithm can solve this problem in $O(nd|Q|)$ time, but algorithms that query in linear time to the dimensionality of the corpus scale poorly when working with high-dimensional datasets. Text documents are especially bad in this regard since they often are encoded using \textit{$w$-shingles} ($w$ contigous words) which \citet{li2011hashing} shows easily can reach a dimensionality upwards of $d=2^{83}$ using just $5$-shingles.\\ 

Indsæt reflektioner på metode i abstract

bachelor-project/abstract/main.tex

Line 9 in 7336b49

 This project regards an analysis of recent advances in solving the Approximate Jaccard Similarity Search Problem, specifically in regards to how one can achieve sublinear query time using parallel bit counting as presented by Knudsen\cite{fast-similarity-search}. This contains a theoretical analysis of both runtime and correctness of the bit-counting algorithm as well as an empirical comparison to existing methods. The findings include a theoretical run time advantage to using parallel bit counting compared to a simple, linear time algorithm, but empirical results show that this advantage is hard to achieve in practice. Furthermore, reflections are made on how the results might scale on more specialized hardware. 

Beskrivelse af hvordan at ord-parallelisme bringer køretiden ned til sublineær tid i introduktionen.

bachelor-project/introduction/main.tex

Line 4 in 5aa32e8

 Many improvements have since been presented to both improve processing time, query time and space efficiency. Notable mentions includes (but are not limited to) the use of \textit{tensoring}\cite{andoni2006efficient}, \textit{b-bit minwise hashing}\cite{ping2011theory}, \textit{fast similarity sketching}\cite{dahlgaard2017fast}. Simple applications of these techniques leads to efficient querying with a constant error probability. If one wishes to achieve an even better error probability such as $\varepsilon = o(1)$, it is standard practice within the field to use $O(\log_2(1/\varepsilon))$ independent data structures and return the best result, resulting in a query time of $O(\frac{1}{\epsilon} (n^\rho + |Q|))$. Recent advances by \citet{fast-similarity-search} show that it is possible to achieve an even better query time by sampling these data structures from one large sketch. The similarity between these sub-sketches and a query set needs to be evaluated efficiently when querying, which requires efficient computation of the cardinality of a bit-string. To do this, \citet{fast-similarity-search} presents a general parallel bit-counting algorithm that computes the cardinality of a list of bit-strings in sub-linear time amortized due to word-parallism. This brings the query time down to $O((\frac{n\log_2 w}{w})^\rho \log(1/\varepsilon) + |Q|)$.\\ 

Introducér rho i introduktionen

bachelor-project/introduction/main.tex

Line 4 in 5aa32e8

 Many improvements have since been presented to both improve processing time, query time and space efficiency. Notable mentions includes (but are not limited to) the use of \textit{tensoring}\cite{andoni2006efficient}, \textit{b-bit minwise hashing}\cite{ping2011theory}, \textit{fast similarity sketching}\cite{dahlgaard2017fast}. Simple applications of these techniques leads to efficient querying with a constant error probability. If one wishes to achieve an even better error probability such as $\varepsilon = o(1)$, it is standard practice within the field to use $O(\log_2(1/\varepsilon))$ independent data structures and return the best result, resulting in a query time of $O(\frac{1}{\epsilon} (n^\rho + |Q|))$. Recent advances by \citet{fast-similarity-search} show that it is possible to achieve an even better query time by sampling these data structures from one large sketch. The similarity between these sub-sketches and a query set needs to be evaluated efficiently when querying, which requires efficient computation of the cardinality of a bit-string. To do this, \citet{fast-similarity-search} presents a general parallel bit-counting algorithm that computes the cardinality of a list of bit-strings in sub-linear time amortized due to word-parallism. This brings the query time down to $O((\frac{n\log_2 w}{w})^\rho \log(1/\varepsilon) + |Q|)$.\\ 

Indsæt mere sammenhæng mellem cardinality of bit-string og sketches

bachelor-project/introduction/main.tex

Line 4 in 5aa32e8

 Many improvements have since been presented to both improve processing time, query time and space efficiency. Notable mentions includes (but are not limited to) the use of \textit{tensoring}\cite{andoni2006efficient}, \textit{b-bit minwise hashing}\cite{ping2011theory}, \textit{fast similarity sketching}\cite{dahlgaard2017fast}. Simple applications of these techniques leads to efficient querying with a constant error probability. If one wishes to achieve an even better error probability such as $\varepsilon = o(1)$, it is standard practice within the field to use $O(\log_2(1/\varepsilon))$ independent data structures and return the best result, resulting in a query time of $O(\frac{1}{\epsilon} (n^\rho + |Q|))$. Recent advances by \citet{fast-similarity-search} show that it is possible to achieve an even better query time by sampling these data structures from one large sketch. The similarity between these sub-sketches and a query set needs to be evaluated efficiently when querying, which requires efficient computation of the cardinality of a bit-string. To do this, \citet{fast-similarity-search} presents a general parallel bit-counting algorithm that computes the cardinality of a list of bit-strings in sub-linear time amortized due to word-parallism. This brings the query time down to $O((\frac{n\log_2 w}{w})^\rho \log(1/\varepsilon) + |Q|)$.\\ 

snailed / bachelor-project Goto Github PK

bachelor-project's People

Contributors

Stargazers

Watchers

bachelor-project's Issues

Indsæt ("Formal definition will follow") efter problembeskrivelsen

Indsæt reflektioner på metode i abstract

Beskrivelse af hvordan at ord-parallelisme bringer køretiden ned til sublineær tid i introduktionen.

Introducér rho i introduktionen

Indsæt mere sammenhæng mellem cardinality of bit-string og sketches

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent