Git Product home page Git Product logo

bachelor-project's People

Contributors

snailed avatar

Stargazers

 avatar

Watchers

 avatar

bachelor-project's Issues

Indsæt ("Formal definition will follow") efter problembeskrivelsen

The \textit{Approximate Similarity Search Problem} regards efficiently finding a set $A$ from a corpus $\mathcal{F}$ that is approximately similar to a query set $Q$ in regards to the \textit{Jaccard Similarity} metric $J(A,Q) = \frac{|A\cap Q|}{|A\cup Q|}$\cite{dahlgaard2017fast}\cite{fast-similarity-search}. Practical applications includes searching through large corpi of high-dimensional text documents like plagiarism-detection or website duplication checking among others\cite{vassilvitskii2018}. The main bottleneck in this problem is the \textit{curse of dimensionality}. Any trivial algorithm can solve this problem in $O(nd|Q|)$ time, but algorithms that query in linear time to the dimensionality of the corpus scale poorly when working with high-dimensional datasets. Text documents are especially bad in this regard since they often are encoded using \textit{$w$-shingles} ($w$ contigous words) which \citet{li2011hashing} shows easily can reach a dimensionality upwards of $d=2^{83}$ using just $5$-shingles.\\

Indsæt reflektioner på metode i abstract

This project regards an analysis of recent advances in solving the Approximate Jaccard Similarity Search Problem, specifically in regards to how one can achieve sublinear query time using parallel bit counting as presented by Knudsen\cite{fast-similarity-search}. This contains a theoretical analysis of both runtime and correctness of the bit-counting algorithm as well as an empirical comparison to existing methods. The findings include a theoretical run time advantage to using parallel bit counting compared to a simple, linear time algorithm, but empirical results show that this advantage is hard to achieve in practice. Furthermore, reflections are made on how the results might scale on more specialized hardware.

Beskrivelse af hvordan at ord-parallelisme bringer køretiden ned til sublineær tid i introduktionen.

Many improvements have since been presented to both improve processing time, query time and space efficiency. Notable mentions includes (but are not limited to) the use of \textit{tensoring}\cite{andoni2006efficient}, \textit{b-bit minwise hashing}\cite{ping2011theory}, \textit{fast similarity sketching}\cite{dahlgaard2017fast}. Simple applications of these techniques leads to efficient querying with a constant error probability. If one wishes to achieve an even better error probability such as $\varepsilon = o(1)$, it is standard practice within the field to use $O(\log_2(1/\varepsilon))$ independent data structures and return the best result, resulting in a query time of $O(\frac{1}{\epsilon} (n^\rho + |Q|))$. Recent advances by \citet{fast-similarity-search} show that it is possible to achieve an even better query time by sampling these data structures from one large sketch. The similarity between these sub-sketches and a query set needs to be evaluated efficiently when querying, which requires efficient computation of the cardinality of a bit-string. To do this, \citet{fast-similarity-search} presents a general parallel bit-counting algorithm that computes the cardinality of a list of bit-strings in sub-linear time amortized due to word-parallism. This brings the query time down to $O((\frac{n\log_2 w}{w})^\rho \log(1/\varepsilon) + |Q|)$.\\

Introducér rho i introduktionen

Many improvements have since been presented to both improve processing time, query time and space efficiency. Notable mentions includes (but are not limited to) the use of \textit{tensoring}\cite{andoni2006efficient}, \textit{b-bit minwise hashing}\cite{ping2011theory}, \textit{fast similarity sketching}\cite{dahlgaard2017fast}. Simple applications of these techniques leads to efficient querying with a constant error probability. If one wishes to achieve an even better error probability such as $\varepsilon = o(1)$, it is standard practice within the field to use $O(\log_2(1/\varepsilon))$ independent data structures and return the best result, resulting in a query time of $O(\frac{1}{\epsilon} (n^\rho + |Q|))$. Recent advances by \citet{fast-similarity-search} show that it is possible to achieve an even better query time by sampling these data structures from one large sketch. The similarity between these sub-sketches and a query set needs to be evaluated efficiently when querying, which requires efficient computation of the cardinality of a bit-string. To do this, \citet{fast-similarity-search} presents a general parallel bit-counting algorithm that computes the cardinality of a list of bit-strings in sub-linear time amortized due to word-parallism. This brings the query time down to $O((\frac{n\log_2 w}{w})^\rho \log(1/\varepsilon) + |Q|)$.\\

Indsæt mere sammenhæng mellem cardinality of bit-string og sketches

Many improvements have since been presented to both improve processing time, query time and space efficiency. Notable mentions includes (but are not limited to) the use of \textit{tensoring}\cite{andoni2006efficient}, \textit{b-bit minwise hashing}\cite{ping2011theory}, \textit{fast similarity sketching}\cite{dahlgaard2017fast}. Simple applications of these techniques leads to efficient querying with a constant error probability. If one wishes to achieve an even better error probability such as $\varepsilon = o(1)$, it is standard practice within the field to use $O(\log_2(1/\varepsilon))$ independent data structures and return the best result, resulting in a query time of $O(\frac{1}{\epsilon} (n^\rho + |Q|))$. Recent advances by \citet{fast-similarity-search} show that it is possible to achieve an even better query time by sampling these data structures from one large sketch. The similarity between these sub-sketches and a query set needs to be evaluated efficiently when querying, which requires efficient computation of the cardinality of a bit-string. To do this, \citet{fast-similarity-search} presents a general parallel bit-counting algorithm that computes the cardinality of a list of bit-strings in sub-linear time amortized due to word-parallism. This brings the query time down to $O((\frac{n\log_2 w}{w})^\rho \log(1/\varepsilon) + |Q|)$.\\

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.