Git Product home page Git Product logo

iceceng23's Introduction

iceceng23

Repository for "Model Agnostic Knowledge Transfer Methods for Sentence Embedding Models" paper by Gunel and Amasyali.

How to repeat the experiments

  1. Download Julia
  • All experiments are done with Julia version 1.8
  1. Install dependent packages. It is better to create separate virtual environments for different minimization methods since each dependent to different packages.
  • For Entropy Minimization experiments
    • Knet, JLD2, CUDA
  • For Distance Minimization experiments
    • Flux, BSON, ...
  1. Download data
  2. Execute

Data Loading for FastText Embeddings

Loading pre-trained FastText models (especially English - 2M by 300 vectors!!) takes too much time and space. Since there is no native way of loading them in Julia, I found a 2-step way: First you load the data by using readEmbeds function. This function returns a vocabulary and a matrix (word vectors). The second step is to write it in binary format and use Mmap utility for later usages. This is a one time thing but it saves tons of time and space.

readEmbeds function requires to have vocabulary and dimension size of the intended FastText model. To do so:

  1. Open ft-model.txt with an text editor (use nano since it gives you the number of lines(vocabulary) automatically)
  2. Insert voc_size dim_size at top line
  3. Save and exit
function readEmbeds(file; threshold=0, vocabulary=Nothing, dtype=Float32)
    @warn "This method reads the word embedding matrix in column major format"
    count, dims = parse.(Int64, split(readline(file), ' '))
    words  = String[]; # words in vocabulary
    matrix = isequal(vocabulary, Nothing) ?  Array{dtype}(undef, count, dims) : Array{Float32}[];

    # p = Progress(dims, 1, "Reading embeddig file: $file") # this is for progressbar
    for (i, line) in enumerate(drop(eachline(file), 1))
        mixed = split(chomp(line), " ")
        if vocabulary == Nothing
            push!(words, mixed[1])
            matrix[i, :] .= parse.(dtype, mixed[2:end])
        elseif in(mixed[1], vocabulary)
            push!(words, mixed[1])
            push!(matrix, parse.(dtype, mixed[2:end]))
        end
        # next!(p)
    end
    words, Matrix(permutedims(matrix))
end

Use writeBinaryEmbeddings to write it to the disk.

using Mmap

function writeBinaryEmbeddings(file::String, WE::Matrix, V::Array{String})
    d, w = size(WE);
    @info "Creating .bin for Word Embeddings"
    if d > w
        @warn "Permuting the Embedding matrix to raw major form"
        WE = Matrix(permutedims(WE))
    end

    s = open(file * "_WE.bin", "w+")
    write(s, d)
    write(s, w)
    write(s, WE)
    close(s)

    @info "Creating .txt for Vocabulary"
    s = open(file * "_voc.txt", "w+")
    for word in V
        write(s, word*"\n")
    end
    close(s)
    @info "Files are written by adding '_WE.bin' to the given file name $file "
end

Use readBinaryEmbeddings to load the saved vocabulary file and word vectors.

using Mmap

function readBinaryEmbeddings(file::String; atype=Float32)
    @info "Reading Word Embedding file"
    s = open(file * "_WE.bin")   # default is read-only
    m = read(s, Int)
    n = read(s, Int)
    WE = Mmap.mmap(s, Matrix{atype}, (m,n))
    close(s)

    @info "Reading vocabulary file"
    V = readlines(file * "_voc.txt")
    return V, Matrix(permutedims(WE))
end

iceceng23's People

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.