The tda4atd from danchern97

tda4atd's People

Contributors

Stargazers

Watchers

tda4atd's Issues

Parallelization in barcode calculation

Hi there, thanks for releasing your work.

I wanted to understand the design behind barcode generation implementation in the codebase.

queue = Queue()
number_of_splits = 2
for i, filename in enumerate(tqdm(adj_filenames, desc='Calculating barcodes')):
    barcodes = defaultdict(list)
    adj_matricies = np.load(filename, allow_pickle=True) # samples X 
    print(f"Matricies loaded from: {filename}")
    ntokens = ntokens_array[i*batch_size*DUMP_SIZE : (i+1)*batch_size*DUMP_SIZE]
    splitted = split_matricies_and_lengths(adj_matricies, ntokens, number_of_splits)
    for matricies, ntokens in tqdm(splitted, leave=False):
        p = Process(
            target=subprocess_wrap,
            args=(
                queue,
                get_only_barcodes,
                (matricies, ntokens, dim, lower_bound)
            )
        )
        p.start()
        barcodes_part = queue.get() # block until putted and get barcodes from the queue
        p.join() # release resources
        p.close() # releasing resources of ripser

Why are barcodes calculated in this synchronous subprocess manner? As I understand it, the dataset is split into 2 (why was 2 chosen as number_of_splits?), then each half is fed into a subprocess for barcode generation iteratively. GPU memory usage is quite low (around 150MB) which makes sense as only one document is considered at a time.

Since pool.starmap is used for other parts of calculation, is there any reason why it was not used for barcode calculation? The below code using starmap is much faster and appears to be correct, please let me know if there's anything I'm missing. Thank you!

nworkers=10
pool = multiprocessing.Pool(nworkers)

args = [(matrices, ntokens, dim, lower_bound) for matrices, ntokens in split_matrices]
all_barcodes = pool.starmap(get_only_barcodes, args)

for barcodes_part in all_barcodes:
    barcodes = unite_barcodes(barcodes, barcodes_part)

Recommend Projects

danchern97 / tda4atd Goto Github PK

tda4atd's People

Contributors

Stargazers

Watchers

Forkers

tda4atd's Issues

Parallelization in barcode calculation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent