nahmedraja / gasal2 Goto Github PK

License: Apache License 2.0

Makefile 2.22% Shell 0.66% Cuda 8.58% C 20.23% C++ 68.31%

sequence-alignment gpu-acceleration cuda-library bioinfomatics

gasal2's Introduction

GASAL2 - GPU-accelerated DNA alignment library

GASAL2 is an easy-to-use CUDA library for DNA/RNA sequence alignment algorithms. Currently it supports different kind of alignments:

local alignment
semi-global alignment
global alignment
tile-based banded alignment.

It can also reverse and, or complement any sequences independently before alignment, and report second-best scores for certain alignment types.

It is an extension of GASAL (https://github.com/nahmedraja/GASAL) and allows full overlapping of CPU and GPU execution.

List of new features:

Added traceback computation. The ouput is in CIGAR format
GASAL2 can now compute all types of semi-global alignments
Added expandable memory management on host side. The batches of query and target sequences are automatically enlarged if the required memory becomes larger than the allocated memory
Added kernel to reverse-complement sequences.
Cleaned up, inconsistencies fixed, and a small optimization has been added (around 9% speedup with exact same result)

Changes in user interface:

Changed the interface of gasal_init_streams() function
The user now has to provide MAX_QUERY_LEN instead of MAX_SEQ_LEN during compilation

Requirements

A Linux platform with CUDA toolkit 8 or higher is required, along with usual build environment for C and C++ code. GASAL2 has been tested over NVIDIA GPUs with compute capabilities of 2.0, 3.5 and 5.0. Although lower versions of the CUDA framework might work, they have not been tested.

Compiling GASAL2

The library can be compiled with the following two commands:

$ ./configure.sh <path to cuda installation directory>
$ make GPU_SM_ARCH=<GPU SM architecture> MAX_QUERY_LEN=<maximum query length> N_CODE=<code for "N", e.g. 0x4E if the bases are represented by ASCII characters> [N_PENALTY=<penalty for aligning "N" against any other base>]

N_PENALTY is optional and if it is not specified then GASAL2 considers "N" as an ordinary base having the same match/mismatch scores as for A, C, G or T. As a result of these commands, include and lib directories will be created containing various .h files and libgasal.a, respectively. The user needs to include gasal_header.h in the code and link it with libgasal.a during compilation. Also, the CUDA runtime library has to be linked by adding -lcudart flag. The path to the CUDA runtime library must also be specfied while linking as -L .

Using GASAL2

Initialization

To use GASAL2 alignment functions, first the match/mismatach scores and gap open/extension penalties need to be passed on to the GPU. Assign the values match/mismatach scores and gap open/extension penalties to the members of gasal_subst_scores struct:

typedef struct{
	int32_t match;
	int32_t mismatch;
	int32_t gap_open;
	int32_t gap_extend;
}gasal_subst_scores;

The values are passed to the GPU by calling gasal_copy_subst_scores() function:

void gasal_copy_subst_scores(gasal_subst_scores *subst);

A vector of gasal_gpu_storage_t is created a the following function:

gasal_gpu_storage_v gasal_init_gpu_storage_v(int n_streams);

With the help of n_streams, the user specifies the number of outstanding GPU alignment kernel launches to be performed. The return type is gasal_gpu_storage_v:

typedef struct{
	int n;
	gasal_gpu_storage_t *a;
}gasal_gpu_storage_v;

with n = n_streams and a being a pointer to the array. An element of the array holds the required data structurea of a stream. To destroy the vector the following function is used:

void gasal_destroy_gpu_storage_v(gasal_gpu_storage_v *gpu_storage_vec);

The streams in the vector are initialized by calling:

void gasal_init_streams(gasal_gpu_storage_v *gpu_storage_vec,  int max_query_len, int max_target_len, int max_n_alns,  Parameters *params);

In GASAL2, the sequences to be aligned are conatined in two batches. A sequence in query_batch is aligned to sequence in target_batch. A batch is a concatenation of sequences. The length of a sequence must be a multiple of 8. Hence, if a sequence is not a multiple of 8, N's are added at the end of sequence. We call these redundant bases as Pad bases. Note that the pad bases are always "N's" irrespective of whether N_PENALTY is defined or not. The gasal_init_streams() function alloctes the memory required by a stream. With the help of max_batch_bytes, the user specifies the expected maxumum size(in bytes) of sequences in the two batches. host_max_batch_bytes are pre-allocated on the CPU. Smilarly, gpu_max_batch_bytes are pre-allocated on the GPU. max_n_alns is the expected maximum number of sequences in a batch. If the actual required GPU memory is more than the pre-allocated memory, GASAL2 automatically allocates more memory.

Most GASAL2 functions operate with a Parameters object. This object holds all the informations about the alignment options selected. In particular, the alignment type, the default values when opening or extending gaps, etc. The Parameters object is filled like this:

Parameters *args;
args = new Parameters(0, NULL);

args->algo = <LOCAL|GLOBAL|SEMI_GLOBAL>; 
args->start_pos = <WITHOUT_START|WITH_START|WITH_TB>; //`WITHOUT_START` computes only the score and end-position. `WITH_START` computes the start-position with score and end-position. `WITH_TB` computes the score, start-position, end-position and traceback in CIGAR format.
args->isReverseComplement = <TRUE|FALSE>; //whether to reverse-complement the query sequence.
args->semiglobal_skipping_head = <QUERY|TARGET|BOTH|NONE>; //ignore gaps at the begining of QUERY|TARGET|BOTH|NONE in semi alignment-global.
args->semiglobal_skipping_tail = <QUERY|TARGET|BOTH|NONE>; //ignore gaps at the end of QUERY|TARGET|BOTH|NONE in semi alignment-global.
args->secondBest = <TRUE|FALSE>; //whether to compute the second best score in local and semi-global algo. But the start-position(WITH_START) and traceback(WITH_TRACEBACK) is only computMarched with the best score.

To free up the allocated memory the following function is used:

void gasal_destroy_streams(gasal_gpu_storage_v *gpu_storage_vec, Parameters *params);

The gasal_init_streams() and gasal_destroy_streams() internally use cudaMalloc(), cudaMallocHost(), cudaFree() and cudaFreeHost() functions. These CUDA API functions are time expensive. Therefore, gasal_init_streams() and gasal_destroy_streams() should be preferably called only once in the program. You will find all these functions in the file ctors.cpp.

Input data preparation

The gasal_gpu_storage_t in gasal.h holds the data structures for a stream. In the following we only show those members of gasal_gpu_storage_t which should be accessed by the user. Other fields should not be modified manually and the user should rely on dedicated functions for complex operations.

typedef struct{
	...
	uint8_t *host_query_op;
	uint8_t *host_target_op;
	...
	uint32_t *host_query_batch_offsets;
	uint32_t *host_target_batch_offsets;
	uint32_t *host_query_batch_lens;
	uint32_t *host_target_batch_lens;
	uint32_t host_max_query_batch_bytes;
	uint32_t host_max_target_batch_bytes;
	gasal_res_t *host_res;
	gasal_res_t *host_res_second; 
	uint32_t host_max_n_alns;
	uint32_t current_n_alns;
	int is_free;
	...
} gasal_gpu_storage_t;

To align the sequences the user first need to check the availability of a stream. If is_free is 1, the user can use the current stream to perform the alignment on the GPU. To do this, the user must fill the sequences with the following function.

uint32_t gasal_host_batch_fill(gasal_gpu_storage_t *gpu_storage, uint32_t idx, const char* data, uint32_t size, data_source SRC);

This function takes a sequence and its length, and append it in the data structure. It also adds the neccessary padding bases to ensure the sequence has a length which is a multiple of 8. Moreover, it takes care of allocating more memory if there is not enough room when adding the sequence. SRC is either QUERY or TARGET, depending upon which batch to fill. When executed, this function returns the offset to be filled by the user in host_target_batch_offsets or host_query_batch_offsets. The user also has to fill host_target_batch_lens or host_query_batch_lens with original length of sequences, i.e. length without pad bases. The offset values include pad bases, whereas lengths are without pad bases. The number of elements in offset and length arrays must be equal. The offset values allows the user to express the mode of pairwise alignment, i.e. one-to-one, one-to-all or one-to-many etc., between the query and traget sequences. The current_n_alns must appropriately be incremented to show the current number of alignments. host_max_n_alns is initially set equal to max_n_alns in gasal_init_streams() function. If the 'current_n_alns' exceeds host_max_n_alns, the user must call the following funnction to reallocate host offset, lengths and results arrays.March

void gasal_host_alns_resize(gasal_gpu_storage_t *gpu_storage, int new_max_alns, Parameters *params);

where new_max_alns is the new value of host_max_n_alns.

One can also use the gasal_host_batch_addbase to add a single base to the sequence. This takes care of memory reallocation if needed, but does not take care of padding, so this has to be used carefully.

The the list of pre-processing operation (nothing, reverse, complement, reverse-complement) that has to be done on the batch of sequence can be loaded into the gpu_storage with the function gasal_op_fill. Its code is in interfaces.cpp. It fills host_query_op and host_query_op with an array of size host_max_n_alns where each value is the value of the enumeration of operation_on_seq (in gasal.h):

enum operation_on_seq{
	FORWARD_NATURAL,
	REVERSE_NATURAL,
	FORWARD_COMPLEMENT,
	REVERSE_COMPLEMENT,
};

By default, no operations are done on the sequences (that is, the fields host_query_op and host_target_op arrays are initialized to 0, which is the value of FORWARD_NATURAL). March

Alignment launching

To launch the alignment, the following function is used:

void gasal_aln_async(gasal_gpu_storage_t *gpu_storage, const uint32_t actual_query_batch_bytes, const uint32_t actual_target_batch_bytes, const uint32_t actual_n_alns, Parameters *params)

The actual_query_batch_bytes and actual_target_batch_bytes specify the size of the two batches (in bytes) including the pad bases. actual_n_alns is the number of alignments to be performed. GASAL2 internally sets is_free to 0 after launching the alignment kernel on the GPU. From the performance prespective, if the average lengths of the sequences in query_batch and target_batch are not same, then the shorter sequences should be placed in query_batch. Fo rexample, in case of read mappers the read sequences are conatined in query_batch and the genome sequences in target_batch.

The gasal_aln_async() function returns immediately after launching the alignment kernel on the GPU. The user can perform other tasks instead of waiting for the kernel to finish.To test whether the alignment on GPU is finished, the following function is called:

int gasal_is_aln_async_done(gasal_gpu_storage *gpu_storage);

If the function returns 0 the alignment on the GPU is finished and the output arrays contain valid results. Moreover, is_free is set to 1 by GASAL2. Thus, the current stream can be used for the alignment of another batch of sequences. The function returns -1 if the results are not ready. It returns -2 if the function is called on a stream in which no alignment has been launced, i.e. is_free == 1.

Alignment results

The structure gasal_res_t holds the results of the alignment and can be accessed manually. Its fields are the following:

struct gasal_res{
	int32_t *aln_score;
	int32_t *query_batch_end;
	int32_t *target_batch_end;
	int32_t *query_batch_start;
	int32_t *target_batch_start;
	uint8_t *cigar;
	uint32_t *n_cigar_ops;
};
typedef struct gasal_res gasal_res_t;

The output of alignments are stored in aln_score, query_batch_end, target_batch_end, query_batch_start, and target_batch_start, cigar and n_cigar_ops arrays, within the host_res structure inside the gasal_gpu_storage structure. cigar is a byte array which contains the traceback information in CIGAR format of all the alignments performed . The lower 2 bits of a byte indicate the CIGAR operation:

0 = match
1 = mismatch
2 = deletion
3 = insertion

The upper 6 bits store the count of the operation in the lower two bits. The traceback information of an alignment in the cigar array is in the reverse direction. host_query_batch_offsets conatins the offset of an alignment in the cigar array. The n_cigar_ops contains number of bytes in the cigar array encoding the traceback information of an alignment.

In case of second-best result, the same applies with the fields in host_res_secondbest. But the start-position and traceback( is only computed with the best score. Therefore, only host_res_secondbest->aln_score, host_res_secondbest->query_batch_end and host_res_secondbest->target_batch_end are valid for second-best result.

Example

The test_prog directory conatins an example program which uses GASAL2 for sequence alignment on GPU. See the README in the directory for the instructions about running the program.

Citing GASAL2

GASAL2 is published in BMC Bioinformatics:

N. Ahmed, J. Lévy, S. Ren, H. Mushtaq, K. Bertels and Z. Al-ars, GASAL2: a GPU accelerated sequence alignment library for high-throughput NGS data, BMC Bioinformatics 20, 520 (2019) doi: 10.1186/s12859-019-3086-9.

Problems and suggestions

For any issues and suugestions contact Nauman Ahmed at [email protected].

gasal2's People

Contributors

Stargazers

Watchers

Forkers

j-levy sdwfrost singam-sanjay r-barnes schaudge wangyaning45 shridharathi guoshuai1314 justinmgarrigus yufeng98 genostack

gasal2's Issues

Reversal code is likely wrong

These lines count the number of N in the last word; however, if N can be used in sequences generally (which seems to be a design goal), then this is incorrect: the number of trailing Ns should be counted.

Error on run on given test data

Hello. I just installed GASAL2 on the CHPC system at University of Utah. I followed your instructions and then ran the test_prog.out on the given test data and had this error:

[GASAL CUDA ERROR] invalid argument(CUDA error no.=1). Line no. 171 in file src/res.cpp

I used this to run:

./test_prog.out -a 3 -b 3 -q 6 -r 1 -s -t -p -y "local" query_batch.fasta target_batch.fasta

I have the following modules installed:
gcc/10.2.0
cuda/11.4

and I installed with these variables:
./configure.sh /uufs/chpc.utah.edu/sys/installdir/cuda/11.4.0/
make GPU_SM_ARCH='sm_70' MAX_QUERY_LEN=300 N_CODE='0x4E' N_PENALTY=-1

Banded kernel disabled

The -y option to test_prog does not have the ability to enabled the banded alignment kernel.

Is this intentional?

one-to-all alignment example

Hi，as the introduction said ,GASAL2 seems to support the one-to-all alignment method,is there a example about that?I‘ve tried to find the method through the test example but it couldn't help.Looking forward to your help,thank you very much!

CUDA Errors

After compiling the code with

./configure.sh $CUDA_HOME
make GPU_SM_ARCH=sm_60 MAX_QUERY_LEN=1024 N_CODE=0x4E
cd test_prog
make

I run:

./test_prog -p -y local query_batch.fasta target_batch.fasta

This raises several CUDA errors that I can suppress by commenting out the following lines:

res.cpp:157: if (res->n_cigar_ops != NULL) CHECKCUDAERROR(cudaFreeHost(res->n_cigar_ops));
res.cpp:173: if (device_cpy->cigar != NULL) CHECKCUDAERROR(cudaFree(device_cpy->cigar));
ctors.cpp:165: if (gpu_storage_vec->a[i].host_res->cigar != NULL) CHECKCUDAERROR(cudaFreeHost(gpu_storage_vec->a[i].host_res->cigar));

All of the error messages read the same:

[GASAL CUDA ERROR:] invalid argument(CUDA error no.=1). Line no. 157 in file ../src/res.cpp

It looks like there might be an issue with allocating/deallocating CIGAR storage.

CUDA memory error on some datasets when using traceback

Hello,

I have been running into an issue where GASAL2 fails with a CUDA memory error on some datasets (but not all) when I use traceback. In my analysis, it mainly happens more often when the reference sequences are longer than 580 nt. I ran a test where I ran GASAL2 on batches of the dataset that were 10,000 sequences in size, and some fail and some succeed. If I concatenate the batches that succeed, then the larger file succeeds, which leads me to believe that there are specific sequence pairings that result in the segmentation fault. I filtered the dataset by length of reference, so that no references were longer than 580, but that did not eliminate the CUDA memory errors.

This is the command I am using to run:
./test_prog.out -a 3 -b 3 -q 6 -r 1 -s -t -y "local" ../test-data/read_file_15.fasta ../test-data/ref_file_15.fasta

This is the error that I see without running cuda-memcheck

[GASAL WARNING:] Trying to write 280 bytes while only 160 remain (query) (block size 1320000, filled 1319840 bytes).
Allocating a new block of size 2640000, total size available reaches 3960000. Doing this repeadtedly slows down the execution.
[GASAL WARNING:] actual_query_batch_bytes(1362176) > Allocated GPU memory (gpu_max_query_batch_bytes=1320000). Therefore, allocating 2640000 bytes on GPU (gpu_max_query_batch_bytes=2640000). Performance may be lost if this is repeated many times.
[GASAL WARNING:] actual_query_batch_bytes(1362176) > Allocated HOST memory for CIGAR (gpu_max_query_batch_bytes=2640000). Therefore, allocating 5280000 bytes on the host (gpu_max_query_batch_bytes=5280000). Performance may be lost if this is repeated many times.
[GASAL WARNING:] Trying to write 272 bytes while only 144 remain (query) (block size 1320000, filled 1319856 bytes).
Allocating a new block of size 2640000, total size available reaches 3960000. Doing this repeadtedly slows down the execution.
[GASAL WARNING:] actual_query_batch_bytes(1363936) > Allocated GPU memory (gpu_max_query_batch_bytes=1320000). Therefore, allocating 2640000 bytes on GPU (gpu_max_query_batch_bytes=2640000). Performance may be lost if this is repeated many times.
[GASAL CUDA ERROR:] an illegal memory access was encountered(CUDA error no.=700). Line no. 79 in file src/gasal_align.cu
srun: error: cgpu05: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=3134271.206

I am attaching a sample dataset that fails.
test-data.zip

Issue to compile test_prog

Hello,

I could successfully compile GASAL2 using the configure.sh and then make, but I am experiencing issues to compile test_prog. Below there is a part of the output when executing the Makefile in the folder test_prog

$ g++ -std=c++11 -g -c -O3 -Wall -Werror -fopenmp -I ../include -o test_prog.o test_prog.cpp
$ g++ -std=c++11 -O3 -o test_prog.out -L/usr/local/cuda-11.2/targets/x86_64-linux/lib -L../lib test_prog.o -fopenmp -lcudart -lgasal
/usr/bin/ld: ../lib/libgasal.a(host_batch.cppo): in function gasal_host_batch_new(unsigned int, unsigned int)': /usr/local/cuda-11.2/targets/x86_64-linux/include/cuda_runtime.h:388: undefined reference to cudaHostAlloc'
/usr/bin/ld: ../lib/libgasal.a(host_batch.cppo): in function gasal_host_batch_new(unsigned int, unsigned int)': /home/ae/var/dl/gasal2/GASAL2-master/src/host_batch.cpp:15: undefined reference to cudaGetErrorString'
/usr/bin/ld: ../lib/libgasal.a(host_batch.cppo): in function gasal_host_batch_destroy(host_batch*)': /home/ae/var/dl/gasal2/GASAL2-master/src/host_batch.cpp:37: undefined reference to cudaFreeHost'
....

It seems there is a problem when linking cudart to the object file ( test_prog.o). I checked that cudart was available in the path indicated by the option -L (/usr/local/cuda-11.2/targets/x86_64-linux/lib).

What I am missing?

Thanks in advance.

N_CODE used where N_VALUE should have been

N_CODE can differ from N_VALUE because N_VALUE holds only the bottom nibble of N_CODE.

This line uses N_CODE, but it should use N_VALUE.

All-to-All Example

Is there an example available of GASAL2 performing all-to-all alignments?

CUDA errors with some datasets

Hi,

Thank you so much for your great work!

We are currently trying to evaluate the tool with the test_prog.out
However, we've seen CUDA errors for all of our datasets
Do you know what's going on and how can we fix it?

The datasets we are using:
https://osf.io/e8ngp/?view_only=3cd84f2ece3247b69e88e3fe5edf711d

We are using the datasets under short-reads folder, there are three sets of data.
100-reads.fa and 100-targets.fa
150-reads.fa and 150-targets.fa
300-reads.fa and 300-targets.fa

We compiled the tool by

./configure.sh $CUDA_HOME
make GPU_SM_ARCH=sm_60 MAX_QUERY_LEN=1024 N_CODE=0x4E
cd test_prog
make

We ran the tool by

./test_prog.out -y local 100-reads.fa 100-targets.fa
./test_prog.out -y local 150-reads.fa 150-targets.fa
./test_prog.out -y local 300-reads.fa 300-targets.fa

The error information we got
For 150-reads.fa and 150-targets.fa

 test_env $ ./test_prog.out -y local 150-reads.fa 150-targets.fa
sa=1 , sb=4 , gapo=6 , gape=1
start_pos=0 , print_out=0 , n_threads=1
semiglobal_skipping_head=2 , semiglobal_skipping_tail=2 , algo=3
isPacked = false , secondBest = 0
query_batch_fasta_filename=150-reads.fa , target_batch_fasta_filename=150-targets.fa
Loading files....
[TEST_PROG DEBUG]: Size of read batches are: query=183062155, target=192513509. maximum_sequence_length=9006989
[TEST_PROG DEBUG]: query, mod@id=
Processing...
[GASAL CUDA ERROR:] invalid argument(CUDA error no.=1). Line no. 15 in file src/host_batch.cpp

For 300-reads.fa and 300-targets.fa

 test_env $ ./test_prog.out -y local 300-reads.fa 300-targets.fa
sa=1 , sb=4 , gapo=6 , gape=1
start_pos=0 , print_out=0 , n_threads=1
semiglobal_skipping_head=2 , semiglobal_skipping_tail=2 , algo=3
isPacked = false , secondBest = 0
query_batch_fasta_filename=300-reads.fa , target_batch_fasta_filename=300-targets.fa
Loading files....
[TEST_PROG DEBUG]: Size of read batches are: query=337217983, target=330668044. maximum_sequence_length=30260
[TEST_PROG DEBUG]: query, mod@id=
Processing...
[TEST_PROG DEBUG]: size of host_unpack_query is 168946860
[TEST_PROG DEBUG]: Number of gpu_batch in gpu_batch_arr : 2
[TEST_PROG DEBUG]: Number of gpu_storage_vecs in a gpu_batch : 1
[GASAL WARNING:] Trying to write 2352 bytes while only 880 remain (query) (block size 1560000, filled 1559120 bytes).
                 Allocating a new block of size 3120000, total size available reaches 4680000. Doing this repeadtedly slows down the execution.
[GASAL WARNING:] Trying to write 7128 bytes while only 360 remain (query) (block size 3120000, filled 3119640 bytes).
                 Allocating a new block of size 6240000, total size available reaches 10920000. Doing this repeadtedly slows down the execution.
[GASAL WARNING:] Trying to write 10536 bytes while only 5064 remain (query) (block size 6240000, filled 6234936 bytes).
                 Allocating a new block of size 12480000, total size available reaches 23400000. Doing this repeadtedly slows down the execution.
[TEST_PROG DEBUG]: Stream 0: j = 5000, seqs_done = 5000, query_batch_idx=17726200 , target_batch_idx=17386280
[GASAL WARNING:] actual_query_batch_bytes(17726200) > Allocated GPU memory (gpu_max_query_batch_bytes=1560000). Therefore, allocating 18720000 bytes on GPU (gpu_max_query_batch_bytes=18720000). Performance may be lost if this is repeated many times.
[GASAL WARNING:] Trying to write 1672 bytes while only 1344 remain (query) (block size 1560000, filled 1558656 bytes).
                 Allocating a new block of size 3120000, total size available reaches 4680000. Doing this repeadtedly slows down the execution.
[GASAL CUDA ERROR:] an illegal memory access was encountered(CUDA error no.=700). Line no. 15 in file src/host_batch.cpp

The 100 dataset seems to be working well.

The configuration of our machine:
lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                36
On-line CPU(s) list:   0-35
Thread(s) per core:    1
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping:              1
CPU MHz:               2599.980
CPU max MHz:           3300.0000
CPU min MHz:           1200.0000
BogoMIPS:              4200.19
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-17
NUMA node1 CPU(s):     18-35
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt spec_ctrl ibpb_support tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

nvidia-smi

Sun Dec  8 18:31:13 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   33C    P0    26W / 250W |      0MiB / 12198MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:82:00.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 12198MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

cat /etc/*release

CentOS Linux release 7.4.1708 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.4.1708 (Core)
CentOS Linux release 7.4.1708 (Core)

We would really appreciate it if you could take a look at this issue. Thank you so much!!