Git Product home page Git Product logo

datacorrelationpredictionwithnlp's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

datacorrelationpredictionwithnlp's Issues

Program hangs on feature generation

Hi, Immanuel
I have cloned your code and run CorrelationPrediction, but hangs on feature computation for 2 hours.
Can you give me some suggestions to help me debug. Thanks in advance.

image

System Configuration
CPU: 12 vCPU
GPU: 1 V100
OS: Ubuntu 22.04
Cuda: 11.8
Pytorch: 2.0.1+cu118
transformers: 4.28.1
simpletransformers: 0.63.11

the code I run is following:

from simpletransformers.classification import (
    ClassificationModel, ClassificationArgs
)
from sklearn.model_selection import train_test_split
import numpy as np
import sklearn.metrics as metrics
import pandas as pd
import random as rand
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# initialize for deterministic results
seed = 0
rand.seed(seed)

# load data
path = '/root/correlations/corresults4.csv'
data = pd.read_csv(path, sep = ',')
data = data.sample(frac=1, random_state=seed)
data.columns = ['dataid', 'datapath', 'nrrows', 'nrvals1', 'nrvals2', 
                'type1', 'type2', 'column1', 'column2', 'method',
                'coefficient', 'pvalue', 'time']

# divide data into subsets
pearson = data[data['method']=='pearson']
spearman = data[data['method']=='spearman']
theilsu = data[data['method']=='theilsu']

# generate and print data statistics
nr_ps = len(pearson.index)
nr_sm = len(spearman.index)
nr_tu = len(theilsu.index)
print(f'#Samples for Pearson: {nr_ps}')
print(f'#Samples for Spearman: {nr_sm}')
print(f'#Samples for Theil\'s u: {nr_tu}')

# |coefficient>0.5| -> label 1
def coefficient_label(row):
  if abs(row['coefficient']) > 0.5:
    return 1
  else:
    return 0
pearson['label'] = pearson.apply(coefficient_label, axis=1)
spearman['label'] = spearman.apply(coefficient_label, axis=1)
theilsu['label'] = theilsu.apply(coefficient_label, axis=1)

rc_p = len(pearson[pearson['label']==1].index)/nr_ps
rc_s = len(spearman[spearman['label']==1].index)/nr_sm
rc_u = len(theilsu[theilsu['label']==1].index)/nr_tu
print(f'Ratio correlated - Pearson: {rc_p}')
print(f'Ratio correlated - Spearman: {rc_s}')
print(f'Ratio correlated - Theil\s u: {rc_u}')

# split data into training and test set
def def_split(data):
  x_train, x_test, y_train, y_test = train_test_split(
      pearson[['column1', 'column2']], pearson['label'],
      test_size=0.2, random_state=seed)
  train = pd.concat([x_train, y_train], axis=1)
  test = pd.concat([x_test, y_test], axis=1)
  return train, test

def ds_split(data):
  counts = data['dataid'].value_counts()
  print(f'Counts: {counts}')
  print(f'Count.index: {counts.index}')
  print(f'Count.index.values: {counts.index.values}')
  print(f'counts.shape: {counts.shape}')
  print(f'counts.iloc[0]: {counts.iloc[0]}')
  nr_vals = len(counts)
  nr_test_ds = int(nr_vals * 0.2)
  print(f'Nr. test data sets: {nr_test_ds}')
  ds_ids = counts.index.values.tolist()
  print(type(ds_ids))
  print(ds_ids)
  test_ds = rand.sample(ds_ids, nr_test_ds)
  print(f'TestDS: {test_ds}')
  def is_test(row):
    if row['dataid'] in test_ds:
      return True
    else:
      return False
  data['istest'] = data.apply(is_test, axis=1)
  train = data[data['istest'] == False]
  test = data[data['istest'] == True]
  print(f'train.shape: {train.shape}')
  print(f'test.shape: {test.shape}')
  print(train)
  print(test)
  return train[['column1', 'column2', 'label']], test[['column1', 'column2', 'label']]

train, test = ds_split(pearson)
train.columns = ['text_a', 'text_b', 'labels']
test.columns = ['text_a', 'text_b', 'labels']
print(train.shape)
print(test.shape)

model_args = ClassificationArgs(num_train_epochs=10, train_batch_size=40,
                                overwrite_output_dir=True, manual_seed=seed,
                                output_dir="root/correlations/models/")
model = ClassificationModel("roberta", "roberta-base", weight=[1, 2],
                            use_cuda = True, args=model_args)
model.train_model(train_df=train)
model.save_pretrained("refine_model")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.