Git Product home page Git Product logo

Comments (10)

loretoparisi avatar loretoparisi commented on May 22, 2024 1

@w4nderlust ah yeah you were right, that was the question here #66
Ok I will try to change \t to , 🤕

If anyone else is having the same issue:

When on macOS:

sed -i "" $'s/,/ /g' /root/spam_dataset.csv 
sed -i "" $'s/\t/,/g' /root/spam_dataset.csv

(keep an eye to the "" before the -i for inline replacement and to the ANSI-C style quoting since OSX sed does not recognize \t)

while on linux

sed -i "" $'s/,/ /g' /root/spam_dataset.csv 
sed -i "s/\t/,/g" /root/spam_dataset.csv

We should to replace every , in the dataset to a (as example) before converting tab to commas, otherwise we could have comma in some columns, breaking the resulting CSV file.

and if for some reason you have forget the header:

sed -i '' -e '1i\'$'\n''label,text' /root/spam_dataset.csv

from ludwig.

w4nderlust avatar w4nderlust commented on May 22, 2024 1

That's a great suggestion, a good workaround until we implement a better solution for reading TSVs and other file formats.

from ludwig.

w4nderlust avatar w4nderlust commented on May 22, 2024

Your CSV seems to be tab separated and not comma separated. At the moment we don't support TSV, we are working on it right now. For the time being please use comma to separate your columns and escape them if they appear in your text as described here.
Please confirm that this solves the problem.

from ludwig.

aminaBm avatar aminaBm commented on May 22, 2024

I have the same problem and my data is separated using a comma and it still showing the same error :(

from ludwig.

loretoparisi avatar loretoparisi commented on May 22, 2024

@aminaBm are you sure that you do not have any additional , in the text column? This happened to me also before normalizing the text column and converting tabs to commas.

from ludwig.

cuggla91 avatar cuggla91 commented on May 22, 2024

I have the same issue as @aminaBm. I used df = pd.read_csv('dump_20190401.csv', escapechar='\') t try to deal with it but somehow it still is an issue for me. I get this error for this code:

Code:
print('creating model')
model = LudwigModel(model_definition)
print('training model')
train_stats = model.train(data_df=df)
model.close()

Error:
creating model
training model

KeyError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3077 try:
-> 3078 return self._engine.get_loc(key)
3079 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'text'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
2 model = LudwigModel(model_definition)
3 print('training model')
----> 4 train_stats = model.train(data_df=df)
5 model.close()

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/api.py in train(self, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, data_dict, train_set_metadata_json, experiment_name, model_name, model_load_path, model_resume_path, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, gpus, gpu_fraction, use_horovod, random_seed, logging_level, debug, **kwargs)
448 use_horovod=use_horovod,
449 random_seed=random_seed,
--> 450 debug=debug,
451 )
452

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/train.py in full_train(model_definition, model_definition_file, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, train_set_metadata_json, experiment_name, model_name, model_load_path, model_resume_path, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, should_close_session, gpus, gpu_fraction, use_horovod, random_seed, debug, **kwargs)
254 skip_save_processed_input=skip_save_processed_input,
255 preprocessing_params=model_definition['preprocessing'],
--> 256 random_seed=random_seed
257 )
258

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in preprocess_for_training(model_definition, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, train_set_metadata_json, skip_save_processed_input, preprocessing_params, random_seed)
387 data_test_df,
388 preprocessing_params,
--> 389 random_seed
390 )
391 elif data_csv is not None or data_train_csv is not None:

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in _preprocess_df_for_training(features, data_df, data_train_df, data_validation_df, data_test_df, preprocessing_params, random_seed)
638 features,
639 preprocessing_params,
--> 640 random_seed=random_seed
641 )
642 training_set, test_set, validation_set = split_dataset_tvt(

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in build_dataset_df(dataset_df, features, global_preprocessing_parameters, train_set_metadata, random_seed, **kwargs)
84 dataset_df,
85 features,
---> 86 global_preprocessing_parameters
87 )
88

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in build_metadata(dataset_df, features, global_preprocessing_parameters)
124 ]
125 train_set_metadata[feature['name']] = get_feature_meta(
--> 126 dataset_df[feature['name']].astype(str),
127 preprocessing_parameters
128 )

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2487 res = cache.get(item)
2488 if res is None:
-> 2489 values = self._data.get(item)
2490 res = self._box_item_values(item, values)
2491 cache[item] = res

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3078 return self._engine.get_loc(key)
3079 except KeyError:
-> 3080 return self._engine.get_loc(self._maybe_cast_indexer(key))
3081
3082 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'text'

from ludwig.

w4nderlust avatar w4nderlust commented on May 22, 2024

I'm sorry @aminaBm an @cuggla91 . Those errors are pandas errors that reflect a probably malformed csv. Unfortunately if you can't share your data there isn't much I can do about it. Try cleaning up your csv and / or changing the separator up to the point where you have a readable csv, and then let me know what parameters of the pd.read_csv() function worked, as then I can try to improve ludwig csv loading accordingly.

from ludwig.

Kranthiteja7 avatar Kranthiteja7 commented on May 22, 2024

I also have same problem. But I load the text data through manually using Dataframe. Then how can I separate the csv file from comma to tab.?

from ludwig.

w4nderlust avatar w4nderlust commented on May 22, 2024

pandad.read_csv(path, delimiter='\t')

from ludwig.

rishijain07 avatar rishijain07 commented on May 22, 2024

im also getting same error

INFO:ludwig.models.llm:Done.
WARNING:ludwig.utils.tokenizers:No padding token id found. Using eos_token as pad_token.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-21-3b63728b4f1d>](https://localhost:8080/#) in <cell line: 58>()
     56 
     57 model = LudwigModel(config=qlora_fine_tuning_config, logging_level=logging.INFO)
---> 58 results = model.train(dataset=df[:10])

9 frames
[/usr/local/lib/python3.10/dist-packages/ludwig/api.py](https://localhost:8080/#) in train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
    631                 update_config_with_metadata(self.config_obj, training_set_metadata)
    632                 logger.info("Warnings and other logs:")
--> 633                 self.model = LudwigModel.create_model(self.config_obj, random_seed=random_seed)
    634                 # update config with properties determined during model instantiation
    635                 update_config_with_model(self.config_obj, self.model)

[/usr/local/lib/python3.10/dist-packages/ludwig/api.py](https://localhost:8080/#) in create_model(config_obj, random_seed)
   2060             config_obj = ModelConfig.from_dict(config_obj)
   2061         model_type = get_from_registry(config_obj.model_type, model_type_registry)
-> 2062         return model_type(config_obj, random_seed=random_seed)
   2063 
   2064     @staticmethod

[/usr/local/lib/python3.10/dist-packages/ludwig/models/llm.py](https://localhost:8080/#) in __init__(self, config_obj, random_seed, _device, **_kwargs)
    138 
    139         self.output_features.update(
--> 140             self.build_outputs(
    141                 output_feature_configs=self.config_obj.output_features,
    142                 # Set the input size to the model vocab size instead of the tokenizer vocab size

[/usr/local/lib/python3.10/dist-packages/ludwig/models/llm.py](https://localhost:8080/#) in build_outputs(cls, output_feature_configs, input_size)
    235 
    236         output_features = {}
--> 237         output_feature = cls.build_single_output(output_feature_config, output_features)
    238         output_features[output_feature_config.name] = output_feature
    239 

[/usr/local/lib/python3.10/dist-packages/ludwig/models/base.py](https://localhost:8080/#) in build_single_output(feature_config, output_features)
    123         logger.debug(f"Output {feature_config.type} feature {feature_config.name}")
    124         output_feature_class = get_from_registry(feature_config.type, get_output_type_registry())
--> 125         output_feature_obj = output_feature_class(feature_config, output_features=output_features)
    126         return output_feature_obj
    127 

[/usr/local/lib/python3.10/dist-packages/ludwig/features/text_feature.py](https://localhost:8080/#) in __init__(self, output_feature_config, output_features, **kwargs)
    308         **kwargs,
    309     ):
--> 310         super().__init__(output_feature_config, output_features, **kwargs)
    311 
    312     @classmethod

[/usr/local/lib/python3.10/dist-packages/ludwig/features/sequence_feature.py](https://localhost:8080/#) in __init__(self, output_feature_config, output_features, **kwargs)
    344     ):
    345         super().__init__(output_feature_config, output_features, **kwargs)
--> 346         self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
    347         self._setup_loss()
    348         self._setup_metrics()

[/usr/local/lib/python3.10/dist-packages/ludwig/features/base_feature.py](https://localhost:8080/#) in initialize_decoder(self, decoder_config)
    281         # Input to the decoder is the output feature's FC hidden layer.
    282         decoder_config.input_size = self.fc_stack.output_shape[-1]
--> 283         decoder_cls = get_decoder_cls(self.type(), decoder_config.type)
    284         decoder_schema = decoder_cls.get_schema_cls().Schema()
    285         decoder_params_dict = decoder_schema.dump(decoder_config)

[/usr/local/lib/python3.10/dist-packages/ludwig/decoders/registry.py](https://localhost:8080/#) in get_decoder_cls(feature, name)
     30 @DeveloperAPI
     31 def get_decoder_cls(feature: str, name: str) -> Type[Decoder]:
---> 32     return get_decoder_registry()[feature][name]
     33 
     34 

[/usr/local/lib/python3.10/dist-packages/ludwig/utils/registry.py](https://localhost:8080/#) in __getitem__(self, key)
     44         if self.parent and key not in self.data:
     45             return self.parent.__getitem__(key)
---> 46         return self.data.__getitem__(key)
     47 
     48     def __contains__(self, key: str):

from ludwig.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.