Comments (10)
@w4nderlust ah yeah you were right, that was the question here #66
Ok I will try to change \t
to ,
🤕
If anyone else is having the same issue:
When on macOS:
sed -i "" $'s/,/ /g' /root/spam_dataset.csv
sed -i "" $'s/\t/,/g' /root/spam_dataset.csv
(keep an eye to the ""
before the -i
for inline replacement and to the ANSI-C style quoting since OSX sed does not recognize \t
)
while on linux
sed -i "" $'s/,/ /g' /root/spam_dataset.csv
sed -i "s/\t/,/g" /root/spam_dataset.csv
We should to replace every ,
in the dataset to a
(as example) before converting tab to commas, otherwise we could have comma in some columns, breaking the resulting CSV file.
and if for some reason you have forget the header:
sed -i '' -e '1i\'$'\n''label,text' /root/spam_dataset.csv
from ludwig.
That's a great suggestion, a good workaround until we implement a better solution for reading TSVs and other file formats.
from ludwig.
Your CSV seems to be tab separated and not comma separated. At the moment we don't support TSV, we are working on it right now. For the time being please use comma to separate your columns and escape them if they appear in your text as described here.
Please confirm that this solves the problem.
from ludwig.
I have the same problem and my data is separated using a comma and it still showing the same error :(
from ludwig.
@aminaBm are you sure that you do not have any additional ,
in the text
column? This happened to me also before normalizing the text
column and converting tabs to commas.
from ludwig.
I have the same issue as @aminaBm. I used df = pd.read_csv('dump_20190401.csv', escapechar='\') t try to deal with it but somehow it still is an issue for me. I get this error for this code:
Code:
print('creating model')
model = LudwigModel(model_definition)
print('training model')
train_stats = model.train(data_df=df)
model.close()
Error:
creating model
training model
KeyError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3077 try:
-> 3078 return self._engine.get_loc(key)
3079 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'text'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
in
2 model = LudwigModel(model_definition)
3 print('training model')
----> 4 train_stats = model.train(data_df=df)
5 model.close()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/api.py in train(self, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, data_dict, train_set_metadata_json, experiment_name, model_name, model_load_path, model_resume_path, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, gpus, gpu_fraction, use_horovod, random_seed, logging_level, debug, **kwargs)
448 use_horovod=use_horovod,
449 random_seed=random_seed,
--> 450 debug=debug,
451 )
452
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/train.py in full_train(model_definition, model_definition_file, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, train_set_metadata_json, experiment_name, model_name, model_load_path, model_resume_path, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, should_close_session, gpus, gpu_fraction, use_horovod, random_seed, debug, **kwargs)
254 skip_save_processed_input=skip_save_processed_input,
255 preprocessing_params=model_definition['preprocessing'],
--> 256 random_seed=random_seed
257 )
258
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in preprocess_for_training(model_definition, data_df, data_train_df, data_validation_df, data_test_df, data_csv, data_train_csv, data_validation_csv, data_test_csv, data_hdf5, data_train_hdf5, data_validation_hdf5, data_test_hdf5, train_set_metadata_json, skip_save_processed_input, preprocessing_params, random_seed)
387 data_test_df,
388 preprocessing_params,
--> 389 random_seed
390 )
391 elif data_csv is not None or data_train_csv is not None:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in _preprocess_df_for_training(features, data_df, data_train_df, data_validation_df, data_test_df, preprocessing_params, random_seed)
638 features,
639 preprocessing_params,
--> 640 random_seed=random_seed
641 )
642 training_set, test_set, validation_set = split_dataset_tvt(
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in build_dataset_df(dataset_df, features, global_preprocessing_parameters, train_set_metadata, random_seed, **kwargs)
84 dataset_df,
85 features,
---> 86 global_preprocessing_parameters
87 )
88
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ludwig/data/preprocessing.py in build_metadata(dataset_df, features, global_preprocessing_parameters)
124 ]
125 train_set_metadata[feature['name']] = get_feature_meta(
--> 126 dataset_df[feature['name']].astype(str),
127 preprocessing_parameters
128 )
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2487 res = cache.get(item)
2488 if res is None:
-> 2489 values = self._data.get(item)
2490 res = self._box_item_values(item, values)
2491 cache[item] = res
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3078 return self._engine.get_loc(key)
3079 except KeyError:
-> 3080 return self._engine.get_loc(self._maybe_cast_indexer(key))
3081
3082 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'text'
from ludwig.
I'm sorry @aminaBm an @cuggla91 . Those errors are pandas errors that reflect a probably malformed csv. Unfortunately if you can't share your data there isn't much I can do about it. Try cleaning up your csv and / or changing the separator up to the point where you have a readable csv, and then let me know what parameters of the pd.read_csv()
function worked, as then I can try to improve ludwig csv loading accordingly.
from ludwig.
I also have same problem. But I load the text data through manually using Dataframe. Then how can I separate the csv file from comma to tab.?
from ludwig.
pandad.read_csv(path, delimiter='\t')
from ludwig.
im also getting same error
INFO:ludwig.models.llm:Done.
WARNING:ludwig.utils.tokenizers:No padding token id found. Using eos_token as pad_token.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
[<ipython-input-21-3b63728b4f1d>](https://localhost:8080/#) in <cell line: 58>()
56
57 model = LudwigModel(config=qlora_fine_tuning_config, logging_level=logging.INFO)
---> 58 results = model.train(dataset=df[:10])
9 frames
[/usr/local/lib/python3.10/dist-packages/ludwig/api.py](https://localhost:8080/#) in train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
631 update_config_with_metadata(self.config_obj, training_set_metadata)
632 logger.info("Warnings and other logs:")
--> 633 self.model = LudwigModel.create_model(self.config_obj, random_seed=random_seed)
634 # update config with properties determined during model instantiation
635 update_config_with_model(self.config_obj, self.model)
[/usr/local/lib/python3.10/dist-packages/ludwig/api.py](https://localhost:8080/#) in create_model(config_obj, random_seed)
2060 config_obj = ModelConfig.from_dict(config_obj)
2061 model_type = get_from_registry(config_obj.model_type, model_type_registry)
-> 2062 return model_type(config_obj, random_seed=random_seed)
2063
2064 @staticmethod
[/usr/local/lib/python3.10/dist-packages/ludwig/models/llm.py](https://localhost:8080/#) in __init__(self, config_obj, random_seed, _device, **_kwargs)
138
139 self.output_features.update(
--> 140 self.build_outputs(
141 output_feature_configs=self.config_obj.output_features,
142 # Set the input size to the model vocab size instead of the tokenizer vocab size
[/usr/local/lib/python3.10/dist-packages/ludwig/models/llm.py](https://localhost:8080/#) in build_outputs(cls, output_feature_configs, input_size)
235
236 output_features = {}
--> 237 output_feature = cls.build_single_output(output_feature_config, output_features)
238 output_features[output_feature_config.name] = output_feature
239
[/usr/local/lib/python3.10/dist-packages/ludwig/models/base.py](https://localhost:8080/#) in build_single_output(feature_config, output_features)
123 logger.debug(f"Output {feature_config.type} feature {feature_config.name}")
124 output_feature_class = get_from_registry(feature_config.type, get_output_type_registry())
--> 125 output_feature_obj = output_feature_class(feature_config, output_features=output_features)
126 return output_feature_obj
127
[/usr/local/lib/python3.10/dist-packages/ludwig/features/text_feature.py](https://localhost:8080/#) in __init__(self, output_feature_config, output_features, **kwargs)
308 **kwargs,
309 ):
--> 310 super().__init__(output_feature_config, output_features, **kwargs)
311
312 @classmethod
[/usr/local/lib/python3.10/dist-packages/ludwig/features/sequence_feature.py](https://localhost:8080/#) in __init__(self, output_feature_config, output_features, **kwargs)
344 ):
345 super().__init__(output_feature_config, output_features, **kwargs)
--> 346 self.decoder_obj = self.initialize_decoder(output_feature_config.decoder)
347 self._setup_loss()
348 self._setup_metrics()
[/usr/local/lib/python3.10/dist-packages/ludwig/features/base_feature.py](https://localhost:8080/#) in initialize_decoder(self, decoder_config)
281 # Input to the decoder is the output feature's FC hidden layer.
282 decoder_config.input_size = self.fc_stack.output_shape[-1]
--> 283 decoder_cls = get_decoder_cls(self.type(), decoder_config.type)
284 decoder_schema = decoder_cls.get_schema_cls().Schema()
285 decoder_params_dict = decoder_schema.dump(decoder_config)
[/usr/local/lib/python3.10/dist-packages/ludwig/decoders/registry.py](https://localhost:8080/#) in get_decoder_cls(feature, name)
30 @DeveloperAPI
31 def get_decoder_cls(feature: str, name: str) -> Type[Decoder]:
---> 32 return get_decoder_registry()[feature][name]
33
34
[/usr/local/lib/python3.10/dist-packages/ludwig/utils/registry.py](https://localhost:8080/#) in __getitem__(self, key)
44 if self.parent and key not in self.data:
45 return self.parent.__getitem__(key)
---> 46 return self.data.__getitem__(key)
47
48 def __contains__(self, key: str):
from ludwig.
Related Issues (20)
- Softmax missing from Torchvision models HOT 3
- Re-training PEFT model fails after loading with `Linear4bit` error
- Cannot install ray 2.3.1 on Apple M2 macbook HOT 1
- Getting `ValueError: Hyperopt Section not present in config` while loading hyperopt from YAML config HOT 1
- Remove target_module hardcoding for Mixtral model
- repetition_penalty bugged out
- Cannot run/install finetuning colab notebook HOT 3
- Ludwig: Fine-Tune Mistral-7b missing LudwigModel import and/or definition HOT 8
- Errors due to most recent PyTorch Nightly Build (1/16/2024)
- Inference on CPU HOT 3
- Impossibility to use a tokenizer with auto_transformer HOT 4
- Integrating new frameworks
- Unpin DeepSpeed to allow 0.13.0 and greater
- Unpin pandas to allow newer versions >= 2.2.0
- Use torch >= 2.1.1 in Docker images to enable SDPA dispatching via Flash Attention 2 for faster training and inference
- Remove target_module LoRA mapping for Phi-2 model
- RuntimeError in Model Training on Predibase with Ludwig: Data Type Mismatch HOT 5
- Ray parallelization does not work HOT 1
- Ludwig New Version Issues of Repeating output HOT 7
- Unable to create visualizations using Python API HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ludwig.