Comments (10)
I would like to know this too. Right now i am at 4% and 0.25 cents costs, so it could be even around 6€. I will tell you once I have finished the German translation, right now it is really slow, almost like it would be stuck at 4%.
EDIT: looks like I hit the rate limit, after some experiments I am now down to 25 parallel calls. This is very slow, but it seems to work.
from cabrita.
I have no idea how they did it for US$8. My cost was close to US$25. I did not translate to Portuguese, though.
from cabrita.
I highly recommend to translate the Cleaned Dataset: https://github.com/gururise/AlpacaDataCleaned
I will try to translate it into German in a few weeks when the cleaning has progressed further.
from cabrita.
Ah it seems I've miscalculated from the JSON structure rows <-> instructions, thank you for the correction. I'll just run the whole translation, but I think the larger dataset will take a lot more time to fine-tune.
from cabrita.
Dropping tqdm in favour of just counting via callback how many futures have been completed/not completed seems to double the overall speed of the threading job. There seems to be underlying issues with this library that is used in many machinelearning projects.
from cabrita.
I am trying this with Hindi. The generation results don't seem so good.
from cabrita.
If you look closely in translate_data.py
:
with open('alpaca_data.json', 'r') as f:
data = json.load(f)
start = 40000
end = 55000
translated_data = []
data = data[start:end]
with ThreadPoolExecutor(max_workers=MAX_PARALLEL_REQUESTS) as executor:
futures = {executor.submit(translate_item, item): item for item in data}
for future in tqdm(as_completed(futures), total=len(futures), desc="Translating"):
translated_data.append(future.result())
Only a chunk of the original instruction set is translated. You need to repeat this process by changing the start
and end
variables.
from cabrita.
Translating the whole alpaca-lora/alpaca_data_cleaned_archive.json is somewhere around 30,000-50,000€ 30-50€ according to calculated tokens from 1000 random sampled prompts.
I'm curious is the selected chunk 40000-55000 for translation in the project chosen for it's quality or is it just random?
from cabrita.
Translating the whole alpaca-lora/alpaca_data_cleaned_archive.json is somewhere around 30,000-50,000€ according to calculated tokens from 1000 random sampled prompts.
I'm curious is the selected chunk 40000-55000 for translation in the project chosen for it's quality or is it just random?
I translated the complete alpaca_data.json
to Arabic and it costed me $60 using GPT-3.5-turbo ($16~$18 of which where given for free by OpenAI iirc)
from cabrita.
Dropping tqdm in favour of just counting via callback how many futures have been completed/not completed seems to double the overall speed of the threading job. There seems to be underlying issues with this library that is used in many machinelearning projects.
Nice. Didn't think of trying that before
from cabrita.
Related Issues (12)
- Translation from English + Finetuning vs. original LLama quality HOT 3
- max_memory? HOT 2
- About the training time on Google Colab A100
- Any unquantized and quantized models available? HOT 1
- Alteração nos imports LLaMAForCausalLM e LLaMATokenizer
- Did not find branch or tag 'c3dc391', assuming revision or ref. HOT 2
- Can't find config.json at '{pretrained_model_name_or_path} HOT 1
- Translation scripts stops after a few minutes HOT 7
- I am getting error at "from transformers import AutoTokenizer, AutoConfig, LLaMAForCausalLM, LLaMATokenizer" HOT 2
- Out of memory HOT 1
- Cannot copy out of meta tensor; no data!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cabrita.