I would like to deploy the ctranslate2 version of nllb-200-distilled-600M model side by side with the current raw version of it, so that we can compare performance.
This version seems to be much faster as it is optimized for CPU usage and this will allow the language team to use this model server, since they are already using this model.
Description
Details
Related Objects
Event Timeline
Change 980015 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] nllb: add cpu optimized version
Using ctranslate2 with 8bit quantization I was able to create a model.bin of ~600MB size (from ~2.5GB of the original).
I can't tell any difference in the quality of the results with the examples I've tried yet, but such a thing would allow us to scale horizontally as we'll need pod requiremnets of 1GB (unless there are any surprises).
This is the command to produce the binary:
ct2-transformers-converter --model facebook/nllb-200-distilled-600M --quantization int8 --output_dir /path/to/output/dir
I plan to deploy the attached patch and test the resources required.
I have added support to the llm model server for a ctranslate2 using the code sample found in the OpenMT forum and the same used in MinT as it is much faster than the one provided in official documentation of ctranslate2. When running locally the official one gave me inference time of 8s while the one from MinT 3s.
We can now have support for the following model versions of NLLB:
- nllb-200-distilled-600M - The default model as found in huggingface
- nllb-200-distilled-600M - ctranslate2 version
- nllb-200-1.3B - ctranslate2 version int8 precision
- nllb-200-3.3B- ctranlate2 version int8 precision
In the table below I summarize the available models that we have. With the available options there are tradeoffs that can be made and have to do with: model size, latency and quality of output.
Examples of such tradeoffs:
- if results are good enough we may choose to deploy the int8 version of the 600M parameter model as we can scale horizontally (multiple pods)
- Choose to deploy a quantized verison (int8) version of a bigger model (e.g. 1.3B params) depending on latency and quality of translations
model | ctranslate | Weight Precision (Data type) | Model size on disk |
nllb-200-distilled-600M | No | float32 | 2.48GB |
nllb-200-distilled-600M | Yes | float32 | 2.48GB |
nllb-200-distilled-600M | Yes | int8 | 650MB |
nllb-200-1.3B | Yes | int8 | 1.39GB (from initial 5.5GB) |
nllb-200-3.3B | Yes | int8 | 3.37GB (from initial 17.6GB) |
Change 980015 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] nllb: add cpu optimized version
Change 982770 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: deploy nllb cpu version on Lift Wing
Change 982770 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: deploy nllb cpu version on Lift Wing
Change 982795 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: update llm and readability images
Change 982795 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update llm and readability images
Change 983079 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] llm: update image with sentencepiece
Change 983079 merged by jenkins-bot:
[operations/deployment-charts@master] llm: update image with sentencepiece
The model is deployed on eqiad and a request can be made like this
time curl -s https://inference.svc.eqiad.wmnet:30443/v1/models/nllb-200:predict -X POST -d '{"prompt": "Jazz is a music genre that originated in the African-American communities of New Orleans, Louisiana, United States, in the late 19th and early 20th centuries, with its roots in blues and ragtime. Since the 1920s Jazz Age, it has been recognized as a major form of musical expression in traditional and popular music, linked by the common bonds of African-American and European-American musical parentage. Jazz is characterized by swing and blue notes, complex chords, call and response vocals, polyrhythms and improvisation. Jazz has roots in West African cultural and musical expression, and in African-American music traditions.", "tgt_lang": "deu_Latn"}' -i -H "Host: nllb-200-cpu.llm.wikimedia.org" --http1.1 --header "Content-type: application/json"
This request takes 30s which is too much compared to the offline tests of approx 3s. This deployment involves 1 cpu and 2GB of memory.
It seems that the model itself benefits from multiple CPUs so we can increase the pod CPUs to report latencies.
Change 983204 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: increase cpu for nllb
Change 983204 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: increase cpu for nllb
Change 983353 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] llm: set number of threads for ctranslate
Change 983353 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] llm: set number of threads for ctranslate
Change 983436 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: manually set number of threads for nllb-cpu
Change 983436 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: manually set number of threads for nllb-cpu
The resources used by the pod have been increased to 4 cpus and a request with the above size takes ~5s while a request with the size of a sentence ~1s.
this will allow the language team to use this model server
Please coordinate with us on this. MinT has an evolving sophisticated layer of pre-processing/post-processing steps per language. It also has structured document translation(not just plain text translation, mainly HTML translation and other formats) that require fast, multiple calls to inferences with fragments of text.
It also has a language to model mapping mechanism and upcoming translation memory system.
To get all these working in a high performant setting, there are trade offs with seperating model inference and the above mentioned domain logic into seperate systems over the network. It is not clear whether architecture of LiftWing envision just model inference or the extra layers of domain logic in same system or separate systems(See my question on this in internal slack channel - https://wikimedia.slack.com/archives/C05F8ERE2CV/p1688981007204259 5 months ago.).
For some of the latest models we integrated(indictran2-en-indic, indictrans2-indic-indic), the wrapper code is supposed to live close to the model inference to make the inference complete and sometimes this code is sharable across models. So another set of tradeoffs on per-model basic inference vs smarter and complete inference with shared inference logic, and domain logic. MinT itself does not complete this domain logic - MinT is backend for nodejs service cxserver. CX frontend calls cxserver. cxserver call MinT.
MinT is now powered by 14 different models. NLLB is one of them, but are using it as fallback model only. Other newer models are giving better performance and quality for many languages.
Hey @santhosh, thank you for providing the background information on MinT. We can definitely coordinate if you have any requests for a specific model that you would like us to support.
Regarding the specific task: The ML team is working on supporting LLMs in Lift Wing with a focus on the work being done on GPUs so we are exploring also inference optimization frameworks and engines.
At this time we are highly interested in exploring and validating the level of support these frameworks have for AMD GPUs (rocm drivers) as we are in the process of expanding Lift Wing. Because GPU capacity is going to be limited we also want to be able to support some models on CPUs with acceptable latencies and reasonable resource requirements. In this context we chose nllb to test with ctranslate since it is already being used within WMF instead of choosing some random model for this exploratory work.
Any experience or hints that your team can offer are highly valuable.