Deploy ctranslate2 version of nllb-200
Closed, ResolvedPublic3 Estimated Story Points

Description

I would like to deploy the ctranslate2 version of nllb-200-distilled-600M model side by side with the current raw version of it, so that we can compare performance.
This version seems to be much faster as it is optimized for CPU usage and this will allow the language team to use this model server, since they are already using this model.

Event Timeline

isarantopoulos set the point value for this task to 3.
isarantopoulos triaged this task as Medium priority.
isarantopoulos moved this task from Unsorted to In Progress on the Machine-Learning-Team board.

Change 980015 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] nllb: add cpu optimized version

https://gerrit.wikimedia.org/r/980015

Using ctranslate2 with 8bit quantization I was able to create a model.bin of ~600MB size (from ~2.5GB of the original).
I can't tell any difference in the quality of the results with the examples I've tried yet, but such a thing would allow us to scale horizontally as we'll need pod requiremnets of 1GB (unless there are any surprises).
This is the command to produce the binary:

ct2-transformers-converter --model facebook/nllb-200-distilled-600M --quantization int8 --output_dir /path/to/output/dir

I plan to deploy the attached patch and test the resources required.

I have added support to the llm model server for a ctranslate2 using the code sample found in the OpenMT forum and the same used in MinT as it is much faster than the one provided in official documentation of ctranslate2. When running locally the official one gave me inference time of 8s while the one from MinT 3s.

We can now have support for the following model versions of NLLB:

In the table below I summarize the available models that we have. With the available options there are tradeoffs that can be made and have to do with: model size, latency and quality of output.
Examples of such tradeoffs:

  • if results are good enough we may choose to deploy the int8 version of the 600M parameter model as we can scale horizontally (multiple pods)
  • Choose to deploy a quantized verison (int8) version of a bigger model (e.g. 1.3B params) depending on latency and quality of translations
modelctranslateWeight Precision (Data type)Model size on disk
nllb-200-distilled-600MNofloat322.48GB
nllb-200-distilled-600MYesfloat322.48GB
nllb-200-distilled-600MYesint8650MB
nllb-200-1.3BYesint81.39GB (from initial 5.5GB)
nllb-200-3.3BYesint83.37GB (from initial 17.6GB)

Change 980015 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] nllb: add cpu optimized version

https://gerrit.wikimedia.org/r/980015

Change 982770 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy nllb cpu version on Lift Wing

https://gerrit.wikimedia.org/r/982770

Change 982770 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy nllb cpu version on Lift Wing

https://gerrit.wikimedia.org/r/982770

Change 982795 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update llm and readability images

https://gerrit.wikimedia.org/r/982795

Change 982795 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update llm and readability images

https://gerrit.wikimedia.org/r/982795

Change 983079 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] llm: update image with sentencepiece

https://gerrit.wikimedia.org/r/983079

Change 983079 merged by jenkins-bot:

[operations/deployment-charts@master] llm: update image with sentencepiece

https://gerrit.wikimedia.org/r/983079

The model is deployed on eqiad and a request can be made like this

time curl -s https://inference.svc.eqiad.wmnet:30443/v1/models/nllb-200:predict -X POST -d '{"prompt": "Jazz is a music genre that originated in the African-American communities of New Orleans, Louisiana, United States, in the late 19th and early 20th centuries, with its roots in blues and ragtime. Since the 1920s Jazz Age, it has been recognized as a major form of musical expression in traditional and popular music, linked by the common bonds of African-American and European-American musical parentage. Jazz is characterized by swing and blue notes, complex chords, call and response vocals, polyrhythms and improvisation. Jazz has roots in West African cultural and musical expression, and in African-American music traditions.", "tgt_lang": "deu_Latn"}' -i -H "Host: nllb-200-cpu.llm.wikimedia.org"  --http1.1 --header "Content-type: application/json"

This request takes 30s which is too much compared to the offline tests of approx 3s. This deployment involves 1 cpu and 2GB of memory.
It seems that the model itself benefits from multiple CPUs so we can increase the pod CPUs to report latencies.

Change 983204 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase cpu for nllb

https://gerrit.wikimedia.org/r/983204

Change 983204 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase cpu for nllb

https://gerrit.wikimedia.org/r/983204

Change 983353 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: set number of threads for ctranslate

https://gerrit.wikimedia.org/r/983353

Change 983353 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: set number of threads for ctranslate

https://gerrit.wikimedia.org/r/983353

Change 983436 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: manually set number of threads for nllb-cpu

https://gerrit.wikimedia.org/r/983436

Change 983436 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: manually set number of threads for nllb-cpu

https://gerrit.wikimedia.org/r/983436

The resources used by the pod have been increased to 4 cpus and a request with the above size takes ~5s while a request with the size of a sentence ~1s.

this will allow the language team to use this model server

Please coordinate with us on this. MinT has an evolving sophisticated layer of pre-processing/post-processing steps per language. It also has structured document translation(not just plain text translation, mainly HTML translation and other formats) that require fast, multiple calls to inferences with fragments of text.

It also has a language to model mapping mechanism and upcoming translation memory system.

To get all these working in a high performant setting, there are trade offs with seperating model inference and the above mentioned domain logic into seperate systems over the network. It is not clear whether architecture of LiftWing envision just model inference or the extra layers of domain logic in same system or separate systems(See my question on this in internal slack channel - https://wikimedia.slack.com/archives/C05F8ERE2CV/p1688981007204259 5 months ago.).

For some of the latest models we integrated(indictran2-en-indic, indictrans2-indic-indic), the wrapper code is supposed to live close to the model inference to make the inference complete and sometimes this code is sharable across models. So another set of tradeoffs on per-model basic inference vs smarter and complete inference with shared inference logic, and domain logic. MinT itself does not complete this domain logic - MinT is backend for nodejs service cxserver. CX frontend calls cxserver. cxserver call MinT.

MinT is now powered by 14 different models. NLLB is one of them, but are using it as fallback model only. Other newer models are giving better performance and quality for many languages.

Hey @santhosh, thank you for providing the background information on MinT. We can definitely coordinate if you have any requests for a specific model that you would like us to support.

Regarding the specific task: The ML team is working on supporting LLMs in Lift Wing with a focus on the work being done on GPUs so we are exploring also inference optimization frameworks and engines.
At this time we are highly interested in exploring and validating the level of support these frameworks have for AMD GPUs (rocm drivers) as we are in the process of expanding Lift Wing. Because GPU capacity is going to be limited we also want to be able to support some models on CPUs with acceptable latencies and reasonable resource requirements. In this context we chose nllb to test with ctranslate since it is already being used within WMF instead of choosing some random model for this exploratory work.
Any experience or hints that your team can offer are highly valuable.