Deploy ctranslate2 version of nllb-200
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	isarantopoulos
	Nov 21 2023, 4:02 PM

Description

I would like to deploy the ctranslate2 version of nllb-200-distilled-600M model side by side with the current raw version of it, so that we can compare performance.
This version seems to be much faster as it is optimized for CPU usage and this will allow the language team to use this model server, since they are already using this model.

Details

Subject	Repo	Branch	Lines +/-
ml-services: manually set number of threads for nllb-cpu	operations/deployment-charts	master	+7 -3
llm: set number of threads for ctranslate	machinelearning/liftwing/inference-services	main	+12 -1
ml-services: increase cpu for nllb	operations/deployment-charts	master	+4 -0
llm: update image with sentencepiece	operations/deployment-charts	master	+2 -2
ml-services: update llm and readability images	operations/deployment-charts	master	+4 -4
ml-services: deploy nllb cpu version on Lift Wing	operations/deployment-charts	master	+10 -14
nllb: add cpu optimized version	machinelearning/liftwing/inference-services	main	+106 -4

Customize query in gerrit

Related Objects

Mentioned In: T335491: Provide better long-term storage for translation models
rMLIS0e7016382743: llm: set number of threads for ctranslate
rMLIS564904f49eab: nllb: add cpu optimized version

Event Timeline

isarantopoulos created this task.Nov 21 2023, 4:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 21 2023, 4:02 PM

isarantopoulos updated the task description. (Show Details)Nov 28 2023, 2:57 PM

isarantopoulos set the point value for this task to 3.

isarantopoulos claimed this task.Dec 4 2023, 4:42 PM

isarantopoulos triaged this task as Medium priority.

isarantopoulos moved this task from Unsorted to In Progress on the Machine-Learning-Team board.

Change 980015 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] nllb: add cpu optimized version

https://gerrit.wikimedia.org/r/980015

gerritbot added a project: Patch-For-Review.Dec 4 2023, 5:41 PM

Using ctranslate2 with 8bit quantization I was able to create a model.bin of ~600MB size (from ~2.5GB of the original).
I can't tell any difference in the quality of the results with the examples I've tried yet, but such a thing would allow us to scale horizontally as we'll need pod requiremnets of 1GB (unless there are any surprises).
This is the command to produce the binary:

ct2-transformers-converter --model facebook/nllb-200-distilled-600M --quantization int8 --output_dir /path/to/output/dir

I plan to deploy the attached patch and test the resources required.

I have added support to the llm model server for a ctranslate2 using the code sample found in the OpenMT forum and the same used in MinT as it is much faster than the one provided in official documentation of ctranslate2. When running locally the official one gave me inference time of 8s while the one from MinT 3s.

We can now have support for the following model versions of NLLB:

nllb-200-distilled-600M - The default model as found in huggingface
nllb-200-distilled-600M - ctranslate2 version
nllb-200-1.3B - ctranslate2 version int8 precision
nllb-200-3.3B- ctranlate2 version int8 precision

In the table below I summarize the available models that we have. With the available options there are tradeoffs that can be made and have to do with: model size, latency and quality of output.
Examples of such tradeoffs:

if results are good enough we may choose to deploy the int8 version of the 600M parameter model as we can scale horizontally (multiple pods)
Choose to deploy a quantized verison (int8) version of a bigger model (e.g. 1.3B params) depending on latency and quality of translations

model	ctranslate	Weight Precision (Data type)	Model size on disk
nllb-200-distilled-600M	No	float32	2.48GB
nllb-200-distilled-600M	Yes	float32	2.48GB
nllb-200-distilled-600M	Yes	int8	650MB
nllb-200-1.3B	Yes	int8	1.39GB (from initial 5.5GB)
nllb-200-3.3B	Yes	int8	3.37GB (from initial 17.6GB)

Change 980015 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] nllb: add cpu optimized version

https://gerrit.wikimedia.org/r/980015

isarantopoulos mentioned this in rMLIS564904f49eab: nllb: add cpu optimized version.Dec 12 2023, 5:13 PM

Maintenance_bot removed a project: Patch-For-Review.Dec 12 2023, 5:30 PM

Change 982770 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy nllb cpu version on Lift Wing

https://gerrit.wikimedia.org/r/982770

gerritbot added a project: Patch-For-Review.Dec 13 2023, 10:11 AM

Change 982770 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy nllb cpu version on Lift Wing

https://gerrit.wikimedia.org/r/982770

Maintenance_bot removed a project: Patch-For-Review.Dec 13 2023, 10:30 AM

Change 982795 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update llm and readability images

https://gerrit.wikimedia.org/r/982795

gerritbot added a project: Patch-For-Review.Dec 13 2023, 1:12 PM

Change 982795 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update llm and readability images

https://gerrit.wikimedia.org/r/982795

Maintenance_bot removed a project: Patch-For-Review.Dec 13 2023, 2:11 PM

Change 983079 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] llm: update image with sentencepiece

https://gerrit.wikimedia.org/r/983079

gerritbot added a project: Patch-For-Review.Dec 14 2023, 7:04 AM

Change 983079 merged by jenkins-bot:

[operations/deployment-charts@master] llm: update image with sentencepiece

https://gerrit.wikimedia.org/r/983079

Maintenance_bot removed a project: Patch-For-Review.Dec 14 2023, 7:30 AM

The model is deployed on eqiad and a request can be made like this

time curl -s https://inference.svc.eqiad.wmnet:30443/v1/models/nllb-200:predict -X POST -d '{"prompt": "Jazz is a music genre that originated in the African-American communities of New Orleans, Louisiana, United States, in the late 19th and early 20th centuries, with its roots in blues and ragtime. Since the 1920s Jazz Age, it has been recognized as a major form of musical expression in traditional and popular music, linked by the common bonds of African-American and European-American musical parentage. Jazz is characterized by swing and blue notes, complex chords, call and response vocals, polyrhythms and improvisation. Jazz has roots in West African cultural and musical expression, and in African-American music traditions.", "tgt_lang": "deu_Latn"}' -i -H "Host: nllb-200-cpu.llm.wikimedia.org"  --http1.1 --header "Content-type: application/json"

This request takes 30s which is too much compared to the offline tests of approx 3s. This deployment involves 1 cpu and 2GB of memory.
It seems that the model itself benefits from multiple CPUs so we can increase the pod CPUs to report latencies.

Change 983204 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: increase cpu for nllb

https://gerrit.wikimedia.org/r/983204

gerritbot added a project: Patch-For-Review.Dec 14 2023, 3:29 PM

Change 983204 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase cpu for nllb

https://gerrit.wikimedia.org/r/983204

Maintenance_bot removed a project: Patch-For-Review.Dec 14 2023, 4:10 PM

Change 983353 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: set number of threads for ctranslate

https://gerrit.wikimedia.org/r/983353

gerritbot added a project: Patch-For-Review.Dec 15 2023, 12:34 PM

Change 983353 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: set number of threads for ctranslate

https://gerrit.wikimedia.org/r/983353

isarantopoulos mentioned this in rMLIS0e7016382743: llm: set number of threads for ctranslate.Dec 15 2023, 2:06 PM

Maintenance_bot removed a project: Patch-For-Review.Dec 15 2023, 2:10 PM

Change 983436 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: manually set number of threads for nllb-cpu

https://gerrit.wikimedia.org/r/983436

gerritbot added a project: Patch-For-Review.Dec 15 2023, 3:12 PM

elukey mentioned this in T335491: Provide better long-term storage for translation models.Dec 15 2023, 3:20 PM

Change 983436 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: manually set number of threads for nllb-cpu

https://gerrit.wikimedia.org/r/983436

Maintenance_bot removed a project: Patch-For-Review.Dec 15 2023, 4:10 PM

isarantopoulos moved this task from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.Dec 18 2023, 7:39 AM

The resources used by the pod have been increased to 4 cpus and a request with the above size takes ~5s while a request with the size of a sentence ~1s.

akosiaris subscribed.Dec 18 2023, 12:20 PM

this will allow the language team to use this model server

Please coordinate with us on this. MinT has an evolving sophisticated layer of pre-processing/post-processing steps per language. It also has structured document translation(not just plain text translation, mainly HTML translation and other formats) that require fast, multiple calls to inferences with fragments of text.

It also has a language to model mapping mechanism and upcoming translation memory system.

To get all these working in a high performant setting, there are trade offs with seperating model inference and the above mentioned domain logic into seperate systems over the network. It is not clear whether architecture of LiftWing envision just model inference or the extra layers of domain logic in same system or separate systems(See my question on this in internal slack channel - https://wikimedia.slack.com/archives/C05F8ERE2CV/p1688981007204259 5 months ago.).

For some of the latest models we integrated(indictran2-en-indic, indictrans2-indic-indic), the wrapper code is supposed to live close to the model inference to make the inference complete and sometimes this code is sharable across models. So another set of tradeoffs on per-model basic inference vs smarter and complete inference with shared inference logic, and domain logic. MinT itself does not complete this domain logic - MinT is backend for nodejs service cxserver. CX frontend calls cxserver. cxserver call MinT.

MinT is now powered by 14 different models. NLLB is one of them, but are using it as fallback model only. Other newer models are giving better performance and quality for many languages.

KartikMistry subscribed.Dec 19 2023, 7:11 AM

elukey subscribed.Dec 19 2023, 9:04 AM

Hey @santhosh, thank you for providing the background information on MinT. We can definitely coordinate if you have any requests for a specific model that you would like us to support.

Regarding the specific task: The ML team is working on supporting LLMs in Lift Wing with a focus on the work being done on GPUs so we are exploring also inference optimization frameworks and engines.
At this time we are highly interested in exploring and validating the level of support these frameworks have for AMD GPUs (rocm drivers) as we are in the process of expanding Lift Wing. Because GPU capacity is going to be limited we also want to be able to support some models on CPUs with acceptable latencies and reasonable resource requirements. In this context we chose nllb to test with ctranslate since it is already being used within WMF instead of choosing some random model for this exploratory work.
Any experience or hints that your team can offer are highly valuable.

isarantopoulos closed this task as Resolved.Dec 22 2023, 6:43 AM

Deploy ctranslate2 version of nllb-200Closed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Deploy ctranslate2 version of nllb-200
Closed, ResolvedPublic3 Estimated Story Points
Actions