Create a language detection service in LiftWing
Closed, ResolvedPublic

Description

To support T99666: Provide a service to detect which language the user is writing on a language identification system based on fastttext model as outlined in https://arxiv.org/pdf/2305.13820.pdf is to be created.

Additional information

  • Given a text, the system will predict the language and also should print a score indicating the confidence level.
  • Ownership of the service: @santhosh , Language-Team
  • fasttext is very fast for language detection and inference is almost instant. No GPU required.
  • The model supports detecting 201 languages. However, supporting a larger set of languages, for example, all 320 languages in which wikipedia exist is a future goal. This require good quality dataset and @santhosh has been looking into this. A program to prepare such dataset is at https://github.com/santhoshtr/wikisentences and currently checking if the authors of the above mentioned paper are interested in this exploration. The hardware setup and time required to train a new model is not that elaborate. About 2 hours with a sufficiently powerful machine is enough. No GPU required.

Event Timeline

Change 932828 had a related patch set uploaded (by Santhosh; author: Santhosh):

[machinelearning/liftwing/inference-services@main] Add language identification service

https://gerrit.wikimedia.org/r/932828

@santhosh Thanks for creating the task and taking the time to read docs and create a patch! Could you also clarify the questions mentioned here ?
Before deploying any models/services we need to know things related to the usage of each model, ownership expected load, need of external data or streams etc. Of course some of them do not apply to pre-trained models, so feel free to answer these with one-two words : e.g. Is there an expected frequency in which the model will have to be retrained with new data? No retraining happening as it is a pretrained model.

@santhosh is this model service going to be used by an existing service? Is it going to replace a old service?
I have uploaded the model to swift s3://wmf-ml-models/llm/langid/lid201-model.bin and also registered the CI/CD pipelines required in this patch (already merged).

@santhosh is this model service going to be used by an existing service?

To start with, the frontend of machine translation service(Demo at https://translate.wmcloud.org) will use it. Note that the usage is at browser when user enters some content. So we will need publicly exposed API

Is it going to replace a old service?

There is no language identification service currently to replace.

Change 932828 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] Add language identification service

https://gerrit.wikimedia.org/r/932828

Change 934559 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: Deploy langid service

https://gerrit.wikimedia.org/r/934559

Change 934559 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Deploy langid service

https://gerrit.wikimedia.org/r/934559

The langid model server has been deployed in the experimental namespace in ml-staging-codfw.
Sample call:

curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/langid-model:predict -X POST -i -H "Host: langid.experimental.wikimedia.org" -d '{"text": "Lorem Ipsum is simply dummy text of the printing and typesetting industry."}'

response:

{"language":"eng_Latn","score":0.5333166718482971}

Change 934572 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] langid: get model name from env var

https://gerrit.wikimedia.org/r/934572

Thanks @isarantopoulos .

The language codes returned by the service is not in the format we usually use - 2 letter iso 639-1 codes. I will submit a patch to return langauge codes in that way.

Change 934572 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] langid: get model name from env var

https://gerrit.wikimedia.org/r/934572

Change 937575 had a related patch set uploaded (by Santhosh; author: Santhosh):

[machinelearning/liftwing/inference-services@main] langid: Provide wiki language code also in outputs

https://gerrit.wikimedia.org/r/937575

santhosh triaged this task as Medium priority.Jul 13 2023, 5:53 AM

Change 937575 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] langid: Provide wiki language code and name also in outputs

https://gerrit.wikimedia.org/r/937575

Hi @isarantopoulos I drafted the model card here: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_Identification

What is the next step to make this service into production? Since language identification is mostly a frontend feature, we would like to get this exposed as a public web API.

@santhosh Thanks for creating the model card!
Is there a client/system that will use this at the moment? If yes, is there an estimate on the amount of traffic we should be expecting? Main reason I am asking is so that we know the scaling requirements (if any) and also can validate via load testing.

Change 965189 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: add langid in llm namespace

https://gerrit.wikimedia.org/r/965189

Change 965191 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] APIGW: add entry for llm langid LW isvc

https://gerrit.wikimedia.org/r/965191

We've started the procedure to expose the service.
These are the following steps:

  • Update kserve on the model server with newer version
  • Move the service from the experimental namespace to the newly created llm namespace
  • Run load testing
  • Expose the model through the API Gateway

@santhosh Thanks for creating the model card!
Is there a client/system that will use this at the moment? If yes, is there an estimate on the amount of traffic we should be expecting? Main reason I am asking is so that we know the scaling requirements (if any) and also can validate via load testing.

The first project that will use this service would be MinT in its user interface. In https://translate.wmcloud.org/ if user enters some content to translate, we automatically detect the language. This interface is mainly used by community to test machine translation for about 200 languages, as such its traffic would be very minimal. About a 100 requests per day max.

However, We anticipate more traffic when we integrate machine translation to more places, but that is annual goal for this year and would be an iterative rollout.

@santhosh Usually when we expose a service to the outside world (wmf cloud included) we use the API Gateway, and this is a summary of its auth options:

https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage#Authentication

In theory you could use the unauthenticated tier, but getting an oauth token specific for the clients that you are planning to use may be best long term. Let us know :)

@elukey If I understood that documentation correctly, if the service required oauth token, still Anonymous users can use it with the applicable ratelimiting. am I right?
There would be usecases where non-mediawiki static webpage using this API and this anonymous ratelimited option should be sufficient.

@elukey If I understood that documentation correctly, if the service required oauth token, still Anonymous users can use it with the applicable ratelimiting. am I right?
There would be usecases where non-mediawiki static webpage using this API and this anonymous ratelimited option should be sufficient.

Exactly yes, anonymous is preferred but we'd like to see authenticated users for fixed/known clients to be able to better follow-up/recognize them etc..

Change 965936 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/puppet@production] httpbb(ml-services): add test for langid model

https://gerrit.wikimedia.org/r/965936

Change 965936 merged by Elukey:

[operations/puppet@production] httpbb: add test for langid model

https://gerrit.wikimedia.org/r/965936

I have added a new entry in the API Gateway docs.
This doesn't work but will once all the changes are deployed

Change 965189 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add langid in llm namespace

https://gerrit.wikimedia.org/r/965189

Change 966252 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] langid: fix - python utils missing

https://gerrit.wikimedia.org/r/966252

Change 966252 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] langid: fix - python utils missing

https://gerrit.wikimedia.org/r/966252

Change 966254 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: Update Docker image for langid

https://gerrit.wikimedia.org/r/966254

Change 966254 merged by Elukey:

[operations/deployment-charts@master] ml-services: Update Docker image for langid

https://gerrit.wikimedia.org/r/966254

Change 965191 merged by jenkins-bot:

[operations/deployment-charts@master] service: Add entry for llm langid for Lift Wing in the api-gw config

https://gerrit.wikimedia.org/r/965191

New model is deployed on Lift Wing and can be accessed through the API Gateway
Example request:

curl https://api.wikimedia.org/service/lw/inference/v1/models/langid:predict -X POST -d '{"text": "Some sample text in any language that we want to identify"}' -H "Content-type: application/json"

Internal endpoint

curl -s https://inference.svc.codfw.wmnet:30443/v1/models/langid:predict -X POST -d '{"text": "Some sample text in any language that we want to identify"}' -i -H "Host: langid.llm.wikimedia.org"

Reminder: Content-type header can be omitted but best practice is to explicitly define it

Summary:

Change 966448 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] Use language identification service, remove pycld2

https://gerrit.wikimedia.org/r/966448

Change 966486 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/grafana-grizzly@master] Add Lift Wing LangId SLO definition

https://gerrit.wikimedia.org/r/966486

Change 966486 merged by Elukey:

[operations/grafana-grizzly@master] Add Lift Wing LangId SLO definition

https://gerrit.wikimedia.org/r/966486

Change 966448 merged by Santhosh:

[mediawiki/services/machinetranslation@master] Use language identification service, remove pycld2

https://gerrit.wikimedia.org/r/966448

Change 967220 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: load test langid model

https://gerrit.wikimedia.org/r/967220

I ran some initial load tests with wrk which is available on our deployment servers. Plz forgive the poor presentation of the results but I haven't yet worked on providing a more easier to digest report with this tool.

Inputs for these load tests include sentences varying from 5 to 500 words each as presented in the related patch.

wrk -c 1 -t 1 --timeout 50s -s langid.lua https://inference.svc.codfw.wmnet:30443/v1/models/langid:predict --latency -d 60 -- langid.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/langid:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.33ms    1.45ms  22.65ms   93.25%
    Req/Sec    28.37      6.95    50.00     51.17%
  Latency Distribution
     50%    5.15ms
     75%    5.68ms
     90%    6.21ms
     99%   12.41ms
  1703 requests in 1.00m, 426.34KB read
Requests/sec:     28.35
Transfer/sec:      7.10KB
thread 1 made 1704 requests and got 1703 responses
wrk -c 4 -t 2 --timeout 50s -s langid.lua https://inference.svc.codfw.wmnet:30443/v1/models/langid:predict --latency -d 60 -- langid.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/langid:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.42ms    5.88ms 186.56ms   98.38%
    Req/Sec    57.06     10.73   100.00     66.50%
  Latency Distribution
     50%    4.96ms
     75%    5.71ms
     90%    6.55ms
     99%   16.33ms
  6848 requests in 1.00m, 1.64MB read
  Non-2xx or 3xx responses: 863
Requests/sec:    113.99
Transfer/sec:     27.97KB
thread 1 made 3430 requests and got 3429 responses
thread 2 made 3419 requests and got 3419 responses

All Non-2xx or 3xx responses are because of rate limiting (429 error code) which means that a single pod can handle traffic very easily and with very low latency.

By increasing the random delay between responses we are able to get results close to the rate limit (100rps) and p99 is ~12 ms for 5k requests

wrk -c 8 -t 1 --timeout 50s -s langid.lua https://inference.svc.codfw.wmnet:30443/v1/models/langid:predict --latency -d 60 -- langid.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/langid:predict
  1 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.26ms    3.21ms 162.16ms   98.38%
    Req/Sec    99.99     13.41   151.00     75.00%
  Latency Distribution
     50%    5.01ms
     75%    5.59ms
     90%    6.22ms
     99%   11.81ms
  5986 requests in 1.00m, 1.46MB read
  Non-2xx or 3xx responses: 58
Requests/sec:     99.65
Transfer/sec:     24.91KB
thread 1 made 5989 requests and got 5986 responses

Change 967230 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: add autoscaling for langid

https://gerrit.wikimedia.org/r/967230

Based on the above results we conclude that the service can easily withstand a high amount of traffic.
After a discussion on IRC we concluded to set up autoscaling up to 3 max replicas just in case we experience a high amount of traffic.
With the related patch a new pod will be created once the current one has above 75% of the target (50rps).

Change 967220 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] llm: load test langid model

https://gerrit.wikimedia.org/r/967220

Change 967230 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add autoscaling for langid

https://gerrit.wikimedia.org/r/967230

Load testing was done using input text ranging from 5 to 500 words. Although 1-2 sentences would be enough for the model to detect a language we should truncate the input text to avoid increased latencies (if any!) in case a user wants to send a huge amount of data through a single request.
I suggest we truncate any request to 500 words. There is nothing specific to support the 500 word suggestion, just that we know that the model is super fast with this kind of input and 500 words seems more than enough.

I ran some final load testing on localhost to test the limits of the current implementation.

 wrk -c 4 -t 2 --timeout 50s -s langid.lua http://localhost:8080/v1/models/langid:predict --latency  -d 60 -- langid.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 1m test @ http://localhost:8080/v1/models/langid:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.10ms    7.58ms  49.35ms   66.41%
    Req/Sec    43.91      8.78    70.00     77.39%
  Latency Distribution
     50%   13.43ms
     75%   19.58ms
     90%   26.05ms
     99%   36.96ms
  5272 requests in 1.00m, 1.10MB read
Requests/sec:     87.80
Transfer/sec:     18.72KB
thread 1 made 2635 requests and got 2634 responses
thread 2 made 2640 requests and got 2638 responses

wrk -c 20 -t 1 --timeout 50s -s langid.lua http://localhost:8080/v1/models/langid:predict --latency  -d 60 -- langid.input
thread 1 created logfile wrk_1.log created
Running 1m test @ http://localhost:8080/v1/models/langid:predict
  1 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   140.35ms    9.96ms 183.63ms   74.52%
    Req/Sec   117.56     20.72   151.00     55.69%
  Latency Distribution
     50%  140.36ms
     75%  146.10ms
     90%  151.67ms
     99%  162.95ms
  7028 requests in 1.00m, 1.46MB read
Requests/sec:    117.08
Transfer/sec:     24.97KB
thread 1 made 7045 requests and got 7028 responses


wrk -c 20 -t 4 --timeout 50s -s langid.lua http://localhost:8080/v1/models/langid:predict --latency  -d 60 -- langid.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
thread 3 created logfile wrk_3.log created
thread 4 created logfile wrk_4.log created
Running 1m test @ http://localhost:8080/v1/models/langid:predict
  4 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   140.54ms   42.74ms 277.11ms   55.95%
    Req/Sec    29.29     11.89    60.00     77.05%
  Latency Distribution
     50%  139.27ms
     75%  178.81ms
     90%  196.82ms
     99%  222.08ms
  7012 requests in 1.00m, 1.46MB read
Requests/sec:    116.72
Transfer/sec:     24.89KB
thread 1 made 1761 requests and got 1756 responses
thread 2 made 1757 requests and got 1754 responses
thread 3 made 1756 requests and got 1751 responses
thread 4 made 1756 requests and got 1751 responses

The model server can handle great load. At more than 100 rps latency increases but all requests are successful as it was able to handle 20 simultaneous connections with 4 threads in the above wrk test.

Change 968388 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-10-25-032936-production

https://gerrit.wikimedia.org/r/968388

Change 968388 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-10-31-044726-production

https://gerrit.wikimedia.org/r/968388