Running FairSeq M2M-100 machine translation model in CPU-only environment.
This guide describes the steps for running Facebook FairSeq m2m_100 multilingual translation model in CPU-only environment. (as of November 2020)
Topline
As of November 2020, FairSeq m2m_100 is considered to be one of the most advance machine translation model. It uses a transformer-base model to do direct translation between any pair of supported 100 languages, without routing through intermediate language (English) as in the majority of machine translation models. The result is a big leap in all translation benchmarking metrics (ex. BLEU score).
However the problem with m2m_100 is that both their pretrained model and the source code that need to be used to run the model is written in the way that it needs many high-end GPUs to perform the translation. This is prohibitively impossible for many PoC / develop environments for the majority of developer communities.
I did some research and debugged the source code of FairSeq and m2m_100 and successfully was able to modify the source code / setting to run the m2m_100 model on a CPU only machine. It still needs a high capability computer to run because the m2m_100 model itself is very big (~51GB) and it needs to be loaded into memory. But with this modification, the cost of performing translation using m2m_100 is a lot lower than needing to have a high performance multi-GPUs machine as in the original requirement.
Hardware Requirements
CPU: X64-based architecture
RAM: Minimum 64GB
DISK: Minimum 200GB available
GPU: None
OS: Linux (I used Ubuntu 20.04)
SWAP Memory: At least 128GB, see https://bogdancornianu.com/change-swap-size-in-ubuntu/ for setting up SWAP memory.
Software Requirements
Python3
Pytorch
Because we will use CPU-only, be sure to install “No CUDA” version from: https://pytorch.org/get-started/locally/ (The version I used is 1.7.0)
Sentencepiece
You can install using ‘pip3 install sentencepiece’
Fairscale
You can install using ‘pip3 install fairscale==0.1.1’
Fairseq
We need the last (working) master branch because the latest public wheel still does not have the m2m_100 model (as of November 2020). So you will need to clone Fairseq source code from: ‘git clone https://github.com/pytorch/fairseq.git — recursive’
Checkout to the same commit version when I was writing this blog to ensure that the source code modification will be the same location and will definitsly works: ‘cd fairseq && git checkout 0d03fbe’
Then install the cloned FairSeq using: ‘pip3 install — editable ./’
If you run into errors about not having some required library, just install them using ‘pip3 instal XXXXX’ and rerun the installation again.
M2m_100 pretrained model weight file
I use the version that requires the 4 GPUs to be run (12b_last_chk_4_gpus.pt). You can download it from https://github.com/pytorch/fairseq/tree/master/examples/m2m_100. Put the pretrained weight file inside fairseq directory.
Detailed Steps
The first thing I gained by debugging around is that the m2m_100 model was trained using a parallel model approach. That means the model was divided into multiple parts that can be run in parallel across many GPUs. We need to supply the which GPUs we want to send the part of model to run using parameters like:
— pipeline-encoder-balance '[1,15,10]' \
— pipeline-encoder-devices '[0,1,0]' \
— pipeline-decoder-balance '[3,11,11,1]' \
— pipeline-decoder-devices '[0,2,3,0]'
Here, it is necessary to set the number of devices and balances correctly (I believe it has to be the same setting as how the model was trained) otherwise the pretrained weight cannot be loaded (because the tensor name has the device and balance number in it). Another problem is that the number of devices is the ID of the GPU that we want to send the weight to. Fairseq resolve targetted GPU by calling torch.device() using these numbers as parameters, which is a legacy way of calling the function and not supporting the CPU target at all. (The new way of calling this function is to pass parameter like torch.device(‘cuda:0’), which can support CPU target by using torch.deviced(‘cpu’) but setting [‘cpu’, ‘cpu’] in m2m_100 input parameter breaks many part of FairSeq which requires the parameter being number. And even if we install the CPU-only version of PyTorch, the FairSeq will still try to use GPU (by calling torch.cuda.XXX directly and causing exceptions to be raised). With all this information, I conclude that the easiest way is to do a small modification after parameter processing and weights loading to force every balance and device goes to the CPU instead of GPUs. Below are the list of modifications I made (my added lines are in bold texts):
** Note that line number may be different depending on your commit version of FairSeq and fairscale library. It is better to find the location of the changes by using function name and class name as a guide.
- Force CPU only in model parallelizing pipeline, regardless of legacy parameter parsed.
File: fairseq/model_parallel/models/pipeline_parallel_transformer/model.py
Line: 44 — Function: __init__ of class: PipelineParallelTransformerModel
class PipelineParallelTransformerModel(BaseFairseqModel): def __init__(self, encoder, decoder, balance, devices, chunks, checkpoint): devices = ['cpu' for _ in devices]
2. You also need to modify some code in fairscale library too.
** Depending on your library installation. If you installed all libraries inside “venv” environment, you may need to search for the file inside
<venv dir>/lib/python3.8/site-packages/fairscale
In my case, I installed fairscale globally. So my fairscale is at
.local/lib/python3.8/site-packages/fairscale
File: .local/lib/python3.8/site-packages/fairscale/nn/pipe/pipe.py
Line: 318 — Function: split_module
if len(layers) == balance[j]: # Group buffered layers as a partition. partition = nn.Sequential(layers) if devices: device = devices[j] device = torch.device('cpu') partition.to(device)
Line: 509 — Function: __init__ of class: Pipe
if devices is None: devices = range(torch.cuda.device_count())devices = ['cpu' for d in devices]devices = [torch.device(d) for d in devices]devices = cast(List[torch.device], devices)
* In the following steps, I will demonstrate the command to do translation from Thai to Indonesian language. you can change source and target language in configuration and filename from ‘th’ and ‘id’ to other languages. All m2m_100 model supported language pairs are listed in: language_pairs.txt
- Prepare input data. For my case, I want to translate Thai language to Indonesian language. I will put all my input text in Thai language inside ‘raw_input.th-id.th’ (put it inside fairseq folder). Strangely, the pipeline of translation using m2m_100 also requires target language raw data file even if it will not be used. Seems like m2m_100 checks and tries to encode raw data in both source and destination language even if we do a one way translation. The target raw data file cannot be empty too so you can just put whatever content in there. For me, I just copy source raw data into target language raw data file: ‘cp raw_input.th-id.th raw_input.th-id.id’
- Encode raw data using the sentencepiece model. Sentencepiece is a tokenizer model that can work with variable length of token merging to produce the entire sentence. I created and run a script to do such task like this:
File: 1_encode.sh
wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.modelfor lang in 'th' 'id'; do python3 scripts/spm_encode.py \ --model spm.128k.model \ --output_format=piece \ --inputs=raw_input.th-id.${lang} \ --outputs=spm.th-id.${lang}done
3. Perform preprocessing (binarization) on our input data:
File: 2_binarization.sh
wget https://dl.fbaipublicfiles.com/m2m_100/data_dict.128k.txtfairseq-preprocess \ --source-lang "th" --target-lang "id" \ --testpref spm.th-id \ --thresholdsrc 0 --thresholdtgt 0 \ --destdir data_bin_th_id \ --srcdict data_dict.128k.txt --tgtdict data_dict.128k.txt
4. Perform translation. Note that we cannot — fp16 here as the CPU only does not support half precision floating point like GPUs. You can also see that we need to retain correct balances and device parameters so that models can load weight file correctly before we force them to go to the CPU later in modified source code.
File: 3_generation_m2m_100.sh
wget https://dl.fbaipublicfiles.com/m2m_100/model_dict.128k.txtwget https://dl.fbaipublicfiles.com/m2m_100/language_pairs.txtfairseq-generate \ data_bin_th_id \ --cpu \ --batch-size 1 \ --path 12b_last_chk_2_gpus.pt \ --fixed-dictionary model_dict.128k.txt \ -s "th" -t "id" \ --remove-bpe 'sentencepiece' \ --beam 5 \ --task translation_multi_simple_epoch \ --lang-pairs language_pairs.txt \ --decoder-langtok --encoder-langtok src \ --gen-subset test \ --dataset-impl mmap \ --distributed-world-size 1 --distributed-no-spawn \ --pipeline-chunks 1 \ --model-overrides '{"ddp_backend": "c10d", "pipeline_balance": "1, 15, 13, 11, 11, 1" , "pipeline_devices": "0, 1, 0, 2, 3, 0" }' \ --pipeline-encoder-balance '[1,15,10]' \ --pipeline-encoder-devices '[0,1,0]' \ --pipeline-decoder-balance '[3,11,11,1]' \ --pipeline-decoder-devices '[0,2,3,0]' > gen_out
5. You can see result written in gen_out:
S-2 __th__ แถวยาวไปไหมT-2 แถวยาวไปไหมH-2 -2.5059561729431152 Berjalan jarak jauh?D-2 -2.5059561729431152 Berjalan jarak jauh?P-2 -6.9086 -5.2208 -0.9001 -3.2984 -0.3153 -0.7454 -0.1531S-1 __th__ พนักงานบริการโคตรแย่T-1 พนักงานบริการโคตรแย่H-1 -2.0654807090759277 Pekerja Layanan KotorD-1 -2.0654807090759277 Pekerja Layanan KotorP-1 -6.4449 -4.4084 -0.6868 -0.2564 -2.1839 -0.1741 -3.2183 -0.6242 -0.5922S-3 __th__ อยากให้มีพนักงานเยอะกว่านี้T-3 อยากให้มีพนักงานเยอะกว่านี้H-3 -1.5292832851409912 Saya ingin lebih banyak karyawan.D-3 -1.5292832851409912 Saya ingin lebih banyak karyawan.P-3 -6.1507 -2.1494 -0.7614 -1.5158 -0.2357 -1.2831 -0.1340 -1.3752 -0.1582S-0 __th__ ทดสอบการใช้ firseq ในการแปลภาษาจากภาษาไทยไปเป็นภาษาอินโดนีเซียT-0 ทดสอบการใช้ firseq ในการแปลภาษาจากภาษาไทยไปเป็นภาษาอินโดนีเซียH-0 -1.0547080039978027 Uji coba penggunaan firseq dalam terjemahan bahasa Indonesia ke bahasa IndonesiaD-0 -1.0547080039978027 Uji coba penggunaan firseq dalam terjemahan bahasa Indonesia ke bahasa IndonesiaP-0 -4.7732 -1.1226 -0.4373 -1.9382 -1.8837 -0.3304 -0.1301 -0.1324 -1.4030 -1.0985 -0.0928 -0.1336 -0.4800 -0.9179 -1.1600 -0.6968 -1.4860 -0.76842020-11-16 07:51:35 | INFO | fairseq_cli.generate | NOTE: hypothesis and token scores are output in base 22020-11-16 07:51:35 | INFO | fairseq_cli.generate | Translated 4 sentences (43 tokens) in 384.7s (0.01 sentences/s, 0.11 tokens/s)
And those are all steps we need to take to run m2m_100 on CPU-only machine. The translation using such big model is surely very slow without GPUs help, but this will help many people who want to try out or validate the model on their tasks first before invest money in high performance GPUs environment.