Installation

pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install inltk

Note: Just make sure to pick the correct torch wheel url, according to the needed platform and python version, which you will find here.

iNLTK runs on CPU, as is the desired behaviour for most of the Deep Learning models in production.

The first command above will install pytorch for cpu, which, as the name suggests, does not have cuda support.

Note: inltk is currently supported only on Linux and Windows 10 with Python >= 3.6

Supported Languages

Language Code
Hindi hi
Punjabi pa
Gujarati gu
Kannada kn
Malayalam ml
Oriya or
Marathi mr
Bengali bn
Tamil ta
Urdu ur
Nepali ne
Sanskrit sa
English en

API

Setup the language

from inltk.inltk import setup

setup('<code-of-language>') // if you wanted to use hindi, then setup('hi')

Note: You need to run setup('<code-of-language>') when you use a language for the FIRST TIME ONLY. This will download all the necessary models required to do inference for that language.

Tokenize

from inltk.inltk import tokenize

tokenize(text ,'<code-of-language>') // where text is string in <code-of-language>

Get Embedding Vectors

This returns an array of “Embedding vectors”, containing 400 Dimensional representation for every token in the text.

from inltk.inltk import get_embedding_vectors

vectors = get_embedding_vectors(text, '<code-of-language>') // where text is string in <code-of-language>

Example:

>> vectors = get_embedding_vectors('भारत', 'hi')
>> vectors[0].shape
(400,)

>> get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
[array([-0.894777, -0.140635, -0.030086, -0.669998, ...,  0.859898,  1.940608,  0.09252 ,  1.043363], dtype=float32), array([ 0.290839,  1.459981, -0.582347,  0.27822 , ..., -0.736542, -0.259388,  0.086048,  0.736173], dtype=float32), array([ 0.069481, -0.069362,  0.17558 , -0.349333, ...,  0.390819,  0.117293, -0.194081,  2.492722], dtype=float32), array([-0.37837 , -0.549682, -0.497131,  0.161678, ...,  0.048844, -1.090546,  0.154555,  0.925028], dtype=float32), array([ 0.219287,  0.759776,  0.695487,  1.097593, ...,  0.016115, -0.81602 ,  0.333799,  1.162199], dtype=float32), array([-0.31529 , -0.281649, -0.207479,  0.177357, ...,  0.729619, -0.161499, -0.270225,  2.083801], dtype=float32), array([-0.501414,  1.337661, -0.405563,  0.733806, ..., -0.182045, -1.413752,  0.163339,  0.907111], dtype=float32), array([ 0.185258, -0.429729,  0.060273,  0.232177, ..., -0.537831, -0.51664 , -0.249798,  1.872428], dtype=float32)]
>> vectors = get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
>> len(vectors)
8

Links to Embedding visualization on Embedding projector for all the supported languages are given in table below.

Predict Next ‘n’ words

from inltk.inltk import predict_next_words

predict_next_words(text , n, '<code-of-language>') 

// text --> string in <code-of-language>
// n --> number of words you want to predict (integer)

Note: You can also pass a fourth parameter, randomness, to predict_next_words. It has a default value of 0.8

Identify language

Note: If you update the version of iNLTK, you need to run reset_language_identifying_models before identifying language.

from inltk.inltk import identify_language, reset_language_identifying_models

reset_language_identifying_models() # only if you've updated iNLTK version
identify_language(text)

// text --> string in one of the supported languages

Example:

>> identify_language('न्यायदर्शनम् भारतीयदर्शनेषु अन्यतमम्। वैदिकदर्शनेषु ')
'sanskrit'

Remove foreign languages

from inltk.inltk import remove_foreign_languages

remove_foreign_languages(text, '<code-of-language>')

// text --> string in one of the supported languages
// <code-of-language> --> code of that language whose words you want to retain

Example:

>> remove_foreign_languages('विकिपीडिया सभी विषयों ਇੱਕ ਅਲੌਕਿਕ ਨਜ਼ਾਰਾ ਬੱਝਾ ਹੋਇਆ ਸਾਹਮਣੇ ਆ ਖਲੋਂਦਾ ਸੀ पर प्रामाणिक और 维基百科:关于中文维基百科 उपयोग, परिवर्तन 维基百科:关于中文维基百科', 'hi')
['▁विकिपीडिया', '▁सभी', '▁विषयों', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁पर', '▁प्रामाणिक', '▁और', '▁', '<unk>', ':', '<unk>', '▁उपयोग', ',', '▁परिवर्तन', '▁', '<unk>', ':', '<unk>']

Every word other than that of host language will become <unk> and signifies space character

Checkout this notebook by Amol Mahajan where he uses iNLTK to remove foreign characters from iitb_en_hi_parallel corpus

Get Sentence Encoding

from inltk.inltk import get_sentence_encoding

get_sentence_encoding(text, '<code-of-language>')

Example: 

>> encoding = get_sentence_encoding('मुझे अपने देश से', 'hi')
>> encoding.shape
(400,)

get_sentence_encoding returns 400 dimensional encoding of the sentence from ULMFiT LM Encoder of <code-of-language> trained in repositories linked below.

Get Sentence Similarity

from inltk.inltk import get_sentence_similarity

get_sentence_similarity(sentence1, sentence2, '<code-of-language>', cmp = cos_sim)

// sentence1, sentence2 are strings in '<code-of-language>'
// similarity of encodings is calculated by using cmp function whose default is cosine similarity

Example: 

>> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'मैंने कन्फेक्शनरी स्टोर्स पर सेब और संतरे की कीमतों की तुलना की', 'hi')
0.126698300242424

>> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'यहां कोई तुलना नहीं है। आप सेब की तुलना संतरे से कर रहे हैं', 'hi')
0.25467658042907715

get_sentence_similarity returns similarity between two sentences by calculating cosine similarity (default comparison function) between the encoding vectors of two sentences.

Get Similar Sentences

from inltk.inltk import get_similar_sentences

get_similar_sentences(sentence, no_of_variants, '<code-of-language>', degree_of_aug = 0.1)

// where degree_of_aug is roughly the percentage of sentence you want to augment, with a default value of 0.1

Example:

>> get_similar_sentences('मैं आज बहुत खुश हूं', 10, 'hi')
['मैं आजकल बहुत खुश हूं',
 'मैं आज काफ़ी खुश हूं',
 'मैं आज काफी खुश हूं',
 'मैं अब बहुत खुश हूं',
 'मैं आज अत्यधिक खुश हूं',
 'मैं अभी बहुत खुश हूं',
 'मैं आज बहुत हाजिर हूं',
 'मैं वर्तमान बहुत खुश हूं',
 'मैं आज अत्यंत खुश हूं',
 'मैं सदैव बहुत खुश हूं']

get_similar_sentences returns list of length no_of_variants which contains sentences which are similar to sentence

Trained Models

Language Repository Dataset used for Language modeling Perplexity of ULMFiT LM
(on validation set)
Perplexity of TransformerXL LM
(on validation set)
Dataset used for Classification Classification:
Test set Accuracy
Classification:
Test set MCC
Classification: Notebook
for Reproducibility
ULMFiT Embeddings visualization TransformerXL Embeddings visualization
Hindi NLP for Hindi Hindi Wikipedia Articles - 172k


Hindi Wikipedia Articles - 55k
34.06


35.87
26.09


34.78
BBC News Articles


IIT Patna Movie Reviews


IIT Patna Product Reviews
78.75


57.74


75.71
71.61


37.23


59.76
Notebook


Notebook


Notebook
Hindi Embeddings projection Hindi Embeddings projection
Bengali NLP for Bengali Bengali Wikipedia Articles 41.2 39.3 Bengali News Articles (Soham Articles) 90.71 87.92 Notebook Bengali Embeddings projection Bengali Embeddings projection
Gujarati NLP for Gujarati Gujarati Wikipedia Articles 34.12 28.12 iNLTK Headlines Corpus - Gujarati 91.05 86.09 Notebook Gujarati Embeddings projection Gujarati Embeddings projection
Malayalam NLP for Malayalam Malayalam Wikipedia Articles 26.39 25.79 iNLTK Headlines Corpus - Malayalam 95.56 93.29 Notebook Malayalam Embeddings projection Malayalam Embeddings projection
Marathi NLP for Marathi Marathi Wikipedia Articles 18 17.42 iNLTK Headlines Corpus - Marathi 92.40 85.23 Notebook Marathi Embeddings projection Marathi Embeddings projection
Tamil NLP for Tamil Tamil Wikipedia Articles 19.80 17.22 iNLTK Headlines Corpus - Tamil 95.22 92.70 Notebook Tamil Embeddings projection Tamil Embeddings projection
Punjabi NLP for Punjabi Punjabi Wikipedia Articles 24.40 14.03 IndicNLP News Article Classification Dataset - Punjabi 97.12 96.17 Notebook Punjabi Embeddings projection Punjabi Embeddings projection
Kannada NLP for Kannada Kannada Wikipedia Articles 70.10 61.97 IndicNLP News Article Classification Dataset - Kannada 98.87 98.30 Notebook Kannada Embeddings projection Kannada Embeddings projection
Oriya NLP for Oriya Oriya Wikipedia Articles 26.57 26.81 IndicNLP News Article Classification Dataset - Oriya 98.83 98.44 Notebook Oriya Embeddings Projection Oriya Embeddings Projection
Sanskrit NLP for Sanskrit Sanskrit Wikipedia Articles ~6 ~3 Sanskrit Shlokas Dataset 84.3 (valid set) Sanskrit Embeddings projection Sanskrit Embeddings projection
Nepali NLP for Nepali Nepali Wikipedia Articles 31.5 29.3 Nepali News Dataset 98.5 (valid set) Nepali Embeddings projection Nepali Embeddings projection
Urdu NLP for Urdu Urdu Wikipedia Articles 13.19 12.55 Urdu News Dataset 95.28 (valid set) Urdu Embeddings projection Urdu Embeddings projection

Note: English model has been directly taken from fast.ai

Effect of using Transfer Learning + Data-Augmentation from iNLTK

Language Repository Dataset used for Classification Results on using
complete training set
Percentage Decrease
in Training set size
Results on using
reduced training set
without Data Aug
Results on using
reduced training set
with Data Aug
Hindi NLP for Hindi IIT Patna Movie Reviews Accuracy: 57.74

MCC: 37.23
80% (2480 -> 496) Accuracy: 47.74

MCC: 20.50
Accuracy: 56.13

MCC: 34.39
Bengali NLP for Bengali Bengali News Articles (Soham Articles) Accuracy: 90.71

MCC: 87.92
99% (11284 -> 112) Accuracy: 69.88

MCC: 61.56
Accuracy: 74.06

MCC: 65.08
Gujarati NLP for Gujarati iNLTK Headlines Corpus - Gujarati Accuracy: 91.05

MCC: 86.09
90% (5269 -> 526) Accuracy: 80.88

MCC: 70.18
Accuracy: 81.03

MCC: 70.44
Malayalam NLP for Malayalam iNLTK Headlines Corpus - Malayalam Accuracy: 95.56

MCC: 93.29
90% (5036 -> 503) Accuracy: 82.38

MCC: 73.47
Accuracy: 84.29

MCC: 76.36
Marathi NLP for Marathi iNLTK Headlines Corpus - Marathi Accuracy: 92.40

MCC: 85.23
95% (9672 -> 483) Accuracy: 84.13

MCC: 68.59
Accuracy: 84.55

MCC: 69.11
Tamil NLP for Tamil iNLTK Headlines Corpus - Tamil Accuracy: 95.22

MCC: 92.70
95% (5346 -> 267) Accuracy: 86.25

MCC: 79.42
Accuracy: 89.84

MCC: 84.63

For more details around implementation or to reproduce results, checkout respective repositories.