Installation¶

pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install inltk

Note: Just make sure to pick the correct torch wheel url, according to the needed platform and python version, which you will find here.

iNLTK runs on CPU, as is the desired behaviour for most of the Deep Learning models in production.

The first command above will install pytorch for cpu, which, as the name suggests, does not have cuda support.

Note: inltk is currently supported only on Linux and Windows 10 with Python >= 3.6

Supported languages¶

Native languages¶

Language	Code
Hindi	hi
Punjabi	pa
Gujarati	gu
Kannada	kn
Malayalam	ml
Oriya	or
Marathi	mr
Bengali	bn
Tamil	ta
Urdu	ur
Nepali	ne
Sanskrit	sa
English	en
Telugu	te

Code Mixed languages¶

Language	Script	Code
Hinglish (Hindi+English)	Latin	hi-en
Tanglish (Tamil+English)	Latin	ta-en
Manglish (Malayalam+English)	Latin	ml-en

API¶

Setup the language¶

from inltk.inltk import setup

setup('<code-of-language>') // if you wanted to use hindi, then setup('hi')

Note: You need to run setup('<code-of-language>') when you use a language for the FIRST TIME ONLY. This will download all the necessary models required to do inference for that language.

Tokenize¶

from inltk.inltk import tokenize

tokenize(text ,'<code-of-language>') // where text is string in <code-of-language>

Get Embedding Vectors¶

This returns an array of “Embedding vectors”, containing 400 Dimensional representation for every token in the text.

from inltk.inltk import get_embedding_vectors

vectors = get_embedding_vectors(text, '<code-of-language>') // where text is string in <code-of-language>

Example:

>> vectors = get_embedding_vectors('भारत', 'hi')
>> vectors[0].shape
(400,)

>> get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
[array([-0.894777, -0.140635, -0.030086, -0.669998, ...,  0.859898,  1.940608,  0.09252 ,  1.043363], dtype=float32), array([ 0.290839,  1.459981, -0.582347,  0.27822 , ..., -0.736542, -0.259388,  0.086048,  0.736173], dtype=float32), array([ 0.069481, -0.069362,  0.17558 , -0.349333, ...,  0.390819,  0.117293, -0.194081,  2.492722], dtype=float32), array([-0.37837 , -0.549682, -0.497131,  0.161678, ...,  0.048844, -1.090546,  0.154555,  0.925028], dtype=float32), array([ 0.219287,  0.759776,  0.695487,  1.097593, ...,  0.016115, -0.81602 ,  0.333799,  1.162199], dtype=float32), array([-0.31529 , -0.281649, -0.207479,  0.177357, ...,  0.729619, -0.161499, -0.270225,  2.083801], dtype=float32), array([-0.501414,  1.337661, -0.405563,  0.733806, ..., -0.182045, -1.413752,  0.163339,  0.907111], dtype=float32), array([ 0.185258, -0.429729,  0.060273,  0.232177, ..., -0.537831, -0.51664 , -0.249798,  1.872428], dtype=float32)]
>> vectors = get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
>> len(vectors)
8

Links to Embedding visualization on Embedding projector for all the supported languages are given in table below.

Predict Next ‘n’ words¶

from inltk.inltk import predict_next_words

predict_next_words(text , n, '<code-of-language>') 

// text --> string in <code-of-language>
// n --> number of words you want to predict (integer)

Note: You can also pass a fourth parameter, randomness, to predict_next_words. It has a default value of 0.8

Identify language¶

Note: If you update the version of iNLTK, you need to run reset_language_identifying_models before identifying language.

from inltk.inltk import identify_language, reset_language_identifying_models

reset_language_identifying_models() # only if you've updated iNLTK version
identify_language(text)

// text --> string in one of the supported languages

Example:

>> identify_language('न्यायदर्शनम् भारतीयदर्शनेषु अन्यतमम्। वैदिकदर्शनेषु ')
'sanskrit'

Remove foreign languages¶

from inltk.inltk import remove_foreign_languages

remove_foreign_languages(text, '<code-of-language>')

// text --> string in one of the supported languages
// <code-of-language> --> code of that language whose words you want to retain

Example:

>> remove_foreign_languages('विकिपीडिया सभी विषयों ਇੱਕ ਅਲੌਕਿਕ ਨਜ਼ਾਰਾ ਬੱਝਾ ਹੋਇਆ ਸਾਹਮਣੇ ਆ ਖਲੋਂਦਾ ਸੀ पर प्रामाणिक और 维基百科:关于中文维基百科 उपयोग, परिवर्तन 维基百科:关于中文维基百科', 'hi')
['▁विकिपीडिया', '▁सभी', '▁विषयों', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁पर', '▁प्रामाणिक', '▁और', '▁', '<unk>', ':', '<unk>', '▁उपयोग', ',', '▁परिवर्तन', '▁', '<unk>', ':', '<unk>']

Every word other than that of host language will become <unk> and ▁ signifies space character

Checkout this notebook by Amol Mahajan where he uses iNLTK to remove foreign characters from iitb_en_hi_parallel corpus

Get Sentence Encoding¶

from inltk.inltk import get_sentence_encoding

get_sentence_encoding(text, '<code-of-language>')

Example: 

>> encoding = get_sentence_encoding('मुझे अपने देश से', 'hi')
>> encoding.shape
(400,)

get_sentence_encoding returns 400 dimensional encoding of the sentence from ULMFiT LM Encoder of <code-of-language> trained in repositories linked below.

Get Sentence Similarity¶

from inltk.inltk import get_sentence_similarity

get_sentence_similarity(sentence1, sentence2, '<code-of-language>', cmp = cos_sim)

// sentence1, sentence2 are strings in '<code-of-language>'
// similarity of encodings is calculated by using cmp function whose default is cosine similarity

Example: 

>> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'मैंने कन्फेक्शनरी स्टोर्स पर सेब और संतरे की कीमतों की तुलना की', 'hi')
0.126698300242424

>> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'यहां कोई तुलना नहीं है। आप सेब की तुलना संतरे से कर रहे हैं', 'hi')
0.25467658042907715

get_sentence_similarity returns similarity between two sentences by calculating cosine similarity (default comparison function) between the encoding vectors of two sentences.

Get Similar Sentences¶

from inltk.inltk import get_similar_sentences

get_similar_sentences(sentence, no_of_variants, '<code-of-language>', degree_of_aug = 0.1)

// where degree_of_aug is roughly the percentage of sentence you want to augment, with a default value of 0.1

Example:

>> get_similar_sentences('मैं आज बहुत खुश हूं', 10, 'hi')
['मैं आजकल बहुत खुश हूं',
 'मैं आज काफ़ी खुश हूं',
 'मैं आज काफी खुश हूं',
 'मैं अब बहुत खुश हूं',
 'मैं आज अत्यधिक खुश हूं',
 'मैं अभी बहुत खुश हूं',
 'मैं आज बहुत हाजिर हूं',
 'मैं वर्तमान बहुत खुश हूं',
 'मैं आज अत्यंत खुश हूं',
 'मैं सदैव बहुत खुश हूं']

get_similar_sentences returns list of length no_of_variants which contains sentences which are similar to sentence

Trained Models¶

Language	Repository	Dataset used for Language modeling	Perplexity of ULMFiT LM (on validation set)	Perplexity of TransformerXL LM (on validation set)	Dataset used for Classification	Classification: Test set Accuracy	Classification: Test set MCC	Classification: Notebook for Reproducibility	ULMFiT Embeddings visualization	TransformerXL Embeddings visualization
Hindi	NLP for Hindi	Hindi Wikipedia Articles - 172k Hindi Wikipedia Articles - 55k	34.06 35.87	26.09 34.78	BBC News Articles IIT Patna Movie Reviews IIT Patna Product Reviews	78.75 57.74 75.71	0.71 0.37 0.59	Notebook Notebook Notebook	Hindi Embeddings projection	Hindi Embeddings projection
Bengali	NLP for Bengali	Bengali Wikipedia Articles	41.2	39.3	Bengali News Articles (Soham Articles)	90.71	0.87	Notebook	Bengali Embeddings projection	Bengali Embeddings projection
Gujarati	NLP for Gujarati	Gujarati Wikipedia Articles	34.12	28.12	iNLTK Headlines Corpus - Gujarati	91.05	0.86	Notebook	Gujarati Embeddings projection	Gujarati Embeddings projection
Malayalam	NLP for Malayalam	Malayalam Wikipedia Articles	26.39	25.79	iNLTK Headlines Corpus - Malayalam	95.56	0.93	Notebook	Malayalam Embeddings projection	Malayalam Embeddings projection
Marathi	NLP for Marathi	Marathi Wikipedia Articles	18	17.42	iNLTK Headlines Corpus - Marathi	92.40	0.85	Notebook	Marathi Embeddings projection	Marathi Embeddings projection
Tamil	NLP for Tamil	Tamil Wikipedia Articles	19.80	17.22	iNLTK Headlines Corpus - Tamil	95.22	0.92	Notebook	Tamil Embeddings projection	Tamil Embeddings projection
Punjabi	NLP for Punjabi	Punjabi Wikipedia Articles	24.40	14.03	IndicNLP News Article Classification Dataset - Punjabi	97.12	0.96	Notebook	Punjabi Embeddings projection	Punjabi Embeddings projection
Kannada	NLP for Kannada	Kannada Wikipedia Articles	70.10	61.97	IndicNLP News Article Classification Dataset - Kannada	98.87	0.98	Notebook	Kannada Embeddings projection	Kannada Embeddings projection
Oriya	NLP for Oriya	Oriya Wikipedia Articles	26.57	26.81	IndicNLP News Article Classification Dataset - Oriya	98.83	0.98	Notebook	Oriya Embeddings Projection	Oriya Embeddings Projection
Sanskrit	NLP for Sanskrit	Sanskrit Wikipedia Articles	~6	~3	Sanskrit Shlokas Dataset	84.3 (valid set)			Sanskrit Embeddings projection	Sanskrit Embeddings projection
Nepali	NLP for Nepali	Nepali Wikipedia Articles	31.5	29.3	Nepali News Dataset	98.5 (valid set)			Nepali Embeddings projection	Nepali Embeddings projection
Urdu	NLP for Urdu	Urdu Wikipedia Articles	13.19	12.55	Urdu News Dataset	95.28 (valid set)			Urdu Embeddings projection	Urdu Embeddings projection
Telugu	NLP for Telugu	Telugu Wikipedia Articles	27.47	29.44	Telugu News Dataset Telugu News Andhra Jyoti	95.4 92.09		Notebook Notebook	Telugu Embeddings projection	Telugu Embeddings projection
Tanglish	NLP for Tanglish	Synthetic Tanglish Dataset	37.50	-	Dravidian Codemix HASOC @ FIRE 2020 Dravidian Codemix Sentiment Analysis @ FIRE 2020	F1 Score: 0.88 F1 Score: 0.62	-	Notebook Notebook	Tanglish Embeddings Projection	-
Manglish	NLP for Manglish	Synthetic Manglish Dataset	45.84	-	Dravidian Codemix HASOC @ FIRE 2020 Dravidian Codemix Sentiment Analysis @ FIRE 2020	F1 Score: 0.74 F1 Score: 0.69	-	Notebook Notebook	Manglish Embeddings Projection	-
Hinglish	NLP for Hinglish	Synthetic Hinglish Dataset	86.48	-	-	-	-	-	Hinglish Embeddings Projection	-

Note: English model has been directly taken from fast.ai

Effect of using Transfer Learning + Paraphrases from iNLTK¶

Language	Repository	Dataset used for Classification	Results on using complete training set	Percentage Decrease in Training set size	Results on using reduced training set without Paraphrases	Results on using reduced training set with Paraphrases
Hindi	NLP for Hindi	IIT Patna Movie Reviews	Accuracy: 57.74 MCC: 37.23	80% (2480 -> 496)	Accuracy: 47.74 MCC: 20.50	Accuracy: 56.13 MCC: 34.39
Bengali	NLP for Bengali	Bengali News Articles (Soham Articles)	Accuracy: 90.71 MCC: 87.92	99% (11284 -> 112)	Accuracy: 69.88 MCC: 61.56	Accuracy: 74.06 MCC: 65.08
Gujarati	NLP for Gujarati	iNLTK Headlines Corpus - Gujarati	Accuracy: 91.05 MCC: 86.09	90% (5269 -> 526)	Accuracy: 80.88 MCC: 70.18	Accuracy: 81.03 MCC: 70.44
Malayalam	NLP for Malayalam	iNLTK Headlines Corpus - Malayalam	Accuracy: 95.56 MCC: 93.29	90% (5036 -> 503)	Accuracy: 82.38 MCC: 73.47	Accuracy: 84.29 MCC: 76.36
Marathi	NLP for Marathi	iNLTK Headlines Corpus - Marathi	Accuracy: 92.40 MCC: 85.23	95% (9672 -> 483)	Accuracy: 84.13 MCC: 68.59	Accuracy: 84.55 MCC: 69.11
Tamil	NLP for Tamil	iNLTK Headlines Corpus - Tamil	Accuracy: 95.22 MCC: 92.70	95% (5346 -> 267)	Accuracy: 86.25 MCC: 79.42	Accuracy: 89.84 MCC: 84.63

For more details around implementation or to reproduce results, checkout respective repositories.