Installation¶
pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install inltk
Note: Just make sure to pick the correct torch wheel url, according to the needed platform and python version, which you will find here.
iNLTK runs on CPU, as is the desired behaviour for most of the Deep Learning models in production.
The first command above will install pytorch for cpu, which, as the name suggests, does not have cuda support.
Note: inltk is currently supported only on Linux and Windows 10 with Python >= 3.6
Supported languages¶
Native languages¶
Language | Code |
---|---|
Hindi | hi |
Punjabi | pa |
Gujarati | gu |
Kannada | kn |
Malayalam | ml |
Oriya | or |
Marathi | mr |
Bengali | bn |
Tamil | ta |
Urdu | ur |
Nepali | ne |
Sanskrit | sa |
English | en |
Telugu | te |
Code Mixed languages¶
Language | Script | Code |
---|---|---|
Hinglish (Hindi+English) | Latin | hi-en |
Tanglish (Tamil+English) | Latin | ta-en |
Manglish (Malayalam+English) | Latin | ml-en |
API¶
Setup the language¶
from inltk.inltk import setup
setup('<code-of-language>') // if you wanted to use hindi, then setup('hi')
Note: You need to run setup('<code-of-language>') when you use a language for the FIRST TIME ONLY. This will download all the necessary models required to do inference for that language.
Tokenize¶
from inltk.inltk import tokenize
tokenize(text ,'<code-of-language>') // where text is string in <code-of-language>
Get Embedding Vectors¶
This returns an array of “Embedding vectors”, containing 400 Dimensional representation for every token in the text.
from inltk.inltk import get_embedding_vectors
vectors = get_embedding_vectors(text, '<code-of-language>') // where text is string in <code-of-language>
Example:
>> vectors = get_embedding_vectors('भारत', 'hi')
>> vectors[0].shape
(400,)
>> get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
[array([-0.894777, -0.140635, -0.030086, -0.669998, ..., 0.859898, 1.940608, 0.09252 , 1.043363], dtype=float32), array([ 0.290839, 1.459981, -0.582347, 0.27822 , ..., -0.736542, -0.259388, 0.086048, 0.736173], dtype=float32), array([ 0.069481, -0.069362, 0.17558 , -0.349333, ..., 0.390819, 0.117293, -0.194081, 2.492722], dtype=float32), array([-0.37837 , -0.549682, -0.497131, 0.161678, ..., 0.048844, -1.090546, 0.154555, 0.925028], dtype=float32), array([ 0.219287, 0.759776, 0.695487, 1.097593, ..., 0.016115, -0.81602 , 0.333799, 1.162199], dtype=float32), array([-0.31529 , -0.281649, -0.207479, 0.177357, ..., 0.729619, -0.161499, -0.270225, 2.083801], dtype=float32), array([-0.501414, 1.337661, -0.405563, 0.733806, ..., -0.182045, -1.413752, 0.163339, 0.907111], dtype=float32), array([ 0.185258, -0.429729, 0.060273, 0.232177, ..., -0.537831, -0.51664 , -0.249798, 1.872428], dtype=float32)]
>> vectors = get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
>> len(vectors)
8
Links to Embedding visualization
on Embedding projector for all the supported languages are given in table below.
Predict Next ‘n’ words¶
from inltk.inltk import predict_next_words
predict_next_words(text , n, '<code-of-language>')
// text --> string in <code-of-language>
// n --> number of words you want to predict (integer)
Note: You can also pass a fourth parameter, randomness, to predict_next_words. It has a default value of 0.8
Identify language¶
Note: If you update the version of iNLTK, you need to run
reset_language_identifying_models
before identifying language.
from inltk.inltk import identify_language, reset_language_identifying_models
reset_language_identifying_models() # only if you've updated iNLTK version
identify_language(text)
// text --> string in one of the supported languages
Example:
>> identify_language('न्यायदर्शनम् भारतीयदर्शनेषु अन्यतमम्। वैदिकदर्शनेषु ')
'sanskrit'
Remove foreign languages¶
from inltk.inltk import remove_foreign_languages
remove_foreign_languages(text, '<code-of-language>')
// text --> string in one of the supported languages
// <code-of-language> --> code of that language whose words you want to retain
Example:
>> remove_foreign_languages('विकिपीडिया सभी विषयों ਇੱਕ ਅਲੌਕਿਕ ਨਜ਼ਾਰਾ ਬੱਝਾ ਹੋਇਆ ਸਾਹਮਣੇ ਆ ਖਲੋਂਦਾ ਸੀ पर प्रामाणिक और 维基百科:关于中文维基百科 उपयोग, परिवर्तन 维基百科:关于中文维基百科', 'hi')
['▁विकिपीडिया', '▁सभी', '▁विषयों', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁पर', '▁प्रामाणिक', '▁और', '▁', '<unk>', ':', '<unk>', '▁उपयोग', ',', '▁परिवर्तन', '▁', '<unk>', ':', '<unk>']
Every word other than that of host language will become <unk>
and ▁
signifies space character
Checkout this notebook by Amol Mahajan where he uses iNLTK to remove foreign characters from iitb_en_hi_parallel corpus
Get Sentence Encoding¶
from inltk.inltk import get_sentence_encoding
get_sentence_encoding(text, '<code-of-language>')
Example:
>> encoding = get_sentence_encoding('मुझे अपने देश से', 'hi')
>> encoding.shape
(400,)
get_sentence_encoding
returns 400 dimensional encoding of the sentence from
ULMFiT LM Encoder of <code-of-language>
trained in repositories linked below.
Get Sentence Similarity¶
from inltk.inltk import get_sentence_similarity
get_sentence_similarity(sentence1, sentence2, '<code-of-language>', cmp = cos_sim)
// sentence1, sentence2 are strings in '<code-of-language>'
// similarity of encodings is calculated by using cmp function whose default is cosine similarity
Example:
>> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'मैंने कन्फेक्शनरी स्टोर्स पर सेब और संतरे की कीमतों की तुलना की', 'hi')
0.126698300242424
>> get_sentence_similarity('मैं इन दोनों श्रेणियों के बीच कुछ भी सामान्य नहीं देखता।', 'यहां कोई तुलना नहीं है। आप सेब की तुलना संतरे से कर रहे हैं', 'hi')
0.25467658042907715
get_sentence_similarity
returns similarity between two sentences by calculating
cosine similarity
(default comparison function) between the encoding vectors of two
sentences.
Get Similar Sentences¶
from inltk.inltk import get_similar_sentences
get_similar_sentences(sentence, no_of_variants, '<code-of-language>', degree_of_aug = 0.1)
// where degree_of_aug is roughly the percentage of sentence you want to augment, with a default value of 0.1
Example:
>> get_similar_sentences('मैं आज बहुत खुश हूं', 10, 'hi')
['मैं आजकल बहुत खुश हूं',
'मैं आज काफ़ी खुश हूं',
'मैं आज काफी खुश हूं',
'मैं अब बहुत खुश हूं',
'मैं आज अत्यधिक खुश हूं',
'मैं अभी बहुत खुश हूं',
'मैं आज बहुत हाजिर हूं',
'मैं वर्तमान बहुत खुश हूं',
'मैं आज अत्यंत खुश हूं',
'मैं सदैव बहुत खुश हूं']
get_similar_sentences
returns list
of length no_of_variants
which contains sentences which
are similar to sentence
Trained Models¶
Note: English model has been directly taken from fast.ai
Effect of using Transfer Learning + Paraphrases from iNLTK¶
Language | Repository | Dataset used for Classification | Results on using complete training set |
Percentage Decrease in Training set size |
Results on using reduced training set without Paraphrases |
Results on using reduced training set with Paraphrases |
---|---|---|---|---|---|---|
Hindi | NLP for Hindi | IIT Patna Movie Reviews | Accuracy: 57.74 MCC: 37.23 |
80% (2480 -> 496) | Accuracy: 47.74 MCC: 20.50 |
Accuracy: 56.13 MCC: 34.39 |
Bengali | NLP for Bengali | Bengali News Articles (Soham Articles) | Accuracy: 90.71 MCC: 87.92 |
99% (11284 -> 112) | Accuracy: 69.88 MCC: 61.56 |
Accuracy: 74.06 MCC: 65.08 |
Gujarati | NLP for Gujarati | iNLTK Headlines Corpus - Gujarati | Accuracy: 91.05 MCC: 86.09 |
90% (5269 -> 526) | Accuracy: 80.88 MCC: 70.18 |
Accuracy: 81.03 MCC: 70.44 |
Malayalam | NLP for Malayalam | iNLTK Headlines Corpus - Malayalam | Accuracy: 95.56 MCC: 93.29 |
90% (5036 -> 503) | Accuracy: 82.38 MCC: 73.47 |
Accuracy: 84.29 MCC: 76.36 |
Marathi | NLP for Marathi | iNLTK Headlines Corpus - Marathi | Accuracy: 92.40 MCC: 85.23 |
95% (9672 -> 483) | Accuracy: 84.13 MCC: 68.59 |
Accuracy: 84.55 MCC: 69.11 |
Tamil | NLP for Tamil | iNLTK Headlines Corpus - Tamil | Accuracy: 95.22 MCC: 92.70 |
95% (5346 -> 267) | Accuracy: 86.25 MCC: 79.42 |
Accuracy: 89.84 MCC: 84.63 |
For more details around implementation or to reproduce results, checkout respective repositories.