The NLP revolution has started — My BERT implementation
As you know AI is all the rage these days, with the industry after industry poised to be affected by it.
AI can be broadly divided into three structures — Numbers, Text, Images (including video). While Dealing with numbers has been going on for quite some time, dealing with text ( NLP ) and Video are both very hot right now.
To be clear, NLP is a very old concept but now are seeing massive interest in the field, thanks to
convergence of Data, Algorithms and computing power.
we are now collecting tons and tons of data, our CPU GPU are getting insanely cheap and powerful, last but not least there is tremendous explosion of interest in this area resulting in so many advanced word embedding algos like Word2vec, Glove, BERT, ELMo, Context2vec
Word2Vec and GloVe ultilize the co-occurrence of target word and context word, which is defined by the context window. But the order of words in a sentence is not taken into account.
Context2Vec takes the sequential relationships between words into account. Each sentence is modeled by a bidirectional RNN, and each target word obtains a contextual embedding from the hidden states of RNN that captures the information from words before and after it.
ELMo is quite similar to Context2Vec. The main difference is that ELMo uses language modelling to training the word embeddings but Context2Vec adopts the Word2Vec fashion, which builds the mapping between target word and context word. Also, ELMo is a little deeper than Context2Vec.
BERT benefits from the invention of Transformer. Though ELMo proves to be effective in extracting context-dependent embeddings, BERT argues that ELMo captiures the context from only two directions (i.e., bidirectional RNN). BERT adopts the encoder of Transformer, which is composed of attention networks. Therefore, BERT can capture context from all possible directions (fully connected).
Heres a very basic example of superpower that BERT posses
import torch from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM text = ( "Today, every student has a computer small enough to fit into " "his _. He can solve any math problem by simply pushing the " "computer's little _. Computers can add, multiply, divide, and " "_. They can also _ better than a human. Some computers are " "_. Others have an _ screen that shows all kinds of _ and _ " "figures. " ) # Load pre-trained model with masked language model head bert_version = 'bert-large-uncased' model = BertForMaskedLM.from_pretrained(bert_version) # Preprocess text tokenizer = BertTokenizer.from_pretrained(bert_version) tokenized_text = tokenizer.tokenize(text) mask_positions = [] for i in range(len(tokenized_text)): if tokenized_text[i] == '_': tokenized_text[i] = '[MASK]' mask_positions.append(i) # Predict missing words from left to right model.eval() for mask_pos in mask_positions: # Convert tokens to vocab indices token_ids = tokenizer.convert_tokens_to_ids(tokenized_text) tokens_tensor = torch.tensor([token_ids]) # Call BERT to predict token at this position predictions = model(tokens_tensor)[0, mask_pos] predicted_index = torch.argmax(predictions).item() predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0] # Update text tokenized_text[mask_pos] = predicted_token for mask_pos in mask_positions: tokenized_text[mask_pos] = "_" + tokenized_text[mask_pos] + "_" print(' '.join(tokenized_text).replace(' ##', ''))
The result of this as follows
today , every student has a computer small enough to fit into his _bed_ . he can solve any math problem by simply pushing the computer ' s little _button_ . computers can add , multiply , divide , and _kill_ . they can also _be_ better than a human . some computers are _black_ . others have an _open_ screen that shows all kinds of _clothes_ and _even_ figures .
The bert encoder in this example has been trained on a bunch of fantasy novels hence the slightly dystopian view, but this is crazy,
computers sounding like humans will change the world without a doubt.
Originally published at https://www.linkedin.com.