Machine Learning Engineer Data Scientist IOS Developer
NLP using Neural NetworksThis project applies state of the art machine learning technique, Neural Networks to create a multi class text classifier. The dataset contains 13 million board games reviews from different users all around the word. Yes the data set is very large and training this large dataset might be a problem .However , thanks to google , not only for tensorflow but they have a very powerful cloud notebook , which uses GPUs and TPUs and can be used to run Machine learning models for free.
colab by google :- https://colab.research.google.com/notebooks/intro.ipynb#recent=true
The reason I choose neural networks for this project because of two major reasons. Firstly , I have never done NLP using neural networks and I wanted to go out of my way to do something that is neither taught to me any class nor I have had experience with before . Secondly, I have heard all the hype regarding how powerfull neural Networks are and what is the better way of exploring this.
Check out my project video below :-
Check out my live implementation of the model:- HERE
Download my notebook :- HERE
Check out on github :- HERE
This project is trained and tested using a large dataset of 13 million boardgames reviews. The model that is created is used to deploy a realtime predictor on a website.
First step is to import all the libraries that are needed for this project . Some the libraries I have used are :-
# Keras
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense,Flatten, Input, Reshape, Concatenate,Conv2D,MaxPool2D,concatenate ,Dropout
from keras.models import Model
from keras.layers.embeddings import Embedding
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.regularizers import l1,l2,l1_l2
# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
## Plot
import matplotlib.pyplot as plt
# NLTK
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
# Other
import re
import string
import numpy as np
import pandas as pd
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))
from google.colab import drive
drive.mount('/content/drive')
In order to process the data we first need to explore it , understand it , find patterns and finally then we can process the data to our model. I am using pandas to read the CSV file and to apply certain operations on it. Pandas offer a fast yet equally powerfull data processing library.
reviews = pd.read_csv("/content/drive/My Drive/reviews.csv", error_bad_lines=False)
reviews.head()
reviews.describe()
print(reviews[reviews['comment'].isna()])
print(reviews['rating'].value_counts())
reviews.head()
reviews['rating'].hist(bins=10)
plt.xlabel('rating of review')
plt.ylabel('number of reviews')
plt.show()
After visualising the data we can conlude the following things:-
Using the pandas inbuilt function I am removing all the rows with null values.
reviews= reviews.dropna()
reviews.head()
I am using numpy round function to convert the double rating to corresponding integer values . This is will reduce the number of classes to 10.
reviews['rating'] = np.around(reviews['rating'])
reviews['rating'].replace(0.0,1.0, inplace=True)
print(reviews['rating'].value_counts())
reviews.head()
reviews['rating'].hist(bins=10)
plt.xlabel('rating of review')
plt.ylabel('number of reviews')
plt.show()
The data is still unbalanced as we can see that we have more data for rating of 7 and 8 . So to feed the neural network we need to balance the data . I am using pandas to extract frames of individual ratings from the data and then sampling them to a single value so that all the ratings have equal number of data. Finally I use concatenate to add all data into a new data frame.
sam = 20000
reviews_rate_1 = reviews[reviews['rating'] == 1.0]
reviews_rate_2 = reviews[reviews['rating'] == 2.0]
reviews_rate_3 = reviews[reviews['rating'] == 3.0]
reviews_rate_4 = reviews[reviews['rating'] == 4.0]
reviews_rate_5 = reviews[reviews['rating'] == 5.0]
reviews_rate_6 = reviews[reviews['rating'] == 6.0]
reviews_rate_7 = reviews[reviews['rating'] == 7.0]
reviews_rate_8 = reviews[reviews['rating'] == 8.0]
reviews_rate_9 = reviews[reviews['rating'] == 9.0]
reviews_rate_10 = reviews[reviews['rating'] == 10.0]
reviews_rate_1 = reviews_rate_1.sample(sam)
reviews_rate_2 = reviews_rate_2.sample(sam)
reviews_rate_3 = reviews_rate_3.sample(sam)
reviews_rate_4 = reviews_rate_4.sample(sam)
reviews_rate_5 = reviews_rate_5.sample(sam)
reviews_rate_6 = reviews_rate_6.sample(sam)
reviews_rate_7 = reviews_rate_7.sample(sam)
reviews_rate_8 = reviews_rate_8.sample(sam)
reviews_rate_9 = reviews_rate_9.sample(sam)
reviews_rate_10 = reviews_rate_10.sample(sam)
reviews_balanced = pd.concat([reviews_rate_1,reviews_rate_2,reviews_rate_3,
reviews_rate_4,reviews_rate_5,reviews_rate_6,reviews_rate_7,
reviews_rate_8,reviews_rate_9,reviews_rate_10],axis=0)
reviews_balanced.head()
print(reviews_balanced['rating'].value_counts())
This is one of the more important past of text preprocessing . Since everybody has a different writing style we need to make sure that data is error free and consistent. The first things is make sure that the data is lowercase so that our model doesnot consider 'Hello' and 'hello' as two different features. Then we use NLTK library to download all the english stopwords and remove them from my data. Stopwords include is , the etc which have know meaning in determining the rating of the review so we remove them. Finally , we clean all the contractions , punctutions and abbreviations from the data using regular expressions.
def clean_text(text):
text = text.lower().split()
stops = set(stopwords.words("english"))
text = [w for w in text if not w in stops and len(w) >= 3]
text = " ".join(text)
text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
text = re.sub(r"what's", "what is ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"i'm", "i am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r",", " ", text)
text = re.sub(r"\.", " ", text)
text = re.sub(r"!", " ! ", text)
text = re.sub(r"\/", " ", text)
text = re.sub(r"\^", " ^ ", text)
text = re.sub(r"\+", " + ", text)
text = re.sub(r"\-", " - ", text)
text = re.sub(r"\=", " = ", text)
text = re.sub(r"'", " ", text)
text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
text = re.sub(r":", " : ", text)
text = re.sub(r" e g ", " eg ", text)
text = re.sub(r" b g ", " bg ", text)
text = re.sub(r" u s ", " american ", text)
text = re.sub(r"\0s", "0", text)
text = re.sub(r" 9 11 ", "911", text)
text = re.sub(r"e - mail", "email", text)
text = re.sub(r"j k", "jk", text)
text = re.sub(r"\s{2,}", " ", text)
return text
reviews_balanced['comment'] = reviews_balanced['comment'].map(lambda text : clean_text(text))
reviews_balanced.head()
First we need to find out what is the length of reviews in our data set . This is important beacause we need to tokenize our data and we need to define certain hyperparameters for our model. This will all be possible and more optimised if we fully understand the data. I am using pandas to calculate the number of words for each comment , then we create 5 word bins based on the number of words each of these comments have. Finally I display the number of comments we have in each of the bin.
reviews_balanced['num_words'] = reviews_balanced.comment.apply(lambda x : len(x.split()))
reviews_balanced['bins']=pd.cut(reviews_balanced.num_words, bins=[0,100,300,500,800, np.inf], labels=['0-100', '100-300', '300-500','500-800' ,'>800'])
word_distribution = reviews_balanced.groupby('bins').size().reset_index().rename(columns={0:'counts'})
word_distribution.head()
We can see that majority of our comments have a word count between 0-100 .
We need to divide the data into testing and training data. For this I am using train_test_split() function by sklearn . This divides the data randomly by into the ratio provided by the user as a paramter. Here i haves used '0.3' , So I got a 70:30 ratio of train and test data.
x_train ,x_test, y_train, y_test = train_test_split(reviews_balanced.comment, reviews_balanced['rating'], test_size=0.3)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
Before the data can be fed into neural network model , we need to make tokenize the data . Tokenization is the process of converting the data into numbers , since a computer can only understand numbers. Tokenization works by creating a dictionary of the most used words in a sentence then replacing the word by its corresponding index.
Suppose we have :
"Hello this is rohan " - THIS IS NOT UNDERSTANDABLE BY A MACHINE
We tokenize it by creating a dictionary :
{ Hello : 1 , this : 2 , is : 3 , Rohan : 4 }
Then we replace the word by its corresponding index. So now our sentence becomes :
[1 , 2, 3, 4] - THIS IS UNDERSTANDABLE BY A MACHINE
max_features = 10000
tokenizer = Tokenizer(num_words=max_features,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower = True,oov_token='OOV')
tokenizer.fit_on_texts([list(x_train),list(x_test)])
x_train_1 = tokenizer.texts_to_sequences(x_train)
x_test_1 = tokenizer.texts_to_sequences(x_test)
I used keras tokenizer() function to create a BOW . I haves specified max_features which will extract top 10000 words from the corpus. Filters argument will not consider all the symbols as features , since they have no significanc over the ratings and oov_token is used to represent out of vocabulary words.Then I am using both test and train data to create this BOW using fit_on_texts() function. Finally I am converting the the sentences to machine readable vectors using text_to_sequences() function.
### Padding and Tensors
Since we are using tensorflow we need to create the input tensors . This is very simple but we know that the length of our text sequences are different. A NN takes in input of the input size provided during model compilation , so we need to make sure that all of our tensors are of equal length.
This is were we use padding. Padding means adding zeros either at the begining or at the end of the text sequences. The number of zeroes depend upon the how many zeros are need to make all the sequence of equal length. I am using keras pad_sequences function. We need to specifiy the maximum length of the sequence . As we know most of our data is < 100 words I am using max_len of 100. To save training time and memory.
max_len = 100
x_train_1 = pad_sequences(x_train_1,max_len, padding = "post")
x_test_1 = pad_sequences(x_test_1,max_len, padding = "post")
print(x_train_1[1])
The result of padding gives a tensor of 100 length.The zeros were added in the end since i specified padding as 'Post'.
The tokenizer function creates a dictionary of words which is our vocabulary.
tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1
print("The size of our vocaulary is ",vocab_size)
Since our classifiction is a multi class classification we need to find the unqiue labels and convert them to categorical tensor so that our NN model can understand it. I am using skleans label encoder to conver the data into categorical tensors.
(unique, counts) = np.unique(y_train, return_counts=True)
num_classes = len(counts)
print(unique)
print(num_classes)
I have 10 classes. Now i need to convert this numerical data into categorical tensors. A categorical tensors are binary tensor with columns equal to num of classes.
For our case we have 10 classes so for a label with value 8 will be equal to:
[0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0]
Encoder = LabelEncoder()
y_train = Encoder.fit_transform(y_train)
y_test = Encoder.fit_transform(y_test)
print(y_train.shape)
print(to_categorical(y_train)[1])
print(to_categorical(y_train).shape)
I have used encoder to convert my train data into categorical tensor. As we can see before the data was of 1 dim now it is 2 dim . The 10 colums signify which class the data belong to.
Now comes the brain our of classifier , the neural network model. There are many neural network architectures but for this project I am going to use a shallow NN , a Deep NN , a CNN and a CNN with regularization.
For all my models i am going to use 'softmax' activation function for output layer . Why softmax you ask , because we are doing multi class classification . Softmax function gives probabilites for all the classes and the one with the maximum probabilty is our preditiction .
For loss we are going to use 'keras.loss.categorical_crossentropy' since it is a softmax loss function which is used for multi class prediction.
For optimizer I am going to use 'keras.optimizer.adam'. Adam is one of the fastest and accurate optimizer for classification.
Here I have 2 dense keras.layers in a functional model.The first layer will be input layer to get the input and pass it to the other layers. For text classification we need to have an embedding layer . An embedding layer initializes random weights for all the words and will learn embedding for all the words. we need to set the size the of embedding matrix , here i will be using 50. This size is used represent the embedding of the words. The total trainable paramaters are 50 * max_features. Then I add a flatten layer to convert the 2 dim ouput of embedding layer to 1 dim . Finally I use 2 dense layers or fully connected layers to train and get my prediction.
number of nodes for dense layer1 = 512 number of nodes for dense layer2 or prediction = Num of classes
The activation function that I am using for input layer is Relu . Relu is Rectified Linear Unit.The function returns 0 if it receives any negative input, but for any positive value x it returns that value back. So it can be written as f(x)=max(0,x).
Inputs = Input(shape=(max_len, ))
embedding_layer = Embedding(max_features,50, input_length=max_len)(Inputs)
x = Flatten()(embedding_layer)
x = Dense(512, activation='relu')(x)
predictions = Dense(num_classes, activation='softmax')(x)
model1 = Model(inputs=[Inputs], outputs=predictions)
model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model1.summary()
Here I have 4 dense keras.layers in a functional model. The model is similar to the above model but the number of dense layers I used increased . This makes the networks a deep NN as the number of hidden layers are more. The nodes in each layer are 128,100, 32 and num_classes.
Inputs = Input(shape=(max_len, ))
embedding_layer = Embedding(max_features,50, input_length=max_len)(Inputs)
x = Flatten()(embedding_layer)
x = Dense(128, activation='relu')(x)
x = Dense(100, activation='relu')(x)
x = Dense(32, activation='relu')(x)
predictions = Dense(num_classes, activation='softmax')(x)
model2 = Model(inputs=[Inputs], outputs=predictions)
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.summary()
I am using a functional model to create a network with 1 embedding layer, 2 conv2D layers and 1 dense layer. The filters I used for convolutional layers are 2 and 3 and num of filters i used is 50.
embedding_d = 50
filter_sizes = [2,3]
Num_f = 16
Inputs = Input(shape=(max_len,))
x = Embedding(max_features, embedding_d)(Inputs)
x = Reshape((max_len, embedding_d, 1))(x)
maxpool_pool = []
for i in range(len(filter_sizes)):
conv = Conv2D(Num_f, kernel_size=(filter_sizes[i], embedding_d),
kernel_initializer='he_normal', activation='relu')(x)
maxpool_pool.append(MaxPool2D(pool_size=(max_len - filter_sizes[i] + 1, 1))(conv))
x = Concatenate(axis=1)(maxpool_pool)
x = Flatten()(x)
x = Dropout(0.1)(x)
predictions = Dense(10, activation="softmax")(x)
model3 = Model(inputs=[Inputs], outputs=predictions)
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model3.summary()
I am using a functional model to create a network with 1 embedding layer, 2 conv2D layers and 1 dense layer. The filters I used for convolutional layers are 2 and 3 and num of filters i used is 50. Here i am using l1 regularization to improve the accuracy.
embedding_d = 100
filter_sizes = [2,3]
Num_f = 64
Inputs = Input(shape=(max_len,))
x = Embedding(max_features, embedding_d)(Inputs)
x = Reshape((max_len, embedding_d, 1))(x)
maxpool_pool = []
for i in range(len(filter_sizes)):
conv = Conv2D(Num_f, kernel_size=(filter_sizes[i], embedding_d), activation='relu',kernel_regularizer = l1(0.0001))(x)
maxpool_pool.append(MaxPool2D(pool_size=(max_len - filter_sizes[i] + 1, 1))(conv))
x = Concatenate(axis=1)(maxpool_pool)
x = Flatten()(x)
x = Dropout(0.1)(x)
predictions = Dense(10, activation="softmax")(x)
model4 = Model(inputs=[Inputs], outputs=predictions)
model4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model4.summary()
Here I am training all my models using the training data. I am using callback to make sure that the best accurcay is recorded.
filepath="model1.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history1 = model1.fit(x_train_1, to_categorical(y_train), batch_size = 16, epochs = 4, validation_split = .1,callbacks=[checkpointer])
filepath="model2.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history2 = model2.fit(x_train_1, to_categorical(y_train), batch_size = 16, epochs = 4, validation_split = .1,callbacks=[checkpointer])
filepath="model3.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history3 = model3.fit(x_train_1, to_categorical(y_train), batch_size = 16, epochs = 4, validation_split = .1,callbacks=[checkpointer])
filepath="model4.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history4 = model4.fit(x_train_1[1:10000], to_categorical(y_train[1:10000]), batch_size = 16, epochs = 4, validation_split = .1,callbacks=[checkpointer])
Here i am using sklearn model evaluate to test the accurcay of our model.
score1 = model1.evaluate(x_test_1, to_categorical(y_test), verbose=1)
print(score1)
score2 = model2.evaluate(x_test_1, to_categorical(y_test), verbose=1)
print(score2)
score3 = model3.evaluate(x_test_1, to_categorical(y_test), verbose=1)
print(score3)
score4 = model4.evaluate(x_test_1, to_categorical(y_test), verbose=1)
print(score4)
Here I am saving all the models as h5 files so that i can use them to again without training them. Also I need to save my tokenizer or vocabulary so that any new sentence can be processed before feeding into the NN.
import pickle
odel1.save("/content/drive/My Drive/model1.h5")
model2.save("/content/drive/My Drive/model2.h5")
model3.save("/content/drive/My Drive/model3.h5")
model4.save("/content/drive/My Drive/model4.h5")
with open('/content/drive/My Drive/tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
I am using matplot lib to visualize all my accuracy for the best model . Which is the CNN with regularization.
plt.plot(history4.history['accuracy'])
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.plot(history4.history['loss'])
plt.plot(history4.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, np.argmax(model4.predict(x_test_1),axis=1)))
I would say that my neural network model did not perform well enough as i expected. However CNN works better as compared to dense neural networks. There are better models present like RNN and LSTM but due to the computational power i was not able to implement them.