Classification of text data: using multiple feature spaces with Scikit-learn
Posted on jeu. 02 novembre 2017 in Machine learning
As mentionned in another post, I am currently working on a text classification task and experimenting with several features extraction methods.
Input features for text classification
Topic-based
I have started with the regular "topic-based" method such as latent semantic indexing/analysis (LSI/LSA), latent dirichlet allocation (LDA) and non-negative matrix factorization (NMF).
These methods start with a term-document matrix \(T\) with documents as rows and terms as columns. In the simplest case, the value \(T_{ij}\) is simply the absolute frequency (or count) of term \(j\) in document \(i\). This value is often replaced by the so-called TF-IDF value (Term Frequency - Inverse Document Frequency) which basically allows to give more importance to rare terms. Note that "terms" could be words or n-grams or possibly any other relevant unit of text.
From this matrix, topic-based methods seek to discover latent factors called topics, which are linear combinations of the terms and represent documents as linear combinations of topics. The expectation is that words found in similar documents will end up in the same topic, hoping that topics are more releveant than bare words for classification. Moreover using \(T\) directly as input to a classifier would result in on feature per word which can lead to a prohibitively large number of features. Topic extraction can therefore be seen as a dimensionality reduction step.
In practice LSI uses singular value decomposition (SVD) decomposition on \(T\), LDA is a probabilistic model over topics and documents, and NMF, well, relies on the non-negative matrix factorization of \(T\).
Word embedding-based
Although topic-based approaches are standard in text analysis, I was curious about the newer so-called word embedding methods such as Facebook's FastText. These follow a rather orthogonal approach as they seek to find a vector representation of words so that semantically similar words are represented by similar vectors (according to a given metric). To reach this goal, the broad idea is to find an embedding allowing to predict which word should occur given its context (for the continuous bag of words representation, the skip-gram representation swaps words and contexts). Here context mean "surroundings words", i.e. words found in a windows around the word of interest. Note that I use "word" instead of "terms", but this can also be applied to n-grams as a unit as well.
As opposed to topic-based methods, word-embedding methods consider a more local context: for the former, similar terms are those appearing in similar documents, for the latter, similar terms appear in similar contexts within a document.
The best of both worlds?
The two approaches seeming quite complementary I thought I may give a shot at combining their resulting features. I settled on using NMF and a FastText-based document embedding.
In practice: document FastText
Document embedding methods results in vectors for terms but not for documents. Therefore I used a fairly simple method to get document-vectors from terms-vectors: simply concatenate the element-wise min, max and mean of all words in the document. For a term-embedding of size \(k\), this results in a document-embedding of size \(3k\). This original idea was described here and gave simingly good results for short documents.
The code using Gensim's FastText:
from sklearn.base import BaseEstimator
from gensim.models.fasttext import FastText
class DocumentFastText(BaseEstimator):
def __init__(self, sentences=None, sg=0, hs=0, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, word_ngrams=1, loss='ns', sample=0.001, seed=1, workers=3,
min_alpha=0.0001, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, trim_rule=None,
batch_words=10000, epochs=5):
sentences = None
self.sg = sg
self.hs = hs
self.size = size
self.alpha = alpha
self.window = window
self.min_count = min_count
self.max_vocab_size = max_vocab_size
self.word_ngrams = word_ngrams
self.loss = loss
self.sample = sample
self.seed = seed
self.workers = workers
self.min_alpha = min_alpha
self.negative = negative
self.cbow_mean = cbow_mean
self.hashfxn = hashfxn
self.iter = iter
self.null_word = null_word
self.min_n = min_n
self.max_n = max_n
self.sorted_vocab = sorted_vocab
self.bucket = bucket
self.trim_rule = trim_rule
self.batch_words = batch_words
self.epochs = epochs
self.fast_text = FastText(sentences, sg, hs, size, alpha, window, min_count, max_vocab_size,
word_ngrams, loss, sample, seed, workers, min_alpha, negative, cbow_mean,
hashfxn, iter, null_word, min_n, max_n, sorted_vocab,
bucket, trim_rule, batch_words)
self.is_fit = False
def fit(self, text, y=None):
self.fast_text.build_vocab(text)
self.fast_text.train(text, epochs=self.epochs)
self.is_fit = True
return self
In practice: getting the inputs right
Now, my workflow is based on scikit-learn's pipelines and FastText has been implemented in the Gensim Library but not in sklearn as it is not general enough. for this reason, FastText was not design to work with sklearn's convenient Vectorizers and has to be fed with a list of words instead of a document-term matrix. Thus I have to find a way to:
-
combine the feature coming from both methods before feeding them to the classifier which can be done easily with a FeatureUnion
-
do so starting from a different representation of the text (list versus document-term matrix)
-
allow some parameters to be shared between these representations (stop words for instance)
For this, let's first build a class which replicates the pre-processing and tokenizing steps of the Vectorizer. This yields a list a words for FastText to use while still taking into account the parameters passed to the original vectorizer and is used with NMF.
from sklearn.base import BaseEstimator
class TextPreProcessor(BaseEstimator):
def __init__(self, vectorizer):
self.vectorizer = vectorizer
self.preprocess = vectorizer.build_preprocessor()
self.tokenize = vectorizer.build_tokenizer()
def fit(self, text, y=None):
return self
def transform(self, text):
return np.array([self.tokenize(self.preprocess(self.vectorizer.decode(t))) for t in text])
def fit_transform(self, text, y=None):
return self.transform(text)
In practice: putting it all together
Then, it is mostly a matter of building and pluging the corresponding pipes together into a final pipeline:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV
train_vectorizer = CountVectorizer()
doc_fast_text = pftc.DocumentFastText()
fasttext_subpipe = Pipeline(steps=[('text_prepro', pftc.TextPreProcessor(train_vectorizer)),
('transfo', doc_fast_text)])
nmf = NMF()
nmf_subpipe = Pipeline(steps=[('vectorizer', train_vectorizer), ('transfo', nmf)])
feature_union = FeatureUnion([("embedding", fasttext_subpipe), ("topics", nmf_subpipe)])
classifier = GradientBoostingClassifier()
pipe = Pipeline(steps=[('feature_extraction', feature_union), ("classfier", classifier)])
I can then run my workflow; a grid search over parameters for instance. Conveniently, multiple nesting of parameters are handled with the "__" syntax.
params_grid = {
"feature_extraction__embedding__transfo__size": [100, 200],
"feature_extraction__embedding__transfo__min_count": [2, 5],
"feature_extraction__embedding__transfo__word_ngrams": [1, 2, 3],
"feature_extraction__topics__vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)],
"feature_extraction__topics__vectorizer__binary": [True, False],
"feature_extraction__topics__transfo__n_components": [100, 200],
"feature_extraction__topics__transfo__alpha": [0, 0.25, 0.5],
"feature_extraction__topics__transfo__l1_ratio": [0, 0.5, 1]
}
kfold_cv = StratifiedKFold()
gs_cv = GridSearchCV(pipe, param_grid=params_grid,
scoring=best_cut_mcc_scoring, cv=kfold_cv,
n_jobs=n_cores, verbose=10,
random_state=RANDOM_STATE_SEED).fit(train_text, train_decision)
And that's it.
Epilogue
Using both feature spaces as input gives improved classification results as compared to using either separately.