# Classifier Pipelines

# What

# What is a classifier pipeline?

Classifier Pipelines enable you to configure a full classification pipeline at the state level. The classifier configurations can be broadly categorized into three categories:

Preprocessors: apply transformation on the input text before being fed to the vectorizer or the classifier model.
Vectorizer: converts the text to a machine readable vector format.
Classifier Model: assigns a class to an input vector based on its supervised learning.

classifier pipelines flow

Refer to configuration choices to know more about the available configurations, and the factors that may influence your choice of these configurations.

# How

# How to create classifier pipelines

You can view and create new classifier pipelines by going to a version and clicking Classifier Pipelines on the version sidebar. By default, every version will have a Default pipeline with all preprocessors enabled, the model set as fasttext and the vectorizer set as word2vec. All states will be configured to this pipeline upon version creation. In the screnshot below, two pipelines named default and new_pipeline are created, and the default pipeline is the designated Default pipeline.

classifier pipelines page

You can create a new classifier pipeline by clicking on the Create Pipeline button and filling out the required fields. While the Name and Description fields are self-explanatory, you can refer to configuration choices to know more about available Preprocessors, Models, and Vectorizers. Note that fasttext requires word2vec and transformers requires minibert.

create classifier pipeline

There is exactly one default classifier pipeline per version. You are allowed to change the default pipeline, and if you change it, all newly created states will belong to the new default pipeline. The default pipeline cannot be deleted.

# How to assign a classifier pipeline

You can view the current list of the state-pipeline assignments through the State Configurations sidebar. You can edit the state-pipeline assignment in two ways:

You can edit the state-pipeline assignment by clicking the edit mode on the State Configurations sidebar. Note that you cannot remove any states from the Default pipeline. So, to assign a pipeline other than Default to a state, you should select the desired pipeline using the Select a Pipeline to Edit dropdown and then assign the state to it. In the screeshot below, the show_transactions state is assigned to the new_pipeline, whereas the rest are assigned to the Default pipeline.

edit state configuration sidebar

You can also edit a specific state's classifier pipeline by using the edit state dialog as shown in the screenshots below.

edit state pipelineA

edit state pipelineB

# How to choose pipeline configurations

In the following sections, we will list the available preprocessors, vectorizers, and classification models.

Preprocessors

Name	Notes
Lower	Convert text to all lowercase
Americanize	Convert non-American English words to their American English equivalent
Expand Contraction	Replace contractions with full representations
Remove Punctuation	Remove Punctuation
Lemmatize	Remove word endings to attain base words
Num Replace	Replace numbers with a special token

Vectorizers

Name	Family	Notes
One Hot	Bag of Words	Best for straightforward problems with clear class label differences. Lightweight and fast.
Count	Bag of Words	Similar to one hot model, but with labeling based on term frequency. Lightweight and fast.
TFIDF	Bag of Words	Simple model with labeling based on term frequency and inverse document frequency, best used when large amounts of unimportant words are present in the training data. Lightweight and fast.
Word2Vec	Word Embedding	Predictive embedding model that is pre-trained except when used with fastText (when it also utilizes subwords). Fast but will require significant memory.
GLoVe	Word Embedding	Pretrained vectorizer which is focused on global word co-occurrence. Fast but will require significant memory.
Universal Sentence Encoder	Sentence Embedding	Sentence embedder pre-trained on various NLP tasks. Fast but requires significant memory. Not recommended to be used for most use cases.
ELMo	Sentence Embedding	Pretrained predictive embedding model which utilizes a Bi-LSTM for training. Fast but will require significant memory.
Roberta	Sentence Embedding	Large fine-tuned sentence-transformer model based on RoBerta transformers. Will lead to increased train and infer time; GPU recommended.
Minibert	Sentence Embedding	Small fine-tuned sentence-transformer model based on BERT transformers. Smaller than its Roberta counterpart, with comparable performance. Most preferred among the Word/Sentence embedding class of vectorizers. GPU recommended.

Classifer Models

Name	Training Speed	Inference Speed	Scalability	Supports Pretraining	Performance	Notes
fastText	Excellent	Excellent	Excellent	False	Good	An excellent fast model that can scale to be trained on huge amounts of data. It is best used with large datasets or when speed is paramount. Beware that it performs poorly on few-shot training data.
Logistic Regression	Excellent	Excellent	Good	True	Good	A simple linear model useful for getting a baseline performance. It is the quickest model outside of fastText and can be especially useful for getting a quick idea of the performance of vectorizers on a particular problem.
Support Vector Machine	Good	Excellent	Good	True	Good	An excellent all-around model, the SVM is most useful on high-dimensionality datasets with a clear label distinction. This makes it especially effective when paired with one hot, count, or tfidf vectorizer
Multi-Layer Perceptron	Good	Good	Good	True	Good	An excellent all-around model, the MLP is useful on most datasets. It is capable of fitting more complex problems where there is not as clear of a distinction between classes
Gradient Boosted Decision Tree	Poor	Poor	Poor	True	Poor	The gradient-boosted decision tree has overall poor performance on most datasets. It is particularly difficult to generalize. It will benefit most from automatic hyper-parameter optimization (a feature not available yet).
Transformers	Poor	Poor	Excellent	True	Excellent	These models can perform excellently on the most complex problems and scale well to even the largest datasets. They tend to be slower at inference and training than most other models. GPU highly recommended

← Response Query Testing →