| Title: | Real-Time Emoji Mapping for Emotional English Text |
|---|---|
| Description: | This package allows users to analyze text and classify emotions such as happiness, sadness, anger, fear, and neutrality. It combines text preprocessing, TF-IDF feature extraction, and Random Forest classification to predict emotions and map them to corresponding emojis for enhanced sentiment visualization. |
| Authors: | Yusong Zhao [aut], Fangyi Wang [aut, cre], Zisheng Qu [aut] |
| Maintainer: | Fangyi Wang <[email protected]> |
| License: | GPL-2 |
| Version: | 0.1.0 |
| Built: | 2026-06-10 09:45:03 UTC |
| Source: | https://github.com/hateprograming/text2emotion |
Evaluate a Random Forest Model on Test Data
evaluate_rf_model( rf_model, test_texts, test_labels, tfidf_model, vectorizer, stopwords, verbose = TRUE )evaluate_rf_model( rf_model, test_texts, test_labels, tfidf_model, vectorizer, stopwords, verbose = TRUE )
rf_model |
A trained 'ranger' model object. |
test_texts |
A vector of raw test texts. |
test_labels |
A factor vector of true labels. |
tfidf_model |
The TF-IDF transformer used for training. |
vectorizer |
The vectorizer used to build DTM. |
stopwords |
A character vector of stopwords. |
verbose |
Whether to print progress. Default TRUE. |
A list with test accuracy, test predictions, and aligned test data.
This function processes a character vector of tokens and handles negations by combining the word "not" with the immediately following word (e.g., "not happy" becomes "not_happy"). This technique helps to better preserve sentiment polarity during text analysis.
handle_negation(tokens)handle_negation(tokens)
tokens |
A character vector of tokens (individual words). |
The negation handling procedure follows these steps:
Iterate through each token.
If a token is "not" and followed by another token, merge them into a single token separated by an underscore (e.g., "not_happy").
Skip the next token after merging to avoid duplication.
Otherwise, keep the token unchanged.
This method is especially useful in sentiment analysis tasks where the presence of negations can invert the sentiment polarity of words.
A character vector of tokens with negations handled by combining "not" with the next word.
handle_negation(c("i", "am", "not", "happy")) # Returns: c("i", "am", "not_happy") handle_negation(c("this", "is", "not", "good", "but", "not", "terrible")) # Returns: c("this", "is", "not_good", "but", "not_terrible") handle_negation(c("nothing", "to", "worry", "about")) # Returns: c("nothing", "to", "worry", "about")handle_negation(c("i", "am", "not", "happy")) # Returns: c("i", "am", "not_happy") handle_negation(c("this", "is", "not", "good", "but", "not", "terrible")) # Returns: c("this", "is", "not_good", "but", "not_terrible") handle_negation(c("nothing", "to", "worry", "about")) # Returns: c("nothing", "to", "worry", "about")
This function takes input text, preprocesses it, extracts TF-IDF features using a pre-trained model, predicts the emotion using a trained classifier, and returns the result with optional emoji representation.
predict_emotion_with_emoji(text, output_type = "textemoji")predict_emotion_with_emoji(text, output_type = "textemoji")
text |
Character string containing the text to analyze. |
output_type |
Type of output to return. Must be one of:
|
Depending on output_type:
For "emotion": character string of predicted emotion
For "emoji": character string of corresponding emoji
For "textemoji": original text with appended emoji
The function also prints the result to console.
## Not run: predict_emotion_with_emoji("I'm so happy today!") predict_emotion_with_emoji("This makes me angry", "emoji") predict_emotion_with_emoji("I feel scared", "emotion") ## End(Not run)## Not run: predict_emotion_with_emoji("I'm so happy today!") predict_emotion_with_emoji("This makes me angry", "emoji") predict_emotion_with_emoji("I feel scared", "emotion") ## End(Not run)
This function performs multi-stage text preprocessing, including lowercasing, HTML cleaning, punctuation normalization, contraction expansion, internet slang replacement, emoticon replacement, and final standardization.
preprocess_text(text, use_textclean = TRUE, custom_slang = NULL)preprocess_text(text, use_textclean = TRUE, custom_slang = NULL)
text |
A character vector of input texts. |
use_textclean |
Logical. Whether to use |
custom_slang |
A named character vector providing user-defined slang mappings. Optional. |
The preprocessing pipeline includes:
Lowercasing the text.
Replacing HTML entities and non-ASCII characters.
Expanding common English contractions (e.g., "I'm" -> "I am").
Replacing internet slang and emoticons if use_textclean is TRUE.
Handling additional slang defined by the user.
Normalizing repeated punctuations and whitespace.
A character vector of cleaned and normalized text.
preprocess_text("I'm feeling lit rn!!!") preprocess_text("I can't believe it... lol :)", use_textclean = TRUE)preprocess_text("I'm feeling lit rn!!!") preprocess_text("I can't believe it... lol :)", use_textclean = TRUE)
Train a full model pipeline including text preprocessing, TF-IDF vectorization, random forest tuning, and training.
custom_slang |
A named list for custom slang replacements (optional). |
max_features |
Maximum number of features for TF-IDF vectorizer (default 10000). |
min_df |
Minimum document frequency for TF-IDF (default 2). |
max_df |
Maximum document frequency for TF-IDF (default 0.8). |
mtry_grid |
Grid of values for 'mtry' parameter to tune in random forest (default: c(5, 10, 20)). |
ntree_grid |
Grid of values for 'ntree' parameter to tune in random forest (default: c(100, 200, 300)). |
stopwords_file |
Path to the stopwords RDS file (default: "final_stopwords.rds"). |
vectorizer_file |
Path to save the trained vectorizer (default: "trained_vectorizer.rds"). |
tfidf_model_file |
Path to save the trained TF-IDF model (default: "trained_tfidf_model.rds"). |
rf_model_file |
Path to save the trained random forest model (default: "trained_rf_ranger_model.rds"). |
train_df_cache_path |
Path to cache the training data frame (default: "train_df_cached.rds"). |
A list containing the trained TF-IDF model, vectorizer, random forest model, and test accuracy.
Train a Random Forest Model with TF-IDF Features
train_rf_model( train_matrix, train_labels, ntree = 300, mtry = NULL, seed = 123, verbose = TRUE, train_df_cache_path = "train_df_cached.rds" )train_rf_model( train_matrix, train_labels, ntree = 300, mtry = NULL, seed = 123, verbose = TRUE, train_df_cache_path = "train_df_cached.rds" )
train_matrix |
A sparse matrix ('dgCMatrix') of training features. |
train_labels |
A factor vector of training labels. |
ntree |
Number of trees. Default 300. |
mtry |
Variables to consider at each split. If NULL, auto-selected. |
seed |
Random seed. Default 123. |
verbose |
Whether to print progress. Default TRUE. |
train_df_cache_path |
Path to cache the train data frame. Default "train_df_cached.rds". |
A trained 'ranger' model object.
train_tfidf_model( preprocessed_text, max_features = 10000, min_df = 2, max_df = 0.8 )train_tfidf_model( preprocessed_text, max_features = 10000, min_df = 2, max_df = 0.8 )
preprocessed_text |
A character vector containing the preprocessed text. |
max_features |
The maximum number of features (terms) to include in the vocabulary. Default is 10000. |
min_df |
Minimum document frequency for terms. Default is 2 (terms must appear in at least 2 documents). |
max_df |
Maximum document frequency as a proportion of documents. Default is 0.8 (terms must appear in less than 80 |
A list with the following components:
The trained TF-IDF model object.
The vocabulary vectorizer used in training.
The TF-IDF sparse matrix representing the text data.
Train a TF-IDF model with customizable tokenization and vocabulary pruning.
This function performs the following steps:
1. Tokenizes the preprocessed text into words and removes stopwords. 2. Defines custom stopwords and retains important emotional function words. 3. Creates a vocabulary based on unigrams and trigrams, pruning terms based on document frequency and term count. 4. Builds the TF-IDF sparse matrix for the input text.
preprocessed_text <- c("I'm feeling so happy today!", "I feel really excited and hopeful!") result <- train_tfidf_model(preprocessed_text) result$tfidf_model # Access the trained TF-IDF model
This function performs hyperparameter tuning for a Random Forest model using grid search. It searches over the grid of 'mtry' (number of variables to consider at each split) and 'ntree' (number of trees in the forest) to find the best model based on training accuracy.
tune_rf_model( train_matrix, train_labels, mtry_grid = c(5, 10, 20), ntree_grid = c(100, 200, 300), seed = 123, verbose = TRUE )tune_rf_model( train_matrix, train_labels, mtry_grid = c(5, 10, 20), ntree_grid = c(100, 200, 300), seed = 123, verbose = TRUE )
train_matrix |
A sparse matrix (class 'dgCMatrix') representing the training feature data. |
train_labels |
A factor vector representing the training labels. |
mtry_grid |
A vector of values to search for the 'mtry' parameter (number of variables to consider at each split). Default is 'c(5, 10, 20)'. |
ntree_grid |
A vector of values to search for the 'ntree' parameter (number of trees in the forest). Default is 'c(100, 200, 300)'. |
seed |
A seed value for reproducibility. Default is '123'. |
verbose |
A logical indicating whether to print progress information during the grid search. Default is 'TRUE'. |
The function trains multiple Random Forest models using different combinations of 'mtry' and 'ntree' values, and evaluates their performance based on training accuracy. The hyperparameters that give the highest accuracy are returned as the best parameters. The process uses the 'ranger' package for training the Random Forest model.
A list containing the best hyperparameters ('mtry', 'ntree', and 'accuracy'):
'mtry': The best number of variables to consider at each split.
'ntree': The best number of trees in the forest.
'accuracy': The accuracy achieved by the model with the best hyperparameters.