What: Raw Reddit posts from the RHMD dataset, labeled as "health mention" or "non-health mention."
Input data for training and testing the model.
Example:
"I had a heart attack last week" → Health mention.
"That movie gave me a heart attack!" → Non-health mention (figurative).
Prepares raw text for modeling by cleaning and structuring it.
What: Remove noise from text.
Steps:
URLs/User Mentions: Delete links (http://...) and user tags (@username).
Emojis: Convert emojis to text (e.g., ❤️ → heart, 😢 → crying_face).
Typos: Fix misspelled words (e.g., hedache → headache).
What: Split text into smaller units (tokens).
How: Use BERT’s tokenizer to break text into words/subwords.
"heart attack" → Tokens: ["heart", "attack"].What: Combine negation words with their following terms to retain context.
Why: Negations (e.g., "not serious") change the meaning of health mentions.
How:
Merge phrases like "not good" → "not_good".
Use regex or libraries like spaCy to detect negations.
Example:
Original: "The pain is not severe."
Processed: _"The pain is notsevere."
Two specialized models train independently to classify health mentions.
Model: BERT or RoBERTa.
Role: Understands general language patterns.
Example: Detects figurative usage in "That exam gave me a stroke!" (non-health).
Model: BioBERT (pretrained on medical texts).
Role: Focuses on medical terms (e.g., "heart attack," "stroke").
Example: Flags "I was diagnosed with a stroke" as a health mention.
What: A smaller student model (DistilBERT) learns from both teachers.
How:
Soft Labels: Teachers predict probabilities (e.g., [0.9 Health, 0.1 Non-Health]).
Loss Function: Student mimics teacher predictions while using ground-truth labels.
Total Loss = 0.7 * Task Loss + 0.3 * Distillation Loss.Temperature Scaling: Smooth teacher predictions to help the student learn better.
What: Forces the model to focus on health-related terms.
How:
Assign higher weights to tokens like "heart attack" or "stroke".
Example: In "Running caused a heart attack," the model pays extra attention to "heart attack".
Metrics:
Macro F1-Score: Balances precision and recall for imbalanced classes.
Baseline Comparison: Compare against models like BiLSTM-Senti or BERTweet.