Methodology

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#ffffff', 'edgeLabelBackground':'#fff'}}}%% flowchart TD A[Reddit Posts - RHMD Dataset] --> B[Data Preprocessing] subgraph Preprocessing B --> C[Text Cleaning: URLs, Emojis, Typos] C --> D[Tokenization - BERT Tokenizer] D --> E[Preserve Negations] end E --> F[Teacher Models] subgraph Teachers F --> G[General Context Teacher: BERT or RoBERTa] F --> H[Domain-Specific Teacher: BioBERT] end G --> L[Multi-Teacher Knowledge Distillation] H --> L[Multi-Teacher Knowledge Distillation] subgraph Distillation L --> M[Student Model: DistilBERT] M --> N[Soft Labels from Teachers] N --> O[Loss Function: Alpha L_task + 1-Alpha L_distill] O --> P[Temperature Scaling] end P --> Q[Domain-Specific Attention Layer - Focus on Health Terms] Q --> R[Evaluation] subgraph Evaluation R --> S[Macro F1, Precision, Recall] R --> U[Baseline Comparison - BiLSTM-Senti, BERTweet] end style A fill:#ffebcc,stroke:#666 style B fill:#e6f3ff,stroke:#666 style G fill:#d4edda,stroke:#666 style H fill:#d4edda,stroke:#666 style M fill:#fff3cd,stroke:#666 style Q fill:#f8d7da,stroke:#666 style R fill:#e2e3e5,stroke:#666

1. Reddit Posts - RHMD Dataset

What: Raw Reddit posts from the RHMD dataset, labeled as "health mention" or "non-health mention."
Input data for training and testing the model.
Example:
- "I had a heart attack last week" → Health mention.
- "That movie gave me a heart attack!" → Non-health mention (figurative).

2. Data Preprocessing

Prepares raw text for modeling by cleaning and structuring it.

a. Text Cleaning

What: Remove noise from text.
Steps:
- URLs/User Mentions: Delete links (http://...) and user tags (@username).
- Emojis: Convert emojis to text (e.g., ❤️ → heart, 😢 → crying_face).
- Typos: Fix misspelled words (e.g., hedache → headache).

b. Tokenization

What: Split text into smaller units (tokens).
How: Use BERT’s tokenizer to break text into words/subwords.
- Example: "heart attack" → Tokens: ["heart", "attack"].

c. Preserve Negations

What: Combine negation words with their following terms to retain context.
Why: Negations (e.g., "not serious") change the meaning of health mentions.
How:
- Merge phrases like "not good" → "not_good".
- Use regex or libraries like spaCy to detect negations.
Example:
- Original: "The pain is not severe."
- Processed: _"The pain is notsevere."

3. Teacher Models

Two specialized models train independently to classify health mentions.

a. General Context Teacher

Model: BERT or RoBERTa.
Role: Understands general language patterns.
Example: Detects figurative usage in "That exam gave me a stroke!" (non-health).

b. Domain-Specific Teacher

Model: BioBERT (pretrained on medical texts).
Role: Focuses on medical terms (e.g., "heart attack," "stroke").
Example: Flags "I was diagnosed with a stroke" as a health mention.

4. Multi-Teacher Knowledge Distillation

What: A smaller student model (DistilBERT) learns from both teachers.
How:
1. Soft Labels: Teachers predict probabilities (e.g., [0.9 Health, 0.1 Non-Health]).
2. Loss Function: Student mimics teacher predictions while using ground-truth labels.
  - Formula: Total Loss = 0.7 * Task Loss + 0.3 * Distillation Loss.
3. Temperature Scaling: Smooth teacher predictions to help the student learn better.

5. Keyword Attention Layer

What: Forces the model to focus on health-related terms.
How:
- Assign higher weights to tokens like "heart attack" or "stroke".
- Example: In "Running caused a heart attack," the model pays extra attention to "heart attack".

6. Evaluation

Metrics:
- Macro F1-Score: Balances precision and recall for imbalanced classes.
- Baseline Comparison: Compare against models like BiLSTM-Senti or BERTweet.