Methodology

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#ffffff', 'edgeLabelBackground':'#fff'}}}%% flowchart TD A[Reddit Posts - RHMD Dataset] --> B[Data Preprocessing] subgraph Preprocessing B --> C[Text Cleaning: URLs, Emojis, Typos] C --> D[Tokenization - BERT Tokenizer] D --> E[Preserve Negations] end E --> F[Teacher Models] subgraph Teachers F --> G[General Context Teacher: BERT or RoBERTa] F --> H[Domain-Specific Teacher: BioBERT] end G --> L[Multi-Teacher Knowledge Distillation] H --> L[Multi-Teacher Knowledge Distillation] subgraph Distillation L --> M[Student Model: DistilBERT] M --> N[Soft Labels from Teachers] N --> O[Loss Function: Alpha L_task + 1-Alpha L_distill] O --> P[Temperature Scaling] end P --> Q[Domain-Specific Attention Layer - Focus on Health Terms] Q --> R[Evaluation] subgraph Evaluation R --> S[Macro F1, Precision, Recall] R --> U[Baseline Comparison - BiLSTM-Senti, BERTweet] end style A fill:#ffebcc,stroke:#666 style B fill:#e6f3ff,stroke:#666 style G fill:#d4edda,stroke:#666 style H fill:#d4edda,stroke:#666 style M fill:#fff3cd,stroke:#666 style Q fill:#f8d7da,stroke:#666 style R fill:#e2e3e5,stroke:#666

1. Reddit Posts - RHMD Dataset


2. Data Preprocessing

Prepares raw text for modeling by cleaning and structuring it.

a. Text Cleaning

b. Tokenization

c. Preserve Negations


3. Teacher Models

Two specialized models train independently to classify health mentions.

a. General Context Teacher

b. Domain-Specific Teacher


4. Multi-Teacher Knowledge Distillation


5. Keyword Attention Layer


6. Evaluation