|
| 1 | +--- |
| 2 | +title: "ModernBERT for Patents: Faster Insights, Smarter Classification" |
| 3 | +excerpt: "ModernBERT for complex patent classification, demonstrating >2x faster inference than traditional BERT with state-of-the-art accuracy using hierarchical loss. Introduced USPTO-3M, a large public dataset of 3 million patents." |
| 4 | +collection: portfolio |
| 5 | +header: |
| 6 | + teaser: "/images/modernbert-patents/class_imbalance.png" # Using your provided Top 30 Class Imbalance bar chart |
| 7 | +# permalink: /portfolio/modernbert-patents/ # Optional: uncomment and adjust if needed |
| 8 | +--- |
| 9 | + |
| 10 | +Patents are the bedrock of intellectual property, but navigating them is tough. Imagine millions of dense, technical documents filled with legal jargon – a huge challenge for lawyers, researchers, and innovators. Manually classifying these patents into specific technological categories (like 'A01B - Soil Working' vs. 'H01L - Semiconductor Devices') is slow, expensive, and crucial for tasks like prior art searches and R&D analysis. Can we automate this better and faster with modern AI? |
| 11 | + |
| 12 | +### The Challenge: Taming the Patent Beast |
| 13 | + |
| 14 | +Classifying patents automatically isn't easy: |
| 15 | + |
| 16 | +1. **Scale:** Millions of patents exist, with more added daily. We need efficient solutions. |
| 17 | +2. **Complexity:** Patents use specialized language and describe intricate inventions. Models need deep understanding. |
| 18 | +3. **Fine-Grained Categories:** The classification system (like the Cooperative Patent Classification - CPC) is highly detailed, requiring nuanced distinctions. |
| 19 | +4. **Data Imbalance:** Some patent categories are common, while others are extremely rare, making it hard for models to learn about the less frequent ones. |
| 20 | + |
| 21 | +<!-- Generic diagram illustrating text classification --> |
| 22 | +<p style="text-align: center;"> |
| 23 | + <img src="/images/modernbert-patents/generic_classification_concept.png" alt="Generic diagram showing text input, processing/model, and classified output labels" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 24 | +</p> |
| 25 | +<p style="text-align: center;"> |
| 26 | + <em>Figure 1: Conceptual overview of text classification: Input text is processed by a model to assign predefined categories. [Image Source: Generic representation]</em> |
| 27 | +</p> |
| 28 | + |
| 29 | +### The Contender: ModernBERT Enters the Ring |
| 30 | + |
| 31 | +For years, **BERT** (Bidirectional Encoder Representations from Transformers) has been a workhorse for understanding text. It's good, but technology moves fast! [2, 9] |
| 32 | + |
| 33 | +**ModernBERT** is like BERT's souped-up successor. [2, 5, 9] It incorporates newer architectural tweaks (like RoPE embeddings, GeGLU activations, optimized attention) and training techniques designed for: [2, 6, 8] |
| 34 | + |
| 35 | +* **Speed:** Faster processing (inference), potentially 2-4x faster than older models. [1, 3] |
| 36 | +* **Endurance:** Better handling of longer text sequences (up to 8192 tokens vs BERT's 512). [2, 3, 6] |
| 37 | +* **Efficiency:** Optimized for better performance and hardware utilization on standard GPUs. [2, 3, 5] |
| 38 | + |
| 39 | +Our hypothesis: Could ModernBERT's advantages make it a better fit for the demanding task of patent classification? [2] |
| 40 | + |
| 41 | +### Our Approach: Training the Specialist & Building the Dataset |
| 42 | + |
| 43 | +We set out to answer two main questions: |
| 44 | + |
| 45 | +1. **Direct Fine-tuning:** Can we take a general-purpose ModernBERT and simply train it (fine-tune it) on patent data to achieve good classification performance, potentially beating standard BERT? [11, 16, 21] |
| 46 | +2. **Domain Pre-training Boost?** Patents have unique language ("said," "comprising," "wherein"). Would "pre-training" ModernBERT *further* on just patent text *before* the final classification fine-tuning give it an extra edge? [11] |
| 47 | + |
| 48 | +**Introducing USPTO-3M:** To run these experiments, we needed data. We collected and publicly released **USPTO-3M**, a dataset of 3 million US patents from 2013-2015, sourced from Google BigQuery. This dataset itself is a contribution to the research community! |
| 49 | + |
| 50 | +**The Imbalance Problem:** Like many real-world datasets, USPTO-3M has a significant class imbalance. A few patent categories dominate, while many are rare, following a long-tail distribution. [Figure 2, Figure 3] |
| 51 | + |
| 52 | +<!-- Your provided Top 30 class imbalance bar chart --> |
| 53 | +<p style="text-align: center;"> |
| 54 | + <img src="/images/modernbert-patents/class_imbalance.png" alt="Histogram showing the frequency of the top 30 patent classes, highlighting severe imbalance" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 55 | +</p> |
| 56 | +<p style="text-align: center;"> |
| 57 | + <em>Figure 2: The top 30 patent classes (out of 665) make up almost half the dataset! This imbalance needs careful handling. (Figure adapted from project report)</em> |
| 58 | +</p> |
| 59 | + |
| 60 | +<!-- Your provided Log-Log plot --> |
| 61 | +<p style="text-align: center;"> |
| 62 | + <img src="/images/modernbert-patents/cpc_loglog.png" alt="Log-log plot showing CPC Label Frequencies vs. Rank" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 63 | +</p> |
| 64 | +<p style="text-align: center;"> |
| 65 | + <em>Figure 3: The log-log plot confirms the severe imbalance across all 665 classes, with a steep drop-off and a long tail of infrequent classes. (Figure adapted from project report)</em> |
| 66 | +</p> |
| 67 | + |
| 68 | +**Our Training Strategy:** We primarily focused on fine-tuning ModernBERT by adding a classification layer on top and training it to predict the correct patent codes (a multi-label problem, as one patent can fit multiple categories) [8, 11]. We used standard techniques like Binary Cross-Entropy loss. |
| 69 | + |
| 70 | +<!-- Generic diagram of fine-tuning --> |
| 71 | +<p style="text-align: center;"> |
| 72 | + <img src="/images/modernbert-patents/generic_finetuning_diagram.png" alt="Generic diagram showing a pre-trained model being adapted with new data/layers for a specific task" style="max-width: 80%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 73 | +</p> |
| 74 | +<p style="text-align: center;"> |
| 75 | + <em>Figure 4: Fine-tuning adapts a general pre-trained model (like ModernBERT) for a specialized task (like patent classification) using task-specific data. [Image Source: Generic representation]</em> [11, 16, 21] |
| 76 | +</p> |
| 77 | + |
| 78 | +### Putting it to the Test: Key Findings |
| 79 | + |
| 80 | +We ran extensive experiments, comparing ModernBERT to standard BERT concepts and testing our different training strategies. Here's what we found: |
| 81 | + |
| 82 | +**Finding 1: ModernBERT is a Speed Demon!** 🚀 |
| 83 | + |
| 84 | +This was a major win. While we didn't create a specific chart for this page, our benchmarks (detailed in the paper) showed that when performing the classification task on a test set of 150,000 patents, **ModernBERT was over 2x faster than standard BERT** (4541 vs 2224 samples/sec throughput) on the same high-end GPU (NVIDIA H200). This aligns with claims of ModernBERT being 2-4x faster than previous encoders. [1, 3] |
| 85 | + |
| 86 | +* **Why this matters:** For systems handling millions of patents, this speedup translates directly to lower computational costs and faster results. |
| 87 | + |
| 88 | +**Finding 2: Accuracy Duel - Fine-tuning Works, Pre-training Less So (Here)** |
| 89 | + |
| 90 | +* **Good News:** Standard fine-tuning of ModernBERT achieved accuracy (measured by F1@1 score) comparable to, and sometimes slightly better than, results reported for fine-tuned standard BERT (like PatentBERT). |
| 91 | +* **Minimal Impact of Sequence Length:** Increasing sequence length from 128 to 1024 or 1536 tokens showed only marginal improvements, suggesting the crucial information is often near the beginning of the patent text. [Figure 5] |
| 92 | +* **Pre-training Surprise:** The extra step of pre-training ModernBERT *only* on our patent dataset *before* fine-tuning *didn't* significantly improve overall results (Micro F1) in our setup. It even slightly hurt the average per-class performance (Macro F1). [Figure 6, Figure 7] Why? ModernBERT's initial massive pre-training might already be robust [1, 5], or our patent dataset (relative to the initial 2 trillion tokens [2, 6]) wasn't large enough or the pre-training phase long enough to make a difference here. |
| 93 | +* **Takeaway:** For this task, simply fine-tuning ModernBERT is an effective and efficient strategy. |
| 94 | + |
| 95 | +<!-- Your provided Sequence Length comparison --> |
| 96 | +<p style="text-align: center;"> |
| 97 | + <img src="/images/modernbert-patents/seqlen.png" alt="Micro F1 score during fine-tuning for different sequence lengths (128, 1024, 1536)" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 98 | +</p> |
| 99 | +<p style="text-align: center;"> |
| 100 | + <em>Figure 5: Overall performance (Micro F1) during training is very similar across different input sequence lengths (128, 1024, 1536), with longer sequences showing only a slight edge later. (Figure adapted from project report)</em> |
| 101 | +</p> |
| 102 | + |
| 103 | +<!-- Your provided Pretraining vs Finetuning Micro F1 --> |
| 104 | +<p style="text-align: center;"> |
| 105 | + <img src="/images/modernbert-patents/ptft_f1_micro.png" alt="Micro F1 score comparing pretraining+finetuning vs vanilla finetuning" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 106 | +</p> |
| 107 | +<p style="text-align: center;"> |
| 108 | + <em>Figure 6: Overall performance (Micro F1) is nearly identical whether using vanilla fine-tuning or adding a domain pre-training step first. (Figure adapted from project report)</em> |
| 109 | +</p> |
| 110 | + |
| 111 | +<!-- Your provided Pretraining vs Finetuning Macro F1 --> |
| 112 | +<p style="text-align: center;"> |
| 113 | + <img src="/images/modernbert-patents/ptft_f1_macro.png" alt="Macro F1 score comparing pretraining+finetuning vs vanilla finetuning" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 114 | +</p> |
| 115 | +<p style="text-align: center;"> |
| 116 | + <em>Figure 7: Average per-class performance (Macro F1) was slightly lower when adding domain pre-training, suggesting vanilla fine-tuning was sufficient or even preferable here. (Figure adapted from project report)</em> |
| 117 | +</p> |
| 118 | + |
| 119 | + |
| 120 | +**Finding 3: Taming the Imbalance Boosts Rare Classes** |
| 121 | + |
| 122 | +We experimented with weighting the loss function to pay more attention to rare classes ("balanced" weighting). |
| 123 | +* **Effect:** It helped the model perform better *on average across all classes*, especially benefiting rare ones (higher macro-average precision), but slightly decreased the *overall* accuracy weighted by sample count (lower micro-average scores, not shown). [Figure 8] |
| 124 | +* **Trade-off:** There's often a trade-off between optimizing for overall accuracy versus ensuring fairness/performance across all classes, especially rare ones. |
| 125 | + |
| 126 | +<!-- Your provided Class Weighting vs Finetuning Macro Precision --> |
| 127 | +<p style="text-align: center;"> |
| 128 | + <img src="/images/modernbert-patents/cw_vs_ft_prec.png" alt="Macro Precision comparing class-weighted vs vanilla finetuning" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 129 | +</p> |
| 130 | +<p style="text-align: center;"> |
| 131 | + <em>Figure 8: Using class weights improved the average precision across all classes (Macro Precision), particularly later in training, compared to standard fine-tuning. (Figure adapted from project report)</em> |
| 132 | +</p> |
| 133 | + |
| 134 | + |
| 135 | +**Finding 4: Climbing the Hierarchy for State-of-the-Art Results!** 🏆 |
| 136 | + |
| 137 | +Patent codes have a structure (Section > Class > Subclass). [15, 17, 20] Misclassifying `A01B` (Soil Working) as `A01C` (Planting) is arguably a "smaller" mistake than classifying it as `H01L` (Semiconductors). Standard loss functions treat all mistakes equally. |
| 138 | + |
| 139 | +We introduced a **Hierarchical Loss** function that penalizes "big jumps" in the hierarchy more than "small slips." [4, 19] |
| 140 | + |
| 141 | +<!-- Generic diagram of a hierarchy --> |
| 142 | +<p style="text-align: center;"> |
| 143 | + <img src="/images/modernbert-patents/generic_hierarchy_diagram.png" alt="Generic diagram showing a tree-like hierarchical structure" style="max-width: 60%; height: auto; display: block; margin-left: auto; margin-right: auto;"> |
| 144 | +</p> |
| 145 | +<p style="text-align: center;"> |
| 146 | + <em>Figure 9: Patent codes (like CPC) have a hierarchical structure. Our Hierarchical Loss function incorporates this knowledge, penalizing errors based on their distance in the hierarchy. [Image Source: Generic representation]</em> [15, 17] |
| 147 | +</p> |
| 148 | + |
| 149 | +* **Result:** By combining this Hierarchical Loss with optimized training parameters (learning rate, weight decay) and slightly longer training (2 epochs), **our ModernBERT model surpassed the previous state-of-the-art F1@1 score reported by PatentBERT!** |
| 150 | + |
| 151 | + **Key Performance Comparison (F1 Score @ Top 1):** |
| 152 | + * PatentBERT (Baseline): ~65.9% |
| 153 | + * Our ModernBERT (Fine-tuned): ~65.9% - 66.1% |
| 154 | + * **Our ModernBERT (Hierarchical Loss + Tuned + 2 Epochs): 66.9%** ✨ |
| 155 | + |
| 156 | +* **Why this matters:** This shows that understanding the *structure* of the classification problem can unlock better performance, and ModernBERT is capable of achieving SOTA results when trained carefully. |
| 157 | + |
| 158 | +### Why This Project Matters & Conclusion |
| 159 | + |
| 160 | +This investigation demonstrates that **ModernBERT is a highly effective and significantly more efficient alternative to standard BERT for the complex task of patent classification.** [2, 9] |
| 161 | + |
| 162 | +**Key Takeaways:** |
| 163 | + |
| 164 | +1. **Speed & Efficiency:** ModernBERT offers substantial (>2x) inference speedups, crucial for real-world deployment. [1, 3] |
| 165 | +2. **Strong Performance:** Standard fine-tuning yields results comparable to previous benchmarks. |
| 166 | +3. **SOTA Potential:** By incorporating domain structure via Hierarchical Loss and careful tuning, ModernBERT can achieve state-of-the-art accuracy. |
| 167 | +4. **Dataset Contribution:** We provide USPTO-3M, a large, valuable dataset for future research. |
| 168 | +5. **Practical Insights:** Direct fine-tuning is often sufficient. [11] Sequence length had minimal impact. Class weighting helps rare classes but may slightly reduce overall accuracy. Domain pre-training needs careful consideration. Hierarchical loss provides an edge. |
| 169 | + |
| 170 | +This work paves the way for faster, more accurate AI tools to help navigate the complex world of patents, potentially saving significant time and resources in legal tech, R&D, and innovation analysis. |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +**Code & Data Repository:** [**https://github.com/Malav-P/modernpatentBERT**](https://github.com/Malav-P/modernpatentBERT) |
| 175 | +**Dataset:** [**https://huggingface.co/datasets/MalavP/USPTO-3M**](https://huggingface.co/datasets/MalavP/USPTO-3M) |
| 176 | + |
| 177 | +### Key Technologies |
| 178 | + |
| 179 | +* **Core Libraries:** Python, PyTorch, Hugging Face (Transformers, Datasets), Scikit-learn |
| 180 | +* **Models:** `answerdotai/ModernBERT-base`, Compared against BERT concepts. [1, 2, 5, 6, 8, 9] |
| 181 | +* **Techniques:** Fine-tuning [11, 16, 21, 22], Masked Language Modeling (for pre-training experiment) [6, 8], Multi-Label Classification [4], Binary Cross-Entropy Loss, Hierarchical Loss [4, 19], Class Weighting. |
| 182 | +* **Evaluation:** F1@1, Precision@1, Recall@1 (Top 1 metrics), Micro/Macro Averages. |
| 183 | +* **Infrastructure:** Linux, Google BigQuery (Data Acquisition), NVIDIA GPUs (H100, L40S, H200 via PACE Cluster), Git. |
| 184 | + |
| 185 | +--- |
0 commit comments