Skip to content

Commit 4d69c31

Browse files
author
Gauthier Roy
committed
mbertv1
1 parent 0896133 commit 4d69c31

File tree

7 files changed

+185
-0
lines changed

7 files changed

+185
-0
lines changed

_portfolio/modernpatentbert.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
---
2+
title: "ModernBERT for Patents: Faster Insights, Smarter Classification"
3+
excerpt: "ModernBERT for complex patent classification, demonstrating >2x faster inference than traditional BERT with state-of-the-art accuracy using hierarchical loss. Introduced USPTO-3M, a large public dataset of 3 million patents."
4+
collection: portfolio
5+
header:
6+
teaser: "/images/modernbert-patents/class_imbalance.png" # Using your provided Top 30 Class Imbalance bar chart
7+
# permalink: /portfolio/modernbert-patents/ # Optional: uncomment and adjust if needed
8+
---
9+
10+
Patents are the bedrock of intellectual property, but navigating them is tough. Imagine millions of dense, technical documents filled with legal jargon – a huge challenge for lawyers, researchers, and innovators. Manually classifying these patents into specific technological categories (like 'A01B - Soil Working' vs. 'H01L - Semiconductor Devices') is slow, expensive, and crucial for tasks like prior art searches and R&D analysis. Can we automate this better and faster with modern AI?
11+
12+
### The Challenge: Taming the Patent Beast
13+
14+
Classifying patents automatically isn't easy:
15+
16+
1. **Scale:** Millions of patents exist, with more added daily. We need efficient solutions.
17+
2. **Complexity:** Patents use specialized language and describe intricate inventions. Models need deep understanding.
18+
3. **Fine-Grained Categories:** The classification system (like the Cooperative Patent Classification - CPC) is highly detailed, requiring nuanced distinctions.
19+
4. **Data Imbalance:** Some patent categories are common, while others are extremely rare, making it hard for models to learn about the less frequent ones.
20+
21+
<!-- Generic diagram illustrating text classification -->
22+
<p style="text-align: center;">
23+
<img src="/images/modernbert-patents/generic_classification_concept.png" alt="Generic diagram showing text input, processing/model, and classified output labels" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;">
24+
</p>
25+
<p style="text-align: center;">
26+
<em>Figure 1: Conceptual overview of text classification: Input text is processed by a model to assign predefined categories. [Image Source: Generic representation]</em>
27+
</p>
28+
29+
### The Contender: ModernBERT Enters the Ring
30+
31+
For years, **BERT** (Bidirectional Encoder Representations from Transformers) has been a workhorse for understanding text. It's good, but technology moves fast! [2, 9]
32+
33+
**ModernBERT** is like BERT's souped-up successor. [2, 5, 9] It incorporates newer architectural tweaks (like RoPE embeddings, GeGLU activations, optimized attention) and training techniques designed for: [2, 6, 8]
34+
35+
* **Speed:** Faster processing (inference), potentially 2-4x faster than older models. [1, 3]
36+
* **Endurance:** Better handling of longer text sequences (up to 8192 tokens vs BERT's 512). [2, 3, 6]
37+
* **Efficiency:** Optimized for better performance and hardware utilization on standard GPUs. [2, 3, 5]
38+
39+
Our hypothesis: Could ModernBERT's advantages make it a better fit for the demanding task of patent classification? [2]
40+
41+
### Our Approach: Training the Specialist & Building the Dataset
42+
43+
We set out to answer two main questions:
44+
45+
1. **Direct Fine-tuning:** Can we take a general-purpose ModernBERT and simply train it (fine-tune it) on patent data to achieve good classification performance, potentially beating standard BERT? [11, 16, 21]
46+
2. **Domain Pre-training Boost?** Patents have unique language ("said," "comprising," "wherein"). Would "pre-training" ModernBERT *further* on just patent text *before* the final classification fine-tuning give it an extra edge? [11]
47+
48+
**Introducing USPTO-3M:** To run these experiments, we needed data. We collected and publicly released **USPTO-3M**, a dataset of 3 million US patents from 2013-2015, sourced from Google BigQuery. This dataset itself is a contribution to the research community!
49+
50+
**The Imbalance Problem:** Like many real-world datasets, USPTO-3M has a significant class imbalance. A few patent categories dominate, while many are rare, following a long-tail distribution. [Figure 2, Figure 3]
51+
52+
<!-- Your provided Top 30 class imbalance bar chart -->
53+
<p style="text-align: center;">
54+
<img src="/images/modernbert-patents/class_imbalance.png" alt="Histogram showing the frequency of the top 30 patent classes, highlighting severe imbalance" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;">
55+
</p>
56+
<p style="text-align: center;">
57+
<em>Figure 2: The top 30 patent classes (out of 665) make up almost half the dataset! This imbalance needs careful handling. (Figure adapted from project report)</em>
58+
</p>
59+
60+
<!-- Your provided Log-Log plot -->
61+
<p style="text-align: center;">
62+
<img src="/images/modernbert-patents/cpc_loglog.png" alt="Log-log plot showing CPC Label Frequencies vs. Rank" style="max-width: 90%; height: auto; display: block; margin-left: auto; margin-right: auto;">
63+
</p>
64+
<p style="text-align: center;">
65+
<em>Figure 3: The log-log plot confirms the severe imbalance across all 665 classes, with a steep drop-off and a long tail of infrequent classes. (Figure adapted from project report)</em>
66+
</p>
67+
68+
**Our Training Strategy:** We primarily focused on fine-tuning ModernBERT by adding a classification layer on top and training it to predict the correct patent codes (a multi-label problem, as one patent can fit multiple categories) [8, 11]. We used standard techniques like Binary Cross-Entropy loss.
69+
70+
<!-- Generic diagram of fine-tuning -->
71+
<p style="text-align: center;">
72+
<img src="/images/modernbert-patents/generic_finetuning_diagram.png" alt="Generic diagram showing a pre-trained model being adapted with new data/layers for a specific task" style="max-width: 80%; height: auto; display: block; margin-left: auto; margin-right: auto;">
73+
</p>
74+
<p style="text-align: center;">
75+
<em>Figure 4: Fine-tuning adapts a general pre-trained model (like ModernBERT) for a specialized task (like patent classification) using task-specific data. [Image Source: Generic representation]</em> [11, 16, 21]
76+
</p>
77+
78+
### Putting it to the Test: Key Findings
79+
80+
We ran extensive experiments, comparing ModernBERT to standard BERT concepts and testing our different training strategies. Here's what we found:
81+
82+
**Finding 1: ModernBERT is a Speed Demon!** 🚀
83+
84+
This was a major win. While we didn't create a specific chart for this page, our benchmarks (detailed in the paper) showed that when performing the classification task on a test set of 150,000 patents, **ModernBERT was over 2x faster than standard BERT** (4541 vs 2224 samples/sec throughput) on the same high-end GPU (NVIDIA H200). This aligns with claims of ModernBERT being 2-4x faster than previous encoders. [1, 3]
85+
86+
* **Why this matters:** For systems handling millions of patents, this speedup translates directly to lower computational costs and faster results.
87+
88+
**Finding 2: Accuracy Duel - Fine-tuning Works, Pre-training Less So (Here)**
89+
90+
* **Good News:** Standard fine-tuning of ModernBERT achieved accuracy (measured by F1@1 score) comparable to, and sometimes slightly better than, results reported for fine-tuned standard BERT (like PatentBERT).
91+
* **Minimal Impact of Sequence Length:** Increasing sequence length from 128 to 1024 or 1536 tokens showed only marginal improvements, suggesting the crucial information is often near the beginning of the patent text. [Figure 5]
92+
* **Pre-training Surprise:** The extra step of pre-training ModernBERT *only* on our patent dataset *before* fine-tuning *didn't* significantly improve overall results (Micro F1) in our setup. It even slightly hurt the average per-class performance (Macro F1). [Figure 6, Figure 7] Why? ModernBERT's initial massive pre-training might already be robust [1, 5], or our patent dataset (relative to the initial 2 trillion tokens [2, 6]) wasn't large enough or the pre-training phase long enough to make a difference here.
93+
* **Takeaway:** For this task, simply fine-tuning ModernBERT is an effective and efficient strategy.
94+
95+
<!-- Your provided Sequence Length comparison -->
96+
<p style="text-align: center;">
97+
<img src="/images/modernbert-patents/seqlen.png" alt="Micro F1 score during fine-tuning for different sequence lengths (128, 1024, 1536)" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;">
98+
</p>
99+
<p style="text-align: center;">
100+
<em>Figure 5: Overall performance (Micro F1) during training is very similar across different input sequence lengths (128, 1024, 1536), with longer sequences showing only a slight edge later. (Figure adapted from project report)</em>
101+
</p>
102+
103+
<!-- Your provided Pretraining vs Finetuning Micro F1 -->
104+
<p style="text-align: center;">
105+
<img src="/images/modernbert-patents/ptft_f1_micro.png" alt="Micro F1 score comparing pretraining+finetuning vs vanilla finetuning" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;">
106+
</p>
107+
<p style="text-align: center;">
108+
<em>Figure 6: Overall performance (Micro F1) is nearly identical whether using vanilla fine-tuning or adding a domain pre-training step first. (Figure adapted from project report)</em>
109+
</p>
110+
111+
<!-- Your provided Pretraining vs Finetuning Macro F1 -->
112+
<p style="text-align: center;">
113+
<img src="/images/modernbert-patents/ptft_f1_macro.png" alt="Macro F1 score comparing pretraining+finetuning vs vanilla finetuning" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;">
114+
</p>
115+
<p style="text-align: center;">
116+
<em>Figure 7: Average per-class performance (Macro F1) was slightly lower when adding domain pre-training, suggesting vanilla fine-tuning was sufficient or even preferable here. (Figure adapted from project report)</em>
117+
</p>
118+
119+
120+
**Finding 3: Taming the Imbalance Boosts Rare Classes**
121+
122+
We experimented with weighting the loss function to pay more attention to rare classes ("balanced" weighting).
123+
* **Effect:** It helped the model perform better *on average across all classes*, especially benefiting rare ones (higher macro-average precision), but slightly decreased the *overall* accuracy weighted by sample count (lower micro-average scores, not shown). [Figure 8]
124+
* **Trade-off:** There's often a trade-off between optimizing for overall accuracy versus ensuring fairness/performance across all classes, especially rare ones.
125+
126+
<!-- Your provided Class Weighting vs Finetuning Macro Precision -->
127+
<p style="text-align: center;">
128+
<img src="/images/modernbert-patents/cw_vs_ft_prec.png" alt="Macro Precision comparing class-weighted vs vanilla finetuning" style="max-width: 70%; height: auto; display: block; margin-left: auto; margin-right: auto;">
129+
</p>
130+
<p style="text-align: center;">
131+
<em>Figure 8: Using class weights improved the average precision across all classes (Macro Precision), particularly later in training, compared to standard fine-tuning. (Figure adapted from project report)</em>
132+
</p>
133+
134+
135+
**Finding 4: Climbing the Hierarchy for State-of-the-Art Results!** 🏆
136+
137+
Patent codes have a structure (Section > Class > Subclass). [15, 17, 20] Misclassifying `A01B` (Soil Working) as `A01C` (Planting) is arguably a "smaller" mistake than classifying it as `H01L` (Semiconductors). Standard loss functions treat all mistakes equally.
138+
139+
We introduced a **Hierarchical Loss** function that penalizes "big jumps" in the hierarchy more than "small slips." [4, 19]
140+
141+
<!-- Generic diagram of a hierarchy -->
142+
<p style="text-align: center;">
143+
<img src="/images/modernbert-patents/generic_hierarchy_diagram.png" alt="Generic diagram showing a tree-like hierarchical structure" style="max-width: 60%; height: auto; display: block; margin-left: auto; margin-right: auto;">
144+
</p>
145+
<p style="text-align: center;">
146+
<em>Figure 9: Patent codes (like CPC) have a hierarchical structure. Our Hierarchical Loss function incorporates this knowledge, penalizing errors based on their distance in the hierarchy. [Image Source: Generic representation]</em> [15, 17]
147+
</p>
148+
149+
* **Result:** By combining this Hierarchical Loss with optimized training parameters (learning rate, weight decay) and slightly longer training (2 epochs), **our ModernBERT model surpassed the previous state-of-the-art F1@1 score reported by PatentBERT!**
150+
151+
**Key Performance Comparison (F1 Score @ Top 1):**
152+
* PatentBERT (Baseline): ~65.9%
153+
* Our ModernBERT (Fine-tuned): ~65.9% - 66.1%
154+
* **Our ModernBERT (Hierarchical Loss + Tuned + 2 Epochs): 66.9%**
155+
156+
* **Why this matters:** This shows that understanding the *structure* of the classification problem can unlock better performance, and ModernBERT is capable of achieving SOTA results when trained carefully.
157+
158+
### Why This Project Matters & Conclusion
159+
160+
This investigation demonstrates that **ModernBERT is a highly effective and significantly more efficient alternative to standard BERT for the complex task of patent classification.** [2, 9]
161+
162+
**Key Takeaways:**
163+
164+
1. **Speed & Efficiency:** ModernBERT offers substantial (>2x) inference speedups, crucial for real-world deployment. [1, 3]
165+
2. **Strong Performance:** Standard fine-tuning yields results comparable to previous benchmarks.
166+
3. **SOTA Potential:** By incorporating domain structure via Hierarchical Loss and careful tuning, ModernBERT can achieve state-of-the-art accuracy.
167+
4. **Dataset Contribution:** We provide USPTO-3M, a large, valuable dataset for future research.
168+
5. **Practical Insights:** Direct fine-tuning is often sufficient. [11] Sequence length had minimal impact. Class weighting helps rare classes but may slightly reduce overall accuracy. Domain pre-training needs careful consideration. Hierarchical loss provides an edge.
169+
170+
This work paves the way for faster, more accurate AI tools to help navigate the complex world of patents, potentially saving significant time and resources in legal tech, R&D, and innovation analysis.
171+
172+
---
173+
174+
**Code & Data Repository:** [**https://github.com/Malav-P/modernpatentBERT**](https://github.com/Malav-P/modernpatentBERT)
175+
**Dataset:** [**https://huggingface.co/datasets/MalavP/USPTO-3M**](https://huggingface.co/datasets/MalavP/USPTO-3M)
176+
177+
### Key Technologies
178+
179+
* **Core Libraries:** Python, PyTorch, Hugging Face (Transformers, Datasets), Scikit-learn
180+
* **Models:** `answerdotai/ModernBERT-base`, Compared against BERT concepts. [1, 2, 5, 6, 8, 9]
181+
* **Techniques:** Fine-tuning [11, 16, 21, 22], Masked Language Modeling (for pre-training experiment) [6, 8], Multi-Label Classification [4], Binary Cross-Entropy Loss, Hierarchical Loss [4, 19], Class Weighting.
182+
* **Evaluation:** F1@1, Precision@1, Recall@1 (Top 1 metrics), Micro/Macro Averages.
183+
* **Infrastructure:** Linux, Google BigQuery (Data Acquisition), NVIDIA GPUs (H100, L40S, H200 via PACE Cluster), Git.
184+
185+
---
68.3 KB
Loading
41.3 KB
Loading
352 KB
Loading
328 KB
Loading
297 KB
Loading
401 KB
Loading

0 commit comments

Comments
 (0)