1 Introduction
Figure 1. Evolution of patent applications and grants at IP5 offices from 2009 to 2019 (IP5, 2019). |
2 Context and methods
2.1 Patent classification system
Table 1. IPC Areas of Technology. |
Section | Description |
---|---|
A | Human Necessities |
B | Performing Operations; Transporting |
C | Chemistry; Metallurgy |
D | Textiles; Paper |
E | Fixed Constructions |
F | Mechanical Engineering; Lighting; Heating; Weapons; Blasting Engines or Pumps |
G | Physics |
H | Electricity |
2.2 Patents in Portugal
Figure 2. Patents applications in Portugal since 2010. |
2.3 Patents classification model
Table 3. Patent classification related studies. |
Authors | Feature Engineering | Algorithm | Section | Language | Dataset size | Number of classes |
---|---|---|---|---|---|---|
(Trappey et al., 2006) | Key phrases frequency based on TF-IDF | Neural Networks | full document | English | 300 training 124 test | 9 |
(Derieux et al., 2010) | Terms extraction and semantic relation | SVM | full document | English, German, French | 985 training 2000 test | 630 |
(Trappey et al., 2013) | Key phrases frequency based on TF-IDF | Ontology-Based Neural Network | full document | English | 333 training 160 test | 23 |
(Zhang, 2014) | - | SVM | - | English | 5000 | 5 |
(Wu et al., 2016) | SOM, KPCA | SVM | full document | English | 60.000 | 7 |
(Li et al., 2018) | Skip-gram | CNN | title and abstract | English | 742.097 training 1350 test | 637 |
(Risch & Krestel, 2019) | Domain-specific FastText word embeddings | Bi-directional GRU | title and abstract | English | ~1.7M training ~300.000 test | 637 |
(Abdelgawad et al., 2020) | GloVe, Word2Vec, FastText | Hierarchical SVM and CNN with BOHB (Bayesian Optimization hyperband) | title, abstract, description, and claims | English | 75.000 training 28.926 test | 451 |
(Lee & Hsiang, 2020) | - | BERT-Base | claims | English | 1,950,247 training 150,000 test | 632 |
3 Data
Figure 3. Number of patents by section. |
Figure 4. Frequency of classes by the number of patents. |
Figure 5. Boxplot of text size by section with a) outliers and b) without outliers. |
Figure 6. Wordcloud with the more frequent words by section. |
4 Methodology
Figure 7. Applied methodology. |
4.1 Data acquisition
Table 4. Features used in the analysis. |
Feature | Description |
---|---|
id | Patent internal identification |
Title | Descriptive name of the patent |
Claims | The legal scope of the invention, including delimitations and application field |
Abstract | A brief description of the invention presented in the patent |
Section | IPC 1st level classification code |
Class | IPC 2nd level classification code |
Subclass | IPC 3rd level classification code |
Main group | IPC 4th level classification code |
Subgroup | IPC 5th level classification code |
4.2 Feature engineering
4.3 Modelling
4.4 Assessment
5 Results and discussion
Table 5. Mean F1 score (cross-validation k=5) with different feature engineering methods. |
Model | F1_weighted (%) |
---|---|
LinearSVC (baseline) | 60.8 |
CNN | 50 |
DistilBERT Multilingual | 50.1 |
BiLSTM | 57 |
ULMFiT | 57 |
BERT-Base Multilingual | 59.5 |
BERTimbau | 63.6 |
Table 6. F1 score on the test set. |
Model | F1_weighted (%) |
---|---|
LinearSVC (baseline) | 60.8 |
CNN | 50 |
DistilBERT Multilingual | 50.1 |
BiLSTM | 57 |
ULMFiT | 57 |
BERT-Base Multilingual | 59.5 |
BERTimbau | 63.6 |
Table 7. Precision, Recall and F1 score by Section. |
Section | Precision | Recall | F1 |
---|---|---|---|
A | 0.7923 | 0.7 | 0.7433 |
H | 0.6805 | 0.7248 | 0.7019 |
C | 0.6515 | 0.7427 | 0.6941 |
E | 0.5981 | 0.5923 | 0.5952 |
F | 0.5446 | 0.5393 | 0.5419 |
B | 0.5291 | 0.5388 | 0.5339 |
G | 0.4903 | 0.4633 | 0.4764 |
D | 0.5098 | 0.4041 | 0.4509 |
Figure 8. Classes G06 most frequent words in parallel to most similar classes. |