|
| 1 | +# Contributing to htmlparser |
| 2 | + |
| 3 | +## Adding New HTML Elements |
| 4 | + |
| 5 | +When adding new HTML elements to the parser, you must regenerate the element name hash tables in `src/nu/validator/htmlparser/impl/ElementName.java`. |
| 6 | + |
| 7 | +### Step 1: Add the new element constant |
| 8 | + |
| 9 | +Add a new `static final ElementName` constant for your element, following the existing pattern: |
| 10 | + |
| 11 | +```java |
| 12 | +public static final ElementName MYNEWELEMENT = new ElementName( |
| 13 | + "mynewelement", "mynewelement", |
| 14 | + // CPPONLY: NS_NewHTMLElement, |
| 15 | + // CPPONLY: NS_NewSVGUnknownElement, |
| 16 | + TreeBuilder.OTHER); |
| 17 | +``` |
| 18 | + |
| 19 | +The flags (like `TreeBuilder.OTHER`, `SPECIAL`, `SCOPING`, etc.) depend on how the element should be handled by the tree builder. |
| 20 | + |
| 21 | +### Step 2: Uncomment the code generation sections |
| 22 | + |
| 23 | +Uncomment three sections in `ElementName.java`: |
| 24 | + |
| 25 | +1. **The imports** near the top (~lines 26-39): |
| 26 | + - `java.io.*` |
| 27 | + - `java.util.*` |
| 28 | + - `java.util.regex.*` |
| 29 | + |
| 30 | +2. **`implements Comparable<ElementName>`** on the class declaration (~line 49) |
| 31 | + |
| 32 | +3. **The code generation block** marked with: |
| 33 | + `"START CODE ONLY USED FOR GENERATING CODE uncomment and run to regenerate"` |
| 34 | + That includes the `main()` method and helper functions (~lines 272-659) |
| 35 | + |
| 36 | +### Step 3: Add case to treeBuilderGroupToName() if needed |
| 37 | + |
| 38 | +If your element uses a new `TreeBuilder` group constant, add a case for it in the `treeBuilderGroupToName()` method within the code generation block. |
| 39 | + |
| 40 | +### Step 4: Compile and run |
| 41 | + |
| 42 | +Compile the project: |
| 43 | + |
| 44 | +```bash |
| 45 | +mvn compile |
| 46 | +``` |
| 47 | + |
| 48 | +Run the `ElementName` class with paths to the Gecko tag-list files: |
| 49 | + |
| 50 | +```bash |
| 51 | +java -cp target/classes nu.validator.htmlparser.impl.ElementName \ |
| 52 | + /path/to/nsHTMLTagList.h \ |
| 53 | + /path/to/SVGTagList.h |
| 54 | +``` |
| 55 | + |
| 56 | +**For Java-only builds** (not Gecko), you can use empty dummy files: |
| 57 | + |
| 58 | +```bash |
| 59 | +mkdir -p /tmp/tagfiles |
| 60 | +touch /tmp/tagfiles/nsHTMLTagList.h /tmp/tagfiles/SVGTagList.h |
| 61 | +java -cp target/classes nu.validator.htmlparser.impl.ElementName \ |
| 62 | + /tmp/tagfiles/nsHTMLTagList.h \ |
| 63 | + /tmp/tagfiles/SVGTagList.h |
| 64 | +``` |
| 65 | + |
| 66 | +> **Note:** Using empty files means the `CPPONLY` comments will all show `NS_NewHTMLUnknownElement`. For Gecko builds, use the actual files from moz-central: |
| 67 | +> - `parser/htmlparser/nsHTMLTagList.h` |
| 68 | +> - `dom/svg/SVGTagList.h` |
| 69 | +
|
| 70 | +### Step 5: Update the generated arrays |
| 71 | + |
| 72 | +The program outputs: |
| 73 | +1. All element constant definitions (with updated `CPPONLY` comments if using real Gecko tag files) |
| 74 | +2. The `ELEMENT_NAMES` array in level-order binary search tree order |
| 75 | +3. The `ELEMENT_HASHES` array with corresponding hash values |
| 76 | + |
| 77 | +Replace the existing `ELEMENT_NAMES` and `ELEMENT_HASHES` arrays in the file with the generated output. The arrays must stay in sync—element at position N in `ELEMENT_NAMES` must have its hash at position N in `ELEMENT_HASHES`. |
| 78 | + |
| 79 | +### Step 6: Re-comment the code generation sections |
| 80 | + |
| 81 | +After regeneration, comment out the sections you uncommented in Step 2 to restore the file to its normal state. |
| 82 | + |
| 83 | +### Step 7: Run tests |
| 84 | + |
| 85 | +Verify your changes work correctly: |
| 86 | + |
| 87 | +```bash |
| 88 | +mvn test |
| 89 | +``` |
| 90 | + |
| 91 | +### Technical Details |
| 92 | + |
| 93 | +The hash function (`bufToHash`) creates a unique integer for each element name using the element's length and specific character positions. The arrays are organized as a level-order binary search tree for O(log n) lookup performance. |
| 94 | + |
| 95 | +If you encounter a hash collision (two elements with the same hash), the regeneration will report an error. That would require modifying the hash function, which has not been necessary historically. |
0 commit comments