Add html tokenizer

Implementeer een **streaming Html tokenizer** in C# die volledig **karakter-voor-karakter** werkt met een **state machine**, en die kan stoppen bij een optionele `stopDelimiter`. Reuse code from BaseSubTokenizer<TToken>

### **Public API**

The public API is handled by `BaseSubTokenizer<TToken>`, so make sure to inherit from this class:

```csharp
public sealed class HtmlTokenizer : BaseSubTokenizer<HtmlToken>
```

The BaseSubTokenizer contains an abstract method, so make sure to override that one:

```csharp
internal protected override Task ParseAsync(CancellationToken ct)
```

There is also already a method available that handles the stopDelimiter logic. So override as follows:

```csharp
    internal protected override Task ParseAsync(CancellationToken ct)
    {
        /* define additional state variables here */

        TokenizeCharacters(ct, (c) => ProcessChar(c, /* pass extra state variables here */));

        EmitPending(/* pass extra state variables here */);

        return Task.CompletedTask;
    }
```

### Simularities with XmlTokenizer

The implementation should be very simular to the XmlTokenizer. So also mainly copy the token types. There is one large difference though. html can have css and javascript sections. Example:

```html
<!DOCTYPE html>
<html>
<head>
    <style language="javascript">
        body { font-family: Arial, sans-serif; }
        .container { max-width: 600px; margin: 0 auto; }
        button { padding: 10px 15px; background: #007bff; color: white; border: none; cursor: pointer; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Hello World</h1>
        <button onclick="alert('Clicked!')">Click Me</button>
    </div>
    <script>
        document.addEventListener('DOMContentLoaded', () => {
            console.log('Page loaded');
        });
    </script>
</body>
</html>
```

So in this example the content of the style element should be passed to the CssTokenizer, where `</style>` will be the stopDelimiter and the script element should be passed to the TypescriptTokenizer where `</script>` will be the stopDelimiter.

This same mechanism is already present in the `MarkupTokenizer` which also has inline sections that are handled by other tokenizers (see ParseCodeInlines()), so be sure to check that out.

You do not have to handle styling or script within attributes, those can just be emitted as AttributeValue, so nothing special there.

---

### **Gedrag**

1. Lees de input karakter-voor-karakter van een `StreamReader`.
2. Herken en emiteer tokens direct via `onToken` zodra ze volledig herkend zijn.
3. Lees alleen vooruit als het echt nodig is.
4. Parse best-effort, no strict validation.

---

### **Implementatie details**

Take `XmlTokenizer` as an example.

4. **Goal**:

* Fully streaming html tokenizer
* Supports both block-style and flow-style CSS
* Matches the style of the XML tokenizer in the `NTokenizers` repo
* html elements such as `p` or `div`, etc will just be emitted as ElementName.

---

### **xUnit tests**

* Plaats tests in `tests\NTokenizers.Tests\HtmlTokenizerTests.cs`.
* Test All tokens
* Assert input and output (The Parse method returns a string)
* Test dat parsing correct stopt bij een opgegeven `stopDelimiter`.
* Test script and style elements
* Test cancellation token

---

### **Show case project**

* Add a show case project just like the xml one.
* Use html with a style and javascript section

---

### **Folderstructuur**

```
project-root/
¦
+- src/
¦  +- NTokenizers/
¦     +- Html/
¦        +- HtmlToken.cs
¦        +- HtmlTokenizer.cs
¦
+- tests/
   +- NTokenizers.ShowCase.Html/
      +- NTokenizers.ShowCase.Html.csproj
      +- Program.cs
   +- NTokenizers.Tests/
      +- HtmlTokenizerTests.cs
```

* `HtmlTokenType.cs` : definieert `HtmlTokenType`.
* `HtmlToken.cs` :  definieert `HtmlToken`.
* `HtmlTokenizer.cs` :  bevat de `HtmlTokenizer.Parse` methode.
* `NTokenizers.ShowCase.Html.csproj` : Showcase project.
* `Program.cs` : Html showcase program logic.
* `HtmlTokenizerTests.cs` :  xUnit tests voor alle tokentypes, keywords, comments, operators, stopDelimiter, etc.

---

## Documentatie

- Voeg Html toe aan de opsommingen in the README.md van de root (laat het voorbeeld ongemoeid)
- Voeg Html toe aan de Description opsommingen in the NTokenizers.json
- Voeg een yaml.md file toe aan de docs folder (for content inspiration take json.md)
- Voeg yaml toe aan de _config.yml.

## Bonus implementation double nesting

It would be nice if the MarkdownTokenizer, can handle HtmlTokenizing (instead of the XmlTokenizer), that can handle CssTokenizing etc. I have not really thought that true. But if it is possible, it would be nice that this double nesting just works. If not we will pick that up in a later iteration.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add html tokenizer #38

Public API

Simularities with XmlTokenizer

Gedrag

Implementatie details

xUnit tests

Show case project

Folderstructuur

Documentatie

Bonus implementation double nesting

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Add html tokenizer #38

Description

Public API

Simularities with XmlTokenizer

Gedrag

Implementatie details

xUnit tests

Show case project

Folderstructuur

Documentatie

Bonus implementation double nesting

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions