-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Implementeer een streaming Html tokenizer in C# die volledig karakter-voor-karakter werkt met een state machine, en die kan stoppen bij een optionele stopDelimiter. Reuse code from BaseSubTokenizer
Public API
The public API is handled by BaseSubTokenizer<TToken>, so make sure to inherit from this class:
public sealed class HtmlTokenizer : BaseSubTokenizer<HtmlToken>The BaseSubTokenizer contains an abstract method, so make sure to override that one:
internal protected override Task ParseAsync(CancellationToken ct)There is also already a method available that handles the stopDelimiter logic. So override as follows:
internal protected override Task ParseAsync(CancellationToken ct)
{
/* define additional state variables here */
TokenizeCharacters(ct, (c) => ProcessChar(c, /* pass extra state variables here */));
EmitPending(/* pass extra state variables here */);
return Task.CompletedTask;
}Simularities with XmlTokenizer
The implementation should be very simular to the XmlTokenizer. So also mainly copy the token types. There is one large difference though. html can have css and javascript sections. Example:
<!DOCTYPE html>
<html>
<head>
<style language="javascript">
body { font-family: Arial, sans-serif; }
.container { max-width: 600px; margin: 0 auto; }
button { padding: 10px 15px; background: #007bff; color: white; border: none; cursor: pointer; }
</style>
</head>
<body>
<div class="container">
<h1>Hello World</h1>
<button onclick="alert('Clicked!')">Click Me</button>
</div>
<script>
document.addEventListener('DOMContentLoaded', () => {
console.log('Page loaded');
});
</script>
</body>
</html>So in this example the content of the style element should be passed to the CssTokenizer, where </style> will be the stopDelimiter and the script element should be passed to the TypescriptTokenizer where </script> will be the stopDelimiter.
This same mechanism is already present in the MarkupTokenizer which also has inline sections that are handled by other tokenizers (see ParseCodeInlines()), so be sure to check that out.
You do not have to handle styling or script within attributes, those can just be emitted as AttributeValue, so nothing special there.
Gedrag
- Lees de input karakter-voor-karakter van een
StreamReader. - Herken en emiteer tokens direct via
onTokenzodra ze volledig herkend zijn. - Lees alleen vooruit als het echt nodig is.
- Parse best-effort, no strict validation.
Implementatie details
Take XmlTokenizer as an example.
- Goal:
- Fully streaming html tokenizer
- Supports both block-style and flow-style CSS
- Matches the style of the XML tokenizer in the
NTokenizersrepo - html elements such as
pordiv, etc will just be emitted as ElementName.
xUnit tests
- Plaats tests in
tests\NTokenizers.Tests\HtmlTokenizerTests.cs. - Test All tokens
- Assert input and output (The Parse method returns a string)
- Test dat parsing correct stopt bij een opgegeven
stopDelimiter. - Test script and style elements
- Test cancellation token
Show case project
- Add a show case project just like the xml one.
- Use html with a style and javascript section
Folderstructuur
project-root/
¦
+- src/
¦ +- NTokenizers/
¦ +- Html/
¦ +- HtmlToken.cs
¦ +- HtmlTokenizer.cs
¦
+- tests/
+- NTokenizers.ShowCase.Html/
+- NTokenizers.ShowCase.Html.csproj
+- Program.cs
+- NTokenizers.Tests/
+- HtmlTokenizerTests.cs
HtmlTokenType.cs: definieertHtmlTokenType.HtmlToken.cs: definieertHtmlToken.HtmlTokenizer.cs: bevat deHtmlTokenizer.Parsemethode.NTokenizers.ShowCase.Html.csproj: Showcase project.Program.cs: Html showcase program logic.HtmlTokenizerTests.cs: xUnit tests voor alle tokentypes, keywords, comments, operators, stopDelimiter, etc.
Documentatie
- Voeg Html toe aan de opsommingen in the README.md van de root (laat het voorbeeld ongemoeid)
- Voeg Html toe aan de Description opsommingen in the NTokenizers.json
- Voeg een yaml.md file toe aan de docs folder (for content inspiration take json.md)
- Voeg yaml toe aan de _config.yml.
Bonus implementation double nesting
It would be nice if the MarkdownTokenizer, can handle HtmlTokenizing (instead of the XmlTokenizer), that can handle CssTokenizing etc. I have not really thought that true. But if it is possible, it would be nice that this double nesting just works. If not we will pick that up in a later iteration.