Skip to content

Add html tokenizer #38

@crwsolutions

Description

@crwsolutions

Implementeer een streaming Html tokenizer in C# die volledig karakter-voor-karakter werkt met een state machine, en die kan stoppen bij een optionele stopDelimiter. Reuse code from BaseSubTokenizer

Public API

The public API is handled by BaseSubTokenizer<TToken>, so make sure to inherit from this class:

public sealed class HtmlTokenizer : BaseSubTokenizer<HtmlToken>

The BaseSubTokenizer contains an abstract method, so make sure to override that one:

internal protected override Task ParseAsync(CancellationToken ct)

There is also already a method available that handles the stopDelimiter logic. So override as follows:

    internal protected override Task ParseAsync(CancellationToken ct)
    {
        /* define additional state variables here */

        TokenizeCharacters(ct, (c) => ProcessChar(c, /* pass extra state variables here */));

        EmitPending(/* pass extra state variables here */);

        return Task.CompletedTask;
    }

Simularities with XmlTokenizer

The implementation should be very simular to the XmlTokenizer. So also mainly copy the token types. There is one large difference though. html can have css and javascript sections. Example:

<!DOCTYPE html>
<html>
<head>
    <style language="javascript">
        body { font-family: Arial, sans-serif; }
        .container { max-width: 600px; margin: 0 auto; }
        button { padding: 10px 15px; background: #007bff; color: white; border: none; cursor: pointer; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Hello World</h1>
        <button onclick="alert('Clicked!')">Click Me</button>
    </div>
    <script>
        document.addEventListener('DOMContentLoaded', () => {
            console.log('Page loaded');
        });
    </script>
</body>
</html>

So in this example the content of the style element should be passed to the CssTokenizer, where </style> will be the stopDelimiter and the script element should be passed to the TypescriptTokenizer where </script> will be the stopDelimiter.

This same mechanism is already present in the MarkupTokenizer which also has inline sections that are handled by other tokenizers (see ParseCodeInlines()), so be sure to check that out.

You do not have to handle styling or script within attributes, those can just be emitted as AttributeValue, so nothing special there.


Gedrag

  1. Lees de input karakter-voor-karakter van een StreamReader.
  2. Herken en emiteer tokens direct via onToken zodra ze volledig herkend zijn.
  3. Lees alleen vooruit als het echt nodig is.
  4. Parse best-effort, no strict validation.

Implementatie details

Take XmlTokenizer as an example.

  1. Goal:
  • Fully streaming html tokenizer
  • Supports both block-style and flow-style CSS
  • Matches the style of the XML tokenizer in the NTokenizers repo
  • html elements such as p or div, etc will just be emitted as ElementName.

xUnit tests

  • Plaats tests in tests\NTokenizers.Tests\HtmlTokenizerTests.cs.
  • Test All tokens
  • Assert input and output (The Parse method returns a string)
  • Test dat parsing correct stopt bij een opgegeven stopDelimiter.
  • Test script and style elements
  • Test cancellation token

Show case project

  • Add a show case project just like the xml one.
  • Use html with a style and javascript section

Folderstructuur

project-root/
¦
+- src/
¦  +- NTokenizers/
¦     +- Html/
¦        +- HtmlToken.cs
¦        +- HtmlTokenizer.cs
¦
+- tests/
   +- NTokenizers.ShowCase.Html/
      +- NTokenizers.ShowCase.Html.csproj
      +- Program.cs
   +- NTokenizers.Tests/
      +- HtmlTokenizerTests.cs
  • HtmlTokenType.cs : definieert HtmlTokenType.
  • HtmlToken.cs : definieert HtmlToken.
  • HtmlTokenizer.cs : bevat de HtmlTokenizer.Parse methode.
  • NTokenizers.ShowCase.Html.csproj : Showcase project.
  • Program.cs : Html showcase program logic.
  • HtmlTokenizerTests.cs : xUnit tests voor alle tokentypes, keywords, comments, operators, stopDelimiter, etc.

Documentatie

  • Voeg Html toe aan de opsommingen in the README.md van de root (laat het voorbeeld ongemoeid)
  • Voeg Html toe aan de Description opsommingen in the NTokenizers.json
  • Voeg een yaml.md file toe aan de docs folder (for content inspiration take json.md)
  • Voeg yaml toe aan de _config.yml.

Bonus implementation double nesting

It would be nice if the MarkdownTokenizer, can handle HtmlTokenizing (instead of the XmlTokenizer), that can handle CssTokenizing etc. I have not really thought that true. But if it is possible, it would be nice that this double nesting just works. If not we will pick that up in a later iteration.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions