How To Meet Ladies with AngleSharp

Florian Rappl, MVP Visual C#

How To Meet Ladies

HTML with AngleSharp in .NET

AngleSharp DOM Manipulation

Florian Rappl

Writer and consultant

  • Microsoft C# MVP and CodeProject MVP
  • Active contributions to open-source projects
  • Company workshops, talks and IT consulting

Languages and technologies

  • C#, JavaScript and C/C++
  • Full web stack (client and server)
  • High Performance and Embedded Computing

Agenda

  1. What is AngleSharp?
  2. Special rules for special ladies
  3. Get ladies, get them fast
  4. How To Tell Peers: about it
  5. Being responsive
  6. Next generation

A little task ...

  • Get the content of a webpage
  • However:
    • The page requires login
    • The login is secured (verification token)
    • The address of the page changes
    • Only the link on the startpage is consistent
  • How do you solve this problem?
DOM
Jamie Treworgy
The .NET world has definitely longed for a comprehensive functional DOM model, which is something I never had the energy to do with my own project!

Jamie Treworgy
Created CsQuery

AngleSharp in words

  • AngleSharp is the ultimate HTML5 parser
  • It can be easily extended
  • The project is fully standards (W3C) driven
  • It comes with a CSS(3) parser
  • Basic understanding of HTTP
  • Full DOM interaction model

AngleSharp in a picture

Performance

AngleSharp Parsing Time

Extensibility

AngleSharp Extensions

Solving the task with AngleSharp

  • Solve it like in the browser, just in code:
    • Navigate to the startpage
    • Login with provided credentials
    • Navigate to link's reference
    • Get the content
Special Ladies Rules

Special Algorithms of HTML

  • Context creation
  • Scoping rules
  • Foster parenting*
  • Heisenberg algorithm*
  • Formatting reconstruction

* discussion postponed.

google.com/error

<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<meta name=viewport content="initial-scale=1, ...">
<title>Error 404 (Not Found)!!1</title>
<style>/* ... */</style>
<a href=//www.google.com/><span id=logo aria-label=Google></span></a>
<p><b>404.</b> <ins>That’s an error.</ins>
<p>The requested URL <code>/error</code> was not found on this server.
<ins>That’s all we know.</ins>
			
  • head and body inserted
  • Tags automatically closed
  • p applies to scoping
  • Document finished as it should be
  • Optional quotes (simple values)
  • The document has 0 errors
  • ⇒ Perfectly valid HTML5!

Google Error DOM Tree

Scoping Rules

  • Triggered on certain elements
  • Adhere to:
    • General [button in, e.g. html]
    • Lists (list item scope) [li in, e.g. ul]
    • Buttons (button scope) [p in, button]
    • Cells (table scope) [tr in, e.g., table]
    • Options (select scope) [select in, e.g. option]
  • Removes elements until the causing element has been removed

Tokenizer State

HTML5 Tokenizer States

Reconstruct Active Formatting

  • Most simple question: How does the DOM tree look for the following code?
    <p>1<b>2<i>3</b>4</i>5</p>
    					

Reconstruct Active Formatting

  • Answer: Different! Parser inserted non-specified element.
    <p>1<b>2<i>3</b>4</i>5</p>
    					

    HTML5 Reconstruct Active Formatting

Fast Girls

Critical Rendering Path

  • Goal: Paint information as fast as possible
  • Essential: Need to understand HTML parser

HTML Critical Rendering Path

Biggest Issue: Network

Snail Network
  • Latency and bandwidth are our biggest constraints
  • Issues with testing: Local != Global
  • The page should work fine on mobile networks

Mobile Latency

  • Latency for different mobile network generations
    • LTE < 100 ms
    • HSPA+ 100-300 ms
    • HSPA 150-500 ms
    • EDGE 300-750 ms
    • GPRS 700-1000 ms
  • ⇒ It should be obvious that minimizing requests is a good idea
  • High Performance Browser Networking

One Second

HTML Latency Mobile Generation

  • In a second we only have 400ms to render
  • But more recources have potential to be loaded as well
  • Question: What are best practices to minimize overhead?
Souders High Performance Websites
Souders Even Fast Websites

A Little Update

  • Most best practices are still valid
  • However, script tags do not necessarily have to go down
  • Reason? Consider we have a single loader script:
    • decorated with defer and async
    • loading only the required resources (scripts)
    • already fully using the event loop for processing
    • all downloads happening in parallel: guaranteed
  • CDN may be bad for several reasons - be cautious

Scripts

  • Historically scripts are required to pause parsing
  • Hence every script represents a complete waste
  • Why? document.write!
  • Changing the source on the fly comes at a cost
  • Insertation pointer is set only for parser-inserted scripts

StyleSheets

  • Parsing does not stop for stylesheets
  • However, rendering is postponed until stylesheets are ready
  • Outstanding stylesheets therefore prevent rendering

CSS StyleSheet Blocking

The Cure?!

  • Abuse media attribute for switching on / off:
    						<link rel="stylesheet" href="css.css" media="none"
    						      onload="if(media!='all')media='all'">
    					
  • Use it for fonts with base64 encoded content

CSS StyleSheet Deferring

What Else?

  • Use DNS prefetching, ETags and more
  • Minimize your images
  • Minify JavaScript and StyleSheets
  • Identify and remove unused CSS rules
  • Aggregate icons to fonts or spritesheets
Share Secret With Friends

HTTP

  • Builds upon TCP/IP (OSI layer 5-7)
  • Is a stateless protocol: request / response / done
  • Messages consist of header and body
  • Everything is plaintext, but the body may be binary
  • Headers determine how the body should be interpreted

Example

GET /docs/index.html HTTP/1.1
Host: www.test101.com
Accept: image/gif, image/jpeg, */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla
				
HTTP/1.1 200 OK
Date: Sun, 26 Jul 2015 08:56:53 GMT
Server: Apache/2.2.14 (Win32)
Last-Modified: Thu, 20 Nov 2014 07:16:26 GMT
ETag: "10000000565a5-2c-3e94b66c2e680"
Accept-Ranges: bytes
Content-Length: 44
Connection: close
Content-Type: text/html
X-Pad: avoid browser bug
  
<html><body><h1>It works!</h1></body></html>
				

HTTP Verbs

  • GET: Defines an idempotent request without a body.
  • POST: Sends data via on optional body for modification.
  • PUT: An idempotent request potentially with a body.
  • DELETE: A destructive, idempotent request without a body.

There are more verbs specified, but they are (rarely / never) used.

Sending Forms

  • Request determined by protocol, verb and encoding type
  • Default method of transmission is GET
  • Default encoding type is application/x-www-form-urlencoded
  • There is also text/plain, which just uses plaintext
  • Most interesting is multipart/form-data:
    • Transfer in binary form
    • Possibility to attach files
    • Separated by arbitrary boundary
Mobile Responsive Design

CSS Media Queries

  • Apply different styles for different devices
  • HTML and JavaScript not influenced
  • Branch in (CSS) code, or before loading
CSS Media Queries

Responsive Images

Picture Responsive
  • Problem: Load device-specific images
  • Solution 1: Branch via CSS and media queries
    • Problem: CSS bloat and strong coupling
  • Solution 2: Use JavaScript for the loading
    • Problem: Requires JavaScript

There's already something ...

  • Media elements cover this already
  • Source selection depends on the browser's capabilities
  • We may give hints (e.g., encoding)
  • But <source> would break <img>
HTML Video element

The picture Element

  • New element that follows media elements
  • Define element in conjunction with children
  • Source selection is, however, more complicated
  • Use special syntax to find match

A simple Example

<picture>
  <source 
    media="(min-width: 650px)"
    srcset="images/kitten-stretching.png">
  <source 
    media="(min-width: 465px)"
    srcset="images/kitten-sitting.png">
  <img 
    src="images/kitten-curled.png" 
    alt="a cute kitten">
</picture>
			

A complete Example

<picture>
  <source media="(min-width: 650px)" 
          srcset="images/kitten-stretching.png,
                  images/kitten-stretching@1.5x.png 1.5x,  
                  images/kitten-stretching@2x.png 2x">
  <source media="(min-width: 465px)" 
          srcset="images/kitten-sitting.png,
                  images/kitten-sitting@1.5x.png 1.5x
                  images/kitten-sitting@2x.png 2x">
  <img src="images/kitten-curled.png" 
       srcset="images/kitten-curled@1.5x.png 1.5x,
               images/kitten-curled@2x.png 2x">
</picture>
			
Web Bricks

Web Components

  • Primary goal: Decomposition
  • Reduce dependencies, improve maintainability
  • Package CSS, JS and HTML to form new elements
  • Requires a set of new web technologies

Shadow DOM

  • DOM behind a certain element
  • Used implicitely for input elements
  • Comes with own styling, behavior and markup
  • Appears to be normal element, but has other tree attached
  • However, this one hosts own DOM tree ("shadow DOM")
  • Behavior is controlled by shadow DOM
  • Normal / naive children are not drawn / in control
Shadow DOM
The Basics of the Shadow DOM

Templates

  • Templating integrated in HTML - it is possible
  • No more text transportation via script tags
  • Content stored as document fragment - can be cloned
  • Already valid HTML produced
  • Less pressure on the parser, better performance

Mutation Observer

  • Watching for DOM changes, but with performance
  • Problem with "old" events: change triggers change, ...
  • Mutation observers work with the event loop
  • Batched reporting of events
  • Reports may contain:
    • Child nodes, such as elements
    • Changes to attributes (optionally filtered)
    • Text that has been modified
Browsing Context

What others say

Takeaways

  • AngleSharp is super useful
  • The web is more interesting than ever
  • AngleSharp gives us a really nice playground
  • More cool things are about to be standardized
  • We will try to integrate those things
Womanizer

Thanks for your attention

  • Feel free to contact me