Skip to content

macsencasaus/html-tokenizer-hs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

HTML Tokenizer

HTML tokenizer library for Haskell. Tokenizes HTML elements from bytestring.

The single Tokenizer.hs file exposes types

-- | Tokenizer errors, represents errors that can occur
data TokenizerError
  = -- Encountered a character it wasn't expecting to see
    Unexpected Char
  | -- Encountered EOF before it was expected
    UnexpectedEOF
  deriving (Eq, Show)

and

-- | Represents types of Tokens we can tokenize
-- The tokenizer will produce a list of tokens
data Token
  = -- Text Node
    TextToken Text
  | -- <a>
    StartTagToken Text AttributeMap
  | -- </a>
    EndTagToken Text
  | -- <br/>
    SelfClosingTagToken Text AttributeMap
  | -- <!--x-->
    CommentToken Text
  | -- <!DOCTYPE x>
    DoctypeToken Text
  deriving (Eq, Show)

where

-- | Map for attributes : Attribute Name -> Attribute Value
type AttributeMap = Map.Map Text Text

as well as one function

tokenizer :: ByteString -> Either TokenizerError [Token]

used to tokenize bytestring of HTML elements.

See Example.hs for a small example of a program used to scrape the title of a website using the tokenizer.

Dependencies

For the tokenizer:

build-depends:
  base >= 4.14 && < 5,           
  bytestring >= 0.10.12.0,      
  text >= 1.2.4.0,             
  containers >= 0.6.2.0          

For the example:

  http-conduit >= 2.3.0          

For the tests:

  hspec >= 2.9.0                 

Resources

Releases

No releases published

Packages

No packages published