Skip to content

ssinno28/UltimoScraper

Repository files navigation

UltimoScraper

A web scraper designed with ethics and responsibility at its core. UltimoScraper provides a flexible, extensible framework for web scraping while respecting website resources and following best practices.

UltimoScraper main function is to scan an entire website for pages that match specific keywords provided. It can also extract specific pieces of content (social media links) along the way. Once it has completed you can access a dictionary where the key is the keyword and the value is a list of pages that matches that keyword. From there you can then perform more in depth data extraction from those pages based on the keywords that were matched.

Ethical Scraping

Robots.txt Compliance

UltimoScraper includes RobotsTxtRetriever.cs to fetch and parse robots.txt files. Always retrieve and respect the ignore rules:

var robotsRetriever = serviceProvider.GetRequiredService<IRobotsTxtRetriever>();
var ignoreRules = await robotsRetriever.GetIgnoreRulesAsync("https://example.com");

Request Throttling

Built-in throttling prevents overwhelming target websites, ensuring scraping doesn't negatively impact their performance.

Max Depth & Pages

Sites are hierarchical, so if I want to say only scrape a max depth of 3, then for every page linked on the home page it will only scrape up to three pages (Home Page -> Blog -> Blog Post). This is important for sites with an incredible amount of content. You can also specify a max number of unique pages scraped, once it hits the max, it will stop scraping. The scraper will exit once it hits either the max depth or the max pages.

The WebParser is the interface you should interact with, here are the methods it offers:

 public interface IWebParser
    {
        Task<ParsedSite> ParseSite(
            string domain,
            IList<IgnoreRule> ignoreRules,
            IList<Keyword> keywords,
            int maxDepth,
            int maxPages,
            string sessionName = null);

        Task<IList<string>> KeywordSearch(
            string domain,
            IList<IgnoreRule> ignoreRules,
            IList<Keyword> keywords,
            IList<Keyword> searchKeywords,
            string sessionName = null);
        Task<ParsedPage> ParsePage(string domain, string path, IList<IgnoreRule> ignoreRules, IList<Keyword> keywords, string sessionName = null);
    }

Extensibility

Core Retrievers

Extend scraping functionality by implementing:

  • ILinkRetriever - Extract links from pages
  • ITitleRetriever - Extract page titles
  • IListRetriever - Extract lists of elements
  • IListItemRetriever - Extract individual items from lists

Retrievers query the DOM for specific types of elements, while processors filter out those elements based on your specific needs. Here is an example of the default link retriever:

public class DefaultLinkRetriever : ALinkRetriever, ILinkRetriever
    {
        public DefaultLinkRetriever(IEnumerable<ILinkProcessor> linkProcessors) : base(linkProcessors)
        {
        }

        public async Task<IList<ParsedWebLink>> GetLinks(
            HtmlNode htmlNode, 
            IList<IgnoreRule> linkIgnoreRules,
            IList<Keyword> keywords)
        {
            var links =
                htmlNode.QuerySelectorAll("a")
                        .Select(a => new ParsedWebLink
                        {
                            Value = HttpUtility.HtmlDecode(a.GetAttributeValue("href", null)),
                            Text = a.GetOnlyInnerText()
                        }).Where(u => !string.IsNullOrEmpty(u.Value));

            return await ProcessLinks(links, linkIgnoreRules, keywords);
        }
    }

Processors

Retrievers work with processors to filter and refine results, ensuring you only capture the data you need.

public class FacebookLinkProcessor : ILinkProcessor
    {
        public async Task<bool> Process(ParsedWebLink webLink, IList<Keyword> keywords)
        {
            var match = Regex.Match(webLink.Value, @"(?:(?:http|https):\/\/)?(?:www.)?facebook.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[?\w\-]*\/)?(?:profile.php\?id=(?=\d.*))?([\w\-]*)?");
            return match.Success;
        }
    }

Custom Ignore Rules

Create ignore rules targeting:

  • Links - Filter specific URLs
  • Lists - Ignore HTML elements matching criteria

Simply create a list of ignore rules and pass it to the ParseSite method:

 var ignoreRules = new List<IgnoreRule>()
        {
            new()
            {
                IgnoreRuleType = IgnoreRuleType.Link,
                Rule = "OnlineRegistration"
            }
        };
 var result = await _webParser.ParseSite(domain, ignoreRules, keywords);

Keywords

Keywords are a prerequisit for running the scraper as it lets it know exactly what type of content you are looking for. Similarly to IgnoreRules, Keywords also needs to be supplied to the ParseSite method, where links can either match on the Value or the Regex properties of the keyword. Regex can be left null, but should be used if you want to narrow down your matches, for example if you wanted to match on golf, that could also potentially match on disc golf, Example:

new Keyword {
    Value = "golf",
    Regex = "\\b(?<!disc\\s)golf\\b"
}

The regex ensures that any potential matches on disc golf would be ignored.

Page Interactions

Execute JavaScript during scraping to wait for dynamic content:

// Wait for content to load before scraping
 public class LPageInteraction : IPageInteraction
    {
        public bool IsMatch(string url)
        {
            var uri = new Uri(url);
            return uri.Host.Contains("example");
        }

        public async Task Interact(IPage page)
        {
           await page.WaitForFunctionAsync(@"() => {" +
                                           "var items = document.querySelector(\"#blogposts\"); " +
                                           "return items !== null && items.innerHTML !== ''" +
                                           "}");
        }
    }

Threaders

In the scenario that you are scraping a site that just happens to have broken or incompelte HTML, you can create threaders to ensure the website can be parsed properly:

 public class BrokenDivThreader : IHtmlThreader
    {
        public string Thread(string html)
        {
            string threadedHtml =
                Regex.Replace(html, "(<div)(.*)(<b=\"\")(>)", m => $"{m.Groups[1]} {m.Groups[2]} {m.Groups[4]}");

            return threadedHtml;
        }
    }

Usage

Dependency Injection

Register all scraper services using ServiceCollectionHelper:

services.AddWebScraper();

Getting Started

  1. Register services with AddWebScraper()
  2. Configure scraper settings in appsettings:
  "Scraper": {
    "PageThrottle": 5000,
    "PageTimeout": 5000,
    "MaxProcesses": 10,
    "Headless": true
  }
  1. Retrieve and respect robots.txt rules
  2. Configure custom retrievers and processors
  3. Implement page interactions if needed
  4. Start scraping via IWebParser after injecting it
    var ignoreRules = await _robotsTxtRetriever.GetRobotsTxt(new Uri(domain));
    var result = await _webParser.ParseSite(domain, ignoreRules, keywords.Select(x => new Keyword() { Value = x }).ToList(), 5, 20);

Scrape responsibly.

About

This is a composable webs scraper built with C#

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published