A web scraper designed with ethics and responsibility at its core. UltimoScraper provides a flexible, extensible framework for web scraping while respecting website resources and following best practices.
UltimoScraper main function is to scan an entire website for pages that match specific keywords provided. It can also extract specific pieces of content (social media links) along the way. Once it has completed you can access a dictionary where the key is the keyword and the value is a list of pages that matches that keyword. From there you can then perform more in depth data extraction from those pages based on the keywords that were matched.
UltimoScraper includes RobotsTxtRetriever.cs to fetch and parse robots.txt files. Always retrieve and respect the ignore rules:
var robotsRetriever = serviceProvider.GetRequiredService<IRobotsTxtRetriever>();
var ignoreRules = await robotsRetriever.GetIgnoreRulesAsync("https://example.com");Built-in throttling prevents overwhelming target websites, ensuring scraping doesn't negatively impact their performance.
Sites are hierarchical, so if I want to say only scrape a max depth of 3, then for every page linked on the home page it will only scrape up to three pages (Home Page -> Blog -> Blog Post). This is important for sites with an incredible amount of content. You can also specify a max number of unique pages scraped, once it hits the max, it will stop scraping. The scraper will exit once it hits either the max depth or the max pages.
The WebParser is the interface you should interact with, here are the methods it offers:
public interface IWebParser
{
Task<ParsedSite> ParseSite(
string domain,
IList<IgnoreRule> ignoreRules,
IList<Keyword> keywords,
int maxDepth,
int maxPages,
string sessionName = null);
Task<IList<string>> KeywordSearch(
string domain,
IList<IgnoreRule> ignoreRules,
IList<Keyword> keywords,
IList<Keyword> searchKeywords,
string sessionName = null);
Task<ParsedPage> ParsePage(string domain, string path, IList<IgnoreRule> ignoreRules, IList<Keyword> keywords, string sessionName = null);
}Extend scraping functionality by implementing:
ILinkRetriever- Extract links from pagesITitleRetriever- Extract page titlesIListRetriever- Extract lists of elementsIListItemRetriever- Extract individual items from lists
Retrievers query the DOM for specific types of elements, while processors filter out those elements based on your specific needs. Here is an example of the default link retriever:
public class DefaultLinkRetriever : ALinkRetriever, ILinkRetriever
{
public DefaultLinkRetriever(IEnumerable<ILinkProcessor> linkProcessors) : base(linkProcessors)
{
}
public async Task<IList<ParsedWebLink>> GetLinks(
HtmlNode htmlNode,
IList<IgnoreRule> linkIgnoreRules,
IList<Keyword> keywords)
{
var links =
htmlNode.QuerySelectorAll("a")
.Select(a => new ParsedWebLink
{
Value = HttpUtility.HtmlDecode(a.GetAttributeValue("href", null)),
Text = a.GetOnlyInnerText()
}).Where(u => !string.IsNullOrEmpty(u.Value));
return await ProcessLinks(links, linkIgnoreRules, keywords);
}
}Retrievers work with processors to filter and refine results, ensuring you only capture the data you need.
public class FacebookLinkProcessor : ILinkProcessor
{
public async Task<bool> Process(ParsedWebLink webLink, IList<Keyword> keywords)
{
var match = Regex.Match(webLink.Value, @"(?:(?:http|https):\/\/)?(?:www.)?facebook.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[?\w\-]*\/)?(?:profile.php\?id=(?=\d.*))?([\w\-]*)?");
return match.Success;
}
}Create ignore rules targeting:
- Links - Filter specific URLs
- Lists - Ignore HTML elements matching criteria
Simply create a list of ignore rules and pass it to the ParseSite method:
var ignoreRules = new List<IgnoreRule>()
{
new()
{
IgnoreRuleType = IgnoreRuleType.Link,
Rule = "OnlineRegistration"
}
};
var result = await _webParser.ParseSite(domain, ignoreRules, keywords);Keywords are a prerequisit for running the scraper as it lets it know exactly what type of content you are looking for. Similarly to IgnoreRules, Keywords also needs to be supplied to the ParseSite method, where links can either match on the Value or the Regex properties of the keyword. Regex can be left null, but should be used if you want to narrow down your matches, for example if you wanted to match on golf, that could also potentially match on disc golf, Example:
new Keyword {
Value = "golf",
Regex = "\\b(?<!disc\\s)golf\\b"
}The regex ensures that any potential matches on disc golf would be ignored.
Execute JavaScript during scraping to wait for dynamic content:
// Wait for content to load before scraping
public class LPageInteraction : IPageInteraction
{
public bool IsMatch(string url)
{
var uri = new Uri(url);
return uri.Host.Contains("example");
}
public async Task Interact(IPage page)
{
await page.WaitForFunctionAsync(@"() => {" +
"var items = document.querySelector(\"#blogposts\"); " +
"return items !== null && items.innerHTML !== ''" +
"}");
}
}In the scenario that you are scraping a site that just happens to have broken or incompelte HTML, you can create threaders to ensure the website can be parsed properly:
public class BrokenDivThreader : IHtmlThreader
{
public string Thread(string html)
{
string threadedHtml =
Regex.Replace(html, "(<div)(.*)(<b=\"\")(>)", m => $"{m.Groups[1]} {m.Groups[2]} {m.Groups[4]}");
return threadedHtml;
}
}Register all scraper services using ServiceCollectionHelper:
services.AddWebScraper();- Register services with
AddWebScraper() - Configure scraper settings in appsettings:
"Scraper": {
"PageThrottle": 5000,
"PageTimeout": 5000,
"MaxProcesses": 10,
"Headless": true
}- Retrieve and respect robots.txt rules
- Configure custom retrievers and processors
- Implement page interactions if needed
- Start scraping via
IWebParserafter injecting it
var ignoreRules = await _robotsTxtRetriever.GetRobotsTxt(new Uri(domain));
var result = await _webParser.ParseSite(domain, ignoreRules, keywords.Select(x => new Keyword() { Value = x }).ToList(), 5, 20);Scrape responsibly.