siehe Auch

Suchmaschine

architecture

http://engineering.nyu.edu/~suel/cs6913/lec8-crawl.pdf mentions BubiNG
https://ssrg.eecs.uottawa.ca/publications.html
https://cs.au.dk/~gerth/webalg02/slides/crawling.pdf
https://en.wikipedia.org/wiki/Web_crawler
Design a Basic Search Engine, System Design Interview Prep (YT)
System Design distributed web crawler to crawl Billions of web pages (YT)
Internet Archive Crawler Requirements Analysis 2018
Search Engine Construction Kit Wiki
Introduction to Information Retrieval book
Mercator: A scalable, extensible Web crawler, paper 1999?
- Mercator, A masterclass in system design for a web crawler 20.2.2024
- TODO search for mercator scheme
https://github.com/StractOrg/stract/blob/main/docs/architecture/webgraph.md

Papers:

https://arxiv.org/pdf/1601.06919 BUbiNG: Massive Crawling for the Masses
https://ssrg.eecs.uottawa.ca/docs/2014_Khaled%20Ben%20Hafaiedh.pdf A Scalable P2P RIA Crawling System with Partial Knowledge
https://ssrg.eecs.uottawa.ca/docs/CASCON2013.pdf A Brief History of Web Crawlers
https://www.cs.sfu.ca/~ester/papers/vldb2001.pdf Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies

Calculating optimal refetch time (feeds!)

https://en.wikipedia.org/wiki/Poisson_point_process or poisson process model

split between crawler/warc writer and processor

Having two separate processes for crawling and processing has advantages:

The crawler is simple: Work a queue of URLs and write a warc file
Processing can be redone as often as needed without network access

There should not be much overhead:

The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.

relevant theory:

page cache
- https://biriukov.dev/docs/page-cache/3-page-cache-and-basic-file-operations/
- https://programs.team/page-cache-why-is-my-container-memory-usage-always-at-the-critical-point.html

crawling

httparchive hat einen Almanach mit Statistiken über Webseiten
https://www.commoncrawl.org
- https://crates.io/crates/cc-downloader
List of open source web crawlers - could they use URLFrontier?

rust

https://crates.io/crates/spider
- by spider.cloud, doing crawling as a service
- apparently highly optimized and decentralized
- https://crates.io/crates/spider_worker
- website.rs ca 5000 Zeilen, unwartbarer code
https://crates.io/crates/gar-crawl - abandoned, example project
https://crates.io/crates/texting_robots - robots.txt parsing
https://crates.io/crates/dyer (2y)
https://crates.io/crates/crusty (2y), abandoned, 1 contributor
https://crates.io/crates/crawly, 1 355 lines file
https://github.com/joelkoen/wls crawl multiple sitemaps and list URLs
https://crates.io/crates/frangipani, (1y, 25commits) evtl. ein paar Ideen
https://github.com/spire-rs 1 person hobby
- https://crates.io/crates/robotxt
https://crates.io/crates/quick_crawler (+4y)
https://crates.io/crates/robots_txt (+4y)
https://crates.io/crates/website_crawler
https://crates.io/crates/stream_crawler - experiment
https://github.com/tokahuke/lopez (2y, 106 commits, crawl directives language)
https://crates.io/crates/waper CLI tool to scrape HTML websites, interessante Datenstruktur-libraries
https://crates.io/crates/recursive_scraper

url frontier

https://github.com/crawler-commons/url-frontier

url normalization

https://crates.io/crates/urlnorm

index

tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
- https://github.com/lnx-search/lnx - deployment of tantivy
- https://crates.io/crates/tantivy_warc_indexer - builds a tantivy index from common crawl warc.wet files
seekstorm Search engine library & multi-tenancy server in Rust

scraping

https://crates.io/crates/article-extractor

search

Search Engines

https://dawnsearch.org distributed web search engine that searches by meaning, rust
- https://crates.io/crates/dawnsearch

Terrier

Terrier.org (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow

news archiving

https://wiki.archiveteam.org/index.php?title=NewsGrabber

feeds

https://hackage.haskell.org/package/feed-crawl

from org-mode/thk

dezentrale Suchmaschine

https://yacy.net
- RFP
- https://www.youtube.com/c/YaCyTutorials/videos
https://github.com/nvasilakis/yippee
- Java
- last commit 2012
https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
https://metaphacts.com/diesel
- non-free enterprise search engine
https://fourweekmba.com/distributed-search-engines-vs-google/
presearch.{io|org}
- non-free, blockchain,bla bla
https://github.com/kearch/kearch
- letzter commit 2019
- python
https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
https://wiki.p2pfoundation.net/Distributed_Search_Engines
https://en.wikipedia.org/wiki/Distributed_search_engine
https://searx.github.io/searx/
https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america

Internet Archive

Crawling, Crawler

https://en.wikipedia.org/wiki/WARC_(file_format)

ArchiveBox
- https://github.com/ArchiveBox/ArchiveBox
Python, aktiv

Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
- https://archivebox.io
grab-site

Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
  - Java, inaktiv?
  - related Paper: Extending WASP: providing context to a personal web archive
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org

dezentraler Crawler des Wikia Projektes In C#, Pyton
- https://web.archive.org/web/20090207182028/http://grub.org/
- https://en.wikipedia.org/wiki/Grub_(search_engine)
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
  - https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/
- https://github.com/jordanspooner/haskell-web-crawler
  - dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
  - A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
  - eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
  - eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
  - HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
  - A flexible, lightweight search platform
  - Vorläufer https://github.com/fortytools/holumbus/
    - http://holumbus.fh-wedel.de/trac
- lower level libs
  - https://github.com/haskell/wreq http://www.serpentine.com/wreq
  - https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md

Search

openwebsearch.eu
- https://openwebsearch.eu/the-project/research-results
lemurproject.org
1. lucindri
  
  https://lemurproject.org/lucindri.php
  
  Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
2. Galago
  
  toolkit for experimenting with text search
  - http://www.search-engines-book.com

Wikia

https://en.wikipedia.org/wiki/Wikia_Search
von Jimmy Wales
Implementierung
- Grub, dezentraler Crawler
- Apache Nutch

Recherche 2023-11-26

Scraping

https://github.com/scrapy

sonstiges

verteilte Systeme

- https://crates.io/crates/amadeus stream processing on top of https://github.com/constellation-rs/constellation

Populus:DezInV/Notes

Inhaltsverzeichnis

siehe Auch

architecture

Calculating optimal refetch time (feeds!)

split between crawler/warc writer and processor

archive

compression

seekable (random access) compression

crawling

rust

url frontier

url normalization

index

scraping

search

Search Engines

Terrier

news archiving

feeds

from org-mode/thk

dezentrale Suchmaschine

Internet Archive

Crawling, Crawler

Search

Wikia

Recherche 2023-11-26

Scraping

sonstiges

verteilte Systeme

Navigationsmenü

Seitenaktionen

Seitenaktionen

Meine Werkzeuge

Navigation

Suche

Werkzeuge