Populus:DezInV/Notes
siehe Auch
architecture
- http://engineering.nyu.edu/~suel/cs6913/lec8-crawl.pdf mentions BubiNG
- https://ssrg.eecs.uottawa.ca/publications.html
- https://cs.au.dk/~gerth/webalg02/slides/crawling.pdf
- https://en.wikipedia.org/wiki/Web_crawler
- Design a Basic Search Engine, System Design Interview Prep (YT)
- System Design distributed web crawler to crawl Billions of web pages (YT)
- Internet Archive Crawler Requirements Analysis 2018
- Search Engine Construction Kit Wiki
- Introduction to Information Retrieval book
- Mercator: A scalable, extensible Web crawler, paper 1999?
- Mercator, A masterclass in system design for a web crawler 20.2.2024
- TODO search for mercator scheme
Papers:
- https://arxiv.org/pdf/1601.06919 BUbiNG: Massive Crawling for the Masses
- https://ssrg.eecs.uottawa.ca/docs/2014_Khaled%20Ben%20Hafaiedh.pdf A Scalable P2P RIA Crawling System with Partial Knowledge
- https://ssrg.eecs.uottawa.ca/docs/CASCON2013.pdf A Brief History of Web Crawlers
- https://www.cs.sfu.ca/~ester/papers/vldb2001.pdf Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies
Calculating optimal refetch time (feeds!)
- https://en.wikipedia.org/wiki/Poisson_point_process or poisson process model
split between crawler/warc writer and processor
Having two separate processes for crawling and processing has advantages:
- The crawler is simple: Work a queue of URLs and write a warc file
- Processing can be redone as often as needed without network access
There should not be much overhead:
- The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
- Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.
relevant theory:
- page cache
archive
- Kiwix, macht Webseiten offline verfügbar für Regionen ohne Internet, mit ZIM Dateiformat, Mediawiki-Foundation
- https://wiki.archiveteam.org
- The WARC format
- Discussion about compression with somebody working on Russian web archive
compression
- Book: 1999, Managing Gigabytes - Compressing and Indexing Documents and Images
- brotli
seekable (random access) compression
Spinning hard drives have sequential read speeds above 100 MB/s. Thus a record in e.g. a 64MB compressed warc file can be retrieved on average in less than 0.3s, even when the archive needs to be seeked from the start.
(64 MB once was the recommended size for Hadoop files.)
Still, these were the links I found:
- https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md
- https://encode.su/threads/1766-Seekable-compression
- https://stackoverflow.com/questions/2046559/any-seekable-compression-library
- https://superuser.com/questions/1235351/what-makes-a-tar-archive-seekable
- https://serverfault.com/questions/59795/is-there-a-smarter-tar-or-cpio-out-there-for-efficiently-retrieving-a-file-store/546691#546691
- https://innovation.ebayinc.com/tech/engineering/gzinga-seekable-and-splittable-gzip/
- https://stackoverflow.com/questions/14225751/random-access-to-gzipped-files
- https://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives
- https://stackoverflow.com/questions/236414/what-is-the-best-compression-algorithm-that-allows-random-reads-writes-in-a-file
crawling
- httparchive hat einen Almanach mit Statistiken über Webseiten
- https://www.commoncrawl.org
- List of open source web crawlers - could they use URLFrontier?
rust
- https://crates.io/crates/spider
- by spider.cloud, doing crawling as a service
- apparently highly optimized and decentralized
- https://crates.io/crates/spider_worker
- website.rs ca 5000 Zeilen, unwartbarer code
- https://crates.io/crates/gar-crawl - abandoned, example project
- https://crates.io/crates/texting_robots - robots.txt parsing
- https://crates.io/crates/dyer (2y)
- https://crates.io/crates/crusty (2y), abandoned, 1 contributor
- https://crates.io/crates/crawly, 1 355 lines file
- https://github.com/joelkoen/wls crawl multiple sitemaps and list URLs
- https://crates.io/crates/frangipani, (1y, 25commits) evtl. ein paar Ideen
- https://github.com/spire-rs 1 person hobby
- https://crates.io/crates/quick_crawler (+4y)
- https://crates.io/crates/robots_txt (+4y)
- https://crates.io/crates/website_crawler
- https://crates.io/crates/stream_crawler - experiment
- https://github.com/tokahuke/lopez (2y, 106 commits, crawl directives language)
- https://crates.io/crates/waper CLI tool to scrape HTML websites, interessante Datenstruktur-libraries
- https://crates.io/crates/recursive_scraper
url frontier
url normalization
index
- tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
- https://github.com/lnx-search/lnx - deployment of tantivy
- https://crates.io/crates/tantivy_warc_indexer - builds a tantivy index from common crawl warc.wet files
- seekstorm Search engine library & multi-tenancy server in Rust
scraping
search
Search Engines
- https://dawnsearch.org distributed web search engine that searches by meaning, rust
Terrier
- Terrier.org (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow
news archiving
feeds
from org-mode/thk
dezentrale Suchmaschine
- https://yacy.net
- https://github.com/nvasilakis/yippee
- Java
- last commit 2012
- https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
- https://metaphacts.com/diesel
- non-free enterprise search engine
- https://fourweekmba.com/distributed-search-engines-vs-google/
- presearch.{io|org}
- non-free, blockchain,bla bla
- https://github.com/kearch/kearch
- letzter commit 2019
- python
- https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
- https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
- https://wiki.p2pfoundation.net/Distributed_Search_Engines
- https://en.wikipedia.org/wiki/Distributed_search_engine
- https://searx.github.io/searx/
- https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
- Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
Internet Archive
- https://netpreserve.org/about-us/
- https://github.com/iipc/awesome-web-archiving
- https://github.com/internetarchive
- https://github.com/iipc/openwayback
- http://timetravel.mementoweb.org/
- https://github.com/webrecorder
Crawling, Crawler
ArchiveBox
Python, aktiv
Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
grab-site
Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
- Java, inaktiv?
- related Paper: Extending WASP: providing context to a personal web archive
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org
dezentraler Crawler des Wikia Projektes In C#, Pyton
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
- https://github.com/jordanspooner/haskell-web-crawler
- dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
- A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
- eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
- eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
- HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
- A flexible, lightweight search platform
- Vorläufer https://github.com/fortytools/holumbus/
- lower level libs
Search
openwebsearch.eu
lemurproject.org
lucindri
https://lemurproject.org/lucindri.php
Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
Galago
toolkit for experimenting with text search
Wikia
- https://en.wikipedia.org/wiki/Wikia_Search
- von Jimmy Wales
- Implementierung
- Grub, dezentraler Crawler
- Apache Nutch
Recherche 2023-11-26
- https://en.wikipedia.org/wiki/Category:Web_scraping
- https://en.wikipedia.org/wiki/Category:Web_archiving
- https://en.wikipedia.org/wiki/Category:Free_search_engine_software
- https://en.wikipedia.org/wiki/Category:Internet_search_engines
- https://en.wikipedia.org/wiki/Category:Free_web_crawlers
Scraping
sonstiges
verteilte Systeme
- https://crates.io/crates/amadeus stream processing on top of https://github.com/constellation-rs/constellation