architecture

split between crawler/warc writer and processor

Having two separate processes for crawling and processing has advantages:

The crawler is simple: Work a queue of URLs and write a warc file
Processing can be redone as often as needed without network access

There should not be much overhead:

The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.

relevant theory:

page cache
- https://biriukov.dev/docs/page-cache/3-page-cache-and-basic-file-operations/
- https://programs.team/page-cache-why-is-my-container-memory-usage-always-at-the-critical-point.html

crawling

httparchive hat einen Almanach mit Statistiken über Webseiten
https://www.commoncrawl.org
- https://crates.io/crates/cc-downloader

rust

https://crates.io/crates/spider
- by spider.cloud, doing crawling as a service
- apparently highly optimized and decentralized
- https://crates.io/crates/spider_worker
https://crates.io/crates/gar-crawl
https://crates.io/crates/texting_robots - robots.txt parsing
https://crates.io/crates/dyer
https://crates.io/crates/crusty - polite && scalable broad web crawler
https://crates.io/crates/crawly
https://github.com/joelkoen/wls crawl multiple sitemaps and list URLs
https://crates.io/crates/frangipani
https://github.com/spire-rs crawler & scraper framework
- https://crates.io/crates/robotxt
https://crates.io/crates/quick_crawler (+4y)
https://crates.io/crates/robots_txt (+4y)
https://crates.io/crates/website_crawler
https://crates.io/crates/stream_crawler

url normalization

https://crates.io/crates/urlnorm

index

tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
- https://github.com/lnx-search/lnx - deployment of tantivy
- https://crates.io/crates/tantivy_warc_indexer - builds a tantivy index from common crawl warc.wet files
seekstorm Search engine library & multi-tenancy server in Rust

scraping

https://crates.io/crates/article-extractor

search

Search Engines

https://dawnsearch.org distributed web search engine that searches by meaning, rust
- https://crates.io/crates/dawnsearch

Terrier

Terrier.org (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow

news archiving

https://wiki.archiveteam.org/index.php?title=NewsGrabber

feeds

https://hackage.haskell.org/package/feed-crawl

from org-mode/thk

dezentrale Suchmaschine

https://yacy.net
- RFP
- https://www.youtube.com/c/YaCyTutorials/videos
https://github.com/nvasilakis/yippee
- Java
- last commit 2012
https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
https://metaphacts.com/diesel
- non-free enterprise search engine
https://fourweekmba.com/distributed-search-engines-vs-google/
presearch.{io|org}
- non-free, blockchain,bla bla
https://github.com/kearch/kearch
- letzter commit 2019
- python
https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
https://wiki.p2pfoundation.net/Distributed_Search_Engines
https://en.wikipedia.org/wiki/Distributed_search_engine
https://searx.github.io/searx/
https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america

Internet Archive

Crawling, Crawler

https://en.wikipedia.org/wiki/WARC_(file_format)

ArchiveBox
- https://github.com/ArchiveBox/ArchiveBox
Python, aktiv

Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
- https://archivebox.io
grab-site

Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
  - Java, inaktiv?
  - related Paper: Extending WASP: providing context to a personal web archive
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org

dezentraler Crawler des Wikia Projektes In C#, Pyton
- https://web.archive.org/web/20090207182028/http://grub.org/
- https://en.wikipedia.org/wiki/Grub_(search_engine)
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
  - https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/
- https://github.com/jordanspooner/haskell-web-crawler
  - dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
  - A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
  - eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
  - eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
  - HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
  - A flexible, lightweight search platform
  - Vorläufer https://github.com/fortytools/holumbus/
    - http://holumbus.fh-wedel.de/trac
- lower level libs
  - https://github.com/haskell/wreq http://www.serpentine.com/wreq
  - https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md

Search

openwebsearch.eu
- https://openwebsearch.eu/the-project/research-results
lemurproject.org
1. lucindri
  
  https://lemurproject.org/lucindri.php
  
  Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
2. Galago
  
  toolkit for experimenting with text search
  - http://www.search-engines-book.com

Wikia

https://en.wikipedia.org/wiki/Wikia_Search
von Jimmy Wales
Implementierung
- Grub, dezentraler Crawler
- Apache Nutch

Recherche 2023-11-26

Scraping

https://github.com/scrapy

sonstiges

verteilte Systeme

- https://crates.io/crates/amadeus stream processing on top of https://github.com/constellation-rs/constellation

Populus:DezInV/Notes

Inhaltsverzeichnis

architecture

split between crawler/warc writer and processor

archive

compression

seekable (random access) compression

crawling

rust

url normalization

index

scraping

search

Search Engines

Terrier

news archiving

feeds

from org-mode/thk

dezentrale Suchmaschine

Internet Archive

Crawling, Crawler

Search

Wikia

Recherche 2023-11-26

Scraping

sonstiges

verteilte Systeme

Navigationsmenü

Seitenaktionen

Seitenaktionen

Meine Werkzeuge

Navigation

Suche

Werkzeuge