Populus:DezInV/Notes
architecture
split between crawler/warc writer and processor
Having two separate processes for crawling and processing has advantages:
- The crawler is simple: Work a queue of URLs and write a warc file
- Processing can be redone as often as needed without network access
There should not be much overhead:
- The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
- Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.
relevant theory:
- page cache
archive
- https://wiki.archiveteam.org
- The WARC format
- Discussion about compression with somebody working on Russian web archive
compression
- Book: 1999, Managing Gigabytes - Compressing and Indexing Documents and Images
- brotli
seekable (random access) compression
Spinning hard drives have sequential read speeds above 100 MB/s. Thus a record in e.g. a 64MB compressed warc file can be retrieved on average in less than 0.3s, even when the archive needs to be seeked from the start.
(64 MB once was the recommended size for Hadoop files.)
Still, these were the links I found:
- https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md
- https://encode.su/threads/1766-Seekable-compression
- https://stackoverflow.com/questions/2046559/any-seekable-compression-library
- https://superuser.com/questions/1235351/what-makes-a-tar-archive-seekable
- https://serverfault.com/questions/59795/is-there-a-smarter-tar-or-cpio-out-there-for-efficiently-retrieving-a-file-store/546691#546691
- https://innovation.ebayinc.com/tech/engineering/gzinga-seekable-and-splittable-gzip/
- https://stackoverflow.com/questions/14225751/random-access-to-gzipped-files
- https://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives
- https://stackoverflow.com/questions/236414/what-is-the-best-compression-algorithm-that-allows-random-reads-writes-in-a-file
crawling
- httparchive hat einen Almanach mit Statistiken über Webseiten
- https://www.commoncrawl.org
index
- tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
scraping
search
Search Engines
Terrier
- Terrier.org (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow
news archiving
feeds
from org-mode/thk
dezentrale Suchmaschine
- https://yacy.net
- https://github.com/nvasilakis/yippee
- Java
- last commit 2012
- https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
- https://metaphacts.com/diesel
- non-free enterprise search engine
- https://fourweekmba.com/distributed-search-engines-vs-google/
- presearch.{io|org}
- non-free, blockchain,bla bla
- https://github.com/kearch/kearch
- letzter commit 2019
- python
- https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
- https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
- https://wiki.p2pfoundation.net/Distributed_Search_Engines
- https://en.wikipedia.org/wiki/Distributed_search_engine
- https://searx.github.io/searx/
- https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
- Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
Internet Archive
- https://netpreserve.org/about-us/
- https://github.com/iipc/awesome-web-archiving
- https://github.com/internetarchive
- https://github.com/iipc/openwayback
- http://timetravel.mementoweb.org/
- https://github.com/webrecorder
Crawling, Crawler
ArchiveBox
Python, aktiv
Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
grab-site
Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
- Java, inaktiv?
- related Paper: Extending WASP: providing context to a personal web archive
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org
dezentraler Crawler des Wikia Projektes In C#, Pyton
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
- https://github.com/jordanspooner/haskell-web-crawler
- dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
- A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
- eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
- eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
- HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
- A flexible, lightweight search platform
- Vorläufer https://github.com/fortytools/holumbus/
- lower level libs
Search
openwebsearch.eu
lemurproject.org
lucindri
https://lemurproject.org/lucindri.php
Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
Galago
toolkit for experimenting with text search
Wikia
- https://en.wikipedia.org/wiki/Wikia_Search
- von Jimmy Wales
- Implementierung
- Grub, dezentraler Crawler
- Apache Nutch
Recherche 2023-11-26
- https://en.wikipedia.org/wiki/Category:Web_scraping
- https://en.wikipedia.org/wiki/Category:Web_archiving
- https://en.wikipedia.org/wiki/Category:Free_search_engine_software
- https://en.wikipedia.org/wiki/Category:Internet_search_engines
- https://en.wikipedia.org/wiki/Category:Free_web_crawlers