Aktuelle Version vom 23. November 2024, 17:23 Uhr

architecture

split between crawler/warc writer and processor

Having two separate processes for crawling and processing has advantages:

The crawler is simple: Work a queue of URLs and write a warc file
Processing can be redone as often as needed without network access

There should not be much overhead:

The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.

relevant theory:

page cache
- https://biriukov.dev/docs/page-cache/3-page-cache-and-basic-file-operations/
- https://programs.team/page-cache-why-is-my-container-memory-usage-always-at-the-critical-point.html

crawling

httparchive hat einen Almanach mit Statistiken über Webseiten
https://www.commoncrawl.org

index

tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
seekstorm Search engine library & multi-tenancy server in Rust

scraping

search

Search Engines

Terrier

Terrier.org (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow

news archiving

https://wiki.archiveteam.org/index.php?title=NewsGrabber

feeds

https://hackage.haskell.org/package/feed-crawl

from org-mode/thk

dezentrale Suchmaschine

https://yacy.net
- RFP
- https://www.youtube.com/c/YaCyTutorials/videos
https://github.com/nvasilakis/yippee
- Java
- last commit 2012
https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
https://metaphacts.com/diesel
- non-free enterprise search engine
https://fourweekmba.com/distributed-search-engines-vs-google/
presearch.{io|org}
- non-free, blockchain,bla bla
https://github.com/kearch/kearch
- letzter commit 2019
- python
https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
https://wiki.p2pfoundation.net/Distributed_Search_Engines
https://en.wikipedia.org/wiki/Distributed_search_engine
https://searx.github.io/searx/
https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america

Internet Archive

Crawling, Crawler

https://en.wikipedia.org/wiki/WARC_(file_format)

ArchiveBox
- https://github.com/ArchiveBox/ArchiveBox
Python, aktiv

Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
- https://archivebox.io
grab-site

Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
  - Java, inaktiv?
  - related Paper: Extending WASP: providing context to a personal web archive
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org

dezentraler Crawler des Wikia Projektes In C#, Pyton
- https://web.archive.org/web/20090207182028/http://grub.org/
- https://en.wikipedia.org/wiki/Grub_(search_engine)
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
  - https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/
- https://github.com/jordanspooner/haskell-web-crawler
  - dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
  - A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
  - eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
  - eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
  - HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
  - A flexible, lightweight search platform
  - Vorläufer https://github.com/fortytools/holumbus/
    - http://holumbus.fh-wedel.de/trac
- lower level libs
  - https://github.com/haskell/wreq http://www.serpentine.com/wreq
  - https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md

Search

openwebsearch.eu
- https://openwebsearch.eu/the-project/research-results
lemurproject.org
1. lucindri
  
  https://lemurproject.org/lucindri.php
  
  Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
2. Galago
  
  toolkit for experimenting with text search
  - http://www.search-engines-book.com

Wikia

https://en.wikipedia.org/wiki/Wikia_Search
von Jimmy Wales
Implementierung
- Grub, dezentraler Crawler
- Apache Nutch

Recherche 2023-11-26

Scraping

https://github.com/scrapy

Populus:DezInV/Notes: Unterschied zwischen den Versionen

Aktuelle Version vom 23. November 2024, 17:23 Uhr

Inhaltsverzeichnis

architecture

split between crawler/warc writer and processor

archive

compression

seekable (random access) compression

crawling

index

scraping

search

Search Engines

Terrier

news archiving

feeds

from org-mode/thk

dezentrale Suchmaschine

Internet Archive

Crawling, Crawler

Search

Wikia

Recherche 2023-11-26

Scraping

Navigationsmenü

Seitenaktionen

Seitenaktionen

Meine Werkzeuge

Navigation

Suche

Werkzeuge

@@ Zeile 1: / Zeile 1: @@
-== Crawling ==
+== architecture ==
+=== split between crawler/warc writer and processor ===
+Having two separate processes for crawling and processing has advantages:
+* The crawler is simple: Work a queue of URLs and write a warc file
+* Processing can be redone as often as needed without network access
+There should not be much overhead:
+* The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
+* Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.
+relevant theory:
+* page cache
+** https://biriukov.dev/docs/page-cache/3-page-cache-and-basic-file-operations/
+** https://programs.team/page-cache-why-is-my-container-memory-usage-always-at-the-critical-point.html
+== archive ==
+* https://wiki.archiveteam.org
+* [https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1-annotated/ The WARC format]
+* [https://encode.su/threads/3660-Best-compressors-for-huge-JSON-and-WARC-(web-archive)-files Discussion about compression] with somebody working on Russian web archive
+=== compression ===
+* Book: 1999, Managing Gigabytes - Compressing and Indexing Documents and Images
+* brotli
+** https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf
+==== seekable (random access) compression ====
+Spinning hard drives have sequential read speeds above 100 MB/s. Thus a record in e.g. a 64MB compressed warc file can be retrieved on average in less than 0.3s, even when the archive needs to be seeked from the start.
+(64 MB once was the recommended size for Hadoop files.)
+Still, these were the links I found:
+* https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md
+* https://encode.su/threads/1766-Seekable-compression
+* https://stackoverflow.com/questions/2046559/any-seekable-compression-library
+* https://superuser.com/questions/1235351/what-makes-a-tar-archive-seekable
+* https://serverfault.com/questions/59795/is-there-a-smarter-tar-or-cpio-out-there-for-efficiently-retrieving-a-file-store/546691#546691
+* https://innovation.ebayinc.com/tech/engineering/gzinga-seekable-and-splittable-gzip/
+* https://stackoverflow.com/questions/14225751/random-access-to-gzipped-files
+* https://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives
+* https://stackoverflow.com/questions/236414/what-is-the-best-compression-algorithm-that-allows-random-reads-writes-in-a-file
+== crawling ==
 * [https://httparchive.org httparchive] hat einen [https://almanac.httparchive.org Almanach] mit Statistiken über Webseiten
+* https://www.commoncrawl.org
+== index ==
+* [https://github.com/quickwit-oss/tantivy tantivy] is a full-text search engine library inspired by Apache Lucene and written in Rust
+* [https://crates.io/crates/seekstorm seekstorm] Search engine library & multi-tenancy server in Rust
+== scraping ==
+== search ==
+=== Search Engines ===
+==== Terrier ====
+* [http://terrier.org Terrier.org] (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow
+== news archiving ==
+* https://wiki.archiveteam.org/index.php?title=NewsGrabber
+== feeds ==
+* https://hackage.haskell.org/package/feed-crawl
+= from org-mode/thk =
+== dezentrale Suchmaschine ==
+* https://yacy.net
+** [https://bugs.debian.org/768171 RFP]
+** https://www.youtube.com/c/YaCyTutorials/videos
+* https://github.com/nvasilakis/yippee
+** Java
+** last commit 2012
+* https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
+* https://metaphacts.com/diesel
+** non-free enterprise search engine
+* https://fourweekmba.com/distributed-search-engines-vs-google/
+* presearch.{io|org}
+** non-free, blockchain,bla bla
+* https://github.com/kearch/kearch
+** letzter commit 2019
+** python
+* https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
+* https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
+* https://wiki.p2pfoundation.net/Distributed_Search_Engines
+* https://en.wikipedia.org/wiki/Distributed_search_engine
+* https://searx.github.io/searx/
+* https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
+* Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
+=== Internet Archive ===
+* https://netpreserve.org/about-us/
+* https://github.com/iipc/awesome-web-archiving
+* https://github.com/internetarchive
+* https://github.com/iipc/openwayback
+* http://timetravel.mementoweb.org/
+** http://timetravel.mementoweb.org/about/
+* https://github.com/webrecorder
+=== Crawling, Crawler ===
+* https://en.wikipedia.org/wiki/WARC_(file_format)
+<ol>
+<li><p>ArchiveBox</p>
+<ul>
+<li>https://github.com/ArchiveBox/ArchiveBox</li></ul>
+<p>Python, aktiv</p>
+<p>Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…</p>
+<ul>
+<li>https://archivebox.io</li></ul>
+</li>
+<li><p>grab-site</p>
+<p>Python, aktiv https://github.com/ArchiveTeam/grab-site</p></li>
+<li><p>WASP</p>
+<ul>
+<li>Paper: [https://ceur-ws.org/Vol-2167/paper6.pdf WASP: Web Archiving and Search Personalized] via https://github.com/webis-de/wasp
+<ul>
+<li>Java, inaktiv?</li>
+<li>related Paper: [https://www.cs.ru.nl/bachelors-theses/2019/Gijs_Hendriksen___4324544___Extending_WASP_providing_context_to_a_personal_web_archive.pdf Extending WASP: providing context to a personal web archive]</li></ul>
+</li></ul>
+</li>
+<li><p>StormCrawler</p>
+<ul>
+<li>https://en.wikipedia.org/wiki/StormCrawler</li>
+<li>Auf Basis von Apache Storm (distributed stream processing)</li></ul>
+</li>
+<li><p>HTTrack</p>
+<ul>
+<li>https://www.httrack.com</li>
+<li>In C geschrieben, letztes Release 2017, aber aktives Forum und Github</li>
+<li>https://github.com/xroche/httrack/tags</li></ul>
+</li>
+<li><p>Grub.org</p>
+<p>dezentraler Crawler des Wikia Projektes In C#, Pyton</p>
+<ul>
+<li>https://web.archive.org/web/20090207182028/http://grub.org/</li>
+<li>https://en.wikipedia.org/wiki/Grub_(search_engine)</li></ul>
+</li>
+<li><p>HCE – Hierarchical Cluster Engine</p>
+<ul>
+<li>http://hierarchical-cluster-engine.com</li>
+<li>https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project</li>
+<li>Ukraine, aktiv bis ca. 2014?</li></ul>
+</li>
+<li><p>Heritrix</p>
+<ul>
+<li>https://en.wikipedia.org/wiki/Heritrix</li>
+<li>Crawler von archive.org</li></ul>
+</li>
+<li><p>Haskell</p>
+<ul>
+<li>https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
+<ul>
+<li>https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/</li></ul>
+</li>
+<li>https://github.com/jordanspooner/haskell-web-crawler
+<ul>
+<li>dead since 2017, seems like student assignment</li></ul>
+</li>
+<li>https://hackage.haskell.org/package/scalpel
+<ul>
+<li>A high level web scraping library for Haskell.</li></ul>
+</li>
+<li>https://hackage.haskell.org/package/hScraper
+<ul>
+<li>eine Version 0.1.0.0 von 2015</li></ul>
+</li>
+<li>https://hackage.haskell.org/package/hs-scrape
+<ul>
+<li>eine VErsion von 2014, aber git commit von 2020</li></ul>
+</li>
+<li>https://hackage.haskell.org/package/http-conduit-downloader
+<ul>
+<li>HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.</li></ul>
+</li>
+<li>https://github.com/hunt-framework/hunt
+<ul>
+<li>A flexible, lightweight search platform</li>
+<li>Vorläufer https://github.com/fortytools/holumbus/
+<ul>
+<li>http://holumbus.fh-wedel.de/trac</li></ul>
+</li></ul>
+</li>
+<li>lower level libs
+<ul>
+<li>https://github.com/haskell/wreq http://www.serpentine.com/wreq</li>
+<li>https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md</li></ul>
+</li></ul>
+</li></ol>
+<span id="search"></span>
+=== Search ===
+<ol>
+<li><p>openwebsearch.eu</p>
+<ul>
+<li>https://openwebsearch.eu/the-project/research-results</li></ul>
+</li>
+<li><p>lemurproject.org</p>
+<ol>
+<li><p>lucindri</p>
+<p>https://lemurproject.org/lucindri.php</p>
+<p>Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.</p></li>
+<li><p>Galago</p>
+<p>toolkit for experimenting with text search</p>
+<ul>
+<li>http://www.search-engines-book.com</li></ul>
+</li></ol>
+</li></ol>
+=== Wikia ===
+* https://en.wikipedia.org/wiki/Wikia_Search
+* von Jimmy Wales
+* Implementierung
+** Grub, dezentraler Crawler
+** Apache Nutch
+=== Recherche 2023-11-26 ===
+* https://en.wikipedia.org/wiki/Category:Web_scraping
-== Indizierung ==
+* https://en.wikipedia.org/wiki/Category:Web_archiving
+* https://en.wikipedia.org/wiki/Category:Free_search_engine_software
+* https://en.wikipedia.org/wiki/Category:Internet_search_engines
+* https://en.wikipedia.org/wiki/Category:Free_web_crawlers
-== Scraping ==
+=== Scraping ===
+* https://github.com/scrapy
-== Suche ==