Mitarbeit willkommen! Bitte schau unter Hilfe:Benutzerkonto oder informiere Dich über Populus.Wiki.

Populus:DezInV/Notes: Unterschied zwischen den Versionen

Aus Populus DE
Zur Navigation springenZur Suche springen
K (Thk verschob die Seite Populus:DezInV/Notizen nach Populus:DezInV/Notes, ohne dabei eine Weiterleitung anzulegen)
 
(6 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 1: Zeile 1:
  +
== architecture ==
  +
  +
=== split between crawler/warc writer and processor ===
  +
  +
Having two separate processes for crawling and processing has advantages:
  +
  +
* The crawler is simple: Work a queue of URLs and write a warc file
  +
* Processing can be redone as often as needed without network access
  +
  +
There should not be much overhead:
  +
  +
* The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
  +
* Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.
  +
  +
relevant theory:
  +
  +
* page cache
  +
** https://biriukov.dev/docs/page-cache/3-page-cache-and-basic-file-operations/
  +
** https://programs.team/page-cache-why-is-my-container-memory-usage-always-at-the-critical-point.html
  +
 
== archive ==
 
== archive ==
   
  +
* https://wiki.archiveteam.org
  +
* [https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1-annotated/ The WARC format]
 
* [https://encode.su/threads/3660-Best-compressors-for-huge-JSON-and-WARC-(web-archive)-files Discussion about compression] with somebody working on Russian web archive
 
* [https://encode.su/threads/3660-Best-compressors-for-huge-JSON-and-WARC-(web-archive)-files Discussion about compression] with somebody working on Russian web archive
  +
  +
=== compression ===
  +
  +
* Book: 1999, Managing Gigabytes - Compressing and Indexing Documents and Images
  +
* brotli
  +
** https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf
  +
  +
==== seekable (random access) compression ====
  +
  +
Spinning hard drives have sequential read speeds above 100 MB/s. Thus a record in e.g. a 64MB compressed warc file can be retrieved on average in less than 0.3s, even when the archive needs to be seeked from the start.
  +
  +
(64 MB once was the recommended size for Hadoop files.)
  +
  +
Still, these were the links I found:
  +
  +
* https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md
  +
* https://encode.su/threads/1766-Seekable-compression
  +
* https://stackoverflow.com/questions/2046559/any-seekable-compression-library
  +
* https://superuser.com/questions/1235351/what-makes-a-tar-archive-seekable
  +
* https://serverfault.com/questions/59795/is-there-a-smarter-tar-or-cpio-out-there-for-efficiently-retrieving-a-file-store/546691#546691
  +
* https://innovation.ebayinc.com/tech/engineering/gzinga-seekable-and-splittable-gzip/
  +
* https://stackoverflow.com/questions/14225751/random-access-to-gzipped-files
  +
* https://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives
  +
* https://stackoverflow.com/questions/236414/what-is-the-best-compression-algorithm-that-allows-random-reads-writes-in-a-file
   
 
== crawling ==
 
== crawling ==
   
 
* [https://httparchive.org httparchive] hat einen [https://almanac.httparchive.org Almanach] mit Statistiken über Webseiten
 
* [https://httparchive.org httparchive] hat einen [https://almanac.httparchive.org Almanach] mit Statistiken über Webseiten
  +
* https://www.commoncrawl.org
   
 
== index ==
 
== index ==
  +
  +
* [https://github.com/quickwit-oss/tantivy tantivy] is a full-text search engine library inspired by Apache Lucene and written in Rust
   
 
== scraping ==
 
== scraping ==
   
 
== search ==
 
== search ==
  +
  +
=== Search Engines ===
  +
  +
==== Terrier ====
  +
  +
* [http://terrier.org Terrier.org] (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow
  +
  +
== news archiving ==
  +
  +
* https://wiki.archiveteam.org/index.php?title=NewsGrabber
  +
  +
== feeds ==
  +
  +
* https://hackage.haskell.org/package/feed-crawl
  +
  +
= from org-mode/thk =
  +
  +
== dezentrale Suchmaschine ==
  +
  +
* https://yacy.net
  +
** [https://bugs.debian.org/768171 RFP]
  +
** https://www.youtube.com/c/YaCyTutorials/videos
  +
* https://github.com/nvasilakis/yippee
  +
** Java
  +
** last commit 2012
  +
* https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
  +
* https://metaphacts.com/diesel
  +
** non-free enterprise search engine
  +
* https://fourweekmba.com/distributed-search-engines-vs-google/
  +
* presearch.{io|org}
  +
** non-free, blockchain,bla bla
  +
* https://github.com/kearch/kearch
  +
** letzter commit 2019
  +
** python
  +
* https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
  +
* https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
  +
* https://wiki.p2pfoundation.net/Distributed_Search_Engines
  +
* https://en.wikipedia.org/wiki/Distributed_search_engine
  +
* https://searx.github.io/searx/
  +
* https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
  +
* Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
  +
  +
=== Internet Archive ===
  +
  +
* https://netpreserve.org/about-us/
  +
* https://github.com/iipc/awesome-web-archiving
  +
* https://github.com/internetarchive
  +
* https://github.com/iipc/openwayback
  +
* http://timetravel.mementoweb.org/
  +
** http://timetravel.mementoweb.org/about/
  +
* https://github.com/webrecorder
  +
  +
=== Crawling, Crawler ===
  +
  +
* https://en.wikipedia.org/wiki/WARC_(file_format)
  +
  +
<ol>
  +
<li><p>ArchiveBox</p>
  +
<ul>
  +
<li>https://github.com/ArchiveBox/ArchiveBox</li></ul>
  +
  +
<p>Python, aktiv</p>
  +
<p>Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…</p>
  +
<ul>
  +
<li>https://archivebox.io</li></ul>
  +
</li>
  +
<li><p>grab-site</p>
  +
<p>Python, aktiv https://github.com/ArchiveTeam/grab-site</p></li>
  +
<li><p>WASP</p>
  +
<ul>
  +
<li>Paper: [https://ceur-ws.org/Vol-2167/paper6.pdf WASP: Web Archiving and Search Personalized] via https://github.com/webis-de/wasp
  +
<ul>
  +
<li>Java, inaktiv?</li>
  +
<li>related Paper: [https://www.cs.ru.nl/bachelors-theses/2019/Gijs_Hendriksen___4324544___Extending_WASP_providing_context_to_a_personal_web_archive.pdf Extending WASP: providing context to a personal web archive]</li></ul>
  +
</li></ul>
  +
</li>
  +
<li><p>StormCrawler</p>
  +
<ul>
  +
<li>https://en.wikipedia.org/wiki/StormCrawler</li>
  +
<li>Auf Basis von Apache Storm (distributed stream processing)</li></ul>
  +
</li>
  +
<li><p>HTTrack</p>
  +
<ul>
  +
<li>https://www.httrack.com</li>
  +
<li>In C geschrieben, letztes Release 2017, aber aktives Forum und Github</li>
  +
<li>https://github.com/xroche/httrack/tags</li></ul>
  +
</li>
  +
<li><p>Grub.org</p>
  +
<p>dezentraler Crawler des Wikia Projektes In C#, Pyton</p>
  +
<ul>
  +
<li>https://web.archive.org/web/20090207182028/http://grub.org/</li>
  +
<li>https://en.wikipedia.org/wiki/Grub_(search_engine)</li></ul>
  +
</li>
  +
<li><p>HCE – Hierarchical Cluster Engine</p>
  +
<ul>
  +
<li>http://hierarchical-cluster-engine.com</li>
  +
<li>https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project</li>
  +
<li>Ukraine, aktiv bis ca. 2014?</li></ul>
  +
</li>
  +
<li><p>Heritrix</p>
  +
<ul>
  +
<li>https://en.wikipedia.org/wiki/Heritrix</li>
  +
<li>Crawler von archive.org</li></ul>
  +
</li>
  +
<li><p>Haskell</p>
  +
<ul>
  +
<li>https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
  +
<ul>
  +
<li>https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/</li></ul>
  +
</li>
  +
<li>https://github.com/jordanspooner/haskell-web-crawler
  +
<ul>
  +
<li>dead since 2017, seems like student assignment</li></ul>
  +
</li>
  +
<li>https://hackage.haskell.org/package/scalpel
  +
<ul>
  +
<li>A high level web scraping library for Haskell.</li></ul>
  +
</li>
  +
<li>https://hackage.haskell.org/package/hScraper
  +
<ul>
  +
<li>eine Version 0.1.0.0 von 2015</li></ul>
  +
</li>
  +
<li>https://hackage.haskell.org/package/hs-scrape
  +
<ul>
  +
<li>eine VErsion von 2014, aber git commit von 2020</li></ul>
  +
</li>
  +
<li>https://hackage.haskell.org/package/http-conduit-downloader
  +
<ul>
  +
<li>HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.</li></ul>
  +
</li>
  +
<li>https://github.com/hunt-framework/hunt
  +
<ul>
  +
<li>A flexible, lightweight search platform</li>
  +
<li>Vorläufer https://github.com/fortytools/holumbus/
  +
<ul>
  +
<li>http://holumbus.fh-wedel.de/trac</li></ul>
  +
</li></ul>
  +
</li>
  +
<li>lower level libs
  +
<ul>
  +
<li>https://github.com/haskell/wreq http://www.serpentine.com/wreq</li>
  +
<li>https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md</li></ul>
  +
</li></ul>
  +
</li></ol>
  +
  +
<span id="search"></span>
  +
=== Search ===
  +
  +
<ol>
  +
<li><p>openwebsearch.eu</p>
  +
<ul>
  +
<li>https://openwebsearch.eu/the-project/research-results</li></ul>
  +
</li>
  +
<li><p>lemurproject.org</p>
  +
<ol>
  +
<li><p>lucindri</p>
  +
<p>https://lemurproject.org/lucindri.php</p>
  +
<p>Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.</p></li>
  +
<li><p>Galago</p>
  +
<p>toolkit for experimenting with text search</p>
  +
<ul>
  +
<li>http://www.search-engines-book.com</li></ul>
  +
</li></ol>
  +
</li></ol>
  +
  +
=== Wikia ===
  +
  +
* https://en.wikipedia.org/wiki/Wikia_Search
  +
* von Jimmy Wales
  +
* Implementierung
  +
** Grub, dezentraler Crawler
  +
** Apache Nutch
  +
  +
=== Recherche 2023-11-26 ===
  +
  +
* https://en.wikipedia.org/wiki/Category:Web_scraping
  +
* https://en.wikipedia.org/wiki/Category:Web_archiving
  +
* https://en.wikipedia.org/wiki/Category:Free_search_engine_software
  +
* https://en.wikipedia.org/wiki/Category:Internet_search_engines
  +
* https://en.wikipedia.org/wiki/Category:Free_web_crawlers
  +
  +
=== Scraping ===
  +
  +
* https://github.com/scrapy

Aktuelle Version vom 2. September 2024, 17:51 Uhr

architecture

split between crawler/warc writer and processor

Having two separate processes for crawling and processing has advantages:

  • The crawler is simple: Work a queue of URLs and write a warc file
  • Processing can be redone as often as needed without network access

There should not be much overhead:

  • The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
  • Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.

relevant theory:

archive

compression

seekable (random access) compression

Spinning hard drives have sequential read speeds above 100 MB/s. Thus a record in e.g. a 64MB compressed warc file can be retrieved on average in less than 0.3s, even when the archive needs to be seeked from the start.

(64 MB once was the recommended size for Hadoop files.)

Still, these were the links I found:

crawling

index

  • tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

scraping

search

Search Engines

Terrier

  • Terrier.org (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow

news archiving

feeds

from org-mode/thk

dezentrale Suchmaschine

Internet Archive

Crawling, Crawler

  1. ArchiveBox

    Python, aktiv

    Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…

  2. grab-site

    Python, aktiv https://github.com/ArchiveTeam/grab-site

  3. WASP

  4. StormCrawler

  5. HTTrack

  6. Grub.org

    dezentraler Crawler des Wikia Projektes In C#, Pyton

  7. HCE – Hierarchical Cluster Engine

  8. Heritrix

  9. Haskell

Search

  1. openwebsearch.eu

  2. lemurproject.org

    1. lucindri

      https://lemurproject.org/lucindri.php

      Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.

    2. Galago

      toolkit for experimenting with text search

Wikia

Recherche 2023-11-26

Scraping