Mitarbeit willkommen! Bitte schau unter Hilfe:Benutzerkonto oder informiere Dich über Populus.Wiki.

Populus:DezInV/Notes: Unterschied zwischen den Versionen

Aus Populus DE
Zur Navigation springenZur Suche springen
 
(10 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 1: Zeile 1:
== siehe Auch ==

* [[Suchmaschine]]

== architecture ==
== architecture ==

* [https://www.youtube.com/watch?v=0LTXCcVRQi0 Design a Basic Search Engine, System Design Interview Prep] (YT)
* [https://www.youtube.com/watch?v=BKZxZwUgL3Y System Design distributed web crawler to crawl Billions of web pages] (YT)
* [https://github.com/internetarchive/heritrix3/wiki/Internet%20Archive%20Crawler%20Requirements%20Analysis Internet Archive Crawler Requirements Analysis] 2018
* [https://github.com/nhthanh87/seck/wiki Search Engine Construction Kit Wiki]
* [https://www-nlp.stanford.edu/IR-book/ Introduction to Information Retrieval] book
* [https://webpages.charlotte.edu/sakella/courses/cloud09/papers/Mercator.pdf Mercator: A scalable, extensible Web crawler], paper 1999?
** [https://medium.com/@kslohith1729/mercator-a-masterclass-in-system-design-for-a-web-crawler-30e690a9103b Mercator, A masterclass in system design for a web crawler] 20.2.2024
** TODO search for mercator scheme


=== split between crawler/warc writer and processor ===
=== split between crawler/warc writer and processor ===
Zeile 21: Zeile 34:
== archive ==
== archive ==


* Kiwix, macht Webseiten offline verfügbar für Regionen ohne Internet, mit ZIM Dateiformat, Mediawiki-Foundation
* https://wiki.archiveteam.org
* https://wiki.archiveteam.org
* [https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1-annotated/ The WARC format]
* [https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1-annotated/ The WARC format]
Zeile 53: Zeile 67:
* [https://httparchive.org httparchive] hat einen [https://almanac.httparchive.org Almanach] mit Statistiken über Webseiten
* [https://httparchive.org httparchive] hat einen [https://almanac.httparchive.org Almanach] mit Statistiken über Webseiten
* https://www.commoncrawl.org
* https://www.commoncrawl.org
** https://crates.io/crates/cc-downloader
* [https://github.com/crawler-commons/url-frontier/discussions/45 List of open source web crawlers - could they use URLFrontier?]

=== rust ===

* https://crates.io/crates/spider
** by spider.cloud, doing crawling as a service
** apparently highly optimized and decentralized
** https://crates.io/crates/spider_worker
** website.rs ca 5000 Zeilen, unwartbarer code
* https://crates.io/crates/gar-crawl - abandoned, example project
* https://crates.io/crates/texting_robots - robots.txt parsing
* https://crates.io/crates/dyer (2y)
* https://crates.io/crates/crusty (2y), abandoned, 1 contributor
* https://crates.io/crates/crawly, 1 355 lines file
* https://github.com/joelkoen/wls crawl multiple sitemaps and list URLs
* https://crates.io/crates/frangipani, (1y, 25commits) evtl. ein paar Ideen
* https://github.com/spire-rs 1 person hobby
** https://crates.io/crates/robotxt
* https://crates.io/crates/quick_crawler (+4y)
* https://crates.io/crates/robots_txt (+4y)
* https://crates.io/crates/website_crawler
* https://crates.io/crates/stream_crawler - experiment
* https://github.com/tokahuke/lopez (2y, 106 commits, crawl directives language)
* https://crates.io/crates/waper CLI tool to scrape HTML websites, interessante Datenstruktur-libraries
* https://crates.io/crates/recursive_scraper

=== url frontier ===

* https://github.com/crawler-commons/url-frontier

== url normalization ==

* https://crates.io/crates/urlnorm


== index ==
== index ==


* [https://github.com/quickwit-oss/tantivy tantivy] is a full-text search engine library inspired by Apache Lucene and written in Rust
* [https://github.com/quickwit-oss/tantivy tantivy] is a full-text search engine library inspired by Apache Lucene and written in Rust
** https://github.com/lnx-search/lnx - deployment of tantivy
** https://crates.io/crates/tantivy_warc_indexer - builds a tantivy index from common crawl warc.wet files
* [https://crates.io/crates/seekstorm seekstorm] Search engine library & multi-tenancy server in Rust
* [https://crates.io/crates/seekstorm seekstorm] Search engine library & multi-tenancy server in Rust


== scraping ==
== scraping ==

* https://crates.io/crates/article-extractor


== search ==
== search ==


=== Search Engines ===
=== Search Engines ===

* https://dawnsearch.org distributed web search engine that searches by meaning, rust
** https://crates.io/crates/dawnsearch


==== Terrier ====
==== Terrier ====
Zeile 246: Zeile 301:


* https://github.com/scrapy
* https://github.com/scrapy

= sonstiges =

== verteilte Systeme ==

- https://crates.io/crates/amadeus stream processing on top of https://github.com/constellation-rs/constellation

Aktuelle Version vom 17. Dezember 2024, 18:54 Uhr

siehe Auch

architecture

split between crawler/warc writer and processor

Having two separate processes for crawling and processing has advantages:

  • The crawler is simple: Work a queue of URLs and write a warc file
  • Processing can be redone as often as needed without network access

There should not be much overhead:

  • The (compressed) data should still be in the Kernels page cache if processing starts immediately after the warc file has been written.
  • Decompression is fast and the decompressed data can be piped directly into a HTML parser to minimize memory consumption.

relevant theory:

archive

compression

seekable (random access) compression

Spinning hard drives have sequential read speeds above 100 MB/s. Thus a record in e.g. a 64MB compressed warc file can be retrieved on average in less than 0.3s, even when the archive needs to be seeked from the start.

(64 MB once was the recommended size for Hadoop files.)

Still, these were the links I found:

crawling

rust

url frontier

url normalization

index

scraping

search

Search Engines

Terrier

  • Terrier.org (Java) by the Information Retrieval Group within the School of Computing Science at the University of Glasgow

news archiving

feeds

from org-mode/thk

dezentrale Suchmaschine

Internet Archive

Crawling, Crawler

  1. ArchiveBox

    Python, aktiv

    Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…

  2. grab-site

    Python, aktiv https://github.com/ArchiveTeam/grab-site

  3. WASP

  4. StormCrawler

  5. HTTrack

  6. Grub.org

    dezentraler Crawler des Wikia Projektes In C#, Pyton

  7. HCE – Hierarchical Cluster Engine

  8. Heritrix

  9. Haskell

Search

  1. openwebsearch.eu

  2. lemurproject.org

    1. lucindri

      https://lemurproject.org/lucindri.php

      Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.

    2. Galago

      toolkit for experimenting with text search

Wikia

Recherche 2023-11-26

Scraping

sonstiges

verteilte Systeme

- https://crates.io/crates/amadeus stream processing on top of https://github.com/constellation-rs/constellation