Mitarbeit willkommen! Bitte schau unter Hilfe:Benutzerkonto oder informiere Dich über Populus.Wiki.

Populus:DezInV/Notes: Unterschied zwischen den Versionen

Aus Populus DE
Zur Navigation springenZur Suche springen
K (Thk verschob die Seite Populus:DezInV/Notizen nach Populus:DezInV/Notes, ohne dabei eine Weiterleitung anzulegen)
Zeile 12: Zeile 12:
   
 
== search ==
 
== search ==
  +
  +
= from org-mode/thk =
  +
  +
  +
== dezentrale Suchmaschine ==
  +
  +
* https://yacy.net
  +
** [https://bugs.debian.org/768171 RFP]
  +
** https://www.youtube.com/c/YaCyTutorials/videos
  +
* https://github.com/nvasilakis/yippee
  +
** Java
  +
** last commit 2012
  +
* https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
  +
* https://metaphacts.com/diesel
  +
** non-free enterprise search engine
  +
* https://fourweekmba.com/distributed-search-engines-vs-google/
  +
* presearch.{io|org}
  +
** non-free, blockchain,bla bla
  +
* https://github.com/kearch/kearch
  +
** letzter commit 2019
  +
** python
  +
* https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
  +
* https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
  +
* https://wiki.p2pfoundation.net/Distributed_Search_Engines
  +
* https://en.wikipedia.org/wiki/Distributed_search_engine
  +
* https://searx.github.io/searx/
  +
* https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
  +
* Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
  +
  +
=== Internet Archive ===
  +
  +
* https://netpreserve.org/about-us/
  +
* https://github.com/iipc/awesome-web-archiving
  +
* https://github.com/internetarchive
  +
* https://github.com/iipc/openwayback
  +
* http://timetravel.mementoweb.org/
  +
** http://timetravel.mementoweb.org/about/
  +
* https://github.com/webrecorder
  +
  +
=== Crawling, Crawler ===
  +
  +
* https://en.wikipedia.org/wiki/WARC_(file_format)
  +
  +
<ol>
  +
<li><p>ArchiveBox</p>
  +
<ul>
  +
<li>https://github.com/ArchiveBox/ArchiveBox</li></ul>
  +
  +
<p>Python, aktiv</p>
  +
<p>Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…</p>
  +
<ul>
  +
<li>https://archivebox.io</li></ul>
  +
</li>
  +
<li><p>grab-site</p>
  +
<p>Python, aktiv https://github.com/ArchiveTeam/grab-site</p></li>
  +
<li><p>WASP</p>
  +
<ul>
  +
<li>Paper: [https://ceur-ws.org/Vol-2167/paper6.pdf WASP: Web Archiving and Search Personalized] via https://github.com/webis-de/wasp
  +
<ul>
  +
<li>Java, inaktiv?</li>
  +
<li>related Paper: [https://www.cs.ru.nl/bachelors-theses/2019/Gijs_Hendriksen___4324544___Extending_WASP_providing_context_to_a_personal_web_archive.pdf Extending WASP: providing context to a personal web archive]</li></ul>
  +
</li></ul>
  +
</li>
  +
<li><p>StormCrawler</p>
  +
<ul>
  +
<li>https://en.wikipedia.org/wiki/StormCrawler</li>
  +
<li>Auf Basis von Apache Storm (distributed stream processing)</li></ul>
  +
</li>
  +
<li><p>HTTrack</p>
  +
<ul>
  +
<li>https://www.httrack.com</li>
  +
<li>In C geschrieben, letztes Release 2017, aber aktives Forum und Github</li>
  +
<li>https://github.com/xroche/httrack/tags</li></ul>
  +
</li>
  +
<li><p>Grub.org</p>
  +
<p>dezentraler Crawler des Wikia Projektes In C#, Pyton</p>
  +
<ul>
  +
<li>https://web.archive.org/web/20090207182028/http://grub.org/</li>
  +
<li>https://en.wikipedia.org/wiki/Grub_(search_engine)</li></ul>
  +
</li>
  +
<li><p>HCE – Hierarchical Cluster Engine</p>
  +
<ul>
  +
<li>http://hierarchical-cluster-engine.com</li>
  +
<li>https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project</li>
  +
<li>Ukraine, aktiv bis ca. 2014?</li></ul>
  +
</li>
  +
<li><p>Heritrix</p>
  +
<ul>
  +
<li>https://en.wikipedia.org/wiki/Heritrix</li>
  +
<li>Crawler von archive.org</li></ul>
  +
</li>
  +
<li><p>Haskell</p>
  +
<ul>
  +
<li>https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
  +
<ul>
  +
<li>https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/</li></ul>
  +
</li>
  +
<li>https://github.com/jordanspooner/haskell-web-crawler
  +
<ul>
  +
<li>dead since 2017, seems like student assignment</li></ul>
  +
</li>
  +
<li>https://hackage.haskell.org/package/scalpel
  +
<ul>
  +
<li>A high level web scraping library for Haskell.</li></ul>
  +
</li>
  +
<li>https://hackage.haskell.org/package/hScraper
  +
<ul>
  +
<li>eine Version 0.1.0.0 von 2015</li></ul>
  +
</li>
  +
<li>https://hackage.haskell.org/package/hs-scrape
  +
<ul>
  +
<li>eine VErsion von 2014, aber git commit von 2020</li></ul>
  +
</li>
  +
<li>https://hackage.haskell.org/package/http-conduit-downloader
  +
<ul>
  +
<li>HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.</li></ul>
  +
</li>
  +
<li>https://github.com/hunt-framework/hunt
  +
<ul>
  +
<li>A flexible, lightweight search platform</li>
  +
<li>Vorläufer https://github.com/fortytools/holumbus/
  +
<ul>
  +
<li>http://holumbus.fh-wedel.de/trac</li></ul>
  +
</li></ul>
  +
</li>
  +
<li>lower level libs
  +
<ul>
  +
<li>https://github.com/haskell/wreq http://www.serpentine.com/wreq</li>
  +
<li>https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md</li></ul>
  +
</li></ul>
  +
</li></ol>
  +
  +
<span id="search"></span>
  +
=== Search ===
  +
  +
<ol>
  +
<li><p>openwebsearch.eu</p>
  +
<ul>
  +
<li>https://openwebsearch.eu/the-project/research-results</li></ul>
  +
</li>
  +
<li><p>lemurproject.org</p>
  +
<ol>
  +
<li><p>lucindri</p>
  +
<p>https://lemurproject.org/lucindri.php</p>
  +
<p>Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.</p></li>
  +
<li><p>Galago</p>
  +
<p>toolkit for experimenting with text search</p>
  +
<ul>
  +
<li>http://www.search-engines-book.com</li></ul>
  +
</li></ol>
  +
</li></ol>
  +
  +
=== Wikia ===
  +
  +
* https://en.wikipedia.org/wiki/Wikia_Search
  +
* von Jimmy Wales
  +
* Implementierung
  +
** Grub, dezentraler Crawler
  +
** Apache Nutch
  +
  +
=== Recherche 2023-11-26 ===
  +
  +
* https://en.wikipedia.org/wiki/Category:Web_scraping
  +
* https://en.wikipedia.org/wiki/Category:Web_archiving
  +
* https://en.wikipedia.org/wiki/Category:Free_search_engine_software
  +
* https://en.wikipedia.org/wiki/Category:Internet_search_engines
  +
* https://en.wikipedia.org/wiki/Category:Free_web_crawlers
  +
  +
=== Scraping ===
  +
  +
* https://github.com/scrapy

Version vom 7. Dezember 2023, 13:05 Uhr

archive

crawling

index

scraping

search

from org-mode/thk

dezentrale Suchmaschine

Internet Archive

Crawling, Crawler

  1. ArchiveBox

    Python, aktiv

    Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…

  2. grab-site

    Python, aktiv https://github.com/ArchiveTeam/grab-site

  3. WASP

  4. StormCrawler

  5. HTTrack

  6. Grub.org

    dezentraler Crawler des Wikia Projektes In C#, Pyton

  7. HCE – Hierarchical Cluster Engine

  8. Heritrix

  9. Haskell

Search

  1. openwebsearch.eu

  2. lemurproject.org

    1. lucindri

      https://lemurproject.org/lucindri.php

      Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.

    2. Galago

      toolkit for experimenting with text search

Wikia

Recherche 2023-11-26

Scraping