Mitarbeit willkommen! Bitte schau unter Hilfe:Benutzerkonto oder informiere Dich über Populus.Wiki.

Populus:DezInV/Notes: Unterschied zwischen den Versionen

Aus Populus DE
Zur Navigation springenZur Suche springen
K (Thk verschob die Seite Populus:DezInV/Notizen nach Populus:DezInV/Notes, ohne dabei eine Weiterleitung anzulegen)
Keine Bearbeitungszusammenfassung
Zeile 12: Zeile 12:


== search ==
== search ==

= from org-mode/thk =


== dezentrale Suchmaschine ==

* https://yacy.net
** [https://bugs.debian.org/768171 RFP]
** https://www.youtube.com/c/YaCyTutorials/videos
* https://github.com/nvasilakis/yippee
** Java
** last commit 2012
* https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
* https://metaphacts.com/diesel
** non-free enterprise search engine
* https://fourweekmba.com/distributed-search-engines-vs-google/
* presearch.{io|org}
** non-free, blockchain,bla bla
* https://github.com/kearch/kearch
** letzter commit 2019
** python
* https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
* https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
* https://wiki.p2pfoundation.net/Distributed_Search_Engines
* https://en.wikipedia.org/wiki/Distributed_search_engine
* https://searx.github.io/searx/
* https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
* Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america

=== Internet Archive ===

* https://netpreserve.org/about-us/
* https://github.com/iipc/awesome-web-archiving
* https://github.com/internetarchive
* https://github.com/iipc/openwayback
* http://timetravel.mementoweb.org/
** http://timetravel.mementoweb.org/about/
* https://github.com/webrecorder

=== Crawling, Crawler ===

* https://en.wikipedia.org/wiki/WARC_(file_format)

<ol>
<li><p>ArchiveBox</p>
<ul>
<li>https://github.com/ArchiveBox/ArchiveBox</li></ul>

<p>Python, aktiv</p>
<p>Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…</p>
<ul>
<li>https://archivebox.io</li></ul>
</li>
<li><p>grab-site</p>
<p>Python, aktiv https://github.com/ArchiveTeam/grab-site</p></li>
<li><p>WASP</p>
<ul>
<li>Paper: [https://ceur-ws.org/Vol-2167/paper6.pdf WASP: Web Archiving and Search Personalized] via https://github.com/webis-de/wasp
<ul>
<li>Java, inaktiv?</li>
<li>related Paper: [https://www.cs.ru.nl/bachelors-theses/2019/Gijs_Hendriksen___4324544___Extending_WASP_providing_context_to_a_personal_web_archive.pdf Extending WASP: providing context to a personal web archive]</li></ul>
</li></ul>
</li>
<li><p>StormCrawler</p>
<ul>
<li>https://en.wikipedia.org/wiki/StormCrawler</li>
<li>Auf Basis von Apache Storm (distributed stream processing)</li></ul>
</li>
<li><p>HTTrack</p>
<ul>
<li>https://www.httrack.com</li>
<li>In C geschrieben, letztes Release 2017, aber aktives Forum und Github</li>
<li>https://github.com/xroche/httrack/tags</li></ul>
</li>
<li><p>Grub.org</p>
<p>dezentraler Crawler des Wikia Projektes In C#, Pyton</p>
<ul>
<li>https://web.archive.org/web/20090207182028/http://grub.org/</li>
<li>https://en.wikipedia.org/wiki/Grub_(search_engine)</li></ul>
</li>
<li><p>HCE – Hierarchical Cluster Engine</p>
<ul>
<li>http://hierarchical-cluster-engine.com</li>
<li>https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project</li>
<li>Ukraine, aktiv bis ca. 2014?</li></ul>
</li>
<li><p>Heritrix</p>
<ul>
<li>https://en.wikipedia.org/wiki/Heritrix</li>
<li>Crawler von archive.org</li></ul>
</li>
<li><p>Haskell</p>
<ul>
<li>https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
<ul>
<li>https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/</li></ul>
</li>
<li>https://github.com/jordanspooner/haskell-web-crawler
<ul>
<li>dead since 2017, seems like student assignment</li></ul>
</li>
<li>https://hackage.haskell.org/package/scalpel
<ul>
<li>A high level web scraping library for Haskell.</li></ul>
</li>
<li>https://hackage.haskell.org/package/hScraper
<ul>
<li>eine Version 0.1.0.0 von 2015</li></ul>
</li>
<li>https://hackage.haskell.org/package/hs-scrape
<ul>
<li>eine VErsion von 2014, aber git commit von 2020</li></ul>
</li>
<li>https://hackage.haskell.org/package/http-conduit-downloader
<ul>
<li>HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.</li></ul>
</li>
<li>https://github.com/hunt-framework/hunt
<ul>
<li>A flexible, lightweight search platform</li>
<li>Vorläufer https://github.com/fortytools/holumbus/
<ul>
<li>http://holumbus.fh-wedel.de/trac</li></ul>
</li></ul>
</li>
<li>lower level libs
<ul>
<li>https://github.com/haskell/wreq http://www.serpentine.com/wreq</li>
<li>https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md</li></ul>
</li></ul>
</li></ol>

<span id="search"></span>
=== Search ===

<ol>
<li><p>openwebsearch.eu</p>
<ul>
<li>https://openwebsearch.eu/the-project/research-results</li></ul>
</li>
<li><p>lemurproject.org</p>
<ol>
<li><p>lucindri</p>
<p>https://lemurproject.org/lucindri.php</p>
<p>Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.</p></li>
<li><p>Galago</p>
<p>toolkit for experimenting with text search</p>
<ul>
<li>http://www.search-engines-book.com</li></ul>
</li></ol>
</li></ol>

=== Wikia ===

* https://en.wikipedia.org/wiki/Wikia_Search
* von Jimmy Wales
* Implementierung
** Grub, dezentraler Crawler
** Apache Nutch

=== Recherche 2023-11-26 ===

* https://en.wikipedia.org/wiki/Category:Web_scraping
* https://en.wikipedia.org/wiki/Category:Web_archiving
* https://en.wikipedia.org/wiki/Category:Free_search_engine_software
* https://en.wikipedia.org/wiki/Category:Internet_search_engines
* https://en.wikipedia.org/wiki/Category:Free_web_crawlers

=== Scraping ===

* https://github.com/scrapy

Version vom 7. Dezember 2023, 13:05 Uhr

archive

crawling

index

scraping

search

from org-mode/thk

dezentrale Suchmaschine

Internet Archive

Crawling, Crawler

  1. ArchiveBox

    Python, aktiv

    Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…

  2. grab-site

    Python, aktiv https://github.com/ArchiveTeam/grab-site

  3. WASP

  4. StormCrawler

  5. HTTrack

  6. Grub.org

    dezentraler Crawler des Wikia Projektes In C#, Pyton

  7. HCE – Hierarchical Cluster Engine

  8. Heritrix

  9. Haskell

Search

  1. openwebsearch.eu

  2. lemurproject.org

    1. lucindri

      https://lemurproject.org/lucindri.php

      Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.

    2. Galago

      toolkit for experimenting with text search

Wikia

Recherche 2023-11-26

Scraping