Populus:DezInV/Notes: Unterschied zwischen den Versionen
Aus Populus DE
Zur Navigation springenZur Suche springen
Thk (Diskussion | Beiträge) K (Thk verschob die Seite Populus:DezInV/Notizen nach Populus:DezInV/Notes, ohne dabei eine Weiterleitung anzulegen) |
Thk (Diskussion | Beiträge) Keine Bearbeitungszusammenfassung |
||
Zeile 12: | Zeile 12: | ||
== search == |
== search == |
||
= from org-mode/thk = |
|||
== dezentrale Suchmaschine == |
|||
* https://yacy.net |
|||
** [https://bugs.debian.org/768171 RFP] |
|||
** https://www.youtube.com/c/YaCyTutorials/videos |
|||
* https://github.com/nvasilakis/yippee |
|||
** Java |
|||
** last commit 2012 |
|||
* https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml |
|||
* https://metaphacts.com/diesel |
|||
** non-free enterprise search engine |
|||
* https://fourweekmba.com/distributed-search-engines-vs-google/ |
|||
* presearch.{io|org} |
|||
** non-free, blockchain,bla bla |
|||
* https://github.com/kearch/kearch |
|||
** letzter commit 2019 |
|||
** python |
|||
* https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2 |
|||
* https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e |
|||
* https://wiki.p2pfoundation.net/Distributed_Search_Engines |
|||
* https://en.wikipedia.org/wiki/Distributed_search_engine |
|||
* https://searx.github.io/searx/ |
|||
* https://www.astridmager.net Wissenschaftlerin, alternative, ethic search |
|||
* Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america |
|||
=== Internet Archive === |
|||
* https://netpreserve.org/about-us/ |
|||
* https://github.com/iipc/awesome-web-archiving |
|||
* https://github.com/internetarchive |
|||
* https://github.com/iipc/openwayback |
|||
* http://timetravel.mementoweb.org/ |
|||
** http://timetravel.mementoweb.org/about/ |
|||
* https://github.com/webrecorder |
|||
=== Crawling, Crawler === |
|||
* https://en.wikipedia.org/wiki/WARC_(file_format) |
|||
<ol> |
|||
<li><p>ArchiveBox</p> |
|||
<ul> |
|||
<li>https://github.com/ArchiveBox/ArchiveBox</li></ul> |
|||
<p>Python, aktiv</p> |
|||
<p>Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…</p> |
|||
<ul> |
|||
<li>https://archivebox.io</li></ul> |
|||
</li> |
|||
<li><p>grab-site</p> |
|||
<p>Python, aktiv https://github.com/ArchiveTeam/grab-site</p></li> |
|||
<li><p>WASP</p> |
|||
<ul> |
|||
<li>Paper: [https://ceur-ws.org/Vol-2167/paper6.pdf WASP: Web Archiving and Search Personalized] via https://github.com/webis-de/wasp |
|||
<ul> |
|||
<li>Java, inaktiv?</li> |
|||
<li>related Paper: [https://www.cs.ru.nl/bachelors-theses/2019/Gijs_Hendriksen___4324544___Extending_WASP_providing_context_to_a_personal_web_archive.pdf Extending WASP: providing context to a personal web archive]</li></ul> |
|||
</li></ul> |
|||
</li> |
|||
<li><p>StormCrawler</p> |
|||
<ul> |
|||
<li>https://en.wikipedia.org/wiki/StormCrawler</li> |
|||
<li>Auf Basis von Apache Storm (distributed stream processing)</li></ul> |
|||
</li> |
|||
<li><p>HTTrack</p> |
|||
<ul> |
|||
<li>https://www.httrack.com</li> |
|||
<li>In C geschrieben, letztes Release 2017, aber aktives Forum und Github</li> |
|||
<li>https://github.com/xroche/httrack/tags</li></ul> |
|||
</li> |
|||
<li><p>Grub.org</p> |
|||
<p>dezentraler Crawler des Wikia Projektes In C#, Pyton</p> |
|||
<ul> |
|||
<li>https://web.archive.org/web/20090207182028/http://grub.org/</li> |
|||
<li>https://en.wikipedia.org/wiki/Grub_(search_engine)</li></ul> |
|||
</li> |
|||
<li><p>HCE – Hierarchical Cluster Engine</p> |
|||
<ul> |
|||
<li>http://hierarchical-cluster-engine.com</li> |
|||
<li>https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project</li> |
|||
<li>Ukraine, aktiv bis ca. 2014?</li></ul> |
|||
</li> |
|||
<li><p>Heritrix</p> |
|||
<ul> |
|||
<li>https://en.wikipedia.org/wiki/Heritrix</li> |
|||
<li>Crawler von archive.org</li></ul> |
|||
</li> |
|||
<li><p>Haskell</p> |
|||
<ul> |
|||
<li>https://shimweasel.com/2017/07/13/a-modest-scraping-proposal |
|||
<ul> |
|||
<li>https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/</li></ul> |
|||
</li> |
|||
<li>https://github.com/jordanspooner/haskell-web-crawler |
|||
<ul> |
|||
<li>dead since 2017, seems like student assignment</li></ul> |
|||
</li> |
|||
<li>https://hackage.haskell.org/package/scalpel |
|||
<ul> |
|||
<li>A high level web scraping library for Haskell.</li></ul> |
|||
</li> |
|||
<li>https://hackage.haskell.org/package/hScraper |
|||
<ul> |
|||
<li>eine Version 0.1.0.0 von 2015</li></ul> |
|||
</li> |
|||
<li>https://hackage.haskell.org/package/hs-scrape |
|||
<ul> |
|||
<li>eine VErsion von 2014, aber git commit von 2020</li></ul> |
|||
</li> |
|||
<li>https://hackage.haskell.org/package/http-conduit-downloader |
|||
<ul> |
|||
<li>HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.</li></ul> |
|||
</li> |
|||
<li>https://github.com/hunt-framework/hunt |
|||
<ul> |
|||
<li>A flexible, lightweight search platform</li> |
|||
<li>Vorläufer https://github.com/fortytools/holumbus/ |
|||
<ul> |
|||
<li>http://holumbus.fh-wedel.de/trac</li></ul> |
|||
</li></ul> |
|||
</li> |
|||
<li>lower level libs |
|||
<ul> |
|||
<li>https://github.com/haskell/wreq http://www.serpentine.com/wreq</li> |
|||
<li>https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md</li></ul> |
|||
</li></ul> |
|||
</li></ol> |
|||
<span id="search"></span> |
|||
=== Search === |
|||
<ol> |
|||
<li><p>openwebsearch.eu</p> |
|||
<ul> |
|||
<li>https://openwebsearch.eu/the-project/research-results</li></ul> |
|||
</li> |
|||
<li><p>lemurproject.org</p> |
|||
<ol> |
|||
<li><p>lucindri</p> |
|||
<p>https://lemurproject.org/lucindri.php</p> |
|||
<p>Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.</p></li> |
|||
<li><p>Galago</p> |
|||
<p>toolkit for experimenting with text search</p> |
|||
<ul> |
|||
<li>http://www.search-engines-book.com</li></ul> |
|||
</li></ol> |
|||
</li></ol> |
|||
=== Wikia === |
|||
* https://en.wikipedia.org/wiki/Wikia_Search |
|||
* von Jimmy Wales |
|||
* Implementierung |
|||
** Grub, dezentraler Crawler |
|||
** Apache Nutch |
|||
=== Recherche 2023-11-26 === |
|||
* https://en.wikipedia.org/wiki/Category:Web_scraping |
|||
* https://en.wikipedia.org/wiki/Category:Web_archiving |
|||
* https://en.wikipedia.org/wiki/Category:Free_search_engine_software |
|||
* https://en.wikipedia.org/wiki/Category:Internet_search_engines |
|||
* https://en.wikipedia.org/wiki/Category:Free_web_crawlers |
|||
=== Scraping === |
|||
* https://github.com/scrapy |
Version vom 7. Dezember 2023, 13:05 Uhr
archive
- Discussion about compression with somebody working on Russian web archive
crawling
- httparchive hat einen Almanach mit Statistiken über Webseiten
index
scraping
search
from org-mode/thk
dezentrale Suchmaschine
- https://yacy.net
- https://github.com/nvasilakis/yippee
- Java
- last commit 2012
- https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
- https://metaphacts.com/diesel
- non-free enterprise search engine
- https://fourweekmba.com/distributed-search-engines-vs-google/
- presearch.{io|org}
- non-free, blockchain,bla bla
- https://github.com/kearch/kearch
- letzter commit 2019
- python
- https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
- https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
- https://wiki.p2pfoundation.net/Distributed_Search_Engines
- https://en.wikipedia.org/wiki/Distributed_search_engine
- https://searx.github.io/searx/
- https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
- Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
Internet Archive
- https://netpreserve.org/about-us/
- https://github.com/iipc/awesome-web-archiving
- https://github.com/internetarchive
- https://github.com/iipc/openwayback
- http://timetravel.mementoweb.org/
- https://github.com/webrecorder
Crawling, Crawler
ArchiveBox
Python, aktiv
Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
grab-site
Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
- Java, inaktiv?
- related Paper: Extending WASP: providing context to a personal web archive
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org
dezentraler Crawler des Wikia Projektes In C#, Pyton
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
- https://github.com/jordanspooner/haskell-web-crawler
- dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
- A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
- eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
- eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
- HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
- A flexible, lightweight search platform
- Vorläufer https://github.com/fortytools/holumbus/
- lower level libs
Search
openwebsearch.eu
lemurproject.org
lucindri
https://lemurproject.org/lucindri.php
Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
Galago
toolkit for experimenting with text search
Wikia
- https://en.wikipedia.org/wiki/Wikia_Search
- von Jimmy Wales
- Implementierung
- Grub, dezentraler Crawler
- Apache Nutch
Recherche 2023-11-26
- https://en.wikipedia.org/wiki/Category:Web_scraping
- https://en.wikipedia.org/wiki/Category:Web_archiving
- https://en.wikipedia.org/wiki/Category:Free_search_engine_software
- https://en.wikipedia.org/wiki/Category:Internet_search_engines
- https://en.wikipedia.org/wiki/Category:Free_web_crawlers