Populus:DezInV/Notes: Unterschied zwischen den Versionen
Aus Populus DE
Zur Navigation springenZur Suche springenThk (Diskussion | Beiträge) K (Thk verschob die Seite Populus:DezInV/Notizen nach Populus:DezInV/Notes, ohne dabei eine Weiterleitung anzulegen) |
Thk (Diskussion | Beiträge) |
||
Zeile 12: | Zeile 12: | ||
== search == |
== search == |
||
+ | |||
+ | = from org-mode/thk = |
||
+ | |||
+ | |||
+ | == dezentrale Suchmaschine == |
||
+ | |||
+ | * https://yacy.net |
||
+ | ** [https://bugs.debian.org/768171 RFP] |
||
+ | ** https://www.youtube.com/c/YaCyTutorials/videos |
||
+ | * https://github.com/nvasilakis/yippee |
||
+ | ** Java |
||
+ | ** last commit 2012 |
||
+ | * https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml |
||
+ | * https://metaphacts.com/diesel |
||
+ | ** non-free enterprise search engine |
||
+ | * https://fourweekmba.com/distributed-search-engines-vs-google/ |
||
+ | * presearch.{io|org} |
||
+ | ** non-free, blockchain,bla bla |
||
+ | * https://github.com/kearch/kearch |
||
+ | ** letzter commit 2019 |
||
+ | ** python |
||
+ | * https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2 |
||
+ | * https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e |
||
+ | * https://wiki.p2pfoundation.net/Distributed_Search_Engines |
||
+ | * https://en.wikipedia.org/wiki/Distributed_search_engine |
||
+ | * https://searx.github.io/searx/ |
||
+ | * https://www.astridmager.net Wissenschaftlerin, alternative, ethic search |
||
+ | * Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america |
||
+ | |||
+ | === Internet Archive === |
||
+ | |||
+ | * https://netpreserve.org/about-us/ |
||
+ | * https://github.com/iipc/awesome-web-archiving |
||
+ | * https://github.com/internetarchive |
||
+ | * https://github.com/iipc/openwayback |
||
+ | * http://timetravel.mementoweb.org/ |
||
+ | ** http://timetravel.mementoweb.org/about/ |
||
+ | * https://github.com/webrecorder |
||
+ | |||
+ | === Crawling, Crawler === |
||
+ | |||
+ | * https://en.wikipedia.org/wiki/WARC_(file_format) |
||
+ | |||
+ | <ol> |
||
+ | <li><p>ArchiveBox</p> |
||
+ | <ul> |
||
+ | <li>https://github.com/ArchiveBox/ArchiveBox</li></ul> |
||
+ | |||
+ | <p>Python, aktiv</p> |
||
+ | <p>Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…</p> |
||
+ | <ul> |
||
+ | <li>https://archivebox.io</li></ul> |
||
+ | </li> |
||
+ | <li><p>grab-site</p> |
||
+ | <p>Python, aktiv https://github.com/ArchiveTeam/grab-site</p></li> |
||
+ | <li><p>WASP</p> |
||
+ | <ul> |
||
+ | <li>Paper: [https://ceur-ws.org/Vol-2167/paper6.pdf WASP: Web Archiving and Search Personalized] via https://github.com/webis-de/wasp |
||
+ | <ul> |
||
+ | <li>Java, inaktiv?</li> |
||
+ | <li>related Paper: [https://www.cs.ru.nl/bachelors-theses/2019/Gijs_Hendriksen___4324544___Extending_WASP_providing_context_to_a_personal_web_archive.pdf Extending WASP: providing context to a personal web archive]</li></ul> |
||
+ | </li></ul> |
||
+ | </li> |
||
+ | <li><p>StormCrawler</p> |
||
+ | <ul> |
||
+ | <li>https://en.wikipedia.org/wiki/StormCrawler</li> |
||
+ | <li>Auf Basis von Apache Storm (distributed stream processing)</li></ul> |
||
+ | </li> |
||
+ | <li><p>HTTrack</p> |
||
+ | <ul> |
||
+ | <li>https://www.httrack.com</li> |
||
+ | <li>In C geschrieben, letztes Release 2017, aber aktives Forum und Github</li> |
||
+ | <li>https://github.com/xroche/httrack/tags</li></ul> |
||
+ | </li> |
||
+ | <li><p>Grub.org</p> |
||
+ | <p>dezentraler Crawler des Wikia Projektes In C#, Pyton</p> |
||
+ | <ul> |
||
+ | <li>https://web.archive.org/web/20090207182028/http://grub.org/</li> |
||
+ | <li>https://en.wikipedia.org/wiki/Grub_(search_engine)</li></ul> |
||
+ | </li> |
||
+ | <li><p>HCE – Hierarchical Cluster Engine</p> |
||
+ | <ul> |
||
+ | <li>http://hierarchical-cluster-engine.com</li> |
||
+ | <li>https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project</li> |
||
+ | <li>Ukraine, aktiv bis ca. 2014?</li></ul> |
||
+ | </li> |
||
+ | <li><p>Heritrix</p> |
||
+ | <ul> |
||
+ | <li>https://en.wikipedia.org/wiki/Heritrix</li> |
||
+ | <li>Crawler von archive.org</li></ul> |
||
+ | </li> |
||
+ | <li><p>Haskell</p> |
||
+ | <ul> |
||
+ | <li>https://shimweasel.com/2017/07/13/a-modest-scraping-proposal |
||
+ | <ul> |
||
+ | <li>https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/</li></ul> |
||
+ | </li> |
||
+ | <li>https://github.com/jordanspooner/haskell-web-crawler |
||
+ | <ul> |
||
+ | <li>dead since 2017, seems like student assignment</li></ul> |
||
+ | </li> |
||
+ | <li>https://hackage.haskell.org/package/scalpel |
||
+ | <ul> |
||
+ | <li>A high level web scraping library for Haskell.</li></ul> |
||
+ | </li> |
||
+ | <li>https://hackage.haskell.org/package/hScraper |
||
+ | <ul> |
||
+ | <li>eine Version 0.1.0.0 von 2015</li></ul> |
||
+ | </li> |
||
+ | <li>https://hackage.haskell.org/package/hs-scrape |
||
+ | <ul> |
||
+ | <li>eine VErsion von 2014, aber git commit von 2020</li></ul> |
||
+ | </li> |
||
+ | <li>https://hackage.haskell.org/package/http-conduit-downloader |
||
+ | <ul> |
||
+ | <li>HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.</li></ul> |
||
+ | </li> |
||
+ | <li>https://github.com/hunt-framework/hunt |
||
+ | <ul> |
||
+ | <li>A flexible, lightweight search platform</li> |
||
+ | <li>Vorläufer https://github.com/fortytools/holumbus/ |
||
+ | <ul> |
||
+ | <li>http://holumbus.fh-wedel.de/trac</li></ul> |
||
+ | </li></ul> |
||
+ | </li> |
||
+ | <li>lower level libs |
||
+ | <ul> |
||
+ | <li>https://github.com/haskell/wreq http://www.serpentine.com/wreq</li> |
||
+ | <li>https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md</li></ul> |
||
+ | </li></ul> |
||
+ | </li></ol> |
||
+ | |||
+ | <span id="search"></span> |
||
+ | === Search === |
||
+ | |||
+ | <ol> |
||
+ | <li><p>openwebsearch.eu</p> |
||
+ | <ul> |
||
+ | <li>https://openwebsearch.eu/the-project/research-results</li></ul> |
||
+ | </li> |
||
+ | <li><p>lemurproject.org</p> |
||
+ | <ol> |
||
+ | <li><p>lucindri</p> |
||
+ | <p>https://lemurproject.org/lucindri.php</p> |
||
+ | <p>Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.</p></li> |
||
+ | <li><p>Galago</p> |
||
+ | <p>toolkit for experimenting with text search</p> |
||
+ | <ul> |
||
+ | <li>http://www.search-engines-book.com</li></ul> |
||
+ | </li></ol> |
||
+ | </li></ol> |
||
+ | |||
+ | === Wikia === |
||
+ | |||
+ | * https://en.wikipedia.org/wiki/Wikia_Search |
||
+ | * von Jimmy Wales |
||
+ | * Implementierung |
||
+ | ** Grub, dezentraler Crawler |
||
+ | ** Apache Nutch |
||
+ | |||
+ | === Recherche 2023-11-26 === |
||
+ | |||
+ | * https://en.wikipedia.org/wiki/Category:Web_scraping |
||
+ | * https://en.wikipedia.org/wiki/Category:Web_archiving |
||
+ | * https://en.wikipedia.org/wiki/Category:Free_search_engine_software |
||
+ | * https://en.wikipedia.org/wiki/Category:Internet_search_engines |
||
+ | * https://en.wikipedia.org/wiki/Category:Free_web_crawlers |
||
+ | |||
+ | === Scraping === |
||
+ | |||
+ | * https://github.com/scrapy |
Version vom 7. Dezember 2023, 13:05 Uhr
archive
- Discussion about compression with somebody working on Russian web archive
crawling
- httparchive hat einen Almanach mit Statistiken über Webseiten
index
scraping
search
from org-mode/thk
dezentrale Suchmaschine
- https://yacy.net
- https://github.com/nvasilakis/yippee
- Java
- last commit 2012
- https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
- https://metaphacts.com/diesel
- non-free enterprise search engine
- https://fourweekmba.com/distributed-search-engines-vs-google/
- presearch.{io|org}
- non-free, blockchain,bla bla
- https://github.com/kearch/kearch
- letzter commit 2019
- python
- https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
- https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
- https://wiki.p2pfoundation.net/Distributed_Search_Engines
- https://en.wikipedia.org/wiki/Distributed_search_engine
- https://searx.github.io/searx/
- https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
- Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
Internet Archive
- https://netpreserve.org/about-us/
- https://github.com/iipc/awesome-web-archiving
- https://github.com/internetarchive
- https://github.com/iipc/openwayback
- http://timetravel.mementoweb.org/
- https://github.com/webrecorder
Crawling, Crawler
ArchiveBox
Python, aktiv
Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
grab-site
Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
- Java, inaktiv?
- related Paper: Extending WASP: providing context to a personal web archive
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org
dezentraler Crawler des Wikia Projektes In C#, Pyton
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
- https://github.com/jordanspooner/haskell-web-crawler
- dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
- A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
- eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
- eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
- HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
- A flexible, lightweight search platform
- Vorläufer https://github.com/fortytools/holumbus/
- lower level libs
Search
openwebsearch.eu
lemurproject.org
lucindri
https://lemurproject.org/lucindri.php
Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
Galago
toolkit for experimenting with text search
Wikia
- https://en.wikipedia.org/wiki/Wikia_Search
- von Jimmy Wales
- Implementierung
- Grub, dezentraler Crawler
- Apache Nutch
Recherche 2023-11-26
- https://en.wikipedia.org/wiki/Category:Web_scraping
- https://en.wikipedia.org/wiki/Category:Web_archiving
- https://en.wikipedia.org/wiki/Category:Free_search_engine_software
- https://en.wikipedia.org/wiki/Category:Internet_search_engines
- https://en.wikipedia.org/wiki/Category:Free_web_crawlers