Version vom 7. Dezember 2023, 13:05 Uhr

crawling

httparchive hat einen Almanach mit Statistiken über Webseiten

index

scraping

search

from org-mode/thk

dezentrale Suchmaschine

https://yacy.net
- RFP
- https://www.youtube.com/c/YaCyTutorials/videos
https://github.com/nvasilakis/yippee
- Java
- last commit 2012
https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
https://metaphacts.com/diesel
- non-free enterprise search engine
https://fourweekmba.com/distributed-search-engines-vs-google/
presearch.{io|org}
- non-free, blockchain,bla bla
https://github.com/kearch/kearch
- letzter commit 2019
- python
https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
https://wiki.p2pfoundation.net/Distributed_Search_Engines
https://en.wikipedia.org/wiki/Distributed_search_engine
https://searx.github.io/searx/
https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america

Internet Archive

Crawling, Crawler

https://en.wikipedia.org/wiki/WARC_(file_format)

ArchiveBox
- https://github.com/ArchiveBox/ArchiveBox
Python, aktiv

Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
- https://archivebox.io
grab-site

Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
  - Java, inaktiv?
  - related Paper: Extending WASP: providing context to a personal web archive
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org

dezentraler Crawler des Wikia Projektes In C#, Pyton
- https://web.archive.org/web/20090207182028/http://grub.org/
- https://en.wikipedia.org/wiki/Grub_(search_engine)
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
  - https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/
- https://github.com/jordanspooner/haskell-web-crawler
  - dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
  - A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
  - eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
  - eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
  - HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
  - A flexible, lightweight search platform
  - Vorläufer https://github.com/fortytools/holumbus/
    - http://holumbus.fh-wedel.de/trac
- lower level libs
  - https://github.com/haskell/wreq http://www.serpentine.com/wreq
  - https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md

Search

openwebsearch.eu
- https://openwebsearch.eu/the-project/research-results
lemurproject.org
1. lucindri
  
  https://lemurproject.org/lucindri.php
  
  Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
2. Galago
  
  toolkit for experimenting with text search
  - http://www.search-engines-book.com

Wikia

https://en.wikipedia.org/wiki/Wikia_Search
von Jimmy Wales
Implementierung
- Grub, dezentraler Crawler
- Apache Nutch

Recherche 2023-11-26

Scraping

https://github.com/scrapy

@@ Zeile 12: / Zeile 12: @@
 == search ==
+= from org-mode/thk =
+== dezentrale Suchmaschine ==
+* https://yacy.net
+** [https://bugs.debian.org/768171 RFP]
+** https://www.youtube.com/c/YaCyTutorials/videos
+* https://github.com/nvasilakis/yippee
+** Java
+** last commit 2012
+* https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
+* https://metaphacts.com/diesel
+** non-free enterprise search engine
+* https://fourweekmba.com/distributed-search-engines-vs-google/
+* presearch.{io|org}
+** non-free, blockchain,bla bla
+* https://github.com/kearch/kearch
+** letzter commit 2019
+** python
+* https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
+* https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
+* https://wiki.p2pfoundation.net/Distributed_Search_Engines
+* https://en.wikipedia.org/wiki/Distributed_search_engine
+* https://searx.github.io/searx/
+* https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
+* Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
+=== Internet Archive ===
+* https://netpreserve.org/about-us/
+* https://github.com/iipc/awesome-web-archiving
+* https://github.com/internetarchive
+* https://github.com/iipc/openwayback
+* http://timetravel.mementoweb.org/
+** http://timetravel.mementoweb.org/about/
+* https://github.com/webrecorder
+=== Crawling, Crawler ===
+* https://en.wikipedia.org/wiki/WARC_(file_format)
+<ol>
+<li><p>ArchiveBox</p>
+<ul>
+<li>https://github.com/ArchiveBox/ArchiveBox</li></ul>
+<p>Python, aktiv</p>
+<p>Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…</p>
+<ul>
+<li>https://archivebox.io</li></ul>
+</li>
+<li><p>grab-site</p>
+<p>Python, aktiv https://github.com/ArchiveTeam/grab-site</p></li>
+<li><p>WASP</p>
+<ul>
+<li>Paper: [https://ceur-ws.org/Vol-2167/paper6.pdf WASP: Web Archiving and Search Personalized] via https://github.com/webis-de/wasp
+<ul>
+<li>Java, inaktiv?</li>
+<li>related Paper: [https://www.cs.ru.nl/bachelors-theses/2019/Gijs_Hendriksen___4324544___Extending_WASP_providing_context_to_a_personal_web_archive.pdf Extending WASP: providing context to a personal web archive]</li></ul>
+</li></ul>
+</li>
+<li><p>StormCrawler</p>
+<ul>
+<li>https://en.wikipedia.org/wiki/StormCrawler</li>
+<li>Auf Basis von Apache Storm (distributed stream processing)</li></ul>
+</li>
+<li><p>HTTrack</p>
+<ul>
+<li>https://www.httrack.com</li>
+<li>In C geschrieben, letztes Release 2017, aber aktives Forum und Github</li>
+<li>https://github.com/xroche/httrack/tags</li></ul>
+</li>
+<li><p>Grub.org</p>
+<p>dezentraler Crawler des Wikia Projektes In C#, Pyton</p>
+<ul>
+<li>https://web.archive.org/web/20090207182028/http://grub.org/</li>
+<li>https://en.wikipedia.org/wiki/Grub_(search_engine)</li></ul>
+</li>
+<li><p>HCE – Hierarchical Cluster Engine</p>
+<ul>
+<li>http://hierarchical-cluster-engine.com</li>
+<li>https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project</li>
+<li>Ukraine, aktiv bis ca. 2014?</li></ul>
+</li>
+<li><p>Heritrix</p>
+<ul>
+<li>https://en.wikipedia.org/wiki/Heritrix</li>
+<li>Crawler von archive.org</li></ul>
+</li>
+<li><p>Haskell</p>
+<ul>
+<li>https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
+<ul>
+<li>https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/</li></ul>
+</li>
+<li>https://github.com/jordanspooner/haskell-web-crawler
+<ul>
+<li>dead since 2017, seems like student assignment</li></ul>
+</li>
+<li>https://hackage.haskell.org/package/scalpel
+<ul>
+<li>A high level web scraping library for Haskell.</li></ul>
+</li>
+<li>https://hackage.haskell.org/package/hScraper
+<ul>
+<li>eine Version 0.1.0.0 von 2015</li></ul>
+</li>
+<li>https://hackage.haskell.org/package/hs-scrape
+<ul>
+<li>eine VErsion von 2014, aber git commit von 2020</li></ul>
+</li>
+<li>https://hackage.haskell.org/package/http-conduit-downloader
+<ul>
+<li>HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.</li></ul>
+</li>
+<li>https://github.com/hunt-framework/hunt
+<ul>
+<li>A flexible, lightweight search platform</li>
+<li>Vorläufer https://github.com/fortytools/holumbus/
+<ul>
+<li>http://holumbus.fh-wedel.de/trac</li></ul>
+</li></ul>
+</li>
+<li>lower level libs
+<ul>
+<li>https://github.com/haskell/wreq http://www.serpentine.com/wreq</li>
+<li>https://github.com/snoyberg/http-client/blob/master/TUTORIAL.md</li></ul>
+</li></ul>
+</li></ol>
+<span id="search"></span>
+=== Search ===
+<ol>
+<li><p>openwebsearch.eu</p>
+<ul>
+<li>https://openwebsearch.eu/the-project/research-results</li></ul>
+</li>
+<li><p>lemurproject.org</p>
+<ol>
+<li><p>lucindri</p>
+<p>https://lemurproject.org/lucindri.php</p>
+<p>Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.</p></li>
+<li><p>Galago</p>
+<p>toolkit for experimenting with text search</p>
+<ul>
+<li>http://www.search-engines-book.com</li></ul>
+</li></ol>
+</li></ol>
+=== Wikia ===
+* https://en.wikipedia.org/wiki/Wikia_Search
+* von Jimmy Wales
+* Implementierung
+** Grub, dezentraler Crawler
+** Apache Nutch
+=== Recherche 2023-11-26 ===
+* https://en.wikipedia.org/wiki/Category:Web_scraping
+* https://en.wikipedia.org/wiki/Category:Web_archiving
+* https://en.wikipedia.org/wiki/Category:Free_search_engine_software
+* https://en.wikipedia.org/wiki/Category:Internet_search_engines
+* https://en.wikipedia.org/wiki/Category:Free_web_crawlers
+=== Scraping ===
+* https://github.com/scrapy

Populus:DezInV/Notes: Unterschied zwischen den Versionen

Version vom 7. Dezember 2023, 13:05 Uhr

Inhaltsverzeichnis

archive

crawling

index

scraping

search

from org-mode/thk

dezentrale Suchmaschine

Internet Archive

Crawling, Crawler

Search

Wikia

Recherche 2023-11-26

Scraping

Navigationsmenü

Seitenaktionen

Seitenaktionen

Meine Werkzeuge

Navigation

Suche

Werkzeuge