mirror of
https://git.friendi.ca/friendica/friendica-addons.git
synced 2025-07-12 03:18:48 +00:00
Merge remote-tracking branch 'upstream/master' into retriever
Conflicts: retriever/view/help.tpl
This commit is contained in:
commit
4fcd882f02
150 changed files with 343 additions and 567 deletions
153
retriever/templates/help.tpl
Normal file
153
retriever/templates/help.tpl
Normal file
|
@ -0,0 +1,153 @@
|
|||
{{*
|
||||
* AUTOMATICALLY GENERATED TEMPLATE
|
||||
* DO NOT EDIT THIS FILE, CHANGES WILL BE OVERWRITTEN
|
||||
*
|
||||
*}}
|
||||
<h2>Retriever Plugin Help</h2>
|
||||
<p>
|
||||
This plugin replaces the short excerpts you normally get in RSS feeds
|
||||
with the full content of the article from the source website. You
|
||||
specify which part of the page you're interested in with a set of
|
||||
rules. When each item arrives, the plugin downloads the full page
|
||||
from the website, extracts content using the rules, and replaces the
|
||||
original article.
|
||||
</p>
|
||||
<p>
|
||||
There's a few reasons you may want to do this. The source website
|
||||
might be slow or overloaded. The source website might be
|
||||
untrustworthy, in which case using Friendica to scrub the HTML is a
|
||||
good idea. You might be on a LAN that blacklists certain websites.
|
||||
It also works neatly with the mailstream plugin, allowing you to read
|
||||
a news stream comfortably without needing continuous Internet
|
||||
connectivity.
|
||||
</p>
|
||||
<p>
|
||||
However, setting up retriever can be quite tricky since it depends on
|
||||
the internal design of the website. That was designed to make life
|
||||
easy for the website's developers, not for you. You'll need to have
|
||||
some familiarity with HTML, and be willing to adapt when the website
|
||||
suddenly changes everything without notice.
|
||||
</p>
|
||||
<h3>Configuring Retriever for a feed</h3>
|
||||
<p>
|
||||
To set up retriever for an RSS feed, go to the "Contacts" page and
|
||||
find your feed. Then click on the drop-down menu on the contact.
|
||||
Select "Retriever" to get to the retriever configuration.
|
||||
</p>
|
||||
<p>
|
||||
The "Include" configuration section specifies parts of the page to
|
||||
include in the article. Each row has three components:
|
||||
</p>
|
||||
<ul>
|
||||
<li>An HTML tag (e.g. "div", "span", "p")</li>
|
||||
<li>An attribute (usually "class" or "id")</li>
|
||||
<li>A value for the attribute</li>
|
||||
</ul>
|
||||
<p>
|
||||
A simple case is when the article is wrapped in a "div" element:
|
||||
</p>
|
||||
<pre>
|
||||
...
|
||||
<div class="ArticleWrapper">
|
||||
<h2>Man Bites Dog</h2>
|
||||
<img src="mbd.jpg">
|
||||
<p>
|
||||
Residents of the sleepy community of Nowheresville were
|
||||
shocked yesterday by the sight of creepy local weirdo Jim
|
||||
McOddman assaulting innocent local dog Snufflekins with his
|
||||
false teeth.
|
||||
</p>
|
||||
...
|
||||
</div>
|
||||
...
|
||||
</pre>
|
||||
<p>
|
||||
You then specify the tag "div", attribute "class", and value
|
||||
"ArticleWrapper". Everything else in the page, such as navigation
|
||||
panels and menus and footers and so on, will be discarded. If there
|
||||
is more than one section of the page you want to include, specify each
|
||||
one on a separate row. If the matching section contains some sections
|
||||
you want to remove, specify those in the "Exclude" section in the same
|
||||
way.
|
||||
</p>
|
||||
<p>
|
||||
Once you've got a configuration that you think will work, you can try
|
||||
it out on some existing articles. Type a number into the
|
||||
"Retrospectively Apply" box and click "Submit". After a while
|
||||
(exactly how long depends on your system's cron configuration) the new
|
||||
articles should be available.
|
||||
</p>
|
||||
<h3>Techniques</h3>
|
||||
<p>
|
||||
You can leave the attribute and value blank to include all the
|
||||
corresponding elements with the specified tag name. You can also use
|
||||
a tag name of just an asterisk ("*"), which will match any element type with the
|
||||
specified attribute regardless of the tag.
|
||||
</p>
|
||||
<p>
|
||||
Note that the "class" attribute is a special case. Many web page
|
||||
templates will put multiple different classes in the same element,
|
||||
separated by spaces. If you specify an attribute of "class" it will
|
||||
match an element if any of its classes matches the specified value.
|
||||
For example:
|
||||
</p>
|
||||
<pre>
|
||||
<div class="article breaking-news">
|
||||
</pre>
|
||||
<p>
|
||||
In this case you can specify a value of "article", or "breaking-news".
|
||||
You can also specify "article breaking-news", but that won't match if
|
||||
the website suddenly changes to "breaking-news article", so that's not
|
||||
recommended.
|
||||
</p>
|
||||
<p>
|
||||
One useful trick you can try is using the website's "print" pages.
|
||||
Many news sites have print versions of all their articles. These are
|
||||
usually drastically simplified compared to the live website page.
|
||||
Sometimes this is a good way to get the whole article when it's
|
||||
normally split across multiple pages.
|
||||
</p>
|
||||
<p>
|
||||
Hopefully the URL for the print page is a predictable variant of the
|
||||
normal article URL. For example, an article URL like:
|
||||
</p>
|
||||
<pre>
|
||||
http://www.newssite.com/article-8636.html
|
||||
</pre>
|
||||
<p>
|
||||
...might have a print version at:
|
||||
</p>
|
||||
<pre>
|
||||
http://www.newssite.com/print/article-8636.html
|
||||
</pre>
|
||||
<p>
|
||||
To change the URL used to retrieve the page, use the "URL Pattern" and
|
||||
"URL Replace" fields. The pattern is a regular expression matching
|
||||
part of the URL to replace. In this case, you might use a pattern of
|
||||
"/article" and a replace string of "/print/article". A common pattern
|
||||
is simply a dollar sign ("$"), used to add the replace string to the end of the URL.
|
||||
</p>
|
||||
<h3>Background Processing</h3>
|
||||
<p>
|
||||
Note that retrieving and processing the articles can take some time,
|
||||
so it's done in the background. Incoming articles will be marked as
|
||||
invisible while they're in the process of being downloaded. If a URL
|
||||
fails, the plugin will keep trying at progressively longer intervals
|
||||
for up to a month, in case the website is temporarily overloaded or
|
||||
the network is down.
|
||||
</p>
|
||||
<h3>Retrieving Images</h3>
|
||||
<p>
|
||||
Retriever can also optionally download images and store them in the
|
||||
local Friendica instance. Just check the "Download Images" box. You
|
||||
can also download images in every item from your network, whether it's
|
||||
an RSS feed or not. Go to the "Settings" page and
|
||||
click <a href="$config">"Plugin settings"</a>. Then check the "All
|
||||
Photos" box in the "Retriever Settings" section and click "Submit".
|
||||
</p>
|
||||
<h2>Configure Feeds:</h2>
|
||||
<div>
|
||||
{{ for $feeds as $feed }}
|
||||
{{ inc contact_template.tpl with $contact=$feed }}{{ endinc }}
|
||||
{{ endfor }}
|
||||
</div>
|
Loading…
Add table
Add a link
Reference in a new issue