rawdog-2.19/ 0000755 0004715 0004715 00000000000 12273447040 012343 5 ustar ats ats 0000000 0000000 rawdog-2.19/config 0000644 0004715 0004715 00000032127 12167755742 013555 0 ustar ats ats 0000000 0000000 # Sample rawdog config file. Copy this into your ~/.rawdog/ directory, and edit # it to suit your preferences. # All paths in this file should be either absolute, or relative to your .rawdog # directory. # If you want to include another config file, then use "include FILENAME". # Times in this file are specified as a value and a unit (for instance, # "4h"). Units available are "s" (seconds), "m" (minutes), "h" (hours), # "d" (days) and "w" (weeks). If no unit is specified, rawdog will # assume minutes. # Boolean (yes/no) values in this file are specified as "true" or "false". # rawdog can be extended using plugin modules written in Python. This # option specifies the directories to search for plugins to load. If a # directory does not exist or cannot be read, it will be ignored. This # option must appear before any options that are implemented by plugins. plugindirs plugins # Whether to split rawdog's state amongst multiple files. # If this is turned on, rawdog will use significantly less memory, but # will do more disk IO -- probably a good idea if you read a lot of # feeds. splitstate false # The maximum number of articles to show on the generated page. # Set this to 0 for no limit. maxarticles 200 # The maximum age of articles to show on the generated page. # Set this to 0 for no limit. maxage 0 # The age after which articles will be discarded if they do not appear # in a feed. Set this to a larger value if you want your rawdog output # to cover more than a day's worth of articles. expireage 1d # The minimum number of articles from each feed to keep around in the history. # Set this to 0 to only keep articles that were returned the last time the feed # was fetched. (If this is set to 0, or "currentonly" below is set to true, # then rawdog will not send the RFC3229+feed "A-IM: feed" header when making # HTTP requests, since it can't tell from the response to such a request # whether any articles have been removed from the feed; this makes rawdog # slightly less bandwidth-efficient.) keepmin 20 # Whether to only display articles that are currently included in a feed # (useful for "planet" pages where you only want to display the current # articles from several feeds). If this is false, rawdog will keep a # history of older articles. currentonly false # Whether to divide the articles up by day, writing a "dayformat" heading # before each set. daysections true # The format to write day headings in. See "man strftime" for more # information; for example: # %A, %d %B Wednesday, 21 January # %Y-%m-%d 2004-01-21 (ISO 8601 format) dayformat %A, %d %B # Whether to divide the articles up by time, writing a "timeformat" heading # before each set. timesections true # The format to write time headings in. For example: # %H:%M 18:07 (ISO 8601 format) # %I:%M %p 06:07 PM timeformat %H:%M # The format to display feed update and article times in. For example: # %H:%M, %A, %d %B 18:07, Wednesday, 21 January # %Y-%m-%d %H:%M 2004-01-21 18:07 (ISO 8601 format) datetimeformat %H:%M, %A, %d %B # The page template file to use, or "default" to use the built-in template # (which is probably sufficient for most users). Use "rawdog -s page" to show # the template currently in use as a starting-point for customisation. # The following strings will be replaced in the output: # __version__ The rawdog version in use # __refresh__ The HTML 4 header # __items__ The aggregated items # __num_items__ The number of items on the page # __feeds__ The feed list # __num_feeds__ The number of feeds listed # You can define additional strings using "define" in this config file; for # example, if you say "define myname Adam Sampson", then "__myname__" will be # replaced by "Adam Sampson" in the output. pagetemplate default # Similarly, the template used for each item shown. Use "rawdog -s item" to # show the template currently in use as a starting-point for customisation. # The following strings will be replaced in the output: # __title__ The item title (as an HTML link, if possible) # __title_no_link__ The item title (as text) # __url__ The item's URL, or the empty string if it doesn't # have one # __guid__ The item's GUID, or the empty string if it doesn't # have one # __description__ The item's descriptive text, or the empty string # if it doesn't have a description # __date__ The item's date as provided by the feed # __added__ The date the article was received by rawdog # __hash__ A hash of the article (useful for summary pages) # # All of the __feed_X__ strings from feeditemtemplate below will also be # expanded here, for the feed that the article came from. # # You can define additional strings on a per-feed basis by using the # "define_X" feed option; see the description of "feed" below for more # details. # # Simple conditional expansion is possible by saying something like # "__if_items__ hello __endif__"; the text between the if and endif will # only be included if __items__ would expand to something other than # the empty string. Ifs can be nested, and __else__ is supported. # (This also works for the other templates, but it's most useful here.) itemtemplate default # The template used to generate the feed list (__feeds__ above). Use "rawdog # -s feedlist" to show the current template. # The following strings will be replaced in the output: # __feeditems__ The feed items feedlisttemplate default # The template used to generate each item in the feed list. Use "rawdog # -s feeditem" to show the current template. # The following strings will be replaced in the output: # __feed_id__ The feed's title with non-alphanumeric characters # (and HTML markup) removed (useful for per-feed # styles); you can use the "id" feed option below to # set a custom ID if you prefer # __feed_hash__ A hash of the feed URL (useful for per-feed styles) # __feed_title__ The feed title (as an HTML link, if possible) # __feed_title_no_link__ # The feed title (as text) # __feed_url__ The feed URL # __feed_icon__ An "XML button" linking to the feed URL # __feed_last_update__ # The time when the feed was last updated # __feed_next_update__ # The time when the feed will next need updating feeditemtemplate default # Where to write the output HTML to. You should place style.css in the same # directory. Specify this as "-" to write the HTML to stdout. # (You will probably want to make this an absolute path, else rawdog will write # to a file in your ~/.rawdog directory.) outputfile output.html #outputfile /home/you/public_html/rawdog.html # Whether to use a tag in the generated # HTML to indicate that the page should be refreshed automatically. If # this is turned on, then the page will refresh every N minutes, where N # is the shortest feed period value specified below. # (This works by controlling whether the default template includes # __refresh__; if you use a custom template, __refresh__ is always # available.) userefresh true # Whether to show the list of active feeds in the generated HTML. # (This works by controlling whether the default template includes # __feeds__; if you use a custom template, __feeds__ is always # available.) showfeeds true # The number of concurrent threads that rawdog will use when fetching # feeds -- i.e. the number of feeds that rawdog will attempt to fetch at # the same time. If you have a lot of feeds, setting this to be 20 or # so will significantly speed up updates. If this is set to 1 (or # fewer), rawdog will not start any additional threads at all. numthreads 1 # The time that rawdog will wait before considering a feed unreachable # when trying to connect. If you're getting lots of timeout errors and # are on a slow connection, increase this. # (Unlike other times in this file, this will be assumed to be in # seconds if no unit is specified.) timeout 30s # Whether to ignore timeouts. If this is false, timeouts will be reported as # errors; if this is true, rawdog will silently ignore them. ignoretimeouts false # Whether to show Python traceback messages. If this is true, rawdog will show # a traceback message if an exception is thrown while fetching a feed; this is # mostly useful for debugging rawdog or feedparser. showtracebacks false # Whether to display verbose status messages saying what rawdog's doing # while it runs. Specifying -v or --verbose on the command line is # equivalent to saying "verbose true" here. verbose false # Whether to attempt to fix bits of HTML that should start with a # block-level element (such as article descriptions) by prepending "
" # if they don't already start with a block-level element. blocklevelhtml true # Whether to attempt to turn feed-provided HTML into valid HTML. # The most common problem that this solves is a non-closed element in an # article causing formatting problems for the rest of the page. # For this option to have any effect, you need to have PyTidyLib or mx.Tidy # installed. tidyhtml true # Whether the articles displayed should be sorted first by the date # provided in the feed (useful for "planet" pages, where you're # displaying several feeds and want new articles to appear in the right # chronological place). If this is false, then articles will first be # sorted by the time that rawdog first saw them. sortbyfeeddate false # Whether to consider articles' unique IDs or GUIDs when updating rawdog's # database. If you turn this off, then rawdog will create a new article in its # database when it sees an updated version of an existing article in a feed. # You probably want this turned on. useids true # The fields to use when detecting duplicate articles: "id" is the article's # unique ID or GUID; "link" is the article's link. rawdog will find the first # one of these that's present in the article, and ignore the article if it's # seen an article before (in any feed) that had the same value. For example, # specifying "hideduplicates id link" will first look for id/guid, then for # link. # Note that some feeds use the same link for all their articles; if you specify # "link" here, you will probably want to specify the "allowduplicates" feed # argument (see below) for those feeds. hideduplicates id # The period to use for new feeds added to the config file via the -a|--add # option. newfeedperiod 3h # Whether rawdog should automatically update this config file (and its # internal state) if feed URLs change (for instance, if a feed URL # results in a permanent HTTP redirect). If this is false, then rawdog # will ask you to make the necessary change by hand. changeconfig true # The feeds you want to watch, in the format "feed period url [args]". # The period is the minimum time between updates; if less than period # minutes have passed, "rawdog update" will skip that feed. Specifying # a period less than 30 minutes is considered to be bad manners; it is # suggested that you make the period as long as possible. # Arguments are optional, and can be given in two ways: either on the end of # the "feed" line in the form "key=value", separated by spaces, or as extra # indented lines after the feed line. # possible arguments are: # id Value for the __feed_id__ value in the item # template for items in this feed (defaults to the # feed title with non-alphanumeric characters and # HTML markup removed) # user User for HTTP basic authentication # password Password for HTTP basic authentication # format "text" to indicate that the descriptions in this feed # are unescaped plain text (rather than the usual HTML), # and should be escaped and wrapped in a
element
# X_proxy Proxy URL for protocol X (for instance, "http_proxy")
# proxyuser User for proxy basic authentication
# proxypassword Password for proxy basic authentication
# allowduplicates "true" to disable duplicate detection for this feed
# maxage Override the global "maxage" value for this feed
# keepmin Override the global "keepmin" value for this feed
# define_X Equivalent to "define X ..." for item templates
# when displaying items from this feed
# You can provide a default set of arguments for all feeds using
# "feeddefaults". You can specify as many feeds as you like.
# (These examples have been commented out; remove the leading "#" on each line
# to use them.)
#feeddefaults
# http_proxy http://proxy.example.com:3128/
#feed 1h http://example.com/feed.rss
#feed 30m http://example.com/feed2.rss id=newsfront
#feed 3h http://example.com/feed3.rss keepmin=5
#feed 3h http://example.com/secret.rss user=bob password=secret
#feed 3h http://example.com/broken.rss
# format text
# define_myclass broken
#feed 3h http://proxyfeed.example.com/proxied.rss http_proxy=http://localhost:1234/
#feed 3h http://dupsfeed.example.com/duplicated.rss allowduplicates=true
rawdog-2.19/rawdog 0000755 0004715 0004715 00000002104 11563740634 013557 0 ustar ats ats 0000000 0000000 #!/usr/bin/env python
# rawdog: RSS aggregator without delusions of grandeur.
# Copyright 2003, 2004, 2005, 2006 Adam Sampson
#
# rawdog is free software; you can redistribute and/or modify it
# under the terms of that license as published by the Free Software
# Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# rawdog is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with rawdog; see the file COPYING. If not, write to the Free
# Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA, or see http://www.gnu.org/.
from rawdoglib.rawdog import main
import sys, os
def launch():
sys.exit(main(sys.argv[1:]))
if __name__ == "__main__":
if os.getenv("RAWDOG_PROFILE") is not None:
import profile
profile.run("launch()")
else:
launch()
rawdog-2.19/style.css 0000644 0004715 0004715 00000003116 11137666134 014223 0 ustar ats ats 0000000 0000000 /* Default stylesheet for rawdog. Customise this as you like.
Adam Sampson */
.xmlbutton {
/* From Dylan Greene's suggestion:
http://www.dylangreene.com/blog.asp?blogID=91 */
border: 1px solid;
border-color: #FC9 #630 #330 #F96;
padding: 0 3px;
font: bold 10px sans-serif;
color: #FFF;
background: #F60;
text-decoration: none;
margin: 0;
}
html {
margin: 0;
padding: 0;
}
body {
color: black;
background-color: white;
margin: 0;
padding: 10px;
font-size: medium;
}
#header {
background-color: #ffe;
border: 1px solid gray;
padding: 10px;
margin-bottom: 20px;
}
h1 {
font-weight: bold;
font-size: xx-large;
text-align: left;
margin: 0;
padding: 0;
}
#items {
}
.day {
clear: both;
}
h2 {
font-weight: bold;
font-size: x-large;
text-align: left;
margin: 10px 0;
padding: 0;
}
.time {
clear: both;
}
h3 {
font-weight: bold;
font-size: large;
text-align: left;
margin: 10px 0;
padding: 0;
}
.item {
margin: 20px 30px;
border: 1px solid gray;
clear: both;
}
.itemheader {
padding: 6px;
margin: 0;
background-color: #eee;
}
.itemtitle {
font-weight: bold;
}
.itemfrom {
font-style: italic;
}
.itemdescription {
border-top: 1px solid gray;
margin: 0;
padding: 6px;
}
#feedstatsheader {
}
#feedstats {
}
#feeds {
margin: 10px 0;
border: 1px solid gray;
border-spacing: 0;
}
#feedsheader TH {
background-color: #eee;
border-bottom: 1px solid gray;
padding: 5px;
margin: 0;
}
.feedsrow TD {
padding: 5px 10px;
margin: 0;
}
#footer {
background-color: #ffe;
border: 1px solid gray;
margin-top: 20px;
padding: 10px;
}
#aboutrawdog {
}
rawdog-2.19/MANIFEST.in 0000644 0004715 0004715 00000000334 12167056532 014105 0 ustar ats ats 0000000 0000000 include COPYING
include MANIFEST.in
include NEWS
include PLUGINS
include README
include config
include rawdog
include rawdog.1
include style.css
include test-rawdog
include testserver.py
recursive-include rawdoglib *.py
rawdog-2.19/PKG-INFO 0000644 0004715 0004715 00000001053 12273447040 013437 0 ustar ats ats 0000000 0000000 Metadata-Version: 1.1
Name: rawdog
Version: 2.19
Summary: RSS Aggregator Without Delusions Of Grandeur
Home-page: http://offog.org/code/rawdog/
Author: Adam Sampson
Author-email: ats@offog.org
License: UNKNOWN
Description: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: GNU General Public License v2 or later (GPLv2+)
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python :: 2
Classifier: Topic :: Internet :: WWW/HTTP
rawdog-2.19/README 0000644 0004715 0004715 00000006405 12176436301 013230 0 ustar ats ats 0000000 0000000 rawdog: RSS Aggregator Without Delusions Of Grandeur
Adam Sampson
rawdog is a feed aggregator, capable of producing a personal "river of
news" or a public "planet" page. It supports all common feed formats,
including all versions of RSS and Atom. By default, it is run from cron,
collects articles from a number of feeds, and generates a static HTML
page listing the newest articles in date order. It supports per-feed
customizable update times, and uses ETags, Last-Modified, gzip
compression, and RFC3229+feed to minimize network bandwidth usage. Its
behaviour is highly customisable using plugins written in Python.
rawdog has the following dependencies:
- Python 2.6 or later (but not Python 3)
- feedparser 5.1.2 or later
- PyTidyLib 0.2.1 or later (optional but strongly recommended)
To install rawdog on your system, use distutils -- "python setup.py
install". This will install the "rawdog" command and the "rawdoglib"
Python module that it uses internally. (If you want to install to a
non-standard prefix, read the help provided by "python setup.py install
--help".)
rawdog needs a config file to function. Make the directory ".rawdog" in
your $HOME directory, copy the provided file "config" into that
directory, and edit it to suit your preferences. Comments in that file
describe what each of the options does.
You should copy the provided file "style.css" into the same directory
that you've told rawdog to write its HTML output to. rawdog should be
usable from a browser that doesn't support CSS, but it won't be very
pretty.
When you invoke rawdog from the command line, you give it a series of
actions to perform -- for instance, "rawdog --update --write" tells it
to do the "--update" action (downloading articles from feeds), then the
"--write" action (writing the latest articles it knows about to the HTML
file).
For details of all rawdog's actions and command-line options, see the
rawdog(1) man page -- "man rawdog" after installation.
You will want to run "rawdog -uw" periodically to fetch data and write
the output file. The easiest way to do this is to add a crontab entry
that looks something like this:
0,10,20,30,40,50 * * * * /path/to/rawdog -uw
(If you don't know how to use cron, then "man crontab" is probably a good
start.) This will run rawdog every ten minutes.
If you want rawdog to fetch URLs through a proxy server, then set your
"http_proxy" environment variable appropriately; depending on your
version of cron, putting something like:
http_proxy=http://myproxy.mycompany.com:3128/
at the top of your crontab should be appropriate. (The http_proxy
variable will work for many other programs too.)
In the event that rawdog gets horribly confused (for instance, if your
system clock has a huge jump and it thinks it won't need to fetch
anything for the next thirty years), you can forcibly clear its state by
removing the ~/.rawdog/state file (and the ~/.rawdog/feeds/*.state
files, if you've got the "splitstate" option turned on).
If you don't like the appearance of rawdog, then customise the style.css
file. If you come up with one that looks much better than the existing
one, please send it to me!
This should, hopefully, be all you need to know. If rawdog breaks in
interesting ways, please tell me at the email address at the top of this
file.
rawdog-2.19/test-rawdog 0000755 0004715 0004715 00000137122 12273242307 014536 0 ustar ats ats 0000000 0000000 #!/bin/sh
# test-rawdog: run some basic tests to make sure rawdog's working.
# Copyright 2013, 2014 Adam Sampson
#
# rawdog is free software; you can redistribute and/or modify it
# under the terms of that license as published by the Free Software
# Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# rawdog is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with rawdog; see the file COPYING. If not, write to the Free
# Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA, or see http://www.gnu.org/.
# Default to the C locale, to avoid localised error messages.
default_LC_ALL="C"
# Try to find generic UTF-8 and Japanese UTF-8 locales. (They may not be
# installed.)
utf8_LC_ALL="$(locale -a | grep -i 'utf-\?8' | head -1)"
ja_LC_ALL="$(locale -a | grep -i 'ja_JP.utf-\?8' | head -1)"
# Default to UTC so that local times are reported consistently.
default_TZ="UTC"
statedir="testauto"
# Hostname and ports to run the test server on.
serverhost="localhost"
timeoutport="8431"
httpport="8432"
# Connections to this host should time out.
# (This is distinct from timeoutport above: if you connect to timeoutport, it
# will accept the connection but not do anything, whereas this will timeout
# while connecting.)
timeouthost=""
httpdir="$statedir/pub"
httpurl="http://$serverhost:$httpport"
usage () {
cat <.
EOF
exit 1
}
knownbad=false
keepgoing=false
rawdog="./rawdog"
while getopts bkr:T: OPT; do
case "$OPT" in
b)
knownbad=true
;;
k)
keepgoing=true
;;
r)
rawdog="$OPTARG"
;;
T)
timeouthost="$OPTARG"
;;
?)
usage
;;
esac
done
# Start the server, and kill it when this script exits.
serverpid=""
trap 'test -n "$serverpid" && kill $serverpid' 0
python testserver.py "$serverhost" "$timeoutport" "$httpport" "$httpdir" &
serverpid="$!"
exitcode=0
die () {
echo "Test failed:" "$@"
exitcode=1
if ! $keepgoing; then
exit $exitcode
fi
}
cleanstate () {
rm -fr $statedir $httpdir
mkdir -p $statedir $statedir/plugins $httpdir
cp config $statedir/config
export LC_ALL="$default_LC_ALL"
export TZ="$default_TZ"
}
add () {
echo "$1" >>$statedir/config
}
begin () {
echo ">>> Testing $1"
cleanstate
add "showtracebacks true"
cmdnum=0
}
equals () {
if [ "$1" != "$2" ]; then
die "expected '$1'; got '$2'"
fi
}
exists () {
for fn in "$@"; do
if ! [ -e "$fn" ]; then
die "expected $fn to exist"
fi
done
}
not_exists () {
for fn in "$@"; do
if [ -e "$fn" ]; then
die "expected $fn not to exist"
fi
done
}
same () {
exists "$1" "$2"
if ! cmp "$1" "$2"; then
die "expected $1 to have the same contents as $2"
fi
}
contains () {
file="$1"
exists "$file"
shift
for key in "$@"; do
if ! grep -q "$key" "$file"; then
cat "$file"
die "expected $file to contain '$key'"
fi
done
}
not_contains () {
file="$1"
exists "$file"
shift
for key in "$@"; do
if grep -q "$key" "$file"; then
cat "$file"
die "expected $file not to contain '$key'"
fi
done
}
# Run rawdog.
runf () {
cmdnum=$(expr $cmdnum + 1)
outfile=$statedir/out$cmdnum
$rawdog -d $statedir -V log$cmdnum "$@" >$outfile 2>&1
}
# Run rawdog, expecting it to exit 0.
run () {
if ! runf "$@"; then
cat $outfile
die "exited non-0"
fi
}
# Run rawdog, expecting it to exit non-0.
runn () {
if runf "$@"; then
cat $outfile
die "exited 0"
fi
}
# Run rawdog, expecting no complaints.
runs () {
run "$@"
if [ -s $outfile ]; then
cat $outfile
die "expected no output"
fi
}
# Run rawdog, expecting a complaint containing the first arg.
rune () {
key="$1"
shift
run "$@"
contains $outfile "$key"
}
# Run rawdog, expecting it to exit non-0 with a complaint containing the first
# arg.
runne () {
key="$1"
shift
runn "$@"
contains $outfile "$key"
}
make_text () {
cat >"$1" <"$1" <
Not a feed
This is manifestly not a feed.
EOF
}
make_html_head () {
cat >"$1" <
Not a feed
EOF
cat >>"$1"
cat >>"$1" <
This is manifestly not a feed.
EOF
}
make_html_body () {
cat >"$1" <
Not a feed
This is manifestly not a feed.
EOF
cat >>"$1"
cat >>"$1" <
EOF
}
make_rss10 () {
cat >"$1" <
example-feed-title
http://example.org/
example-feed-description
-
example-item-title
http://example.org/item
example-item-description
EOF
}
make_rss20 () {
cat >"$1" <
example-feed-title
http://example.org/
example-feed-description
-
example-item-title
http://example.org/item
example-item-description]]>
EOF
}
make_rss20_desc () {
cat >"$1" <
example-feed-title
http://example.org/
example-feed-description
-
example-item-title
http://example.org/item
EOF
cat >>"$1"
cat >>"$1" <
EOF
}
write_desc () {
make_rss20_desc $httpdir/feed.rss
add "feed 0 $httpurl/feed.rss"
runs -uw
}
make_atom10 () {
cat >"$1" <
example-feed-title
2013-01-01T18:00:00Z
example-feed-author
http://example.org/feed-id
example-item-title
http://example.org/item-id
2013-01-01T18:00:00Z
example-item-description
EOF
}
make_atom10_with () {
cat >"$1" <
example-feed-title
2013-01-01T18:00:00Z
example-feed-author
http://example.org/feed-id
example-item-title
http://example.org/item-id
2013-01-01T18:00:00Z
EOF
cat >>"$1"
cat >>"$1" <
EOF
}
make_single () {
cat >"$1" <
example-feed-title
2013-01-01T18:00:00Z
example-feed-author
http://example.org/feed-id
$2-title
$4
2013-01-01T18:00:00Z
$2-description
EOF
}
make_range () {
from="$1"
to="$2"
file="$3"
cat >"$file" <
example-feed-title
http://example.org/
example-feed-description
EOF
for i in $(seq $from $to); do
cat >>"$file" <
range-title-$i-
http://example.org/item$i
range-description-$i]]>
EOF
done
cat >>"$file" <
EOF
}
make_n () {
make_range 1 "$@"
}
range () {
seq -f "range-title-%.f-" $1 $2
}
output_range () {
contains $statedir/output.html $(range $1 $2)
}
not_output_range () {
not_contains $statedir/output.html $(range $1 $2)
}
output_n () {
output_range 1 "$@"
}
begin "help"
rune "Usage:" --help
begin "unknown option"
runn --aubergine
contains $outfile "Usage:"
begin "unnecessary argument"
runn aubergine
contains $outfile "Usage:"
begin "--verbose"
run -vu
contains $outfile "Starting update"
begin "--verbose overrides config"
add "verbose false"
echo "verbose false" >$statedir/config.inc
run -v -c config.inc -u
contains $outfile "Starting update"
begin "listing feeds"
make_rss20 $httpdir/0.rss
make_rss20 $httpdir/1.rss
add "feed 0 $httpurl/0.rss"
add "feed 0 $httpurl/1.rss"
run -l
contains $outfile $httpurl/0.rss $httpurl/1.rss
runs -u
run -l
contains $outfile "Title: example-feed-title"
begin "updating one feed"
make_rss20 $httpdir/feed.rss
add "feed 0 $httpurl/feed.rss"
runs -u
runs -f $httpurl/feed.rss
begin "updating nonexistant feed"
rune "No such feed" -f $httpurl/feed.rss
begin "bad config syntax"
add "foo"
rune "Bad line in config"
begin "bad config directive"
add "foo bar"
rune "Unknown config command"
begin "bad boolean value in config"
add "sortbyfeeddate aubergine"
rune "Bad value"
begin "bad time value in config"
add "timeout aubergine"
rune "Bad value"
begin "bad integer value in config"
add "maxarticles aubergine"
rune "Bad value"
begin "bad inline feed argument"
add "feed 0 $httpurl/feed.rss aubergine"
rune "Bad feed argument"
begin "bad feed argument line"
add "feed 0 $httpurl/feed.rss"
add " aubergine"
rune "Bad argument line"
begin "feed argument line with no feed"
: >$statedir/config
add " allowduplicates true"
rune "First line in config cannot be an argument"
begin "feeddefaults on one line"
add "feeddefaults allowduplicates=true"
runs
begin "feeddefaults argument lines"
add "feeddefaults"
add " allowduplicates true"
runs
begin "argument lines in the wrong place"
add "tidyhtml false"
add " allowduplicates true"
rune "Bad argument lines"
begin "feed with no time"
add "feed"
rune "Bad line in config"
begin "feed with no URL"
add "feed 3h"
rune "Bad line in config"
begin "define with no name"
add "define"
rune "Bad line in config"
begin "define with no value"
add "define thing"
rune "Bad line in config"
begin "define"
add "define myvar This is my variable!"
echo "myvar(__myvar__)" >$statedir/page
add "pagetemplate page"
runs -uw
contains $statedir/output.html "myvar(This is my variable!)"
begin "missing config file"
rm $statedir/config
rune "Can't read config file" -u
begin "empty config file"
: >$statedir/config
runs -uw
begin "--config and include"
make_rss20 $httpdir/feed.rss
add "feed 0 $httpurl/feed.rss"
runs -uw
exists $statedir/output.html
rm $statedir/output.html
echo "outputfile second.html" >$statedir/config.inc
runs -c config.inc -w
exists $statedir/second.html
not_exists $statedir/output.html
rm $statedir/second.html
add "include config.inc"
runs -w
exists $statedir/second.html
not_exists $statedir/output.html
rm $statedir/second.html
begin "missing state dir"
runn -d aubergine
contains $outfile "No aubergine directory"
begin "corrupt state file"
echo this is not a valid state file >$statedir/state
runne "means the file is corrupt" -u
begin "empty state file"
touch $statedir/state
runne "means the file is corrupt" -u
begin "corrupt splitstate file"
make_rss20 $statedir/simple.rss
add "splitstate true"
add "feed 0 simple.rss"
runs -u
echo this is not a valid state file >$(echo $statedir/feeds/*.state)
runne "means the file is corrupt" -u
for run in first second feed-adding; do
for state in false true; do
begin "recover from crash on $run run, splitstate $state"
make_rss20 $statedir/0.rss
add "splitstate $state"
add "feed 0 0.rss"
if [ "$run" != first ]; then
runs -u
fi
if [ "$run" = feed-adding ]; then
make_rss20 $statedir/1.rss
add "feed 0 1.rss"
fi
# Crash while updating, so we have both state files open.
cat >$statedir/plugins/crash.py <$statedir/plugins/crash.py <$statedir/plugins/nolock.py <$statedir/lock.py <$statedir/plugins/wait.py <$statedir/plugins/junk.txt
cat >$statedir/plugins/.hidden.py <$statedir/plugins/a.py <$statedir/plugins/b.py <$httpdir/empty.xml <
example-feed-title
http://example.org/
example-feed-description
EOF
add "feed 0 $httpurl/empty.xml"
runs -u
begin "HTTP 404"
add "feed 0 $httpurl/notthere"
rune "404" -u
for proto in http https ftp; do
if [ -n "$timeouthost" ]; then
begin "$proto: connect timeout"
add "timeout 1s"
add "feed 0 $proto://$timeouthost/feed.xml"
rune "Timeout while reading" -u
fi
begin "$proto: response timeout"
add "timeout 1s"
add "feed 0 $proto://$serverhost:$timeoutport/feed.xml"
rune "Timeout while reading" -u
done
begin "ignoretimeouts true"
add "timeout 1s"
add "ignoretimeouts true"
add "feed 0 http://$serverhost:$timeoutport/feed.xml"
runs -u
begin "0 period"
make_rss20 $httpdir/simple.rss
add "feed 0 $httpurl/simple.rss"
runs -u
rm $httpdir/simple.rss
rune "404" -u
begin "1h period"
make_rss20 $httpdir/simple.rss
add "feed 1h $httpurl/simple.rss"
runs -u
rm $httpdir/simple.rss
runs -u
begin "10 items"
make_n 10 $httpdir/feed.rss
add "feed 0 $httpurl/feed.rss"
runs -uw
output_n 10
begin "new articles are collected"
make_n 3 $httpdir/feed.rss
add "feed 0 $httpurl/feed.rss"
runs -uw
output_n 3
make_n 6 $httpdir/feed.rss
runs -uw
output_n 6
begin "outputfile"
make_rss20 $httpdir/feed.rss
add "feed 0 $httpurl/feed.rss"
add "outputfile second.html"
runs -uw
contains $statedir/second.html example-feed-title
begin "outputfile -"
make_rss20 $httpdir/feed.rss
add "feed 0 $httpurl/feed.rss"
add "outputfile -"
run -uw
contains $outfile example-feed-title
begin "maxarticles 10"
make_n 20 $httpdir/feed.rss
add "maxarticles 10"
add "feed 0 $httpurl/feed.rss"
runs -uw
output_n 10
not_output_range 11 20
begin "keepmin 10"
make_n 20 $httpdir/feed.rss
add "keepmin 10"
add "expireage 0"
add "feed 0 $httpurl/feed.rss"
runs -uw
output_n 20
make_n 5 $httpdir/feed.rss
runs -uw
# Should have the 5 currently in the feed, and 10 in total
output_n 5
if [ $(grep range-title- $statedir/output.html | wc -l) != 10 ]; then
die "Should contain 10 items"
fi
begin "currentonly true"
make_n 10 $httpdir/feed.rss
add "currentonly true"
add "feed 0 $httpurl/feed.rss"
runs -uw
output_n 10
make_n 5 $httpdir/feed.rss
runs -uw
output_n 5
not_output_range 6 10
for state in false true; do
begin "useids $state"
add "useids $state"
add "hideduplicates none"
add "feed 0 $httpurl/feed.atom"
echo "Original " | make_atom10_with $httpdir/feed.atom
runs -uw
contains $statedir/output.html Original
echo "Revised " | make_atom10_with $httpdir/feed.atom
runs -uw
contains $statedir/output.html Revised
if $state; then
# Should have updated the existing article
not_contains $statedir/output.html Original
else
# Should have kept both versions
contains $statedir/output.html Original
fi
done
dupecheck () {
add "useids false"
add "feed 0 $httpurl/feed.atom"
make_single $httpdir/feed.atom item-a \
http://example.org/link/x http://example.org/id/0
runs -u
make_single $httpdir/feed.atom item-b \
http://example.org/link/x http://example.org/id/1
runs -u
make_single $httpdir/feed.atom item-c \
http://example.org/link/y http://example.org/id/1
runs -uw
}
begin "hideduplicates none"
add "hideduplicates none"
dupecheck
contains $statedir/output.html item-a-title item-b-title item-c-title
begin "hideduplicates id"
add "hideduplicates id"
dupecheck
contains $statedir/output.html item-a-title item-c-title
not_contains $statedir/output.html item-b-title
begin "hideduplicates link"
add "hideduplicates link"
dupecheck
contains $statedir/output.html item-b-title item-c-title
not_contains $statedir/output.html item-a-title
begin "hideduplicates link id"
add "hideduplicates link id"
dupecheck
contains $statedir/output.html item-c-title
not_contains $statedir/output.html item-a-title item-b-title
begin "allowduplicates"
add "feeddefaults allowduplicates=true"
add "hideduplicates link id"
dupecheck
contains $statedir/output.html item-a-title item-b-title item-c-title
begin "sortbyfeeddate false/true"
# Debian bug 651080.
for day in 01 02 03; do
cat >$httpdir/$day.atom <
example-feed-title-${day}
2013-01-${day}T18:00:00Z
example-feed-author
http://example.org/${day}/feed-id
example-item-title-${day}
http://example.org/${day}/item-id
2013-01-${day}01T18:00:00Z
ENTRY-${day}
EOF
done
entries () {
grep 'ENTRY' $statedir/output.html | sed 's,.*ENTRY-\(..\).*,\1,' | xargs -n10 echo
}
add "feed 0 $httpurl/03.atom"
runs -u
add "feed 0 $httpurl/02.atom"
runs -u
add "feed 0 $httpurl/01.atom"
add "sortbyfeeddate false"
runs -uw
equals "01 02 03" "$(entries)"
add "sortbyfeeddate true"
runs -w
equals "03 02 01" "$(entries)"
for dstate in false true; do
for tstate in false true; do
begin "daysections $dstate, timesections $tstate"
cat >$httpdir/feed.rss <
example-feed-title
http://example.org/
example-feed-description
-
Thu, 03 Jan 2013 18:00:00 +0000
item-1
http://example.org/1
-
Wed, 02 Jan 2013 18:00:00 +0000
item-2
http://example.org/2
-
Tue, 01 Jan 2013 19:00:00 +0000
item-3
http://example.org/3
-
Tue, 01 Jan 2013 18:00:00 +0000
item-4
http://example.org/4
EOF
add "dayformat day(%d)"
add "timeformat time(%H)"
add "daysections $dstate"
add "timesections $tstate"
add "sortbyfeeddate true"
add "feed 0 $httpurl/feed.rss"
runs -uw
if $dstate; then
contains $statedir/output.html \
'day(01)' 'day(02)' 'day(03)'
else
not_contains $statedir/output.html 'day('
fi
if $tstate; then
contains $statedir/output.html \
'time(18)' 'time(19)'
else
not_contains $statedir/output.html 'time('
fi
done
done
begin "default templates"
make_rss20 $httpdir/simple.rss
add "feed 0 $httpurl/simple.rss"
runs -uw
cp $statedir/output.html $statedir/output.html.orig
for template in page item feedlist feeditem; do
run -s $template
cp $outfile $statedir/$template
run --show $template
same $outfile $statedir/$template
add "${template}template ${template}"
done
run -w
same $statedir/output.html.orig $statedir/output.html
begin "show unknown template"
runn -s aubergine
contains $outfile "Unknown template"
begin "pre-2.15 template options"
make_rss20 $httpdir/simple.rss
add "feed 0 $httpurl/simple.rss"
runs -uw
cp $statedir/output.html $statedir/output.html.orig
run -t
cp $outfile $statedir/page
run --show-template
same $outfile $statedir/page
run -T
cp $outfile $statedir/item
run --show-itemtemplate
same $outfile $statedir/item
add "template page"
add "itemtemplate item"
run -w
same $statedir/output.html.orig $statedir/output.html
echo MAGIC1__items__ >$statedir/page
echo MAGIC2 >$statedir/item
run -uw
contains $statedir/output.html MAGIC1 MAGIC2
for template in page item feedlist feeditem; do
begin "missing ${template} template file"
add "${template}template ${template}"
rune "Can't read template file" -u
done
begin "template conditionals"
make_atom10 $httpdir/feed.atom
cat >$statedir/item <$statedir/item
make_atom10 $httpdir/feed.atom
add "feed 0 $httpurl/feed.atom"
add "itemtemplate item"
rune "Character encoding problem" -uw
if [ -n "$utf8_LC_ALL" ]; then
begin "UTF-8 in template, UTF-8 locale"
echo "char(ø)" >$statedir/item
make_atom10 $httpdir/feed.atom
add "feed 0 $httpurl/feed.atom"
add "itemtemplate item"
export LC_ALL="$utf8_LC_ALL"
runs -uw
contains $statedir/output.html "char(ø)"
fi
begin "UTF-8 in define, ASCII locale"
make_atom10 $httpdir/feed.atom
echo "expand(__thing__)" >$statedir/item
add "itemtemplate item"
add "feed 0 $httpurl/feed.atom"
add " define_thing char(ø)"
rune "Character encoding problem" -uw
if [ -n "$utf8_LC_ALL" ]; then
begin "UTF-8 in define, UTF-8 locale"
make_atom10 $httpdir/feed.atom
echo "expand(__thing__)" >$statedir/item
add "itemtemplate item"
add "feed 0 $httpurl/feed.atom"
add " define_thing char(ø)"
export LC_ALL="$utf8_LC_ALL"
runs -uw
contains $statedir/output.html "expand(char(ø))"
fi
begin "item dates"
# Debian bug 651080.
run -s item
cp $outfile $statedir/item
echo "__date__" >>$statedir/item
make_atom10 $httpdir/feed.atom
add "feed 0 $httpurl/feed.atom"
add "sortbyfeeddate true"
add "timeformat HEADING-%m-%d-%H:%M"
add "datetimeformat ITEMDATE-%m-%d-%H:%M"
add "itemtemplate item"
runs -uw
contains $statedir/output.html "HEADING-01-01-18:00" "ITEMDATE-01-01-18:00"
begin "dates shown in local time"
echo "__date__" >$statedir/item
make_atom10 $httpdir/feed.atom
add "feed 0 $httpurl/feed.atom"
add "sortbyfeeddate true"
add "timeformat HEADING-%m-%d-%H:%M"
add "datetimeformat ITEMDATE-%m-%d-%H:%M"
add "itemtemplate item"
runs -u
export TZ="GMT+5"
runs -w
contains $statedir/output.html "HEADING-01-01-13:00" "ITEMDATE-01-01-13:00"
export TZ="$default_TZ"
runs -w
contains $statedir/output.html "HEADING-01-01-18:00" "ITEMDATE-01-01-18:00"
if [ -n "$ja_LC_ALL" ]; then
begin "dates shown in Japanese"
echo "__date__" >$statedir/item
make_atom10 $httpdir/feed.atom
add "feed 0 $httpurl/feed.atom"
add "sortbyfeeddate true"
add "timeformat HEADING-%A-%c"
add "datetimeformat ITEMDATE-%A-%c"
add "itemtemplate item"
export LC_ALL="$ja_LC_ALL"
runs -uw
# Japanese for Tuesday, in Unicode.
tue="火曜日"
contains $statedir/output.html "HEADING-$tue-" "ITEMDATE-$tue-"
not_contains $statedir/output.html "Tuesday"
export LC_ALL="$default_LC_ALL"
runs -uw
contains $statedir/output.html "HEADING-Tuesday" "ITEMDATE-Tuesday"
fi
begin "item authors"
cat >$httpdir/feed.atom <
example-feed-title
2013-01-01T18:00:00Z
http://example.org/feed-id
author-1
example-item-title-1
http://example.org/item-id/1
2013-01-01T18:00:00Z
example-item-description
author-2
author2@example.org
example-item-title-2
http://example.org/item-id/2
2013-01-01T18:00:00Z
example-item-description
author-3
http://example.org/author3
example-item-title-3
http://example.org/item-id/3
2013-01-01T18:00:00Z
example-item-description
author-4
author4@example.org
http://example.org/author4
example-item-title-4
http://example.org/item-id/4
2013-01-01T18:00:00Z
example-item-description
http://a5.example.org
example-item-title-5
http://example.org/item-id/5
2013-01-01T18:00:00Z
example-item-description
EOF
cat >$statedir/item <author-2)" \
"author(author-3)" \
"author(author-4)" \
"author(http://a5.example.org)"
begin "feed list templates"
make_rss20 $httpdir/0.rss
make_rss20 $httpdir/1.rss
make_rss20 $httpdir/2.rss
add "feed 0 $httpurl/0.rss"
add "feed 0 $httpurl/1.rss"
add "feed 0 $httpurl/2.rss"
run -s feedlist
cp $outfile $statedir/feedlist
echo "FEEDLIST" >>$statedir/feedlist
run -s feeditem
cp $outfile $statedir/feeditem
echo "FEEDITEM-__feed_url__" >>$statedir/feeditem
add "feedlisttemplate feedlist"
add "feeditemtemplate feeditem"
run -w
contains $statedir/output.html \
FEEDLIST \
FEEDITEM-$httpurl/0.rss FEEDITEM-$httpurl/1.rss FEEDITEM-$httpurl/2.rss
begin "prefer content over summary"
make_atom10_with $httpdir/1.atom <Content1
EOF
make_atom10_with $httpdir/2.atom <Summary2
EOF
# Note that feedparser 5.1.3 will do odd things if summary follows content --
# feedparser issue 412.
make_atom10_with $httpdir/3.atom <Summary3
Content3
EOF
add "useids false"
add "hideduplicates none"
add "feed 0 $httpurl/1.atom"
add "feed 0 $httpurl/2.atom"
add "feed 0 $httpurl/3.atom"
runs -uw
contains $statedir/output.html Content1 Summary2 Content3
not_contains $statedir/output.html Summary3
begin "showfeeds true/false"
make_atom10 $httpdir/simple.atom
add "feed 0 $httpurl/simple.atom"
runs -u
add "showfeeds true"
runs -w
contains $statedir/output.html $httpurl/simple.atom
add "showfeeds false"
runs -w
not_contains $statedir/output.html $httpurl/simple.atom
begin "userefresh true/false"
make_atom10 $httpdir/0.atom
make_atom10 $httpdir/1.atom
# It should pick the lowest of these and convert to seconds.
add "feed 1m $httpurl/0.atom"
add "feed 2m $httpurl/1.atom"
runs -u
add "userefresh true"
runs -w
contains $statedir/output.html 'http-equiv="Refresh" content="60"'
add "userefresh false"
runs -w
not_contains $statedir/output.html 'http-equiv="Refresh"'
begin "HTTP basic authentication"
make_rss20 $httpdir/private.rss
add "feed 0 $httpurl/auth-TestUser-TestPass/private.rss"
rune "401" -u
add " user TestUser"
add " password TestPass"
runs -u
# Generate a plugin to check that feedparser returned a particular HTTP status
# code.
checkstatus () {
cat >$statedir/plugins/checkstatus.py <$httpdir/.rewrites "/old.rss /301/new.rss"
rune "has been updated automatically" -uw
# We should still have the original items at this point.
output_range 1 10
runs -uw
output_range 1 10
done
begin "changeconfig for feed from included file"
make_rss20 $httpdir/feed.rss
add "changeconfig true"
add "include config2"
echo >$statedir/config2 "feed 0 $httpurl/301/feed.rss"
rune "has been updated automatically" -u
# FIXME: this behaviour is probably not what the user wanted.
# rawdog should probably complain that it's trying to change
# something but hasn't succeeded.
not_contains $statedir/config "$httpurl/feed.rss"
contains $statedir/config2 "$httpurl/301/feed.rss"
not_contains $statedir/config2 "$httpurl/feed.rss"
begin "changeconfig to same URL as existing feed"
make_rss20 $httpdir/feed.rss
add "changeconfig true"
add "feed 0 $httpurl/feed.rss"
runs -u
add "feed 0 $httpurl/301/feed.rss"
rune "already subscribed" -u
for state in false true; do
begin "changeconfig to URL of just-removed feed, splitstate $state"
make_rss20 $httpdir/feed.rss
add "splitstate $state"
add "changeconfig true"
add "feed 0 $httpurl/feed.rss"
runs -u
# Simulate the change failing, then succeeding.
for i in 1 2; do
: >$statedir/config
add "splitstate $state"
add "changeconfig true"
add "feed 0 $httpurl/301/feed.rss"
rune "has been updated automatically" -u
contains $statedir/config "$httpurl/feed.rss"
not_contains $statedir/config "$httpurl/301/feed.rss"
done
runs -u
done
begin "feed format text"
make_rss20_desc $httpdir/feed.rss <three < four"
begin "feed id"
make_rss20 $httpdir/0.rss
make_rss20 $httpdir/1.rss
add "feed 0 $httpurl/0.rss id=blah"
add "feed 0 $httpurl/1.rss"
add "itemtemplate item"
echo "feed-id(__feed_id__)" >$statedir/item
runs -uw
contains $statedir/output.html "feed-id(blah)" "feed-id(examplefeedtitle)"
begin "shorttag expansion"
#
bug fixed 2006-01-07.
#
/ has a workaround in feedparser for sgmllib.
add "tidyhtml false"
write_desc <0
" \
"1
" \
"2
/"
begin "broken processing instruction"
write_desc <
link
]]>
EOF
contains $statedir/output.html \
"$httpurl/rel-link" \
"$httpurl/rel-img"
begin "Javascript removed"
write_desc <
span
]]>
EOF
not_contains $statedir/output.html "Annoying1" "Annoying2"
begin "stray ] in URL"
# This produced an "Invalid IPv6 URL" exception with feedparser r738.
write_desc <link]]>
EOF
contains $statedir/output.html not-broken
if $knownbad; then
begin "escaped slashes in URL"
# feedparser issue 407: links with :// escaped get mangled (reported in
# rawdog by Joseph Reagle).
write_desc <link
link
link
link
]]>
EOF
contains $statedir/output.html \
http://example.com/0 http://example.com/1 \
http://example.com/2 http://example.com/3
fi
begin "add feed, actually a feed"
make_rss20 $httpdir/feed.rss
rune "Adding feed" -a $httpurl/feed.rss
contains "$statedir/config" $httpurl/feed.rss
begin "add feed, relative "
# Debian bug 657206.
make_rss20 $httpdir/feed.rss
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/feed.rss
begin "add feed, absolute "
make_rss20 $httpdir/feed.rss
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/feed.rss
begin "add feed, typical blog"
# Roughly what blogspot pages have.
make_atom10 $httpdir/posts
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/posts
not_contains "$statedir/config" "alt=rss"
begin "add feed, avoid HTML "
make_html $httpdir/dummy.html
make_html_head $httpdir/page.html <
EOF
rune "Cannot find any feeds" -a $httpurl/page.html
begin "add feed, with obvious URL"
make_rss20 $httpdir/foo.rss
make_html_body $httpdir/page.html <Here is our feed!
EOF
rune "Adding feed" -a $httpurl/page.html
if $knownbad; then
begin "add feed, with non-obvious URL"
# ... as boingboing.net currently has (old feedfinder doesn't find
# this; it finds /atom.xml by brute force).
make_rss20 $httpdir/foo
make_html_body $httpdir/page.html <Here is our RSS feed!
EOF
rune "Adding feed" -a $httpurl/page.html
fi
if $knownbad; then
# Old feedfinder could find this because it tried appending lots of
# likely suffixes to URLs. However, this generally isn't needed
# nowdays; most of the feeds that it could find that way have proper
# elements.
begin "add feed, brute force"
make_atom10 $httpdir/index.atom
make_html $httpdir/page.html
rune "Adding feed" -a $httpurl/page.html
fi
begin "add feed, no feeds to be found"
make_html $httpdir/page.html
rune "Cannot find any feeds" -a $httpurl/page.html
begin "add feed, nonsense in HTML"
# Debian bug 650776. This will provoke a HTMLParseError.
make_rss20 $httpdir/feed.rss
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/feed.rss
begin "add feed, already present"
make_atom10 $httpdir/feed.atom
add "feed 3h $httpurl/feed.atom"
rune "already in the config file" -a $httpurl/feed.atom
begin "add feed, prefer RSS 1.0 over nonsense"
make_rss10 $httpdir/feed.rdf
echo "this is nonsense" >$httpdir/feed.rss
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/feed.rdf
begin "add feed, prefer RSS 2 over RSS 1.0"
make_rss10 $httpdir/feed.rdf
make_rss20 $httpdir/feed.rss
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/feed.rss
begin "add feed, prefer .rss2 over .rss"
make_rss20 $httpdir/feed.rss
make_rss20 $httpdir/feed.rss2
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/feed.rss2
begin "add feed, prefer Atom over RSS"
make_rss10 $httpdir/feed.rdf
make_rss20 $httpdir/feed.rss
make_rss20 $httpdir/feed.rss2
make_atom10 $httpdir/feed.atom
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/feed.atom
begin "add feed, prefer entries over comments"
make_atom10 $httpdir/comments.atom
make_atom10 $httpdir/entries.atom
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/entries.atom
begin "add feed, keep page order"
make_atom10 $httpdir/0.atom
make_atom10 $httpdir/1.atom
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/0.atom
begin "add feed, ignore broken link"
make_atom10 $httpdir/1.atom
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/1.atom
begin "add feed, UTF-8 in attr"
# This problem showed up in orbitbooks.net's front page. The intent is fine,
# but it crashes Python 2.7's HTMLParser if it's not properly decoded.
make_atom10 $httpdir/feed.atom
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/feed.atom
begin "add feed, gzip-encoded response"
make_rss20 $httpdir/feed.rss
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/gzip/page.html
contains "$statedir/config" $httpurl/feed.rss
begin "add feed, gzip-encoded feed"
make_rss20 $httpdir/feed.rss
make_html_head $httpdir/page.html <
EOF
rune "Adding feed" -a $httpurl/page.html
contains "$statedir/config" $httpurl/gzip/feed.rss
begin "remove feed"
add "feed 3h $httpurl/0.rss"
add "feed 3h $httpurl/1.rss"
add "feed 3h $httpurl/2.rss"
rune "Removing feed" -r $httpurl/1.rss
contains "$statedir/config" $httpurl/0.rss $httpurl/2.rss
not_contains "$statedir/config" $httpurl/1.rss
begin "remove feed with options"
add "feed 3h $httpurl/0.rss"
add " define_foo 0a"
add " define_foo 0b"
add "feed 3h $httpurl/1.rss"
add " define_foo 1a"
add " define_foo 1b"
add "feed 3h $httpurl/2.rss"
add " define_foo 2a"
add " define_foo 2b"
rune "Removing feed" -r $httpurl/1.rss
contains "$statedir/config" \
$httpurl/0.rss "foo 0a" "foo 0b" \
$httpurl/2.rss "foo 2a" "foo 2b"
not_contains "$statedir/config" \
$httpurl/1.rss "foo 1a" "foo 1b"
begin "remove feed, preserving comments"
add "feed 3h $httpurl/0.rss"
add " define_foo 0a"
add "# Keep this comment"
add " define_foo 0b"
rune "Removing feed" -r $httpurl/0.rss
contains $statedir/config "# Keep this comment"
not_contains $statedir/config "foo 0a" "foo 0b"
begin "remove nonexistant feed"
add "feed 3h $httpurl/0.rss"
add "feed 3h $httpurl/1.rss"
add "feed 3h $httpurl/2.rss"
rune "not in the config file" -r $httpurl/3.rss
for state in false true; do
for fetched in false true; do
not=$(if ! $fetched; then echo "not "; fi)
begin "remove feed, ${not}fetched, splitstate $state"
make_rss20 $httpdir/feed.rss
add "feed 0 $httpurl/feed.rss"
add "splitstate $state"
if $fetched; then
runs -uw
contains $statedir/output.html example-item-title
if $state; then
exists $statedir/feeds/*
fi
fi
rune "Removing feed" -r $httpurl/feed.rss
if $state; then
not_exists $statedir/feeds/*
fi
runs -uw
not_contains $statedir/output.html example-item-title
done
done
# Run the plugins test suite if it's there.
if [ -e rawdog-plugins/test-plugins ]; then
. rawdog-plugins/test-plugins
fi
exit $exitcode
rawdog-2.19/NEWS 0000644 0004715 0004715 00000103250 12273274736 013055 0 ustar ats ats 0000000 0000000 - rawdog 2.19
Make test-rawdog not depend on having a host it can test connection
timeouts against, and add a -T option if you do have one.
When renaming a feed's state file in splitstate mode, don't fail if the
state file doesn't exist -- which can happen if we get a 301 response
for a feed the first time we fetch it. Also rename the lock file along
with the state file.
Add some more comprehensive tests for the changeconfig option; in
particular, test it more thoroughly with splitstate both on and off.
Don't crash if feedparser raises an exception during an update (i.e.
assume that any part of feedparser's response might be missing, until
we've checked that there wasn't an exception).
- rawdog 2.18
Be consistent about catching AttributeError when looking for attributes
that were added to Rawdog during the 2.x series (spotted by Jakub Wilk).
Add some advice in PLUGINS about escaping template parameters. Willem
reported that the enclosure plugin didn't do this, and having had a look
at the others it seems to be a common problem.
Make feedscanner handle "Content-Encoding: gzip" in responses, as
tumblr.com's webservers will use this even if you explicitly refuse it
in the request.
- rawdog 2.17
Add a one-paragraph description of rawdog to the README file, for use by
packgers.
Fix some misquoted dashes in the man page (spotted by lintian).
Set LC_ALL=C and TZ=UTC when running tests, in order to get predictable
results regardless of locale or timezone (reported by Etienne Millon).
Give sensible error messages on startup (rather than crashing) if the
config file or a template file is missing, or contains characters that
aren't in the system encoding.
Give test-rawdog some command-line options; you can now use it to test
an installed version of rawdog, or rawdog running under a non-default
Python version.
Add some more tests to the test suite, having done a coverage analysis
to work out which features weren't yet being tested: date formatting in
varying locales and timezones; RSS 1.0 support; --dump, -c, -f, -l, -N,
-v and -W; include; plugin loading; feed format and id options; author
formatting; template conditionals: broken 301 redirects; useids; content
vs. summary; daysections/timesections; removing articles from a feed;
keeping comments; numthreads; outputfile; various error messages.
Use author URIs retrieved from feeds when formatting author names
(rather than ignoring them; this was the result of a feedparser change).
Make subclasses of Persistable call Persistable's constructor.
(Identified by coverage analysis.)
Don't crash when trying to show a template that doesn't exist.
When removing a feed in splitstate mode, remove its lock file too.
- rawdog 2.16
Remove the bundled copy of feedparser, and document that it's now a
dependency.
Update the package metadata in setup.py.
- rawdog 2.15
rawdog now requires Python 2.6 (rather than Python 2.2). This is the
version in Debian and Red Hat's previous stable releases, so it should
be safe to assume on current systems.
Make setup.py complain if you have an inappropriate Python version.
Remove obsolete code that supported pre-2.6 versions of Python
(timeoutsocket.py, conditional imports, 0/1 for bools, dicts for sets,
locking without with, various standard library features).
Tidy up the code formatting in a few places to make it closer to PEP 8.
Make the rawdog(1) man page describe all of rawdog's options, and make
some other minor improvements to the documentation and help.
Remove the --upgrade option; I think it's highly unlikely that anybody
still has any rawdog 1 state files around.
Make the code that manages the pool of feed-fetching threads only start
as many threads as necessary (including none if there's only one feed to
fetch), and generally tidy it up.
Add test-rawdog, a simple test suite for rawdog with a built-in
webserver. You should be able to run this from the rawdog source
directory to check that much of rawdog is working correctly.
(If you have the rawdog plugins repo in a subdirectory called
"rawdog-plugins", it'll run tests on some of the plugins too.)
Add a -V option, which is like -v but appends the verbose output to a
file. This is mostly useful for testing.
Significantly rework the Persister class: there's now a Persisted class
that can act as a context manager for "with" statements, which
simplifies the code quite a bit, and it correctly handles persisted
objects being opened multiple times and renamed. persister.py is now
under the same license as the rest of rawdog (GPLv2+).
Fix a bug: if you're using splitstate mode, and a feed returns a 301
permanent redirect, rawdog needs to rename the state file and adjust the
articles in it so they're attached to the feed's new URL. In previous
versions this didn't work correctly for two reasons: it tried to load
the existing articles from the new filename, and the resulting file got
clobbered because it was already being used by --update.
Rework the locking logic in persister so that it uses a separate lock
file. This fixes a (mostly) harmless bug: previously if rawdog A was
waiting for rawdog B to finish, then rawdog A wouldn't see the changes
rawdog B had written to the state file. More importantly, it means
rawdog won't leave an empty ("corrupt") state file if it crashes during
the first update or write.
Split state files are now explicitly marked as modified if any articles
were expired from them. (This won't actually change rawdog's behaviour,
since articles were only expired if some articles had been seen during
the update, and that would also have marked the state as modified.)
When splitstate is enabled, make the feeds directory if it doesn't
already exist. This avoids a confusing error message if you didn't make
it by hand.
rawdog now complains if feedparser can't detect the type of a feed or
retrieve any items from it. This usually means that the URL isn't
actually a feed -- for example, if it's redirecting to an error page.
rawdog can now report more than one error for a feed at once -- e.g.
a permanent redirection to something that isn't a feed.
Show URLError exceptions returned by feedparser -- this means rawdog
gives a sensible error message for a file: or ftp: URL that gives an
error, rather than claiming it's a timeout. Plain filenames are now
turned into file: URLs so you get consistent errors for both, and
timeouts are detected by looking for a timeout exception.
Use a custom urllib2 handler to capture all the HTTP responses that
feedparser sees when handling redirects. This means rawdog can now see
both the initial and final status code (rather than the combined one
feedparser returns) -- so it can correctly handle redirects to errors,
and redirects to redirects.
Make "hideduplicates id link" work correctly in the odd corner case
where an article has both id and link duplicated, but to different other
articles.
Upgrade feedparser to version 5.1.3. As a result of the other changes
below, rawdog's copy of feedparser is now completely unmodified -- so
it should be safe to remove it and use your system version if you prefer
(provided it's new enough).
Add a --dump option to pretty-print feedparser's output for a URL.
The feedparser module used to do this if invoked as a script, but more
recent versions of feedparser don't support this.
Use a custom urllib2 handler to do HTTP basic authentication, instead of
a feedparser patch. This also fixes proxy authentication, which I
accidentally broke by removing a helper class several releases ago.
Use a custom urllib2 handler to disable RFC 3229, instead of a
feedparser patch. The behaviour is slightly different in that it now
sends "A-IM: identity" rather than no header at all; this should have
the same effect, though.
Remove the feedparser patch that provided "_raw" versions of content
(before sanitisation) for use in the article hash, and use the normal
version instead. Since we disable sanitisation at fetch time anyway, the
only difference with current feedparser is that the _raw versions didn't
have CP1252 encoding fixes applied -- so in the process of upgrading to
this version, you'll see some duplicate articles on feeds with CP1252
encoding problems. Tests suggest this doesn't affect many feeds (3 out
of the 1000-odd in my test setup).
Set feedparser behaviour using SANITIZE_HTML etc., rather than by
directly changing the lists of elements it's looking for.
Replace feedfinder, which has unfixable unclear licensing, with the
module that Decklin Foster wrote for his Debian package of rawdog
(specifically rawdog_2.13.dfsg.1-1). I've renamed it to "feedscanner",
on the grounds that it may be useful to other projects as well in the
future.
Put feedscanner's license notice into __license__, for consistency with
feedparser.
Make feedscanner understand HTML-style as well as XHTML-style
.
Fix Debian bug 657206: make feedscanner understand relative links
(reported by Peter J. Weisberg).
Fix Debian bug 650776: make feedscanner not crash if it can't parse the
URL it was given as HTML (reported by Jonathan Polley).
Make rawdog use feedscanner's preferred order of feeds in addition to
its own.
Make feedscanner only return URLs that feedparser can parse successfully
as feeds.
Make feedscanner look for links pointing to URLs with words in them
that suggest they're probably feeds.
Make feedscanner check whether the URL it was given is already a feed
before scanning it for links.
Make feedscanner decode the HTML it reads (silently ignoring errors)
before trying to parse it.
Move rawdog's feed quality heuristic into feedscanner.
Simplify the options for dealing with templates: there is now a
-s/--show command-line option that takes a template name as an argument
(i.e. you do "rawdog -s item" rather than "rawdog -T"), and the
"template" config file option is now called "pagetemplate". This
simplifies the code, and makes it possible to add more templates without
adding more command-line options. (For backwards compatibility, all the
old command-line and config-file options are still accepted, and
rawdog.get_template(config) will still return the page template.)
Add templates for the feed list and each item in the feed list
(based on patch from Arnout Engelen).
Don't append an extra newline when showing a template.
- rawdog 2.14
When adding a new feed from a page that provides several feeds, make a
more informed choice rather than just taking the first one: many blogs
provide both content and comments feeds, and we usually want the first
one.
Add a note to PLUGINS about making sure plugin storage gets saved.
Use updated_parsed instead of the obsolete modified_parsed when
extracting the feed-provided date for an item, and fall back to
published_parsed and then created_parsed if it doesn't exist (reported
by Cristian Rigamonti, Martintxo and chrysn). feedparser currently does
fallback automatically, but it's scheduled to be removed at some point,
so it's better for rawdog to do it.
- rawdog 2.13
Forcibly disable BeautifulSoup support in feedparser, since it returns
unpickleable pseudo-string objects, and it crashes when trying to parse
twenty or so of my feeds (reported by Joseph Reagle).
Make the code that cleans up feedparser's return value more thorough --
in particular, turn subclasses of "unicode" into real unicode objects.
Decode the config file from the system encoding, and escape "define_"d
strings when they're written to the output file (reported by Cristian
Rigamonti).
Add the "showtracebacks" option, which causes exceptions that occur
while a feed is being fetched to be reported with a traceback in the
resulting error message.
Use PyTidyLib in preference to mx.Tidy when available (suggested by
Joseph Reagle). If neither is available, "tidyhtml true" just does
nothing, so it's now turned on in the provided config file. The
mxtidy_args hook is now called tidy_args.
Allow template variables to start with an underscore (patch from Oberon
Faelord).
Work around broken DOCTYPEs that confuse sgmllib.
If -v is specified, force verbose on again after reading a secondary
config file (reported by Jonathan Phillips).
Resynchronise the feed list after loading a secondary config file;
previously feeds in secondary config files were ignored (reported by
Jonathan Philips).
- rawdog 2.12
Make rawdog work with Python 2.6 (reported by Roy Lanek).
If feedfinder (which now needs Python 2.4 or later) can't be imported,
just disable it.
Several changes as a result of profiling that significantly speed up
writing output files:
- Make encode_references() use regexp replacement.
- Cache the result of locale.getpreferredencoding().
- Use tuple lists rather than custom comparisons when sorting.
Update feedparser to revision 291, which fixes the handling of
elements (reported by Darren Griffith).
Only update the stored Etag and Last-Modified when a feed changes.
Add the "splitstate" option, which makes rawdog use a separate state
file for each feed rather than one large one. This significantly reduces
rawdog's memory usage at the cost of some more disk IO during --write.
The old behaviour is still the default, but I would recommend turning
splitstate on if you read a lot of feeds, if you use a long expiry time,
or if you're on a machine with limited memory.
As a result of the splitstate work, the output_filter and output_sort
hooks have been removed (because there's no longer a complete list of
articles to work with). Instead, there's now an output_sort_articles
hook that works with a list of article summaries.
Add the "useids" option, which makes rawdog respect article GUIDs when
updating feeds; if an article's GUID matches one we already know about,
we just update the existing article's contents rather than treating it
as a new article (like most aggregators do). This is turned on in the
default configuration, since the behaviour it produces is generally more
useful these days -- many feeds include random advertisements, or other
dynamic content, and so the old approach resulted in lots of duplicated
articles.
- rawdog 2.11
Avoid a crash when a feed's URL is changed and expiry is done on the
same run.
Encode dates correctly in non-ASCII locales (reported by Damjan
Georgievski).
Strengthen the warning in PLUGINS about the effects of overriding
output_write_files (suggested by Virgil Bucoci).
Add the state directory to sys.path, so you can put modules that plugins
need in your ~/.rawdog (suggested by Stuart Langridge).
When adding a feed, check that it isn't already present in the config
file (suggested by Stuart Langridge).
Add --no-lock-wait option to make rawdog exit silently if it can't lock
the state file (i.e. if there's already a rawdog running).
Update to the latest feedparser, which fixes an encoding bug with Python
2.5, among various other stuff (reported by Paul Tomblin, Tim Bishop and
Joseph Reagle).
Handle the author_detail fields being None.
- rawdog 2.10
Work around a feedparser bug (returning a detail without a type field
for posts with embedded SVG).
Pull in most of the changes from feedparser 4.1.
Fix a bug that stopped rawdog from working properly when no locale
information was present in the environment, or on versions of Python
without locale.getpreferredencoding() (reported by Michael Watkins).
Add --remove option to remove a feed from the config file (suggested by
Wolfram Sieber).
Produce a more useful error message when $HOME isn't set (reported by
Wolfram Sieber).
Fix a bug in the expiry code: if you were using keepmin, it could expire
articles that were no longer current but should be kept.
Clean up the example config file a bit.
- rawdog 2.9
Fix a documentation bug about time formats (reported by Tim Bishop).
Fix a text-handling problem related to the locale changes (patch from
Samuel Hym).
Fix use of the "A-IM: feed" header in HTTP requests. A previous upstream
change to feedparser had modified it so that it always sent this header,
which results in a subtle rawdog bug: if a feed returns a partial result
(226) and then has no changes for a long time, rawdog can expire
articles which should still be "current" in the feed. This version adds
a "keepmin" option which make a minimum number of articles be kept for
each feed; this should avoid expiring articles that are still current.
If you want the old behaviour, you can set "keepmin" to 0, in which case
rawdog won't send the "A-IM: feed" header in its requests. rawdog also
won't send that header if "currentonly" is set to true, since in that
case the current set of articles is all rawdog cares about. (See
for Sam
Ruby's discussion of the same problem in Planet.)
If the author's name is given as the empty string, fall back to the
email address, URL or "author".
Change the labels in the feed information table to "Last fetched" and
"Next fetched after", to match what rawdog actually does with the times
it stores (reported by D. Stussy).
- rawdog 2.8
Fix authentication support -- feedparser now supports Basic and Digest
authentication internally, but it needed tweaking to make it useful for
rawdog (reported by Tim Bishop).
- rawdog 2.7
Make feedfinder smarter about trying to find the preferred type of feed
(patch from Decklin Foster).
Add a plugin hook to let you modify mx.Tidy options (suggested by Jon
Lasser).
Work correctly if the threading module isn't available (patch from Jack
Diederich).
Update to feedparser 4.0.2, which includes some of our patches and fixes
an unclear license notice (reported by Jason Diamond, Joe Wreschnig and
Decklin Foster).
Fix a feedparser bug that caused things preceding shorttags to be
duplicated when sanitising HTML.
Set the locale correctly when rawdog starts up (patch from Samuel Hym).
- rawdog 2.6
Allow maxage to be set per feed (patch from Craig Allen).
Support feeddefaults with no options on the same line, as used in the
sample config file (reported by asher).
- rawdog 2.5
Ensure that all the strings in entry_info are in Unicode form, to make
it easier for plugins to deal with them.
Fix a feedparser bug that was breaking feeds which includes itunes
elements (reported by James Cameron).
Make feedparser handle content types and modes in atom:content correctly
(reported by David Dorward).
Make feedparser handle the new elements in Atom 1.0 (patch from Decklin
Foster).
Remove some unnecessary imports found by pyflakes.
Add output_sorted_filter and output_write_files hooks, deprecating
the output_write hook (which wasn't very useful originally, and isn't
used by any of the plugins I've been sent). Restructure the "write" code
so that it should be far easier to write custom output plugins: there
are several new methods on Rawdog for doing different bits of the write
process.
When selecting articles to display, don't assume they're sorted in date
order (a plugin might have done something different).
Don't write an extra newline at the end of the output file (i.e. use
f.write rather than print >>f), and be more careful about encoding when
writing output to stdout.
Provide arbitrary persistent storage for plugins via a
get_plugin_storage method on Rawdog (suggested by BAM).
Add -N option to avoid locking the state file, which may be useful if
you're on an OS or filesystem that doesn't support locks (suggested by
Andy Dustman).
If RAWDOG_PROFILE is set as an environment variable, rawdog will run
under the Python profiler.
Make some minor performance improvements.
Change the "Error parsing feed" message to "Error fetching or parsing
feed", since it really just indicates an error somewhere within
feedparser (reported by Fred Barnes).
Add support for using multiple threads when fetching feeds, which makes
updates go much faster if you've got lots of feeds. (The state-updating
part of the update is still done sequentially, since parallelising it
would mean adding lots of locking and making the code very messy.) To
use this, set "numthreads" to be greater than 0 in your config file.
Since it changes the semantics of one of the plugin hooks, it's off by
default.
Update the GPL and LGPL headers to include the FSF's new address
(reported by Decklin Foster).
- rawdog 2.4
Provide guid in item templates (suggested by Rick van Rein).
Update article-added dates correctly when "currentonly true" is used
(reported by Rick van Rein).
Clarify description of -c in README and man page (reported by Rick van
Rein).
If you return false from an output_items_heading function, then disable
DayWriter (suggested by Ian Glover).
Fix description of article_seen in PLUGINS (reported by Steve Atwell).
Escape odd characters in links and guids, and add a sanity check that'll
trip if non-ASCII somehow makes it to the output (reported by
TheCrypto).
- rawdog 2.3
Make the id= parameter work correctly (patch from Jon Nelson).
- rawdog 2.2
Add "feeddefaults" statement to specify default feed options.
Update feeds list from the config file whenever rawdog runs, rather than
just when doing an update (reported by Decklin Foster).
Reload the config files after -a, so that "rawdog -a URL -u" has the
expected behaviour (reported by Decklin Foster).
Add "define" statement and "define_X" feed option to allow the user to
define extra strings for the template; you can use this, for example, to
select classes for groups of feeds, generate different HTML for
different sorts of feeds, or set the title in different pages generated
from the same template (suggested by Decklin Foster).
Fix a logic error in the _raw changes to feedparser: if a feed didn't
specify its encoding but contained non-ASCII characters, rawdog will
now try to parse it as UTF-8 (which it should be) and, failing that,
as ISO-8859-1 (in case it just contains non-UTF-8 junk).
Don't print the "state file may be corrupt" error if the user hits
Ctrl-C while rawdog's loading it.
Add support for extending rawdog with plugin modules; see the "PLUGINS"
file for more information.
Make "verbose true" work in the config file.
Provide __author__ in items, for use in feeds that support that (patch
from Decklin Foster).
Fix conditional template expansion (patch from Decklin Foster).
Add "blocklevelhtml" statement to disable the "" workaround for
non-block-level HTML; this may be useful if you have a plugin that is
doing different HTML sanitisation, or if your template already forces a
block-level element around article descriptions.
Fix -l for feeds with non-ASCII characters in their titles.
Provide human-readable __feed_id__ in items (patch from David
Durschlag), and add feed-whatevername class to the default item
template; this should make it somewhat easier to add per-feed styles.
Handle feeds that are local files correctly, and handle file: URLs in
feedparser (reported by Chris Niekel).
Allow feed arguments to be given on indented lines after the "feed" or
"feeddefaults" lines; this makes it possible to have spaces in feed
arguments.
Add a meta element to the default template to stop search engines
indexing rawdog pages (patch from Rick van Rein).
Add new feeds at the end of the config file rather than before the first
feed line (patch from Decklin Foster).
- rawdog 2.1
Fix a character encoding problem with format=text feeds.
Add proxyuser and proxypassword options for feeds, so that you can use
per-feed proxies requiring HTTP Basic authentication (patch from Jon
Nelson).
Add a manual page (written by Decklin Foster).
Remove extraneous #! line from feedparser.py (reported by Decklin
Foster).
Update an article's modified date when a new version of it is seen
(reported by Decklin Foster).
Support nested ifs in templates (patch from David Durschlag), and add
__else__.
Make the README file list all the options that rawdog now supports
(reported by David Durschlag).
Make --verbose work even if it's specified after an action (reported by
Dan Noe and David Durschlag).
- rawdog 2.0
Update to feedparser 3.3. This meant reworking some of rawdog's
internals; state files from old versions will no longer work with rawdog
2.0 (and external programs that manipulate rawdog state files will also
be broken). The new feedparser provides a much nicer API, and is
significantly more robust; several feeds that previously caused
feedparser internal errors or Python segfaults now work fine.
Add an --upgrade option to import state from rawdog 1.x state files into
rawdog 2.x. To upgrade from 1.x to 2.x, you'll need to perform the
following steps after installing the new rawdog:
- cp -R ~/.rawdog ~/.rawdog-old
- rm ~/.rawdog/state
- rawdog -u
- rawdog --upgrade ~/.rawdog-old ~/.rawdog (to copy the state)
- rawdog -w
- rm -r ~/.rawdog-old (once you're happy with the new version)
Keep track of a version number in the state file, and complain if you
use a state file from an incompatible version.
Remove support for the old option syntax ("rawdog update write").
Remove workarounds for early 1.x state file versions.
Save the state file in the binary pickle format, and use cPickle instead
of pickle so it can be read and written more rapidly.
Add hideduplicates and allowduplicates options to attempt to hide
duplicate articles (based on patch from Grant Edwards).
Fix a bug when sorting feeds with no titles (found by Joseph Reagle).
Write the updated state file more safely, to reduce the chance that
it'll be damaged or truncated if something goes wrong while it's being
written (requested by Tim Bishop).
Include feedfinder, and add a -a|--add option to add a feed to the
config file.
Correctly handle dates with timezones specified in non-UTC locales
(reported by Paul Tomblin and Jon Lasser).
When a feed's URL changes, as indicated by a permanent HTTP redirect,
automatically update the config file and state.
- rawdog 1.13
Handle OverflowError with parsed dates (patch from Matthew Scott).
- rawdog 1.12
Add "sortbyfeeddate" option for planet pages (requested by David
Dorward).
Add "currentonly" option (patch from Chris Cutler).
Handle nested CDATA blocks in feed XML and HTML correctly in feedparser.
- rawdog 1.11
Add __num_items__ and __num_feeds__ to the page template, and __url__ to
the item template (patch from Chris Cutler).
Add "daysections" and "timesections" options to control whether to split
items up by day and time (based on patch from Chris Cutler).
Add "tidyhtml" option to use mx.Tidy to clean feed-provided HTML.
Remove the
wrapping __description__ from the default item template,
and make rawdog add
...
around the description only if it doesn't
start with a block-level element (which isn't perfect, but covers the
majority of problem cases). If you have a custom item template and want
rawdog to generate a better approximation to valid HTML, you should
change "__description__
" to "__description__".
HTML metacharacters in links are now encoded correctly in generated
HTML ("foo?a=b&c=d" as "foo?a=b&c=d").
Content type selection is now performed for all elements returned from
the feed, since some Blogger v5 feeds cause feedparser to return
multiple versions of the title and link (reported by Eric Cronin).
- rawdog 1.10
Add "ignoretimeouts" option to silently ignore timeout errors.
Fix SSL and socket timeouts on Python 2.3 (reported by Tim Bishop).
Fix entity encoding problem with HTML sanitisation that was causing
rawdog to throw an exception upon writing with feeds containing
non-US-ASCII characters in attribute values (reported by David Dorward,
Dmitry Mark and Steve Pomeroy).
Include MANIFEST.in in the distribution (reported by Chris Cutler).
- rawdog 1.9
Add "clear: both;" to item, time and date styles, so that items with
floated images in don't extend into the items below them.
Changed how rawdog selects the feeds to update; --verbose now shows
only the feeds being updated.
rawdog now uses feedparser 2.7.6, which adds date parsing and limited
sanitisation of feed-provided HTML; I've removed rawdog's own
date-parsing (including iso8601.py) and relative-link-fixing code in
favour of the more-capable feedparser equivalents.
The persister module in rawdoglib is now licensed under the LGPL
(requested by Giles Radford).
Made the error messages that listed the state dir reflect the -b
setting (patch from Antonin Kral).
Treat empty titles, links or descriptions as if they weren't supplied at
all, to cope with broken feeds that specify " " (patch
from Michael Leuchtenburg).
Make the expiry age configurable; previously it was hard-wired to 24
hours. Setting this to a larger value is useful if you want to have a
page covering more than a day's feeds.
Time specifications in the config file can now include a unit; if no
unit is specified it'll default to minutes or seconds as appropriate to
maintain compatibility with old config files. Boolean values can now be
specified as "true" or "false" (or "1" or "0" for backwards
compatibility). rawdog now gives useful errors rather than Python
exceptions for bad values. (Based on suggestions by Tero Karvinen.)
Added datetimeformat option so that you can display feed and article
times differently from the day and time headings, and added some
examples including ISO 8601 format to the config file (patch from Tero
Karvinen).
Forcing a feed to be updated with -f now clears its ETag and
Last-Modified, so it should always be refetched from the server.
Short-form XML tags in RSS ( ) are now handled correctly.
Numeric entities in RSS encoded content are now handled correctly.
- rawdog 1.8
Add format=text feed option to handle broken feeds that make their
descriptions unescaped text.
Add __hash__ and unlinked titles to item templates, so that you can use
multiple config files to build a summary list of item titles (for use in
the Mozilla sidebar, for instance). (Requested by David Dorward.)
Add the --verbose argument (and the "verbose" option to match); this
makes rawdog show what it's doing while it's running.
Add an "include" statement in config files that can be used to include
another config file.
Add feed options to select proxies (contributed by Neil Padgen). This is
straightforward for Python 2.3, but 2.2's urllib2 has a bug which
prevents ProxyHandlers from working; I've added a workaround for now.
- rawdog 1.7
Fix code in iso8601.py that caused a warning with Python 2.3.
- rawdog 1.6
Config file lines are now split on arbitary strings of whitespace, not
just single spaces (reported by Joseph Reagle).
Include a link to the rawdog home page in the default template.
Fix the --dir argument: -d worked fine, but the getopt call was missing
an "=" (reported by Gregory Margo).
Relative links (href and src attributes) in feed-provided HTML are now
made absolute in the output. (The feed validator will complain about
feeds with relative links in, but there are quite a few out there.)
Item templates are now supported, making it easier to customise item
appearance (requested by a number of users, including Giles Radford and
David Dorward). In particular, note that __feed_hash__ can be used
to apply a CSS style to a particular feed.
Simple conditions are supported in templates: __if_x__ .. __endif__ only
expands to its contents if x is not empty. These conditions cannot be
nested.
PyXML's iso8601 module is now included so that rawdog can parse dates in
feeds.
- rawdog 1.5
Remove some debugging code that broke timeouts.
- rawdog 1.4
Fix option-compatibility code (reported by BAM).
Add HTTP basic authentication support (which means modifying feedparser
again).
Print a more useful error if the statefile can't be read.
- rawdog 1.3
Reverted the "retry immediately" behaviour from 1.2, since it causes
denied or broken feeds to get checked every time rawdog is run.
Updated feedparser to 2.5.3, which now returns the XML encoding used.
rawdog uses this information to convert all incoming items into Unicode,
so multiple encodings are now handled correctly. Non-ASCII characters
are encoded using HTML numeric character references (since this allows
me to leave the HTML charset as ISO-8859-1; it's non-trivial to get
Apache to serve arbitrary HTML files with the right Content-Type,
and using won't override HTTP
headers).
Use standard option syntax (i.e. "--update --write" instead of "update
write"). The old syntax will be supported until 2.0.
Error output from reading the config file and from --update now goes to
stderr instead of stdout.
Made the socket timeout configurable (which also means the included copy
of feedparser isn't modified any more).
Added --config option to read an additional config file; this lets you
have multiple output files with different options.
Allow "outputfile -" to write the output to stdout; useful if you want
to have cron mail the output to you rather than putting it on a web
page.
Added --show-template option to show the template currently in use (so
you can customise it yourself), and "template" config option to allow
the user to specify their own template.
Added --dir option for people who want two lots of rawdog state (for two
sets of feeds, for instance).
Added "maxage" config option for people who want "only items added in
the last hour", and made it possible to disable maxarticles by setting
it to 0.
- rawdog 1.2
Updated feedparser to 2.5.2, which fixes a bug that was making rawdog
handle content incorrectly in Echo feeds, handles more content encoding
methods, and returns HTTP status codes. (I've applied a small patch to
correct handling of some Echo feeds.)
Added useful messages for different HTTP status codes and HTTP timeouts.
Since rawdog reads a config file, it can't automatically update
redirected feeds, but it will now tell you about them. Note that for
"fatal" errors (anything except a 2xx response or a redirect), rawdog
will now retry the feed next time it's run.
Prefer "content" over "content_encoded", and fall back correctly if no
useful "content" is found.
- rawdog 1.1
rawdog now preserves the ordering of articles in the RSS when a group of
articles are added at the same time.
Updated rawdog URL in setup.py, since it now has a web page.
Updated rssparser to feedparser 2.4, and added very preliminary support
for the "content" element it can return (for Echo feeds).
- rawdog 1.0
Initial stable release.
rawdog-2.19/rawdog.1 0000644 0004715 0004715 00000012022 12173317060 013702 0 ustar ats ats 0000000 0000000 .TH RAWDOG 1
.SH NAME
rawdog \- an RSS Aggregator Without Delusions Of Grandeur
.SH SYNOPSIS
.B rawdog
.RI [ options ]
.SH DESCRIPTION
\fBrawdog\fP is a feed aggregator for Unix-like systems.
.PP
\fBrawdog\fP uses the Python \fBfeedparser\fP module to retrieve
articles from a number of feeds in RSS, Atom and other formats, and
writes out a single HTML file, based on a template either provided by
the user or generated by \fBrawdog\fP, containing the latest articles
it's seen.
.PP
\fBrawdog\fP uses the ETags and Last-Modified headers to avoid fetching
a file that hasn't changed, and supports gzip and delta compression to
reduce bandwidth when it has.
\fBrawdog\fP is configured from a simple text file; the only state kept
between invocations that can't be reconstructed from the feeds is the
ordering of articles.
.SH OPTIONS
This program follows the usual GNU command line syntax, with long
options starting with two dashes (`\-').
.SS General Options
.TP
\fB\-d\fP \fIDIR\fP, \fB\-\-dir\fP \fIDIR\fP
Use \fIDIR\fP instead of the $HOME/.rawdog directory.
This option lets you have two or more \fBrawdog\fP setups with different
configurations and sets of feeds.
.TP
\fB\-N\fP, \fB\-\-no\-locking\fP
Do not lock the state file.
.IP ""
\fBrawdog\fP usually claims a lock on its state file, to stop more than
one instance from running at the same time.
Unfortunately, some filesystems don't support file locking; you can use
this option to disable locking entirely if you're in that situation.
.TP
\fB\-v\fP, \fB\-\-verbose\fP
Print more detailed information about what \fBrawdog\fP is doing to stderr
while it runs.
.TP
\fB\-V\fP \fIFILE\fP, \fB\-\-log\fP \fIFILE\fP
As with \fB\-V\fP, but write the information to \fIFILE\fP.
.TP
\fB\-W\fP, \fB\-\-no\-lock\-wait\fP
Exit silently if the state file is already locked.
.IP ""
If the state file is already locked, \fBrawdog\fP will normally wait
until it becomes available, then run.
However, if you're got a lot of feeds and a slow network connection, you
might prefer \fBrawdog\fP to just give up immediately if the previous
instance is still running.
.SS Actions
\fBrawdog\fP will perform these actions in the order given.
.TP
\fB\-a\fP \fIURL\fP, \fB\-\-add\fP \fIURL\fP
Try to find a feed associated with \fIURL\fP and add it to the config
file.
.IP ""
\fIURL\fP may be a feed itself, or it can be an HTML page that links to
a feed in any of a variety of ways.
\fBrawdog\fP uses heuristics to pick the best feed it can find, and will
complain if it can't find one.
.TP
\fB\-c\fP \fIFILE\fP, \fB\-\-config\fP \fIFILE\fP
Read \fIFILE\fP as an additional config file; any options provided in
\fIFILE\fP will override those set in the main config file (with the
exception of "feed", which is cumulative).
\fIFILE\fP may be an absolute path or a path relative to your .rawdog
directory.
.IP ""
Note that $HOME/.rawdog/config will still be read first even if you
specify this option.
\fB\-c\fP is mostly useful when you want to write the same set of feeds
out using two different sets of output options.
.TP
\fB\-f\fP \fIURL\fP, \fB\-\-update\-feed\fP \fIURL\fP
Update the feed pointed to by \fIURL\fP immediately, even if its period
hasn't elapsed since it was last updated.
This is useful when you're publishing a feed yourself, and want to test
whether it's working properly.
.TP
\fB\-l\fP, \fB\-\-list\fP
List brief information about each of the feeds that was known about at
the time of the last update.
.TP
\fB\-r\fP \fIURL\fP, \fB\-\-remove\fP \fIURL\fP
Remove feed \fIURL\fP from the config file.
.TP
\fB\-s\fP \fITEMPLATE\fP, \fB\-\-show\fP \fITEMPLATE\fP
Print one of the templates currently in use to stdout.
\fBTEMPLATE\fP may be \fBpage\fP, \fBitem\fP, \fBfeedlist\fP or
\fBfeeditem\fP.
This can be used as a starting point if you want to design your own
template for use with the corresponding \fBtemplate\fP option in the
config file.
.TP
\fB\-u\fP, \fB\-\-update\fP
Fetch data from the feeds and store it.
This could take some time if you've got lots of feeds.
.TP
\fB\-w\fP, \fB\-\-write\fP
Write out the HTML output file.
.SS Special Actions
If one of these options is specified, \fBrawdog\fP will perform only
that action, then exit.
.TP
\fB\-\-dump\fP \fIURL\fP
Show what \fBrawdog\fP's feed parser returns for \fIURL\fP.
This can be useful when trying to understand why \fBrawdog\fP doesn't
display a feed correctly.
.TP
\fB\-\-help\fP
Provide a brief summary of all the options \fBrawdog\fP supports.
.SH EXAMPLES
\fBrawdog\fP is typically invoked from
.BR cron (1).
The following
.BR crontab (5)
entry would fetch data from feeds and write it to HTML once an hour,
exiting if \fBrawdog\fP is already running:
.PP
.nf
.RS
0 * * * * rawdog \-Wuw
.RE
.fi
.SH FILES
$HOME/.rawdog/config
.SH SEE ALSO
.BR cron (1).
.SH AUTHOR
\fBrawdog\fP was mostly written by Adam Sampson , with
contributions and bug reports from many of \fBrawdog\fP's users.
See \fBrawdog\fP's NEWS file for a complete list of contributors.
.PP
This manual page was originally written by Decklin Foster
, for the Debian project (but may be used by
others).
rawdog-2.19/rawdoglib/ 0000755 0004715 0004715 00000000000 12273447040 014315 5 ustar ats ats 0000000 0000000 rawdog-2.19/rawdoglib/__init__.py 0000644 0004715 0004715 00000000104 12171062651 016417 0 ustar ats ats 0000000 0000000 __all__ = [
'feedscanner',
'persister',
'rawdog',
]
rawdog-2.19/rawdoglib/feedscanner.py 0000644 0004715 0004715 00000010436 12177454041 017152 0 ustar ats ats 0000000 0000000 """Scan a URL's contents to find related feeds
This is a compatible replacement for Aaron Swartz's feedfinder module,
using feedparser to check whether the URLs it returns are feeds.
It finds links to feeds within the following elements:
- (standard feed discovery)
- , if the href contains words that suggest it might be a feed
It orders feeds using a quality heuristic: the first result is the most
likely to be a feed for the given URL.
Required: Python 2.4 or later, feedparser
"""
__license__ = """
Copyright (c) 2008 Decklin Foster
Copyright (c) 2013 Adam Sampson
Permission to use, copy, modify, and/or distribute this software for
any purpose with or without fee is hereby granted, provided that
the above copyright notice and this permission notice appear in all
copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL
DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA
OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.
"""
import cStringIO
import feedparser
import gzip
import re
import urllib2
import urlparse
import HTMLParser
def is_feed(url):
"""Return true if feedparser can understand the given URL as a feed."""
p = feedparser.parse(url)
version = p.get("version")
if version is None:
version = ""
return (version != "")
def fetch_url(url):
"""Fetch the given URL and return the data from it as a Unicode string."""
request = urllib2.Request(url)
request.add_header("Accept-Encoding", "gzip")
f = urllib2.urlopen(request)
headers = f.info()
data = f.read()
f.close()
# We have to support gzip encoding because some servers will use it
# even if you explicitly refuse it in Accept-Encoding.
encodings = headers.get("Content-Encoding", "")
encodings = [s.strip() for s in encodings.split(",")]
if "gzip" in encodings:
f = gzip.GzipFile(fileobj=cStringIO.StringIO(data))
data = f.read()
f.close()
# Silently ignore encoding errors -- we don't need to go to the bother of
# detecting the encoding properly (like feedparser does).
data = data.decode("UTF-8", "ignore")
return data
class FeedFinder(HTMLParser.HTMLParser):
def __init__(self, base_uri):
HTMLParser.HTMLParser.__init__(self)
self.found = []
self.count = 0
self.base_uri = base_uri
def add(self, score, href):
url = urlparse.urljoin(self.base_uri, href)
lower = url.lower()
# Some sites provide feeds both for entries and comments;
# prefer the former.
if lower.find("comment") != -1:
score -= 50
# Prefer Atom, then RSS, then RDF (RSS 1).
if lower.find("atom") != -1:
score += 10
elif lower.find("rss2") != -1:
score -= 5
elif lower.find("rss") != -1:
score -= 10
elif lower.find("rdf") != -1:
score -= 15
self.found.append((-score, self.count, url))
self.count += 1
def urls(self):
return [link[2] for link in sorted(self.found)]
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
href = attrs.get('href')
if href is None:
return
if tag == 'link' and attrs.get('rel') == 'alternate' and \
not attrs.get('type') == 'text/html':
self.add(200, href)
if tag == 'a' and re.search(r'\b(rss|atom|rdf|feeds?)\b', href, re.I):
self.add(100, href)
def feeds(page_url):
"""Search the given URL for possible feeds, returning a list of them."""
# If the URL is a feed, there's no need to scan it for links.
if is_feed(page_url):
return [page_url]
data = fetch_url(page_url)
parser = FeedFinder(page_url)
try:
parser.feed(data)
except HTMLParser.HTMLParseError:
pass
found = parser.urls()
# Return only feeds that feedparser can understand.
return [feed for feed in found if is_feed(feed)]
rawdog-2.19/rawdoglib/persister.py 0000644 0004715 0004715 00000012051 12267755143 016717 0 ustar ats ats 0000000 0000000 # persister: persist Python objects safely to pickle files
# Copyright 2003, 2004, 2005, 2013, 2014 Adam Sampson
#
# rawdog is free software; you can redistribute and/or modify it
# under the terms of that license as published by the Free Software
# Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# rawdog is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with rawdog; see the file COPYING. If not, write to the Free
# Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA, or see http://www.gnu.org/.
import cPickle as pickle
import errno
import fcntl
import os
import sys
class Persistable:
"""An object which can be persisted."""
def __init__(self):
self._modified = False
def modified(self, state=True):
"""Mark the object as having been modified (or not)."""
self._modified = state
def is_modified(self):
return self._modified
class Persisted:
"""Context manager for a persistent object. The object being persisted
must implement the Persistable interface."""
def __init__(self, klass, filename, persister):
self.klass = klass
self.filename = filename
self.persister = persister
self.lock_file = None
self.object = None
self.refcount = 0
def rename(self, new_filename):
"""Rename the persisted file. This works whether the file is
currently open or not."""
self.persister._rename(self.filename, new_filename)
for ext in ("", ".lock"):
try:
os.rename(self.filename + ext,
new_filename + ext)
except OSError, e:
# If the file doesn't exist (yet),
# that's OK.
if e.errno != errno.ENOENT:
raise e
self.filename = new_filename
def __enter__(self):
"""As open()."""
return self.open()
def __exit__(self, type, value, tb):
"""As close(), unless an exception occurred in which case do
nothing."""
if tb is None:
self.close()
def open(self, no_block=True):
"""Return the persistent object, loading it from its file if it
isn't already open. You must call close() once you're finished
with the object.
If no_block is True, then this will return None if loading the
object would otherwise block (i.e. if it's locked by another
process)."""
if self.refcount > 0:
# Already loaded.
self.refcount += 1
return self.object
try:
self._open(no_block)
except KeyboardInterrupt:
sys.exit(1)
except:
print "An error occurred while reading state from " + os.path.abspath(self.filename) + "."
print "This usually means the file is corrupt, and removing it will fix the problem."
sys.exit(1)
self.refcount = 1
return self.object
def _get_lock(self, no_block):
if not self.persister.use_locking:
return True
self.lock_file = open(self.filename + ".lock", "w+")
try:
mode = fcntl.LOCK_EX
if no_block:
mode |= fcntl.LOCK_NB
fcntl.lockf(self.lock_file.fileno(), mode)
except IOError, e:
if no_block and e.errno in (errno.EACCES, errno.EAGAIN):
return False
raise e
return True
def _open(self, no_block):
self.persister.log("Loading state file: ", self.filename)
if not self._get_lock(no_block):
return None
try:
f = open(self.filename, "rb")
except IOError:
# File can't be opened.
# Create a new object.
self.object = self.klass()
self.object.modified()
return
self.object = pickle.load(f)
self.object.modified(False)
f.close()
def close(self):
"""Reduce the reference count of the persisted object, saving
it back to its file if necessary."""
self.refcount -= 1
if self.refcount > 0:
# Still in use.
return
if self.object.is_modified():
self.persister.log("Saving state file: ", self.filename)
newname = "%s.new-%d" % (self.filename, os.getpid())
newfile = open(newname, "w")
pickle.dump(self.object, newfile, pickle.HIGHEST_PROTOCOL)
newfile.close()
os.rename(newname, self.filename)
if self.lock_file is not None:
self.lock_file.close()
self.persister._remove(self.filename)
class Persister:
"""Manage the collection of persisted files."""
def __init__(self, config):
self.files = {}
self.log = config.log
self.use_locking = config.locking
def get(self, klass, filename):
"""Get a context manager for a persisted file.
If the file is already open, this will return
the existing context manager."""
if filename in self.files:
return self.files[filename]
p = Persisted(klass, filename, self)
self.files[filename] = p
return p
def _rename(self, old_filename, new_filename):
self.files[new_filename] = self.files[old_filename]
del self.files[old_filename]
def _remove(self, filename):
del self.files[filename]
def delete(self, filename):
"""Delete a persisted file, along with its lock file,
if they exist."""
for ext in ("", ".lock"):
try:
os.unlink(filename + ext)
except OSError:
pass
rawdog-2.19/rawdoglib/rawdog.py 0000644 0004715 0004715 00000160510 12273445026 016157 0 ustar ats ats 0000000 0000000 # rawdog: RSS aggregator without delusions of grandeur.
# Copyright 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2012, 2013, 2014 Adam Sampson
#
# rawdog is free software; you can redistribute and/or modify it
# under the terms of that license as published by the Free Software
# Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# rawdog is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with rawdog; see the file COPYING. If not, write to the Free
# Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA, or see http://www.gnu.org/.
VERSION = "2.19"
HTTP_AGENT = "rawdog/" + VERSION
STATE_VERSION = 2
import feedparser, plugins
from persister import Persistable, Persister
import os, time, getopt, sys, re, cgi, socket, urllib2, calendar
import string, locale
from StringIO import StringIO
import types
import threading
import hashlib
import base64
import feedscanner
try:
import tidylib
except:
tidylib = None
try:
import mx.Tidy as mxtidy
except:
mxtidy = None
# Turn off content-cleaning, since we want to see an approximation to the
# original content for hashing. rawdog will sanitise HTML when writing.
feedparser.RESOLVE_RELATIVE_URIS = 0
feedparser.SANITIZE_HTML = 0
# Disable microformat support, because it tends to return poor-quality data
# (e.g. identifying inappropriate things as enclosures), and it relies on
# BeautifulSoup which is unable to parse many feeds.
feedparser.PARSE_MICROFORMATS = 0
# This is initialised in main().
persister = None
system_encoding = None
def get_system_encoding():
"""Get the system encoding."""
return system_encoding
def safe_ftime(format, t):
"""Format a time value into a string in the current locale (as
time.strftime), but encode the result as ASCII HTML."""
u = unicode(time.strftime(format, t), get_system_encoding())
return encode_references(u)
def format_time(secs, config):
"""Format a time and date nicely."""
t = time.localtime(secs)
format = config["datetimeformat"]
if format is None:
format = config["timeformat"] + ", " + config["dayformat"]
return safe_ftime(format, t)
high_char_re = re.compile(r'[^\000-\177]')
def encode_references(s):
"""Encode characters in a Unicode string using HTML references."""
def encode(m):
return "" + str(ord(m.group(0))) + ";"
return high_char_re.sub(encode, s)
# This list of block-level elements came from the HTML 4.01 specification.
block_level_re = re.compile(r'^\s*<(p|h1|h2|h3|h4|h5|h6|ul|ol|pre|dl|div|noscript|blockquote|form|hr|table|fieldset|address)[^a-z]', re.I)
def sanitise_html(html, baseurl, inline, config):
"""Attempt to turn arbitrary feed-provided HTML into something
suitable for safe inclusion into the rawdog output. The inline
parameter says whether to expect a fragment of inline text, or a
sequence of block-level elements."""
if html is None:
return None
html = encode_references(html)
type = "text/html"
# sgmllib handles "
/" as a SHORTTAG; this workaround from
# feedparser.
html = re.sub(r'(\S)/>', r'\1 />', html)
# sgmllib is fragile with broken processing instructions (e.g.
# ""); just remove them all.
html = re.sub(r']*>', '', html)
html = feedparser._resolveRelativeURIs(html, baseurl, "UTF-8", type)
p = feedparser._HTMLSanitizer("UTF-8", type)
p.feed(html)
html = p.output()
if not inline and config["blocklevelhtml"]:
# If we're after some block-level HTML and the HTML doesn't
# start with a block-level element, then insert a tag
# before it. This still fails when the HTML contains text, then
# a block-level element, then more text, but it's better than
# nothing.
if block_level_re.match(html) is None:
html = "
" + html
if config["tidyhtml"]:
args = {"numeric_entities": 1,
"output_html": 1,
"output_xhtml": 0,
"output_xml": 0,
"wrap": 0}
plugins.call_hook("mxtidy_args", config, args, baseurl, inline)
plugins.call_hook("tidy_args", config, args, baseurl, inline)
if tidylib is not None:
# Disable PyTidyLib's somewhat unhelpful defaults.
tidylib.BASE_OPTIONS = {}
output = tidylib.tidy_document(html, args)[0]
elif mxtidy is not None:
output = mxtidy.tidy(html, None, None, **args)[2]
else:
# No Tidy bindings installed -- do nothing.
output = "
" + html + ""
html = output[output.find("") + 6
: output.rfind("")].strip()
html = html.decode("UTF-8")
box = plugins.Box(html)
plugins.call_hook("clean_html", config, box, baseurl, inline)
return box.value
def select_detail(details):
"""Pick the preferred type of detail from a list of details. (If the
argument isn't a list, treat it as a list of one.)"""
TYPES = {"text/html": 30,
"application/xhtml+xml": 20,
"text/plain": 10}
if details is None:
return None
if type(details) is not list:
details = [details]
ds = []
for detail in details:
ctype = detail.get("type", None)
if ctype is None:
continue
if TYPES.has_key(ctype):
score = TYPES[ctype]
else:
score = 0
if detail["value"] != "":
ds.append((score, detail))
ds.sort()
if len(ds) == 0:
return None
else:
return ds[-1][1]
def detail_to_html(details, inline, config, force_preformatted=False):
"""Convert a detail hash or list of detail hashes as returned by
feedparser into HTML."""
detail = select_detail(details)
if detail is None:
return None
if force_preformatted:
html = "" + cgi.escape(detail["value"]) + "
"
elif detail["type"] == "text/plain":
html = cgi.escape(detail["value"])
else:
html = detail["value"]
return sanitise_html(html, detail["base"], inline, config)
def author_to_html(entry, feedurl, config):
"""Convert feedparser author information to HTML."""
author_detail = entry.get("author_detail")
if author_detail is not None and author_detail.has_key("name"):
name = author_detail["name"]
else:
name = entry.get("author")
url = None
fallback = "author"
if author_detail is not None:
if author_detail.has_key("href"):
url = author_detail["href"]
elif author_detail.has_key("email") and author_detail["email"] is not None:
url = "mailto:" + author_detail["email"]
if author_detail.has_key("email") and author_detail["email"] is not None:
fallback = author_detail["email"]
elif author_detail.has_key("href") and author_detail["href"] is not None:
fallback = author_detail["href"]
if name == "":
name = fallback
if url is None:
html = name
else:
html = "" + cgi.escape(name) + ""
# We shouldn't need a base URL here anyway.
return sanitise_html(html, feedurl, True, config)
def string_to_html(s, config):
"""Convert a string to HTML."""
return sanitise_html(cgi.escape(s), "", True, config)
template_re = re.compile(r'(__[^_].*?__)')
def fill_template(template, bits):
"""Expand a template, replacing __x__ with bits["x"], and only
including sections bracketed by __if_x__ .. [__else__ ..]
__endif__ if bits["x"] is not "". If not bits.has_key("x"),
__x__ expands to ""."""
result = plugins.Box()
plugins.call_hook("fill_template", template, bits, result)
if result.value is not None:
return result.value
encoding = get_system_encoding()
f = StringIO()
if_stack = []
def write(s):
if not False in if_stack:
f.write(s)
for part in template_re.split(template):
if part.startswith("__") and part.endswith("__"):
key = part[2:-2]
if key.startswith("if_"):
k = key[3:]
if_stack.append(bits.has_key(k) and bits[k] != "")
elif key == "endif":
if if_stack != []:
if_stack.pop()
elif key == "else":
if if_stack != []:
if_stack.append(not if_stack.pop())
elif bits.has_key(key):
if type(bits[key]) == types.UnicodeType:
write(bits[key].encode(encoding))
else:
write(bits[key])
else:
write(part)
v = f.getvalue()
f.close()
return v
file_cache = {}
def load_file(name):
"""Read the contents of a template file, caching the result so we don't
have to read the file multiple times. The file is assumed to be in the
system encoding; the result will be an ASCII string."""
if not file_cache.has_key(name):
try:
f = open(name)
data = f.read()
f.close()
except IOError:
raise ConfigError("Can't read template file: " + name)
try:
data = data.decode(get_system_encoding())
except UnicodeDecodeError, e:
raise ConfigError("Character encoding problem in template file: " + name + ": " + str(e))
data = encode_references(data)
file_cache[name] = data.encode(get_system_encoding())
return file_cache[name]
def write_ascii(f, s, config):
"""Write the string s, which should only contain ASCII characters, to
file f; if it isn't encodable in ASCII, then print a warning message
and write UTF-8."""
try:
f.write(s)
except UnicodeEncodeError, e:
config.bug("Error encoding output as ASCII; UTF-8 has been written instead.\n", e)
f.write(s.encode("UTF-8"))
def short_hash(s):
"""Return a human-manipulatable 'short hash' of a string."""
return hashlib.sha1(s).hexdigest()[-8:]
def ensure_unicode(value, encoding):
"""Convert a structure returned by feedparser into an equivalent where
all strings are represented as fully-decoded unicode objects."""
if isinstance(value, str):
try:
return value.decode(encoding)
except:
# If the encoding's invalid, at least preserve
# the byte stream.
return value.decode("ISO-8859-1")
elif isinstance(value, unicode) and type(value) is not unicode:
# This is a subclass of unicode (e.g. BeautifulSoup's
# NavigableString, which is unpickleable in some versions of
# the library), so force it to be a real unicode object.
return unicode(value)
elif isinstance(value, dict):
d = {}
for (k, v) in value.items():
d[k] = ensure_unicode(v, encoding)
return d
elif isinstance(value, list):
return [ensure_unicode(v, encoding) for v in value]
else:
return value
class BasicAuthProcessor(urllib2.BaseHandler):
"""urllib2 handler that does HTTP basic authentication
or proxy authentication with a fixed username and password.
(Unlike the classes to do this in urllib2, this doesn't wait
for a 401/407 response first.)"""
def __init__(self, user, password, proxy=False):
self.auth = base64.b64encode(user + ":" + password)
if proxy:
self.header = "Proxy-Authorization"
else:
self.header = "Authorization"
def http_request(self, req):
req.add_header(self.header, "Basic " + self.auth)
return req
https_request = http_request
class DisableIMProcessor(urllib2.BaseHandler):
"""urllib2 handler that disables RFC 3229 for a request."""
def http_request(self, req):
# Request doesn't provide a method for removing headers --
# so overwrite the header instead.
req.add_header("A-IM", "identity")
return req
https_request = http_request
class ResponseLogProcessor(urllib2.BaseHandler):
"""urllib2 handler that maintains a log of HTTP responses."""
# Run after anything that's mangling headers (usually 500 or less), but
# before HTTPErrorProcessor (1000).
handler_order = 900
def __init__(self):
self.log = []
def http_response(self, req, response):
entry = {
"url": req.get_full_url(),
"status": response.getcode(),
}
location = response.info().get("Location")
if location is not None:
entry["location"] = location
self.log.append(entry)
return response
https_response = http_response
def get_log(self):
return self.log
non_alphanumeric_re = re.compile(r'<[^>]*>|\&[^\;]*\;|[^a-z0-9]')
class Feed:
"""An RSS feed."""
def __init__(self, url):
self.url = url
self.period = 30 * 60
self.args = {}
self.etag = None
self.modified = None
self.last_update = 0
self.feed_info = {}
def needs_update(self, now):
"""Return True if it's time to update this feed, or False if
its update period has not yet elapsed."""
return ((now - self.last_update) >= self.period)
def get_state_filename(self):
return "feeds/%s.state" % (short_hash(self.url),)
def fetch(self, rawdog, config):
"""Fetch the current set of articles from the feed."""
handlers = []
logger = ResponseLogProcessor()
handlers.append(logger)
proxies = {}
for name, value in self.args.items():
if name.endswith("_proxy"):
proxies[name[:-6]] = value
if len(proxies) != 0:
handlers.append(urllib2.ProxyHandler(proxies))
if self.args.has_key("proxyuser") and self.args.has_key("proxypassword"):
handlers.append(BasicAuthProcessor(self.args["proxyuser"], self.args["proxypassword"], proxy=True))
if self.args.has_key("user") and self.args.has_key("password"):
handlers.append(BasicAuthProcessor(self.args["user"], self.args["password"]))
if self.get_keepmin(config) == 0 or config["currentonly"]:
# If RFC 3229 and "A-IM: feed" is used, then there's
# no way to tell when an article has been removed.
# So if we only want to keep articles that are still
# being published by the feed, we have to turn it off.
handlers.append(DisableIMProcessor())
plugins.call_hook("add_urllib2_handlers", rawdog, config, self, handlers)
url = self.url
# Turn plain filenames into file: URLs. (feedparser will open
# plain filenames itself, but we want it to open the file with
# urllib2 so we get a URLError if something goes wrong.)
if not ":" in url:
url = "file:" + url
try:
result = feedparser.parse(url,
etag=self.etag,
modified=self.modified,
agent=HTTP_AGENT,
handlers=handlers)
except Exception, e:
result = {
"rawdog_exception": e,
"rawdog_traceback": sys.exc_info()[2],
}
result["rawdog_responses"] = logger.get_log()
return result
def update(self, rawdog, now, config, articles, p):
"""Add new articles from a feed to the collection.
Returns True if any articles were read, False otherwise."""
# Note that feedparser might have thrown an exception --
# so until we print the error message and return, we
# can't assume that p contains any particular field.
responses = p.get("rawdog_responses")
if len(responses) > 0:
last_status = responses[-1]["status"]
elif len(p.get("feed", [])) != 0:
# Some protocol other than HTTP -- assume it's OK,
# since we got some content.
last_status = 200
else:
# Timeout, or empty response from non-HTTP.
last_status = 0
version = p.get("version")
if version is None:
version = ""
self.last_update = now
errors = []
fatal = False
old_url = self.url
if "rawdog_exception" in p:
errors.append("Error fetching or parsing feed:")
errors.append(str(p["rawdog_exception"]))
if config["showtracebacks"]:
from traceback import format_tb
errors.append("".join(format_tb(p["rawdog_traceback"])))
errors.append("")
fatal = True
if len(responses) != 0 and responses[0]["status"] == 301:
# Permanent redirect(s). Find the new location.
i = 0
while i < len(responses) and responses[i]["status"] == 301:
i += 1
location = responses[i - 1].get("location")
if location is None:
errors.append("The feed returned a permanent redirect, but without a new location.")
else:
errors.append("New URL: " + location)
errors.append("The feed has moved permanently to a new URL.")
if config["changeconfig"]:
rawdog.change_feed_url(self.url, location, config)
errors.append("The config file has been updated automatically.")
else:
errors.append("You should update its entry in your config file.")
errors.append("")
bozo_exception = p.get("bozo_exception")
got_urlerror = isinstance(bozo_exception, urllib2.URLError)
got_timeout = isinstance(bozo_exception, socket.timeout)
if got_urlerror or got_timeout:
# urllib2 reported an error when fetching the feed.
# Check to see if it was a timeout.
if not (got_timeout or str(bozo_exception).endswith("timed out>")):
errors.append("Error while fetching feed:")
errors.append(str(bozo_exception))
errors.append("")
fatal = True
elif config["ignoretimeouts"]:
return False
else:
errors.append("Timeout while reading feed.")
errors.append("")
fatal = True
elif last_status == 304:
# The feed hasn't changed. Return False to indicate
# that we shouldn't do expiry.
return False
elif last_status in [403, 410]:
# The feed is disallowed or gone. The feed should be
# unsubscribed.
errors.append("The feed has gone.")
errors.append("You should remove it from your config file.")
errors.append("")
fatal = True
elif last_status / 100 != 2:
# Some sort of client or server error. The feed may
# need unsubscribing.
errors.append("The feed returned an error.")
errors.append("If this condition persists, you should remove it from your config file.")
errors.append("")
fatal = True
elif version == "" and len(p.get("entries", [])) == 0:
# feedparser couldn't detect the type of this feed or
# retrieve any entries from it.
errors.append("The data retrieved from this URL could not be understood as a feed.")
errors.append("You should check whether the feed has changed URLs or been removed.")
errors.append("")
fatal = True
old_error = "\n".join(errors)
plugins.call_hook("feed_fetched", rawdog, config, self, p, old_error, not fatal)
if len(errors) != 0:
print >>sys.stderr, "Feed: " + old_url
if last_status != 0:
print >>sys.stderr, "HTTP Status: " + str(last_status)
for line in errors:
print >>sys.stderr, line
if fatal:
return False
# From here, we can assume that we've got a complete feedparser
# response.
p = ensure_unicode(p, p.get("encoding") or "UTF-8")
# No entries means the feed hasn't changed, but for some reason
# we didn't get a 304 response. Handle it the same way.
if len(p["entries"]) == 0:
return False
self.etag = p.get("etag")
self.modified = p.get("modified")
self.feed_info = p["feed"]
feed = self.url
article_ids = {}
if config["useids"]:
# Find IDs for existing articles.
for (hash, a) in articles.items():
id = a.entry_info.get("id")
if a.feed == feed and id is not None:
article_ids[id] = a
seen_articles = set()
sequence = 0
for entry_info in p["entries"]:
article = Article(feed, entry_info, now, sequence)
ignore = plugins.Box(False)
plugins.call_hook("article_seen", rawdog, config, article, ignore)
if ignore.value:
continue
seen_articles.add(article.hash)
sequence += 1
id = entry_info.get("id")
if id in article_ids:
existing_article = article_ids[id]
elif article.hash in articles:
existing_article = articles[article.hash]
else:
existing_article = None
if existing_article is not None:
existing_article.update_from(article, now)
plugins.call_hook("article_updated", rawdog, config, existing_article, now)
else:
articles[article.hash] = article
plugins.call_hook("article_added", rawdog, config, article, now)
if config["currentonly"]:
for (hash, a) in articles.items():
if a.feed == feed and hash not in seen_articles:
del articles[hash]
return True
def get_html_name(self, config):
if self.feed_info.has_key("title_detail"):
r = detail_to_html(self.feed_info["title_detail"], True, config)
elif self.feed_info.has_key("link"):
r = string_to_html(self.feed_info["link"], config)
else:
r = string_to_html(self.url, config)
if r is None:
r = ""
return r
def get_html_link(self, config):
s = self.get_html_name(config)
if self.feed_info.has_key("link"):
return '' + s + ''
else:
return s
def get_id(self, config):
if self.args.has_key("id"):
return self.args["id"]
else:
r = self.get_html_name(config).lower()
return non_alphanumeric_re.sub('', r)
def get_keepmin(self, config):
return self.args.get("keepmin", config["keepmin"])
class Article:
"""An article retrieved from an RSS feed."""
def __init__(self, feed, entry_info, now, sequence):
self.feed = feed
self.entry_info = entry_info
self.sequence = sequence
self.date = None
parsed = entry_info.get("updated_parsed")
if parsed is None:
parsed = entry_info.get("published_parsed")
if parsed is None:
parsed = entry_info.get("created_parsed")
if parsed is not None:
try:
self.date = calendar.timegm(parsed)
except OverflowError:
pass
self.hash = self.compute_initial_hash()
self.last_seen = now
self.added = now
def compute_initial_hash(self):
"""Compute an initial unique hash for an article.
The generated hash must be unique amongst all articles in the
system (i.e. it can't just be the article ID, because that
would collide if more than one feed included the same
article)."""
h = hashlib.sha1()
def add_hash(s):
h.update(s.encode("UTF-8"))
add_hash(self.feed)
entry_info = self.entry_info
if entry_info.has_key("title"):
add_hash(entry_info["title"])
if entry_info.has_key("link"):
add_hash(entry_info["link"])
if entry_info.has_key("content"):
for content in entry_info["content"]:
add_hash(content["value"])
if entry_info.has_key("summary_detail"):
add_hash(entry_info["summary_detail"]["value"])
return h.hexdigest()
def update_from(self, new_article, now):
"""Update this article's contents from a newer article that's
been identified to be the same."""
self.entry_info = new_article.entry_info
self.sequence = new_article.sequence
self.date = new_article.date
self.last_seen = now
def can_expire(self, now, config):
return ((now - self.last_seen) > config["expireage"])
def get_sort_date(self, config):
if config["sortbyfeeddate"]:
return self.date or self.added
else:
return self.added
class DayWriter:
"""Utility class for writing day sections into a series of articles."""
def __init__(self, file, config):
self.lasttime = []
self.file = file
self.counter = 0
self.config = config
def start_day(self, tm):
print >>self.file, ''
day = safe_ftime(self.config["dayformat"], tm)
print >>self.file, '' + day + '
'
self.counter += 1
def start_time(self, tm):
print >>self.file, ''
clock = safe_ftime(self.config["timeformat"], tm)
print >>self.file, '' + clock + '
'
self.counter += 1
def time(self, s):
tm = time.localtime(s)
if tm[:3] != self.lasttime[:3] and self.config["daysections"]:
self.close(0)
self.start_day(tm)
if tm[:6] != self.lasttime[:6] and self.config["timesections"]:
if self.config["daysections"]:
self.close(1)
else:
self.close(0)
self.start_time(tm)
self.lasttime = tm
def close(self, n=0):
while self.counter > n:
print >>self.file, ""
self.counter -= 1
def parse_time(value, default="m"):
"""Parse a time period with optional units (s, m, h, d, w) into a time
in seconds. If no unit is specified, use minutes by default; specify
the default argument to change this. Raises ValueError if the format
isn't recognised."""
units = { "s" : 1, "m" : 60, "h" : 3600, "d" : 86400, "w" : 604800 }
for unit, size in units.items():
if value.endswith(unit):
return int(value[:-len(unit)]) * size
return int(value) * units[default]
def parse_bool(value):
"""Parse a boolean value (0, 1, false or true). Raise ValueError if
the value isn't recognised."""
value = value.strip().lower()
if value == "0" or value == "false":
return False
elif value == "1" or value == "true":
return True
else:
raise ValueError("Bad boolean value: " + value)
def parse_list(value):
"""Parse a list of keywords separated by whitespace."""
return value.strip().split(None)
def parse_feed_args(argparams, arglines):
"""Parse a list of feed arguments. Raise ConfigError if the syntax is
invalid, or ValueError if an argument value can't be parsed."""
args = {}
for p in argparams:
ps = p.split("=", 1)
if len(ps) != 2:
raise ConfigError("Bad feed argument in config: " + p)
args[ps[0]] = ps[1]
for p in arglines:
ps = p.split(None, 1)
if len(ps) != 2:
raise ConfigError("Bad argument line in config: " + p)
args[ps[0]] = ps[1]
for name, value in args.items():
if name == "allowduplicates":
args[name] = parse_bool(value)
elif name == "keepmin":
args[name] = int(value)
elif name == "maxage":
args[name] = parse_time(value)
return args
class ConfigError(Exception): pass
class Config:
"""The aggregator's configuration."""
def __init__(self, locking=True, logfile_name=None):
self.locking = locking
self.files_loaded = []
self.loglock = threading.Lock()
self.logfile = None
if logfile_name:
self.logfile = open(logfile_name, "a")
self.reset()
def reset(self):
# Note that these default values are *not* the same as
# in the supplied config file. The idea is that someone
# who has an old config file shouldn't notice a difference
# in behaviour on upgrade -- so new options generally
# default to False here, and True in the sample file.
self.config = {
"feedslist" : [],
"feeddefaults" : {},
"defines" : {},
"outputfile" : "output.html",
"maxarticles" : 200,
"maxage" : 0,
"expireage" : 24 * 60 * 60,
"keepmin" : 0,
"dayformat" : "%A, %d %B %Y",
"timeformat" : "%I:%M %p",
"datetimeformat" : None,
"userefresh" : False,
"showfeeds" : True,
"timeout" : 30,
"pagetemplate" : "default",
"itemtemplate" : "default",
"feedlisttemplate" : "default",
"feeditemtemplate" : "default",
"verbose" : False,
"ignoretimeouts" : False,
"showtracebacks" : False,
"daysections" : True,
"timesections" : True,
"blocklevelhtml" : True,
"tidyhtml" : False,
"sortbyfeeddate" : False,
"currentonly" : False,
"hideduplicates" : [],
"newfeedperiod" : "3h",
"changeconfig": False,
"numthreads": 1,
"splitstate": False,
"useids": False,
}
def __getitem__(self, key): return self.config[key]
def get(self, key, default=None): return self.config.get(key, default)
def __setitem__(self, key, value): self.config[key] = value
def reload(self):
self.log("Reloading config files")
self.reset()
for filename in self.files_loaded:
self.load(filename, False)
def load(self, filename, explicitly_loaded=True):
"""Load configuration from a config file."""
if explicitly_loaded:
self.files_loaded.append(filename)
lines = []
try:
f = open(filename, "r")
for line in f.xreadlines():
try:
line = line.decode(get_system_encoding())
except UnicodeDecodeError, e:
raise ConfigError("Character encoding problem in config file: " + filename + ": " + str(e))
stripped = line.strip()
if stripped == "" or stripped[0] == "#":
continue
if line[0] in string.whitespace:
if lines == []:
raise ConfigError("First line in config cannot be an argument")
lines[-1][1].append(stripped)
else:
lines.append((stripped, []))
f.close()
except IOError:
raise ConfigError("Can't read config file: " + filename)
for line, arglines in lines:
try:
self.load_line(line, arglines)
except ValueError:
raise ConfigError("Bad value in config: " + line)
def load_line(self, line, arglines):
"""Process a configuration directive."""
l = line.split(None, 1)
if len(l) == 1 and l[0] == "feeddefaults":
l.append("")
elif len(l) != 2:
raise ConfigError("Bad line in config: " + line)
# Load template files immediately, so we produce an error now
# rather than later if anything goes wrong.
if l[0].endswith("template") and l[1] != "default":
load_file(l[1])
handled_arglines = False
if l[0] == "feed":
l = l[1].split(None)
if len(l) < 2:
raise ConfigError("Bad line in config: " + line)
self["feedslist"].append((l[1], parse_time(l[0]), parse_feed_args(l[2:], arglines)))
handled_arglines = True
elif l[0] == "feeddefaults":
self["feeddefaults"] = parse_feed_args(l[1].split(None), arglines)
handled_arglines = True
elif l[0] == "define":
l = l[1].split(None, 1)
if len(l) != 2:
raise ConfigError("Bad line in config: " + line)
self["defines"][l[0]] = l[1]
elif l[0] == "plugindirs":
for dir in parse_list(l[1]):
plugins.load_plugins(dir, self)
elif l[0] == "outputfile":
self["outputfile"] = l[1]
elif l[0] == "maxarticles":
self["maxarticles"] = int(l[1])
elif l[0] == "maxage":
self["maxage"] = parse_time(l[1])
elif l[0] == "expireage":
self["expireage"] = parse_time(l[1])
elif l[0] == "keepmin":
self["keepmin"] = int(l[1])
elif l[0] == "dayformat":
self["dayformat"] = l[1]
elif l[0] == "timeformat":
self["timeformat"] = l[1]
elif l[0] == "datetimeformat":
self["datetimeformat"] = l[1]
elif l[0] == "userefresh":
self["userefresh"] = parse_bool(l[1])
elif l[0] == "showfeeds":
self["showfeeds"] = parse_bool(l[1])
elif l[0] == "timeout":
self["timeout"] = parse_time(l[1], "s")
elif l[0] in ("template", "pagetemplate"):
self["pagetemplate"] = l[1]
elif l[0] == "itemtemplate":
self["itemtemplate"] = l[1]
elif l[0] == "feedlisttemplate":
self["feedlisttemplate"] = l[1]
elif l[0] == "feeditemtemplate":
self["feeditemtemplate"] = l[1]
elif l[0] == "verbose":
self["verbose"] = parse_bool(l[1])
elif l[0] == "ignoretimeouts":
self["ignoretimeouts"] = parse_bool(l[1])
elif l[0] == "showtracebacks":
self["showtracebacks"] = parse_bool(l[1])
elif l[0] == "daysections":
self["daysections"] = parse_bool(l[1])
elif l[0] == "timesections":
self["timesections"] = parse_bool(l[1])
elif l[0] == "blocklevelhtml":
self["blocklevelhtml"] = parse_bool(l[1])
elif l[0] == "tidyhtml":
self["tidyhtml"] = parse_bool(l[1])
elif l[0] == "sortbyfeeddate":
self["sortbyfeeddate"] = parse_bool(l[1])
elif l[0] == "currentonly":
self["currentonly"] = parse_bool(l[1])
elif l[0] == "hideduplicates":
self["hideduplicates"] = parse_list(l[1])
elif l[0] == "newfeedperiod":
self["newfeedperiod"] = l[1]
elif l[0] == "changeconfig":
self["changeconfig"] = parse_bool(l[1])
elif l[0] == "numthreads":
self["numthreads"] = int(l[1])
elif l[0] == "splitstate":
self["splitstate"] = parse_bool(l[1])
elif l[0] == "useids":
self["useids"] = parse_bool(l[1])
elif l[0] == "include":
self.load(l[1], False)
elif plugins.call_hook("config_option_arglines", self, l[0], l[1], arglines):
handled_arglines = True
elif plugins.call_hook("config_option", self, l[0], l[1]):
pass
else:
raise ConfigError("Unknown config command: " + l[0])
if arglines != [] and not handled_arglines:
raise ConfigError("Bad argument lines in config after: " + line)
def log(self, *args):
"""Print a status message. If running in verbose mode, write
the message to stderr; if using a logfile, write it to the
logfile."""
if self["verbose"]:
with self.loglock:
print >>sys.stderr, "".join(map(str, args))
if self.logfile is not None:
with self.loglock:
print >>self.logfile, "".join(map(str, args))
self.logfile.flush()
def bug(self, *args):
"""Report detection of a bug in rawdog."""
print >>sys.stderr, "Internal error detected in rawdog:"
print >>sys.stderr, "".join(map(str, args))
print >>sys.stderr, "This could be caused by a bug in rawdog itself or in a plugin."
print >>sys.stderr, "Please send this error message and your config file to the rawdog author."
def edit_file(filename, editfunc):
"""Edit a file in place: for each line in the input file, call
editfunc(line, outputfile), then rename the output file over the input
file."""
newname = "%s.new-%d" % (filename, os.getpid())
oldfile = open(filename, "r")
newfile = open(newname, "w")
editfunc(oldfile, newfile)
newfile.close()
oldfile.close()
os.rename(newname, filename)
class AddFeedEditor:
def __init__(self, feedline):
self.feedline = feedline
def edit(self, inputfile, outputfile):
d = inputfile.read()
outputfile.write(d)
if not d.endswith("\n"):
outputfile.write("\n")
outputfile.write(self.feedline)
def add_feed(filename, url, rawdog, config):
"""Try to add a feed to the config file."""
feeds = feedscanner.feeds(url)
if feeds == []:
print >>sys.stderr, "Cannot find any feeds in " + url
return
feed = feeds[0]
if feed in rawdog.feeds:
print >>sys.stderr, "Feed " + feed + " is already in the config file"
return
print >>sys.stderr, "Adding feed " + feed
feedline = "feed %s %s\n" % (config["newfeedperiod"], feed)
edit_file(filename, AddFeedEditor(feedline).edit)
class ChangeFeedEditor:
def __init__(self, oldurl, newurl):
self.oldurl = oldurl
self.newurl = newurl
def edit(self, inputfile, outputfile):
for line in inputfile.xreadlines():
ls = line.strip().split(None)
if len(ls) > 2 and ls[0] == "feed" and ls[2] == self.oldurl:
line = line.replace(self.oldurl, self.newurl, 1)
outputfile.write(line)
class RemoveFeedEditor:
def __init__(self, url):
self.url = url
def edit(self, inputfile, outputfile):
while True:
l = inputfile.readline()
if l == "":
break
ls = l.strip().split(None)
if len(ls) > 2 and ls[0] == "feed" and ls[2] == self.url:
while True:
l = inputfile.readline()
if l == "":
break
elif l[0] == "#":
outputfile.write(l)
elif l[0] not in string.whitespace:
outputfile.write(l)
break
else:
outputfile.write(l)
def remove_feed(filename, url, config):
"""Try to remove a feed from the config file."""
if url not in [f[0] for f in config["feedslist"]]:
print >>sys.stderr, "Feed " + url + " is not in the config file"
else:
print >>sys.stderr, "Removing feed " + url
edit_file(filename, RemoveFeedEditor(url).edit)
class FeedFetcher:
"""Class that will handle fetching a set of feeds in parallel."""
def __init__(self, rawdog, feedlist, config):
self.rawdog = rawdog
self.config = config
self.lock = threading.Lock()
self.jobs = set(feedlist)
self.results = {}
def worker(self, num):
rawdog = self.rawdog
config = self.config
while True:
with self.lock:
try:
job = self.jobs.pop()
except KeyError:
# No jobs left.
break
config.log("[", num, "] Fetching feed: ", job)
feed = rawdog.feeds[job]
plugins.call_hook("pre_update_feed", rawdog, config, feed)
result = feed.fetch(rawdog, config)
with self.lock:
self.results[job] = result
def run(self, max_workers):
max_workers = max(max_workers, 1)
num_workers = min(max_workers, len(self.jobs))
self.config.log("Fetching ", len(self.jobs), " feeds using ",
num_workers, " threads")
workers = []
for i in range(1, num_workers):
t = threading.Thread(target=self.worker, args=(i,))
t.start()
workers.append(t)
self.worker(0)
for worker in workers:
worker.join()
self.config.log("Fetch complete")
return self.results
class FeedState(Persistable):
"""The collection of articles in a feed."""
def __init__(self):
Persistable.__init__(self)
self.articles = {}
class Rawdog(Persistable):
"""The aggregator itself."""
def __init__(self):
Persistable.__init__(self)
self.feeds = {}
self.articles = {}
self.plugin_storage = {}
self.state_version = STATE_VERSION
self.using_splitstate = None
def get_plugin_storage(self, plugin):
try:
st = self.plugin_storage.setdefault(plugin, {})
except AttributeError:
# rawdog before 2.5 didn't have plugin storage.
st = {}
self.plugin_storage = {plugin: st}
return st
def check_state_version(self):
"""Check the version of the state file."""
try:
version = self.state_version
except AttributeError:
# rawdog 1.x didn't keep track of this.
version = 1
return version == STATE_VERSION
def change_feed_url(self, oldurl, newurl, config):
"""Change the URL of a feed."""
assert self.feeds.has_key(oldurl)
if self.feeds.has_key(newurl):
print >>sys.stderr, "Error: New feed URL is already subscribed; please remove the old one"
print >>sys.stderr, "from the config file by hand."
return
edit_file("config", ChangeFeedEditor(oldurl, newurl).edit)
feed = self.feeds[oldurl]
# Changing the URL will change the state filename as well,
# so we need to save the old name to load from.
old_state = feed.get_state_filename()
feed.url = newurl
del self.feeds[oldurl]
self.feeds[newurl] = feed
if config["splitstate"]:
feedstate_p = persister.get(FeedState, old_state)
feedstate_p.rename(feed.get_state_filename())
with feedstate_p as feedstate:
for article in feedstate.articles.values():
article.feed = newurl
feedstate.modified()
else:
for article in self.articles.values():
if article.feed == oldurl:
article.feed = newurl
print >>sys.stderr, "Feed URL automatically changed."
def list(self, config):
"""List the configured feeds."""
for url, feed in self.feeds.items():
feed_info = feed.feed_info
print url
print " ID:", feed.get_id(config)
print " Hash:", short_hash(url)
print " Title:", feed.get_html_name(config)
print " Link:", feed_info.get("link")
def sync_from_config(self, config):
"""Update rawdog's internal state to match the
configuration."""
# Make sure the splitstate directory exists.
if config["splitstate"]:
try:
os.mkdir("feeds")
except OSError:
# Most likely it already exists.
pass
# Convert to or from splitstate if necessary.
try:
u = self.using_splitstate
except AttributeError:
# We were last run with a version of rawdog that didn't
# have this variable -- so we must have a single state
# file.
u = False
if u is None:
self.using_splitstate = config["splitstate"]
elif u != config["splitstate"]:
if config["splitstate"]:
config.log("Converting to split state files")
for feed_hash, feed in self.feeds.items():
with persister.get(FeedState, feed.get_state_filename()) as feedstate:
feedstate.articles = {}
for article_hash, article in self.articles.items():
if article.feed == feed_hash:
feedstate.articles[article_hash] = article
feedstate.modified()
self.articles = {}
else:
config.log("Converting to single state file")
self.articles = {}
for feed_hash, feed in self.feeds.items():
with persister.get(FeedState, feed.get_state_filename()) as feedstate:
for article_hash, article in feedstate.articles.items():
self.articles[article_hash] = article
feedstate.articles = {}
feedstate.modified()
persister.delete(feed.get_state_filename())
self.modified()
self.using_splitstate = config["splitstate"]
seen_feeds = set()
for (url, period, args) in config["feedslist"]:
seen_feeds.add(url)
if not self.feeds.has_key(url):
config.log("Adding new feed: ", url)
self.feeds[url] = Feed(url)
self.modified()
feed = self.feeds[url]
if feed.period != period:
config.log("Changed feed period: ", url)
feed.period = period
self.modified()
newargs = {}
newargs.update(config["feeddefaults"])
newargs.update(args)
if feed.args != newargs:
config.log("Changed feed options: ", url)
feed.args = newargs
self.modified()
for url in self.feeds.keys():
if url not in seen_feeds:
config.log("Removing feed: ", url)
if config["splitstate"]:
persister.delete(self.feeds[url].get_state_filename())
else:
for key, article in self.articles.items():
if article.feed == url:
del self.articles[key]
del self.feeds[url]
self.modified()
def update(self, config, feedurl=None):
"""Perform the update action: check feeds for new articles, and
expire old ones."""
config.log("Starting update")
now = time.time()
socket.setdefaulttimeout(config["timeout"])
if feedurl is None:
update_feeds = [url for url in self.feeds.keys()
if self.feeds[url].needs_update(now)]
elif self.feeds.has_key(feedurl):
update_feeds = [feedurl]
self.feeds[feedurl].etag = None
self.feeds[feedurl].modified = None
else:
print "No such feed: " + feedurl
update_feeds = []
numfeeds = len(update_feeds)
config.log("Will update ", numfeeds, " feeds")
fetcher = FeedFetcher(self, update_feeds, config)
fetched = fetcher.run(config["numthreads"])
seen_some_items = set()
def do_expiry(articles):
"""Expire articles from a list. Return True if any
articles were expired."""
feedcounts = {}
for key, article in articles.items():
url = article.feed
feedcounts[url] = feedcounts.get(url, 0) + 1
expiry_list = []
feedcounts = {}
for key, article in articles.items():
url = article.feed
feedcounts[url] = feedcounts.get(url, 0) + 1
expiry_list.append((article.added, article.sequence, key, article))
expiry_list.sort()
count = 0
for date, seq, key, article in expiry_list:
url = article.feed
if url not in self.feeds:
config.log("Expired article for nonexistent feed: ", url)
count += 1
del articles[key]
continue
if (url in seen_some_items
and self.feeds.has_key(url)
and article.can_expire(now, config)
and feedcounts[url] > self.feeds[url].get_keepmin(config)):
plugins.call_hook("article_expired", self, config, article, now)
count += 1
feedcounts[url] -= 1
del articles[key]
config.log("Expired ", count, " articles, leaving ", len(articles))
return (count > 0)
count = 0
for url in update_feeds:
count += 1
config.log("Updating feed ", count, " of " , numfeeds, ": ", url)
feed = self.feeds[url]
if config["splitstate"]:
feedstate_p = persister.get(FeedState, feed.get_state_filename())
feedstate = feedstate_p.open()
articles = feedstate.articles
else:
articles = self.articles
content = fetched[url]
plugins.call_hook("mid_update_feed", self, config, feed, content)
rc = feed.update(self, now, config, articles, content)
url = feed.url
plugins.call_hook("post_update_feed", self, config, feed, rc)
if rc:
seen_some_items.add(url)
if config["splitstate"]:
feedstate.modified()
if config["splitstate"]:
if do_expiry(articles):
feedstate.modified()
feedstate_p.close()
if config["splitstate"]:
self.articles = {}
else:
do_expiry(self.articles)
self.modified()
config.log("Finished update")
def get_template(self, config, name="page"):
"""Return the contents of a template."""
filename = config.get(name + "template", "default")
if filename != "default":
return load_file(filename)
if name == "page":
template = """
"""
if config["userefresh"]:
template += """__refresh__
"""
template += """
rawdog
rawdog
__items__
"""
if config["showfeeds"]:
template += """Feeds
__feeds__
"""
template += """
"""
return template
elif name == "item":
return """
__title__
[__feed_title__]
__if_description__
__description__
__endif__
"""
elif name == "feedlist":
return """
Feed RSS Last fetched Next fetched after
__feeditems__
"""
elif name == "feeditem":
return """
__feed_title__
__feed_icon__
__feed_last_update__
__feed_next_update__
"""
else:
raise KeyError("Unknown template name: " + name)
def show_template(self, name, config):
"""Show the contents of a template, as currently configured."""
try:
print self.get_template(config, name),
except KeyError:
print >>sys.stderr, "Unknown template name: " + a
def write_article(self, f, article, config):
"""Write an article to the given file."""
feed = self.feeds[article.feed]
entry_info = article.entry_info
link = entry_info.get("link")
if link == "":
link = None
guid = entry_info.get("id")
if guid == "":
guid = None
itembits = self.get_feed_bits(config, feed)
for name, value in feed.args.items():
if name.startswith("define_"):
itembits[name[7:]] = sanitise_html(value, "", True, config)
title = detail_to_html(entry_info.get("title_detail"), True, config)
key = None
for k in ["content", "summary_detail"]:
if entry_info.has_key(k):
key = k
break
if key is None:
description = None
else:
force_preformatted = (feed.args.get("format", "default") == "text")
description = detail_to_html(entry_info[key], False, config, force_preformatted)
date = article.date
if title is None:
if link is None:
title = "Article"
else:
title = "Link"
itembits["title_no_link"] = title
if link is not None:
itembits["url"] = string_to_html(link, config)
else:
itembits["url"] = ""
if guid is not None:
itembits["guid"] = string_to_html(guid, config)
else:
itembits["guid"] = ""
if link is None:
itembits["title"] = title
else:
itembits["title"] = '' + title + ''
itembits["hash"] = short_hash(article.hash)
if description is not None:
itembits["description"] = description
else:
itembits["description"] = ""
author = author_to_html(entry_info, feed.url, config)
if author is not None:
itembits["author"] = author
else:
itembits["author"] = ""
itembits["added"] = format_time(article.added, config)
if date is not None:
itembits["date"] = format_time(date, config)
else:
itembits["date"] = ""
plugins.call_hook("output_item_bits", self, config, feed, article, itembits)
itemtemplate = self.get_template(config, "item")
f.write(fill_template(itemtemplate, itembits))
def write_remove_dups(self, articles, config, now):
"""Filter the list of articles to remove articles that are too
old or are duplicates."""
kept_articles = []
seen_links = set()
seen_guids = set()
dup_count = 0
for article in articles:
feed = self.feeds[article.feed]
age = now - article.added
maxage = feed.args.get("maxage", config["maxage"])
if maxage != 0 and age > maxage:
continue
entry_info = article.entry_info
link = entry_info.get("link")
if link == "":
link = None
guid = entry_info.get("id")
if guid == "":
guid = None
if not feed.args.get("allowduplicates", False):
is_dup = False
for key in config["hideduplicates"]:
if key == "id" and guid is not None:
if guid in seen_guids:
is_dup = True
seen_guids.add(guid)
elif key == "link" and link is not None:
if link in seen_links:
is_dup = True
seen_links.add(link)
if is_dup:
dup_count += 1
continue
kept_articles.append(article)
return (kept_articles, dup_count)
def get_feed_bits(self, config, feed):
"""Get the bits that are used to describe a feed."""
bits = {}
bits["feed_id"] = feed.get_id(config)
bits["feed_hash"] = short_hash(feed.url)
bits["feed_title"] = feed.get_html_link(config)
bits["feed_title_no_link"] = detail_to_html(feed.feed_info.get("title_detail"), True, config)
bits["feed_url"] = string_to_html(feed.url, config)
bits["feed_icon"] = 'XML'
bits["feed_last_update"] = format_time(feed.last_update, config)
bits["feed_next_update"] = format_time(feed.last_update + feed.period, config)
return bits
def write_feeditem(self, f, feed, config):
"""Write a feed list item."""
bits = self.get_feed_bits(config, feed)
f.write(fill_template(self.get_template(config, "feeditem"), bits))
def write_feedlist(self, f, config):
"""Write the feed list."""
bits = {}
feeds = [(feed.get_html_name(config).lower(), feed)
for feed in self.feeds.values()]
feeds.sort()
feeditems = StringIO()
for key, feed in feeds:
self.write_feeditem(feeditems, feed, config)
bits["feeditems"] = feeditems.getvalue()
feeditems.close()
f.write(fill_template(self.get_template(config, "feedlist"), bits))
def get_main_template_bits(self, config):
"""Get the bits that are used in the default main template,
with the exception of items and num_items."""
bits = { "version" : VERSION }
bits.update(config["defines"])
refresh = config["expireage"]
for feed in self.feeds.values():
if feed.period < refresh: refresh = feed.period
bits["refresh"] = """"""
f = StringIO()
self.write_feedlist(f, config)
bits["feeds"] = f.getvalue()
f.close()
bits["num_feeds"] = str(len(self.feeds))
return bits
def write_output_file(self, articles, article_dates, config):
"""Write a regular rawdog HTML output file."""
f = StringIO()
dw = DayWriter(f, config)
plugins.call_hook("output_items_begin", self, config, f)
for article in articles:
if not plugins.call_hook("output_items_heading", self, config, f, article, article_dates[article]):
dw.time(article_dates[article])
self.write_article(f, article, config)
dw.close()
plugins.call_hook("output_items_end", self, config, f)
bits = self.get_main_template_bits(config)
bits["items"] = f.getvalue()
f.close()
bits["num_items"] = str(len(articles))
plugins.call_hook("output_bits", self, config, bits)
s = fill_template(self.get_template(config, "page"), bits)
outputfile = config["outputfile"]
if outputfile == "-":
write_ascii(sys.stdout, s, config)
else:
config.log("Writing output file: ", outputfile)
f = open(outputfile + ".new", "w")
write_ascii(f, s, config)
f.close()
os.rename(outputfile + ".new", outputfile)
def write(self, config):
"""Perform the write action: write articles to the output
file."""
config.log("Starting write")
now = time.time()
def list_articles(articles):
return [(-a.get_sort_date(config), a.feed, a.sequence, a.hash) for a in articles.values()]
if config["splitstate"]:
article_list = []
for feed in self.feeds.values():
with persister.get(FeedState, feed.get_state_filename()) as feedstate:
article_list += list_articles(feedstate.articles)
else:
article_list = list_articles(self.articles)
numarticles = len(article_list)
if not plugins.call_hook("output_sort_articles", self, config, article_list):
article_list.sort()
if config["maxarticles"] != 0:
article_list = article_list[:config["maxarticles"]]
if config["splitstate"]:
wanted = {}
for (date, feed_url, seq, hash) in article_list:
if not feed_url in self.feeds:
# This can happen if you've managed to
# kill rawdog between it updating a
# split state file and the main state
# -- so just ignore the article and
# it'll expire eventually.
continue
wanted.setdefault(feed_url, []).append(hash)
found = {}
for (feed_url, article_hashes) in wanted.items():
feed = self.feeds[feed_url]
with persister.get(FeedState, feed.get_state_filename()) as feedstate:
for hash in article_hashes:
found[hash] = feedstate.articles[hash]
else:
found = self.articles
articles = []
article_dates = {}
for (date, feed, seq, hash) in article_list:
a = found.get(hash)
if a is not None:
articles.append(a)
article_dates[a] = -date
plugins.call_hook("output_write", self, config, articles)
if not plugins.call_hook("output_sorted_filter", self, config, articles):
(articles, dup_count) = self.write_remove_dups(articles, config, now)
else:
dup_count = 0
config.log("Selected ", len(articles), " of ", numarticles, " articles to write; ignored ", dup_count, " duplicates")
if not plugins.call_hook("output_write_files", self, config, articles, article_dates):
self.write_output_file(articles, article_dates, config)
config.log("Finished write")
def usage():
"""Display usage information."""
print """rawdog, version """ + VERSION + """
Usage: rawdog [OPTION]...
General options (use only once):
-d|--dir DIR Use DIR instead of ~/.rawdog
-N, --no-locking Do not lock the state file
-v, --verbose Print more detailed status information
-V|--log FILE Append detailed status information to FILE
-W, --no-lock-wait Exit silently if state file is locked
Actions (performed in order given):
-a|--add URL Try to find a feed associated with URL and
add it to the config file
-c|--config FILE Read additional config file FILE
-f|--update-feed URL Force an update on the single feed URL
-l, --list List feeds known at time of last update
-r|--remove URL Remove feed URL from the config file
-s|--show TEMPLATE Show the contents of a template
(TEMPLATE may be: page item feedlist feeditem)
-u, --update Fetch data from feeds and store it
-w, --write Write out HTML output
Special actions (all other options are ignored if one of these is specified):
--dump URL Show what rawdog's parser returns for URL
--help Display this help and exit
Report bugs to ."""
def main(argv):
"""The command-line interface to the aggregator."""
locale.setlocale(locale.LC_ALL, "")
# This is quite expensive and not threadsafe, so we do it on
# startup and cache the result.
global system_encoding
system_encoding = locale.getpreferredencoding()
try:
SHORTOPTS = "a:c:d:f:lNr:s:tTuvV:wW"
LONGOPTS = [
"add=",
"config=",
"dir=",
"dump=",
"help",
"list",
"log=",
"no-lock-wait",
"no-locking",
"remove=",
"show=",
"show-itemtemplate",
"show-template",
"update",
"update-feed=",
"verbose",
"write",
]
(optlist, args) = getopt.getopt(argv, SHORTOPTS, LONGOPTS)
except getopt.GetoptError, s:
print s
usage()
return 1
if len(args) != 0:
usage()
return 1
if "HOME" in os.environ:
statedir = os.environ["HOME"] + "/.rawdog"
else:
statedir = None
verbose = False
logfile_name = None
locking = True
no_lock_wait = False
for o, a in optlist:
if o == "--dump":
import pprint
pprint.pprint(feedparser.parse(a, agent=HTTP_AGENT))
return 0
elif o == "--help":
usage()
return 0
elif o in ("-d", "--dir"):
statedir = a
elif o in ("-N", "--no-locking"):
locking = False
elif o in ("-v", "--verbose"):
verbose = True
elif o in ("-V", "--log"):
logfile_name = a
elif o in ("-W", "--no-lock-wait"):
no_lock_wait = True
if statedir is None:
print "$HOME not set and state dir not explicitly specified; please use -d/--dir"
return 1
try:
os.chdir(statedir)
except OSError:
print "No " + statedir + " directory"
return 1
sys.path.append(".")
config = Config(locking, logfile_name)
def load_config(fn):
try:
config.load(fn)
except ConfigError, err:
print >>sys.stderr, "In " + fn + ":"
print >>sys.stderr, err
return 1
if verbose:
config["verbose"] = True
load_config("config")
global persister
persister = Persister(config)
rawdog_p = persister.get(Rawdog, "state")
rawdog = rawdog_p.open(no_block=no_lock_wait)
if rawdog is None:
return 0
if not rawdog.check_state_version():
print "The state file " + statedir + "/state was created by an older"
print "version of rawdog, and cannot be read by this version."
print "Removing the state file will fix it."
return 1
rawdog.sync_from_config(config)
plugins.call_hook("startup", rawdog, config)
for o, a in optlist:
if o in ("-a", "--add"):
add_feed("config", a, rawdog, config)
config.reload()
rawdog.sync_from_config(config)
elif o in ("-c", "--config"):
load_config(a)
rawdog.sync_from_config(config)
elif o in ("-f", "--update-feed"):
rawdog.update(config, a)
elif o in ("-l", "--list"):
rawdog.list(config)
elif o in ("-r", "--remove"):
remove_feed("config", a, config)
config.reload()
rawdog.sync_from_config(config)
elif o in ("-s", "--show"):
rawdog.show_template(a, config)
elif o in ("-t", "--show-template"):
rawdog.show_template("page", config)
elif o in ("-T", "--show-itemtemplate"):
rawdog.show_template("item", config)
elif o in ("-u", "--update"):
rawdog.update(config)
elif o in ("-w", "--write"):
rawdog.write(config)
plugins.call_hook("shutdown", rawdog, config)
rawdog_p.close()
return 0
rawdog-2.19/rawdoglib/plugins.py 0000644 0004715 0004715 00000004531 12167103316 016350 0 ustar ats ats 0000000 0000000 # plugins: handle add-on modules for rawdog.
# Copyright 2004, 2005, 2013 Adam Sampson
#
# rawdog is free software; you can redistribute and/or modify it
# under the terms of that license as published by the Free Software
# Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# rawdog is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with rawdog; see the file COPYING. If not, write to the Free
# Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA, or see http://www.gnu.org/.
# The design of rawdog's plugin API was inspired by Stuart Langridge's
# Vellum weblog system:
# http://www.kryogenix.org/code/vellum/
import os, imp
class Box:
"""Utility class that holds a mutable value. Useful for passing
immutable types by reference."""
def __init__(self, value=None):
self.value = value
plugin_count = 0
def load_plugins(dir, config):
global plugin_count
try:
files = os.listdir(dir)
except OSError:
# Ignore directories that can't be read.
return
for file in files:
if file == "" or file[0] == ".":
continue
desc = None
for d in imp.get_suffixes():
if file.endswith(d[0]) and d[2] == imp.PY_SOURCE:
desc = d
if desc is None:
continue
fn = os.path.join(dir, file)
config.log("Loading plugin ", fn)
f = open(fn, "r")
imp.load_module("plugin%d" % (plugin_count,), f, fn, desc)
plugin_count += 1
f.close()
attached = {}
def attach_hook(hookname, func):
"""Attach a function to a hook. The function should take the
appropriate arguments for the hook, and should return either True or
False to indicate whether further functions should be processed."""
attached.setdefault(hookname, []).append(func)
def call_hook(hookname, *args):
"""Call all the functions attached to a hook with the given
arguments, in the order they were added, stopping if a hook function
returns False. Returns True if any hook function returned False (i.e.
returns True if any hook function handled the request)."""
for func in attached.get(hookname, []):
if not func(*args):
return True
return False
rawdog-2.19/setup.py 0000755 0004715 0004715 00000001432 12273445020 014054 0 ustar ats ats 0000000 0000000 #!/usr/bin/env python
from distutils.core import setup
import sys
if sys.version_info < (2, 6) or sys.version_info >= (3,):
print("rawdog requires Python 2.6 or later, and not Python 3.")
sys.exit(1)
setup(name = "rawdog",
version = "2.19",
description = "RSS Aggregator Without Delusions Of Grandeur",
author = "Adam Sampson",
author_email = "ats@offog.org",
url = "http://offog.org/code/rawdog/",
scripts = ['rawdog'],
data_files = [('share/man/man1', ['rawdog.1'])],
packages = ['rawdoglib'],
classifiers = [
"Development Status :: 5 - Production/Stable",
"Environment :: Console",
"License :: OSI Approved :: GNU General Public License v2 or later (GPLv2+)",
"Operating System :: POSIX",
"Programming Language :: Python :: 2",
"Topic :: Internet :: WWW/HTTP",
])
rawdog-2.19/PLUGINS 0000644 0004715 0004715 00000030233 12176717227 013421 0 ustar ats ats 0000000 0000000 # Writing rawdog plugins
## Introduction
As provided, rawdog provides a fairly small set of features. In order to
make it do more complex jobs, rawdog can be extended using plugin
modules written in Python. This document is intended for developers who
want to extend rawdog by writing plugins.
Extensions work by registering hook functions which are called by
various bits of rawdog's core as it runs. These functions can modify
rawdog's internal state in various interesting ways. An arbitrary number
of functions can be attached to each hook; they are called in the order
they were attached. Hook functions take various arguments depending on
where they're called from, and returns a boolean value indicating
whether further functions attached to the same hook should be called.
The "plugindirs" config option gives a list of directories to search for
plugins; all Python modules found in those directories will be loaded by
rawdog. In practice, this means that you need to call your file
something ending in ".py" to have it recognised as a plugin.
## The plugins module
All plugins should import the `rawdoglib.plugins` module, which provides
the functions for registering and calling hooks, along with some
utilities for plugins. Many plugins will also want to import the
`rawdoglib.rawdog` module, which contains rawdog's core functionality,
much of which is reusable.
### rawdoglib.plugins.attach_hook(hook_name, function)
The attach_hook function adds a hook function to the hook of the given
name.
### rawdoglib.plugins.Box
The Box class is used to pass immutable types by reference to hook
functions; this allows several plugins to modify a value. It contains a
single `value` attribute for the value it is holding.
## Plugin storage
Since some plugins will need to keep state between runs, the Rawdog
object that most hook functions are provided with has a
`get_plugin_storage` method, which when called with a plugin identifier
for your plugin as an argument will give you a reference to a dictionary
which will be persisted in the rawdog state file. The dictionary is empty to
start with; you may store any pickleable objects you like in it. Plugin
identifiers should be strings based on your email address, in order to be
globally unique -- for example, `org.offog.ats.archive`.
After changing a plugin storage dictionary, you must call "rawdog.modified()"
to ensure that rawdog will write out its state file.
## Hooks
Most hook functions are called with "rawdog" and "config" as their first
two arguments; these are references to the aggregator's Rawdog and
Config objects.
If you need a hook that doesn't currently exist, please contact me.
The following hooks are supported:
### startup(rawdog, config)
Run when rawdog starts up, after the state file and config file have
been loaded, but before rawdog starts processing command-line arguments.
### shutdown(rawdog, config)
Run just before rawdog saves the state file and exits.
### config_option(config, name, value)
* name: the option name
* value: the option value
Called when rawdog encounters a config file option that it doesn't
recognise. The rawdoglib.rawdog.parse_* functions will probably be
useful when dealing with config options. You can raise ValueError to
have rawdog print an appropriate error message. You should return False
from this hook if name is an option you recognise.
Note that using config.log in this hook will probably not do what you
want, because the verbose flag may not yet have been turned on.
### config_option_arglines(config, name, value, arglines)
* name: the option name
* value: the option value
* arglines: a list of extra indented lines given after the option (which
can be used to supply extra arguments for the option)
As config_option for options that can handle extra argument lines.
If the options you are implementing should not have extra arguments,
then use the config_option hook instead.
### output_sort_articles(rawdog, config, articles)
* articles: the mutable list of (date, feed_url, sequence_number,
article_hash) tuples
Called to sort the list of articles to write. The default action here is
to just call the list's sort method; if you sort the list in a different
way, you should return False from this hook to prevent rawdog from
resorting it afterwards.
Later versions of rawdog may add more items at the end of the tuple;
bear this in mind when you're manipulating the items.
### output_write(rawdog, config, articles)
* articles: the mutable list of Article objects
Called immediately before output_sorted_filter; this hook is here for
backwards compatibility, and should not be used in new plugins.
### output_sorted_filter(rawdog, config, articles)
* articles: the mutable list of Article objects
Called after rawdog sorts the list of articles to write, but before it
removes duplicate and excessively old articles. This hook can be used to
implement alternate duplicate-filtering methods. If you return False
from this hook, then rawdog will not do its usual duplicate-removing
filter pass.
### output_write_files(rawdog, config, articles, article_dates)
* articles: the mutable list of Article objects
* article_dates: a dictionary mapping Article objects to the dates that
were used to sort them
Called when rawdog is about to write its output to files. This hook can
be used to implement alternative output methods.
If you return False from this hook, then rawdog will not write any
output itself (and the later output_ hooks will thus not be called). I
would suggest not returning False here unless you plan to call the
rawdog.write_output_file method from your hook implementation; failure
to do so will most likely break other plugins.
### output_items_begin(rawdog, config, f)
* f: a writable file object (__items__)
Called before rawdog starts expanding the items template. This set of
hooks can be used to implement alternative date (or other section)
headings.
### output_items_heading(rawdog, config, f, article, date)
* f: a writable file object (__items__)
* article: the Article object about to be written
* date: the Article's date for sorting purposes
Called before each item is written. If you return False from this hook,
then rawdog's normal time-based section headings will not be written.
### output_items_end(rawdog, config, f)
* f: a writable file object (__items__)
Called after all items are written.
### output_bits(rawdog, config, bits)
* bits: a dictionary of template parameters
Called before expanding the page template. This hook can be used to add
extra template parameters.
Note that template parameters should be valid HTML, with entities
escaped, even if they're URLs or similar. You can use rawdog's
`rawdoglib.rawdog.string_to_html` function to do this for you:
the_thing = "This can contain arbitary text & stuff"
bits["thing"] = string_to_html(the_thing, config)
It's also good idea for template parameter names to be valid Python
identifiers, so that plugins that replace the template system with
something smarter can make them into local variables.
### output_item_bits(rawdog, config, feed, article, bits)
* feed: the Feed containing this article
* article: the Article being templated
* bits: a dictionary of template parameters
Called before expanding the item template for an article. This hook can
be used to add extra template parameters.
(See the documentation for `output_bits` for some advice on adding
template parameters.)
### pre_update_feed(rawdog, config, feed)
* feed: the Feed about to be updated
Called before a feed's content is fetched. This hook can be used to
perform extra actions before fetching a feed. Note that if `usethreads`
is set to a positive number in the config file, this hook may be called
from a worker thread.
### mid_update_feed(rawdog, config, feed, content)
* feed: the Feed being updated
* content: the feedparser output from the feed (may be None)
Called after a feed's content has been fetched, but before rawdog's
internal state has been updated. This hook can be used to modify
feedparser's output.
### post_update_feed(rawdog, config, feed, seen_articles)
* feed: the Feed that has been updated
* seen_articles: a boolean indicating whether any articles were read
from the feed
Called after a feed is updated.
### article_seen(rawdog, config, article, ignore)
* article: the Article that has been received
* ignore: a Boxed boolean indicating whether to ignore the article
Called when an article is received from a feed. This hook can be used to
modify or ignore incoming articles.
### article_updated(rawdog, config, article, now)
* article: the Article that has been updated
* now: the current time
Called after an article has been updated (when rawdog receives an
article from a feed that it already has).
### article_added(rawdog, config, article, now)
* article: the Article that has been added
* now: the current time
Called after a new article has been added.
### article_expired(rawdog, config, article, now)
* article: the Article that will be expired
* now: the current time
Called before an article is expired.
### fill_template(template, bits, result)
* template: the template string to fill
* bits: a dictionary of template arguments
* result: a Boxed Unicode string for the result of template expansion
Called whenever template expansion is performed. If you set the value
inside result to something other than None, then rawdog will treat that
value as the result of template expansion (rather than performing its
normal expansion process); you can thus use this hook either for
manipulating template parameters, or for replacing the template system
entirely.
### tidy_args(config, args, baseurl, inline)
* args: a dictionary of keyword arguments for Tidy
* baseurl: the URL at which the HTML was originally found
* inline: a boolean indicating whether the output should be inline HTML
or a block element
When HTML is being sanitised by rawdog and the "tidyhtml" option is
enabled, this hook will be called just before Tidy is run (either via
PyTidyLib or via mx.Tidy). It can be used to add or modify Tidy options;
for example, to make it produce XHTML output.
### clean_html(config, html, baseurl, inline)
* html: a Boxed Unicode string containing the HTML being cleaned
* baseurl: the URL at which the HTML was originally found
* inline: a boolean indicating whether the output should be inline HTML
or a block element
Called whenever HTML is being sanitised by rawdog (after its existing
HTML sanitisation processes). You can use this to implement extra
sanitisation passes. You'll need to update the boxed value with the new,
cleaned string.
### add_urllib2_handlers(rawdog, config, feed, handlers)
* feed: the Feed to which the request will be made
* handlers: the mutable list of urllib2 *Handler objects that will be
passed to feedparser
Called before feedparser is used to fetch feed content. This hook can be
used to add additional urllib2 handlers to cope with unusual protocol
requirements; use `handlers.append` to add extra handlers.
### feed_fetched(rawdog, config, feed, feed_data, error, non_fatal)
* feed: the Feed that has just been fetched
* feed_data: the data returned from feedparser.parse
* error: the error string if an error occurred, or None if no error
occurred
* non_fatal: if error is not None, a boolean indicating whether the
error was fatal
Called after feedparser has been called to fetch the feed. This hook can
be used to manipulate the received feed data or implement custom error
handling.
## Obsolete hooks
The following hooks existed in previous versions of rawdog, but are no
longer supported:
* output_filter (since rawdog 2.12); use output_sorted_filter instead
* output_sort (since rawdog 2.12); use output_sort_articles instead
## Examples
### backwards.py
This is probably the simplest useful example plugin: it reverses the
sort order of the output.
import rawdoglib.plugins
def backwards(rawdog, config, articles):
articles.sort()
articles.reverse()
return False
rawdoglib.plugins.attach_hook("output_sort_articles", backwards)
### option.py
This plugin shows how to handle a config file option.
import rawdoglib.plugins
def option(config, name, value):
if name == "myoption":
print "Test plugin option:", value
return False
else:
return True
rawdoglib.plugins.attach_hook("config_option", option)
rawdog-2.19/testserver.py 0000644 0004715 0004715 00000016055 12177453766 015151 0 ustar ats ats 0000000 0000000 # testserver: servers for rawdog's test suite.
# Copyright 2013 Adam Sampson
#
# rawdog is free software; you can redistribute and/or modify it
# under the terms of that license as published by the Free Software
# Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# rawdog is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with rawdog; see the file COPYING. If not, write to the Free
# Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA, or see http://www.gnu.org/.
import BaseHTTPServer
import SimpleHTTPServer
import SocketServer
import base64
import cStringIO
import gzip
import hashlib
import os
import re
import sys
import threading
import time
class TimeoutRequestHandler(SocketServer.BaseRequestHandler):
"""Request handler for a server that just does nothing for a few
seconds, then disconnects. This is used for testing timeout handling."""
def handle(self):
time.sleep(5)
class TimeoutServer(SocketServer.ThreadingMixIn, SocketServer.TCPServer):
"""Timeout server for rawdog's test suite."""
pass
class HTTPRequestHandler(SimpleHTTPServer.SimpleHTTPRequestHandler):
"""HTTP request handler for rawdog's test suite."""
# do_GET/do_HEAD are copied from SimpleHTTPServer because send_head isn't
# part of the API.
def do_GET(self):
f = self.send_head()
if f:
self.copyfile(f, self.wfile)
f.close()
def do_HEAD(self):
f = self.send_head()
if f:
f.close()
def send_head(self):
# Look for lines of the form "/oldpath /newpath" in .rewrites.
try:
f = open(os.path.join(self.server.files_dir, ".rewrites"))
for line in f.readlines():
(old, new) = line.split(None, 1)
if self.path == old:
self.path = new
f.close()
except IOError:
pass
m = re.match(r'^/auth-([^/-]+)-([^/]+)(/.*)$', self.path)
if m:
# Require basic authentication.
auth = "Basic " + base64.b64encode(m.group(1) + ":" + m.group(2))
if self.headers.get("Authorization") != auth:
self.send_response(401)
self.end_headers()
return None
self.path = m.group(3)
m = re.match(r'^/digest-([^/-]+)-([^/]+)(/.*)$', self.path)
if m:
# Require digest authentication. (Not a good implementation!)
realm = "rawdog test server"
nonce = "0123456789abcdef"
a1 = m.group(1) + ":" + realm + ":" + m.group(2)
a2 = "GET:" + self.path
def h(s):
return hashlib.md5(s).hexdigest()
response = h(h(a1) + ":" + nonce + ":" + h(a2))
mr = re.search(r'response="([^"]*)"',
self.headers.get("Authorization", ""))
if mr is None or mr.group(1) != response:
self.send_response(401)
self.send_header("WWW-Authenticate",
'Digest realm="%s", nonce="%s"'
% (realm, nonce))
self.end_headers()
return None
self.path = m.group(3)
m = re.match(r'^/(\d\d\d)(/.*)?$', self.path)
if m:
# Request for a particular response code.
code = int(m.group(1))
self.send_response(code)
if m.group(2):
self.send_header("Location", self.server.base_url + m.group(2))
self.end_headers()
return None
encoding = None
m = re.match(r'^/(gzip)(/.*)$', self.path)
if m:
# Request for a content encoding.
encoding = m.group(1)
self.path = m.group(2)
m = re.match(r'^/([^/]+)$', self.path)
if m:
# Request for a file.
filename = os.path.join(self.server.files_dir, m.group(1))
try:
f = open(filename, "rb")
except IOError:
self.send_response(404)
self.end_headers()
return None
# Use the SHA1 hash as an ETag.
etag = '"' + hashlib.sha1(f.read()).hexdigest() + '"'
f.seek(0)
# Oversimplistic, but matches what feedparser sends.
if self.headers.get("If-None-Match", "") == etag:
self.send_response(304)
self.end_headers()
return None
size = os.fstat(f.fileno()).st_size
mime_type = "text/plain"
if filename.endswith(".rss") or filename.endswith(".rss2"):
mime_type = "application/rss+xml"
elif filename.endswith(".rdf"):
mime_type = "application/rdf+xml"
elif filename.endswith(".atom"):
mime_type = "application/atom+xml"
elif filename.endswith(".html"):
mime_type = "text/html"
self.send_response(200)
if encoding:
self.send_header("Content-Encoding", encoding)
if encoding == "gzip":
data = f.read()
f.close()
f = cStringIO.StringIO()
g = gzip.GzipFile(fileobj=f, mode="wb")
g.write(data)
g.close()
size = f.tell()
f.seek(0)
self.send_header("Content-Length", size)
self.send_header("Content-Type", mime_type)
self.send_header("ETag", etag)
self.end_headers()
return f
# A request we can't handle.
self.send_response(500)
self.end_headers()
return None
def log_message(self, fmt, *args):
f = open(self.server.files_dir + "/.log", "a")
f.write(fmt % args + "\n")
f.close()
class HTTPServer(BaseHTTPServer.HTTPServer):
"""HTTP server for rawdog's test suite."""
def __init__(self, base_url, files_dir, *args, **kwargs):
self.base_url = base_url
self.files_dir = files_dir
BaseHTTPServer.HTTPServer.__init__(self, *args, **kwargs)
def main(args):
if len(args) < 3:
print "Usage: testserver.py HOSTNAME TIMEOUT-PORT HTTP-PORT FILES-DIR"
sys.exit(1)
hostname = args[0]
timeout_port = int(args[1])
http_port = int(args[2])
files_dir = args[3]
timeoutd = TimeoutServer((hostname, timeout_port), TimeoutRequestHandler)
t = threading.Thread(target=timeoutd.serve_forever)
t.daemon = True
t.start()
base_url = "http://" + hostname + ":" + str(http_port)
httpd = HTTPServer(base_url, files_dir, (hostname, http_port), HTTPRequestHandler)
httpd.serve_forever()
if __name__ == "__main__":
main(sys.argv[1:])
rawdog-2.19/COPYING 0000644 0004715 0004715 00000043135 10404407140 013373 0 ustar ats ats 0000000 0000000 GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Library General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
GNU GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
Copyright (C)
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Library General
Public License instead of this License.