
|
ubh - The Usenet Binary Harvester
ubh - The Usenet Binary Harvester
ubh -c file (-S|-M) -A -C -d -D -g [(-a|-s)]|[(-f|-l) num] -i [(-I|-X) pattern] -L -n -O -r -u -w -y -Z -z
ubh is the Usenet Binary Harvester, a Perl program which
automatically discovers, downloads, and decodes single and
multi-part binary Usenet postings.
ubh decodes single and multi-part uuencoded binaries.
ubh also decodes single part MIME base64-encoded image,
audio, and video attachments, and application/octet-stream
attachments. It also combines and decodes multi-part
message/partial binaries.
ubh uses a standard .newsrc file to control which
groups to process, and uses the .newsrc to keep track of
articles already processed.
You can specify search filters to select articles to download via Perl
regular expression syntax.
By default ubh will only consider articles with a
well-formed Subject header. A well-formed
Subject header is one that contains a file name with
an extension which matches the extension filter. Multi-part
Subject headers will be recognized if they contain
a part/total designator of the form
[m/n] or
(m/n), where
m is the part number and n is the total number of
parts. Part numbers begin with 1 and may or may not
contain leading zeroes. A part number of 0 is ignored.
ubh provides an interactive article preselection option
to allow you to preview the Subject: headers for multi-part
binaries, and specify which binaries you wish to download.
ubh runs equally well under Unix-based Perl, Active Perl
on Win32 platforms, and Mac OS X.
ubh requires Net:NNTP and News:Newsrc
(which itself requires Set::IntSpan). See the section
INSTALLATION for details.
These options apply to single part article processing:
Option |
Description |
-S |
Process only single part articles. |
-g |
This option ("g" for
"greedy") will download and process each unread article
even if the subject does not contain a filename which
matches the single part extension filter. |
|
These options apply to multi-part article processing:
Option |
Description |
-M |
Process only multi-part articles. |
-i |
Interactive preselection of multi-part articles. |
|
These options apply to both single part and multi-part article processing:
Option |
Description |
-A |
Enables disk-based article assembly.
This will download the articles to disk (instead of RAM)
prior to decoding. |
-C |
Changes all non-alphanumeric characters in a filename to '_'.
This will eliminate spaces in filnames, as well as all other
undesireable (and possibly illegal) characters. |
-d |
Diagnostic mode. Downloads and writes all unread
articles in raw form. This occurs prior to single and
multi-part filtering. It's very useful to look at the raw
articles to see why they are failing to be selected for
downloading. Helpful for reverse-engineering new or bizarre
encoding formats.
You can also use this to perform your own post-processing
directly on the raw articles. |
-D |
Multi-part diagnostic mode. Downloads and writes all selected,
complete multi-part articles. Very useful to look at the raw
articles to see why they are failing to be unencoded. Helpful
for reverse-engineering new or bizarre encoding formats.
You can also use this to perform your own post-processing
directly on the raw articles. |
-c file |
Use file as configuration file, instead of the default.
On Win32 platforms, the default is ubhrc. On Unix
platforms, the default is .ubhrc. |
-a |
Process all articles, but disregard the newsrc (ie, consider
all articles even if they are marked as read in the newsrc, and
do not catch up the group at the end of processing of the group). |
-f num |
Process the first num articles. Updates newsrc. |
-l num |
Process the last num articles. Updates newsrc. |
-s |
Log all subjects to subjects.log. Log multi-part
subjects to multiparts.log. Doesn't download
anything. Disregards newsrc. |
-I regexp |
Inclusion search filter (double quote on command line).
regexp is any valid Perl regular expression. |
-X regexp |
Exclusion search filter (double quote on command line).
regexp is any valid Perl regular expression. |
-L |
Long filenames - uses the article subject as the filename.
This makes life easier because many folks encode their
files with terribly vague filenames. |
-n |
Updates the .newsrc every time an article is processed,
instead of waiting until the entire group has been processed. |
-O |
Forces ubh to overwrite duplicate filenames instead
of creating a unique filename. |
-r |
Logs rejected subjects to rejects.log in the group
directory. Logs rejected single part and multi-part articles.
Excellent for quality control to see if ubh is rejecting
any binaries, and essential for diagnosing why articles are
being rejected. Normally the rejected articles will contain
SPAM or discussion, but occasionally ubh will reject an
arcane format or mal-formed MIME-formatted message.
Rejected multi-part articles logged with this option in are in
their assembled stated, prefaced by the headers from the first
article. |
-u |
Prints out a brief usage summary. |
-w |
Prints out warranty information. |
-y |
chmod 0666 on all output files. |
-Z |
Produces lots and lots of logs. |
-z |
Marks articles that don't pass inclusion/exclusion as
read. This cleans up the .newsrc dramatically. |
|
ubh requires the following Perl modules: Net::NNTP,
News::Newsrc, and Set::IntSpan.
These modules may already be installed on your system. To determine
whether these modules have already installed, use the following commands:
perl -e "use Net::NNTP;"
perl -e "use News::Newsrc;"
perl -e "use Set::IntSpan;"
Net::NNTP is part of the libnet distribution,
which provides an entire family of networking modules.
If you need to install one or more of these modules, follow the
instructions below for your particular platform.
You will use the Perl Package Manager (PPM) to download and install the
modules. At the PPM prompt, enter the following commands:
install libnet
install Set-IntSpan
install News-Newsrc
You will download the modules from your nearest CPAN mirror.
modules/by-module/Net/libnet-1.0607.tar.gz
modules/by-module/Set/Set-IntSpan-1.07.tar.gz
modules/by-module/News/News-Newsrc-1.07.tar.gz
Feel free to use more recent versions of these modules, if available.
For each of those files, execute the following commands:
gunzip < module.tar.gz | tar xvf -
cd module
perl Makefile.PL
make
make test
make install
For more details, review the installation instructions provided with each
module.
Use the Unix .tar.gz distribution.
You need to download (from Apple) "Unix tools for OS X", and then
proceed with the module installation as given above.
alt.binaries.sounds.mp3.1990s: 15558-23146
alt.binaries.sounds.mp3.1980s: 30139-35021
alt.binaries.pictures.autos:
alt.binaries.pictures.cartoons! 63671-64660
alt.binaries.pictures.hockey: 1-1406
You specify program options with a ubhrc file.
On Win32 systems, the default ubhrc file name is ubhrc. On
Unix systems, the the default ubhrc file name is .ubhrc.
Here are the available options.
Specify one option per line, OPTION=value.
Comments begin with #.
Keyword |
Description |
NNTPSERVER |
Specifies the name of the news server to connect to.
Default is news. |
NNTPRETRIES |
Number of times to retry NNTP commands. |
NEWSRCNAME |
The name of the newsrc file.
On Win32 systems, the default newsrc file name is
newsrc. On Unix systems, the default newsrc
file name is .newsrc. |
DATADIR |
Directory to store downloaded group subdirectories and
downloaded binaries. Default is data. |
FORCEDIR |
Forces output to a specific directory instead
of the default data/newsgroup-name.
Handy for multiple newsgroups which hold similar stuff. |
TEMPDIR |
Directory to store encoded articles prior to decoding.
Default is DATADIR. |
PERMISSION |
Permission to apply to created subdirectories.
Default is 0777. |
MULTI_EXT |
Multi-part article file extensions. Any valid Perl regular
expression.
Default is (?i)asf|avi|gif|jpg|mov|mpg|mpeg|rm|rar. |
SINGLE_EXT |
Single part article file extensions. Any valid Perl regular
expression.
Default is (?i)asf|avi|gif|jpg|mov|mpg|mpeg|rm|rar. |
EXTENSIONS |
Sets both MULTI_EXT and SINGLE_EXT. |
ACCOUNT and PASSWORD |
News server account and password. You must define both of
these if your server requires them. By default, ubh will
access the server without authentication. |
OPT_g
OPT_i
OPT_d
OPT_D
OPT_a
OPT_f
OPT_l
OPT_r
OPT_s
OPT_S
OPT_M
OPT_I
OPT_X
OPT_C
OPT_L
OPT_z
OPT_Z
OPT_y |
All of the command-line options (except -u,
-c, and -w) can be set in the ubhrc
file, using OPT_x where x is
the corresponding command line option letter. |
|
Here is an example ubhrc file. This can be used to harvest single
and multi-part binaries in the pictures and multimedia newsgroups.
NNTPSERVER = binaries.newsfeeds.com
NEWSRCNAME = newsrc_nfdc
DATADIR = data
MULTI_EXT = (?i)asf|avi|gif|jpg|mov|mpg|rm
SINGLE_EXT = (?i)asf|avi|gif|jpg|mov|mpg|rm
ACCOUNT = bart
PASSWORD = the+simpsons
OPT_g = 1
Here is another example ubhrc file. This demonstrates how to search
for MP3s by a particular set of artists:
NNTPSERVER = mp3.newsfeeds.com
NEWSRCNAME = newsrc_mp3
DATADIR = data
MULTI_EXT = (?i)mp3
SINGLE_EXT = dontcare
OPT_a = 1
OPT_M = 1
OPT_I = (?i)oasis|horton.*heat|smiths|morrissey|bjork|foo.*fighters|green.*day
Here are some command-line usage examples:
ubh -i -M -I "(?i)rem|r.e.m.|u2|korn|hanson"
ubh -S -l 1000
Copyright © 2000 Gerard Lanois
gerard 'at' users.sourceforge.net
P.O. Box 507264 San Diego, CA 92150-7264
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc., 59
Temple Place, Suite 330, Boston, MA 02111-1307 USA |