[Noisebridge-discuss] Need help installing Python program "decruft" (web content extraction tool)

John Magolske listmail at b79.net
Mon Jan 24 21:56:25 PST 2011

I'm trying to get a particular python program installed, a port of
Arc90's readability project. It plucks the readable content out of
web pages:


I was wondering if someone with more python-fu might be able to point
the way towards successfully installing & using this (can't find any
contact info on the above linked sites or I'd ask there). See below
for details.

TIA for any help,



(on Debian Sid)

 % cd /home/john/bin/python
 % wget http://decruft.googlecode.com/files/decruft-0.1.tgz
 % tar -zxf decruft-0.1.tgz
 % cd decruft
 % ls
BeautifulSoup.py   decruft.py*  __init__.py     page_parser.pyc  url_helpers.pyc
BeautifulSoup.pyc  decruft.pyc  page_parser.py  url_helpers.py
 % sudo aptitude install python-lxml
    [ ... ]
Setting up python-lxml (2.2.8-2) ...
 % python
Python 2.6.6 (r266:84292, Oct  9 2010, 11:40:09)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from decruft import Document
>>> import urllib2
>>> f = urllib2.open(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'open'
>>> print Document(f.read()).summary()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'f' is not defined

John Magolske

More information about the Noisebridge-discuss mailing list