blob: 88a60a2458f46592dcbe30da80119825b9540683 [file] [log] [blame]
Patrick Williamsc124f4f2015-09-15 14:41:29 -05001= 4.3.2 (20131002) =
2
3* Fixed a bug in which short Unicode input was improperly encoded to
4 ASCII when checking whether or not it was the name of a file on
5 disk. [bug=1227016]
6
7* Fixed a crash when a short input contains data not valid in
8 filenames. [bug=1232604]
9
10* Fixed a bug that caused Unicode data put into UnicodeDammit to
11 return None instead of the original data. [bug=1214983]
12
13* Combined two tests to stop a spurious test failure when tests are
14 run by nosetests. [bug=1212445]
15
16= 4.3.1 (20130815) =
17
18* Fixed yet another problem with the html5lib tree builder, caused by
19 html5lib's tendency to rearrange the tree during
20 parsing. [bug=1189267]
21
22* Fixed a bug that caused the optimized version of find_all() to
23 return nothing. [bug=1212655]
24
25= 4.3.0 (20130812) =
26
27* Instead of converting incoming data to Unicode and feeding it to the
28 lxml tree builder in chunks, Beautiful Soup now makes successive
29 guesses at the encoding of the incoming data, and tells lxml to
30 parse the data as that encoding. Giving lxml more control over the
31 parsing process improves performance and avoids a number of bugs and
32 issues with the lxml parser which had previously required elaborate
33 workarounds:
34
35 - An issue in which lxml refuses to parse Unicode strings on some
36 systems. [bug=1180527]
37
38 - A returning bug that truncated documents longer than a (very
39 small) size. [bug=963880]
40
41 - A returning bug in which extra spaces were added to a document if
42 the document defined a charset other than UTF-8. [bug=972466]
43
44 This required a major overhaul of the tree builder architecture. If
45 you wrote your own tree builder and didn't tell me, you'll need to
46 modify your prepare_markup() method.
47
48* The UnicodeDammit code that makes guesses at encodings has been
49 split into its own class, EncodingDetector. A lot of apparently
50 redundant code has been removed from Unicode, Dammit, and some
51 undocumented features have also been removed.
52
53* Beautiful Soup will issue a warning if instead of markup you pass it
54 a URL or the name of a file on disk (a common beginner's mistake).
55
56* A number of optimizations improve the performance of the lxml tree
57 builder by about 33%, the html.parser tree builder by about 20%, and
58 the html5lib tree builder by about 15%.
59
60* All find_all calls should now return a ResultSet object. Patch by
61 Aaron DeVore. [bug=1194034]
62
63= 4.2.1 (20130531) =
64
65* The default XML formatter will now replace ampersands even if they
66 appear to be part of entities. That is, "<" will become
67 "<". The old code was left over from Beautiful Soup 3, which
68 didn't always turn entities into Unicode characters.
69
70 If you really want the old behavior (maybe because you add new
71 strings to the tree, those strings include entities, and you want
72 the formatter to leave them alone on output), it can be found in
73 EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183]
74
75* Gave new_string() the ability to create subclasses of
76 NavigableString. [bug=1181986]
77
78* Fixed another bug by which the html5lib tree builder could create a
79 disconnected tree. [bug=1182089]
80
81* The .previous_element of a BeautifulSoup object is now always None,
82 not the last element to be parsed. [bug=1182089]
83
84* Fixed test failures when lxml is not installed. [bug=1181589]
85
86* html5lib now supports Python 3. Fixed some Python 2-specific
87 code in the html5lib test suite. [bug=1181624]
88
89* The html.parser treebuilder can now handle numeric attributes in
90 text when the hexidecimal name of the attribute starts with a
91 capital X. Patch by Tim Shirley. [bug=1186242]
92
93= 4.2.0 (20130514) =
94
95* The Tag.select() method now supports a much wider variety of CSS
96 selectors.
97
98 - Added support for the adjacent sibling combinator (+) and the
99 general sibling combinator (~). Tests by "liquider". [bug=1082144]
100
101 - The combinators (>, +, and ~) can now combine with any supported
102 selector, not just one that selects based on tag name.
103
104 - Added limited support for the "nth-of-type" pseudo-class. Code
105 by Sven Slootweg. [bug=1109952]
106
107* The BeautifulSoup class is now aliased to "_s" and "_soup", making
108 it quicker to type the import statement in an interactive session:
109
110 from bs4 import _s
111 or
112 from bs4 import _soup
113
114 The alias may change in the future, so don't use this in code you're
115 going to run more than once.
116
117* Added the 'diagnose' submodule, which includes several useful
118 functions for reporting problems and doing tech support.
119
120 - diagnose(data) tries the given markup on every installed parser,
121 reporting exceptions and displaying successes. If a parser is not
122 installed, diagnose() mentions this fact.
123
124 - lxml_trace(data, html=True) runs the given markup through lxml's
125 XML parser or HTML parser, and prints out the parser events as
126 they happen. This helps you quickly determine whether a given
127 problem occurs in lxml code or Beautiful Soup code.
128
129 - htmlparser_trace(data) is the same thing, but for Python's
130 built-in HTMLParser class.
131
132* In an HTML document, the contents of a <script> or <style> tag will
133 no longer undergo entity substitution by default. XML documents work
134 the same way they did before. [bug=1085953]
135
136* Methods like get_text() and properties like .strings now only give
137 you strings that are visible in the document--no comments or
138 processing commands. [bug=1050164]
139
140* The prettify() method now leaves the contents of <pre> tags
141 alone. [bug=1095654]
142
143* Fix a bug in the html5lib treebuilder which sometimes created
144 disconnected trees. [bug=1039527]
145
146* Fix a bug in the lxml treebuilder which crashed when a tag included
147 an attribute from the predefined "xml:" namespace. [bug=1065617]
148
149* Fix a bug by which keyword arguments to find_parent() were not
150 being passed on. [bug=1126734]
151
152* Stop a crash when unwisely messing with a tag that's been
153 decomposed. [bug=1097699]
154
155* Now that lxml's segfault on invalid doctype has been fixed, fixed a
156 corresponding problem on the Beautiful Soup end that was previously
157 invisible. [bug=984936]
158
159* Fixed an exception when an overspecified CSS selector didn't match
160 anything. Code by Stefaan Lippens. [bug=1168167]
161
162= 4.1.3 (20120820) =
163
164* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
165 test failure caused by the lousy HTMLParser in those
166 versions. [bug=1038503]
167
168* Raise a more specific error (FeatureNotFound) when a requested
169 parser or parser feature is not installed. Raise NotImplementedError
170 instead of ValueError when the user calls insert_before() or
171 insert_after() on the BeautifulSoup object itself. Patch by Aaron
172 Devore. [bug=1038301]
173
174= 4.1.2 (20120817) =
175
176* As per PEP-8, allow searching by CSS class using the 'class_'
177 keyword argument. [bug=1037624]
178
179* Display namespace prefixes for namespaced attribute names, instead of
180 the fully-qualified names given by the lxml parser. [bug=1037597]
181
182* Fixed a crash on encoding when an attribute name contained
183 non-ASCII characters.
184
185* When sniffing encodings, if the cchardet library is installed,
186 Beautiful Soup uses it instead of chardet. cchardet is much
187 faster. [bug=1020748]
188
189* Use logging.warning() instead of warning.warn() to notify the user
190 that characters were replaced with REPLACEMENT
191 CHARACTER. [bug=1013862]
192
193= 4.1.1 (20120703) =
194
195* Fixed an html5lib tree builder crash which happened when html5lib
196 moved a tag with a multivalued attribute from one part of the tree
197 to another. [bug=1019603]
198
199* Correctly display closing tags with an XML namespace declared. Patch
200 by Andreas Kostyrka. [bug=1019635]
201
202* Fixed a typo that made parsing significantly slower than it should
203 have been, and also waited too long to close tags with XML
204 namespaces. [bug=1020268]
205
206* get_text() now returns an empty Unicode string if there is no text,
207 rather than an empty bytestring. [bug=1020387]
208
209= 4.1.0 (20120529) =
210
211* Added experimental support for fixing Windows-1252 characters
212 embedded in UTF-8 documents. (UnicodeDammit.detwingle())
213
214* Fixed the handling of &quot; with the built-in parser. [bug=993871]
215
216* Comments, processing instructions, document type declarations, and
217 markup declarations are now treated as preformatted strings, the way
218 CData blocks are. [bug=1001025]
219
220* Fixed a bug with the lxml treebuilder that prevented the user from
221 adding attributes to a tag that didn't originally have
222 attributes. [bug=1002378] Thanks to Oliver Beattie for the patch.
223
224* Fixed some edge-case bugs having to do with inserting an element
225 into a tag it's already inside, and replacing one of a tag's
226 children with another. [bug=997529]
227
228* Added the ability to search for attribute values specified in UTF-8. [bug=1003974]
229
230 This caused a major refactoring of the search code. All the tests
231 pass, but it's possible that some searches will behave differently.
232
233= 4.0.5 (20120427) =
234
235* Added a new method, wrap(), which wraps an element in a tag.
236
237* Renamed replace_with_children() to unwrap(), which is easier to
238 understand and also the jQuery name of the function.
239
240* Made encoding substitution in <meta> tags completely transparent (no
241 more %SOUP-ENCODING%).
242
243* Fixed a bug in decoding data that contained a byte-order mark, such
244 as data encoded in UTF-16LE. [bug=988980]
245
246* Fixed a bug that made the HTMLParser treebuilder generate XML
247 definitions ending with two question marks instead of
248 one. [bug=984258]
249
250* Upon document generation, CData objects are no longer run through
251 the formatter. [bug=988905]
252
253* The test suite now passes when lxml is not installed, whether or not
254 html5lib is installed. [bug=987004]
255
256* Print a warning on HTMLParseErrors to let people know they should
257 install a better parser library.
258
259= 4.0.4 (20120416) =
260
261* Fixed a bug that sometimes created disconnected trees.
262
263* Fixed a bug with the string setter that moved a string around the
264 tree instead of copying it. [bug=983050]
265
266* Attribute values are now run through the provided output formatter.
267 Previously they were always run through the 'minimal' formatter. In
268 the future I may make it possible to specify different formatters
269 for attribute values and strings, but for now, consistent behavior
270 is better than inconsistent behavior. [bug=980237]
271
272* Added the missing renderContents method from Beautiful Soup 3. Also
273 added an encode_contents() method to go along with decode_contents().
274
275* Give a more useful error when the user tries to run the Python 2
276 version of BS under Python 3.
277
278* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
279 UnicodeDammit(markup, smart_quotes_to="ascii").
280
281= 4.0.3 (20120403) =
282
283* Fixed a typo that caused some versions of Python 3 to convert the
284 Beautiful Soup codebase incorrectly.
285
286* Got rid of the 4.0.2 workaround for HTML documents--it was
287 unnecessary and the workaround was triggering a (possibly different,
288 but related) bug in lxml. [bug=972466]
289
290= 4.0.2 (20120326) =
291
292* Worked around a possible bug in lxml that prevents non-tiny XML
293 documents from being parsed. [bug=963880, bug=963936]
294
295* Fixed a bug where specifying `text` while also searching for a tag
296 only worked if `text` wanted an exact string match. [bug=955942]
297
298= 4.0.1 (20120314) =
299
300* This is the first official release of Beautiful Soup 4. There is no
301 4.0.0 release, to eliminate any possibility that packaging software
302 might treat "4.0.0" as being an earlier version than "4.0.0b10".
303
304* Brought BS up to date with the latest release of soupselect, adding
305 CSS selector support for direct descendant matches and multiple CSS
306 class matches.
307
308= 4.0.0b10 (20120302) =
309
310* Added support for simple CSS selectors, taken from the soupselect project.
311
312* Fixed a crash when using html5lib. [bug=943246]
313
314* In HTML5-style <meta charset="foo"> tags, the value of the "charset"
315 attribute is now replaced with the appropriate encoding on
316 output. [bug=942714]
317
318* Fixed a bug that caused calling a tag to sometimes call find_all()
319 with the wrong arguments. [bug=944426]
320
321* For backwards compatibility, brought back the BeautifulStoneSoup
322 class as a deprecated wrapper around BeautifulSoup.
323
324= 4.0.0b9 (20120228) =
325
326* Fixed the string representation of DOCTYPEs that have both a public
327 ID and a system ID.
328
329* Fixed the generated XML declaration.
330
331* Renamed Tag.nsprefix to Tag.prefix, for consistency with
332 NamespacedAttribute.
333
334* Fixed a test failure that occured on Python 3.x when chardet was
335 installed.
336
337* Made prettify() return Unicode by default, so it will look nice on
338 Python 3 when passed into print().
339
340= 4.0.0b8 (20120224) =
341
342* All tree builders now preserve namespace information in the
343 documents they parse. If you use the html5lib parser or lxml's XML
344 parser, you can access the namespace URL for a tag as tag.namespace.
345
346 However, there is no special support for namespace-oriented
347 searching or tree manipulation. When you search the tree, you need
348 to use namespace prefixes exactly as they're used in the original
349 document.
350
351* The string representation of a DOCTYPE always ends in a newline.
352
353* Issue a warning if the user tries to use a SoupStrainer in
354 conjunction with the html5lib tree builder, which doesn't support
355 them.
356
357= 4.0.0b7 (20120223) =
358
359* Upon decoding to string, any characters that can't be represented in
360 your chosen encoding will be converted into numeric XML entity
361 references.
362
363* Issue a warning if characters were replaced with REPLACEMENT
364 CHARACTER during Unicode conversion.
365
366* Restored compatibility with Python 2.6.
367
368* The install process no longer installs docs or auxillary text files.
369
370* It's now possible to deepcopy a BeautifulSoup object created with
371 Python's built-in HTML parser.
372
373* About 100 unit tests that "test" the behavior of various parsers on
374 invalid markup have been removed. Legitimate changes to those
375 parsers caused these tests to fail, indicating that perhaps
376 Beautiful Soup should not test the behavior of foreign
377 libraries.
378
379 The problematic unit tests have been reformulated as informational
380 comparisons generated by the script
381 scripts/demonstrate_parser_differences.py.
382
383 This makes Beautiful Soup compatible with html5lib version 0.95 and
384 future versions of HTMLParser.
385
386= 4.0.0b6 (20120216) =
387
388* Multi-valued attributes like "class" always have a list of values,
389 even if there's only one value in the list.
390
391* Added a number of multi-valued attributes defined in HTML5.
392
393* Stopped generating a space before the slash that closes an
394 empty-element tag. This may come back if I add a special XHTML mode
395 (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty
396 useless.
397
398* Passing text along with tag-specific arguments to a find* method:
399
400 find("a", text="Click here")
401
402 will find tags that contain the given text as their
403 .string. Previously, the tag-specific arguments were ignored and
404 only strings were searched.
405
406* Fixed a bug that caused the html5lib tree builder to build a
407 partially disconnected tree. Generally cleaned up the html5lib tree
408 builder.
409
410* If you restrict a multi-valued attribute like "class" to a string
411 that contains spaces, Beautiful Soup will only consider it a match
412 if the values correspond to that specific string.
413
414= 4.0.0b5 (20120209) =
415
416* Rationalized Beautiful Soup's treatment of CSS class. A tag
417 belonging to multiple CSS classes is treated as having a list of
418 values for the 'class' attribute. Searching for a CSS class will
419 match *any* of the CSS classes.
420
421 This actually affects all attributes that the HTML standard defines
422 as taking multiple values (class, rel, rev, archive, accept-charset,
423 and headers), but 'class' is by far the most common. [bug=41034]
424
425* If you pass anything other than a dictionary as the second argument
426 to one of the find* methods, it'll assume you want to use that
427 object to search against a tag's CSS classes. Previously this only
428 worked if you passed in a string.
429
430* Fixed a bug that caused a crash when you passed a dictionary as an
431 attribute value (possibly because you mistyped "attrs"). [bug=842419]
432
433* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
434 like <meta charset="utf-8" />. [bug=837268]
435
436* If Unicode, Dammit can't figure out a consistent encoding for a
437 page, it will try each of its guesses again, with errors="replace"
438 instead of errors="strict". This may mean that some data gets
439 replaced with REPLACEMENT CHARACTER, but at least most of it will
440 get turned into Unicode. [bug=754903]
441
442* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
443 on certain kinds of markup. [bug=838800]
444
445* Fixed a bug that wrecked the tree if you replaced an element with an
446 empty string. [bug=728697]
447
448* Improved Unicode, Dammit's behavior when you give it Unicode to
449 begin with.
450
451= 4.0.0b4 (20120208) =
452
453* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()
454
455* BeautifulSoup.new_tag() will follow the rules of whatever
456 tree-builder was used to create the original BeautifulSoup object. A
457 new <p> tag will look like "<p />" if the soup object was created to
458 parse XML, but it will look like "<p></p>" if the soup object was
459 created to parse HTML.
460
461* We pass in strict=False to html.parser on Python 3, greatly
462 improving html.parser's ability to handle bad HTML.
463
464* We also monkeypatch a serious bug in html.parser that made
465 strict=False disastrous on Python 3.2.2.
466
467* Replaced the "substitute_html_entities" argument with the
468 more general "formatter" argument.
469
470* Bare ampersands and angle brackets are always converted to XML
471 entities unless the user prevents it.
472
473* Added PageElement.insert_before() and PageElement.insert_after(),
474 which let you put an element into the parse tree with respect to
475 some other element.
476
477* Raise an exception when the user tries to do something nonsensical
478 like insert a tag into itself.
479
480
481= 4.0.0b3 (20120203) =
482
483Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
484Soup's custom HTML parser in favor of a system that lets you write a
485little glue code and plug in any HTML or XML parser you want.
486
487Beautiful Soup 4.0 comes with glue code for four parsers:
488
489 * Python's standard HTMLParser (html.parser in Python 3)
490 * lxml's HTML and XML parsers
491 * html5lib's HTML parser
492
493HTMLParser is the default, but I recommend you install lxml if you
494can.
495
496For complete documentation, see the Sphinx documentation in
497bs4/doc/source/. What follows is a summary of the changes from
498Beautiful Soup 3.
499
500=== The module name has changed ===
501
502Previously you imported the BeautifulSoup class from a module also
503called BeautifulSoup. To save keystrokes and make it clear which
504version of the API is in use, the module is now called 'bs4':
505
506 >>> from bs4 import BeautifulSoup
507
508=== It works with Python 3 ===
509
510Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
511so bad that it barely worked at all. Beautiful Soup 4 works with
512Python 3, and since its parser is pluggable, you don't sacrifice
513quality.
514
515Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
516support to the finish line. Ezio Melotti is also to thank for greatly
517improving the HTML parser that comes with Python 3.2.
518
519=== CDATA sections are normal text, if they're understood at all. ===
520
521Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
522markup:
523
524 <p><![CDATA[foo]]></p> => <p></p>
525
526A future version of html5lib will turn CDATA sections into text nodes,
527but only within tags like <svg> and <math>:
528
529 <svg><![CDATA[foo]]></svg> => <p>foo</p>
530
531The default XML parser (which uses lxml behind the scenes) turns CDATA
532sections into ordinary text elements:
533
534 <p><![CDATA[foo]]></p> => <p>foo</p>
535
536In theory it's possible to preserve the CDATA sections when using the
537XML parser, but I don't see how to get it to work in practice.
538
539=== Miscellaneous other stuff ===
540
541If the BeautifulSoup instance has .is_xml set to True, an appropriate
542XML declaration will be emitted when the tree is transformed into a
543string:
544
545 <?xml version="1.0" encoding="utf-8">
546 <markup>
547 ...
548 </markup>
549
550The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
551builders set it to False. If you want to parse XHTML with an HTML
552parser, you can set it manually.
553
554
555= 3.2.0 =
556
557The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2
558to make it obvious which one you should use.
559
560= 3.1.0 =
561
562A hybrid version that supports 2.4 and can be automatically converted
563to run under Python 3.0. There are three backwards-incompatible
564changes you should be aware of, but no new features or deliberate
565behavior changes.
566
5671. str() may no longer do what you want. This is because the meaning
568of str() inverts between Python 2 and 3; in Python 2 it gives you a
569byte string, in Python 3 it gives you a Unicode string.
570
571The effect of this is that you can't pass an encoding to .__str__
572anymore. Use encode() to get a string and decode() to get Unicode, and
573you'll be ready (well, readier) for Python 3.
574
5752. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
576which is gone in Python 3. There's some bad HTML that SGMLParser
577handled but HTMLParser doesn't, usually to do with attribute values
578that aren't closed or have brackets inside them:
579
580 <a href="foo</a>, </a><a href="bar">baz</a>
581 <a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>
582
583A later version of Beautiful Soup will allow you to plug in different
584parsers to make tradeoffs between speed and the ability to handle bad
585HTML.
586
5873. In Python 3 (but not Python 2), HTMLParser converts entities within
588attributes to the corresponding Unicode characters. In Python 2 it's
589possible to parse this string and leave the &eacute; intact.
590
591 <a href="http://crummy.com?sacr&eacute;&bleu">
592
593In Python 3, the &eacute; is always converted to \xe9 during
594parsing.
595
596
597= 3.0.7a =
598
599Added an import that makes BS work in Python 2.3.
600
601
602= 3.0.7 =
603
604Fixed a UnicodeDecodeError when unpickling documents that contain
605non-ASCII characters.
606
607Fixed a TypeError that occured in some circumstances when a tag
608contained no text.
609
610Jump through hoops to avoid the use of chardet, which can be extremely
611slow in some circumstances. UTF-8 documents should never trigger the
612use of chardet.
613
614Whitespace is preserved inside <pre> and <textarea> tags that contain
615nothing but whitespace.
616
617Beautiful Soup can now parse a doctype that's scoped to an XML namespace.
618
619
620= 3.0.6 =
621
622Got rid of a very old debug line that prevented chardet from working.
623
624Added a Tag.decompose() method that completely disconnects a tree or a
625subset of a tree, breaking it up into bite-sized pieces that are
626easy for the garbage collecter to collect.
627
628Tag.extract() now returns the tag that was extracted.
629
630Tag.findNext() now does something with the keyword arguments you pass
631it instead of dropping them on the floor.
632
633Fixed a Unicode conversion bug.
634
635Fixed a bug that garbled some <meta> tags when rewriting them.
636
637
638= 3.0.5 =
639
640Soup objects can now be pickled, and copied with copy.deepcopy.
641
642Tag.append now works properly on existing BS objects. (It wasn't
643originally intended for outside use, but it can be now.) (Giles
644Radford)
645
646Passing in a nonexistent encoding will no longer crash the parser on
647Python 2.4 (John Nagle).
648
649Fixed an underlying bug in SGMLParser that thinks ASCII has 255
650characters instead of 127 (John Nagle).
651
652Entities are converted more consistently to Unicode characters.
653
654Entity references in attribute values are now converted to Unicode
655characters when appropriate. Numeric entities are always converted,
656because SGMLParser always converts them outside of attribute values.
657
658ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to
659XHTML_ENTITIES.
660
661The regular expression for bare ampersands was too loose. In some
662cases ampersands were not being escaped. (Sam Ruby?)
663
664Non-breaking spaces and other special Unicode space characters are no
665longer folded to ASCII spaces. (Robert Leftwich)
666
667Information inside a TEXTAREA tag is now parsed literally, not as HTML
668tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)
669
670= 3.0.4 =
671
672Fixed a bug that crashed Unicode conversion in some cases.
673
674Fixed a bug that prevented UnicodeDammit from being used as a
675general-purpose data scrubber.
676
677Fixed some unit test failures when running against Python 2.5.
678
679When considering whether to convert smart quotes, UnicodeDammit now
680looks at the original encoding in a case-insensitive way.
681
682= 3.0.3 (20060606) =
683
684Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be
685sure to pass in an appropriate value for convertEntities, or XML/HTML
686entities might stick around that aren't valid in HTML/XML). The result
687may not validate, but it should be good enough to not choke a
688real-world XML parser. Specifically, the output of a properly
689constructed soup object should always be valid as part of an XML
690document, but parts may be missing if they were missing in the
691original. As always, if the input is valid XML, the output will also
692be valid.
693
694= 3.0.2 (20060602) =
695
696Previously, Beautiful Soup correctly handled attribute values that
697contained embedded quotes (sometimes by escaping), but not other kinds
698of XML character. Now, it correctly handles or escapes all special XML
699characters in attribute values.
700
701I aliased methods to the 2.x names (fetch, find, findText, etc.) for
702backwards compatibility purposes. Those names are deprecated and if I
703ever do a 4.0 I will remove them. I will, I tell you!
704
705Fixed a bug where the findAll method wasn't passing along any keyword
706arguments.
707
708When run from the command line, Beautiful Soup now acts as an HTML
709pretty-printer, not an XML pretty-printer.
710
711= 3.0.1 (20060530) =
712
713Reintroduced the "fetch by CSS class" shortcut. I thought keyword
714arguments would replace it, but they don't. You can't call soup('a',
715class='foo') because class is a Python keyword.
716
717If Beautiful Soup encounters a meta tag that declares the encoding,
718but a SoupStrainer tells it not to parse that tag, Beautiful Soup will
719no longer try to rewrite the meta tag to mention the new
720encoding. Basically, this makes SoupStrainers work in real-world
721applications instead of crashing the parser.
722
723= 3.0.0 "Who would not give all else for two p" (20060528) =
724
725This release is not backward-compatible with previous releases. If
726you've got code written with a previous version of the library, go
727ahead and keep using it, unless one of the features mentioned here
728really makes your life easier. Since the library is self-contained,
729you can include an old copy of the library in your old applications,
730and use the new version for everything else.
731
732The documentation has been rewritten and greatly expanded with many
733more examples.
734
735Beautiful Soup autodetects the encoding of a document (or uses the one
736you specify), and converts it from its native encoding to
737Unicode. Internally, it only deals with Unicode strings. When you
738print out the document, it converts to UTF-8 (or another encoding you
739specify). [Doc reference]
740
741It's now easy to make large-scale changes to the parse tree without
742screwing up the navigation members. The methods are extract,
743replaceWith, and insert. [Doc reference. See also Improving Memory
744Usage with extract]
745
746Passing True in as an attribute value gives you tags that have any
747value for that attribute. You don't have to create a regular
748expression. Passing None for an attribute value gives you tags that
749don't have that attribute at all.
750
751Tag objects now know whether or not they're self-closing. This avoids
752the problem where Beautiful Soup thought that tags like <BR /> were
753self-closing even in XML documents. You can customize the self-closing
754tags for a parser object by passing them in as a list of
755selfClosingTags: you don't have to subclass anymore.
756
757There's a new built-in parser, MinimalSoup, which has most of
758BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc
759reference]
760
761You can use a SoupStrainer to tell Beautiful Soup to parse only part
762of a document. This saves time and memory, often making Beautiful Soup
763about as fast as a custom-built SGMLParser subclass. [Doc reference,
764SoupStrainer reference]
765
766You can (usually) use keyword arguments instead of passing a
767dictionary of attributes to a search method. That is, you can replace
768soup(args={"id" : "5"}) with soup(id="5"). You can still use args if
769(for instance) you need to find an attribute whose name clashes with
770the name of an argument to findAll. [Doc reference: **kwargs attrs]
771
772The method names have changed to the better method names used in
773Rubyful Soup. Instead of find methods and fetch methods, there are
774only find methods. Instead of a scheme where you can't remember which
775method finds one element and which one finds them all, we have find
776and findAll. In general, if the method name mentions All or a plural
777noun (eg. findNextSiblings), then it finds many elements
778method. Otherwise, it only finds one element. [Doc reference]
779
780Some of the argument names have been renamed for clarity. For instance
781avoidParserProblems is now parserMassage.
782
783Beautiful Soup no longer implements a feed method. You need to pass a
784string or a filehandle into the soup constructor, not with feed after
785the soup has been created. There is still a feed method, but it's the
786feed method implemented by SGMLParser and calling it will bypass
787Beautiful Soup and cause problems.
788
789The NavigableText class has been renamed to NavigableString. There is
790no NavigableUnicodeString anymore, because every string inside a
791Beautiful Soup parse tree is a Unicode string.
792
793findText and fetchText are gone. Just pass a text argument into find
794or findAll.
795
796Null was more trouble than it was worth, so I got rid of it. Anything
797that used to return Null now returns None.
798
799Special XML constructs like comments and CDATA now have their own
800NavigableString subclasses, instead of being treated as oddly-formed
801data. If you parse a document that contains CDATA and write it back
802out, the CDATA will still be there.
803
804When you're parsing a document, you can get Beautiful Soup to convert
805XML or HTML entities into the corresponding Unicode characters. [Doc
806reference]
807
808= 2.1.1 (20050918) =
809
810Fixed a serious performance bug in BeautifulStoneSoup which was
811causing parsing to be incredibly slow.
812
813Corrected several entities that were previously being incorrectly
814translated from Microsoft smart-quote-like characters.
815
816Fixed a bug that was breaking text fetch.
817
818Fixed a bug that crashed the parser when text chunks that look like
819HTML tag names showed up within a SCRIPT tag.
820
821THEAD, TBODY, and TFOOT tags are now nestable within TABLE
822tags. Nested tables should parse more sensibly now.
823
824BASE is now considered a self-closing tag.
825
826= 2.1.0 "Game, or any other dish?" (20050504) =
827
828Added a wide variety of new search methods which, given a starting
829point inside the tree, follow a particular navigation member (like
830nextSibling) over and over again, looking for Tag and NavigableText
831objects that match certain criteria. The new methods are findNext,
832fetchNext, findPrevious, fetchPrevious, findNextSibling,
833fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings,
834findParent, and fetchParents. All of these use the same basic code
835used by first and fetch, so you can pass your weird ways of matching
836things into these methods.
837
838The fetch method and its derivatives now accept a limit argument.
839
840You can now pass keyword arguments when calling a Tag object as though
841it were a method.
842
843Fixed a bug that caused all hand-created tags to share a single set of
844attributes.
845
846= 2.0.3 (20050501) =
847
848Fixed Python 2.2 support for iterators.
849
850Fixed a bug that gave the wrong representation to tags within quote
851tags like <script>.
852
853Took some code from Mark Pilgrim that treats CDATA declarations as
854data instead of ignoring them.
855
856Beautiful Soup's setup.py will now do an install even if the unit
857tests fail. It won't build a source distribution if the unit tests
858fail, so I can't release a new version unless they pass.
859
860= 2.0.2 (20050416) =
861
862Added the unit tests in a separate module, and packaged it with
863distutils.
864
865Fixed a bug that sometimes caused renderContents() to return a Unicode
866string even if there was no Unicode in the original string.
867
868Added the done() method, which closes all of the parser's open
869tags. It gets called automatically when you pass in some text to the
870constructor of a parser class; otherwise you must call it yourself.
871
872Reinstated some backwards compatibility with 1.x versions: referencing
873the string member of a NavigableText object returns the NavigableText
874object instead of throwing an error.
875
876= 2.0.1 (20050412) =
877
878Fixed a bug that caused bad results when you tried to reference a tag
879name shorter than 3 characters as a member of a Tag, eg. tag.table.td.
880
881Made sure all Tags have the 'hidden' attribute so that an attempt to
882access tag.hidden doesn't spawn an attempt to find a tag named
883'hidden'.
884
885Fixed a bug in the comparison operator.
886
887= 2.0.0 "Who cares for fish?" (20050410)
888
889Beautiful Soup version 1 was very useful but also pretty stupid. I
890originally wrote it without noticing any of the problems inherent in
891trying to build a parse tree out of ambiguous HTML tags. This version
892solves all of those problems to my satisfaction. It also adds many new
893clever things to make up for the removal of the stupid things.
894
895== Parsing ==
896
897The parser logic has been greatly improved, and the BeautifulSoup
898class should much more reliably yield a parse tree that looks like
899what the page author intended. For a particular class of odd edge
900cases that now causes problems, there is a new class,
901ICantBelieveItsBeautifulSoup.
902
903By default, Beautiful Soup now performs some cleanup operations on
904text before parsing it. This is to avoid common problems with bad
905definitions and self-closing tags that crash SGMLParser. You can
906provide your own set of cleanup operations, or turn it off
907altogether. The cleanup operations include fixing self-closing tags
908that don't close, and replacing Microsoft smart quotes and similar
909characters with their HTML entity equivalents.
910
911You can now get a pretty-print version of parsed HTML to get a visual
912picture of how Beautiful Soup parses it, with the Tag.prettify()
913method.
914
915== Strings and Unicode ==
916
917There are separate NavigableText subclasses for ASCII and Unicode
918strings. These classes directly subclass the corresponding base data
919types. This means you can treat NavigableText objects as strings
920instead of having to call methods on them to get the strings.
921
922str() on a Tag always returns a string, and unicode() always returns
923Unicode. Previously it was inconsistent.
924
925== Tree traversal ==
926
927In a first() or fetch() call, the tag name or the desired value of an
928attribute can now be any of the following:
929
930 * A string (matches that specific tag or that specific attribute value)
931 * A list of strings (matches any tag or attribute value in the list)
932 * A compiled regular expression object (matches any tag or attribute
933 value that matches the regular expression)
934 * A callable object that takes the Tag object or attribute value as a
935 string. It returns None/false/empty string if the given string
936 doesn't match, and any other value if it does.
937
938This is much easier to use than SQL-style wildcards (see, regular
939expressions are good for something). Because of this, I took out
940SQL-style wildcards. I'll put them back if someone complains, but
941their removal simplifies the code a lot.
942
943You can use fetch() and first() to search for text in the parse tree,
944not just tags. There are new alias methods fetchText() and firstText()
945designed for this purpose. As with searching for tags, you can pass in
946a string, a regular expression object, or a method to match your text.
947
948If you pass in something besides a map to the attrs argument of
949fetch() or first(), Beautiful Soup will assume you want to match that
950thing against the "class" attribute. When you're scraping
951well-structured HTML, this makes your code a lot cleaner.
952
9531.x and 2.x both let you call a Tag object as a shorthand for
954fetch(). For instance, foo("bar") is a shorthand for
955foo.fetch("bar"). In 2.x, you can also access a specially-named member
956of a Tag object as a shorthand for first(). For instance, foo.barTag
957is a shorthand for foo.first("bar"). By chaining these shortcuts you
958traverse a tree in very little code: for header in
959soup.bodyTag.pTag.tableTag('th'):
960
961If an element relationship (like parent or next) doesn't apply to a
962tag, it'll now show up Null instead of None. first() will also return
963Null if you ask it for a nonexistent tag. Null is an object that's
964just like None, except you can do whatever you want to it and it'll
965give you Null instead of throwing an error.
966
967This lets you do tree traversals like soup.htmlTag.headTag.titleTag
968without having to worry if the intermediate stages are actually
969there. Previously, if there was no 'head' tag in the document, headTag
970in that instance would have been None, and accessing its 'titleTag'
971member would have thrown an AttributeError. Now, you can get what you
972want when it exists, and get Null when it doesn't, without having to
973do a lot of conditionals checking to see if every stage is None.
974
975There are two new relations between page elements: previousSibling and
976nextSibling. They reference the previous and next element at the same
977level of the parse tree. For instance, if you have HTML like this:
978
979 <p><ul><li>Foo<br /><li>Bar</ul>
980
981The first 'li' tag has a previousSibling of Null and its nextSibling
982is the second 'li' tag. The second 'li' tag has a nextSibling of Null
983and its previousSibling is the first 'li' tag. The previousSibling of
984the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the
985'br' tag.
986
987I took out the ability to use fetch() to find tags that have a
988specific list of contents. See, I can't even explain it well. It was
989really difficult to use, I never used it, and I don't think anyone
990else ever used it. To the extent anyone did, they can probably use
991fetchText() instead. If it turns out someone needs it I'll think of
992another solution.
993
994== Tree manipulation ==
995
996You can add new attributes to a tag, and delete attributes from a
997tag. In 1.x you could only change a tag's existing attributes.
998
999== Porting Considerations ==
1000
1001There are three changes in 2.0 that break old code:
1002
1003In the post-1.2 release you could pass in a function into fetch(). The
1004function took a string, the tag name. In 2.0, the function takes the
1005actual Tag object.
1006
1007It's no longer to pass in SQL-style wildcards to fetch(). Use a
1008regular expression instead.
1009
1010The different parsing algorithm means the parse tree may not be shaped
1011like you expect. This will only actually affect you if your code uses
1012one of the affected parts. I haven't run into this problem yet while
1013porting my code.
1014
1015= Between 1.2 and 2.0 =
1016
1017This is the release to get if you want Python 1.5 compatibility.
1018
1019The desired value of an attribute can now be any of the following:
1020
1021 * A string
1022 * A string with SQL-style wildcards
1023 * A compiled RE object
1024 * A callable that returns None/false/empty string if the given value
1025 doesn't match, and any other value otherwise.
1026
1027This is much easier to use than SQL-style wildcards (see, regular
1028expressions are good for something). Because of this, I no longer
1029recommend you use SQL-style wildcards. They may go away in a future
1030release to clean up the code.
1031
1032Made Beautiful Soup handle processing instructions as text instead of
1033ignoring them.
1034
1035Applied patch from Richie Hindle (richie at entrian dot com) that
1036makes tag.string a shorthand for tag.contents[0].string when the tag
1037has only one string-owning child.
1038
1039Added still more nestable tags. The nestable tags thing won't work in
1040a lot of cases and needs to be rethought.
1041
1042Fixed an edge case where searching for "%foo" would match any string
1043shorter than "foo".
1044
1045= 1.2 "Who for such dainties would not stoop?" (20040708) =
1046
1047Applied patch from Ben Last (ben at benlast dot com) that made
1048Tag.renderContents() correctly handle Unicode.
1049
1050Made BeautifulStoneSoup even dumber by making it not implicitly close
1051a tag when another tag of the same type is encountered; only when an
1052actual closing tag is encountered. This change courtesy of Fuzzy (mike
1053at pcblokes dot com). BeautifulSoup still works as before.
1054
1055= 1.1 "Swimming in a hot tureen" =
1056
1057Added more 'nestable' tags. Changed popping semantics so that when a
1058nestable tag is encountered, tags are popped up to the previously
1059encountered nestable tag (of whatever kind). I will revert this if
1060enough people complain, but it should make more people's lives easier
1061than harder. This enhancement was suggested by Anthony Baxter (anthony
1062at interlink dot com dot au).
1063
1064= 1.0 "So rich and green" (20040420) =
1065
1066Initial release.