As you can see, we moved the point so that the dangling <div> that we found will be effectively replaced by the div in the original document. If this doesn't work out, we just throw away unbalanced_start without doing anything. pre_deletepre, post_tags=post_tagspost_deletepostpos2pospopparse_htmlparentparampold_html_tokensold_htmlol of objnoscriptnoframesnextnew_html_tokensnew_html__new__name__name___move_el_inside_block__module__method__metaclass__metamergedmerge_insert_merge_element_contentsmerge_deletemenumatchmarkup_serialize_tokensmarkup_funcmarkupmain__main__lxml.html.difflxml.htmllxmllocate_unbalanced_start (line 365)locate_unbalanced_startlocate_unbalanced_endlinklilastj2j1itemsitemisindexiselementis_wordis_start_tagis_end_taginsertins_tokens_ins_del_re</?(ins|del).*?>ins_chunksins</ins> <ins>inputindexinclude_hrefs__import__imgi2i1htmldiff_tokenshtmldiff, html_repr=html_reprhtml_escapehtml_annotate_merge_annotationshtml_annotate (line 35)html_annotatehtml2_tokenshtml1_tokenshtmlhref_token.htmlhref_tokenhrefhrhide_when_equalh6h5h4h3h2h1getparentgetchildrenget_opcodesget_matching_blocksgetframesetfragment_fromstringformflatten_el_fixup_ins_del_tagsfixup_ins_del_tagsfixup_chunksfinding_namefindingfindallfindfieldsetextraextendexpand_tokensetreeescapeequaleq_oldeq_newendtagendswithend_wordsend_whitespace_reend_tag_end_body_reendencodingempty_tagseldtdrop_tag doclist should be ordered from oldest to newest, like::
The documents must be *fragments* (str/UTF8 or unicode), not complete documents
The markup argument is a function to markup the spans of words. This function is called like markup('Hello', 'version 2'), and returns HTML. The first argument is text and never includes any markup. The default uses a span with a title:
>>> print(default_markup('Some Text', 'by Joe')) <span title="by Joe">Some Text</span> doclist__doc__docdldivdirdifflib_diffcommanddest_tokdestdescendant-or-self::%sdeletedel_tokensdel_chunksdel</del> <del>default_markupdddatacur_wordcur_tokenscreate_parentcopy_annotations_contains_block_level_tagcompress_tokenscompress_merge_backcompilecommandscommandcolclsclosecline_in_tracebackcleanup_htmlcleanup_deletecleanupchunks, chunks=chunkchildren_tagchild_tagchildcgicenterbr_body_rebody_el</body.*?><body.*?>blockquoteblock_level_tagsblock_level_container_tagsbasestringbasefontbasebalancedbattribargsareaappendannotation__all__addressactuala/< >"=": ), </<><>/You should pass in an element, not a string like %rWeird state, cur_word=ValueErrorUnexpected delete tag: %rUSequenceMatcher\S+(?:\s+|$)S Represents the href in an anchor tag. Unlike other words, we only show the href when it changes. Represents a token that is actually a tag. Currently this is just the <img> tag, which takes up visible space just like a word but is only represented in a document by a tag. Represents a diffable token, generally a word that is displayed to the user. Opening tags are attached to this token when they are adjacent (pre_tags) and closing tags that follow the word (post_tags). Some exceptions occur when there are empty tags adjacent to a word, so there may be close tags in pre_tags, or open tags in post_tags.
We also keep track of whether the word was originally followed by whitespace, even though we do not want to treat the word as equivalent to a similar word that does not have a trailing space. Raised when the document no longer contains any pending deletes (DEL_START/DEL_END) NoDeletesNameError Link: %sInsensitiveSequenceMatcher.get_matching_blocksInsensitiveSequenceMatcherImportErrorIElementDEL_STARTDEL_END Acts like SequenceMatcher, but tries not to find very small equal blocks amidst large spans of changes ">lxml.html.diff.__defaults__exactlyat leastat mostlxml.html.diff.html_annotatelxml.html.diff.default_markup%.200s() takes %.8s %zd positional argument%.1s (%zd given) while calling a Python objectNULL result without error in PyObject_Calltoo many values to unpack (expected %zd)need more than %zd value%.1s to unpack'%.200s' object is unsliceablelxml.html.diff.html_annotate_merge_annotationslxml.html.diff.tokenize_annotatedlxml.html.diff.htmldifflxml.html.diff.htmldiff_tokenslxml.html.diff.markup_serialize_tokenslxml.html.diff.compress_merge_backlxml.html.diff.cleanup_deletegenerator already executing<cyfunction %U at %p>unbound method %.200S() needs an argument%.200s() takes no arguments (%zd given)%.200s() takes exactly one argument (%zd given)Bad call flags in __Pyx_CyFunction_Call. METH_OLDARGS is no longer supported!%.200s() takes no keyword argumentsgenerator ignored GeneratorExitmetaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its baseslxml.html.diff.split_unbalancedlxml.html.diff.fixup_chunks__annotations__ must be set to a dict object__kwdefaults__ must be set to a dict object__defaults__ must be set to a tuple objectfunction's dictionary may not be deletedsetting function's dictionary to a non-dict__qualname__ must be set to a string object__name__ must be set to a string objectShared Cython type %.200s is not a type objectShared Cython type %.200s has the wrong size, try recompiling_cython_0_29_32cannot import name %S'%.50s' object has no attribute '%U'can't send non-None value to a just-started generator�?name '%U' is not definedlxml.html.diff.token.htmllxml.html.diff.token.__repr__join() result is too long for a Python stringlxml.html.diff.tag_token.__new__a���X���O���F���=���1���%������9��������4���Y�����������lxml.html.diff.merge_deletelxml.html.diff.merge_insertlxml.html.diff.split_trailing_whitespacelxml.html.diff.cleanup_htmllxml.html.diff.parse_html�j���k���e���h���i��l��(c���d��assignment'%.200s' object does not support slice %.10slxml.html.diff._move_el_inside_blocklxml.html.diff.locate_unbalanced_enditemslxml.html.diff.start_taglxml.html.diff.split_words'NoneType' object has no attribute '%.30s'dictionary changed size during iteration'NoneType' object is not iterablelxml.html.diff.locate_unbalanced_startlxml.html.diff.compress_tokenslxml.html.diff.copy_annotationslxml.html.diff.serialize_html_fragmentlxml.html.diff.fixup_ins_del_tagslxml.html.diff.is_start_taglxml.html.diff.is_wordlxml.html.diff.is_end_taglxml.html.diff.end_tag'%.200s' object is not subscriptablecannot fit '%.200s' into an index-sized integerlxml.html.diff._merge_element_contentslxml.html.diff.tokenizelxml.html.diff.tag_token.htmllxml.html.diff.href_token.htmllxml.html.diff.tag_token.__repr__lxml.html.diff._contains_block_level_taglxml.html.diff._fixup_ins_del_tagslxml.html.diff.expand_tokens`��� ��� ����� ���%s() got multiple values for keyword argument '%U'%.200s() keywords must be strings%s() got an unexpected keyword argument '%U'raise: arg 3 must be a traceback or Noneinstance exception may not have a separate valuecalling %R should have returned an instance of BaseException, not %Rraise: exception class must be a subclass of BaseExceptionif _cython_generator_type is not None: try: Generator = _module.Generator except AttributeError: pass else: Generator.register(_cython_generator_type) if _cython_coroutine_type is not None: try: Coroutine = _module.Coroutine except AttributeError: pass else: Coroutine.register(_cython_coroutine_type) Cython module failed to patch module with custom typeco_argcountco_posonlyargcountco_kwonlyargcountco_nlocalsco_stacksizeco_flagsco_codeco_constsco_namesco_varnamesco_freevarsco_cellvarsco_linetablereplace_cython_coroutine_type_cython_generator_type_modulelxml.html.diff.InsensitiveSequenceMatcher.get_matching_blocks�?Interpreter change detected - this module can only be loaded into one interpreter per process.__loader__loader__file__origin__package__parent__path__submodule_search_locationslxml.html.diff.split_deletelxml.html.diff.token.__new__9��0��'��������D��t�����������lxml.html.diff.flatten_el;��d�4����D����#d��,�d����e��,�e��t1f��d,8g��-�l��\-�m���-_n���-�o��d/ p���/jq���0�r���4s��,������� ����d 4��� t��� t���������D����T��44E����E���DF����J��TdP����P���dT��,4V��\�Y����Y���^��l�����d�������L����dķ��|���������$����D����ĺ��$�DD���\Ի���d���������Լ���d����<����\���|�������$����t���T����������\��������������<����D���d�����������d����Dd��t�����������<4�������������D���|d���$���,��������<�����4��� ���� ����7��, D9��| �9��� d:��� T;��<!<���!D=���!>��,"�>��d"�>��|"E���"�R��D#tY���#�|���#t���\$$����$��4%����%4���<&$���&���&���'t'���'4,��$(/���(d2���(�;��<)�S���)�^���)d_��*�_��$*�j��t*�t���*$���,+�|+�����+����,����.���l.T����.D����.��,/�����/����0���40����\0���|0d����0ij���04���$1����1zRx�$���FJw�?;*3$"DH����\@����iB�B�B �B(�A0�D8�G���B�F�F�A�A�I�[�i�F�F�I�o�B�F�F�A�A�I�g�B�F�F�A�A�I�@�F�F�F�F�F�I�o�B�F�F�A�A�I�g�B�F�F�A�A�I�`�B�F�F�A�A�I�n�B�F�F�A�A�I�[�i�F�N�b�E�F�F�A�A�I�n�E�F�F�A�A�I�[�i�F�F�F�F�F�F�I�o�E�F�F�A�A�I�n�E�F�F�A�A�I���E�F�F�A�A�I�Y�E�F�F�A�A�I�[�i�F�F�I�o�E�F�F�A�A�I�r�H�H�A�A�A�A�A�H�H�M�A�K�o�E�F�F�A�A�I�`�E�F�F�A�A�I�@�F�G�O�b�E�F�F�A�A�I�@�F�F�O�b�E�F�F�A�A�I�[�i�F�N�b�E�F�F�A�A�I���E�F�F�A�A�I�R�E�F�F�A�A�I�[�i�F�F�I�o�E�F�F�A�A�I�{�E�F�F�A�A�I�R�E�F�F�A�A�I�R�E�F�F�A�A�I�g�E�F�F�A�A�I��E�F�F�A�A�I�x�E�F�F�A�A�I�|�E�F�F�A�A�I�@�F�F�F�I�o�E�F�F�A�A�I�[�i�F�F�F�F�I�T�M�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�i�F�L�[�i�F�F�F�F�I�[�i�F�F�I�o�E�F�F�A�A�I��E�F�F�A�A�I���E�F�F�A�A�I�Y�E�F�F�A�A�I�R�E�F�F�A�A�I�R�E�F�F�A�A�I�R�E�F�F�A�A�I�Y�E�F�F�A�A�I�`�E�F�F�A�A�I��E�F�F�A�A�I�Y�E�F�F�A�A�I�@�F�P�D�b�E�F�F�A�A�I�n�E�F�F�A�A�I�n�E�F�F�A�A�I��H8A0A(B BBB4T�f���A�I�D B AANg CAEl�g���$B�F�B �B(�A0�A8�J��G�g�F�a 8D0A(B BBBC�H�g�A�d�`���B�F�B �B(�A0�A8�G`� 8D0A(B BBBA�hGpghF`�hHpghA`\d���pB�F�B �B(�A0�D8�D�s�D�V�F�^ 8D0A(B BBBMd����� B�F�B �B(�A0�A8�G`6hGpghF`_ 8D0A(B BBBEUhHpghA`, `���5�D �����(B�F�B �B(�A0�A8�J�s�G�g�F�b 8D0A(B BBBB��B�b�B��B�a�B���B�a�A�Q�H�h�F�d� ���YB�F�B �B(�A0�A8�G`� 8D0A(B BBBM{hGpghF`� hHpghA`LT ��B�E�E �B(�A0�A8�D`s 8D0A(B BBBL�� �����B�F�B �A(�A0�G@c 0D(A BBBO_ 0D(A BBBF{HGPgHF@^ 0D(A BBBH�HHPgHA@d,���B�F�B �B(�A0�A8�Gp�xG�gxFp_ 8D0A(B BBBE�xH�gxApL� ���'B�B�E �B(�D0�A8�I�+ 8D0A(B BBBO4��4���B�D�D �i GBDAAB,5��tA�N�NH[PTHE@`AAdLP5��CB�B�B �B(�A0�A8�DpR 8A0A(B BBBF� 8C0A(B BBBJL�89���B�B�B �B(�A0�A8�D`� 8D0A(B BBBH �>��l �>���B�B�B �D(�D0�G@t 0D(A BBBL� 0D(A BBBID 0D(A BBBQ,� 0B���B�A�A �( CBOD� �C���B�A�A �D@. AABF� CABP(G��l G��B�B�B �D(�D0�GP� 0A(A BBBN� 0A(A BBBLW 0A(A BBBA<��P���B�B�E �A(�C0�p(A BBBL��J���.B�D�B �E(�A0�A8�G�K 8D0A(B BBBBL�x���)B�D�B �B(�A0�A8�G�& 8D0A(B BBBJ<lP���?B�B�B �D(�D0�)(A BBB�P��� �H��� �@����H���@��� $8���<@���,TH���zB�A�D �� CBK�����)A�c�����IdN,����A�A�D T DAA,�@����A�A�D T DAA����;a�Y�<����(Q�V�\Ц���D~ N|@����DA K������DA K� ����Do Mo A�����pDo Mo���pDo Mo0���/Q�]�<@���L^�Y�,\�M���B�A�D ��AB$�fN��KA�D�D AAL�����B�B�A �A(�D@� (D ABBI� (C DBBALh���hB�E�D �A(�D@� (D ABBF� (D ABBKdT����GB�B�B �B(�D0�D8�G`� 8C0A(B BBBOK 8F0A(B BBBF4�p���*B�A�D �G@G AABGD�h����B�A�C �D0W AABKl AABP<����%DTȱ���B�H�A �_ ABC� ABJgCBL�@���/B�B�A �D(�D@� (D ABBID (D ABBC� ����D W Ey G����$����<����pDo Mo\�pDo Mo$|@����H�G0WA�I0�,�����]R�A�A �v �A�B�C���YA�s LTd�(���B�B�E �E(�D0�A8�DP� 8A0A(B BBBO� 8F0A(B BBBA<\���J�A�D �� ABKc DBG4�����dA�D�G0Q DANR AAL,�iJ��LB�K�C �xABL�����A�A�G0G DAKF AAHf AAHg AAGTؽ���4l����B�A�D �G0u DABF4�X���$B�A�A �G0� AABA<�P����B�B�D �A(�G`� (A ABBKT�����B�B�B �A(�A0�D@o 0A(A BBBMc 0D(A BBBBt���d�����B�B�B �B(�A0�A8�Dp� 8A0A(B BBBNJ 8C0A(B BBBA����*De����B�I�B �B(�A0�A8�J�F�D�l�F�� 8A0A(B BBBF��H�g�A�U�H�g�G�P�H�h�G�\�P���B�E�A �A(�D@� (D ABBPw (D ABBPL (D ABBKL����� B�G�B �B(�A0�D8�D`; 8D0A(B BBBB�L@��B�F�B �B(�A0�A8�GP� 8A0A(B BBBG{XG`gXFP\ 8C0A(B BBBI� 8C0A(B BBBN�XH`gXAPd�x��B�F�B �B(�A0�A8�G`�hGpghF`^ 8D0A(B BBBF�hHpghA`<L����I�A�A �� ABBA ABLL�p���B�B�H �G(�A0�C8�D` 8D0A(B BBBI����4A�G jA<�����B�I�D �m ABAc DBG\<����B�B�A �D(�G0Y (D ABBNT (G ABBPQ (F ABBA\����B�B�A �A(�D0F (D ABBGQ (F ABBDD (A ABBEL�`��@B�E�D �I(�D0� (D ABBOt (D DBBP<LP���B�H�D �D(�D0s (D ABBN4����rA�D�G e DAJlDA���%d�0��_B�B�E �E(�A0�A8�D@ 8A0A(B BBBA� 8C0A(B BBBO\D("��� B�B�B �B(�A0�D8�D`� 8D0A(B BBBPthIpShD`T�x/���B�B�B �B(�A0�D8�DPKXD`VXFP_ 8D0A(B BBBHL��5��&#B�B�B �B(�A0�A8�Dp� 8D0A(B BBBMlL�X���&B�F�B �B(�A0�A8�G�� 8D0A(B BBBF{�G�g�F���H�g�A������B�B�B �B(�A0�A8�G�:�G�g�F�_ 8D0A(B BBBE� �H�g�A��H�g�G�LD 8����B�G�B �B(�D0�A8�G� 8D0A(B BBBC|� Ȱ���B�B�B �B(�A0�D8�DP| 8A0A(B BBBIs 8C0A(B BBBAE 8A0A(B BBBA�!ض���B�B�B �B(�A0�A8�G�� �G�g�F�_ 8D0A(B BBBE�H�g�F���H�h�G�L�!���� B�B�B �B(�A0�D8�D`� 8D0A(B BBBGd�!����B�F�B �B(�A0�A8�Gp�xG�gxFp^ 8D0A(B BBBF�xH�gxFp�T"�� B�B�B �B(�A0�D8�Dp_ 8A0A(B BBBFlxD�VxFp^ 8A0A(B BBBL�xE�QxBp�xG�VxApL�"����Y B�B�B �B(�D0�A8�DP{ 8D0A(B BBBGL4#�����B�B�E �E(�A0�A8�D@� 8D0A(B BBBE\�#���B�E�A �A(�D@� (D ABBP{ (C ABBMy (D ABBNd�#���PB�E�A �A(�D@ (D ABBKL (D ABBK� (D ABBCLL$p ��y B�B�B �B(�A0�D8�G`' 8D0A(B BBBHL�$����B�B�B �B(�A0�A8�G`H 8D0A(B BBBJT�$*��OB�B�B �B(�A0�D8�D`hDpVhF`^ 8D0A(B BBBID%5��fG b Gud%X5��vD t HuL�%�5��� B�G�B �B(�A0�D8�D`� 8D0A(B BBBMd�%(@��� B�B�B �B(�D0�D8�DP� 8A0A(B BBBC� 8A0A(B BBBEL<&�I���B�E�E �B(�A0�A8�D�� 8D0A(B BBBOL�&�W���B�B�B �B(�A0�A8�D`a 8D0A(B BBBD|�&pf���B�B�B �A(�D0�DPrXE`VXFP\ 0C(A BBBHG 0A(A BBBAy 0A(A BBBOd\'�i��QB�E�E �B(�A0�D8�H�� 8A0A(B BBBAN8F0A(B BBBD�'�9��B�E�E �E(�D0�D8�DP�8D0A(B BBB\(Pk��yB�E�D �A(�D0v (A ABBT] (A ABBMD (D ABBSLl($:��XB�E�E �B(�D0�C8�D�18D0A(B BBB,�(,?��A�D�D AA<�(@���E�E�B �A(�A0��(D BBBL,)�@��3B�E�E �E(�A0�A8�D@8D0A(B BBBL|)`l��zB�B�B �J(�A0�A8�FP 8D0A(B BBBHL�)�m��LB�B�B �B(�A0�D8�G`/ 8D0A(B BBBP4*�|���B�D�D �� GBPAAB4T*H}���B�D�D �{ GBBAAB4�*�}���B�D�D �� GBLAAB<�*&@��xB�E�E �G(�A0�Z(C BBB<+^@��`B�B�D �A(�D0I(D ABB$D+�}���A�= BK E$l+���jA�� FK E$�+Ȁ���A� KK E�+@���[X�w��+����[U�w��+����[U�w�,�@��L4,��fB�B�B �B(�D0�A8�D�� 8D0A(B BBBJ��,���Y B�B�B �E(�A0�A8�D�� 8A0A(B BBBLq�E�l�F�\ 8C0A(B BBBP��H�g�A��-���B�B�B �D(�A0�D`\hDpkhF`\ 0C(A BBBJY 0A(A BBBOy 0A(A BBBOzhHpghA`� p �hu��� te��$��$���o��0 ��$� �^ ���o����o�o����o���$֊�����&�6�F�V�f�v���������Ƌ�����&�6�F�V�f�v���������ƌ�����&�6�F�V�f�v���������ƍ֍�����&�6�F�V�f�v���������Ǝ֎�����&�6�F�V�f�v���������Ə֏�����&�6�F�V�f�v���������Ɛ�����&�6�F�V�f�v���������Ƒ֑�����&�6�F�V�f�v����$ Removes an element, but merges its contents into its place, e.g., given <p>Hi <i>there!</i></p>, if you remove the <i> element you get <p>Hi there!</p> helper for _fixup_ins_del_tags; actually takes the <ins> etc tags and moves them inside any block-level tags. True if the element contains any block-level elements, like <p>, <td>, etc. fixup_ins_del_tags that works on an lxml document in-place Serialize a single lxml element as HTML. The serialized form includes the elements tail.
If skip_outer is true, then don't serialize the outermost tag Given an html string, move any <ins> or <del> tags inside of any block-level elements, e.g. transform <ins><p>word</p></ins> to <p><ins>word</ins></p> The text representation of an end tag for a tag. Includes trailing whitespace when appropriate. The text representation of the start tag for a tag. Splits some text into words. Includes trailing whitespace on each word when appropriate. Takes an lxml element el, and generates all the text chunks for that tag. Each start tag is a chunk, each word is a chunk, and each end tag is a chunk.
If skip_tag is true, then the outermost container tag is not returned (just its contents). This function takes a list of chunks and produces a list of tokens.
This function takes a word, such as 'test
' and returns ('test','
') This 'cleans' the HTML, meaning that any page structure is removed (only the contents of <body> are used, if there is any <body). Also <ins> and <del> tags are removed. Parses an HTML fragment, returning an lxml element. Note that the HTML will be wrapped in a <div> tag that was not in the original document.
If cleanup is true, make sure there's no <head> or <body>, and get rid of any <ins> and <del> tags.
Parse the given HTML and returns token objects (words with attached tags).
This parses only the content of a page; anything in the head is ignored, and the <head> and <body> elements are themselves optional. The content is then parsed by lxml, which ensures the validity of the resulting parsed document (though lxml may make incorrect guesses when the markup is particular bad).
<ins> and <del> tags are also eliminated from the document, as that gets confusing.
If include_hrefs is true, then the href attribute of <a> tags is included as a special kind of diffable token. like locate_unbalanced_start, except handling end tags and possibly moving the point earlier in the document. pre_delete and post_delete implicitly point to a place in the document (where the two were split). This moves that point (by popping items from one and pushing them onto the other). It moves the point to try to find a place where unbalanced_start applies.
As you can see, we moved the point so that the dangling <div> that we found will be effectively replaced by the div in the original document. If this doesn't work out, we just throw away unbalanced_start without doing anything. Returns (stuff_before_DEL_START, stuff_inside_DEL_START_END, stuff_after_DEL_END). Returns the first case found (there may be more DEL_STARTs in stuff_after_DEL_END). Raises NoDeletes if there's no DEL_START found. Return (unbalanced_start, balanced, unbalanced_end), where each is a list of text and tag chunks.
unbalanced_start is a list of all the tags that are opened, but not closed in this span. Similarly, unbalanced_end is a list of tags that are closed but were not opened. Extracting these might mean some reordering of the chunks. Cleans up any DEL_START/DEL_END markers in the document, replacing them with <del></del>. To do this while keeping the document valid, it may need to drop some tags (either start or end tags).
It may also move the del into adjacent tags to try to move it to a similar location where it was originally located (e.g., moving a delete into preceding <div> tag, if the del looks like (DEL_START, 'Text</div>', DEL_END) Adds the text chunks in del_chunks to the document doc (another list of text chunks) with marker to show it is a delete. cleanup_delete later resolves these markers into <del> tags. doc is the already-handled document (as a list of text chunks); here we add <ins>ins_chunks</ins> to the end of that. Given a list of tokens, return a generator of the chunks of text for the data in the tokens. Does a diff on the tokens themselves, returning a list of text chunks (not tokens). Do a diff of the old and new document. The documents are HTML *fragments* (str/UTF8 or unicode), they are not complete documents (i.e., no <html> tag).
Returns HTML with <ins> and <del> tags added around the appropriate text.
Markup is generally ignored, with the markup from new_html preserved, and possibly some markup from old_html (though it is considered acceptable to lose some of the old markup). Only the words in the HTML are diffed. The exception is <img> tags, which are treated like words, and the href attribute of <a> tags, which are noted inside the tag itself when there are changes.
Serialize the list of tokens into a list of text chunks, calling markup_func around text to add annotations. Merge tok into the last element of tokens (modifying the list of tokens in-place). Combine adjacent tokens when there is no HTML between the tokens, and they share an annotation
Copy annotations from the tokens listed in src to the tokens in dest Merge the annotations from tokens_old into tokens_new, when the tokens in the new document already existed in the old document. Tokenize a document and add an annotation attribute to each token
doclist should be ordered from oldest to newest, like::
The documents must be *fragments* (str/UTF8 or unicode), not complete documents
The markup argument is a function to markup the spans of words. This function is called like markup('Hello', 'version 2'), and returns HTML. The first argument is text and never includes any markup. The default uses a span with a title: