2007年04月28日 02:30 [Edit]
CPAN - URI::Escape::XS Released!
なぜ車輪を再発明したかといえば、今ある車輪のころがりがよくなかったから。
URI::Escapeへの不満は二つあって、一つは速度が不十分だったこと。LWPなどと組み合わせて使う場合は、正規表現ベースの変換でも、他のタスクの方がずっと重いので充分速いのですが、ログの解析する時などに利用すると、ずいぶんと遅く感じます。このあたりはある作業をしていて、気になってProfileしてみてはじめて気がつきました。
もう一つは、%uHHHHの対応。一応にぽたん作のURI::Escape::JavaScriptというものがあるのですが、コードを見るとSurrogate Pairに未対応。このあたりの問題は
でも触れたのですが、丸い車輪が見当たらない。しばらくは「とりあえずいっか。どうせ%uHHHHなんて規格外だし」とたかをくくっていたのですが、先日ある仕事で切実に動く奴が必要になったのでえいやと作りました。
詳しくはPODを参照のこと。
Enjoy!
Dan the Perl Monger
NAME
URI::Escape::XS - Drop-In replacement for URI::Escape
VERSION
$Id: XS.pm,v 0.1 2007/04/27 17:17:46 dankogai Exp dankogai $
SYNOPSIS
# use it instead of URI::Escape
use URI::Escape::XS qw/uri_escape uri_unescape/;
$safe = uri_escape("10% is enough\n");
$verysafe = uri_escape("foo", "\0-;\377");
$str = uri_unescape($safe);
# or use encodeURIComponent and decodeURIComponent
use URI::Escape::XS;
$safe = encodeURIComponent("10% is enough\n");
$str = decodeURIComponent("10%25%20is%20enough%0A");
EXPORT
by default
"encodeURIComponent" and "decodeURIComponent"
on demand
"uri_escape" and "uri_unescape"
FUNCTIONS
encodeURIComponent
Does what JavaScript's encodeURIComponent does.
$uri = encodeURIComponent("http://www.example.com/");
# http%3A%2F%2Fwww.example.com%2F
Note you cannot customize characters to escape. If you need to do so,
use "uri_escape".
decodeURIComponent
Does what JavaScript's decodeURIComponent does.
$str = decodeURIComponent("http%3A%2F%2Fwww.example.com%2F");
# http://www.example.com/
It decode not only %HH sequences but also %uHHHH sequences, with
surrogate pairs correctly decoded.
$str = decodeURIComponent("%uD869%uDEB2%u5F3E%u0061");
# \x{2A6B2}\x{5F3E}a
This function UNCONDITIONALLY returns the decoded string with utf8 flag
off. To get utf8-decoded string, use Encode and
decode_utf8(decodeURIComponent($uri));
This is the correct behavior because you cannnot tell if the decoded
string actually contains UTF-8 decoded string, like ISO-8859-1 and
Shift_JIS.
uri_escape
Does exactly the same as URI::Escape::uri_escape() except when
utf8-flagged string is fed.
URI::Escape::uri_escape() croak and urge you to "uri_escape_utf8()" but
it is pointless because URI itself has no such things as utf8 flag. The
function in this module ALWAYS TREATS the string as byte sequence. That
way you can safely use this function without worring about utf8 flags.
Note this function is NOT EXPORTED by default. That way you can use
URI::Escape and URI::Escape::XS simultaneously.
uri_unescape
Does exactly the same as URI::Escape::uri_escape() except when %uHHHH is
fed.
URI::Escape::uri_unescape() simply ignores %uHHHH sequences while the
function in this module does decode it into the corresponding UTF-8 byte
sequence.
Like uri_escape, this funciton is NOT EXPORTED by default.
Note on the %uHHHH sequence
With this module the resulting strings never have the utf8 flag on. So
if you want to decode it to perl utf8, You have to explicitly decode via
Encode. Remember. URIs have always been a byte sequence, not UTF-8
characters.
If %uHHHH sequence became standard, you could have safely told if a given
URI is in Unicode. But more fortunately than unfortunately, the RFC
proposal was rejected so you cannnot tell which encoding is used just by
looking at the URI.
<http://en.wikipedia.org/wiki/Percent-encoding#Non-standard_implementations>
I said fortunately because %uHHHH can be nasty for non-BMP characters.
Since each %uHHHH can hold one 16-bit value, you need a *surrogate pair*
to represent it if it is U+10000 and above.
In spite of that, there are a significant number of URIs with %uHHHH
escapes. Therefore this module supports decoding only.
SPEED
Since this module uses XS, it is really fast except for
uri_escape("noop").
Regexp which is used in URI::Escape is really fast for non-matching but
slows down significantly when it has to replace string.
BENCHMARK
On Macbook Pro 2GHz, Perl 5.8.8.
http://www.google.co.jp/search?q=%E5%B0%8F%E9%A3%BC%E5%BC%BE
============================================================
Unescape it
-----------
U::E 58526/s -- -88%
U::E::XS 486968/s 732% --
--------------
Escape it back
--------------
U::E 30046/s -- -78%
U::E::XS 136992/s 356% --
www.example.com
===============
Unescape it
-----------
Rate U::E U::E::XS
U::E 821972/s -- -4%
U::E::XS 854732/s 4% --
--------------
Escape it back
-------------
U::E::XS 522969/s -- -7%
U::E 565112/s 8% --
AUTHOR
Dan Kogai, "<dankogai at dan.co.jp>"
BUGS
Please report any bugs or feature requests to "bug-uri-escape-xs at
rt.cpan.org", or through the web interface at
<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=URI-Escape-XS>. I will
be notified, and then you'll automatically be notified of progress on
your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc URI::Escape::XS
You can also look for information at:
* AnnoCPAN: Annotated CPAN documentation
<http://annocpan.org/dist/URI-Escape-XS>
* CPAN Ratings
<http://cpanratings.perl.org/d/URI-Escape-XS>
* RT: CPAN's request tracker
<http://rt.cpan.org/NoAuth/Bugs.html?Dist=URI-Escape-XS>
* Search CPAN
<http://search.cpan.org/dist/URI-Escape-XS>
ACKNOWLEDGEMENTS
Gisle Aas for URI::Escape
Koichi Taniguchi for URI::Escape::JavaScript
COPYRIGHT & LICENSE
Copyright 2007 Dan Kogai, all rights reserved.
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
Posted by dankogai at 02:30│Comments(2)│TrackBack(0)
この記事へのトラックバックURL
この記事へのコメント
blogに貼り付けてるPODのSYNOPSISの項のサンプルで
ダブルクォートの閉じ忘れでハイライトがおかしくない?
ダブルクォートの閉じ忘れでハイライトがおかしくない?
Posted by 通りすがり at 2007年04月28日 13:31
通りすがりさん、
お、さくさく感が戻ってきました。
他にもcan'tやcould'veがハイライトをおかしくしていたので、
手元のRepositoryでもこちらのREADMEの移しも直しておきました。
Google Prettyprintの意外な効用。
Dan the Pretty Printed
お、さくさく感が戻ってきました。
他にもcan'tやcould'veがハイライトをおかしくしていたので、
手元のRepositoryでもこちらのREADMEの移しも直しておきました。
Google Prettyprintの意外な効用。
Dan the Pretty Printed
Posted by 弾 at 2007年04月28日 15:06