Regular Expression INTRO [long] [OT] 
Author Message
 Regular Expression INTRO [long] [OT]

wow -- this was nearly a year ago and i still haven't
done it!

Quote:

> Would you mind taking a few moments and writing a guide on using Regex
> functions in PostgreSQL?

> This is to put on the techdocs.postgresql.org website so everyone in the
> PostgreSQL community will have access to it.

i'm posting to pgsql-general so the community
can help find holes and plug them...

here goes --

Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS):

LIKE
====

You're probably familiar with the WHERE clause
of the SELECT statement --

        SELECT somefield FROM table WHERE somefield = somevalue;

For example, to show all customers within the
post code "60201" you could try

        SELECT * FROM cust WHERE zip = '60201';

But if you also have nine-digit zips stored in the zip
field (in the U.S.A. we have standard five-digit postal
codes and also a four-digit extension for further
geographic resolution) then this search will OMIT
customers having more-specific zips such as '60201-9876'
or '60201-1234'.

The Percent "%"
---------------

Try using LIKE instead of = (EQUALS):

        SELECT * FROM cust WHERE zip LIKE '60201%';

With LIKE, your search values can use the percent (%)
to tell PostgreSQL that anything can occupy that spot
-- one character, twelve symbols or zero digits -- and
still satisfy the search.

On Unix or Linux, this is basically the same as the
ASTERISK (*) at the command line, when dealing with
file names:

        # list all files whose names begin with '.bash'
        ls .bash*

        # list all files containing 'out' anywhere in the name
        ls *out*

        # list all file names ending with '.pl'
        ls *.pl

        # list file starting with 'proj', ending with '.c'
        ls proj*.c

With PostgreSQL using the LIKE operator, use the
percent, instead:

        -- list all customers within the 47610 postal code
        SELECT * FROM cust WHERE zip LIKE '47610%';

        -- display customers who have 'Corp' in their names
        SELECT * FROM cust WHERE name LIKE '%Corp%';

        -- show customers whose names end in 'LLC'
        SELECT * FROM cust WHERE name LIKE '%LLC';

        -- documents beginning with 'We', ending with 'rica'
        SELECT * FROM doc WHERE contents LIKE 'We%rica';

Wherever the '%' appears (using the LIKE operator)
Postgres allows anything at all to match -- from a
lengthy string of text, to one single character, to
a zero-length string -- i.e. nothing at all.

        ...ILIKE 'A%Z'

The Underscore "_"
------------------

For the LIKE operator, the UNDERSCORE (_) takes on the same
meaning as the question mark does to shell operations for
Unix and Linux file names:

        # list files starting with dot, followed by at least two chars
        ls .??*

The underscore matches just one character exactly:

        SELECT * FROM atable WHERE afield LIKE '_';
        -- shows records where afield is exactly one character
        -- omitting ones where it's blank or has two (or more)
        -- characters in it)

You can restrict your searches to finding fields
of certain lengths this way:

        SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________';
        -- display all five- or nine-character zip codes

Combine it with percent to find fields over a certain length:

        SELECT * FROM atable WHERE afield LIKE '___%';
        -- display records where afield has three or more characters

Or you can accomodate some language idiosyncracies, as
well:

        SELECT * FROM activity WHERE venue LIKE 'Theat__';
        -- find u.k. 'theatre' and u.s. 'theater'

ILIKE
=====

But what if a customer's name is all capitals, as in
'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't
find it, because uppercase 'O' is not identical to
lowercase 'o'!

Finding text independent of uppercase/lowercase
is something we often need to do. And so we have
the ILIKE operator.

Use ILIKE instead of LIKE when your search should
ignore case -- the "I" stands for case-[I]nsensitive:

        -- show all customers with 'corp' in the name
        SELECT * FROM cust WHERE name ILIKE '%Corp%';
        -- ignoring whether the field is upper- or lowercase

That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP'
and even 'Amalgamated Switch and Relay corporation'
because anything can ('%') precede or ('%') follow the
string 'Corp' which can be uppercase or lowercase.

But there's even more power in "regular expressions"!
See below...

Notes
-----

Always include some real data to search for!

        -- show everything, doing a lot of unnecessary work:
        SELECT * FROM atable WHERE afield LIKE '%';

It would be rather silly to make PostgreSQL compare
every record to see if it matched 'anything goes,
in "afield"'. Just say no.

Convert your spoken-language request to a LIKE
clause like this:

        "afield ENDS WITH avalue"
        reword => "afield starts with anything, ending with avalue"
        sql => "... WHERE afield LIKE '%avalue'"

        "afield STARTS WITH avalue"
        reword => "afield starts with avalue, ending with anything"
        sql => "... WHERE afield LIKE 'avalue%'"

See how that works?

        "afield CONTAINS avalue"
        sql => "... WHERE afield LIKE '%avalue%'"

If you're looking for something at the beginning of
a field, that's where your data goes, and you END with
the percent. If you're looking for something at the end
of a field, put your data there and allow anything at
the front by putting the percent there.

ALSO -- if your field is indexed, you'll defeat the index
unless you anchor your search to the beginning of the field.

        -- find fields starting with 'something'
        ... afield LIKE 'something%'; -- uses afield's index

The example below isn't anchored at the start of the field,
so it does no good to use the index:

        -- find fields containing 'pooh'
        ... afield LIKE '%pooh%'; -- can't use afield's index

Afield must contain 'pooh' BUT it could be anywhere in
the field. The index would be useless.

LIMITATION
==========

Okay. Let's say you have a full-name field that has
first, middle, and last name all in the one field.
Sometimes you'll have a middle name, sometimes
you won't. Sometimes the first name will be spelled
out, other times it'll be shortened or even abbreviated
to just the first initial.

Here's how you'd search that field for "Abraham Lincoln"
using the LIKE operator:

        SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln';

It'll find 'Abraham Lincoln' and 'A. Lincoln' and
'Abe Lincoln' because anything can go between the
initial 'A' and the ending 'Lincoln'.

But -- it'll also match 'Andrew M. Lincoln' and 'Abner
Sasquatch Lincoln' as well... and this is why some bright
souls created regular expressions!

REGULAR EXPRESSIONS
===================

A regular expression is a pattern to search for.
The structure of the pattern has to match a
rigid set of rules so that PostgreSQL will know
what you're trying to search for.

This introduction will barely scratch the surface,
but hopefully it'll get you started--

If you've ever tinkered with perl, you've probably
encountered regular expressions:

        perl -ne 'print if /[A-Za-z]/;' somefile

Here, the "[A-Za-z]" is a regular expression.
It matches any line containing uppercase (A-Z)
or lowercase (a-z) letters. In this example, the
lines will come from the file "somefile" and
lines that match will be printed out.

Well, PostgreSQL can do that, too!

Example
-------

        SELECT * FROM person
        WHERE fullname ~ 'A(be|braham|\\.) *Lincoln';

The REGULAR EXPRESSION operator is the TILDE (~).
The value you're searching for then must be a valid
regular expression. In the above example, we're asking
PostgreSQL to display all rows containing

        'A'

followed by EITHER

        'be'
OR
        'braham'
OR
        '\\.' -- a dot

which is then followed by

        ' ' -- a space
        '*' -- zero or more times

followed by

        'Lincoln'

and that's all. Whew!

As you can guess, these will all successfully match:

        'A. Lincoln'
        'AbeLincoln' -- without any space at all
        'Abraham    Lincoln' -- with lots of spaces

and these won't match:

        'A Lincoln' -- no dot, no 'braham', no 'be'
        'Abe Gump Lincoln' -- nothing will match 'Gump'

Quoting "\\"
------------

So what's with the BACKSLASH in front of the dot? Well,
just as LIKE has the UNDERSCORE (_) to denote "any single
character", REGULAR EXPRESSIONS use the DOT (.) for
that very same purpose. So we have to "escape" the
dot to alter its normal interpretation, using the.

Note that string literals in PostgreSQL already use the
backslash, so you'll have to double any backslash
you use:

        psql=# SELECT * FROM person
        psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln'
        psql-# -- a second backslash added
        psql-# ;

Grouping "()"
-------------

How about the PARENTHESES () and the BAR |, hmm? Well,
that's how you can group permitted alternatives:

        ... ~ 'A(be|braham|\\.)'

That says that 'A' can be followed by any of the three
sub-expressions within the parentheses, namely 'be' or
'braham' or <dot>.

These all produce the same matches:

        ... ~ 'A(be|braham|\\.)'
        ... ~ 'Abe|Abraham|A\\.'
        ... ~ 'A(b(e|raham)|\\.)'

Let's dissect this last one -- it specifies that:

        'A'

can be followed by either

        'b' followed by some more stuff
or
        <dot>

nestling into the details following the 'b' -- if
'A' is followed by 'b', then the 'b' must be followed
by either

        'e' -- which matches 'Abe'
or
        'raham' -- which matches 'Abraham'

Do you see how powerful this can be?

Character Classes "[]"
----------------------

So you're trying to find "McAndrews" -- but maybe it's
spelled with a lower-case "a" as in "Mcandrews", instead!

        SELECT * FROM person
        WHERE fullname ~ 'Mc[Aa]ndrews';

Using the BRACKETS tells the regular expression
parser that you're allowing any one of a whole class
of characters in that spot.

        '[Aa]'

It specifies that you're looking for either an uppercase
'A' or a lowercase 'a' in that spot.

Here's how you can find fields containing vowels:

        ... ~ '[AEIOUaeiou]'

Search for fields containing lowercase letters:

        ... ~ '[abcdefghijklmnopqrstuvwxyz]'

Of course, there's a shortcut for specifying character
classes that cover a whole range:

        ... ~ '[a-z]' -- also matches any
...

read more »



Sun, 26 Sep 2004 08:05:56 GMT
 Regular Expression INTRO [long] [OT]

will> Email addresses can look like any of these:





And





either safe or quoted.

will> A reasonably-functional email address pattern might be
will> something like this:


Bad.  Wrong.  Do not use this.  {sigh}

will> It's not a perfect match for all legal email patterns,

will> it might be sufficient depending on your requirements.

No.  Do not use that for anything.  Do not.  Do not.  You will be
locking out legitimate users.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Mon, 27 Sep 2004 22:38:56 GMT
 
 [ 2 post ] 

 Relevant Pages 

1. Regular Expression INTRO [long] [OT]

2. Regular expressions in view/stored procedure

3. REPLACE and regular expressions

4. Regular expressions in T-SQL

5. Free/open source XP for using regular expressions in T-SQL

6. how to evaluate regular expressions from SQL Server 2000

7. Regular Expression

8. Regular expression

9. Regular expressions ?

10. SQL Parsing Regular Expression?

11. Using regular expressions w/ LIKE keyword

12. Regular Expressions


 
Powered by phpBB® Forum Software