Full-Text search on field containing HTML 
Author Message
 Full-Text search on field containing HTML
I have a text field on a SQL 7.0 database containing html text. The
full-text search will not function properly since the HTML tags may be
directly adjacent to the words.

How do I tell the search engine to ignore the tags? Should I use the
noise file for this?

Any help will be greatly apprecieated.

Per

Sent via Deja.com
http://www.***.com/



Sun, 08 Jun 2003 06:47:18 GMT
 Full-Text search on field containing HTML

SQL Server 7 can only accurately index plain text. There is no method of
telling it the format of the text to be indexed.

SQL Server 2000 CAN index rich documents. The documents must be stored in an
image type column and an additional varchar column should be used to specify
the appropriate extension (.doc, .htm, .xls etc), SQL Server uses this to
identify the word filter to use.

Only one problem, I haven't been able to get this to work yet  - if anyone
else has, please let me know!!!!

David Lapsley

Quote:

> I have a text field on a SQL 7.0 database containing html text. The
> full-text search will not function properly since the HTML tags may be
> directly adjacent to the words.

> How do I tell the search engine to ignore the tags? Should I use the
> noise file for this?

> Any help will be greatly apprecieated.

> Per

> Sent via Deja.com
> http://www.deja.com/



Sun, 08 Jun 2003 20:35:52 GMT
 Full-Text search on field containing HTML
Thanks David,

Would it be worth the effort to add the most common html tags to the
noise file? And if so, will the full-text engine index the words that
are left when the tags are removed?

If there was a way to edit the word-breaking properties, I can see a
way to make this work. The < and > could be defined as word boundries.

Anyone else have an idea on how to make this work without writing your
own indexing mechanism? I have a feeling that where it's going to end
up for me unless someone has a solution.

Thanks again.

Per



Quote:
> SQL Server 7 can only accurately index plain text. There is no method
of
> telling it the format of the text to be indexed.

> SQL Server 2000 CAN index rich documents. The documents must be
stored in an
> image type column and an additional varchar column should be used to
specify
> the appropriate extension (.doc, .htm, .xls etc), SQL Server uses
this to
> identify the word filter to use.

> Only one problem, I haven't been able to get this to work yet  - if
anyone
> else has, please let me know!!!!

> David Lapsley




- Show quoted text -

Quote:
> > I have a text field on a SQL 7.0 database containing html text. The
> > full-text search will not function properly since the HTML tags may
be
> > directly adjacent to the words.

> > How do I tell the search engine to ignore the tags? Should I use the
> > noise file for this?

> > Any help will be greatly apprecieated.

> > Per

> > Sent via Deja.com
> > http://www.deja.com/

Sent via Deja.com
http://www.deja.com/


Sun, 08 Jun 2003 21:52:03 GMT
 Full-Text search on field containing HTML

Quote:

> Anyone else have an idea on how to make this work without writing your
> own indexing mechanism? I have a feeling that where it's going to end
> up for me unless someone has a solution.

Do you need the HTML in the indexed text? If not, you can remove HTML,
CSS and JavaScript with 3 regular expressions and then insert it.
Otherwise, I suppose you could store the filtered version and the
original - it'd waste space but would be a lot less work than writing
your own index engine.

Sent via Deja.com
http://www.deja.com/



Sun, 29 Jun 2003 15:44:30 GMT
 Full-Text search on field containing HTML
Chris,
Yes, use SQL 2000's new IFilter for HTML !! See SQL 2000 BOL titles
"Filtering Supported File Types". Specificly, "Microsoft? SQL ServerT 2000
includes filters for these file extensions: .doc, .xls, .ppt, .txt, and
.htm." and title "Using Full-text Predicates to Query image Columns".
Regards,
John


Quote:


> > Anyone else have an idea on how to make this work without writing your
> > own indexing mechanism? I have a feeling that where it's going to end
> > up for me unless someone has a solution.

> Do you need the HTML in the indexed text? If not, you can remove HTML,
> CSS and JavaScript with 3 regular expressions and then insert it.
> Otherwise, I suppose you could store the filtered version and the
> original - it'd waste space but would be a lot less work than writing
> your own index engine.

> Sent via Deja.com
> http://www.deja.com/



Mon, 30 Jun 2003 08:52:15 GMT
 Full-Text search on field containing HTML
Can sql 2000 full text search handle entities?
e.g.:  Belgi&euml; or Belgi&#39;

Quote:

>Chris,
>Yes, use SQL 2000's new IFilter for HTML !! See SQL 2000 BOL titles
>"Filtering Supported File Types". Specificly, "Microsoft? SQL ServerT 2000
>includes filters for these file extensions: .doc, .xls, .ppt, .txt, and
>.htm." and title "Using Full-text Predicates to Query image Columns".
>Regards,
>John





>> > Anyone else have an idea on how to make this work without writing your
>> > own indexing mechanism? I have a feeling that where it's going to end
>> > up for me unless someone has a solution.

>> Do you need the HTML in the indexed text? If not, you can remove HTML,
>> CSS and JavaScript with 3 regular expressions and then insert it.
>> Otherwise, I suppose you could store the filtered version and the
>> original - it'd waste space but would be a lot less work than writing
>> your own index engine.

>> Sent via Deja.com
>> http://www.deja.com/

____________________________________________

remove .fake. for valid e-mail address
___________________________________________


Wed, 02 Jul 2003 19:31:24 GMT
 
 [ 6 post ] 

 Relevant Pages 

1. Full-Text Search - Searching HTML

2. Contains Operator(Full Text Searching)

3. Where get detailed information on CONTAINS and FULL TEXT Searches

4. Full-text search don't find words containing apostrophes

5. Full Text Searches and the CONTAINS clause

6. Using CONTAINS on full-text search

7. Full-Text search - problem with CONTAINS and wildcards

8. full text search on access text field

9. full text search on a access text data field

10. Full-Text search on HTML files

11. full text search (XML / HTML)

12. Search in a Text Field with CONTAINS


 
Powered by phpBB® Forum Software