html markup effecting search 
Author Message
 html markup effecting search

Hi

Hopefully a simple question but here goes..
I want to read some data into my database which could contain HTML markup
for e.g.

The <b>fox</b> jumped over the <I>gate</I>

(n.b. this data read in will be an xml file)

I will then at some point create a webpage from this data hence the tags
will be part of this page, however Im worried that when the user types in
the word "gate" for example into a search box (yet to be written) it will
fail to find the word gate because it will treat the word gate as <I>gate</>
i don`t want to use wild cards so that it would find all occurrences of gate
so for example it would bring back the word "stargates" only of course if
this was also in the database. Am I missing the point anywhere, should I be
investigating ignored words?

I understand you can use XML and XSL to format on xml tags. However the data
coming in to my database will be in XML and data between the xml tags could
potentially contain html markup. (I understand some manipulation is needed
for ampersands, quotes etc)

One solution I guess would be to strip the markup out and hold the data in a
separate table however this does not seem the most elegant way, by having
one table for creating the webpage and the other used to create the search
catalogue.

any help or advice would be greatly appreciated

thanks
brad



Mon, 11 Oct 2004 02:06:12 GMT
 html markup effecting search

Try the neutral wordbreaker -- that seems to help folks out quite a bit.
You appear not to need the stemming support (which you lose with the neutral
wordbreaker) but the rest may work.  Give it a shot -- hopefully this helps.

Thanks,
--andrew

Andrew Cencini
Program Manager - SQL Server
Microsoft Corporation

This posting is provided "AS IS" with no warranties, and confers no rights.


Quote:
> Hi

> Hopefully a simple question but here goes..
> I want to read some data into my database which could contain HTML markup
> for e.g.

> The <b>fox</b> jumped over the <I>gate</I>

> (n.b. this data read in will be an xml file)

> I will then at some point create a webpage from this data hence the tags
> will be part of this page, however Im worried that when the user types in
> the word "gate" for example into a search box (yet to be written) it will
> fail to find the word gate because it will treat the word gate as
<I>gate</>
> i don`t want to use wild cards so that it would find all occurrences of
gate
> so for example it would bring back the word "stargates" only of course if
> this was also in the database. Am I missing the point anywhere, should I
be
> investigating ignored words?

> I understand you can use XML and XSL to format on xml tags. However the
data
> coming in to my database will be in XML and data between the xml tags
could
> potentially contain html markup. (I understand some manipulation is needed
> for ampersands, quotes etc)

> One solution I guess would be to strip the markup out and hold the data in
a
> separate table however this does not seem the most elegant way, by having
> one table for creating the webpage and the other used to create the search
> catalogue.

> any help or advice would be greatly appreciated

> thanks
> brad



Mon, 11 Oct 2004 05:45:51 GMT
 html markup effecting search
Andrew,

I'm working with Brad Hayes (the original contributor) and I'm trying
out what you suggested.

I've reset the 'default full-text language' option in Query Analyser and
also done 'reconfigure'. (I presume this is what you mean by the neutral
wordbreaker.) Having done this and rebuilt and repopulated the full-text
catalogue, it doesn't seem to make any difference. If I have a field
entry "This text is <b>really</b> important.", I don't get anything
returned for either of the following queries:

SELECT * FROM tbl_Test WHERE CONTAINS (*, '"really"')
SELECT * FROM tbl_Test WHERE CONTAINS (*, '"real*"')

I am confused as to how the neutral setting might suddenly "see through"
the <b> and </b> tags in the string and pick out the word in between.

Thanks in advance,

Jon.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!



Mon, 25 Oct 2004 00:36:20 GMT
 html markup effecting search
You reset the default Full-Text language, not the column language.  The
default language is used when you add new columns to a Full-Text index and
do not specify a language.  Your existing column here will retain the
setting of English (LCID 1033).  You can verify this by executing
sp_help_fulltext_columns and looking at the FULLTEXT_LANGUAGE value for that
column in your index.

The proper syntax would be to drop the column, then re-add it as such:

sp_fulltext_column 'tablename', 'columnname', 'add', 0x0
-- (0x0 means neutral language setting)

Also, you may want to set your Default Full-Text Language back to English
unless you want to implicitly use neutral language on all future column
additions to your Full-Text catalogs.

Cheers,
--andrew

Andrew Cencini
Program Manager - SQL Server
Microsoft Corporation

This posting is provided "AS IS" with no warranties, and confers no rights.


Quote:
> Andrew,

> I'm working with Brad Hayes (the original contributor) and I'm trying
> out what you suggested.

> I've reset the 'default full-text language' option in Query Analyser and
> also done 'reconfigure'. (I presume this is what you mean by the neutral
> wordbreaker.) Having done this and rebuilt and repopulated the full-text
> catalogue, it doesn't seem to make any difference. If I have a field
> entry "This text is <b>really</b> important.", I don't get anything
> returned for either of the following queries:

> SELECT * FROM tbl_Test WHERE CONTAINS (*, '"really"')
> SELECT * FROM tbl_Test WHERE CONTAINS (*, '"real*"')

> I am confused as to how the neutral setting might suddenly "see through"
> the <b> and </b> tags in the string and pick out the word in between.

> Thanks in advance,

> Jon.

> *** Sent via Developersdex http://www.developersdex.com ***
> Don't just participate in USENET...get rewarded for it!



Tue, 26 Oct 2004 11:15:04 GMT
 
 [ 4 post ] 

 Relevant Pages 

1. HTML markup in Noise Word List, full-text

2. set markup html on

3. Want to search specific names for price markup

4. [WEBMASTER] 'www/html search.html'

5. [WEBMASTER] 'www/html search.html'

6. www/html/mhonarc/pgsql-docs (index.html search.htm top)

7. www/html/mhonarc/pgsql-announce (index.html search.htm)

8. Full-Text Search - Searching HTML

9. Searching HTML with Excalibur Text Search Datablade

10. Looking for seach tool to search HTML files and limit search based on fields

11. [WEBMASTER] 'www/html/users-lounge commercial-support.html events.html index.html limitations.html subunsub.html'

12. [WEBMASTER] 'www/html helpus.html home.html logo.html sites.html sponsors.html'


 
Powered by phpBB® Forum Software