I need to get the first sentence in a Wikipedia document. For example, if the user types "Computer" in the text box, it should go to Wikipedia and return the first sentence of the first paragraph. In my approach, I used the I'm Feeling Lucky button of Google to search for "computer Wikipedia" so that it goes to the Wikipedia page. Then, I get the source code of that page and strip out the HTML. From what I observed, the required text always comes right after the first <p> tag. So I remove the text before the <p> tag, and take 150 characters (what I assume should be as a big as a sentence can get). In that 150 characters, I remove the text after the first period, since I only need the first sentence. Here's my code.
Function StripTags(ByVal html As String) As String
' Remove HTML tags.
Return Regex.Replace(html, "<.*?>", "")
End Function
Dim search As String = TextBox1.Text
'Remove unnecessary words in input
search.Replace("what's", "")
search.Replace("what is", "")
'Remove leading and trailing spaces in input
search = search.Trim()
'The following is required for google search
search = search.Replace("+", "2C")
search = search.Replace(" ", "+")
Dim sourcecode As String
Dim wb As New WebBrowser
wb.Navigate("http://ift.tt/1h5G4rz'm+Feeling+Lucky&q=" & search & "+wikipedia")
sourcecode = wb.DocumentText.ToString()
'First remove the text before the first <p>
sourcecode = sourcecode.Substring(sourcecode.IndexOf("<p>"), 150)
'Remove the text after the first period.
sourcecode = sourcecode.Substring(0, sourcecode.IndexOf(".") + 1)
'Remove HTML tags
sourcecode = StripTags(sourcecode)
'Show the required output in the label
lblReply.Text = sourcecode
I'm getting an ArgumentOutOfRangeException in the code where I remove the text before the <p> tag. What's wrong with the code?
Thanks,
Rahul
No comments:
Post a Comment