Cruising Newsgroups: Accelerated Text Parsing with BLOBs
By Tim Tonooka, 4D Solution Partner
Technical Note 01-6
Technical Notes for Technical Notes for 01-02 February 2001
Introduction
This technical note introduces the 4D News Jockey example database. This database includes many features, but in particular it highlights what can be accomplished with optimized text parsing techniques that work with BLOBs.
Why use BLOBs for text parsing?
One of the main benefits of text parsing in BLOBs is that it enables you to get past the 32,000-character limit of 4D text fields. Suppose you're creating a web application. You can start with some basic text, then you add some HTML to it, and pretty soon it can grow past 32,000 characters. And when that happens, you need to start handling your text in BLOBs.
There are some text-parsing techniques you can use that will help you realize substantial performance gains. And if you're going to start writing your own text parsing code, you might as well write it so it works with BLOBs, so you can handle larger blocks of text.
Some of the main applications where you can apply these techniques are:
Heavy-duty web serving applications
Document processing
General text parsing
Implementing a TCP protocol
That last item brings us to the "4D News Jockey" example database that accompanies this technical note. This is a project that I had wanted to do for a long time. I tried using shareware Usenet newsreaders, but I'd always end up finding things that I knew I could improve on if it was in 4D. When the 4D Internet Commands plug-in version 6.7 was released, it included some new features that made it possible to write my own newsreader, so I gave it a shot.
4D News Jockey includes many features, because being not only the designer, but also the end-user, anytime I found a feature I wanted, I was able to create it in 4D. There is much more to the example database than will be discussed in this technical note. A separate document that accompanies the example database includes comprehensive documentation of its features.
A self-tour of 4D News Jockey
Requirements
In order to perform the text parsing demonstration with the 4D News Jockey newsreader database, you will need to have access to a news server. Here are some of your options for accessing a news server:
Via your ISP account. The number of newsgroups that you will be able to access depends on the news server that your Internet Service Provider makes available to you. (Many ISPs do not offer access to a news server at all.) The most newsgroups I've seen on one news server is over 84,000. If you're interested in perusing newsgroups on a regular basis, you should get an account with an ISP that has a good news server.
If you don't have access to a news server via your ISP account, there are other news servers you can access:
news.4d.com. To access the news server of 4D, Inc., you must be a member of the Partners program. You will need the Partners' user name and password to log in.
news.microsoft.com. Microsoft's news server has many discussion groups about their products, and anyone with an Internet connection can access them (no user name or password required).
Public news servers. There are actually quite a few news servers available to the general public at no charge, with no user name or password required. The downside of using these public news servers is that the free public access is typically unintentional, and free public access to a particular news server could end at any time. For the full scoop on public news servers, and frequently updated listings of them, see <www.newzbot.com>
Starting the tour
Be sure to use the compiled version of the structure. The interpreted version should only be used when you want to examine the code. This program performs massive amounts of text parsing, and the compiled version does that about a thousand times faster.
A note for modem users
You should initiate your PPP dial-up modem connection from outside 4D, BEFORE using the online features of 4D News Jockey. If the modem connection isn't already active when you start using 4D News Jockey's online features, it will dial the modem, but this isn't always successful in initiating a usable connection.
Creating records for your news servers
To begin using the database, you'll need to first enter the information about the news server(s) that you want to connect to. Select Newsgroups > News Servers. This opens an output form window listing your news servers. You'll need to enter at least one record here. Click the "Add Record" button in the output form's footer to display the input form to enter a new record.
For example, if you are a member of the 4D, Inc. Partner program, you should enter a record for the 4D, Inc. news server:
The News Server ID field is automatically filled out with a sequence ID number.
In the News Server field, you need to enter either the name of the news server, or its IP address. In this case, enter "news.4d.com".
If the news server requires authentication for access, you'll also need to enter the User Name and Password. In this example, in the User Name field, enter the current 4D Partners program user name. In the password field, enter the 4D Partners program password for the current month (you'll need to revise this record each month).
If you are not a member of the 4D Partner program, you can use Microsoft's news server to get started:
Enter "news.microsoft.com" in the News Server field.
Leave the User name and Password fields empty. (Some news servers, such as this one, do not require authentication.)
After filling out the fields, click the green Accept button to save the record.
Getting the list of newsgroups
After entering records for your news server(s), you'll want to get the list of newsgroups carried by those news server(s). In the News Servers output form, you can click on a news server record to select it. You can hold down the Shift key while selecting another record to select a range of records, or you can hold down the Control key (Windows) or the Command key (Macintosh) to select multiple non-contiguous records.
After highlighting at least one record in the News Servers output form, click the Download List of Newsgroups button in the footer. This will connect across the Internet to each of the highlighted news servers, and download the list of newsgroups carried by each news server. If the news servers carry many newsgroups, this list could take a long time to download.
Using the Newsgroups list
After 4D News Jockey downloads the list of newsgroups, it opens up the Newsgroups window. This window shows the list of all newsgroup records, including new records for the newsgroups in the list that you just downloaded. Across Usenet, there are many, many newsgroups. There's one for just about every interest. As you peruse the newsgroup names, you'll probably find some that appeal to you.
To help locate newsgroups about topics that interest you, you can use the Query button in the footer of the output form. The other buttons can also help you manipulate the current selection.
Subscribing to a newsgroup
When you find a newsgroup that you're interested in, you can choose to subscribe to it. Highlight a record in the Newsgroups output form, then click the Subscribe button. This "subscribes" you to that newsgroup. This is not the same as subscribing to a mailing list like the 4D NUG. When you subscribe to a newsgroup, no e-mails are sent anywhere. In the parlance of news readers, "subscribing" simply means that the news reader program will keep track of your visits to this newsgroup, for example storing information such as the last record you downloaded from the newsgroup, so the news reader knows which articles to start with the next time you visit. Subscribing to a newsgroup does not mean that an e-mail is sent out notifying someone of your interests.
When you click the Subscribe button, the Subscribed Newsgroups window is opened, and a new record is created in that table. This table keeps track of a lot of useful information about your subscribed newsgroups in its records. The newly subscribed newsgroup is added to the end of the list of records already in the window, and is highlighted.
If you happen to know the name of a newsgroup and the news server from which you wish to access the group, you can manually enter a new record in the Subscribed Newsgroups table by clicking the Add Record button in the window's footer area.
Downloading article summaries
The Subscribed Newsgroups window includes a Download Article Summaries button in the output form's footer area. Highlight one or more records in the Subscribed Newsgroup output form, then click this button. 4D News Jockey will then connect across the Internet and download article summaries consisting of only a few selected fields from the article headers in each of the highlighted newsgroups, so you can get an idea of what articles are in the newsgroups. These article summary records can be downloaded much faster than complete articles.
Perusing article summaries
After the downloading of article summaries has completed, the Articles output form window will be opened. If the window was already opened, it will be brought to the front. The current selection of the Articles window will be changed to show only the article summaries that were just downloaded.
The Articles window includes many interface features to assist you in perusing the list to find article summaries of interest to you. (These features are described in full detail in the 4D News Jockey documentation.) For example, you can click on the column headers to sort the articles by the contents of that column. You can Shift-click on a column header to sort the articles in reverse order. 4D News Jockey does not include true article threading, but you can accomplish pseudo-threading by clicking the header button of the Subject column.
Another interface feature is a shortcut to instantly scroll to the last record by pressing the End key (or Shift-Home). (The Home key will take you back to the top of the list.) This uses a little trick with the HIGHLIGHT RECORDS command that saves you the trouble of what you normally have to do in a 4D output form: use the scrollbar to take you to the end of the list, which displays a blank screen of records, then you have to manually scroll the screen back a bit to bring the last records of the list into view.
Downloading articles
Often, you can decide just from skimming the article summaries in the output form which articles you'd like to download the full content of. Once you find some articles that look interesting, highlight them in the output form and click the "Foreground Download" button. 4D News Jockey then connects across the Internet and downloads the header and body of each article. As each article downloads, the Body Length field is filled in, and you'll see a double greater-than sign appear in the DL column, to mark the record as having been downloaded.
If the article contains a uuencoded binary, and the "Extract binaries as each article downloads" preference is enabled (the default setting in the Download Settings dialog), after the article downloads, 4D News Jockey will then parse out the uuencoded binary from the article body text. After extracting the binary content, it will be decoded into its original binary format. The results will be stored into separate tables ([ArticleBodies] and [BinaryFiles]).
After downloading the full text for a batch of articles, double-click one of them in the output form. This will display the article in the input form, where you can review its content.
Another nice interface feature is that if you double-click one record in the output form, then in the input form you use the blue record navigation buttons (First, Previous, Next, and Last) to move around inside the selection of records, when you exit back to the output form, the record that you last looked at will be highlighted. (The normal behavior of 4D is that the first record you had clicked in the output form to get to the input form is the one that will be highlighted when you return to the output form, even though that may not be the last record you looked at in the input form.) This is another interface trick performed with the HIGHLIGHT RECORDS command introduced in 4D v6.5.
Other features in the Articles output form allow you to highlight a group of records and modify them as a batch (with the Apply Keywords or Apply Label buttons), without losing the sort order or the scroll position. You can also delete a batch of Articles records, while retaining the original sort order of the records remaining in the list. And you can remove a subset of articles without losing the sort order. The code that does this uses the commands BOOLEAN ARRAY FROM SET, and LONGINT ARRAY FROM SELECTION introduced in 4D v6.5. Using those commands allows you to build an array of the selected record numbers for the records that were highlighted (the UserSet). Once you've got that, you can manipulate the selection of records without losing the sort order, by using the GOTO SELECTED RECORD command. If the user is trying to work their way through a long list of records, they don't want to have to re-sort them and scroll back each time after modifying a record.
Online Search
The Online Search is one of the most powerful features in 4D News Jockey. Choose Newsgroups . Online Search to display this window.
This feature allows you to search through newsgroups to find articles that interest you, so you can then download them. You can select newsgroups to search in from the list of subscribed groups, or you can pick out newsgroups from the entire list. You select newsgroups in the upper part of the window, and use the buttons underneath the list to add those newsgroups to the list of newsgroups to be searched in.
To help you find newsgroups faster, you can query the list of newsgroups. Here, the Online Search window lets you take advantage of a new feature in 4D v6.7. "Contains" searches in 4D used to be very slow. 4D v6.5 included optimizations that made those searches significantly faster. And 4D v6.7 improves that even more. Now, if you have an indexed alpha field, and you do a contains search in it, the contains search will use the index. To take advantage of this new capability, I've added a field called "NameEndsWith," which holds the last 80 characters of the newsgroup name. With Usenet newsgroup names, the part of the name that you'll be most interested in is more likely to be at the end of the name than at the beginning. And very few newsgroup names are longer than 80 characters anyway. For example, you could do a query to find all newsgroups where "NameEndsWith" contains "database". Then you could click the Copy All Groups button (the one with two blue arrowheads) to copy all those groups down to the list of newsgroups to be searched in.
In the Find Articles Containing entry area at the bottom of the window, you can type something like "ODBC" and click the Online Search button. Then the program will go across the Internet and connect to your news servers that carry the specified newsgroups. For each of the chosen newsgroups, the Subject field of the articles is searched, to find the articles that contain the specified search string.
If you wanted to use a browser to find newsgroup articles, you'd have to open up a list of all the articles for one newsgroup, manually skim the titles, skip past all the spam that says you can instantly become wealthy, lose weight effortlessly, and hot babes want to meet you, just to try to find the articles about ODBC. The Online Search skips all that.
After the search completes, the Online Search window displays the results. At the top of the window is the list of newsgroups that were searched, and below that is the list of the articles found. When you click the Download Summaries button, 4D News Jockey downloads summaries of all these articles and creates records for them in the Articles table. This opens the Articles window, from where you can then select the articles you would like to download the full content of, and download them.
4D News Jockey progress window
While the Online Search runs, you'll see another special feature of 4D News Jockey. Its custom progress window is shown In the upper-left corner of the screen. Instead of having only a standard progress thermometer, this display enables you to monitor the details of a complex sequence of events. For example, as the Online Search runs, the progress window tells us how many newsgroups we've searched, what newsgroup we're currently searching, the running count of how many articles we've found, etc. This progress window could also be used to assist you in troubleshooting complex code, by letting you see exactly what's transpired so far.
One design issue you need to consider in the creation of a progress reporting system is what you'll do when you want to monitor multiple activities running simultaneously. Do you really want to clutter up the screen with a separate progress window for each activity? The 4D News Jockey progress system uses one progress window to monitor all activities. The progress window displays the status of whatever activity is the frontmost process. This way, there is only one progress window, so you know where to look for it.
The project methods that run the progress window are very easy to use. The IFC_ProgressWindowOpen method opens the progress window. The IFC_ProgressMessage method displays individual progress messages. It allows you to specify the x and y coordinates of where you want the message to appear within the progress window. The IFC_ProgressWindowClose method closes the progress window.
This system is about as easy to use as 4D's MESSAGE command, except it has the advantage that the display is persistent — after a screen draw, it'll still be there, unlike the MESSAGE command's text. Also, with 4D News Jockey's progress system, even after an activity has concluded, and even after the process that was running it has ended, you can recall the final progress window for that activity and redisplay it.
What's going on behind the scenes?
At this point, it'd be good to take a look behind the scenes, to see what's really going on. Underneath the surface of 4D News Jockey's elaborate interface, it's using a simple text-based TCP protocol that dates back about 22 years, way before the days of the World Wide Web. NNTP (Network News Transport Protocol) is a text-based TCP protocol, originally used with a command-line interface. You send a text command, and receive back a text response. In 4D News Jockey and other modern newsreaders, the program then takes the text stream response and parses it out into fields and records. We can even receive encoded binary data such as pictures, music, and movies. But at its core, the newsreader is basically a giant text parser.
Let's demonstrate that. Choose Newsgroups . NNTP Terminal to open the NNTP Terminal window. This allows you to interact with a news server at the lowest level possible.
Connecting to a news server
The first thing you'll do in the NNTP Terminal window is select a news server and connect to it. (Prior to doing that, you'll need to have defined at least one news server by creating a record in the NewsServers table.) At the top of the NNTP Terminal window is a pop-up menu/drop-down list showing all the news servers you've created records for in the NewsServers table. Select the news server you want to connect to, then click the Connect button. This button automates the login and authentication sequence.
In the scrolling text area in the NNTP Terminal window, you'll be able to see what happens next: 4D News Jockey opens a connection to the news server. The news server then sends back its greeting response. 4D News Jockey then sends the LIST EXTENSIONS command, to see if the news server requires a user name and password for access. The response "480 Logon Required" would tell us that it does. Then 4D News Jockey would use the AUTHINFO command to send the username and the password. If these are valid, the news server will reply back with "281 Authentication ok".
Now you're ready to start issuing commands to the news server.
Experimenting with NNTP commands
To learn the full details about the NNTP commands, you'll need to read the RFCs and related documents. Here's a quick review of some NNTP commands you can experiment with.
To send a command to the news server, type it in the entry area to the right of the Send Command button, then click the button (or press the keyboard shortcuts Return or Enter). The commands aren't case-sensitive (though certain parameters can be).
Start by sending the HELP command. Type "HELP" into the command area (leave out the quote marks), and click the Send Command button. This returns a list of the NNTP commands that are supported by the particular news server you're connected to. As you can see, it's a fairly brief list. It's a relatively simple protocol.
Now use the LIST ACTIVE command. This makes the news server reply with a list of its active newsgroups. Warning: if your news server carries many newsgroups (the most newsgroups I've ever heard of a news server carrying is 84,196), it will take awhile to receive back the response. The NNTP Terminal window will only display the first 32,000 characters of the response.
Let's suppose you have connected to the 4D, Inc. news server (news.4d.com). The reply to the LIST ACTIVE command will be:
215 list of newsgroups follow
Connectivity 237 7 m
Here's what that means. The first line starts with "215" which is a response code the news server sends. All responses sent by the news server start with a response code, then some descriptive text, in this case "list of newsgroups follow." From the remainder of the response, we see that this news server carries only one newsgroup, which is named "Connectivity." The article number of the most recent article the news server carries for this newsgroup is 237. The oldest article the news server still has for this newsgroup is article number 7. The "m" at the end of the line indicates that this is a moderated newsgroup. (The other codes that would be used there would be "y" indicating that posting is allowed, or "n" to indicate that posting is not allowed.)
Now that we know the name of a newsgroup on the news server, we can explore some of the newsgroup-specific commands. These let you work directly with the articles in a newsgroup. These commands require that you first specify the default newsgroup you want to work with. This is done with the NNTP "GROUP" command. It's like using the DEFAULT TABLE command in 4D.
In this example, you'd continue by entering the following command:
GROUP Connectivity
This will make Connectivity become our current newsgroup on the news server. The news server will return this reply:
211 205 7 237 Connectivity
Analyzing this reply, we see that "211" is the NNTP response code. The rest of the response tells us that the estimated number of articles the news server currently has available for this newsgroup is 205, with article numbers ranging from 7 to 237.
Earlier, we used the automated interface of 4D News Jockey to pull data from the news server. So what you're seeing in the NNTP Terminal window now is what 4D News Jockey is doing behind the scenes, communicating back and forth between the news server. It's using the 4D Internet Commands plug-in to send NNTP commands to the news server, and receive back the text replies.
Now use the "ARTICLE 237" command. The news server replies back with the full text of article number 237. The article consists of two parts, the head and the body.
If you use the "HEAD 237" or "BODY 237" commands, you can receive back the head or body separately.
From this, you can see the raw material that 4D News Jockey works with. In its regular downloading, it parses the text received in this format, and creates article summary records from it.
It's interesting to note that when you connect to the news server, the news server actually has its own database. So we can do some database-to-database communications with it. It has what's called an "overview" database that contains an index of some of the fields for the articles on the news server.
Try the "LIST OVERVIEW.FMT" command. This sends back a list of the fields in the "overview" database. From this, you can see that what's happening when you communicate with the news server is that you're actually communicating with a database of news articles. This is the key to how the Online Search works. It just searches through the field indexes in the overview database on the news server for each newsgroup. That way, you can query on individual fields, do pattern searches, etc.
Let's see how we can tap into that power. Try the "XOVER 234-237" command. The news server will send back article summaries -- the contents of the fields in the overview database for articles number 234-237:
224 Overview information follows
234 Re: EDM Updating "Sebastian Frey" <sebfrey@sextantti.com> Tue, 29 Aug 2000 13:04:12 -0700 <002651319201d80WWW@news.acius.com> <39AAC8EC.D73790C6@knowledgesharing.com> <001753215231c80WWW@news.acius.com> <39AAF88C.29673324@knowledgesharing.com> <0012a4010151d80WWW@news.acius.com> <39ABF9DF.9FC6CD2A@knowledgesharing.com> 5464 126 Xref: www Connectivity:234
235 EDM_ModelRowCountGet Simon J Wright <swright@knowledgesharing.com> Wed, 30 Aug 2000 10:48:22 -0400 <39AD1EB6.2FACB461@knowledgesharing.com> 4080 26 Xref: www Connectivity:235
236 EDM Generic Arrays Simon J Wright <swright@knowledgesharing.com> Wed, 30 Aug 2000 16:40:39 -0400 <39AD7147.75B0DE26@knowledgesharing.com> 4080 13 Xref: www Connectivity:236
237 Re: EDM_ModelRowCountGet "Sebastian Frey" <sebfrey@sextantti.com> Thu, 31 Aug 2000 07:50:48 -0700 <000541605151f80WWW@news.acius.com> <39AD1EB6.2FACB461@knowledgesharing.com> 4080 42 Xref: www Connectivity:237
(Don't get confused by the first line of the news server's response, which is "224 Overview information follows." The 224 there is the news server's response code, not a reference to article number 224.)
These summaries include all the fields from the list returned by the "LIST OVERVIEW.FMT" command we used earlier.
When you use 4D News Jockey to download article summaries, this is exactly what it gets back from the news server after using the XOVER command. Then it uses basic text parsing techniques to parses these article summaries into records in the Articles table.
Disconnect from the news server
When you're done communicating with a news server, click the Disconnect button to issue the NNTP "QUIT" command that ends your session with that news server. You can then connect to a different news server, or close the NNTP Terminal window.
The true test of the parsing code
Now that you've completed the exercises in the NNTP Terminal window, you've learned that nearly all the records in the 4D News Jockey database are created by parsing the text responses sent by the news server in reply to commands sent to the news server by 4D News Jockey.
So far, the text parsing we've done is relatively simple. We parse through the text stream looking for a delimiter character, then drop the preceding text into a field. Now it's time to ratchet up the complexity level, and see how well we fare. The true test of an example database is not how well it holds up using some carefully prepared demo data, or performing tasks that are relatively simple. What you want to know is how well it will hold up against unknown data.
For the next exercise you'll need to subscribe to a newsgroup that features pictures. For this, you'll need to use a news server that carries newsgroups of this type. (Neither 4D, Inc.'s or Microsoft's news server does.)
Choose Newsgroups . All Newsgroups. This displays the list of all the newsgroups carried by all your news servers.
Click the Query button and search for records where the "NameEndsWith" field contains "pictures". (There are also picture newsgroups that don't contain the word "picture" in their name.)
After the query completes, click the header button in the Group Name column to sort the records by their name.
Now you can scroll through this list to look for a picture newsgroup that interests you. As you contemplate this selection, please bear in mind that as a 4D developer, you occupy an esteemed position within the community, alongside the butcher, the baker, and the candlestick maker. Be sure to make a choice worthy of your venerated status in society. An excellent selection would be "alt.binaries.pictures.fractals". If that is unavailable from your news server, other suitable choices would include "alt.binaries.pictures.autos", "alt.binaries.pictures.vehicles", or "alt.binaries.clip-art".
Subscribe to the newsgroup of your choice by highlighting the group in the Newsgroups output form and clicking the Subscribe button in the form's footer.
The Subscribed Newsgroups window then opens up, displaying a record for the newsgroup that you just subscribed to. With that record highlighted in the Subscribed Newsgroups window, click the Download Article Summaries button in the footer of the form. This will download summaries of all of the articles that your news server has for that newsgroup. The Articles window will open up, displaying the article summaries you just downloaded.
To perform the next step of this exercise, you'll need to change one of the default settings. Choose Show Download Settings from the Downloading menu. This opens the Download Settings floating palette window. Uncheck the "Extract binaries as articles download" check box.
Now return your attention to the Articles window. In these binary picture newsgroups, some articles contain both pictures and text, while others contain only text. For the purpose of this demonstration, we want to separate the articles with pictures from the text-only articles. The easiest way to tell these articles apart is by their size. Click the header button over the "Total Length" column. This sorts the list by the size of the articles. The larger articles almost certainly contain pictures. Click on the first record in the list to highlight it. Then scroll down the list until you see some articles that are larger than 30,000 bytes. Hold down the Shift key and click on the last article that is smaller than that. Now all the small articles in the list should be highlighted. Click the Omit Subset button to remove the small articles from the current selection.
Locate two or three articles near the top of the list whose Subject includes ".gif" or ".jpg" and highlight them by clicking on them while holding down the Ctrl key (Windows) or the Command key (Macintosh).
Now click the Foreground Download button to download the full text of the articles. After they download, a double "greater-than" symbol appears in the DL column to indicate that the articles have been downloaded. Double-click on one of the articles to display it in the input form. The Body field should show a lot of gibberish text. What's that all about?
Since NNTP is a 22-year old protocol, it's really designed for sending text. So to include binary content such as a picture in your article, it has to be stored in a binary-encoded format inserted into the body of the text. The vast majority of newsgroup binaries are encoded with the uuencode format. The rest use MIME encoding. Uuencode, which stands for Unix-to-Unix encoding, is a way of taking a binary file, and converting it so that it can be represented using only 64 different ASCII characters.
4D News Jockey can parse through the text of an article and identify if it contains a uuencoded binary. If it does, the code locates its beginning and end, then extracts the binary from the article, decodes it from the uuencoded format into its native binary format, and stores the results into a separate BLOB field. To complicate things more, sometimes people insert multiple binaries into the body of an article, and 4D News Jockey handles those as well.
To go through an article and get the picture out it, what exactly does 4D News Jockey need to do, in terms of parsing? This makes for an interesting exercise. We'll take a look at the raw uuencoded binary in the input form, and see what has to be done to parse out the picture.
In an article that includes a uuencoded binary, the text sometimes includes introductory comments, typically saying something about the picture. We need to skip past that, to get to the uuencoded binary itself. The uuencoded binary begins with a line like "begin 644 MyPicture.jpg". In the uuencode format, a line starting with the word "begin" in lowercase letters is the signal for the beginning of a binary. So we can parse and find that: a carriage return, a line feed, and a lowercase b-e-g-i-n. But how do you know that's not just part of somebody's sentence, continued on a second line? You parse over to the next item, and find something that looks like a Unix file permissions code. So that's a good clue. But maybe that's just a coincidence? Then we parse the next item, and find something that looks like a filename, immediately followed by a carriage return and line feed, so that's three good clues. If this is indeed a binary, we want to make sure it's complete. So we'll go to the end of the article, and parse backwards to see if the uuencode end-of-string marker (the word "end" on a line by itself) is present. If we succeed in finding that, we've established that we apparently have both the beginning and end of a uuencoded binary, so we can continue parsing the binary content itself.
This binary content is a whole lot of gibberish consisting of capital letters, numbers, and some symbols. In the input form, 4D News Jockey shows only a short 2K snippet of it. The full text of long articles is stored in a separate table ({ArticleBodies]). The parsing code will have to look at this text stream, and make sure it follows the rules of the uuencode binary format. Each line of about 63 characters needs to be converted into 45 bytes of original binary data. We need to parse through it, and perform validation to make sure that nothing got lost on the way back from the news server. So we look at the first character, an encoded length byte, which tells us how many characters of encoded data follow on the line. This gives us an expectation of where we should find the carriage return and line feed at the end of the line. So we parse forward, to find out where that is, to find if there's the right amount of characters in between. If not, we'll have to alert the user that an error was found, that the transmission was invalid.
In between the length byte and the end of the line are groups of four ASCII characters that each need to be converted into three binary bytes, by doing some bit shifting. When you're sending binary content over Usenet, you can't be certain that someone isn't using the eighth bit (#7) of each byte for parity checking. So you can only be certain of getting seven out of eight bits of data safely sent. The uuencode format is a way of encoding binary files so they use only 64 different ASCII characters. (Which has the drawback of making your file about 140% larger.) Since you're only using 64 different ASCII characters, that means that you're only using six bits per character. So you take six bits from each of the four ASCII characters, a total of 24 bits, put them into one long row, then read them off in groups of eight bits to get back the three original binary bytes.
In 4D News Jockey, this conversion is all done with 4D code. Obviously, the conversion could be done faster with an external. But doing it in 4D code gives us a chance to test the parsing code on something really demanding.
So let's see just how well the parsing code stands up. Click the yellow Cancel button in the Articles input form, to return to the output form.
Go back to the Download Settings dialog and check the "Extract binaries as articles download" check box. Also be sure to check "Display pictures as they download."
In the Articles output form, click the header button in the Total Length column. This will sort the list in ascending order by article size.
Press Ctrl-A (Windows) or Command-A (Macintosh) to highlight all the articles. Then click the Foreground Download button.
As the full text of the articles is downloaded, any uuencoded binaries will be extracted and when those binaries contain a picture, the picture will be displayed in a large floating palette window.
As you watch each picture display on the screen as it downloads, think about what's going on: we're downloading text articles from the news server, and as the full text of each article is downloaded, the 4D News Jockey program is doing all the text parsing described earlier. That's happening for each picture as you watch it download off Usenet: it's parsing through the text of the article, looking for the word "begin", looking for the Unix file permissions code, looking for the file name, looking for the end-of-text marker, going back, validating each line, decoding the characters…and all this is done in compiled 4D code, as you watch.
To help optimize the text searching, an adaptation in 4D code of the Boyer-Moore search algorithm is used. You might ask, "How often is the Boyer-Moore search and replace being run in 4D News Jockey?" The answer is that it's used on every text response received from the news server. So bear in mind as you watch each picture download that every bit of text that's downloaded has had the Boyer-Moore search run across it at least once. Judging from the speed at which the pictures are downloaded and extracted, you can see how fast that runs in compiled 4D code.
(The reason the search and replace has to be run across all text is that in the NNTP protocol, in preparing text for transmission from the news server, whenever the first character of a text line is a period, it has to be replaced with two periods, so that period won't be confused with the NNTP end-of-text marker, which is a CRLF + period + CRLF. The newsreader client then needs to strip out those extra periods.)
In addition to using the Boyer-Moore search routine on all incoming text to strip out those doubled periods, it's also incorporated into the code involved in finding, extracting, and decoding binaries.
Binary Browser
Once you've downloaded some pictures into the database, you can use 4D News Jockey's thumbnail viewer: Choose Newsgroups > Binary Browser. In the scrolling list of pictures on the left, click on any thumbnail, and it will appear in the large display area on the right side of the window, along with the information about the picture.
Another way of opening up the Binary Browser window is by highlighting several records in the Articles output window, then clicking the Binary Browse button in the footer. This will display the thumbnails of all the pictures contained in the highlighted articles.
The interface in the Binary Browser window is actually very easy to create. The scrolling list of thumbnails is just a 4D subform area. The output form displayed in the subform has no header. Its detail area only has a picture field on the left, and a text field on the right.
Text Parsing with BLOBs
Are BLOBs really faster than text?
Are BLOB variables inherently faster than text variables? No. For the equivalent string-handling operation, there is no performance advantage in using BLOBs over text, other than the fact that BLOBs have a larger capacity.
We can do a simple test where we create a text block of 32,000 characters, and a BLOB of 32,000 bytes, and run a loop to replace every character in each:
vText[[$i]]:="B"
vBLOB{$i}:=66
In compiled code, the BLOB took 105% longer than text on Windows, and 180% longer than text on Macintosh. Now that's just one example, and who's to say there wouldn't be a different situation where the BLOB would be slightly faster. The important point here is that BLOBs in and of themselves are not faster than text.
In situations where you do need to handle more than 32,000 characters of text, the way that you manage your string handling within a BLOB can make an enormous difference in your text parsing performance.
Avoid excessive type changes
If you need to perform string-handling functions and text parsing in a BLOB, you should use custom methods that work directly in the BLOB. You might be inclined (as I was at first) to copy a smaller section of characters out of the source BLOB into a text variable, where you can then use the familiar 4D string-handling commands to manipulate it, then copy the results into a destination BLOB.
You'll get much better performance by using custom text-parsing methods that work directly inside a BLOB, like these methods from the 4D News Jockey example database:
| BLB_TextSearchBMF | |
| BLB_TextReplaceString | |
| BLB_TextSubstringF | |
| BLB_PositionF | |
| BLB_TextFindCharF | |
| BLB_TextIsHereF |
(You can use 4D Insider to see how these methods are utilized in the database. For performance, they need to run in a compiled structure.)
The reason it's best to handle your text parsing directly in the BLOB is that the BLOB to text and TEXT TO BLOB commands are particularly time-consuming. It's okay to use those commands to initially move your text into a BLOB, and to take text out of the final processed end result. But converting the data type back and forth between BLOB to text many times inside your main parsing loop will really hurt your performance.
Strategic text parsing
The biggest way to boost your text-parsing performance is to think in terms of strategies. You need to study the source text and your desired results and determine the most efficient way to achieve your goal. And each situation can be different. Here are some examples of effective strategies for optimized text parsing:
Make as few passes through the text as possible.
For example, suppose you need to parse records out of a text stream. One technique would be to parse through the source text looking for the record delimiters, and use something like the Substring function to copy one record's worth of text into a smaller string. Then the smaller string would be parsed to look for field delimiters, copying the text in between into fields, before saving the record. The problem with this technique is that you're making two parsing passes through your text.
Instead, you can do it all in one parsing pass: write your loop to parse through the text looking for field delimiters. As each field delimiter is found, the text between that and the previous delimiter is copied into a field of a record. At the same time, the code checks to see if each parsed character is the record delimiter. When the record delimiter is found, the record is saved.
Peruse as little of the text as possible.
For example, suppose you need to determine if a certain character is within the 90th to 100th characters of a 32,000-character text string. One technique would be to search the entire text for that character, and after completing that search, see if the resulting position of the found character is within the 90th-100th characters. That's about what you'd have to do if you used 4D's Position command.
You'll have much better performance (in compiled code) if you use a custom method, such as BLB_TextFindCharF from 4D News Jockey, that can conduct its search only within the 90th-100th characters of the 32,000-character text.
When you're trying to find a string in a large text block, start your search from the most likely place the string will be, before looking elsewhere.
The most obvious example of this principle would be if you're looking for an end-of-text marker, you should start your search working backwards from the end of the text block, rather than working your way towards it from the beginning.
A Case Study
For the sake of example, I'll compare several versions of the same code from 4D News Jockey, to show how it was improved for text parsing performance using BLOBs. This section of code is iterated many times inside a loop, and is meant to be run in a compiled structure.
Excerpt #1
This excerpt is from an early version of what evolved into the NWS_BinaryUUDecodeF project method in 4D News Jockey. The code is used in the decoding of a uuencoded binary. Uuencoding represents binary bytes by using only 64 different ASCII characters. Each group of three binary bytes is represented by four ASCII characters. In the code below, each encoded ASCII character is converted into a six-bit value. The six bits from four ASCII characters are combined to make a 24-bit number. Then the bits are read off in groups of eight, to get back the original three binary bytes.
(Writing your own uuencode decoding routine in a high-level language like 4D, even in compiled code, won't be as fast as using a plug-in routine written in C++. It's only presented in this example database as an extremely demanding example of text parsing, to demonstrate how fast these text parsing routines can be.)
In this first excerpt, the encoded ASCII string is held in the text variable $tEncodedLine, and the decoded string is held in a text variable $tDecodedLine:
` r33, 8:34 PM, June 11, 2000 ` Based on the byte count in first char of the line, ` create strings for encoded and decoded lines of text: $nDecodedBytes:=$nCharCode-32 $tDecodedLine:=Char(0)*$nDecodedBytes ` Calculate how long the encoded line should be: $nEncodedBytes:=($nDecodedBytes\3)*4 If ($nDecodedBytes%3#0) $nEncodedBytes:=$nEncodedBytes+4 End if ` Read the encoded line of text from BLOB in $ptrSource-> ` in this case by using the 4D "BLOB to text" command: $tEncodedLine:=BLB_TextSubstringF ($ptrSource;$nPosCurrent;$nPosCR-$nPosCurrent) ` Pad out the encoded string with $20 chars if needed. ` (This would normally only occur on the last line of the encoded binary): If (Length($tEncodedLine)<$nEncodedBytes) $tEncodedLine:=$tEncodedLine+(Char(Space )*($nEncodedBytes-Length($tEncodedLine))) End if ` Decode the line: For ($i;0;$nDecodedBytes-1) If ($i%3=0) $nQuartetIndex:=($i\3)*4 $nBitstream:=(Ascii($tEncodedLine[[$nQuartetIndex+4]])-32) & 63 $nBitstream:=$nBitstream+(((Ascii($tEncodedLine[[$nQuartetIndex+3]])-32) & 63) << 6) $nBitstream:=$nBitstream+(((Ascii($tEncodedLine[[$nQuartetIndex+2]])-32) & 63) << 12) $nBitstream:=$nBitstream+(((Ascii($tEncodedLine[[$nQuartetIndex+1]])-32) & 63) << 18) End if ` Decode original byte Case of : ($i%3=0) ` First byte in encoded quartet $nCharCode:=($nBitstream & 0x00FF0000) >> 16 $tDecodedLine[[$i+1]]:=Char($nCharCode) : ($i%3=1) ` Second byte in encoded quartet $nCharCode:=($nBitstream & 0xFF00) >> 8 $tDecodedLine[[$i+1]]:=Char($nCharCode) : ($i%3=2) ` Third byte in encoded quartet $nCharCode:=$nBitstream & 0x00FF $tDecodedLine[[$i+1]]:=Char($nCharCode) End case End for ` Write the decoded line to the destination BLOB: TEXT TO BLOB($tDecodedLine;$ptrDestBLOB->;Text without length ;*)
Excerpt #2
This is the next revision of the code. Now the encoded string is held in a BLOB variable $oEncodedLine, and the decoded string is held in a BLOB variable $oDecodedLine:
` r34, 11:21 PM June 11, 2000
` Based on the byte count in first char of the line,
` create BLOBs for encoded and decoded lines of text:
$nDecodedBytes:=$nCharCode-32
SET BLOB SIZE($oDecodedLine;$nDecodedBytes;0)
` Calculate how long the encoded line should be:
$nEncodedBytes:=($nDecodedBytes\3)*4
If ($nDecodedBytes%3#0)
$nEncodedBytes:=$nEncodedBytes+4
End if
` Read the encoded line into a BLOB
` (Note that the third parameter (Space) handles the padding of the line):
SET BLOB SIZE($oEncodedLine;$nEncodedBytes;Space )
$nSourceOffset:=$nPosCurrent-1
$nDestOffset:=0
COPY BLOB($ptrSourceBLOB->;$oEncodedLine;$nSourceOffset;$nDestOffset;$nPosCR-$nPosCurrent)
` Decode the line:
For ($i;0;$nDecodedBytes-1)
If ($i%3=0)
$nQuartetIndex:=($i\3)*4
$nBitstream:=($oEncodedLine{$nQuartetIndex+3}-32) & 63
$nBitstream:=$nBitstream+((($oEncodedLine{$nQuartetIndex+2}-32) & 63) << 6)
$nBitstream:=$nBitstream+((($oEncodedLine{$nQuartetIndex+1}-32) & 63) << 12)
$nBitstream:=$nBitstream+((($oEncodedLine{$nQuartetIndex+0}-32) & 63) << 18)
End if
` Decode original byte
Case of
: ($i%3=0) ` First byte in encoded quartet
$nCharCode:=($nBitstream & 0x00FF0000) >> 16
$oDecodedLine{$i}:=$nCharCode
: ($i%3=1) ` Second byte in encoded quartet
$nCharCode:=($nBitstream & 0xFF00) >> 8
$oDecodedLine{$i}:=$nCharCode
: ($i%3=2) ` Third byte in encoded quartet
$nCharCode:=$nBitstream & 0x00FF
$oDecodedLine{$i}:=$nCharCode
End case
End for
` Write the decoded line to the destination BLOB:
$nSourceOffset:=0
$nDestOffset:=$nBytesWritten
COPY BLOB($oDecodedLine;$ptrDestBLOB->;$nSourceOffset;
$nDestOffset;$nDecodedBytes)
$nBytesWritten:=$nBytesWritten+$nDecodedBytes
This version ran astoundingly faster than the previous version. Originally, I had concluded that this was due to the elimination of the Ascii and Char commands. But further testing revealed that the section of code that previously used the Ascii and Char functions actually ran slightly slower in Excerpt #2 in which those commands were removed.
The performance boost in the second version is in fact entirely due to the fact that it's now using the COPY BLOB command to copy a section of the larger BLOB into a smaller BLOB where the manipulations are performed, and then using COPY BLOB to copy the manipulated result into a destination BLOB.
The original example used the BLOB to text and TEXT TO BLOB commands to copy a section of the larger BLOB into a text variable where the manipulations were performed, then to copy the text into a destination BLOB. These commands are much slower than COPY BLOB.
Some tests I did with compiled code indicated that:
On Macintosh, BLOB to text took 239% longer than COPY BLOB.
On Windows, BLOB to text took 514% longer than COPY BLOB.
On Macintosh, TEXT TO BLOB took 211% longer than COPY BLOB.
On Windows, TEXT TO BLOB took 314% longer than COPY BLOB.
Excerpt #3
For the sake of reference, here is the final version of the code that appeared in 4D News Jockey on the 4D Summit 2000 CD-ROM:
` r170, 7:30 AM September 14, 2000
` Decode the line, looping once for each group of four encoded characters:
For ($i;0;$nTriplets-1)
$nQuartetIndex:=$nCurrentPos+($i*4) ` BLOB offset for first char of the quartet
` Read the four encoded characters into an array
` (Already confirmed that there are enough bytes left in the BLOB for this)
For ($j;0;3)
$aQuartetBytes{$j}:=$ptrSourceBLOB->{$nQuartetIndex+$j}
End for
` Test each byte for validity:
For ($j;0;3)
$nByteOffset:=$nQuartetIndex+$j
Case of
: ($ptrSourceBLOB->{$nByteOffset}=Carriage return )
$nPosCR:=$nByteOffset+1 ` Store CR position (as an ordinal)
If ($nPosCR<=($nCurrentPos+$nEncodedBytes))
$nError:=9113 ` This line's carriage return is premature: data is missing
Else
` Pad out the rest of the quartet with spaces:
For ($k;$j;3)
$aQuartetBytes{$k}:=Space
End for
$j:=3 ` Exit this loop
End if
: ($ptrSourceBLOB->{$nByteOffset}>96)
$nError:=9114 ` Invalid character in encoded string
: ($ptrSourceBLOB->{$nByteOffset}<32)
$nError:=9114 ` Invalid character in encoded string
End case
End for
` Create a 24-bit value by combining 6 bits from each encoded character:
$nBitstream:=($aQuartetBytes{3}-32) & 63
$nBitstream:=$nBitstream+((($aQuartetBytes{2}-32) & 63) << 6)
$nBitstream:=$nBitstream+((($aQuartetBytes{1}-32) & 63) << 12)
$nBitstream:=$nBitstream+((($aQuartetBytes{0}-32) & 63) << 18)
` Decode each 8 bits of the 24 into one original value:
$nTripletIndex:=$i*3
$oDecodedLine{$nTripletIndex}:=($nBitstream & 0x00FF0000) >> 16 ` First byte in encoded quartet
$oDecodedLine{$nTripletIndex+1}:=($nBitstream & 0xFF00) >> 8 ` Second byte in encoded quartet
$oDecodedLine{$nTripletIndex+2}:=$nBitstream & 0x00FF ` Third byte in encoded quartet
End for
` Write the decoded line to the destination BLOB:
$nSourceOffset:=0
$nDestOffset:=$nDestBaseOffset+$nBytesWritten
COPY BLOB($oDecodedLine;$ptrDestBLOB->;$nSourceOffset;$nDestOffset;$nDecodedBytes)
$nBytesWritten:=$nBytesWritten+$nDecodedBytes
The overhead of running the outer loop has been reduced by changing it to run only once for every group of three decoded bytes, whereas the previous code excerpts ran the outer loop once for every decoded byte.
The code was also modified to do more complete data validation. Whereas the previous examples assumed that the encoded bytes were in the ASCII range valid for the uuencoding, the new code checks to make sure that each encoded byte is actually in the correct ASCII range. (This validation probably slows things down a little.)
Addressing BLOB contents
If you're not used to using BLOBs, one of the most important things to understand is that you should NEVER use an invalid BLOB address.
Let's take this example:
$charCode := MyBLOB {$address}
Here, the code is assigning to the variable $charCode the value of one of the bytes in the BLOB MyBLOB. The particular byte is specified by the variable $address inside the curly braces.
Making sure that the value of $address is valid is critical.
Let's suppose the BLOB is 100 bytes long. The range of valid address values will start with 0, which represents the first byte in the BLOB. Therefore, 99, which is the length of the BLOB minus 1, will be the last valid address value.
So you need to make sure that before you run that line of code, the address value specified is somewhere in the range of 0-99. Any other value, especially in compiled code, can cause a crash, because you've asked for the value of a location in memory that doesn't exist.
You can also use a calculated value for a BLOB address:
$charCode := MyBLOB {$offset+$i}
In this example, the variable $offset could hold the beginning address of a line of text within a BLOB, to which you add the index $i to reference perhaps a particular character in that line of text. When you use a calculated BLOB address, it is essential that you make sure that the expression inside the curly braces evaluates into a valid value within the range of the BLOB's size.
For example, if your BLOB is empty, what would be the range of valid index values? None. Because zero would be referring to the first byte of the BLOB, and there is no first byte. So if the BLOB is empty, you shouldn't even be executing a line of code like:
$charCode := MyBLOB {$address}
Instead, you should always (especially when using calculated values) check for valid range, with code such as:
Case of
:(BLOB size (MyBLOB) = 0)
$charCode := -1
:($address < 0)
$charCode := -1
:($address >= BLOB size (MyBlob))
$charCode := -1
Else
$charCode := MyBLOB{$address}
End case
In this example, the value -1 is returned in the variable $charCode if the BLOB address specified in $address is invalid.
Optimized text searching routines
To get better text parsing performance, there are some optimized text searching routines available in 4D. Most are based on the Boyer-Moore algorithm. This was outlined in the 1994 ACI US technical note #53, which included an example database. To explain the basic outline of this technique, let's suppose that within some text you're searching for the word "easier". The Boyer-Moore algorithm starts by aligning that word with the beginning of the text, then first compares the last letter of the search string, in this case "r", which is its sixth character, with the character at the sixth position in the text. Let's suppose that we found the letter "w" there. That's not anywhere in the word "easier", so we know that there is no possibility of finding a match for our search string starting anywhere from the first through the sixth characters of the text. So we can skip ahead, and begin looking for a match at the seventh character of the text. This saves us the time that the brute force method would have spent testing for a match at the second, third, fourth, fifth, and sixth positions in the text. The Boyer-Moore algorithm analyzes the search string, and builds a table that shows, for each ASCII character, how many comparison positions can be skipped.
Several people have started with the example database from the 1994 technical note, and derived their own implementation to search for text in BLOBs. If you're preparing to do this yourself, you should be aware that there are some bugs in that example database: it won't find the search string if the text being searched starts with that string, and the backwards search doesn't work.
Since there are a variety of optimized routines for doing text searching in BLOBs already available in 4D code, you can spare yourself the trouble of writing your own. Here are some of the choices available:
There's a set of routines included in 4D News Jockey. This is an implementation of the Boyer-Moore algorithm based on the old technical note. The only real changes consist of fixing bugs in the original example, and implementing the skip table in a BLOB instead of in a string. Also, the original example featured the ability to do non-case-sensitive and non-diacritical-sensitive searches. In order to do the non-diacritical-sensitive searches, you need to build a table listing the diacritical characters. The original version used a table in the data file, which meant that your database couldn't start out with an empty data file. So I moved that diacritical table into a string resource in the structure.
Michael Ginsberg's freeware "MDG_BLOB_Code" can be downloaded from <ftp://ftp.mdg.com/free_stuff/>. It includes these commands:
BLOB_Append
BLOB_Position
BLOB_Replace
BLOB_Substring
At the 1999 4D Summit, John Macrae and Steven Willis had a "Queries" session where they discussed optimized queries. The presentation included an adaptation for BLOBs of the Boyer-Moore technical note. As well as being on the 1999 Summit CD-ROM, it was also included in a Dimensions article (January/February 1999), and you can download it from the 4D Zine site: <ftp://ftp.4dzine.com/summit_1999/>.
Steven Willis also has plans to do an entirely new implementation of the Boyer-Moore algorithm. Rather than reworking the old technical note, this will go back to the Boyer-Moore algorithm itself. It will be a whole new interpretation in 4D code, including some of the features from the original algorithm that had not previously been implemented in 4D code. Watch for this on the 4D Zine web site: <www.4dzine.com>.
Edward V. Berard wrote a "BlobPosition" method that uses the Boyer-Moore technique to find a specified string within a BLOB. Like the other examples, it's freeware. You can get it from the 4D Zine site: <http://www.4dzine.com/4dz.acgi$freeware_show_00000062>. An updated version can be downloaded via anonymous FTP from: <ftp.toa.com>, in a folder named "Blob Position (4D)".
The "QFree 2.0" freeware plug-in from Escape Information Services, makers of QPix and QMedia, has a variety of interesting QuickTime features, including 4 "Regex" commands that can do text parsing in BLOBs, so you can find, replace, split and extract. It uses an entirely different algorithm than Boyer-Moore. What distinguishes this from the other examples is that it is a plug-in, which gives you a big performance advantage. The Regex command set gives you a lot more versatility in the kind of searches and replaces you can do, because it has a powerful matching syntax that includes wildcards. QFree requires QuickTime 4 or later. <http://www.escape.gr/q/q_download.html>
4D News Jockey's BLOB text parsing methods
4D includes many BLOB commands, but some are of particular interest for text parsing. Here is a list of the BLOB commands used in 4D News Jockey:
If you want to study the text parsing techniques featured in this technical note, the six main project methods in 4D News Jockey you should look at are:
BLB_TextSearchBMF (searchString; ptrSourceText; start; forward?; caseSensitive?; diacritic?; wraparound?{; ptrSkipTable}) -> position
This is the routine that does the Boyer-Moore search explained earlier. It's based on the 1994 technical note. It searches for a string within the source text. You can specify where in the text to start your search, whether you want the search to be forward or backward, whether you want the search to be case-sensitive or diacritical-sensitive, or whether you want the search to wraparound. In 4D News Jockey, it's used to find the "begin" and "end" markers in a uuencoded binary, and it's also used by the "BLB_TextReplaceString" method.
BLB_TextReplaceString (ptrToSourceBLOB; oldString;newString{; howMany})
This routine finds occurrences of oldString in the source BLOB and replaces them with newString. It's a replacement for the Replace string command, with the exception that rather than working on text, it works on BLOBs. It uses the Boyer-Moore algorithm. This is used in 4D News Jockey to remove the NNTP escape code (a doubled period) that the NNTP protocol inserts periodically in the text stream.
BLB_TextSubstringF (ptrSource; firstChar{; numChars}) -> text
This works basically the same as 4D's Substring command. The syntax is almost identical, except that it works on both BLOBs and text. In 4D News Jockey, this extracts the NNTP response code from news server replies, and extracts the file name of a uuencoded binary file from a Usenet article.
BLB_PositionF (findString; ptrBLOB) -> position of first occurrence
This works just like 4D's Position command, except that it works for BLOBs. It finds the position of the first occurence of a specified string within a BLOB. In 4D News Jockey, this finds the first carriage return/line feed pair in an NNTP reply.
BLB_TextFindCharF (charCode; ptrSource{; startPos}{; endPos}) -> position of first occurrence found
This method is specifically optimized to find a single character in either text or a BLOB. You can search backwards or forwards. This function is especially useful when you need to find the next carriage return, after having already found the first one, because it allows you to specify the start position and the end position of the range you want to limit your search to in the text or BLOB. That saves you the time of searching through the entire text or BLOB. In comparison, 4D's Position command always has to start its search from the beginning.
BLB_TextIsHereF (findString; ptrSourceText; position) -> Boolean
This returns a Boolean answer to the question, "Is the specified findString at the specified position in the BLOB or text pointed at by ptrSourceText?" 4D News Jockey mainly uses this to search for the end-of-text marker. When the news server sends text back, it just keeps sending packet after packet of text, then it signals the end of the text by putting a carriage return/linefeed/period/carriage return/linefeed, a total of five characters, at the end of the last text packet it sends. So you know that when you do find this end-of-text marker, it's going to be the last five characters of the text. So if you've received a BLOB that is 10,005 bytes long, you want to look for the end-of-text marker at the 10,001st character. That will save you the time of searching through the entire BLOB.
Summary
This technical note introduced the 4D News Jockey example database. This database includes many features, but in particular it highlights what can be accomplished with optimized text parsing techniques that work with BLOBs.
See also
Technical Note 94-53, "Faster, More Powerful Text Searches" presents an implementation of the Boyer-Moore text searching algorithm in 4D code, and includes an example database.