Tuesday, February 06, 2007

There are a few things I've been meaning to write about how to write StreamCruncher Queries to achieve good performance.

If your Query filters Events based on some criteria in the Where clause, while co-relating it with Events from other Streams like this:

select .... from

StreamA (partition by store last 10 minutes) as FirstStream,
StreamB (partition by store latest 25) as SecondStream

where FirstStream.eventId = SecondStream.eventId

and FirstStream.$row_status is not dead
and FirstStream.someColumn > 10
and SecondStream.$row_status is new
and SecondStream.otherColumn is not null;

Here are some simple tips. The same concepts that you learned while optimizing SQL Queries apply here too.
a) Re-arrange the Filter conditions in the Where clause before the Join (Co-relation) to reduce the candidate Rows. In the Query above, the First and Second Streams are Joined on eventId and the resulting combined Rows/Events are filtered using the subsequent Filter criteria like "FirstStream is not dead and .. .someColumn > 10..". This is computationally wasteful because the Database joinsall those additional Rows/Events from the 2 Streams and then removes the ones that do not match the Criteria.
b) If the Events need to be Filtered, use the Pre-Filter clause in the Partition definition to consume only the required Events

Thus, the final Query should look like this:

select .... from

StreamA (partition by store last 10 minutes
where someColumn > 10) as FirstStream,

StreamB (partition by store latest 25
where otherColumn is not null) as SecondStream

where SecondStream.$row_status is new

and FirstStream.$row_status is not dead
and FirstStream.eventId = SecondStream.eventId;

You will notice that the Events are Pre-filtered in the Partition clause itself, where the un-necessary Events are weeded out even before they enter the Partitions. And since Events are fetched into Partitions in-Parallel with Query execution, you can shave off previous milliseconds by Pre-Filtering.

Another trick is to use the Table/EventStream with the least number of Rows/Events as the "Driving Table" (first Table in the Join) i.e the "SecondStream.$row_status is new" will have fewer Events because it picks up only the newly arrived Events to join with the other Stream. This speeds up Join processing time, if the Database underneath uses Hash-Joins.

It is also recommended to have the filter critera like "$row_status .." (the ones that cannot be Pre-Filtered) before the Join clause. So, use "FirstStream.eventId = SecondStream.eventId" in the end, after culling the Rows that are not needed so that only the required Rows are presented to the Join clause.