I just found out about the beagle project couple of months ago, I’m totally excited by it. It’s the missing brick that I longed for to write a `grep on steroid’ which I can use as a source code reading tool.
Yeah, right, I was using grep to read souce code, often times finding cscope insufficient (because some files are not source code, and even cscope’s fuzzy syntax parser can not parse them). On the other hand, with large projects, such as Linux Kernel, or the even larger Android system, grepping can be very slow on the whole project. I once searched for
readlink in android source tree, it took me >30 minutes!
With beagrep (beagle combined with grep), I can grep it for less than 2 seconds!
Why is it possible
When you grep for reading source code’s sake, you often don’t need complex regexp power: when you search
grep readlink, not
grep r.*e.*a.*d.*l.*i.*n.*k, that just does not make any sense!
IOW, you 99.9% times search only for whole words like
readlink, which is a kind of simple regexp, and unlike complex regexps (such as
r.*e.*a.*d.*l.*i.*n.*k), is something search engines can deal with perfectly.
How is it done
It is really a very simple idea, when you want to grep a target repexp, do the following:
- Break the target regexp into whole words, for e.g.,
grep -e some.*fun.*stuffshould be broken into “some fun stuff”.
- Query beagle with the whole words, beagle answers with which files in the repository contais these words. These files are the possibe matching files.
- Grep the target regexp in those possible files (which often is only a very small part of the whole repository, thus grep can finish in a blink of the eyes).
Modifications to beagle
Here’s the details of how I changed beagle to satisfy my need (warning: boring stuff ahead):
- Change all beagle built in filters to FilterText. This is because I don’t want those keywords filtered by those SourceCode filters. This way, I can beagle-query `extends CFunny’ to see which classes are inheriting from `CFunny’ in Java (The default Java filter will remoke
extendssince it is too common and uninteresting in java source files).
- Remove some restricts. For e.g., only the first 100000 tokens in a file would be indexed, which is undesirable for my purpose. Also, I enlarged the memory threshold by 10 times, since I found it causing problems with some large xml files.
- Remove more restricts. Basically, I unremoved anything the NoiseFilter will remove. Also, another filter will remove common English words, I unremoved those as well.
- Added support for indexing Chinese characters (This is because I’m a Chinese).
Here’s how I use it:
- Build a static index at the top level dir of the souce code:
- Use beagrep in any directory in the source tree:
~/bin/beagrep -e "ENGLISH_STOP_WORDS"
The output is like the following:
beagle query argument `ENGLISH STOP WORDS' /src/beagle/beagled/LuceneCommon.cs:1206: ...ENGLISH_STOP_WORDS... ... ...
ENGLISH_STOP_WORDSis broken into 3 words before beagle is queried.
Where to find everything
I have put the source code at github.
If you checkout the source code, you can find the beagrep and its helper scripts under windows-config/bin.
The beagle source code I modified is under windows-config/gcode/beagle.
The c# program which breakes
ENGLISH STOP WORDS is under windows-config/gcode/BeagleTokenizer.
The simplest way to set things up is to run
For more details, please RTFS using beagrep!