Benjamin is balabala (ing)

Monday, March 28, 2011

Enormous check in files into cvs

Well, you can check in many files in on directory by doing this
> cvs add *
> cvs ci *

but...if you want to recursively do it...

find . -type f | grep -v CVS | xargs cvs add

find all files with type as a file (not a directory) and then exclude those files
are for CVS system, and then using xargs to execute 'cvs add' with the output after
grep.

Monday, December 7, 2009

Remove message in email body with pine

email program, pine, has a function that you can filter your message before you send your message out.

Why I want to do it? Since I am idiot enough to include sensitive information in my email body time by time. And people keep warning me about that. Using pine is really cool but composing email too fast without carefully reviewing message content itself is not really cool.

So here is how you need to do to setup basic filter function:
1. first go to S->C-> and find "sending-filters"
2. give your script or program that will parse your message and does necessary replacement or remove.

ex: /path/to/your/program/sendfilter_parser.pl _TMPFILE_ _RESULTFILE_

detail option of command modifying tokens: can be found here:
http://www.washington.edu/pine/tech-notes/config.html

And in my perl script I read two arguments in. First is _TMPFILE_ which is exactly your message body, and you can parse this in your program. Second is _RESULTFILE_ which is the message that you want to show after sending out.

Here is my example:
#!/usr/bin/perl -w

@patterns = ( 'Bad Words', 'Dirty Words' );

$text = $ARGV[0];
$result = $ARGV[1];

open( TEXT, "+<$text" ) or die $!; open( RESULT, ">$result") or die $!;;

// put all message into lines array.
@lines = ;
// move pointer to head of message
seek TEXT,0,0;

// step through message line by line
foreach $line (@lines)
{
// step through pattern by pattern and replace them to an empty string. (remove them)
foreach $onep ( @patterns )
{
$line =~ s/$onep//g;
}
print TEXT $line;
}

// show some message after change.
printf RESULT ("%s\n", "Done parsing");
close( TEXT );
close( RESULT );

Very neat right?

Friday, December 4, 2009

Some performance experience with MyISAM

When I worked on a SQL statement optimization, once I found a same sql statement will
produce two huge different query time. One is 5 secs, the other is 17 mins. Oh my god,
what's going on in MyISAM. The reason is simple, MyISAM is using pread/pwrite to read
and write data to a file descriptor, from Vadim's post I found that Peter was ever
create a function that allow MyISAM to use mmap to cache table if the table size is
smaller than 2GB ( 32 bit limitation ). The original post and experimental testing
is here: http://www.mysqlperformanceblog.com/2006/05/26/myisam-mmap-feature-51/
Therefore, MyISAM is all dealing with real files no matter it is cached in file system cache or not.

Yeah, so I found some truth:
1. If your table is not being cached into file system memory, damn, you have to pray your hard drive is quick enough to read data and put into file system cache.

2. If you are lucky and running a sql statement like this:
SELECT * FROM your_table;

Yes, it is full table scan, but no worry, this is not the worst case. It will spit out data much faster than you thought.

3. If you are doing some joining in your sql statement like this:
SELECT * FROM A JOIN B;

You will have a very bad response time if your A and B tables are not cached into
file system cache. Remember that how MySQL join tables? "Nested Loop JOIN" check google what is this. MySQL will read couple rows from outer table and join with couple inner rows. I watch the IO behavior with iostat. It did show how "nested loop join" work, MySQL read couple blocks of 'outer loop rows' and then read couple blocks of 'inner loop rows' and do the join and then keep going until meet the condtion. I can see a lot of small IO back and forth. But from 2, I can see a small amount of IOPS but couple big IO size request in iostat. For some disks like SATA, it is good for large IO throughput, but not good at IOPS. Further, if your table is frequently delete and insert. Your data will not be continuously. So once you hit condition 3, you even will have worse performance since your disk will have longer seeking time to find the right data.

A trick you can avoid this condition 3 happen. You can have "SELECT * FROM" from your A and B tables. It force MySQL to cache your tables into memory within small amount of IOPS and large chunks of data read from disk. (for sure you have to have enough memory to host table files.) And once tables are in cache, you will be fine with any kind of operation, since you are dealing with your table all in memory now.

Back to how Peter done with mmap, I can not find this feature in mysql.com right now. Only using myisampack to compress your tables and having your tables as read only
table as this, MyISAM will use mmap to cache your tables into memory. I hope that in coming future, MyISAM team will consider to implement it, since from Vadim's post,
Soaris and Linux are not really good in pread call as what he said in the post.
Oracle is smart that it allocates share memory when system startup, it manage its own
memory, cache tables etc. It flush cache by its own.

MyISAM table files size growing fast?!

Recently I just found that in MyISAM table, when you delete rows, MyISAM will only mark the row as 'delete' and will re-use the blocks later. That means the actually table files size (.MYI, .MYD) will not shrink even you delete a lot of data.

A lots of way can force MyISAM to resize table files size which is kind of optimization.

1. You can create an new index or add a new column on the table, so that MyISAM will actually create a temporary table with new structure, and then copy data from old table to temp table, finally rename temp table to the name of the table we are working on. This is an optimization of MyISAM, it also does sorting your data when insert data into temp table. The drawback is that it will take you a lot of time even you thought that is an easy thing just create an index but that will actually cause the table lock for a while. Yeah, table lock, that will kill you if your application can not tolerate long locking. My experience is that 5 million rows, will take 8~10 mins, more or less depends on your hardware system performance as well as table cached in filesystem memory or not.

2. You can shutdown your system and run myisamchk -r to repair your table. It will also shrink table files size as well.

3. A nice way to smoothly shrink files size without long table locking can be like this:
First you create a new table as the same as your old table, and then copy over your data
to new table, once it done, just swap table by using rename table org_table to old_table, new_table to org_table. There is a trick between this process is that when you copy your
data from old to temp table, your application is still writing new data into old table, so
that you have to have a loop that incrementally insert data into temp table ( so that select
on old table will not lock table for long and block write, concurrent write can avoid locking
in some condition ) and then for a very short moment, you can pause your application
to write into old table and immediately swap tables, finally resume application. Downtime
will be 0 in this case. Second, you can keep your application running, however, between incrementally insertion and swapping table, you will have a chance to
get new data being written into old table once swap. Therefore, you can write
down your latest primary key or unique key in the new table once you done your
insertion (remember to adjust auto_increment id to latest id+1), after that, you swap your tables. And once new coming writing start
using new table, you have to go back to your old table and review it and compare
the latest primary id or unique id. If you find any id that is larger than the
one you record in your latest insertion in new table, you must copy them
over to new table after swapping. There is a problem within this case, once
you swap your table, if any data being written into old table in between and
you have to copy them over to new table, the sequence of id will be inconsistent
and if your application will rely on the sequence of insertion data or say
the order of the primary key. Then you probably don't want to follow this way instead
you must pause your application for a short moment.

Why we need to do it to shrink table files size? All about filesystem cache, if we have small table
files, we can cache more tables in the memory to speed up database read/write.