Wednesday, April 27, 2016

MapR Stream Workshop 4: Cold Backup and Restore

Theory:

mapr exportstream and mapr importstream are used together to export data from MapR streams into binary sequence files, and then import the data from the binary sequence files into other MapR streams.
So we can use the 2 tools to do cold backup/restore.

After that, mapr diffstreams can be used to check the differences between the 2 streams.
mapr formatresult can be used to parse a sequence file generated by mapr diffstreams.

Experiment:

1. Backup stream /stream/s1 to MFS location /tmp/backup_s1 -- exportstream

mapr exportstream -src /stream/s1 -dst /tmp/backup_s1
The physical size of stream "/stream/s1" is about 163MB:
# maprcli stream topic list -path /stream/s1
topic  partitions  logicalsize  consumers  maxlag  physicalsize
info   1           371326976    1          858571  171065344
The backup size is even smaller -- 100MB.
[root@v5 backup_s1]# ls -altr
total 101620
drwxrwxrwx 3 mapr root         1 Apr 26 07:09 ..
drwxr--r-- 2 root root         3 Apr 26 07:09 .
-rw-r--r-- 1 root root       563 Apr 26 14:39 part0
-rw-r--r-- 1 root root       175 Apr 26 14:39 part2
-rw-r--r-- 1 root root 104055345 Apr 26 14:39 part1
[root@v5 backup_s1]# pwd
/mapr/my2.cluster.com/tmp/backup_s1
This means the compression ratio of the backup could be better than the source stream itself.

2. Restore the backup to stream /stream/s1_clone -- importstream

maprcli stream create -path /stream/s1_clone
mapr importstream -src /tmp/backup_s1 -dst /stream/s1_clone
The target stream size is similar as the source stream.
# maprcli stream topic list -path /stream/s1_clone
topic  partitions  logicalsize  consumers  maxlag     physicalsize
info   1           371957760    1          440290179  172695552

3. Compare the # of rows between them -- streamanalyzer

# mapr streamanalyzer -path /stream/s1 -topics info
Total number of messages: 16002707
# mapr streamanalyzer -path /stream/s1_clone -topics info
Total number of messages: 14502707
Here we can see 1500000 messages difference.
This could be due to the "dead" messages in source Stream which passed TTL(7 days by default).
This means that "mapr streamanalyzer" could count the "dead" messages also.

But interestingly after one day, I re-ran the "mapr streamanalyzer" and then the count of 2 streams match.
# mapr streamanalyzer -path /stream/s1 -topics info
Total number of messages: 14502707
# mapr streamanalyzer -path /stream/s1_clone -topics info
Total number of messages: 14502707
Note: This difference is because my ntp service is not in sync among nodes.

4. Check the differences -- diffstreams

To prove that the different messages in #3 is due to "dead" messages, we can use "mapr diffstreams" to check the difference. The output will be stored in MFS directory "/tmp/diff".
# mapr diffstreams -src /stream/s1 -dst /stream/s1_clone -outdir /tmp/diff 
tables '/stream/s1', and '/stream/s1_clone' didn't match
Number of rows processed in '/stream/s1' : 250919
Number of rows processed in '/stream/s1_clone' : 227480
Mismatch row count in '/stream/s1' : 23439
Mismatch row count in '/stream/s1_clone' : 0
Rows with mismatch are stored in /tmp/diff
Interestingly the different "row" count is 23439 instead of  1500000.
Why? No hurry,  I will explain and dig deeper in the next step.

5. Parse the output file generated by diffstreams -- formatresult

The output of diffstreams is sequential file. So we need to use "mapr formatresult" to parse it into readable text format.
# mapr formatresult -indir /tmp/diff/OpsForDstTable -outdir /tmp/diff/OpsForDstTable_parse 
Successfully created files in /tmp/diff/OpsForDstTable_parse
The row count of this output file is 23439 which matches the row count in step #4.
# wc -l opsfordst_1.diff.txt
23439 opsfordst_1.diff.txt
# pwd
/mapr/my2.cluster.com/tmp/diff/OpsForDstTable_parse

However the actual count of messages is 1500000 because each row has multiple messages in that output file.
Because the value of each message is stored in binary format in that output file, one easy way to count the # of messages is to count the # of word "binary" in that file.
This can be done using "vim" unix tool , and you just need to type ":%s/binary//gn" when using "vim".
Then it will show you the REAL count of different messages:

6.  Prove the mismatched messages are "dead" messages

From the output in step #5, you can get the epoch timestamp of the mismatched messages.
Take 1461080075789.0 for example, if you convert it to readable human time, it is actually April 19, 2016 at 8:34:35 AM PDT which is 7 days ago comparing to when I did the tests.

Key takeaways:

1. "mapr streamanalyzer" may count the "dead" messages in stream, if time(npt) is not in sync among nodes.

2. "mapr diffstreams" output shows the count of rows != the count of messages.



1 comment:

  1. I was feeling overwhelmed when studying for my fundamentals of nursing course, but thanks to https://www.nursingpaper.com/questions/how-to-study-for-fundamentals-of-nursing/, I was able to find helpful tips and resources to make the process much easier. The website provided practical advice on how to study effectively, including strategies for taking notes, memorizing key concepts, and practicing critical thinking skills. I also appreciated the study guides and practice quizzes that helped me assess my knowledge and prepare for exams.

    ReplyDelete

Popular Posts