dig-the-digits - programming, python, databases

Wednesday, June 10, 2015

Information Theory notes #01

Short notes with a little bit of clarification.

Self information of the event is proportional to probability of event occurrence. The more likely event is the less information it holds.

It could be expressed in the following formula.

$$I(x) = - logP(x)$$
Given that $ D(x) \in [0,1] $

From graph it's clear that event's with less probability are more informative. Ideally guaranteed event shouldn't provide information. $ P(x)=1 $
More details about this topic could be found in Yoshua Bengio's book chapter Probability and Information Theory
http://www-labs.iro.umontreal.ca/~bengioy/dlbook/

Friday, March 20, 2015

bit tricks

1. check if number is power of 2
The trick is that, if number is a power of two it contains only single bit with value 1.
x & (x-1) unsets right most bit with value 1. And in this case it unsets the only bit with value 1, if number is a power of 2.

idea was taken from this article:
http://articles.leetcode.com/2010/09/number-of-1-bits.html

check_base_2 = lambda x: x and not(x & (x-1))

#bin(16)   #10000#bin(16-1) #01111
#  10000# &#  01111#  00000
assert check_base_2(16)
assert check_base_2(32)
assert check_base_2(128)
assert check_base_2(8)
assert check_base_2(256)
assert check_base_2(3) != True

Thursday, August 28, 2014

[Python] Get source of the module function

 >>> import inspect  
 >>> import fractions  
 >>> print (inspect.getsource(fractions.gcd))  
 def gcd(a, b):  
   """Calculate the Greatest Common Divisor of a and b.  
   Unless b==0, the result will have the same sign as b (so that when  
   b is divided by it, the result comes out positive).  
   """  
   while b:  
     a, b = b, a%b  
   return a

Sunday, August 17, 2014

[Python] Understanding iterators and iterables

Hope this note can help not only me understand iterators and iterable in Python 3.X.

Some good sources where explanation could be found:

Short explanation: http://stackoverflow.com/questions/9884132/understanding-pythons-iterator-iterable-and-iteration-protocols-what-exact
Reference to official documentation: https://docs.python.org/3/tutorial/classes.html#iterators
The most comprehensive explanation of topic: http://nedbatchelder.com/text/iter.html

Remarks
Any Python object can be "iterable", which means it can provide a stream of values. An iterable can be used in a for loop
Saing more precise: iterable produce an iterator, itearator produce stream of values
An iterable in Python is any value that can provide a stream of values. Lists , dicts, strings are iterables.
Every iterable type provides it's own iterator type:

lists - list iterators,
dicts - dict iterators, etc..

 data = ['say' , 134 , 'words']  
 # Lists => elements  
 iter(data)  
 # <list_iterator object at 0x7f3ca04fcb00>  
 word = "morning"  
 # Strings => Characters  
 it = iter(word)  
 # <str_iterator at 0x7f3ca04fc8d0>  
 data_struct = {'age' : 12, 'color' : 'red' , 'isempty' : True}  
 # Dicts => Keys  
 iter(data_struct)  
 # <dict_keyiterator at 0x7f3ca0289688>  
 # Passed to iter() dicts produce key iterators,   
 # so it's possible to pass dict to max() function to get max of all keys.  
 max(data_struct)  
 # 'isempty'  
 f = open('find_friends.ipynb')  
 # Files => Rows  
 iter(f)  
 # <_io.TextIOWrapper name='find_friends.ipynb' mode='r' encoding='UTF-8'>  
 next(iter(iter(iter(iter(data)))))

The most interesting point that iterable could be passed as argument to max() func for example. (generally to any built in function which accepts as argument an iterable) And the result of func evaluation depends on iterable type, so passing dict iterable to max will produce max key value in dictionary. Also list of keys could be generated in this way.

 max(data_struct) # 'isempty'   
 list(data_struct) # ['age', 'isempty', 'color']

The most advanced ways if using and understanding of iterables are generators which produce stream if values, and making objects iterable by adding _itter__() method.

Hope, I'll write some notes about these topics later on.

Friday, June 13, 2014

[Oracle] Register MVIEW manually

1. Check if MVIEW is already registered.

 select * from dba_registered_mviews;

2. Register MVIEW if it doesn't exist in DBA_REGISTERED_MVIEWS

 begin  
 dbms_mview.register_mview ( 'TSTCBF' , 'ADDRESS_EX' , 'TST', 204, DBMS_MVIEW.REG_PRIMARY_KEY_MVIEW,  
 'select Address_Ex_ID Tracking_ID, Unique_Address_FK Address_ID, Last_Mod_Date Last_Mod_Date from ADDRESS_EX_TBL@BAS',  
 DBMS_MVIEW.REG_V8_SNAPSHOT);  
 end;

As input parameters use package specific constants where necessary, as mentioned in documentation.

DBMS_MVIEW.REG_PRIMARY_KEY_MVIEW

DBMS_MVIEW.REG_V8_SNAPSHOT

More about sys.dbms_mview package here

Tuesday, May 27, 2014

Oracle data pump export/import

Standard Oracle export/import tools are damn usefull and fast.

To export entire schema as sys user you need only to specify directory object (more about directory) , dump file name, schema name. Generated file is binary, investigate log file to be sure that export was succesful.

for example:

 expdp \"SYS AS SYSDBA\"  DIRECTORY=data_pump_dir DUMPFILE=OFMDEVPH1_MDS_new_2.dmp SCHEMAS=OFMDEVPH1_MDS

As far as OFMDEVPH1_MDS_new_2.dmp is binary you can access file content using impdp tool.

 impdp \"SYS AS SYSDBA\"  DIRECTORY=data_pump_dir DUMPFILE=OFMDEVPH1_MDS_new_2.dmp SCHEMAS=OFMDEVPH1_MDS SQLFILE=user_OFM_MDS.sql INCLUDE=USER,SYSTEM_GRANT,ROLE_GRANT,DEFAULT_ROLE,TABLESPACE_QUOTA

option SQLFILE tells to generate sql file with all ddls in text format. insted of performing import.
INCLUDE is saying that generated file user_OFM_MDS.sql will be only containing specified categories of objects.
content of generated file:

 -- CONNECT SYS  
 ALTER SESSION SET EVENTS '10150 TRACE NAME CONTEXT FOREVER, LEVEL 1';  
 ALTER SESSION SET EVENTS '10904 TRACE NAME CONTEXT FOREVER, LEVEL 1';  
 ALTER SESSION SET EVENTS '25475 TRACE NAME CONTEXT FOREVER, LEVEL 1';  
 ALTER SESSION SET EVENTS '10407 TRACE NAME CONTEXT FOREVER, LEVEL 1';  
 ALTER SESSION SET EVENTS '10851 TRACE NAME CONTEXT FOREVER, LEVEL 1';  
 ALTER SESSION SET EVENTS '22830 TRACE NAME CONTEXT FOREVER, LEVEL 192 ';  
 -- new object type path: SCHEMA_EXPORT/USER  
 -- CONNECT SYSTEM  
 CREATE USER "OFMDEVPH1_MDS" IDENTIFIED BY VALUES 'S:E2ED5B8A25;FB42494E4D316374'  
    DEFAULT TABLESPACE "OFMDEVPH1_MDS"  
    TEMPORARY TABLESPACE "OFMDEVPH1_IAS_TEMP";  
 -- new object type path: SCHEMA_EXPORT/SYSTEM_GRANT  
 GRANT CREATE TYPE TO "OFMDEVPH1_MDS";  
 GRANT CREATE PROCEDURE TO "OFMDEVPH1_MDS";  
 GRANT CREATE SEQUENCE TO "OFMDEVPH1_MDS";  
 GRANT CREATE TABLE TO "OFMDEVPH1_MDS";  
 -- new object type path: SCHEMA_EXPORT/ROLE_GRANT  
 GRANT "CONNECT" TO "OFMDEVPH1_MDS";  
 -- new object type path: SCHEMA_EXPORT/DEFAULT_ROLE  
 ALTER USER "OFMDEVPH1_MDS" DEFAULT ROLE ALL;  
 -- new object type path: SCHEMA_EXPORT/TABLESPACE_QUOTA  
 DECLARE  
  TEMP_COUNT NUMBER;  
  SQLSTR VARCHAR2(200);  
 BEGIN  
  SQLSTR := 'ALTER USER "OFMDEVPH1_MDS" QUOTA UNLIMITED ON "OFMDEVPH1_MDS"';  
  EXECUTE IMMEDIATE SQLSTR;  
 EXCEPTION  
  WHEN OTHERS THEN  
   IF SQLCODE = -30041 THEN  
    SQLSTR := 'SELECT COUNT(*) FROM USER_TABLESPACES  
        WHERE TABLESPACE_NAME = ''OFMDEVPH1_MDS'' AND CONTENTS = ''TEMPORARY''';  
    EXECUTE IMMEDIATE SQLSTR INTO TEMP_COUNT;  
    IF TEMP_COUNT = 1 THEN RETURN;  
    ELSE RAISE;  
    END IF;  
   ELSE  
    RAISE;  
   END IF;  
 END;  
 /

In this way you can review entire dump file. For me it was just matter of curiosity, how passwords are presented.

for importing dump file use:

 impdp \"SYS AS SYSDBA\"  DIRECTORY=data_pump_dir DUMPFILE=OFMDEVPH1_MDS_new_2.dmp SCHEMAS=OFMDEVPH1_MDS

Good article for further reading: Oracle Data Pump Quick Start

Friday, April 4, 2014

Oracle. Statistics gathering issue.

While looking on AWR Oracle report and trying to understand why calls to DB (Oracle 11g) are so slow. I found that one query is expensive in matter of time consumption.

Here is this query:

  select distinct  
  evt_map_tbl.UNIQUE_EVENT_MAPPING_ID as UNIQUE1_43_  
   , evt_map_tbl.ACCOUNT_FK as ACCOUNT9_43_  
   , evt_map_tbl.EVENT_MAPPING_ID as EVENT2_43_  
   , evt_map_tbl.EXT_ID_TYPE_CODE_ID as EXT3_43_  
   , evt_map_tbl.LAST_MOD_DATE as LAST4_43_  
   , evt_map_tbl.OL_VERSION as OL5_43_  
  , evt_map_tbl.VALID_FROM as VALID6_43_  
   , evt_map_tbl.VALID_TO as VALID7_43_  
   , evt_map_tbl.VALUE as VALUE43_  
  from BASTST1.EVENT_MAPPING_TBL evt_map_tbl, BASTST1.EVENT_MAPPING_ASSOC_TBL evt_map_assoc,BASTST1.SOLD_COMPONENT_TBL sold_comp, BASTST1.SOLD_PRODUCT_TBL sold_prod, BASTST1.ACCOUNT_TBL acc_tbl  
  where evt_map_tbl.EVENT_MAPPING_ID=evt_map_assoc.EVENT_MAPPING_REF  
   and evt_map_assoc.SOLD_COMPONENT_FK=sold_comp.SOLD_COMPONENT_ID  
   and sold_comp.SOLD_PRODUCT_FK=sold_prod.SOLD_PRODUCT_ID  
  and sold_prod.ACCOUNT_FK=acc_tbl.ACCOUNT_ID  
   and acc_tbl.ACCOUNT_ID=:acc  
  and evt_map_tbl.VALID_FROM<=:from_dt and evt_map_tbl.VALID_TO>:to_dt;

and execution plan is:

------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 289 | 20 (5)| 00:00:01 |
| 1 | SORT UNIQUE | | 1 | 289 | 20 (5)| 00:00:01 |
| 2 | NESTED LOOPS | | | | | |
| 3 | NESTED LOOPS | | 1 | 289 | 19 (0)| 00:00:01 |
| 4 | MERGE JOIN CARTESIAN | | 1414 | 363K| 19 (0)| 00:00:01 |
| 5 | NESTED LOOPS | | | | | |
| 6 | NESTED LOOPS | | 15 | 3555 | 19 (0)| 00:00:01 |
|* 7 | TABLE ACCESS FULL | EVENT_MAPPING_TBL | 15 | 3165 | 19 (0)| 00:00:01 |
|* 8 | INDEX RANGE SCAN | EMASS_MAPREF_IDX | 1 | | 0 (0)| 00:00:01 |
| 9 | TABLE ACCESS BY INDEX ROWID| EVENT_MAPPING_ASSOC_TBL | 1 | 26 | 0 (0)| 00:00:01 |
| 10 | BUFFER SORT | | 97 | 2522 | 19 (0)| 00:00:01 |
| 11 | TABLE ACCESS BY INDEX ROWID| SOLD_PRODUCT_TBL | 97 | 2522 | 0 (0)| 00:00:01 |
|* 12 | INDEX RANGE SCAN | SP_ACCFK_IDX | 39 | | 0 (0)| 00:00:01 |
|* 13 | INDEX RANGE SCAN | SC_SPFK_IDX | 1 | | 0 (0)| 00:00:01 |
|* 14 | TABLE ACCESS BY INDEX ROWID | SOLD_COMPONENT_TBL | 1 | 26 | 0 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------------

The most expensive step is MERGE JOIN CARTESIAN

Tom Kyte says that MERGE JOIN CARTESIAN could be also efficient is some cases.
CBO (cost based optimizer) should know some statistics values to use most efficient execution plan.
This object DBA_TAB_STATISTICS displays optimizer statistics for all tables in the database :

 SQL> select TABLE_NAME , nvl(NUM_ROWS,0), nvl(BLOCKS,0), nvl(AVG_ROW_LEN,0) , nvl(SAMPLE_SIZE,0) ,LAST_ANALYZED from DBA_TAB_STATISTICS  
  2 where owner = '&tab_owner'  
  3 and table_name in ( 'EVENT_MAPPING_TBL', 'EVENT_MAPPING_ASSOC_TBL','SOLD_COMPONENT_TBL','SOLD_PRODUCT_TBL','ACCOUNT_TBL');

TABLE_NAME NVL(NUM_ROWS,0) NVL(BLOCKS,0) NVL(AVG_ROW_LEN,0) NVL(SAMPLE_SIZE,0) LAST_ANALYZED
------------------------------ --------------- ------------- ------------------ ------------------ -------------------
ACCOUNT_TBL 0 0 0 0
EVENT_MAPPING_ASSOC_TBL 0 0 0 0
EVENT_MAPPING_TBL 0 0 0 0
SOLD_COMPONENT_TBL 0 0 0 0
SOLD_PRODUCT_TBL 0 0 0 0

Statistics wasn't gathered at all. It means that CBO has no information about data. And uses plans suitable for lowest amount of rows.

Statistics can be gathered using DBMS_STATS package.

 SQL> exec dbms_stats.gather_schema_stats( 'BASTST1');

This procedure takes a lot of parameters, but mandatory only first of them, the name of the scheme for which statistics should be collected.

Checking one more time DBA_TAB_STATISTICS. Now statistics is collected, and CBO is aware of such parameters as NUM_ROWS, BLOCKS, AVG_ROW_LEN.

TABLE_NAME NVL(NUM_ROWS,0) NVL(BLOCKS,0) NVL(AVG_ROW_LEN,0) NVL(SAMPLE_SIZE,0) LAST_ANALYZED
------------------------------ --------------- ------------- ------------------ ------------------ -------------------
ACCOUNT_TBL 5967 244 236 5967 2014-04-03 17:09:50
EVENT_MAPPING_ASSOC_TBL 5966 20 14 5966 2014-04-03 17:11:33
EVENT_MAPPING_TBL 5966 65 91 5966 2014-04-03 17:11:52
SOLD_COMPONENT_TBL 9571 244 123 9571 2014-04-03 17:12:03

SOLD_PRODUCT_TBL 9571 80 49 9571 2014-04-03 17:12:15

Finally for target problematic query was used another , more efficient execution plan.

----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 115 | 17 (6)| 00:00:01 |
| 1 | SORT UNIQUE | | 1 | 115 | 17 (6)| 00:00:01 |
| 2 | NESTED LOOPS | | 1 | 115 | 16 (0)| 00:00:01 |
| 3 | NESTED LOOPS | | 2 | 48 | 12 (0)| 00:00:01 |
| 4 | NESTED LOOPS | | 3 | 48 | 9 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| SOLD_PRODUCT_TBL | 3 | 24 | 3 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | SP_ACCFK_IDX | 3 | | 1 (0)| 00:00:01 |
| 7 | TABLE ACCESS BY INDEX ROWID| SOLD_COMPONENT_TBL | 1 | 8 | 2 (0)| 00:00:01 |
|* 8 | INDEX RANGE SCAN | SC_SPFK_IDX | 1 | | 1 (0)| 00:00:01 |
|* 9 | INDEX RANGE SCAN | EMASS_SCEMREF_IDX | 1 | 8 | 1 (0)| 00:00:01 |
|* 10 | TABLE ACCESS BY INDEX ROWID | EVENT_MAPPING_TBL | 1 | 91 | 2 (0)| 00:00:01 |
|* 11 | INDEX RANGE SCAN | EVTMAPP_EVTID_VALID_FROM_IDX | 1 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------------

Without expensive MERGE JOIN CARTESIAN. Very clean and efficient plan.

I was curious about why statistics wasn't collected at all, because without statistics CBO is only blind tool good for nothing. I found some information regarding this topic and soon here will be new post regarding differences in collecting statistics by default JOBS in Oracle 10g and Oracle 11g.