RDBMS Mini Project Report made by sejal (sandhya)Rathi.

31
Sanjivani K.B.P.Polytechnic, Kopargoan A Report On RDBMS Submitted by 1. Miss. Rathi Sejal G. Roll No: 56 For the subject

Transcript of RDBMS Mini Project Report made by sejal (sandhya)Rathi.

Page 1: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

Sanjivani K.B.P.Polytechnic, Kopargoan

A Report On

RDBMS

Submitted by

1. Miss. Rathi Sejal G. Roll No: 56

For the subject

Relational Database Management System

In The Academic Year

2016-17

Page 2: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

Normalization in DBMS:

1NF, 2NF, 3NF and BCNF in Database:

Normalization is a process of organizing the data in database to avoid data redundancy, insertion anomaly, update anomaly & deletion anomaly. Let’s discuss about anomalies first then we will discuss normal forms with examples.

Anomalies in DBMS:

There are three types of anomalies that occur when the database is not normalized. These are – Insertion, update and deletion anomaly. Let’s take an example to understand this.

Example: Suppose a manufacturing company stores the employee details in a table named employee that has four attributes: emp_id for storing employee’s id, emp_name for storing employee’s name, emp_address for storing employee’s address and emp_dept for storing the department details in which the employee works. At some point of time the table looks like this:

Page 3: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

emp_id emp_name emp_address emp_dept

101 Rick Delhi D001

101 Rick Delhi D002

123 Maggie Agra D890

166 Glenn Chennai D900

166 Glenn Chennai D004

The above table is not normalized. We will see the problems that we face when a table is not normalized.

Update anomaly:

In the above table we have two rows for employee Rick as he belongs to two departments of the company. If we want to update the address of Rick then we have to update the same in two rows or

Page 4: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

the data will become inconsistent. If somehow, the correct address gets updated in one department but not in other then as per the database, Rick would be having two different addresses, which is not correct and would lead to inconsistent data.

Insert anomaly:

Suppose a new employee joins the company, who is under training and currently not assigned to any department then we would not be able to insert the data into the table if emp_dept field doesn’t allow nulls.

Delete anomaly:

Suppose, if at a point of time the company closes the department D890 then deleting the rows that are having emp_dept as D890 would also delete the information of employee Maggie since she is assigned only to this department.

To overcome these anomalies we need to normalize the data. In the next section we will discuss about normalization.

Anomalies in DBMS:There are three types of anomalies that occur when the database is not normalized. These are – Insertion, update and deletion anomaly. Let’s take an example to understand this.

Example: Suppose a manufacturing company stores the employee details in a table named employee that has four attributes: emp_id for storing employee’s id, emp_name for storing employee’s name, emp_address for storing employee’s address and emp_dept for storing the department details in which the employee works. At some point of time the table looks like this:

Page 5: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

emp_id emp_name emp_address emp_dept

101 Rick Delhi D001

101 Rick Delhi D002

123 Maggie Agra D890

166 Glenn Chennai D900

166 Glenn Chennai D004

The above table is not normalized. We will see the problems that we face when a table is not normalized.

Update anomaly:

In the above table we have two rows for employee Rick as he belongs to two departments of the company. If we want to update the address of Rick then we have to update the same in two rows or the data will become inconsistent. If somehow, the correct address gets updated in one department but not in other then as per the database, Rick would be having two different addresses, which is not correct and would lead to inconsistent data.

Page 6: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

Insert anomaly:

Suppose a new employee joins the company, who is under training and currently not assigned to any department then we would not be able to insert the data into the table if emp_dept field doesn’t allow nulls.

Delete anomaly:

Suppose, if at a point of time the company closes the department D890 then deleting the rows that are having emp_dept as D890 would also delete the information of employee Maggie since she is assigned only to this department.

To overcome these anomalies we need to normalize the data. In the next section we will discuss about normalization.

Normalization:

If a database design is not perfect, it may contain anomalies, which are like a bad dream for any database administrator. Managing a database with anomalies is next to impossible.

Update anomalies − If data items are scattered and are not linked to each other properly, then it could lead to strange situations. For example, when we try to update one data item having its copies scattered over several places, a few instances get updated properly while a few others are left with old values. Such instances leave the database in an inconsistent state.

Page 7: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

Deletion anomalies − we tried to delete a record, but parts of it was left undeleted because of unawareness, the data is also saved somewhere else.

Insert anomalies − we tried to insert data in a record that does not exist at all.

Normalization is a method to remove all these anomalies and bring the database to a consistent state.

Here are the most commonly used normal forms:

First normal form(1NF) Second normal form(2NF) Third normal form(3NF) Boyce & Codd normal form (BCNF)

First Normal Form:

First Normal Form is defined in the definition of relations (tables) itself. This rule defines that all the attributes in a relation must have atomic domains. The values in an atomic domain are indivisible units.

We re-arrange the relation (table) as below, to convert it to First Normal Form.

Page 8: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

Each attribute must contain only a single value from its pre-defined domain.

Second Normal Form:

Before we learn about the second normal form, we need to understand the following −

Prime attribute − an attribute, which is a part of the prime-key, is known as a prime attribute.

Non-prime attribute − an attribute, which is not a part of the prime-key, is said to be a non-prime attribute.

If we follow second normal form, then every non-prime attribute should be fully functionally dependent on prime key attribute. That is, if X → A holds, then there should not be any proper subset Y of X, for which Y → A also holds true.

Page 9: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

We see here in Student Project relation that the prime key attributes are Stu_id and Proj_ID. According to the rule, non-key attributes, i.e. Stu_Name and Proj_Name must be dependent upon both and not on any of the prime key attribute individually. But we find that Stu_Name can be identified by Stu_ID and Proj_Name can be identified by Proj_ID independently. This is calledpartial dependency, which is not allowed in Second Normal Form.

We broke the relation in two as depicted in the above picture. So there exists no partial dependency.

Third Normal Form:

For a relation to be in Third Normal Form, it must be in Second Normal form and the following must satisfy −

No non-prime attribute is transitively dependent on prime key attribute.

For any non-trivial functional dependency, X → A, then either −

o X is a superkey or,o A is prime attribute.

Page 10: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

We find that in the above Student_detail relation, Stu_ID is the key and only prime key attribute. We find that City can be identified by Stu_ID as well as Zip itself. Neither Zip neither is a superkey nor is City a prime attribute. Additionally, Stu_ID → Zip → City, so there exist transitive dependency.

To bring this relation into third normal form, we break the relation into two relations as follows −

Boyce Codd normal form (BCNF):It is an advance version of 3NF that’s why it is also referred as 3.5NF. BCNF is stricter than 3NF. A table complies with BCNF if it is in 3NF and for every functional dependency X->Y, X should be the super key of the table.

Example: Suppose there is a company wherein employees work in more than one department. They store the data like this:

emp_id emp_nationality emp_dept dept_type dept_no_of_emp

Page 11: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

1001 AustrianProduction and

planningD001 200

1001 Austrian stores D001 250

1002 Americandesign and

technical supportD134 100

1002 AmericanPurchasing

departmentD134 600

Functional dependencies in the table above:emp_id -> emp_nationalityemp_dept -> {dept_type, dept_no_of_emp}

Candidate key: {emp_id, emp_dept}

The table is not in BCNF as neither emp_id nor emp_dept alone are keys.

Page 12: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

To make the table comply with BCNF we can break the table in three tables like this:emp_nationality table:

emp_id emp_nationality

1001 Austrian

1002 American

emp_dept table:

emp_dept dept_typedept_no_of_emp

Production and

planningD001 200

Stores D001 250

Page 13: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

design and technical

supportD134 100

Purchasing

departmentD134 600

emp_dept_mapping table:

emp_id emp_dept

1001 Production and planning

1001 stores

1002 design and technical support

1002 Purchasing department

Page 14: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

Functional dependencies:emp_id -> emp_nationalityemp_dept -> {dept_type, dept_no_of_emp}

Candidate keys:For first table: emp_idFor second table: emp_deptFor third table: {emp_id, emp_dept}

This is now in BCNF as in both the functional dependencies left side part is a key.

Non- prime attribute:

An attribute that is not part of any candidate key is known as non-prime attribute.

Prime attribute:

An attribute that is a part of one of the candidate keys is known as prime attribute.

Data mining:Data mining is an interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets involving methods at the intersection ofartificial intelligence, machine learning, statistics, and database systems The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness

Page 15: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD.The term is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence, machine learning, and business intelligence.The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data miningwas only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting

Page 16: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

is part of the data mining step, but do belong to the overall KDD process as additional steps.The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.Evolutionary Step Business Question Enabling Technology

Data Collection (1960s)

"What was my total revenue in the last five years?"

computers, tapes, disks

Data Access (1980s)"What were unit sales in New England last March?"

faster and cheaper computers with more storage, relational databases

Data Warehousing and Decision Support

"What were unit sales in New England last March? Drill down to Boston."

faster and cheaper computers with more storage, On-line analytical processing (OLAP), multidimensional databases, data warehouses

Data Mining

"What's likely to happen to Boston unit sales next month? Why?"

faster and cheaper computers with more storage, advanced computer algorithms

Data Warehouses:

Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data.

Page 17: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining.

What can data mining do?

Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data.

With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments.

For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures.

Wal-Mart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions

Page 18: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries.

The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.

By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot.

Page 19: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

How does data mining work?

While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:

Classes:

Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.

Clusters:

Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.

Associations:

Data can be mined to identify associations. The beer-diaper example is an example of associative mining.

Sequential patterns:

Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the

Page 20: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

Extract, transform, and load transaction data onto the data warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

Artificial neural networks:

Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Genetic algorithms:

Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.

Decision trees:

Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a

Page 21: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.

Nearest neighbor method:

A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.

Rule induction:

The extraction of useful if-then rules from data based on statistical significance.

Data visualization:

The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

Page 22: RDBMS Mini Project Report made by sejal (sandhya)Rathi.

What technological infrastructure is required?

Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are two critical technological drivers:

Size of the database:

The more data being processed and maintained, the more powerful the system required.

Query complexity:

The more complex the queries and the greater the number of queries being processed, the more powerful the system required.Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time. For example, MPP systems from NCR link hundreds of high-speed Pentium processors to achieve performance levels exceeding those of the largest supercomputers.