KaaShiv InfoTech, Number 1 Inplant Training Experts in Chennai.
Data mining an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
Data mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information.
For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system.
Providing a Graphical representation of Individuals predicted health history by matching the history of the patient with related information's in the system is the core idea of the project.In this project, temporal string pattern matching is one of the core algorithmic approaches used for matching the patient’s history with individual’s symptoms to predict the foredooming health issues.Restrictions will also be done to filter out the appropriate disease related matching through temporal string pattern algorithm. In addition to the usage of pattern matching approach, the outcome of the results will be attained in the course of Non Finite State Automata.In our case, we are using Non Finite State Automata by means of Factors that is so called Factor based Non Finite State Automata.Focusing towards the process of checking constraints such as Data constraints, Drawing constraints and Interface constraint, enables the possibility of obtaining the exact filter criteria thereby matching the exact issues from patients history.
This paper proposes a novel temporal knowledge representation and learning framework to perform large-scale temporal signature mining of longitudinal heterogeneous event data. The framework enables the representation, extraction, and mining of high order latent event structure and relationships within single and multiple event sequences. The proposed knowledge representation maps the heterogeneous event sequences to a geometric image by encoding events as a structured spatial-temporal shape process. We present a doubly constrained convolution sparse coding framework that learns interpretable and shift-invariant latent temporal event signatures. We show how to cope with the sparsity in the data as well as in the latent factor model by inducing a double sparsity constraint on the _-divergence to learn an over complete sparse latent factor model. A novel stochastic optimization scheme performs large-scale incremental learning of group-specific temporal event signatures. We validate the framework on synthetic data and on an electronic health record dataset.
In the recent years, Privacy preserving techniques have been actively studied on the time-series data in various fields like financial, medical, and weather analysis. We are focusing towards preserving the data through Anonymity and Generalization technique. We first investigate, what’s the privacy to be incorporated in the time-series data and after finding the data which needs to be preserved various perturbation terminologies were identified and worked out towards secure multi-party computation (SMC) and encryption techniques in the distributed computing. Our project focused towards Generalized technique in which the data will be filtered or generalized in a grouped structure based on time series grouping algorithm and the data will be shown in the approximation format. So that, the data wont getsdisclosed. Overall the user will be provided with the performance structure without providing the exact information’s. In addition we are trying to incorporate security by adding a deformable/detectable noise to this time series data.
Time series is an important form of data available in numerous applications and often contains vast amount of personalprivacy. The need to protect privacy in time-series data while effectively supporting complex queries on them poses nontrivialchallenges to the database community. We study the anonymization of time series while trying to support complex queries, such asrange and pattern matching queries, on the published data. The conventional k-anonymity model cannot effectively address thisproblem as it may suffer severe pattern loss. We propose a novel anonymization model called (k, P)-anonymity for pattern-rich time Series. This model publishes both the attribute values and the patterns of time series in separate data forms. We demonstrate that ourmodel can prevent linkage attacks on the published data while effectively support a wide variety of queries on the anonym zeddata. We propose two algorithms to enforce (k, P)-anonymity on time-series data. Our anonymity model supports customized datapublishing, which allows a certain part of the values but a different part of the pattern of the anonym zed time series to be publishedsimultaneously. We present estimation techniques to support query processing on such customized data. The proposed methods are evaluated in a comprehensive experimental study. Our results verify the effectiveness and efficiency of our approach.
Crowd sourcing, a largest human resource provider network, outsources complex jobs to unlimited human workers to achieve quick task completion. The work done by the workers are finally merged by the operator to achieve fullness in completion. The published work on the cloud may contain sensitive data that are prone to attack by unauthorized users. The existing privacy protection techniques only partially solve this issue as it involves huge data loss, thus solving one issue and creating a new one. We identify an interesting real time problem, ﬁnding the best products through user feedbacks, which has not been taken/studied before. Given a group of products in the existing market, we want to identify a set of k “best/alternative” possible products in the market such that the newly entered products should not dominate the products in the existing market. In addition to this, the products privacy information should not be disclosed to anyone. Dominance and skyline analysis are much important in decision-making applications. Existing researches focus on helping customers to find a set of “best” possible solution from a pool of given optimal solutions. This project identifies an interesting problem, finding/predicting the data based on the available data in which two major focus needs to be done. One is based on the accurate prediction of data and the next one is not compromising with the security of data. Given a set of solution in the existing business, we want to find a set of k “best” possible solution such that these new solution are not dominated by the solution available in the existing market. Prior knowledge about the outsourced data/work results in advantages such that it helps in drastic reduction of data loss, high accuracy and better tuning of anonymity strategy.
In crowds ourcing database, human operators are embedded into the database engine and collaborate with other conventional database operators to process the queries. Each human operator publishes small HITs (Human Intelligent Task) to the crowd sourcing platform, which consist of a set of database records and corresponding questions for human workers .The human workers complete the HITs and return the results to the crowd sourcing database for further processing. In practice, published records in HITs may contain sensitive attributes, probably causing privacy leakage so that malicious workers could linkthem with other public databases to reveal individual private information.Conventional privacy protection techniques, such as K-Anonymity, can be applied to partially solve the problem. However, after generalizing the data, the result of standard K-Anonymity algorithms may render uncontrollable information loss and affects the accuracy of crowd sourcing. In this paper, we first study the tradeoff between the privacy and accuracy for the human operator within data anonymization process. A probability model is proposed to estimate the lower bound and upper bound of the accuracyfor general K-Anonymity approaches. We show that searching the optimal anonymity approach is NP-Hard and only heuristicapproach is available. The second contribution of the paper is a general feedback-based K-Anonymity scheme. In our scheme,synthetic samples are published to the human workers, the results of which are used to guide the selection on anonymity strategies. We apply the scheme on Mondrian algorithm by adaptively cutting the dimensions based on our feedback results onthe synthetic samples.We evaluate the performance of the feedback-based approach on US census dataset, and show that givena predefined K, our proposal outperforms standard K-Anonymity approaches on retaining the effectiveness of crowd sourcing.
A co-location spatial pattern is a pattern of multiple groups which co-relates spatial features/events that are frequently co-located in the same zone. Co-location pattern mining emphasizes the overall analysis by manipulating the proportion of spatial features and other relevant information’s. The existing system proposes the option of participation index in measuring the prevalence of co-location for two reasons. Initial option with the above said measure is closely inter related to the cross- function, which is often used for measuring the statistics among various pairs of spatial features. Second option focuses on the property of anti monotone which can be insisted/included for computational perspective efficiency. In our research project we are trying to incorporate a novel multi-resolution pruning technique to address the problem of mining co-location data patterns with rare spatial features.
A spatial co-location pattern is a group of spatial features whose instances are frequently located together in geographicspace. Discovering collocations has many useful applications. For example, co-located plant species discovered from plant distribution data sets can contribute to the analysis of plant geography, phytosociology studies, and plant protection recommendations. In thispaper, we study the co-location mining problem in the context of uncertain data, as the data generated from a wide range of datasources are inherently uncertain. One straightforward method to mine the prevalent co-locations in a spatially uncertain data set is tosimply compute the expected participation index of a candidate and decide if it exceeds a minimum prevalence threshold. Although thisdefinition has been widely adopted, it misses important information about the confidence which can be associated with the participationindex of a co-location. We propose another definition, probabilistic prevalent collocations, trying to find all the collocations that are likelyto be prevalent in a randomly generated possible world. Finding robabilistic prevalent colocations (PPCs) turn out to be difficult. First,we propose pruning strategies for candidates to reduce the amount of computation of the probabilistic participation index values. Next,we design an improved dynamic programming algorithm for identifying candidates. This algorithm is suitable for parallel computation,and approximate computation. Finally, the effectiveness and efficiency of the methods proposed as well as the pruning strategies andthe optimization techniques are verified by extensive experiments with “real þ synthetic” spatially uncertain data sets.
In the latest trend of internet, optimized suggestions for the search are anticipated by every individual. Crowd sourcing, a largest human resource provider network useful in search suggestions that helps to find out common phrases that other people have searched for. We identify an interesting real time problem, ﬁnding the best products through suggestions with individual user rating to a particular brand as well as with the rating of the friends to the product and the generalized crowd sourcing opinion. The system comprise of product to product co-relation under a brand, user to user co-relation under common attributes, crowd sourcing opinion by means of key factors obtained for the product in order make the suggestions more optimal.
Users face many choices on the web when it comes to choosing which product to buy, which video to watch, and so on. Inmaking adoption decisions, users rely not only on their own preferences, but also on friends. We call the latter social correlation, which may be caused by the homophile and social influence effects. In this paper, we focus on modeling social correlation on users item adoptions. Given a user-user social graph and an item-user adoption graph, our research seeks to answer the following questions: Whether the items adopted by a user correlate with items adopted by her friends, and how to model item adoptions using social correlation. We propose a social correlation framework that considers a social correlation matrix representing the degrees of correlation from every user to the user’s friends, in addition to a set of latent factors representing topics of interests of individual users. Based on the framework, we develop two generative models, namely sequential and unified, and the corresponding parameter estimation approaches. From each model, we devise the social correlation only and hybrid methods for predicting missing adoption links. Experiments on Live Journal and Epinions data sets show that our proposed models outperform the approach based on latent factors only (LDA).
Projecting data in different dimensions is the core concept taken for this project. The data will be dimensionalised on various perspective and it can be achieved by using different concepts like, CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Horizontal aggregations build data sets with a horizontal De-normalized layout (e.g. point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. In addition to that, this project introduces a performance evaluation on those three methods and we’ve included a third variety of area data mining that’s extracting data from Knowledge cubes. This can be achieved with MDX queries.
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data setsbecause they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where ahorizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in ahorizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal Aggregations. Horizontal aggregations build data sets with a horizontal normalized layout (e.g., point-dimension, observation variable,instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not.
Usually, data mining is considered as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. In our data-driven data mining model, knowledge is originally existed in data, but just not understandable for human. Data mining is taken as a process of transforming knowledge from data format into some other human understandable format like rule, formula, theorem, etc. In order to keep the knowledge unchanged in a data mining process, the knowledge properties should be kept unchanged during a knowledge transformation process. Many real world data mining tasks are highly constraint-based and domain-oriented. Thus, domain prior knowledge should also be a knowledge source for data mining. The control of a user to a data mining process could also be taken as a kind of dynamic input of the data mining process. Thus, a data mining process is not only mining knowledge from data, but also from human. This is the key idea of Domain- oriented Data-driven Data Mining (3DM). In the view of granular computing (GrC), a data mining process can be considered as the transformation of knowledge in different granularities. Original data is a representation of knowledge in the finest granularity. It is not understandable for human. However, human is sensitive to knowledge in coarser granularities. So, a data mining process could be considered to be a transformation of knowledge from a finer granularity space to a coarser granularity space. The understanding for data mining of3DM and GrC is consistent to each other. Rough set and fuzzy set are two important computing paradigms of GrC. They are both generalizations of classical set theory for modeling vagueness and uncertainty. Although both of them can be used to address vagueness, they are not rivals. In some real problems, they are even complementary to each other. In this plenary talk, the new understanding for data mining, domain-oriented data-driven data mining (3DM), will be introduced. The relationship of 3DM and GrC, and granular computing based data mining in the views of rough set and fuzzy set will be discussed.
The following topics are dealt with: data mining in Web 2.0 environment; knowledge-discovery which are from multimedia data and multimedia applications; mining and management of biological data; data mining in medicine; optimizationbased data mining techniques;high; data mining on; data streaming mining and management; spatial and spatio-temporal data mining.
In this paper, we propose four data mining models for the Internet of Things, which are multi-layer data mining model, distributed data mining model, Grid based data mining model and data mining model from multi-technology integration perspective. Among them, multi-layer model includes four layers: (1)data collection layer, (2) data management layer, (3) event processing layer, and (4) data mining service layer. Distributed data mining model can solve problems from depositing data at different sites. Grid based data mining model allows Grid framework to realize the functions of data mining. Data mining model from multi-technology integration perspective describes the corresponding framework for the future Internet. Several key issues in data mining of IoT are also discussed.
Many organizations often underutilize their existing data warehouses. In this paper, we are that suggest a way of acquiring more information from corporate data warehouses without the complications and drawbacks of deploying additional software systems. Association-rule mining, which captures co-occurrence patterns within data, has attracted considerable efforts from data warehousing researchers and practitioners alike. Unfortunately, most data mining tools are loosely coupled, at best, with the data ware house repository. Furthermore, these tools can often find association rules only within the main fact table of the data warehouse (thus ignoring the information-rich dimensions of the star schema) and are not easily applied on non-transaction level data often found in data warehouses. In this paper, we present a new data-mining framework that is tightly integrated with the data warehousing technology. Our framework has several advantages over the use of separate data mining tools. First, the data stays at the data warehouse, and thus the management of security and privacy issues is greatly reduced. Second, we utilize the query processing power of a data warehouse itself, without using a separate data-mining tool. In addition, this framework allows ad-hoc data mining queries over the whole data warehouse, not just over a transformed portion of the data that is required when a standard data-mining tool is used. Finally, this framework also expands the domain of association-rule mining from transaction-level data to aggregated data as well.
Data mining is the process of posing queries and extracting patterns, often previously unknown from large quantities of data using pattern matching or other reasoning techniques.Data mining has many applications in security including for national security as well as for cyber security. The threats to national security include attacking buildings, destroying critical infrastructures such as power grids and telecommunication systems. Data mining techniques are being investigated to find out who the suspicious people are and who is capable of carrying out terrorist activities. Cyber security is involved with protecting the computer and network systems against corruption due to Trojan horses, worms and viruses. Data mining is also being applied to provide solutions such as intrusion detection and auditing. The first part of the presentation will discuss my joint research with Prof. Latium Khan and our students at the University of Texas at Dallas on data mining for cyber security applications. For example, anomaly detection techniques could be used to detect unusual patterns and behaviors. Link analysis may be used to trace the viruses to the perpetrators. Classification may be used to group various cyber attacks and then use the profiles to detect an attack when it occurs. Prediction may be used to determine potential future attacks depending in a way on information learned about terrorists through email and phone conversations.
The well-known privacy-preserveddataminingmodifies existing data mining techniques to randomized data. In this paper, we investigate data mining as a technique for masking data, therefore, termed data mining based privacy protection. This approach incorporates partially the requirement of a targeted data mining task into the process of masking data so that essential structure is preserved in the masked Data. The idea is simple but novel: we explore the data generalization concept from data mining as a way to hide detailed information, rather than discover trends and patterns. Once the data is masked, standard data mining techniques can be applied without modification. Our work demonstrated another positive use of data mining technology: not only can it discover useful patterns, but also mask private information. We consider the following privacy problem: a data holder wants to release a version of data for building classification models, but wants to protect against linking the released data to an external source for inferring sensitive information. We adapt an iterative bottom-up generalization from data mining to generalize the data. The generalized data remains useful to classification but becomes difficult to link to other sources. The generalization space is specified by a hierarchical structure of generalizations. A key is identifying the best generalization to climb up the hierarchy at each iteration. Enumerating all candidate generalizations is impractical. We present a scalable solution that examines at most one generalization in each iteration for each attribute involved in the linking.
This paper accentuate an approach of implementing distributed data mining (DDM) using multi-agent system (MAS) technology, and proposes a data mining technique of ldquoCAKErdquo (classifying, associating &knowledge discovery). The architecture is based on centralized parallel data miningagents (PADMAs). Data mining is part of a word, which has been recently introduced known as BI or business intelligence. The need is to derive knowledge out of the abstract data. The process is difficult, complex, time consuming and resource starving. These highlighted problems addressed in the proposed model. The model architecture is distributed, uses knowledge-driven mining technique and flexible enough to work on any data warehouse, which will help to overcome these problems. Good knowledge of data, meta-data and business domain is required for defining rules for data mining. Taking into consideration that the data and data warehouse has already gone through the necessary processes and ready for data mining.
With the increasing use of database applications, mining interesting information from huge databases becomes of great concern and a variety of mining algorithms have been proposed in recent years. As we know, the data processed in data mining may be obtained from many sources in which differentdata types may be used. However, no algorithm can be applied to all applications due to the difficulty of fitting data types to the algorithm. The selection of an appropriate data mining algorithm is based not only on the goal of the application, but also the data fit ability. Therefore, to transform the non-fitting datatype into a target one is also important in data mining, but the work is often tedious or complex since a lot of data types exist in the real world. Merging the similar data types of a given selected mining algorithm into a generalized data type seems to be a good approach to reduce the transformation complexity. In this work, the data type fit ability problem for six kinds of widely used data mining techniques is discussed and a data type generalization process, including merging and transforming phasesis proposed. In the merging phase, the original data types of the data sources to be mined are first merged into the generalized ones. The transforming phase is then used to convert the generalizeddata types into the target ones for the selected mining algorithm. Using the data type generalization process, the user can select an appropriate mining algorithm just for the goal of the application without considering the data types.
This paper proposes a framework for cost-sensitive classification under a generalized cost function. By combining decision trees with sequential binary programming, we can handle unequal misclassification costs, constrained classification, and complex objective functions that other methods cannot. Our approach has two main contributions. First, it provides a new method for cost-sensitive classification that outperforms a traditional, accuracy-based method and some current cost-sensitive approaches. Second, and more important, our approach can handle a generalized cost function, instead of the simpler misclassification cost matrix to which other approaches are limited.
Many important industrial applications rely on data mining methods to uncover patterns and trends in large data warehouse environments. Since a data warehouse is typically updated periodically in a batch mode, the mined patterns have to be updated as well. This requires not only accuracy from data mining methods but also fast availability of up-to-date knowledge, particularly in the presence of a heavy update load. To cope with this problem, we propose the use of online data mining algorithms which permanently store the discovered knowledge in suitable data structures and enable an efficient adaptation of these structures after insertions and deletions on the raw data. In this paper, we demonstrate how hierarchical clustering methods can be reformulated as online algorithms based on the hierarchical clustering method OPTICS, using a density estimator for data grouping. We also discuss how this algorithmic schema can be specialized for efficient online single-link clustering. A broad experimental evaluation demonstrates that the efficiency is superior with significant speed-up factors even for large bulk insertions and deletions.
In this paper we propose a novel approach that uses structure as well as the content of emails in a folder for email classification. Our approach is based on the premise that representative - common and recurring -structures/patterns can be extracted from a pre-classified email folder and the same can be used effectively for classifying incoming emails. A number of factors that influence representative structure extraction and the classification are analyzed conceptually and validated experimentally. In our approach, the notion of inexact graph match is leveraged for deriving structures that provide coverage for characterizing folder contents. Extensive experimentation validates the selection of parameters and the effectiveness of our approach for email classification.
We consider the problem of detecting anomalies in data that arise as multidimensional arrays with each dimension corresponding to the levels of a categorical variable. In typical data mining applications, the number of cells in such arrays is usually large. Our primary focus is detecting anomalies by comparing information at the current time to historical data. Naive approaches advocated in the process control literature do not work well in this scenario due to the multiple testing problems - performing multiple statistical tests on the same data produce excessive number of false positives. We use an empirical Bayes method which works by fitting a two component Gaussian mixture to deviations at current time. The approach is scalable to problems that involve monitoring massive number of cells and fast enough to be potentially useful in many streaming scenarios. We show the superiority of the method relative to a naive "per component error rate" procedure through simulation. A novel feature of our technique is the ability to suppress deviations that are merely the consequence of sharp changes in the marginal distributions. This research was motivated by the need to extract critical application information and business intelligence from the daily logs that accompany large-scale spoken dialog systems deployed by AT&T. We illustrate our method on one such system.
We present a new framework for classifier fusion that uses a shared sampling distribution for obtaining a weighted classifier ensemble. The weight update process is self regularizing as subsequent classifiers trained on the disjoint views rectify the bias introduced by any classifier in preceding iterations. We provide theoretical guarantees that our approach indeed provides results which are better than the case when boosting is performed separately on different views. The results are shown to outperform other classifier fusion strategies on a well known texture image database.
Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system is to return results not just from a general Web collection but from topic-specific back-end databases as well. Maintaining sufficient classification recall is very difficult as Web queries are typically short, yielding few features per query. This feature sparseness coupled with the high query volumes typical for a large-scale search service makes manual and supervised learning approaches alone insufficient. We use an application of computational linguistics to develop an approach for mining the vast amount of unlabeled data in Web query logs to improve automatic topical Web query classification. We show that our approach in combination with manual matching and supervised learning allows us to classify a substantially larger proportion of queries than any single technique. We examine the performance of each approach on a real Web query stream and show that our combined method accurately classifies 46% of queries, outperforming the recall of best single approach by nearly 20%, with a 7% improvement in overall effectiveness.
Given a large collection of medical images of several conditions and treatments, how can we succinctly describe the characteristics of each setting? For example, given a large collection of retinal images from several different experimental conditions (normal, detached, reattached, etc.), how can data mining help biologists focus on important regions in the images or on the differences between different experimental conditions? If the images were text documents, we could find the main terms and concepts for each condition by existing IR methods (e.g., ft./if and LSI). We propose something analogous, but for the much more challenging case of an image collection: We propose to automatically develop a visual vocabulary by breaking images into n × n tiles and deriving key tiles ("Vivo") for each image and condition. We experiment with numerous domain-independent ways of extracting features from tiles (color histograms, textures, etc.), and several ways of choosing characteristic tiles (PCA, ICA). We perform experiments on two disparate biomedical datasets. The quantitative measure of success is classification accuracy: Our "Vivo" achieve high classification accuracy (up to 83 %for a nine-class problem on feline retinal images). More importantly, qualitatively, our "Vivo" do an excellent job as "visual vocabulary terms": they have biological meaning, as corroborated by domain experts; they help spot characteristic regions of images, exactly like text vocabulary terms do for documents; and they highlight the differences between pairs of images.
The problem of record linkage focuses on determining whether two object descriptions refer to the same underlying entity. Addressing this problem effectively has many practical applications, e.g., elimination of duplicate records in databases and citation matching for scholarly articles. In this paper, we consider a new domain where the record linkage problem is manifested: Internet comparison shopping. We address the resulting linkage setting that requires learning a similarity function between record pairs from streaming data. The learned similarity function is subsequently used in clustering to determine which records are co-referent and should be linked. We present an online machine learning method for addressing this problem, where a composite similarity function based on a linear combination of basic functions is learned incrementally. We illustrate the efficacy of this approach on several real-world datasets from an Internet comparison shopping site, and show that our method is able to effectively learn various distance functions for product data with differing characteristics. We also provide experimental results that show the importance of considering multiple performance measures in record linkage evaluation.
Assessing rules with interestingness measures is the cornerstone of successful applications of association rule discovery. However, there exists no information-theoretic measure which is adapted to the semantics of association rules. In this article, we present the directed information ratio (DIE), a new rule interestingness measure which is based on information theory. DIR is specially designed for association rules, and in particular it differentiates two opposite rules a → b and a → b~. Moreover, to our knowledge, DIR is the only rule interestingness measure which rejects both independence and (what we call) equilibrium, i.e. it discards both the rules whose antecedent and consequent are negatively correlated, and the rules which have more counter-examples than examples. Experimental studies show that DIR is a very filtering measure, which is useful for association rule post-processing.
Data mining algorithms are facing the challenge to deal with an increasing number of complex objects. For graph data, a whole toolbox of data mining algorithms becomes available by defining a kernel function on instances of graphs. Graph kernels based on walks, sub trees and cycles in graphs have been proposed so far. As a general problem, these kernels are either computationally expensive or limited in their expressiveness. We try to overcome this problem by defining expressive graph kernels which are based on paths. As the computation of all paths and longest paths in a graph is NP-hard, we propose graph kernels based on shortest paths. These kernels are computable in polynomial time, retain expressivity and are still positive definite. In experiments on classification of graph models of proteins, our shortest-path kernels show significantly higher classification accuracy than walk-based kernels.
Many applications track the movement of mobile objects, which can be represented as sequences of time stamped locations. Given such a spatiotemporal series, we study the problem of discovering sequential patterns, which are routes frequently followed by the object. Sequential pattern mining algorithms for transaction data are not directly applicable for this setting. The challenges to address are: (i) the fuzziness of locations in patterns, and (ii) the identification of non-explicit pattern instances. In this paper, we define pattern elements as spatial regions around frequent line segments. Our method first transforms the original sequence into a list of sequence segments, and detects frequent regions in a heuristic way. Then, we propose algorithms to find patterns by employing a newly proposed substring tree structure and improving a priori technique. A performance evaluation demonstrates the effectiveness and efficiency of our approach.
We present SUDA2, a recursive algorithm for finding minimal sample unique (MSUs). SUDA2 uses a novel method for representing the search space for MSUs and new observations about the properties of MSUs to prune and traverse this space. Experimental comparisons with previous work demonstrate that SUDA2 is not only several orders of magnitude faster but is also capable of identifying the boundaries of the search space, enabling datasets of larger numbers of columns than before to be addressed.
Data mining aims at extraction of previously unidentified information from large databases. It can be viewed as an automated application of algorithms to discover hidden patterns and to extract knowledge from data. Online Analytical Processing (OLAP) systems, on the other hand, allow exploring and querying huge datasets in interactive way. These OLAP systems are the predominant front-end tools used in data warehousing environments and the OLAP system's market has developed rapidly during the last few years. Several works in the past emphasized the integration of OLAP and data mining. More recently, data mining techniques along with OLAP have been applied in decision support applications to analyze large data sets in an efficient manner. However, in order to integrate data mining results with OLAP the data has to be modeled in a particular type of OLAP schema. An OLAP schema is a collection of database objects, including tables, views, indexes and synonyms. Schema generation process was considered a manual task but in the recent years research communities reported their work in automatic schema generation. In this paper, we reviewed literature on the schema generation techniques and highlighted the limitations of the existing works. The review reveals that automatic schema generation has never been integrated with data mining. Hence, we propose a model for data mining and automatic schema generation of three types namely star, snowflake, and galaxy. Hierarchical clustering technique of data mining was used and schema from the clustered data was generated. We have also developed a prototype of the proposed model and validated it via experiments of real-life data set. The proposed model is significant as it supports both integration and automation process.
An effective analysis of clinical trials data involves analyzing different types of data such as heterogeneous and high dimensional time series data. The current time series analysis methods generally assume that the series at hand have sufficient length to apply statistical techniques to them. Other ideal case assumptions are that data are collected in equal length intervals, and while comparing time series, the lengths are usually expected to be equal to each other. However, these assumptions are not valid for many real data sets, especially for the clinical trials data sets. An addition, the data sources are different from each other, the data are heterogeneous, and the sensitivity of the experiments varies by the source. Approaches for mining time series data need to be revisited, keeping the wide range of requirements in mind. In this paper, we propose a novel approach for information mining that involves two major steps: applying a data mining algorithm over homogeneous subsets of data, and identifying common or distinct patterns over the information gathered in the first step. Our approach is implemented specifically for heterogeneous and high dimensional time series clinical trials data. Using this framework, we propose a new way of utilizing frequent item set mining, as well as clustering and clustering techniques with novel distance metrics for measuring similarity between time series data. By clustering the data, we find groups of analyses (substances in blood) that are most strongly correlated. Most of these relationships already known are verified by the clinical panels, and, in addition, we identify novel groups that need further biomedical analysis. A slight modification to our algorithm results an effective declustering of high dimensional time series data, which is then used for "feature selection." Using industry-sponsored clinical trials datasets, we are able to identify a small set of analyses that effectively models the state of normal health.
Frequent episode mining has been proposed as a data mining task with the goal of recovering sequential patterns from temporal data sequences. While several episode mining approaches have been proposed in the last fifteen years, most of the developed techniques have not been evaluated on a common benchmark data set, limiting the insights gained from experimental evaluations. In particular, it is unclear how well episodes are actually being recovered, leaving an episode mining user without guidelines in the knowledge discovery process. One reason for this can be found in non-disclosure agreements that prevent real life data sets on which approaches have been evaluated from entering the public domain. But even easily accessible real life data sets would not allow to ascertain miners' abilities to identify underlying patterns. A solution to this problem can be seen in generating artificial data, which has the added advantage that patterns can be known, allowing to evaluate the accuracy of mined patterns. Based on insights and experiences stemming from consultations with industrial partners and work with real life data, we propose a data generator for the generation of diverse data sets that reflect realistic data characteristics. We discuss in detail which characteristics real life data can be expected to have and how our generator models them. Finally, we show that we can recreate artificial data that has been used in the literature, contrast it with real life data showing very different characteristics, and show how our generator can be used to create data with realistic characteristics.
With the development of internet and storage technology, we have got a lot of data. In order to find the information from these data, Data Mining has become an increasingly important topic in research as well as in industrial application. Up to now, there are a lot of Data Mining methods and specific tools. This article mainly talks about a new Data Mining Method called Data Mining based on Lattice. It has been applied in many research areas, such as: Data Bases, DataAnalysis and Machine Learning Technology. The experiments finished by foreign researcher showed it may be a useful method for information retrieval and machine learning problem domains. Data Mining based on Lattice is indeed a better method of organization, which is useful for each domain. The use and application of Data Mining based on Lattice is an area of active and promising research in various fields. Therefore, it is important for us to study the Data Mining Method based on Lattice.
This paper focuses on a domain-driven data mining outsourcing scenario whereby a data owner publishes data to an application service provider who returns mining results. To ensure data privacy against an un-trusted party, anonymization, a widely used technique capable of preserving true attribute values and supporting various data mining algorithms is required. Several issues emerge when anonymization is applied in a real world outsourcing scenario. The majority of methods have focused on the traditional data mining paradigm, therefore they do not implement domain knowledge nor optimize data for domain-driven usage. Furthermore, existing techniques are mostly non-interactive in nature, providing little control to users while assuming their natural capability of producing Domain Generalization Hierarchies (DGH). Moreover, previous utility metrics have not considered attribute correlations during generalization. To successfully obtain optimal data privacy and actionable patterns in a real world setting, these concerns need to be addressed. This paper proposes an anonymization framework for aiding users in a domain-driven data mining outsourcing scenario. The framework involves several components designed to anonymize data while preserving meaningful or actionable patterns that can be discovered after mining. In contrast with existing works for traditional data-mining, this framework integrates domain ontology knowledge during DGH creation to retain value meanings after anonymization. In addition, users can implement constraints based on their mining tasks thereby controlling how data generalization is performed. Finally, attribute correlations are calculated to ensure preservation of important features. Preliminary experiments show that an ontology-based DGH manages to preserve semantic meaning after attribute generalization. Also, using Chi-Square as a correlation measure can possibly improve attribute selection before generalization.
As the development of electric industry, more and more real-time data is sent to databases by data acquisition system and large amounts of data are accumulated. Abundant knowledge exists in those historical data. It is meaning to analyze those historical data in electric industry and find useful knowledge and rules from the mass of data to provide better decision support and better adjustment guidance. The concept and steps of data mining is introduced in particular. Based on the characteristic of electric data, the data mining technique is introduced into the electric industry and the feasibility and necessity are discussed. The application of datamining in electric power industrial is discussed. The fault diagnosis and operation optimization based on data mining is researched in detail. The application of data mining in electric industry can guide the optimal operation based on historical data and improve the economic efficient in power plant.
One of the goals of data mining is to discover hidden rules from existing data. Real rules in data differ according to characteristics of the data, and the effect of data mining depends mostly on whether the method selected matches the characteristics of the data. To improve effect of data mining, this paper discusses first correlation of data mining methods and characteristics of data taking temporal data generated from dynamics system as an example, then types of dynamics system since characteristics of data are determined by type of dynamics system and how to determine them from the data. At last we build a neural network to mine the data given the type and parameters of the dynamics system.
The sensor data, which is inputted from sensor network, is stream data having continuous and infinite properties. The previous data mining techniques capsulate directly be used in the sensor data mining because of these properties of sensor data. Also, most of application services in the sensor network are only event alert services which perceive the events from sensors and alert the events to the supervisor. In this paper, we define continuous sensor data mining model and design a system based on the model. The system can service useful knowledge by continuous sensor data mining using gathered data from sensor in the sensor network. First, we classify sensor data to the three data types, which are each simple sensor data, continuous sensor data, and sensor event data, and define sensor data mining models about outlier analysis, pattern analysis, and prediction analysis. After the definition of model, we design a system which can be used in application services like u-Silver care, Sea Ranching Program,City Environment Management, etc., based on these mining models in sensor network environment.
Nowadays, most people rely on traditional data mining techniques to address business affairs. But data mining is proposed for the large amount of data, lack of effective methods to process the data that is little or incomplete, or is overall complex but has a strong regularity at a certain time or space. Grey system theory is the new method to research less data, poor information and uncertainty problem. It just makes up for the shortcomings of traditional data mining. Therefore, this paper is to combine gray system theory with data mining technology, study and improve the gray relational data mining model, and as the basic for the establishment of a gray clustering mining model. At last, this paper applies the gray data mining model to the comparison of the securities companies' core competitiveness, thus proving the correctness and effectiveness of the model.
The need for real-time data mining has long been recognized in various application domains. However existing methodologies are still limited to the optimization of single classical data mining algorithms. In this paper, we investigate the development of a general purpose methodology for real-time data mining and propose a novel supporting framework. In the methodology, definition, characteristics and principles of real-time data mining are finely studied. The framework is proposed based on the novel dynamic data mining process model. The model offers the ability to incrementally update data mining knowledge and synchronously execute data mining tasks; an implementation of the framework and a case study are also presented.
Data mining is an increasingly important technology for extracting useful knowledge hidden in large collections of data. There are, however, negative social perceptions about data mining, among which potential privacy invasion and potential discrimination. The latter consists of unfairly treating people on the basis of their belonging to a specific group. Automated data collection and data mining techniques such as classification rule mining have paved the way to making automated decisions, like loan granting/denial, insurance premium computation, etc. If the training data sets are biased in what regards discriminatory (sensitive) attributes like gender, race, religion, etc., discriminatory decisions may ensue. For this reason, anti-discrimination techniques including discrimination discovery and prevention have been introduced in data mining. Discrimination can be either direct or indirect. Direct discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination occurs when decisions are made based on no sensitive attributes which are strongly correlated with biased sensitive ones. In this paper, we tackle discrimination prevention in data mining and propose new techniques applicable for direct or indirect discrimination prevention individually or both at the same time. We discuss how to clean training data sets and outsourced data sets in such a way that direct and/or indirect discriminatory decision rules are converted to legitimate (nondiscriminatory) classification rules. We also propose new metrics to evaluate the utility of the proposed approaches and we compare these approaches. The experimental evaluations demonstrate that the proposed techniques are effective at removing direct and/or indirect discrimination biases in the original data set while preserving data quality.
It is well known that over 80% of the time required to carry out any real world data mining project is usually spent on data preprocessing. Data preprocessing lays the groundwork for data mining. Before the discovery of useful information/knowledge, the target data set must be properly prepared. But it is unfortunately ignored by most researchers on data mining due to its perceived difficulty. This paper describes an efficient approach for data preprocessing for mining Web based customer survey data in order to speed up the data preparation process. The proposed approach is based on a unified data model derived from analysis of the characteristics of the customer survey data. The unified data model is used as a standard representation for the incoming data so that it can be mined. It not only provides flexibility fordata preprocessing but also reduce complexity and difficulty of preparation for mining customer survey data.
Since 2007, the business calculation model-cloud computing was proposed. Promoted greatly by great companies, cloud computing is developing at a very rapid pace. With the features of mass data storage and distribution calculation of cloud computing, it provides a new method fordata mining, effectively solving the problems of distribution of mass data mining and efficient storage computing. The cloud computing model brings many benefits and convenience. This paper introduces the cloud computing and data mining and then simply introduce some existing parallel data mining algorithms based on cloud computing and data mining service platforms. Finally it gives a simple description of the problems and prospects of data mining based on cloud computing.
With the availability of large datasets in a variety of scientific and commercial domains, data mining has emerged as an important area within the last decade. Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We believe that parallel compilation technology can be used for providing high-level language support for carrying out data mining implementations. Our study of a variety of popular data mining techniques has shown that they can be parallelized in a similar fashion. In our previous work, we have developed a middleware system that exploits this similarity to support distributed memory parallelization and execution on disk-resident datasets. This paper focuses on developing adata parallel language interface for using our middleware's functionality. We use a data parallel dialect of Java and show that it is well suited for data mining algorithms. Compiler techniques for translating this dialect to a middleware specification are presented. The most significant of these is a new technique for extracting a global reduction function from a data parallel loop. We present a detailed experimental evaluation of our compiler using a priori association mining, k-means clustering, and k-nearest neighbor classifiers. Our experimental results show that: 1) compiler generated parallel data mining codes achieve high speedups in a cluster environment, 2) the performance of compiler generated codes is quite close to the performance of manually written codes, and 3) simple additional optimizations like inlining can further reduce the gap between compiled and manual codes.
Spatial data mining and spatial data visualization are two comparatively popular technical methods in recent years, in essence, both purpose is to find geography phenomena what spatial data express and find various knowledge and laws implicit in geography entity. so it is necessary to combine both organically and form a new research direction - Visualization SpatialData Mining (VSDM). This paper mainly discusses the key relationships of visualization and spatial data mining, the main Application of visualization theories and technologies in spatialdata mining, the main methods and examples of visualization spatial data mining, we also present a reference model Visualization Spatial Data.
Data mining has become a major academic research area over the last ten years. However, in the transition from academic prototypes to commercial products there have been few successes with commercial data mining applications failing to make any significant impact in the marketplace. In this paper it is argued that most data mining applications concentrate on the model-building phase of the data mining process and rarely engage the user in other stages. The paper reviews the challenges of producing successful data mining applications and in particular the role of information visualization. Visualization in data mining tends to be used to present final results, rather than playing an important part throughout the entire process. We argue that an immersive environment may provide the user with a more suitable interface than is commonly offered. We present a virtual data mining environment which attempts to integrate a data mining application interface and information visualization in a seamless manner, using the concept of 'liquid data'.
Nowadays, Exploring and extracting knowledge from data is one of the fundamental problems in science. while many data mining models concentrate on automation and efficiency, interactive data mining models focus on adaptive and effective communications between human users and computer systems. User views, preferences, strategies play the most important roles in human-machine inter activities, guide the selection of target knowledge representations, operations, and measurements. However, it is not a right approach that the patterns are discovered only with data mining algorithms. Because this might be extra for discovered patterns but might not be useful for persons. Besides this, The knowledge which is useful for someone might not be useful for another one. Due to these reasons, a user oriented interactive approach needs to be applied to data mining process. User oriented interactive data mining models adapt between the users and computer systems, from efficient communication structures and provide the possibility of discovering the most suitable data mining algorithm for users. Thus, Data mining is extracted from being a hard and boring process for users and it provides users with a chance of discovering the most suitable knowledge for themselves. In this study, states reviewing the studies related to user oriented interactive data mining. The results which are obtained from the studies assessment of systems which are improved and suggestion are presented in these studies. It comes into prominence that data mining which is used for discovering the essence information needs to have an interactive structure and multi-visualization techniques.
Data mining has attracted increasing interests in recent years. Although there are several data mining software suits available, it is not easy for an end user to apply data mining techniques without the help of the data mining expert. The difficult is that with huge amount of data miningalgorithms, how to choose a set of algorithms appropriate to their data that can satisfy their requirement. In other words, the users need the knowledge of the character of the data miningalgorithms. In addition, we believe even a data mining expert also lacks this type of knowledge. The no free lunch theorem has shown that no algorithm is universally better than other algorithms for any datasets. Therefore an algorithm relatively better than other algorithms for some type of datasets in some measure criteria might perform worse in other cases. To circumvent this problem, we propose a method to extract and represent the knowledge ofmining algorithms. The knowledge is represented by ontology. Users or agents could selectmining algorithms easily with the data mining ontology.
The objective of this paper is to introduce further development of a data mining tool, MUSASHI (Mining Utilities and System Architecture for Scalable processing of Historical data), for service computing. Recent advances in information systems have allowed us to gather enormous amounts of data on marketing. However, these gathered data have been individually stored at each company, and have never been integrated because of a lack of techniques to analyze the data in an integrated way and to handle the large amount of data efficiently. To address this issue, we are currently investigating a way to provide a data mining platform as a service so that users can apply various data mining techniques to their marketing data with ease and at a low cost. For this purpose, we have developed an ASP platform leveraging distributed computing technology represented by Cloud computing. This paper describes the ASP platform for data mining services and introduces an empirical application of data mining using our platform.
Data mining an non-trivial extraction of novel, implicit, and actionable knowledge from large datasets is an evolving technology which is a direct result of the increasing use of computer databases in order to store and retrieve information effectively. It is also known as Knowledge Discovery in Databases (KDD) and enables data exploration, data analysis, and datavisualization of huge databases at a high level of abstraction, without a specific hypothesis in mind. The working of data mining is understood by using a method called modeling with it to make predictions. Data mining techniques are results of long process of research and product development and include artificial neural networks, decision trees and genetic algorithms. This paper surveys the data mining technology, its definition, motivation, its process and architecture, kind of data mined, functionalities and classification of data mining, major issues, applications and directions for further research of data mining technology.
Mining the large Web based online distributed databases to discover new knowledge and financial gain is an important research problem. These computations require high performance distributed and parallel computing environments. Traditional data mining techniques such as classification, association, clustering can be extended to find new efficient solutions. The paper presents the scalable data mining problem, proposes the use of software DSM (distributed shared memory) with a new mechanism as an effective solution and discusses both the implementation and performance evaluation results. It is observed that the overhead of a software DSM is very large for scalable data mining programs. A new Log Based Consistency (LBC) mechanism, especially designed for scalable data mining on the software DSM is proposed to overcome this overhead. Traditional association rule based data mining programs frequently modify the same fields by count-up operations. In contrast, the LBC mechanism keeps up the consistency by broadcasting the count-up operation logs among the multiple nodes.
T Subjects such as knowledge engineering , pervasive computing, unified communication, ubiquitous sensing and actuation and situation awareness are gaining the most critical and crucial attention from information technology (IT) professionals and pundits across the globe these days in order to accomplish the vision of ambient intelligence (Amid). It is all about effective and round-the-clock gleaning of data and information from different and distributed sources. Secondly whatever is gathered, transmitted, and stocked are being subjected to a cornucopia of tasks such as processing, mining, clustering, classification, and analysis for the real-time and elegant extraction of hidden actionable insights. Based on the knowledge extracted and the needs identified, the final tasks is decide and initiate the next course of actions in time. Not only information, interaction and transaction, but also physical services can be conceived, constructed and supplied to human users with the stability and maturity Amid technologies and instrumented, interconnected, and intelligent devices. This paper gives the detailed description of an Amid application which can provide impenetrable and unbreakable security, convenience, care and comfort for the needy. Our focus here is to develop a secure and safety-critical Ambient Assisted Living (AAL) environment which can monitor the patient's situation and give timely updates. In order to fulfill all these needs, a smart environment has been created to effectively and insightfully control patients' needs. The middleware standard preferred for the development and deployment a bevy of ambient and articulate services is Open Service Gateway Initiative (Osage).
With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data(the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.
On November of 2010, Microsoft released the Kinect sensor for the Xbox 360 video game console. This device-similar to a webcam-allows an individual to interact with an Xbox 360 or a computer in three-dimensional space using an infrared depth-finding camera and a standard RGB camera. As of January of 2012, over 24 million units have been sold. Using a combination of custom and open-source software, we were able to develop a means for students to visualize and interact with the data allowing us to introduce the concepts and skills used in the field of Electrical and Computer engineering . The unique technological application, visual appeal of the output, and the widespread ubiquity of the device make this an ideal platform for raising interest in the field of Electrical and Computer engineering among high school students. In order to understand the appeal of the Kinect, a working knowledge of the technical details of the device is useful. The novelty and appeal of the Kinect sensor lies in its infrared camera, which is comprised of two distinct devices. An infrared projector sends out a 640x480 grid of infrared beams, and an infrared detector is used to measure how long the reflection of each beam takes to return to the sensor. This data set is known as a “point cloud”. This point cloud is a three-dimensional vector comprised of data points between 40 and 2000, which correspond to distance from the device of each beam. The data in this array can then be parsed to construct a 3d image. The Kinect's infrared camera operates at 30Hz, or 30 samples per second, so the device is able to deliver a frame rate that is sufficient to create the illusion of motion. This allows for the development of applications that give the user a sense of interacting in real time with the image on the screen. The unique visual appeal, novelty of interaction, and relatively easy-to-understand theory of operation make the Kinect an attractive platform for recruitment and outreach- Using the Kinect, a recruiter is able to quickly and effectively demonstrate a range of concepts involving hardware, software, and the design process on a platform that students are familiar with and find appealing. In a short window of time they are able to show examples and explain the fundamental principles of the system while providing tangible, meaningful, and enjoyable interactivity with the device itself. This level of approachability and familiarity is rare among highly-technical fields, and provides an excellent catalyst to develop interest in Electrical and Computerengineering education.
Efficient performance of complex knowledge work is of crucial importance to saving resources in the global economy and long term sustainability. A lot remains to be leveraged in engineering computer-based systems for assisting humans via cognitive and performance aids. The performance of knowledge-intensive tasks (simply, knowledge-work) involves complex and dynamic interactions between human cognition and multiple sources of information. For achieving efficient healthcare for patients, a knowledge work Support System focused in the biomedical domain needs complete access to domain information in order to offer correct and precise data to a knowledge worker. The collection of this data and their interrelationships can be automated by gleaning the necessary knowledge from Linked Open Data (LOD) sets available on the Internet. Because LOD sets are interlinked, populating a KwSSs knowledge base with their informational content allows the system to store the relationships among various biomedical concepts, thereby making it a more active consumer of knowledge and improving its ability to aid in any given setting. This paper explores the utility and completion of LOD sets for a KwSS focused on the biomedical domain. In particular, two types of LOD sets are examined: domain-specific (e.g. Dailymed, DrugBank) and general-context (e.g. DBpedia, WordNet). More specifically, this paper investigates the structure of the available data, the extent to which such interlinked data can provide the knowledge content necessary for fulfilling tasks and activities performed in the biomedical domain (e.g. the patient-doctor setting), and how an individual can potentially access this data.
International ConferenceMeasures of text similarity have been used for a long time in applications in natural language processing and related areas such as text mining, Web p- age retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper presents a method for measuring the semantic similarity of texts, using corpus-based and knowledge-based measures of similarity. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.
Daryl Region - Google Inc. Research Scientist states that: “Data Mining is a mixture of statistics, artificial intelligence and database research.” In other words, the purpose of this process is the automatic discovery of knowledge hidden in data using various computational techniques. The purpose of this work is represented by the analysis of the impact of GRID technology for storing and processing large amounts of information and knowledge. Using computational power of computers and the most effective means of working with data, information exploitation is no longer a difficulty. It shows a strong expansion of the use of GRID technologies in various fields, as a consequence of the development of our society and, in particular, of the scientific and technical world that require technologies that allow all parties to use resources in a well-controlled and well organized way. Therefore, we can use GRID technologies for Data Mining processing. To see what the data “mining” process consist of, we must go through the following steps: construction and validation of the model and application of the model to new data. GRID - Data Mining connection can be successfully used to monitor environmental factors in environmental protection field, in civil engineering field to monitor the behavior in time, in medical field to determine diagnoses, in telecommunications. To be able to develop “mining” applications of the distributed data within a GRID network, the infrastructure that will be used is the Knowledge GRID one. This high level infrastructure has an architecture dedicated to data “mining” operations and specialized services for resource discovery stored in distributed deposits, information services management. In this concept, the achievement of data storage and processing is one of the most effective ways one can obtain results with high accuracy, according to initial requirements, using the automated know- edge discovery principles from the entire resource of knowledge existing in different systems. We can say that the main benefit obtained by using Knowledge GRID architecture is a major improvement in the execution speed of the “mining” process.
Knowledge capture is an important key in a business world where huge quantities of data are available via the Internet. Knowledge, as usable information, is a necessary element in the success of any organization. The recent growth of online information available in the form of academic paper related to algorithm and tool of Thai word segmentation distributed in various web sites, however it has not been organized in a systematic way. Thus, this study tries to propose a knowledge capture methods to support knowledge management activities. To perform the objectives of the study, knowledgeengineering techniques take a very important role in the knowledge capture process in various ways such as to build knowledge model, to simplify access to the information their contain and better ways to represent the knowledge explicitly. In this study, many knowledge engineering methods have been compared to select a suitable method to be applied to solve the problem of knowledge capture from academic papers; i.e. SPEDE, MOKA and Common KADS. The Common KADS methodology is selected because it provides sufficient tools such as a model suite and templates for different knowledge intensive tasks. However, creating and representing knowledge model create difficulties to knowledge engineer caused the ambiguity and unstructured of the source of knowledge. Therefore, the objectives of this paper are to propose the methodology to capture knowledge for academic papers by using the knowledge engineering approach. The academic papers which content related to algorithm and tools of Thai word segmentation are used as a case study to demonstrate the proposed methodology.
With converging of electronics and communication technologies and the integration of voice, data and images has made possible the penetration of information technology to play a major role in human resource re-engineering in the knowledge networked environments. The whole scenario synergises into the concept of providing education or learning on demand and leveraging information and expertise to improve organizational innovation, responsiveness, productivity and competency. The human resource re-engineering assumes greater significance in the new millennium with knowledge management providing a catalytic tool in involving, acquiring, creating, and packaging, distributing, applying and maintaining knowledge databases. This paper deals with certain components of knowledge management emphasizing knowledge networking concepts by means of working out strategic partnerships/alliances with leading organizations which will enable new paradigms for assessing and measuring country's economic empowerment in the totally network global economy. The salient features described in this paper also cover knowledge management, knowledge categories, knowledgetypes and strategic business objectives and knowledge management and collaboration/alliances/partnerships playing a leading role in the years to come for making one world economy to be predominantly a network and knowledge dependent.
Knowledge management plays important role for personalized service in e-commerce. However, the incompleteness of knowledge has degraded knowledge collaboration in the context of e-commerce. This paper has addressed issues of knowledge acquisition for servicing user with needed information and product through agent technology and fuzzy ontology. Seller agent and buyer agent was constructed in this paper, which solve knowledge acquisition and utilization. In the end, the framework of knowledge management has been implemented to cut out knowledge application, and validation of the framework was verified by related empirical data.
Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
The term is a buzzword, and is frequently also applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence, machine learning, and business intelligence. The popular book "Data mining: Practical machine learning tools and techniques with Java" (which covers mostly machine learning material) was originally to be named just "Practical machine learning", and the term "data mining" was only added for marketing reasons. Often the more general terms "(large scale) data analysis", or "analytics" – or when referring to actual methods, artificial intelligence and machine learning – are more appropriate.
The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection) and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation,nor result interpretation and reporting are part of the data mining step, but do belong to the overall KDD process as additional steps.
The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations. In the 1960s, statisticians used terms like "Data Fishing" or "Data Dredging" to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "Data Mining" appeared around 1990 in the database community. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company (now merged into FICO), to pitch their Database Mining Workstation; researchers consequently turned to "data mining". Other terms used include Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, etc. Gregory Piatetsky-Shapiro coined the term "Knowledge Discovery in Databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and Machine Learning Community. However, the term data mining became more popular in the business and press communities. Currently, Data Mining and Knowledge Discovery are used interchangeably. Since about 2007, "Predictive Analytics" and since 2011, "Data Science" terms were also used to describe this field. In business, data mining is the analysis of historical business activities, stored as static data in data warehouse databases. The goal is to reveal hidden patterns and trends. Data mining software uses advanced pattern recognition algorithms to sift through large amounts of data to assist in discovering previously unknown strategic business information. Examples of what businesses use data mining for include performing market analysis to identify new product bundles, finding the root cause of manufacturing problems, to prevent customer attrition and acquire new customers, cross-sell to existing customers, and profile customers with more accuracy.