hey this one is an ieee project right?? i have this pdf, unfortunately not the code :(
Abstract:
Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data.
These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated onthe-fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail on the results of a new query.
To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. After removal of the same-source duplicates, the “presumed” nonduplicate records from the same source can
be used as training examples alleviating the burden of users having to manually label training examples.
Starting from the nonduplicate set, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web database scenario where existing supervised methods do not apply.
Introduction:
TODAY, more and more databases that dynamically generate Web pages in response to user queries are available on the Web. These Web databases compose the deep or hidden Web, which is estimated to contain a much larger amount of high quality, usually structured information and to have a faster growth rate than the static Web. Most Web databases are only accessible via a query interface through which users can submit queries. Once a query is received, the Web server will retrieve the corresponding results from the back-end database and return them to the user.
To build a system that helps users integrate and, more importantly, compare the query results returned from multiple Web databases, a crucial task is to match the different sources’ records that refer to the same real-world entity. For example, Fig. 1 shows some of the query results returned by two online bookstores, booksamillion.com and abebooks.com, in response to the same query “Harry Potter” over the Title field. It can be seen that the record numbered 3 in Fig. 1a and the third record in Fig. 1b refer to the same book, since they have the same ISBN number although their authors differ somewhat. In comparison, the record numbered 5 in Fig. 1a and the second record in Fig. 1b also refer to the same book if we are interested only in the book title and author.1
Download Record Matching over Query Results from Multiple Web Databases Project