دانلود رایگان مقاله انگلیسی ابزاری برای تحلیل های آماری در شبکه کلان داده ها - IEEE 2017

عنوان فارسی
ابزاری برای تحلیل های آماری در شبکه کلان داده ها
عنوان انگلیسی
A Tool for Statistical Analysis on Network Big Data
صفحات مقاله فارسی
0
صفحات مقاله انگلیسی
5
سال انتشار
2017
نشریه
آی تریپل ای - IEEE
فرمت مقاله انگلیسی
PDF
کد محصول
E10391
رشته های مرتبط با این مقاله
مهندسی فناوری اطلاعات
گرایش های مرتبط با این مقاله
مدیریت سیستم های اطلاعات
مجله
کارگاه بین المللی کاربرد پایگاه داده ها و سیستم های کارشناس - International Workshop on Database and Expert Systems Applications
دانشگاه
AT&T Labs - Research - USA ∗Research work conducted while visiting AT&T Labs - USA. C. Ordonez current affiliation - University of Houston - USA
doi یا شناسه دیجیتال
https://doi.org/10.1109/DEXA.2017.23
۰.۰ (بدون امتیاز)
امتیاز دهید
چکیده

Abstract


Due to advances in parallel file systems for big data (i.e. HDFS) and larger capacity hardware (multicore CPUs, large RAM) it is now feasible to manage and query network data in a parallel DBMS supporting SQL, but performing statistical analysis remains a challenge. On the statistics side, the R language is popular, but it presents important limitations: R is limited by main memory, R works in a different address space from query processing, R cannot analyze large diskresident data sets efficiently, and R has no data management capabilities. Moreover, some R libraries allow R to work in parallel, but without data management capabilities. Considering the challenges and limitations described above, we present a system that allows combining SQL queries and R functions in a seamless manner. We justify a parallel DBMS and the R runtime are two different systems that benefit from a low-level integration. Our parallel DBMS is built on top of HDFS, programmed in Java and C++, with a flexible scale out architecture, whereas R is programmed purely in C. The user or developer can make calls in both directions: (1) R calling SQL, to evaluate analytic queries or retrieve data from materialized views (transferring result tables in RAM in a streaming fashion and analyzing them in R), and vice-versa (2) SQL calling R, allowing SQL to convert relational tables to matrices or vectors and making complex computations on them. We give a summary of network monitoring tasks at ATT and present specific programming examples, showing language calls in both directions (i.e. R calls SQL, SQL calls R).

نتیجه گیری

CONCLUSIONS


We presented a system that enables fast bi-directional data transfer between a parallel DBMS and the R runtime. In one direction our system converts SQL relational tables into R data frames or matrices. On the opposite direction an R data frame or matrix is converted into a relational table, with a transformed data frame being the most common case. Our system is built on top of a careful mapping between atomic data types. The system efficiently constructs data structures (i.e. non-atomic data types) in RAM in one pass over a data set. The net gain is that an R script can call an SQL query or materialized view to analyze the result set. On the other hand, an SQL query (not a script or longer embedded SQL program) can call an R function to perform some mathematical computation in an intermediate step. Our initial prototype opens several research directions. We want to define functional constructs in the R programming language to transform relational tables into data frames. In a similar manner, we want to study alternatives to transform a matrix into an SQL object (flat table, subscript/value triples, or binary object). Propagating insertions to materialized views and then to a mathematical model computed by R is a challenging problem. Finally, we need to conduct a detailed performance study on the ATT network data warehouse.


بدون دیدگاه