Quantcast
Channel: Active questions tagged row - Stack Overflow
Viewing all articles
Browse latest Browse all 446

Trouble with removing duplicate rows in R, maybe due to a problem with merging two csv files?

$
0
0

I have been trying to remove duplicate rows in a csv file through both R and Excel. However, both keep returning that there are no duplicate rows even though I can clearly see them. I have checked for spaces, different number of characters, capitalization, and they all seem to be exact duplicates. I'm not sure what's going on, but I appreciate any advice on what to try next.

The csv file has column headers with 23700 rows and 7 columns. The labels etc. are random but styled after my own dataset, just shorter. Example:

Sample     Class A    SubClass B   SubClass C   SubClass D   SubClass E  SubClass FSJAKFHL    Type A     Cya          Oxy          Syne         Cyana       Syne_CC867DSKLFHJAS  Type A     Pro          Gamma        S_13         Pseudo      C_IIISKJDHF     Type B     Pro          Oxy          Syne         C_47        Syne_CC867ASDFJH     Type A     Cya          Oxy          Pseudo       Pseudo      HacenSJAKFHL    Type A     Cya          Oxy          Syne         Cyana       Syne_CC867JLSDHFSDL  Type B     Act          Acid         Actin        Cyana       C_I

In Excel, I've tried using the =LEN(cell) function and downloading data merging tools, but so far haven't had any luck.

In R, I've tried a few different things, including as a data frame and as a tibble. I've been using vroom instead of readr because it is faster with large files. The data set I'm working with has been created by merging two others, hence the duplicates. Both of the data sets are formatted exactly the same as above, one with 12,000 rows and the other with 13,000. Some rows were excluded during the merge, which was expected, but other duplicates seem to have stayed. Neither left_join nor right_join will work for this.

library(vroom)library(dplyr)library(data.table)data1 <- vroom("data1.csv")data2 <- vroom("data2.csv")merged_data <- full_join(data1, data2)duplicated(merged_data)#returned all FALSE, claimed there were no duplicatesdata_NoDuplicates1 <- duplicated(merged_data)#Thought that maybe it needed to be assigned to a value, but still returned all FALSE, claimed there were no duplicatesdata_NoDuplicates2 <- merged_data[!duplicated(merged_data),]#Had no effectdata_NoDuplicates3 <- distinct(merged_data)#Had no effect

I'm not sure what to try next, or where the problem may be. I've tried experimenting with other ways of merging as well, but so far it's either given me the same result or just not worked.

Any advice is appreciated. Thank you.


Viewing all articles
Browse latest Browse all 446

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>