Read irregular .txt file

- 7 mins

Overview:

My friend #8 turned to me for a script that can automatically extract informative data from his irregular .txt (sample.txt) and output it as .xlsx with multiple sheets. As he wish. Let’s go!


Tasks:


Steps: 3 functions

1. Get file name

# file name
get.filename <- function(path){
  filename <- readtext::readtext(path)$text %>% 
  	str_extract(".+?\n") %>% 
  	str_remove("[:blank:]+") %>% str_remove("\n")
  	return(filename)
}

2. Get data

Useful functions: readtext::readtext(), readLines(), writeLines(), file.remove(), data.frame(matrix(nrow=0)), readLines(), names(list).

# tables
get.data <- function(path){
  wd <- str_extract(path,"^..+/")
  data.txt <- readtext::readtext(path)$text %>% str_extract_all("STEERS(.|\n)+?\n\n") %>% unlist()
  n <- length(data.txt)
  data.list <- list()
  data.name.list <- c()
  for (i in 1:n){
    data.name.list[i] <- data.txt[i] %>% str_extract('STEERS.+?\n') %>% str_remove('\n') %>%
      str_remove(" \\(.+\\)") %>% str_remove("^STEERS - ")
    file.path <- paste0(wd,'STEER',i,".txt")
    data.txt[i] %>%
      str_remove("\n\n") %>%
      str_remove('STEERS.+?\n') %>%
      writeLines(file.path)
    data.lines <- readLines(file.path)
    file.remove(file.path)
    # parse
    # var
    var <- data.lines[1] %>% str_extract_all('[:alpha:]+[:blank:]?[:alpha:]+') %>% unlist()
    # data
    data <- data.frame(matrix(ncol = length(var), nrow = 0))
    colnames(data) <- var
    rows <- list()
    for (k in 2:length(data.lines)){
      rows[[k]] <- data.lines[k] %>% str_extract_all('([:digit:]|\\.|-|[:alpha:])+') %>% unlist()
      data[k-1,] <- rows[[k]][1:length(var)]
    }
    data.list[[i]] <- data
  }
  names(data.list) <- data.name.list
  return(data.list)
}

3. Filter data

# extract information in need
filter.data <- function(data.list,lower = 700, upper = 900){
  names <- names(data.list)
  n <- length(data.list)
  data.list.copy <- data.list
  data.list.new <- list()
  for (i in 1:n){
    n.row <- nrow(data.list[[i]])
    for (k in 1:n.row){
      if (str_detect( data.list[[i]][k,]$`Wt Range`, '-')){
        n.1 <- data.list[[i]][k,]$`Wt Range` %>% str_extract('^[:digit:]+') %>% as.numeric()
        n.2 <- data.list[[i]][k,]$`Wt Range` %>% str_extract('[:digit:]+$') %>% as.numeric()
        data.list.copy[[i]][k,]$`Wt Range` <- (n.1+n.2)/2
      }
    }
    data.list.new[[i]] <- data.list[[i]][as.numeric(data.list.copy[[i]]$'Avg Wt') > lower & as.numeric(data.list.copy[[i]]$'Avg Wt') < upper,]
  }
  names(data.list.new) <- names
  return(data.list.new)
}

4. Output

At the beginning, I tried to use library(xlsx). It worked several months ago, but this time I got the following warning message:

WARNING: Initial Java 12 release has broken JNI support and does NOT work. Use stable Java 11 (or watch for 12u if avaiable).
ERROR: Java exception occurred during rJava bootstrap - see stderr for Java stack trace.
Exception in thread "main" java.lang.NullPointerException
	at java.base/jdk.internal.reflect.Reflection.verifyMemberAccess(Reflection.java:130)
	at java.base/java.lang.reflect.AccessibleObject.slowVerifyAccess(AccessibleObject.java:673)
	at java.base/java.lang.reflect.AccessibleObject.verifyAccess(AccessibleObject.java:666)
	at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:638)
	at java.base/java.lang.reflect.Field.checkAccess(Field.java:1075)
	at java.base/java.lang.reflect.Field.get(Field.java:416)
Error: package or namespace load failed for ‘xlsx’:
 .onLoad failed in loadNamespace() for 'xlsx', details:
  call: .jcheck(silent = FALSE)
  error: java.lang.NullPointerException.jcall(f, "Ljava/lang/Object;", "get", .jcast(ic, "java/lang/Object"))<S4 object of class "jobjRef">
In addition: Warning message:
package ‘xlsx’ was built under R version 3.4.4 

It seemed something went wrong with my Java 12. So I downloaded Java 11 and tried to change the default Java (JDK) version. It didn’t work. It might work if I uninstall Java 12 (instruction)(tool). But I was just too lazy to try. It already took me a while to install it 😭.

Prior to this, I follow the instruction by Zhiya Zuo to install Java 12. Quite honestly, I should have install older version.

So I turned to another package writexl. It can do the same thing, but is not based on rJava. Perfect!

# output
out <- function(path){
  file.name <- get.filename(path)
  data <- get.data(path)
  data.need <- filter.data(data)
  writexl::write_xlsx(data.need,paste0(str_extract(path,"^.+/"),file.name,".xlsx"))
}
# example
library(tidyverse)
path <- "~/Downloads/no8.txt"
out(path)

📎 Output: Eastern MO Commission Company - Bowling Green, MO.xlsx


More

But I won’t do these unless he treat me a bubble tea 😌

Zhijian Liu

Zhijian Liu

A foodaholic

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora